Initial commit with all features

This commit is contained in:
LUIS NOVO 2024-10-21 14:56:10 -03:00
commit bcd260a28b
52 changed files with 6897 additions and 0 deletions

8
.dockerignore Normal file
View file

@ -0,0 +1,8 @@
notebooks/
data/
.uploads/
.venv/
.mypy_cache/
.ruff_cache/
.env
sqlite-db/

27
.env.example Normal file
View file

@ -0,0 +1,27 @@
# YOUR LLM API KEYS
OPENAI_API_KEY=API_KEY
# MODEL_CONFIGURATIONS
# Only OpenAI models are supported for now
DEFAULT_MODEL="gpt-4o-mini" # This is the default model used for all the features
SUMMARIZATION_MODEL="gpt-4o-mini" # This is the model used for summarization, defaults to the DEFAULT_MODEL if empty
RETRIEVAL_MODEL="gpt-4o-mini" # This is the model used for retrieval, defaults to the DEFAULT_MODEL if empty
# CONNECTION DETAILS FOR YOUR SURREAL DB
SURREAL_ADDRESS="ws://localhost:8000/rpc"
SURREAL_USER="root"
SURREAL_PASS="root"
SURREAL_NAMESPACE="open_notebook"
SURREAL_DATABASE="staging"
# This is used for the summarization feature when the content is to big to fit a single context window
# It is measured in characters, not tokens.
SUMMARY_CHUNK_SIZE=200000
SUMMARY_CHUNK_OVERLAP=1000
# This is used for vector embeddings
# It is measured in characters, not tokens.
EMBEDDING_CHUNK_SIZE=1000
EMBEDDING_CHUNK_OVERLAP=50

118
.gitignore vendored Normal file
View file

@ -0,0 +1,118 @@
notebooks/
data/
.uploads/
sqlite-db/
surreal-data/
docker.env
# Python-specific
*.py[cod]
__pycache__/
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
# PyInstaller
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# PyCharm
.idea/
# VS Code
.vscode/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
# macOS
.DS_Store
# Windows
Thumbs.db
ehthumbs.db
desktop.ini
# Linux
*~
# Log files
*.log
# Database files
*.db
*.sqlite3
# Virtual environment
.python-version

9
.pre-commit-config.yaml Normal file
View file

@ -0,0 +1,9 @@
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.4.4
hooks:
- id: ruff
args: ["--fix"]
exclude: "templates"
- id: ruff-format
exclude: "templates"

30
.streamlit/config.toml Normal file
View file

@ -0,0 +1,30 @@
[server]
port = 8502
maxMessageSize = 500
[browser]
serverPort = 8502
# [theme]
# # The preset Streamlit theme that your custom theme inherits from.
# # One of "light" or "dark".
# base =
# # Primary accent color for interactive elements.
# primaryColor =
# # Background color for the main content area.
# backgroundColor =
# # Background color used for the sidebar and most interactive widgets.
# secondaryBackgroundColor =
# # Color used for almost all text.
# textColor =
# # Font family for all text in the app, except code blocks. One of "sans serif",
# # "serif", or "monospace".
# font =

52
CONTRIBUTING.md Normal file
View file

@ -0,0 +1,52 @@
# Contributing to Open Notebook
First off, thank you for considering contributing to Open Notebook! What makes open source great is the fact that we can work together and accomplish things we would never do on our own. All suggestions are welcome.
## Code of Conduct
By participating in this project, you are expected to uphold our Code of Conduct (to be created separately).
## How Can I Contribute?
### Reporting Bugs
- Ensure the bug was not already reported by searching on GitHub under [Issues](https://github.com/yourusername/open-notebook/issues).
- If you're unable to find an open issue addressing the problem, [open a new one](https://github.com/yourusername/open-notebook/issues/new). Be sure to include a title and clear description, as much relevant information as possible, and a code sample or an executable test case demonstrating the expected behavior that is not occurring.
### Suggesting Enhancements
- Open a new issue with a clear title and detailed description of the suggested enhancement.
- Provide any relevant examples or mockups if applicable.
### Pull Requests
1. Fork the repo and create your branch from `main`.
2. If you've added code that should be tested, add tests.
3. Ensure the test suite passes.
4. Make sure your code lints.
5. Issue that pull request!
## Styleguides
### Git Commit Messages
- Use the present tense ("Add feature" not "Added feature")
- Use the imperative mood ("Move cursor to..." not "Moves cursor to...")
- Limit the first line to 72 characters or less
- Reference issues and pull requests liberally after the first line
### Python Styleguide
- Follow PEP 8 guidelines
- Use type hints where possible
- Write docstrings for all functions, classes, and modules
### Documentation Styleguide
- Use Markdown for documentation files
- Reference functions and classes appropriately
## Additional Notes
Thank you for contributing to Open Notebook!

28
Dockerfile Normal file
View file

@ -0,0 +1,28 @@
# Use an official Python runtime as a base image
FROM python:3.11.7-slim-bullseye
# Install system dependencies required for building certain Python packages
RUN apt-get update && apt-get install -y \
gcc \
curl wget libmagic-dev \
&& rm -rf /var/lib/apt/lists/*
# Set the working directory in the container to /app
WORKDIR /app
RUN pip install poetry --no-cache-dir
RUN poetry self add poetry-plugin-dotenv
RUN poetry config virtualenvs.create false
COPY pyproject.toml poetry.lock* /app/
RUN poetry install --only main
#--no-root
COPY . /app
WORKDIR /app
EXPOSE 8502
RUN mkdir -p /app/sqlite-db
CMD ["poetry", "run", "streamlit", "run", "app_home.py"]

17
LICENSE Normal file
View file

@ -0,0 +1,17 @@
MIT License
Copyright (c) 2024 Luis Novo
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

27
Makefile Normal file
View file

@ -0,0 +1,27 @@
.PHONY: run check ruff database lint docker-build docker-push
# Get version from pyproject.toml
VERSION := $(shell grep -m1 version pyproject.toml | cut -d'"' -f2)
IMAGE_NAME := lfnovo/open_notebook
database:
docker compose up -d
run:
poetry run streamlit run app_home.py
lint:
poetry run python -m mypy .
ruff:
ruff check . --fix
docker-build:
docker build . -t $(IMAGE_NAME):$(VERSION)
docker tag $(IMAGE_NAME):$(VERSION) $(IMAGE_NAME):latest
docker-push:
docker push $(IMAGE_NAME):$(VERSION)
docker push $(IMAGE_NAME):latest
# Combined build and push
docker-release: docker-build docker-push

90
README.md Normal file
View file

@ -0,0 +1,90 @@
# Open Notebook
An open source, privacy-focused alternative to Google's Notebook LM. Why give Google more of our data when we can take control of our own research workflows?
In a world dominated by Artificial Intelligence, having the ability to think 🧠 and acquire new knowledge 💡, is a skill that should not be a privilege for a few, nor restricted to a single company.
Open Notebook empowers you to manage your research, generate AI-assisted notes, and interact with your content—on your terms.
## ⚙️ Setting Up
Go to the [Setup Guide](docs/SETUP.md) to learn how to set up the tool.
## Usage Instructions
Go to the [Usage](docs/USAGE.md) page to learn how to use all features.
## 🚀 Features
- **Multi-Notebook Support**: Organize your research across multiple notebooks effortlessly.
- **Broad Content Integration**: Works with links, PDFs, TXT files, PowerPoint presentations, YouTube videos, and pasted text (audio/video support coming soon).
- **AI-Powered Notes**: Write notes yourself or let the AI assist you in generating insights.
- **Recursive Summarization**: Tackle large content by recursively summarizing it.
- **Integrated Search Engines**: Built-in full-text and vector search for faster information retrieval.
- **Fine-Grained Context Management**: Choose exactly what to share with the AI to maintain control.
- **Cost Estimation**: Estimate costs for large context processing to keep budget control in check.
### 📝 Notebook Page
Three intuitive columns to streamline your work:
1. **Sources**: Manage all research materials.
2. **Notes**: Create or AI-generate notes.
3. **Chat**: Chat with the AI, leveraging your content.
### ⚙️ Context Configuration
Take control of your data. Decide what gets sent to the AI with three context options:
- No context
- Summary only
- Full content
Plus, you can add your project description to help the AI provide more accurate and helpful responses.
### 🔍 Integrated Search for Your Items
Locate anything across your research with ease using full-text and vector-based search.
### 💬 Powerful open prompts
Jinja based prompts that are easy to customize to your own preferences.
## 🌟 Coming Soon
- **Podcast Generator**: Automatically convert your notes into a podcast format.
- **Multi-model support**: Anthropic, Gemini, Mistral, Ollama coming soon.
- **Enhanced Citations**: Improved layout and finer control for citations.
- **Insight Generation**: New tools for creating insights, leveraging the Fabric framework.
- **Better Embeddings & Summarization**: Smarter ways to distill information.
- **Multiple Chat Sessions**: Juggle different discussions within the same notebook.
- **Live Front-End Updates**: Real-time UI updates for a smoother experience.
- **Async Processing**: Faster UI through asynchronous content processing.
- **Improved Error Handling**: Making everything more robust.
- **Cross-Notebook Sources and Notes**: Reuse research notes across projects.
- **Bookmark Integration**: Integrate with your favorite bookmarking app.
## 💻 Tech Stack
- **Streamlit**: For the front-end (Looking to move out of Streamlit. Contributors welcome!).
- **SurrealDB**: Fast, scalable database solution.
- **Langchain/Langgraph**: The backbone for LLM interactions.
## 🙌 Help Wanted
We would love your contributions! Specifically, we're looking for help with:
- **Front-End Development**: Improve the UI/UX by moving beyond Streamlit.
- **Testing & Bug Fixes**: Help make Open Notebook more robust.
- **Feature Development**: Lets make the coolest note-taking tool together!
See more at [CONTRIBUTING](CONTRIBUTING.md)
## 📄 License
Open Notebook is MIT licensed. See the [LICENSE](LICENSE) file for details.
---
Your contributions, feature requests, and bug reports are always welcome. Let's build a research tool that respects our privacy and makes learning truly open for everyone. ✨

19
app_home.py Normal file
View file

@ -0,0 +1,19 @@
import streamlit as st
from open_notebook.exceptions import InvalidDatabaseSchema
from open_notebook.repository import check_version, execute_migration
try:
check_version()
except InvalidDatabaseSchema as e:
st.error(e)
if st.button("Execute Migration.."):
try:
execute_migration()
st.success("Migration executed successfully")
st.rerun()
except Exception as e:
st.error(e)
st.stop()
st.switch_page("pages/2_📒_Notebooks.py")

196
db_setup.surrealql Normal file
View file

@ -0,0 +1,196 @@
REMOVE table IF EXISTS source;
REMOVE table IF EXISTS reference;
REMOVE table IF EXISTS notebook;
REMOVE table IF EXISTS note;
REMOVE table IF EXISTS artifact;
REMOVE table IF EXISTS source_chunk;
REMOVE table IF EXISTS source_insight;
REMOVE ANALYZER IF EXISTS my_analyzer;
REMOVE FUNCTION IF EXISTS fn::text_search;
REMOVE INDEX IF EXISTS idx_source_full ON TABLE source_chunk;
REMOVE INDEX IF EXISTS idx_source_embed_chunk ON TABLE source_embedding;
REMOVE INDEX IF EXISTS idx_source_insight ON TABLE source_insight;
REMOVE INDEX IF EXISTS idx_note ON TABLE note;
REMOVE INDEX IF EXISTS idx_source_title ON TABLE source;
REMOVE INDEX IF EXISTS idx_note_title ON TABLE note;
DEFINE TABLE IF NOT EXISTS source SCHEMAFULL;
DEFINE FIELD asset
ON TABLE source
FLEXIBLE TYPE option<object>;
DEFINE FIELD title ON TABLE source TYPE option<string>;
-- DEFINE FIELD summary ON TABLE source TYPE option<string>;
DEFINE FIELD topics ON TABLE source TYPE option<array<string>>;
DEFINE FIELD created ON source DEFAULT time::now() VALUE $before OR time::now();
DEFINE FIELD updated ON source DEFAULT time::now() VALUE time::now();
-- temporary while surreal doesn't fix the sdk
DEFINE TABLE IF NOT EXISTS source_chunk SCHEMAFULL;
DEFINE FIELD source ON TABLE source_chunk TYPE record<source>;
DEFINE FIELD order ON TABLE source_chunk TYPE int;
DEFINE FIELD content ON TABLE source_chunk TYPE string;
DEFINE TABLE IF NOT EXISTS source_embedding SCHEMAFULL;
DEFINE FIELD source ON TABLE source_embedding TYPE record<source>;
DEFINE FIELD order ON TABLE source_embedding TYPE int;
DEFINE FIELD content ON TABLE source_embedding TYPE string;
DEFINE FIELD embedding ON TABLE source_embedding TYPE array<float>;
DEFINE TABLE IF NOT EXISTS source_insight SCHEMAFULL;
DEFINE FIELD source ON TABLE source_insight TYPE record<source>;
DEFINE FIELD insight_type ON TABLE source_insight TYPE string;
DEFINE FIELD content ON TABLE source_insight TYPE string;
DEFINE FIELD embedding ON TABLE source_insight TYPE array<float>;
DEFINE TABLE IF NOT EXISTS note SCHEMAFULL;
DEFINE FIELD title ON TABLE note TYPE option<string>;
DEFINE FIELD summary ON TABLE note TYPE option<string>;
DEFINE FIELD content ON TABLE note TYPE option<string>;
DEFINE FIELD embedding ON TABLE note TYPE array<float>;
DEFINE FIELD created ON note DEFAULT time::now() VALUE $before OR time::now();
DEFINE FIELD updated ON note DEFAULT time::now() VALUE time::now();
DEFINE TABLE IF NOT EXISTS notebook SCHEMAFULL;
DEFINE FIELD name ON TABLE notebook TYPE option<string>;
DEFINE FIELD description ON TABLE notebook TYPE option<string>;
DEFINE FIELD created ON notebook DEFAULT time::now() VALUE $before OR time::now();
DEFINE FIELD updated ON notebook DEFAULT time::now() VALUE time::now();
DEFINE TABLE reference
TYPE RELATION
FROM source TO notebook;
DEFINE TABLE artifact
TYPE RELATION
FROM note TO notebook;
-- entender o analyzer
DEFINE ANALYZER my_analyzer TOKENIZERS blank,class,camel,punct FILTERS snowball(english), lowercase;
DEFINE INDEX idx_source_title ON TABLE source COLUMNS title SEARCH ANALYZER my_analyzer BM25 HIGHLIGHTS;
DEFINE INDEX idx_source_full ON TABLE source_chunk COLUMNS content SEARCH ANALYZER my_analyzer BM25 HIGHLIGHTS;
DEFINE INDEX idx_source_embed_chunk ON TABLE source_embedding COLUMNS content SEARCH ANALYZER my_analyzer BM25 HIGHLIGHTS;
DEFINE INDEX idx_source_insight ON TABLE source_insight COLUMNS content SEARCH ANALYZER my_analyzer BM25 HIGHLIGHTS;
DEFINE INDEX idx_note ON TABLE note COLUMNS content SEARCH ANALYZER my_analyzer BM25 HIGHLIGHTS;
DEFINE INDEX idx_note_title ON TABLE note COLUMNS title SEARCH ANALYZER my_analyzer BM25 HIGHLIGHTS;
DEFINE FUNCTION IF NOT EXISTS fn::text_search($query_text: string, $match_count: int, $sources:bool, $show_notes:bool) {
let $source_title_search =
IF $sources {(
SELECT id as item_id, math::max(search::score(1)) AS relevance
FROM source
WHERE title @1@ $query_text
GROUP BY item_id)}
ELSE { [] };
let $source_embedding_search =
IF $sources {(
SELECT source as item_id, math::max(search::score(1)) AS relevance
FROM source_embedding
WHERE content @1@ $query_text
GROUP BY item_id)}
ELSE { [] };
let $source_chunk_search =
IF $sources {(
SELECT source as item_id, math::max(search::score(1)) AS relevance
FROM source_chunk
WHERE content @1@ $query_text
GROUP BY item_id)}
ELSE { [] };
let $source_insight_search =
IF $sources {(
SELECT source as item_id, math::max(search::score(1)) AS relevance
FROM source_insight
WHERE content @1@ $query_text
GROUP BY item_id)}
ELSE { [] };
let $note_title_search =
IF $show_notes {(
SELECT id as item_id, math::max(search::score(1)) AS relevance
FROM note
WHERE title @1@ $query_text
GROUP BY item_id)}
ELSE { [] };
let $note_content_search =
IF $show_notes {(
SELECT id as item_id, math::max(search::score(1)) AS relevance
FROM note
WHERE content @1@ $query_text
GROUP BY item_id)}
ELSE { [] };
let $source_chunk_results = array::union($source_embedding_search, $source_chunk_search);
let $source_asset_results = array::union($source_title_search, $source_insight_search);
let $source_results = array::union($source_chunk_results, $source_asset_results );
let $note_results = array::union($note_title_search, $note_content_search );
let $final_results = array::union($source_results, $note_results );
RETURN (SELECT item_id, math::max(relevance) as relevance from $final_results
group by item_id ORDER BY relevance DESC LIMIT $match_count);
};
REMOVE FUNCTION fn::vector_search;
DEFINE FUNCTION IF NOT EXISTS fn::vector_search($query: array<float>, $match_count: int, $sources:bool, $show_notes:bool) {
let $source_embedding_search =
IF $sources {(
SELECT source as item_id, content, vector::similarity::cosine(embedding, $query) as similarity
FROM source_embedding LIMIT $match_count)}
ELSE { [] };
let $source_insight_search =
IF $sources {(
SELECT source as item_id, content, vector::similarity::cosine(embedding, $query) as similarity
FROM source_insight LIMIT $match_count)}
ELSE { [] };
let $note_content_search =
IF $show_notes {(
SELECT id as item_id, content, vector::similarity::cosine(embedding, $query) as similarity
FROM note LIMIT $match_count)}
ELSE { [] };
let $source_chunk_results = array::union($source_embedding_search, $source_insight_search);
let $source_results = array::union($source_chunk_results, $source_insight_search);
let $note_results = $note_content_search;
let $final_results = array::union($source_results, $note_results );
RETURN (SELECT item_id, math::max(similarity) as similarity from $final_results
group by item_id ORDER BY similarity DESC LIMIT $match_count);
};
CREATE open_notebook:database_info SET
version= "0.0.1";
UPDATE open_notebook:database_info SET
version= "0.0.1";

22
docker-compose.dev.yml Normal file
View file

@ -0,0 +1,22 @@
version: '3'
services:
surrealdb:
image: surrealdb/surrealdb:v2
ports:
- "8000:8000"
volumes:
- ./surreal-data:/mydata
user: "${UID}:${GID}"
command: start --log trace --user root --pass root rocksdb:mydatabase.db
pull_policy: always
open_notebook:
build:
context: .
dockerfile: Dockerfile
ports:
- "8080:8502"
volumes:
- ./docker.env:/app/.env
depends_on:
- surrealdb

22
docker-compose.yml Normal file
View file

@ -0,0 +1,22 @@
version: '3'
services:
surrealdb:
image: surrealdb/surrealdb:v2
ports:
- "8000:8000"
volumes:
- ./surreal-data:/mydata
user: "${UID}:${GID}"
command: start --log trace --user root --pass root rocksdb:mydatabase.db
pull_policy: always
open_notebook:
image: lfnovo/open_notebook:latest
ports:
- "8080:8502"
volumes:
- ./docker.env:/app/.env
depends_on:
- surrealdb
pull_policy: always

128
docs/SETUP.md Normal file
View file

@ -0,0 +1,128 @@
# Installing Open Notebook
## 📦 Installing from Source
Quickly get started by cloning and installing the dependencies.
```sh
git clone https://github.com/lfnovo/open_notebook.git
cd open_notebook
poetry install
```
Make a copy of `example.env` and rename it to `.env`.
You need to enter at least your OPENAI_API_KEY and the Surreal DB connection details.
```
OPENAI_API_KEY=
# CONNECTION DETAILS FOR YOUR SURREAL DB
SURREAL_ADDRESS="ws://localhost:8000/rpc"
SURREAL_USER="root"
SURREAL_PASS="root"
SURREAL_NAMESPACE="open_notebook"
SURREAL_DATABASE="staging"
```
Then, run it by using:
```sh
poetry run streamlit run app_home.py
```
or the shourcut
```sh
make run
```
## 🐳 Docker Setup
Alternatively, you can use Docker for easy setup.
Copy the `.env.example` file and name it `docker.env`
```sh
docker run -d \
--name open_notebook \
-p 8080:8502 \
-v $(pwd)/docker.env:/app/.env \
lfnovo/open_notebook:latest
```
You can pass the environment variables manually if you want:
```sh
docker run -d \
--name open_notebook \
-p 8080:8502 \
-e OPENAI_API_KEY=API_KEY \
-e DEFAULT_MODEL="gpt-4o-mini" \
-e SURREAL_ADDRESS="ws://localhost:8000/rpc" \
-e SURREAL_USER="root" \
-e SURREAL_PASS="root" \
-e SURREAL_NAMESPACE="open_notebook" \
-e SURREAL_DATABASE="staging" \
lfnovo/open_notebook:latest
```
If you need to run Surreal DB on docker as well, it's easier to just use docker-compose, like this:
```yaml
services:
surrealdb:
image: surrealdb/surrealdb:v2
ports:
- "8000:8000"
volumes:
- ./surreal-data:/mydata
user: "${UID}:${GID}"
command: start --log trace --user root --pass root rocksdb:mydatabase.db
pull_policy: always
open_notebook:
image: lfnovo/open_notebook:latest
ports:
- "8080:8502"
volumes:
- ./docker.env:/app/.env
depends_on:
- surrealdb
pull_policy: always
```
or with the environment variables:
```yaml
services:
surrealdb:
image: surrealdb/surrealdb:v2
ports:
- "8000:8000"
volumes:
- ./surreal-data:/mydata
user: "${UID}:${GID}"
command: start --log trace --user root --pass root rocksdb:mydatabase.db
pull_policy: always
open_notebook:
image: lfnovo/open_notebook:latest
ports:
- "8080:8502"
environment:
- OPENAI_API_KEY=API_KEY
- DEFAULT_MODEL=gpt-4o-mini
- SURREAL_ADDRESS=ws://surrealdb:8000/rpc
- SURREAL_USER=root
- SURREAL_PASS=root
- SURREAL_NAMESPACE=open_notebook
- SURREAL_DATABASE=staging
depends_on:
- surrealdb
pull_policy: always
```
## Running the app
After the app is running, you can access it at http://localhost:8080.
The first time you connect, it will check for the database and see if the schema is ready. If not, it will create the database for you.

49
docs/USAGE.md Normal file
View file

@ -0,0 +1,49 @@
# Using Open Notebook
This first release of Open Notebook is inspired by Notebook LM, so you will find a very similar workflow.
## Creating a new notebook
![New Notebook](assets/new_notebook.png)
Just type a name and description for the Notebook and you are good to go. Make the description as detailed as possible since it will be used by the LLM to understand the context of the notebook and provide you with better answers.
## Adding sources
Just click on Add Source and enter the URL, upload the file or paste the content of your source.
![New Notebook](assets/add_source.png)
You'll find your new source in the first column of the Notebook Page.
![New Notebook](assets/asset_list.png)
## Talk to the Assistant
Once you have enough content in the notebook, you can decide which of them will be visible to LLM before sending your question.
![New Notebook](assets/context.png)
- Not in Context: LLM won't get this as part of the context
- Summary: LLM will get the summary for the content and can ask for the full document if desired
- Full Content: LLM will receive the full transcript of the content together with your question.
It's recommended that you use the least amount of context so that you can save up on your API spend.
## Making Notes
There is 2 ways you can make notes:
Manually by clicking on New Note
![New Notebook](assets/human_note.png)
Or by turning any LLM message into a Note.
![New Notebook](assets/ai_note.png)
## Searching
The search page gives you a glance of all the notes you have made and the sources you have added. You can query the database both by keyword as well as using the vector search.
![New Notebook](assets/search.png)

BIN
docs/assets/add_source.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

BIN
docs/assets/ai_note.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 27 KiB

BIN
docs/assets/asset_list.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 63 KiB

BIN
docs/assets/context.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

BIN
docs/assets/human_note.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 41 KiB

BIN
docs/assets/search.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 48 KiB

8
mypy.ini Normal file
View file

@ -0,0 +1,8 @@
[mypy]
# Disable PEP 561 checks
ignore_missing_imports = True
check_untyped_defs = True
# Alternatively, you can ignore specific modules
[mypy-some_module]
ignore_missing_imports = True

View file

433
open_notebook/domain.py Normal file
View file

@ -0,0 +1,433 @@
import os
from datetime import datetime
from typing import Any, ClassVar, Dict, List, Literal, Optional, Type, TypeVar
from langchain_core.runnables.config import RunnableConfig
from loguru import logger
from pydantic import BaseModel, Field, field_validator
from open_notebook.exceptions import (
DatabaseOperationError,
InvalidInputError,
NotFoundError,
)
from open_notebook.graphs.summary import graph as summarizer
from open_notebook.repository import (
repo_create,
repo_delete,
repo_query,
repo_relate,
repo_update,
)
from open_notebook.utils import get_embedding, split_text, surreal_clean
T = TypeVar("T", bound="ObjectModel")
class ObjectModel(BaseModel):
id: Optional[str] = None
table_name: ClassVar[str] = ""
created: Optional[datetime] = None
updated: Optional[datetime] = None
@classmethod
def get_all(cls: Type[T]) -> List[T]:
try:
result = repo_query(f"SELECT * FROM {cls.table_name}")
objects = [cls(**obj) for obj in result]
return objects
except Exception as e:
logger.error(f"Error fetching all {cls.table_name}: {str(e)}")
logger.exception(e)
raise DatabaseOperationError(f"Failed to fetch all {cls.table_name}")
@classmethod
def get(cls: Type[T], id: str) -> Optional[T]:
if not id:
raise InvalidInputError("ID cannot be empty")
try:
result = repo_query(f"SELECT * FROM {id}")
if result:
return cls(**result[0])
return None
except Exception as e:
logger.error(f"Error fetching {cls.table_name} with id {id}: {str(e)}")
logger.exception(e)
raise NotFoundError(f"{cls.table_name} with id {id} not found")
def needs_embedding(self) -> bool:
return False
def get_embedding_content(self) -> Optional[str]:
return None
def save(self) -> None:
try:
data = self._prepare_save_data()
if self.needs_embedding():
embedding_content = self.get_embedding_content()
if embedding_content:
data["embedding"] = get_embedding(embedding_content)
if self.id is None:
logger.debug("Creating new record")
repo_result = repo_create(self.__class__.table_name, data)
else:
logger.debug(f"Updating record with id {self.id}")
repo_result = repo_update(self.id, data)
# Update the current instance with the result
for key, value in repo_result.items():
if hasattr(self, key):
setattr(self, key, value)
except Exception as e:
logger.error(f"Error saving {self.__class__.table_name}: {str(e)}")
logger.exception(e)
raise DatabaseOperationError(f"Failed to save {self.__class__.table_name}")
def _prepare_save_data(self) -> Dict[str, Any]:
data = self.model_dump()
logger.debug(f"Preparing data for save: {data}")
del data["created"]
del data["updated"]
return {key: value for key, value in data.items() if value is not None}
def delete(self) -> bool:
if self.id is None:
raise InvalidInputError("Cannot delete object without an ID")
try:
logger.debug(f"Deleting record with id {self.id}")
return repo_delete(self.id)
except Exception as e:
logger.error(
f"Error deleting {self.__class__.table_name} with id {self.id}: {str(e)}"
)
raise DatabaseOperationError(
f"Failed to delete {self.__class__.table_name}"
)
def relate(self, relationship: str, target_id: str) -> Any:
if not relationship or not target_id:
raise InvalidInputError("Relationship and target ID must be provided")
try:
return repo_relate(self.id, relationship, target_id)
except Exception as e:
logger.error(f"Error creating relationship: {str(e)}")
logger.exception(e)
raise DatabaseOperationError("Failed to create relationship")
class Notebook(ObjectModel):
table_name: ClassVar[str] = "notebook"
name: str
description: str
@field_validator("name")
@classmethod
def name_must_not_be_empty(cls, v):
if not v.strip():
raise InvalidInputError("Notebook name cannot be empty")
return v
@property
def sources(self) -> List["Source"]:
try:
srcs = repo_query(f"""
select * from (
select
<- source as source
from reference
where out={self.id}
fetch source
)
order by source.updated desc
""")
return [Source(**src["source"][0]) for src in srcs] if srcs else []
except Exception as e:
logger.error(f"Error fetching sources for notebook {self.id}: {str(e)}")
logger.exception(e)
raise DatabaseOperationError("Failed to fetch sources for notebook")
@property
def notes(self) -> List["Note"]:
try:
srcs = repo_query(f"""
select * from (
select
<- note as note
from artifact
where out={self.id}
fetch note
)
order by updated desc
""")
return [Note(**src["note"][0]) for src in srcs] if srcs else []
except Exception as e:
logger.error(f"Error fetching notes for notebook {self.id}: {str(e)}")
logger.exception(e)
raise DatabaseOperationError("Failed to fetch notes for notebook")
class Asset(BaseModel):
file_path: Optional[str] = None
url: Optional[str] = None
class SourceInsight(ObjectModel):
insight_type: str
content: str
@field_validator("insight_type")
@classmethod
def validate_insight_type(cls, v):
allowed_types = ["summary", "key_points", "analysis"] # Add more as needed
if v not in allowed_types:
raise InvalidInputError(
f"Invalid insight type. Allowed types are: {', '.join(allowed_types)}"
)
return v
class Source(ObjectModel):
table_name: ClassVar[str] = "source"
asset: Optional[Asset] = None
title: Optional[str] = None
topics: Optional[List[str]] = Field(default_factory=list)
def get_context(
self, context_size: Literal["short", "long"] = "short"
) -> Dict[str, Any]:
if context_size == "long":
return dict(
id=self.id,
title=self.title,
insights=self.insights,
full_text=self.full_text,
)
else:
return dict(id=self.id, title=self.title, insights=self.insights)
@property
def insights(self) -> List[SourceInsight]:
try:
result = repo_query(
"""
SELECT * FROM source_insight WHERE source=$id
""",
{"id": self.id},
)
return [SourceInsight(**insight) for insight in result]
except Exception as e:
logger.error(f"Error fetching insights for source {self.id}: {str(e)}")
logger.exception(e)
raise DatabaseOperationError("Failed to fetch insights for source")
@property
def full_text(self) -> str:
try:
results = []
chunk_indexes = repo_query(
"""
select order
from source_chunk
where source=$id
order by order
""",
{"id": self.id},
)
for chunk_index in chunk_indexes:
chunk = repo_query(
f"""
select content
from source_chunk
where source={self.id} and order={chunk_index['order']}
"""
)
results.append(chunk[0]["content"])
return "".join(results)
except Exception as e:
logger.error(f"Error fetching full text for source {self.id}: {str(e)}")
logger.exception(e)
raise DatabaseOperationError("Failed to fetch full text for source")
def add_to_notebook(self, notebook_id: str) -> Any:
if not notebook_id:
raise InvalidInputError("Notebook ID must be provided")
return self.relate("reference", notebook_id)
def save_chunks(self, text: str) -> None:
if not text:
raise InvalidInputError("Text cannot be empty")
try:
chunks = split_text(text, chunk=500000, overlap=1000)
logger.debug(f"Split into {len(chunks)} chunks")
for i, chunk in enumerate(chunks):
logger.debug(f"Saving chunk {i}")
repo_create(
"source_chunk",
{"source": self.id, "order": i, "content": surreal_clean(chunk)},
)
except Exception as e:
logger.error(f"Error saving chunks for source {self.id}: {str(e)}")
logger.exception(e)
raise DatabaseOperationError("Failed to save chunks for source")
def vectorize(self) -> None:
try:
full_text = self.full_text
if not full_text:
return
chunks = split_text(
self.full_text,
chunk=int(os.environ.get("EMBEDDING_CHUNK_SIZE", 1000)),
overlap=int(os.environ.get("EMBEDDING_CHUNK_OVERLAP", 1000)),
)
logger.debug(f"Split into {len(chunks)} chunks")
# future: we can increase the batch size after surreal launches their new SDK
for i, chunk in enumerate(chunks):
repo_create(
"source_embedding",
{
"source": self.id,
"order": i,
"content": surreal_clean(chunk),
"embedding": get_embedding(chunk),
},
)
except Exception as e:
logger.error(f"Error vectorizing source {self.id}: {str(e)}")
logger.exception(e)
raise DatabaseOperationError(e)
@classmethod
def search(cls, query: str) -> List[Dict[str, Any]]:
if not query:
raise InvalidInputError("Search query cannot be empty")
try:
result = repo_query(
"""
SELECT * omit full_text
FROM source
WHERE string::lowercase(title) CONTAINS $query or title @@ $query
OR string::lowercase(summary) CONTAINS $query or summary @@ $query
OR string::lowercase(full_text) CONTAINS $query or full_text @@ $query
""",
{"query": query},
)
return result
except Exception as e:
logger.error(f"Error searching sources: {str(e)}")
logger.exception(e)
raise DatabaseOperationError("Failed to search sources")
def _add_insight(self, insight_type: str, content: str) -> Any:
if not insight_type or not content:
raise InvalidInputError("Insight type and content must be provided")
try:
embedding = get_embedding(content)
return repo_create(
"source_insight",
{
"source": self.id,
"insight_type": insight_type,
"content": surreal_clean(content),
"embedding": embedding,
},
)
except Exception as e:
logger.error(f"Error adding insight to source {self.id}: {str(e)}")
raise DatabaseOperationError("Failed to add insight to source")
def summarize(self) -> "Source":
try:
config = RunnableConfig(configurable=dict(thread_id=self.id))
result = summarizer.invoke({"content": self.full_text}, config=config)[
"summary"
]
self._add_insight("summary", surreal_clean(result.summary))
self.title = surreal_clean(result.title)
self.topics = result.topics
self.save()
return self
except Exception as e:
logger.error(f"Error summarizing source {self.id}: {str(e)}")
logger.exception(e)
raise DatabaseOperationError("Failed to summarize source")
class Note(ObjectModel):
table_name: ClassVar[str] = "note"
title: Optional[str] = None
note_type: Optional[Literal["human", "ai"]] = "human"
content: Optional[str] = None
@field_validator("content")
@classmethod
def content_must_not_be_empty(cls, v):
if v is not None and not v.strip():
raise InvalidInputError("Note content cannot be empty")
return v
def add_to_notebook(self, notebook_id: str) -> Any:
if not notebook_id:
raise InvalidInputError("Notebook ID must be provided")
return self.relate("artifact", notebook_id)
def get_context(
self, context_size: Literal["short", "long"] = "short"
) -> Dict[str, Any]:
if context_size == "long":
return dict(id=self.id, title=self.title, content=self.content)
else:
return dict(
id=self.id,
title=self.title,
content=self.content[:100] if self.content else None,
)
def needs_embedding(self) -> bool:
return True
def get_embedding_content(self) -> Optional[str]:
return self.content
def text_search(
keyword: str, results: int, source: bool = True, note: bool = True
) -> List[Dict[str, Any]]:
if not keyword:
raise InvalidInputError("Search keyword cannot be empty")
try:
results = repo_query(
"""
SELECT * FROM fn::text_search($keyword, $results, $source, $note);
""",
{"keyword": keyword, "results": results, "source": source, "note": note},
)
return results
except Exception as e:
logger.error(f"Error performing text search: {str(e)}")
logger.exception(e)
raise DatabaseOperationError("Failed to perform text search")
def vector_search(
keyword: str, results: int, source: bool = True, note: bool = True
) -> List[Dict[str, Any]]:
if not keyword:
raise InvalidInputError("Search keyword cannot be empty")
try:
results = repo_query(
"""
SELECT * FROM fn::vector_search($keyword, $results, $source, $note);
""",
{"keyword": keyword, "results": results, "source": source, "note": note},
)
return results
except Exception as e:
logger.error(f"Error performing vector search: {str(e)}")
logger.exception(e)
raise DatabaseOperationError("Failed to perform vector search")

View file

@ -0,0 +1,64 @@
class OpenNotebookError(Exception):
"""Base exception class for Open Notebook errors."""
pass
class DatabaseOperationError(OpenNotebookError):
"""Raised when a database operation fails."""
pass
class InvalidInputError(OpenNotebookError):
"""Raised when invalid input is provided."""
pass
class NotFoundError(OpenNotebookError):
"""Raised when a requested resource is not found."""
pass
class AuthenticationError(OpenNotebookError):
"""Raised when there's an authentication problem."""
pass
class ConfigurationError(OpenNotebookError):
"""Raised when there's a configuration problem."""
pass
class ExternalServiceError(OpenNotebookError):
"""Raised when an external service (e.g., AI model) fails."""
pass
class RateLimitError(OpenNotebookError):
"""Raised when a rate limit is exceeded."""
pass
class FileOperationError(OpenNotebookError):
"""Raised when a file operation fails."""
pass
class NetworkError(OpenNotebookError):
"""Raised when a network operation fails."""
pass
class InvalidDatabaseSchema(OpenNotebookError):
"""Raised when the database is not under the expected schema."""
pass

View file

@ -0,0 +1,52 @@
import os
from langchain_core.runnables import (
RunnableConfig,
)
from langchain_openai import ChatOpenAI
from langgraph.graph import END, START, StateGraph
from loguru import logger
from typing_extensions import TypedDict
from open_notebook.domain import Note, Notebook, Source
from open_notebook.prompter import Prompter
class AskState(TypedDict):
doc_id: str
doc_content: str
question: str
answer: str
notebook: Notebook
def call_model_with_messages(state: AskState, config: RunnableConfig) -> dict:
model = ChatOpenAI(
model=os.environ.get("RETRIEVAL_MODEL", os.environ["DEFAULT_MODEL"]),
temperature=0,
)
system_prompt = Prompter(prompt_template="ask_content").render(data=state)
logger.debug(f"System prompt: {system_prompt}")
ai_message = model.invoke(system_prompt)
return {"answer": ai_message}
# todo: there is probably a better way to do this and avoid repetition
def get_content(state: AskState) -> dict:
doc_id = state["doc_id"]
if "note:" in doc_id:
doc: Note = Note.get(id=doc_id)
elif "source:" in doc_id:
doc: Source = Source.get(id=doc_id)
doc_content = doc.get_context("long") if doc else None
return {"doc_content": doc_content}
agent_state = StateGraph(AskState)
agent_state.add_node("get_content", get_content)
agent_state.add_node("agent", call_model_with_messages)
agent_state.add_edge(START, "get_content")
agent_state.add_edge("get_content", "agent")
agent_state.add_edge("agent", END)
graph = agent_state.compile()

View file

@ -0,0 +1,74 @@
import os
import sqlite3
from typing import Annotated, List, Optional
from langchain_core.runnables import (
RunnableConfig,
)
from langchain_openai import ChatOpenAI
from langgraph.checkpoint.sqlite import SqliteSaver
from langgraph.graph import START, StateGraph
from langgraph.graph.message import add_messages
from langgraph.prebuilt import ToolNode, tools_condition
from loguru import logger
from pydantic import BaseModel, Field
from typing_extensions import TypedDict
from open_notebook.domain import Notebook
from open_notebook.graphs.tools import ask_the_document, get_current_timestamp
from open_notebook.prompter import Prompter
tools = [get_current_timestamp, ask_the_document]
tool_node = ToolNode(tools)
class ChatResponse(BaseModel):
"""Respond to the user with this"""
title: Optional[str] = Field(
description="A title to be used if your question would become a new note on the project"
)
message: str = Field(
description="The actual message you'd like to reply to the user"
)
citations: Optional[List[str]] = Field(
description="The ids for the documents you used to formulate your answer"
)
class ThreadState(TypedDict):
messages: Annotated[list, add_messages]
notebook: Optional[Notebook]
context: Optional[str]
context_config: Optional[dict]
response: Optional[ChatResponse]
def call_model_with_messages(state: ThreadState, config: RunnableConfig) -> dict:
model = ChatOpenAI(model=os.environ["DEFAULT_MODEL"], temperature=0).bind_tools(
tools
)
messages = state["messages"]
system_prompt = Prompter(prompt_template="chat").render(data=state)
logger.warning(f"System prompt: {system_prompt}")
ai_message = model.invoke([system_prompt] + messages)
return {"messages": ai_message}
conn = sqlite3.connect(
os.environ.get("CHECKPOINT_DATA_PATH", "sqlite-db/checkpoints.sqlite"),
check_same_thread=False,
)
memory = SqliteSaver(conn)
agent_state = StateGraph(ThreadState)
agent_state.add_node("agent", call_model_with_messages)
agent_state.add_node("tools", tool_node)
agent_state.add_edge(START, "agent")
agent_state.add_conditional_edges(
"agent",
tools_condition,
)
agent_state.add_edge("tools", "agent")
graph = agent_state.compile(checkpointer=memory)

View file

@ -0,0 +1,217 @@
import re
import fitz # type: ignore
import magic
import requests # type: ignore
from langgraph.graph import END, START, StateGraph
from typing_extensions import TypedDict
from youtube_transcript_api import YouTubeTranscriptApi # type: ignore
from youtube_transcript_api.formatters import TextFormatter # type: ignore
class SourceState(TypedDict):
content: str
file_path: str
url: str
source_type: str
identified_type: str
identified_provider: str
def source_identification(state: SourceState):
"""
Identify the content source based on parameters
"""
if state.get("content"):
doc_type = "text"
elif state.get("file_path"):
doc_type = "file"
elif state.get("url"):
doc_type = "url"
else:
raise ValueError("No source provided.")
return {"source_type": doc_type}
def url_provider(state: SourceState):
"""
Identify the provider
"""
return_dict = {}
url = state.get("url")
if url:
if "youtube.com" in url or "youtu.be" in url:
return_dict["identified_type"] = (
"youtube" # playlists, channels in the future
)
else:
return_dict["identified_type"] = "article"
# article providers in the future
return return_dict
def file_type(state: SourceState):
"""
Identify the file using python-magic
"""
return_dict = {}
file_path = state.get("file_path")
if file_path is not None:
return_dict["identified_type"] = magic.from_file(file_path, mime=True)
return return_dict
def _extract_text_from_pdf(pdf_path):
doc = fitz.open(pdf_path)
text = ""
for page in doc:
text += page.get_text()
doc.close()
return text
def extract_pdf(state: SourceState):
"""
Parse the text file and print its content.
"""
return_dict = {}
if (
state.get("file_path") is not None
and state.get("identified_type") == "application/pdf"
):
file_path = state.get("file_path")
try:
text = _extract_text_from_pdf(file_path)
return_dict["content"] = text
except FileNotFoundError:
raise FileNotFoundError(f"File not found at {file_path}")
except Exception as e:
raise Exception(f"An error occurred: {e}")
return return_dict
def extract_url(state: SourceState):
"""
Get the content of a URL
"""
response = requests.get(f"https://r.jina.ai/{state.get('url')}")
return {"content": response.text}
def extract_txt(state: SourceState):
"""
Parse the text file and print its content.
"""
return_dict = {}
if (
state.get("file_path") is not None
and state.get("identified_type") == "text/plain"
):
file_path = state.get("file_path")
if file_path is not None:
try:
with open(file_path, "r", encoding="utf-8") as file:
content = file.read()
return_dict["content"] = content
except FileNotFoundError:
raise FileNotFoundError(f"File not found at {file_path}")
except Exception as e:
raise Exception(f"An error occurred: {e}")
return return_dict
def _extract_youtube_id(url):
"""
Extract the YouTube video ID from a given URL using regular expressions.
Args:
url (str): The YouTube URL from which to extract the video ID.
Returns:
str: The extracted YouTube video ID or None if no valid ID is found.
"""
# Define a regular expression pattern to capture the YouTube video ID
youtube_regex = (
r"(?:https?://)?" # Optional scheme
r"(?:www\.)?" # Optional www.
r"(?:"
r"youtu\.be/" # Shortened URL
r"|youtube\.com" # Main URL
r"(?:" # Group start
r"/embed/" # Embed URL
r"|/v/" # Older video URL
r"|/watch\?v=" # Standard watch URL
r"|/watch\?.+&v=" # Other watch URL
r")" # Group end
r")" # End main group
r"([\w-]{11})" # 11 characters (YouTube video ID)
)
# Search the URL for the pattern
match = re.search(youtube_regex, url)
# Return the video ID if a match is found
return match.group(1) if match else None
def extract_youtube_transcript(state: SourceState):
"""
Parse the text file and print its content.
"""
transcript = YouTubeTranscriptApi.get_transcript(
_extract_youtube_id(state.get("url")), languages=["pt", "en"]
)
formatter = TextFormatter()
return {"content": formatter.format_transcript(transcript)}
def should_continue(data: SourceState):
if data.get("source_type") == "url":
return "parse_url"
else:
return "end"
workflow = StateGraph(SourceState)
workflow.add_node("source", source_identification)
workflow.add_node("url_provider", url_provider)
workflow.add_node("file_type", file_type)
workflow.add_node("extract_txt", extract_txt)
workflow.add_node("extract_pdf", extract_pdf)
workflow.add_node("extract_url", extract_url)
workflow.add_node("extract_youtube_transcript", extract_youtube_transcript)
workflow.add_edge(START, "source")
workflow.add_conditional_edges(
"source",
lambda x: x.get("source_type"),
{
"url": "url_provider",
"file": "file_type",
"text": END,
},
)
workflow.add_conditional_edges(
"file_type",
lambda x: x.get("identified_type"),
{
"text/plain": "extract_txt",
"application/pdf": "extract_pdf",
},
)
workflow.add_conditional_edges(
"url_provider",
lambda x: x.get("identified_type"),
{"article": "extract_url", "youtube": "extract_youtube_transcript"},
)
workflow.add_edge("url_provider", END)
workflow.add_edge("file_type", END)
workflow.add_edge("extract_txt", END)
workflow.add_edge("extract_pdf", END)
workflow.add_edge("extract_url", END)
workflow.add_edge("extract_youtube_transcript", END)
graph = workflow.compile()

View file

@ -0,0 +1,96 @@
import os
from typing import List, Literal
from langchain_core.runnables import (
RunnableConfig,
)
from langchain_openai import ChatOpenAI
from langgraph.graph import END, START, StateGraph
from langgraph.prebuilt import ToolNode
from pydantic import BaseModel, Field
from typing_extensions import TypedDict
from open_notebook.graphs.tools import get_current_timestamp
from open_notebook.prompter import Prompter
from open_notebook.utils import split_text
tools = [get_current_timestamp]
tool_node = ToolNode(tools)
class SummaryResponse(BaseModel):
"""Respond to the user with this"""
summary: str = Field(description="The summary of the content")
topics: List[str] = Field(description="List of 4-7 topics related to this content")
title: str = Field(description="The title of the content")
class SummaryState(TypedDict):
chunks: List[str]
content: str
summary: SummaryResponse
def build_chunks(state: SummaryState) -> dict:
"""
Split the input text into chunks.
"""
return {
"chunks": split_text(
state["content"],
chunk=int(os.environ.get("SUMMARY_CHUNK_SIZE", 200000)),
overlap=int(os.environ.get("SUMMARY_CHUNK_OVERLAP", 1000)),
)
}
def setup_next_chunk(state: SummaryState) -> dict:
"""
Move the next item in the chunk to the processing area
"""
state["content"] = state["chunks"].pop(0)
return {"chunks": state["chunks"], "content": state["content"]}
def chunk_condition(state: SummaryState) -> Literal["get_chunk", END]: # type: ignore
"""
Checks whether there are more chunks to process.
"""
if len(state["chunks"]) > 0:
return "get_chunk"
return END
# todo: build a helper method for LLM communication on all graphs
def call_model_with_messages(state: SummaryState, config: RunnableConfig) -> dict:
model = (
ChatOpenAI(
model=os.environ.get("SUMMARIZATION_MODEL", os.environ["DEFAULT_MODEL"]),
temperature=0,
)
.bind_tools(tools)
.with_structured_output(SummaryResponse)
)
system_prompt = Prompter(prompt_template="summarize").render(data=state)
ai_message = model.invoke(system_prompt)
return {"summary": ai_message}
agent_state = StateGraph(SummaryState)
agent_state.add_node("setup_chunk", build_chunks)
agent_state.add_edge(START, "setup_chunk")
agent_state.add_conditional_edges(
"setup_chunk",
chunk_condition,
)
agent_state.add_node("get_chunk", setup_next_chunk)
agent_state.add_node("agent", call_model_with_messages)
agent_state.add_edge("get_chunk", "agent")
agent_state.add_conditional_edges(
"agent",
chunk_condition,
)
graph = agent_state.compile()

View file

@ -0,0 +1,24 @@
from datetime import datetime
from langchain.tools import tool
@tool
def get_current_timestamp() -> str:
"""
Returns the current timestamp in the format YYYYMMDDHHmmss.
"""
return datetime.now().strftime("%Y%m%d%H%M%S")
@tool
def ask_the_document(doc_id: str, question: str):
"""
Use this tool to ask a question to the document.
Another LLM will ready the document and answer the question.
Be specific and complete in your query given the LLM that will process it is very capable.
"""
from open_notebook.graphs.ask_content import graph
result = graph.invoke({"doc_id": doc_id, "question": question})
return result["answer"]

93
open_notebook/prompter.py Normal file
View file

@ -0,0 +1,93 @@
"""
A prompt management module using Jinja to generate complex prompts with simple templates.
"""
import os
from dataclasses import dataclass
from datetime import datetime
from typing import Any, Optional, Union
from jinja2 import Environment, FileSystemLoader, Template
env = Environment(loader=FileSystemLoader(os.environ.get("PROMPT_PATH", "prompts")))
@dataclass
class Prompter:
"""
A class for managing and rendering prompt templates.
Attributes:
prompt_template (str, optional): The name of the prompt template file.
prompt_variation (str, optional): The variation of the prompt template.
prompt_text (str, optional): The raw prompt text.
template (Union[str, Template], optional): The Jinja2 template object.
"""
prompt_template: Optional[str] = None
prompt_variation: Optional[str] = "default"
prompt_text: Optional[str] = None
template: Optional[Union[str, Template]] = None
parser: Optional[Any] = None
def __init__(self, prompt_template=None, prompt_text=None):
"""
Initialize the Prompter with either a template file or raw text.
Args:
prompt_template (str, optional): The name of the prompt template file.
prompt_text (str, optional): The raw prompt text.
"""
self.prompt_template = prompt_template
self.prompt_text = prompt_text
self.setup()
def setup(self):
"""
Set up the Jinja2 template based on the provided template file or text.
Raises:
ValueError: If neither prompt_template nor prompt_text is provided.
"""
if self.prompt_template:
self.template = env.get_template(f"{self.prompt_template}.jinja")
elif self.prompt_text:
self.template = Template(self.prompt_text)
else:
raise ValueError("Prompter must have a prompt_template or prompt_text")
assert self.prompt_template or self.prompt_text, "Prompt is required"
@classmethod
def from_text(cls, text: str):
"""
Create a Prompter instance from raw text, which can contain Jinja code.
Args:
text (str): The raw prompt text.
Returns:
Prompter: A new Prompter instance.
"""
return cls(prompt_text=text)
def render(self, data) -> str:
"""
Render the prompt template with the given data.
Args:
data (dict): The data to be used in rendering the template.
Returns:
str: The rendered prompt text.
Raises:
AssertionError: If the template is not defined or not a Jinja2 Template.
"""
data["current_time"] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
if self.parser:
data["format_instructions"] = self.parser.get_format_instructions()
assert self.template, "Prompter template is not defined"
assert isinstance(
self.template, Template
), "Prompter template is not a Jinja2 Template"
return self.template.render(data)

109
open_notebook/repository.py Normal file
View file

@ -0,0 +1,109 @@
import asyncio
import os
from contextlib import asynccontextmanager
from loguru import logger
from surrealdb import Surreal
from open_notebook.exceptions import InvalidDatabaseSchema
EXPECTED_VERSION = "0.0.1"
@asynccontextmanager
async def db_connection():
db = Surreal(os.environ["SURREAL_ADDRESS"])
try:
await db.connect()
await db.signin(
{"user": os.environ["SURREAL_USER"], "pass": os.environ["SURREAL_PASS"]}
)
await db.use(os.environ["SURREAL_NAMESPACE"], os.environ["SURREAL_DATABASE"])
yield db
finally:
await db.close()
def repo_query(query_str, vars=None):
async def _query():
async with db_connection() as db:
result = await db.query(query_str, vars)
return result
result = asyncio.run(_query())
return result[0]["result"]
def check_version():
async def _check_version():
async with db_connection() as db:
result = await db.query("select * from open_notebook:database_info;")
return result
try:
result = asyncio.run(_check_version())
if len(result) == 0 or len(result[0]["result"]) == 0:
raise InvalidDatabaseSchema("Database schema not found")
version = result[0]["result"][0]["version"]
logger.info(f"Connected to SurrealDB, using schema version {version}")
if version != EXPECTED_VERSION:
raise InvalidDatabaseSchema(
f"Version mismatch. Expected {EXPECTED_VERSION}, got {version}"
)
except Exception as e:
logger.error(e)
raise e
def repo_create(table, data):
async def _create():
async with db_connection() as db:
result = await db.create(table, data)
return result
result = asyncio.run(_create())
return result
def repo_update(id, data):
async def _update():
async with db_connection() as db:
result = await db.update(id, data)
return result
result = asyncio.run(_update())
return result
def repo_delete(id):
async def _delete():
async with db_connection() as db:
result = await db.delete(id)
return result
result = asyncio.run(_delete())
return result
def repo_relate(source, relationship, target):
async def _relate():
async with db_connection() as db:
query = f"RELATE {source}->{relationship}->{target};"
result = await db.query(query)
return result
result = asyncio.run(_relate())
return result
def execute_migration():
async def _query():
content = None
with open("db_setup.surrealql", "r") as file:
content = file.read()
async with db_connection() as db:
result = await db.query(content)
return result
result = asyncio.run(_query())
return result[0]["result"]

83
open_notebook/utils.py Normal file
View file

@ -0,0 +1,83 @@
from langchain_text_splitters import CharacterTextSplitter
from openai import OpenAI
client = OpenAI()
def split_text(txt: str, chunk=1000, overlap=0, separator=" "):
"""
Split the input text into chunks.
Args:
txt (str): The input text to be split.
chunk (int): The size of each chunk. Default is 1000.
overlap (int): The number of characters to overlap between chunks. Default is 0.
separator (str): The separator to use when splitting the text. Default is " ".
Returns:
list: A list of text chunks.
"""
text_splitter = CharacterTextSplitter(
chunk_size=chunk, chunk_overlap=overlap, separator=separator
)
return text_splitter.split_text(txt)
def token_count(input_string):
"""
Count the number of tokens in the input string using the 'o200k_base' encoding.
Args:
input_string (str): The input string to count tokens for.
Returns:
int: The number of tokens in the input string.
"""
import tiktoken
encoding = tiktoken.get_encoding("o200k_base")
tokens = encoding.encode(input_string)
token_count = len(tokens)
return token_count
def token_cost(token_count, cost_per_million=0.150):
"""
Calculate the cost of tokens based on the token count and cost per million tokens.
Args:
token_count (int): The number of tokens.
cost_per_million (float): The cost per million tokens. Default is 0.150.
Returns:
float: The calculated cost for the given token count.
"""
return cost_per_million * (token_count / 1_000_000)
def get_embedding(text, model="text-embedding-3-small"):
"""
Get the embedding for the input text using the specified model.
Args:
text (str): The input text to get the embedding for.
model (str): The name of the embedding model to use. Default is "text-embedding-3-small".
Returns:
list: The embedding vector for the input text.
"""
text = text.replace("\n", " ")
return client.embeddings.create(input=[text], model=model).data[0].embedding
def surreal_clean(text):
"""
Clean the input text by escaping colons for SurrealDB compatibility.
Args:
text (str): The input text to clean.
Returns:
str: The cleaned text with escaped colons.
"""
text = text.replace(":", "\:")
return text

115
pages/2_📒_Notebooks.py Normal file
View file

@ -0,0 +1,115 @@
import streamlit as st
from humanize import naturaltime
from open_notebook.domain import Notebook
from stream_app.chat import chat_sidebar
from stream_app.note import add_note, note_card
from stream_app.source import add_source, source_card
from stream_app.utils import setup_stream_state
st.set_page_config(
layout="wide", page_title="📒 Open Notebook", initial_sidebar_state="expanded"
)
def notebook_header(current_notebook):
c1, c2, c3 = st.columns([8, 2, 2])
c1.header(current_notebook.name)
if c2.button("Back to the list", icon="🔙"):
st.session_state["current_notebook"] = None
st.rerun()
if c3.button("Refresh", icon="🔄"):
st.rerun()
current_description = current_notebook.description
with st.expander(
current_notebook.description
if len(current_description) > 0
else "click to add a description"
):
notebook_name = st.text_input("Name", value=current_notebook.name)
notebook_description = st.text_area(
"Description",
value=current_description,
placeholder="Add as much context as you can as this will be used by the AI to generate insights.",
)
if st.button("Save", key="edit_notebook"):
current_notebook.name = notebook_name
current_notebook.description = notebook_description
current_notebook.save()
st.rerun()
if st.button("Delete forever", icon="☠️"):
current_notebook.delete()
st.session_state["current_notebook"] = None
st.rerun()
def notebook_page(current_notebook_id):
current_notebook: Notebook = Notebook.get(current_notebook_id)
if not current_notebook:
st.error("Notebook not found")
return
if current_notebook_id not in st.session_state.keys():
st.session_state[current_notebook_id] = current_notebook
session_id = st.session_state["active_session"]
st.session_state[session_id]["notebook"] = current_notebook
sources = current_notebook.sources
notes = current_notebook.notes
notebook_header(current_notebook)
work_tab, chat_tab = st.columns([4, 2])
with work_tab:
sources_tab, notes_tab = st.columns(2)
with sources_tab:
with st.container(border=True):
if st.button("Add Source", icon=""):
add_source(session_id)
for source in sources:
source_card(session_id=session_id, source=source)
with notes_tab:
with st.container(border=True):
if st.button("Write a Note", icon="📝"):
add_note(session_id)
for note in notes:
note_card(session_id=session_id, note=note)
with chat_tab:
chat_sidebar(session_id=session_id)
if "current_notebook" not in st.session_state:
st.session_state["current_notebook"] = None
if st.session_state["current_notebook"]:
notebook_page(st.session_state["current_notebook"])
st.stop()
st.title("📒 My Notebooks")
st.caption("Here are all your notebooks")
notebooks = Notebook.get_all()
for notebook in notebooks:
with st.container(border=True):
st.subheader(notebook.name)
st.caption(
f"Created: {naturaltime(notebook.created)}, updated: {naturaltime(notebook.updated)}"
)
st.write(notebook.description)
if st.button("Open", key=f"open_notebook_{notebook.id}"):
setup_stream_state(notebook.id)
st.session_state["current_notebook"] = notebook.id
st.rerun()
with st.container(border=True):
new_notebook_title = st.text_input("New Notebook Name")
new_notebook_description = st.text_area("Description")
if st.button("Create a new Notebook", icon=""):
notebook = Notebook(
name=new_notebook_title, description=new_notebook_description
)
notebook.save()
st.rerun()

65
pages/3_🔍_Search.py Normal file
View file

@ -0,0 +1,65 @@
import streamlit as st
from open_notebook.domain import text_search, vector_search
from open_notebook.utils import get_embedding
from stream_app.note import note_list_item
from stream_app.source import source_list_item
st.set_page_config(
layout="wide", page_title="🔍 Open Notebook", initial_sidebar_state="expanded"
)
# search_tab, ask_tab = st.tabs(["Search", "Ask"])
# notebooks = Notebook.get_all()
if "search_results" not in st.session_state:
st.session_state["search_results"] = []
# with search_tab:
with st.container(border=True):
st.subheader("🔍 Search")
st.caption("Search your knowledge base for specific keywords or concepts")
search_term = st.text_input("Search", "")
search_type = st.radio("Search Type", ["Text Search", "Vector Search"])
search_sources = st.checkbox("Search Sources", value=True)
search_notes = st.checkbox("Search Notes", value=True)
if st.button("Search"):
if search_type == "Text Search":
st.write(f"Searching for {search_term}")
st.session_state["search_results"] = text_search(
search_term, 100, search_sources, search_notes
)
elif search_type == "Vector Search":
st.write(f"Searching for {search_term}")
embed_query = get_embedding(search_term)
st.session_state["search_results"] = vector_search(
embed_query, 100, search_sources, search_notes
)
for item in st.session_state["search_results"]:
score = item.get("relevance", item.get("similarity", 0))
if item.get("item_id"):
if "source:" in item["item_id"]:
source_list_item(item["item_id"], score)
elif "note:" in item["item_id"]:
note_list_item(item["item_id"], score)
# coming soon
# with ask_tab:
# with st.form(key="ask_form"):
# st.subheader("Ask Your Knowledge Base")
# st.caption("Let the LLM formulate an answer based on your query")
# question = st.text_input("Your question", "")
# notebooks = st.multiselect(
# "Notebooks",
# notebooks,
# notebooks,
# format_func=lambda x: x.name,
# )
# search_sources = st.multiselect(
# "Use Sources",
# ["Sources", "Notes"],
# ["Sources", "Notes"],
# )
# if st.form_submit_button("Search"):
# st.write(f"Searching for {search_term}")

4021
poetry.lock generated Normal file

File diff suppressed because it is too large Load diff

3
poetry.toml Normal file
View file

@ -0,0 +1,3 @@
[virtualenvs]
in-project = true
path = "."

26
prompts/ask_content.jinja Normal file
View file

@ -0,0 +1,26 @@
# BACKGROUND
Your are a cognitive assistant that helps me study and research.
# OUR WORKING FRAMEWORK
You have access to some information about the project I am working on
as well as the content of a specific item I am interested about.
Your goal is to respond to the question using purely the content in your CONTEXT.
If the content in CONTEXT is not enough to answer the question, do not make up any information and just reply that you can't answer that.
Kindly tell the user what sort of things you'd be able to talk about.
# PROJECT INFO
{{ notebook }}
# CONTENT
{{ doc_content }}
# QUESTION
{{ question}}

45
prompts/chat.jinja Normal file
View file

@ -0,0 +1,45 @@
# BACKGROUND
Your are a cognitive assistant that helps me study and research.
# OUR WORKING FRAMEWORK
We are working within a virtual Notebook,
which is a learning workspace for a specific project.
You have access to some information about the project,
the contents that are selected for discussion, and relevant contexts.
Your goal is to respond to the user's commands and questions,
using purely the content in your context.
# YOUR TOOLS
You might find that some of the documents in the CONTEXT are worth an extra look. For that, you can use the `ask_the_document` tool.
Just ask the question as if you were talking to someone that knows the document deeply and the tool will provide you with the answer.
Use the document id to specify which source or note you'd like to ask about.
# INSTRUCTIONS
- You can ask tools until you are satisfied with the information
- You have a optional field in your answer called title. Only use this field if you believe your answer is important to be saved as a note. If it's just a quick chat, send an empty string to it.
- Please add to the citations list all the ids for sources and notes that you used for your anwer.
# RESPONSE FORMAT
```
[ANSWER]
### CITATIONS
- id1
- id2
- etc
```
# PROJECT INFO
{{ notebook }}
# CONTEXT
{{ context }}

11
prompts/spr.jinja Normal file
View file

@ -0,0 +1,11 @@
# MISSION
You are a Sparse Priming Representation (SPR) writer. An SPR is a particular kind of use of language for advanced NLP, NLU, and NLG tasks, particularly useful for the latest generation of Large Language Models (LLMs). You will be given information by the USER which you are to render as an SPR.
# THEORY
LLMs are a kind of deep neural network. They have been demonstrated to embed knowledge, abilities, and concepts, ranging from reasoning to planning, and even to theory of mind. These are called latent abilities and latent content, collectively referred to as latent space. The latent space of an LLM can be activated with the correct series of words as inputs, which will create a useful internal state of the neural network. This is not unlike how the right shorthand cues can prime a human mind to think in a certain way. Like human minds, LLMs are associative, meaning you only need to use the correct associations to "prime" another model to think in the same way.
# METHODOLOGY
Render the input as a distilled list of succinct statements, assertions, associations, concepts, analogies, and metaphors. The idea is to capture as much, conceptually, as possible but with as few words as possible. Write it in a way that makes sense to you, as the future audience will be another language model, not a human. Use complete sentences.
{# thanks to https://github.com/daveshap/SparsePrimingRepresentations #}

28
prompts/summarize.jinja Normal file
View file

@ -0,0 +1,28 @@
{% include "spr.jinja" %}
# YOUR TASK
You are part of a content summarization platform.
Sometimes, you need to summarize the content gradually since it might be very big.
Please summarize the content below in a few sentences, making it the most complete, dense and SPR compatible as you can.
## INSTRUCTIONS
- If the content already has a current summary, rewrite the summary to add the new information without losing the previous context
- Always make it dense and SPR compatible
- Do not reply with anything feedback or message other than the summary itself
## FORMATTING INSTRUCTIONS
{{ format_instructions }}
## CONTENT
{{content}}
## PREVIOUS SUMMARY
{{summary}}
## SUMMARY

59
pyproject.toml Normal file
View file

@ -0,0 +1,59 @@
[tool.poetry]
name = "open-notebook"
version = "0.0.1"
description = "An open source implementation of a research assistant, inspired by Google Notebook LM"
authors = ["Luis Novo <lfnovo@gmail.com>"]
license = "MIT"
readme = "README.md"
classifiers = [
"License :: OSI Approved :: MIT License",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.11",
]
[tool.poetry.dependencies]
python = "^3.11"
streamlit = "^1.39.0"
watchdog = "^5.0.3"
pydantic = "^2.9.2"
loguru = "^0.7.2"
icecream = "^2.1.3"
langchain = "^0.3.3"
langgraph = "^0.2.38"
humanize = "^4.11.0"
streamlit-tags = "^1.2.8"
streamlit-scrollable-textbox = "^0.0.3"
tiktoken = "^0.8.0"
streamlit-monaco = "^0.1.3"
langgraph-checkpoint-sqlite = "^2.0.0"
pymupdf = "1.24.11"
python-magic = "^0.4.27"
langdetect = "^1.0.9"
youtube-transcript-api = "^0.6.2"
surrealdb = "^0.3.2"
openai = "^1.52.0"
pre-commit = "^4.0.1"
langchain-community = "^0.3.3"
langchain-openai = "^0.2.3"
[tool.poetry.group.dev.dependencies]
ipykernel = "^6.29.5"
ruff = "^0.5.5"
mypy = "^1.11.1"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
[tool.isort]
profile = "black"
line_length = 88
[tool.ruff]
line-length = 88
[tool.ruff.lint]
select = ["E", "F", "I"]
ignore = ["E501"]

0
stream_app/__init__.py Normal file
View file

89
stream_app/chat.py Normal file
View file

@ -0,0 +1,89 @@
import streamlit as st
from langchain_core.runnables import RunnableConfig
from open_notebook.domain import Note, Source
from open_notebook.graphs.chat import graph as chat_graph
from open_notebook.utils import token_cost, token_count
# todo: build a smarter, more robust context manager function
def build_context(session_id):
st.session_state[session_id]["context"] = dict(note=[], source=[])
for id, status in st.session_state[session_id]["context_config"].items():
if not id:
continue
item_type, item_id = id.split(":")
if item_type not in ["note", "source"]:
continue
if "not in" in status:
continue
if item_type == "note":
item: Note = Note.get(id)
elif item_type == "source":
item: Source = Source.get(id)
else:
continue
if not item:
continue
if "summary" in status:
st.session_state[session_id]["context"][item_type] += [
item.get_context(context_size="short")
]
elif "full content" in status:
st.session_state[session_id]["context"][item_type] += [
item.get_context(context_size="long")
]
return st.session_state[session_id]["context"]
def execute_chat(txt_input, session_id):
current_state = st.session_state[session_id]
current_state["messages"] += [txt_input]
result = chat_graph.invoke(
input=current_state,
config=RunnableConfig(configurable={"thread_id": session_id}),
)
return result
# todo: se eu for usar o token count, preciso deixar configuravel
# seria bom ter um total de tokens no admin em algum lugar
def chat_sidebar(session_id):
context = build_context(session_id=session_id)
tokens = token_count(str(context))
cost = token_cost(tokens)
with st.container(border=True):
request = st.chat_input("Enter your question")
st.caption(f"Total tokens: {tokens}, cost: ${cost:.4f}")
if request:
response = execute_chat(txt_input=request, session_id=session_id)
st.session_state[session_id]["messages"] = response["messages"]
for msg in st.session_state[session_id]["messages"][::-1]:
if msg.type not in ["human", "ai"]:
continue
if not msg.content:
continue
with st.chat_message(name=msg.type):
st.write(msg.content)
if msg.type == "ai":
if st.button("💾 New Note", key=f"render_save_{msg.id}"):
title = "New Note"
content = msg.content
note = Note(
title=title,
content=content,
note_type="ai",
)
note.save()
note.add_to_notebook(
st.session_state[session_id]["notebook"].id
)
st.rerun()

5
stream_app/consts.py Normal file
View file

@ -0,0 +1,5 @@
context_icons = [
"⛔ not in context",
"🟡 summary",
"🟢 full content",
]

89
stream_app/note.py Normal file
View file

@ -0,0 +1,89 @@
import streamlit as st
from humanize import naturaltime
from loguru import logger
from streamlit_monaco import st_monaco # type: ignore
from open_notebook.domain import Note
from .consts import context_icons
@st.dialog("Write a Note", width="large")
def add_note(session_id):
note_title = st.text_input("Title")
note_content = st.text_area("Content")
if st.button("Save", key="add_note"):
logger.debug("Adding note")
note = Note(title=note_title, content=note_content, note_type="human")
note.save()
note.add_to_notebook(st.session_state[session_id]["notebook"].id)
st.rerun()
@st.dialog("Add a Source", width="large")
def note_panel(session_id=None, note_id=None):
if note_id:
note: Note = Note.get(note_id)
else:
note: Note = Note()
t_preview, t_edit = st.tabs(["Preview", "Edit"])
with t_preview:
st.subheader(note.title)
st.markdown(note.content)
with t_edit:
note.title = st.text_input("Title", value=note.title)
note.content = st_monaco(
value=note.content, height="600px", language="markdown"
)
if st.button("Save", key=f"edit_note_{note_id}"):
logger.debug("Editing note")
note.save()
if not note.id:
note.add_to_notebook(st.session_state[session_id]["notebook"].id)
st.rerun()
if st.button("Delete", key=f"delete_note_{note_id}"):
logger.debug("Deleting note")
note.delete()
st.rerun()
def note_card(session_id, note):
if note.note_type == "human":
icon = "🤵"
else:
icon = "🤖"
context_state = st.selectbox(
"Context",
label_visibility="collapsed",
options=context_icons,
index=0,
key=f"note_{note.id}",
)
with st.expander(f"{icon} **{note.title}** {naturaltime(note.updated)}"):
st.write(note.content)
with st.popover("Actions"):
if st.button("Edit Note", icon="📝", key=f"edit_note_{note.id}"):
note_panel(session_id, note.id)
if st.button("Delete", icon="🗑️", key=f"delete_options_{note.id}"):
note.delete()
st.rerun()
st.session_state[session_id]["context_config"][note.id] = context_state
def note_list_item(note_id, score=None):
logger.debug(note_id)
note: Note = Note.get(note_id)
if note.note_type == "human":
icon = "🤵"
else:
icon = "🤖"
with st.expander(
f"{icon} [{score:.2f}] **{note.title}** {naturaltime(note.updated)}"
):
st.write(note.content)
if st.button("Edit Note", icon="📝", key=f"x_edit_note_{note.id}"):
note_panel(note_id=note.id)

161
stream_app/source.py Normal file
View file

@ -0,0 +1,161 @@
from pathlib import Path
import streamlit as st
import streamlit_scrollable_textbox as stx # type: ignore
from humanize import naturaltime
from loguru import logger
from streamlit_tags import st_tags # type: ignore
from open_notebook.domain import Asset, Source
from open_notebook.graphs.content_process import graph
from open_notebook.utils import token_cost, token_count
from .consts import context_icons
uploads_dir = Path("./.uploads")
uploads_dir.mkdir(parents=True, exist_ok=True)
@st.dialog("Source", width="large")
def source_panel(source_id):
source: Source = Source.get(source_id)
if not source:
st.error("Source not found")
return
title = st.empty()
if source.title:
title.subheader(source.title)
st.caption(f"Created {naturaltime(source.created)}")
# st.markdown(f"**URL:** {source.url}, **File:** {source.file_path}")
summary = st.empty()
for insight in source.insights:
summary.write(insight.insight_type)
summary.write(insight.content)
topics = source.topics or []
if len(topics) > 0:
st_tags(
label="",
text="Press enter to add more",
value=source.topics,
suggestions=source.topics,
maxtags=10,
key="1",
)
if st.button("Delete", icon="🗑️"):
source.delete()
st.rerun()
cost = token_cost(token_count(source.full_text)) * 1.2
if st.button(f"Summarize (about ${cost:.4f})", icon="📝"):
source.summarize()
st.rerun(scope="fragment")
cost_embedding = token_cost(token_count(source.full_text), 0.02)
if st.button(f"Embed (${cost_embedding:.4f})", icon="📝"):
source.vectorize()
st.success("Embedding complete")
st.subheader("Content")
stx.scrollableTextbox(source.full_text, height=300)
@st.dialog("Add a Source", width="large")
def add_source(session_id):
source_link = None
source_file = None
source_text = None
source_type = st.radio("Type", ["Link", "Upload", "Text"])
req = {}
if source_type == "Link":
source_link = st.text_input("Link")
req["url"] = source_link
elif source_type == "Upload":
source_file = st.file_uploader("Upload")
if source_file is not None:
# Get the file name and extension
file_name = source_file.name
file_extension = Path(file_name).suffix
# Generate a unique file name
base_name = Path(file_name).stem
counter = 1
new_path = uploads_dir / file_name
while new_path.exists():
new_file_name = f"{base_name}_{counter}{file_extension}"
new_path = uploads_dir / new_file_name
counter += 1
req["file_path"] = str(new_path)
# Save the file
with open(new_path, "wb") as f:
f.write(source_file.getbuffer())
else:
source_text = st.text_area("Text")
req["content"] = source_text
if st.button("Process", key="add_source"):
logger.debug("Adding source")
with st.status("Processing...", expanded=True):
st.write("Processing document...")
result = graph.invoke(req)
st.write("Saving..")
source = Source(
asset=Asset(url=req.get("url"), file_path=req.get("file_path")),
)
source.save()
source.save_chunks(result["content"])
source.add_to_notebook(st.session_state[session_id]["notebook"].id)
st.write("Summarizing...")
source.summarize()
st.rerun()
# else:
# st.stop()
def source_card(session_id, source):
icon = "🔗"
context_state = st.selectbox(
"Context",
label_visibility="collapsed",
options=context_icons,
index=0,
key=f"source_{source.id}",
)
with st.expander(f"**{source.title}**"):
st.markdown(f"{icon} Updated: {naturaltime(source.updated)}")
st.markdown("**" + ", ".join(source.topics) + "**")
for insight in source.insights:
st.write(insight.insight_type)
st.write(insight.content)
with st.popover("Actions"):
if st.button("Edit Source", icon="📝", key=source.id):
result = source_panel(source.id)
st.write(result)
if st.button("Delete", icon="🗑️", key=f"delete_options_{source.id}"):
source.delete()
st.rerun()
st.session_state[session_id]["context_config"][source.id] = context_state
def source_list_item(source_id, score=None):
source: Source = Source.get(source_id)
if not source:
st.error("Source not found")
return
icon = "🔗"
with st.expander(
f"{icon} [{score:.2f}] **{source.title}** {naturaltime(source.updated)}"
):
for insight in source.insights:
st.markdown(f"**{insight.insight_type}**")
st.write(insight.content)
if st.button("Edit source", icon="📝", key=f"x_edit_source_{source.id}"):
source_panel(source_id=source.id)

18
stream_app/utils.py Normal file
View file

@ -0,0 +1,18 @@
import streamlit as st
from open_notebook.graphs.chat import ThreadState, graph
def setup_stream_state(session_id) -> None:
"""
Sets the value of the current session_id for langgraph thread state.
If there is no existing thread state for this session_id, it creates a new one.
"""
existing_state = graph.get_state({"configurable": {"thread_id": session_id}}).values
if len(existing_state.keys()) == 0:
st.session_state[session_id] = ThreadState(
messages=[], context=None, notebook=None, context_config={}, response=None
)
else:
st.session_state[session_id] = existing_state
st.session_state["active_session"] = session_id

1
tests/README.md Normal file
View file

@ -0,0 +1 @@
Coming Soon

66
todo.md Normal file
View file

@ -0,0 +1,66 @@
Auto summarize
# Code stuff
- Linting
# Future versions:
- Suporte more models other than OpenAI
- Any LLM do crew ai
- permitir mais de um vetorizer
- Colocar o Gemini como modelo de consulta de documentos
- Permitir usar modelos como Ollama entre outros
- Tentar usar o Pydantic Output Parser
- Tentar remover langchain_openai e anthropic
- DB consistency
- delete notebook (o que fazer com os filhos)
- Ta acumulando 2 sumaries
- deletar filhos quando deletar pais
- Brincar com o tema do Streamlit
- Docstrings
- Arrumar o chat quando houver utilização de ferramentas
- Implementar streaming no chat também
- Citacions: explicar de onde vieram os insights
- Usar propósito do projeto para sumarizar
- Melhorar Citacions: explicar de onde vieram os insights
- Melhorar as estratégias de embedding e limpeza de conteúdo e indexação
- Improve streamlit navigation and refresh
- Mais de uma sessão de chats?
- Melhorias no banco, menos tabelas, mais inteligentes
- Live Query for the front end
- Implementar a ideia do Fabric de prompts e perguntas recomendadas
- Menu bar: sources, notes, projects, search, topics
- Trazer algum sistema de busca
- Multiple study sessions
- Melhorar a visão dos dados
- Usar as queries corretas no Surreal
- Dar um talento de nada no models
- Transformar tudo em lambda?
- Por information nos edges para contexto?
- Processamento tinha que ser async
- Ta dando pau em arquivos grandes
- Precisa de um sistema de fila
- Automatizar o processo de analise
- Suportar transcrição de audio e de video
https://www.youtube.com/watch?v=mdLBr9IMmgI
- Langgraph
- Mudar a memória das threads para o SurrealDB
- Estratégias mais poderosas que combinem fabric com embeddings
- Uma ideia legal seria usar um LLM muito barato para limpar textos e o vision para entender pdfs
----
There is a known issue with the surreal sdk for large content
FEATURES
- Recursive sumarizationa cima de 500k de texto
- Estimativa de custo do vetorizer para os conteudos
- Context Manager - fine grained
- Campo de busca de texto, vetor e híbrida
- Vector search on my own notes