Initial commit with all features
8
.dockerignore
Normal file
|
|
@ -0,0 +1,8 @@
|
|||
notebooks/
|
||||
data/
|
||||
.uploads/
|
||||
.venv/
|
||||
.mypy_cache/
|
||||
.ruff_cache/
|
||||
.env
|
||||
sqlite-db/
|
||||
27
.env.example
Normal file
|
|
@ -0,0 +1,27 @@
|
|||
# YOUR LLM API KEYS
|
||||
OPENAI_API_KEY=API_KEY
|
||||
|
||||
# MODEL_CONFIGURATIONS
|
||||
# Only OpenAI models are supported for now
|
||||
DEFAULT_MODEL="gpt-4o-mini" # This is the default model used for all the features
|
||||
SUMMARIZATION_MODEL="gpt-4o-mini" # This is the model used for summarization, defaults to the DEFAULT_MODEL if empty
|
||||
RETRIEVAL_MODEL="gpt-4o-mini" # This is the model used for retrieval, defaults to the DEFAULT_MODEL if empty
|
||||
|
||||
|
||||
# CONNECTION DETAILS FOR YOUR SURREAL DB
|
||||
SURREAL_ADDRESS="ws://localhost:8000/rpc"
|
||||
SURREAL_USER="root"
|
||||
SURREAL_PASS="root"
|
||||
SURREAL_NAMESPACE="open_notebook"
|
||||
SURREAL_DATABASE="staging"
|
||||
|
||||
# This is used for the summarization feature when the content is to big to fit a single context window
|
||||
# It is measured in characters, not tokens.
|
||||
SUMMARY_CHUNK_SIZE=200000
|
||||
SUMMARY_CHUNK_OVERLAP=1000
|
||||
|
||||
# This is used for vector embeddings
|
||||
# It is measured in characters, not tokens.
|
||||
EMBEDDING_CHUNK_SIZE=1000
|
||||
EMBEDDING_CHUNK_OVERLAP=50
|
||||
|
||||
118
.gitignore
vendored
Normal file
|
|
@ -0,0 +1,118 @@
|
|||
notebooks/
|
||||
data/
|
||||
.uploads/
|
||||
sqlite-db/
|
||||
surreal-data/
|
||||
docker.env
|
||||
# Python-specific
|
||||
*.py[cod]
|
||||
__pycache__/
|
||||
*.so
|
||||
|
||||
# Distribution / packaging
|
||||
.Python
|
||||
build/
|
||||
develop-eggs/
|
||||
dist/
|
||||
downloads/
|
||||
eggs/
|
||||
.eggs/
|
||||
lib/
|
||||
lib64/
|
||||
parts/
|
||||
sdist/
|
||||
var/
|
||||
wheels/
|
||||
share/python-wheels/
|
||||
*.egg-info/
|
||||
.installed.cfg
|
||||
*.egg
|
||||
|
||||
# PyInstaller
|
||||
*.manifest
|
||||
*.spec
|
||||
|
||||
# Installer logs
|
||||
pip-log.txt
|
||||
pip-delete-this-directory.txt
|
||||
|
||||
# Unit test / coverage reports
|
||||
htmlcov/
|
||||
.tox/
|
||||
.nox/
|
||||
.coverage
|
||||
.coverage.*
|
||||
.cache
|
||||
nosetests.xml
|
||||
coverage.xml
|
||||
*.cover
|
||||
*.py,cover
|
||||
.hypothesis/
|
||||
.pytest_cache/
|
||||
|
||||
# Jupyter Notebook
|
||||
.ipynb_checkpoints
|
||||
|
||||
# IPython
|
||||
profile_default/
|
||||
ipython_config.py
|
||||
|
||||
# Environments
|
||||
.env
|
||||
.venv
|
||||
env/
|
||||
venv/
|
||||
ENV/
|
||||
env.bak/
|
||||
venv.bak/
|
||||
|
||||
# PyCharm
|
||||
.idea/
|
||||
|
||||
# VS Code
|
||||
.vscode/
|
||||
|
||||
# Spyder project settings
|
||||
.spyderproject
|
||||
.spyproject
|
||||
|
||||
# Rope project settings
|
||||
.ropeproject
|
||||
|
||||
# mkdocs documentation
|
||||
/site
|
||||
|
||||
# mypy
|
||||
.mypy_cache/
|
||||
.dmypy.json
|
||||
dmypy.json
|
||||
|
||||
# Pyre type checker
|
||||
.pyre/
|
||||
|
||||
# pytype static type analyzer
|
||||
.pytype/
|
||||
|
||||
# Cython debug symbols
|
||||
cython_debug/
|
||||
|
||||
# macOS
|
||||
.DS_Store
|
||||
|
||||
# Windows
|
||||
Thumbs.db
|
||||
ehthumbs.db
|
||||
desktop.ini
|
||||
|
||||
# Linux
|
||||
*~
|
||||
|
||||
# Log files
|
||||
*.log
|
||||
|
||||
# Database files
|
||||
*.db
|
||||
*.sqlite3
|
||||
|
||||
# Virtual environment
|
||||
.python-version
|
||||
9
.pre-commit-config.yaml
Normal file
|
|
@ -0,0 +1,9 @@
|
|||
repos:
|
||||
- repo: https://github.com/astral-sh/ruff-pre-commit
|
||||
rev: v0.4.4
|
||||
hooks:
|
||||
- id: ruff
|
||||
args: ["--fix"]
|
||||
exclude: "templates"
|
||||
- id: ruff-format
|
||||
exclude: "templates"
|
||||
30
.streamlit/config.toml
Normal file
|
|
@ -0,0 +1,30 @@
|
|||
[server]
|
||||
|
||||
port = 8502
|
||||
maxMessageSize = 500
|
||||
|
||||
[browser]
|
||||
serverPort = 8502
|
||||
|
||||
# [theme]
|
||||
|
||||
# # The preset Streamlit theme that your custom theme inherits from.
|
||||
# # One of "light" or "dark".
|
||||
# base =
|
||||
|
||||
# # Primary accent color for interactive elements.
|
||||
# primaryColor =
|
||||
|
||||
# # Background color for the main content area.
|
||||
# backgroundColor =
|
||||
|
||||
# # Background color used for the sidebar and most interactive widgets.
|
||||
# secondaryBackgroundColor =
|
||||
|
||||
# # Color used for almost all text.
|
||||
# textColor =
|
||||
|
||||
# # Font family for all text in the app, except code blocks. One of "sans serif",
|
||||
# # "serif", or "monospace".
|
||||
# font =
|
||||
|
||||
52
CONTRIBUTING.md
Normal file
|
|
@ -0,0 +1,52 @@
|
|||
# Contributing to Open Notebook
|
||||
|
||||
First off, thank you for considering contributing to Open Notebook! What makes open source great is the fact that we can work together and accomplish things we would never do on our own. All suggestions are welcome.
|
||||
|
||||
## Code of Conduct
|
||||
|
||||
By participating in this project, you are expected to uphold our Code of Conduct (to be created separately).
|
||||
|
||||
## How Can I Contribute?
|
||||
|
||||
### Reporting Bugs
|
||||
|
||||
- Ensure the bug was not already reported by searching on GitHub under [Issues](https://github.com/yourusername/open-notebook/issues).
|
||||
- If you're unable to find an open issue addressing the problem, [open a new one](https://github.com/yourusername/open-notebook/issues/new). Be sure to include a title and clear description, as much relevant information as possible, and a code sample or an executable test case demonstrating the expected behavior that is not occurring.
|
||||
|
||||
### Suggesting Enhancements
|
||||
|
||||
- Open a new issue with a clear title and detailed description of the suggested enhancement.
|
||||
- Provide any relevant examples or mockups if applicable.
|
||||
|
||||
### Pull Requests
|
||||
|
||||
1. Fork the repo and create your branch from `main`.
|
||||
2. If you've added code that should be tested, add tests.
|
||||
3. Ensure the test suite passes.
|
||||
4. Make sure your code lints.
|
||||
5. Issue that pull request!
|
||||
|
||||
## Styleguides
|
||||
|
||||
### Git Commit Messages
|
||||
|
||||
- Use the present tense ("Add feature" not "Added feature")
|
||||
- Use the imperative mood ("Move cursor to..." not "Moves cursor to...")
|
||||
- Limit the first line to 72 characters or less
|
||||
- Reference issues and pull requests liberally after the first line
|
||||
|
||||
### Python Styleguide
|
||||
|
||||
- Follow PEP 8 guidelines
|
||||
- Use type hints where possible
|
||||
- Write docstrings for all functions, classes, and modules
|
||||
|
||||
### Documentation Styleguide
|
||||
|
||||
- Use Markdown for documentation files
|
||||
- Reference functions and classes appropriately
|
||||
|
||||
## Additional Notes
|
||||
|
||||
|
||||
Thank you for contributing to Open Notebook!
|
||||
28
Dockerfile
Normal file
|
|
@ -0,0 +1,28 @@
|
|||
# Use an official Python runtime as a base image
|
||||
FROM python:3.11.7-slim-bullseye
|
||||
|
||||
# Install system dependencies required for building certain Python packages
|
||||
RUN apt-get update && apt-get install -y \
|
||||
gcc \
|
||||
curl wget libmagic-dev \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Set the working directory in the container to /app
|
||||
WORKDIR /app
|
||||
|
||||
RUN pip install poetry --no-cache-dir
|
||||
RUN poetry self add poetry-plugin-dotenv
|
||||
RUN poetry config virtualenvs.create false
|
||||
COPY pyproject.toml poetry.lock* /app/
|
||||
|
||||
RUN poetry install --only main
|
||||
#--no-root
|
||||
COPY . /app
|
||||
WORKDIR /app
|
||||
|
||||
EXPOSE 8502
|
||||
|
||||
RUN mkdir -p /app/sqlite-db
|
||||
|
||||
CMD ["poetry", "run", "streamlit", "run", "app_home.py"]
|
||||
|
||||
17
LICENSE
Normal file
|
|
@ -0,0 +1,17 @@
|
|||
MIT License
|
||||
Copyright (c) 2024 Luis Novo
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
||||
27
Makefile
Normal file
|
|
@ -0,0 +1,27 @@
|
|||
.PHONY: run check ruff database lint docker-build docker-push
|
||||
|
||||
# Get version from pyproject.toml
|
||||
VERSION := $(shell grep -m1 version pyproject.toml | cut -d'"' -f2)
|
||||
IMAGE_NAME := lfnovo/open_notebook
|
||||
database:
|
||||
docker compose up -d
|
||||
|
||||
run:
|
||||
poetry run streamlit run app_home.py
|
||||
|
||||
lint:
|
||||
poetry run python -m mypy .
|
||||
|
||||
ruff:
|
||||
ruff check . --fix
|
||||
|
||||
docker-build:
|
||||
docker build . -t $(IMAGE_NAME):$(VERSION)
|
||||
docker tag $(IMAGE_NAME):$(VERSION) $(IMAGE_NAME):latest
|
||||
|
||||
docker-push:
|
||||
docker push $(IMAGE_NAME):$(VERSION)
|
||||
docker push $(IMAGE_NAME):latest
|
||||
|
||||
# Combined build and push
|
||||
docker-release: docker-build docker-push
|
||||
90
README.md
Normal file
|
|
@ -0,0 +1,90 @@
|
|||
# Open Notebook
|
||||
|
||||
An open source, privacy-focused alternative to Google's Notebook LM. Why give Google more of our data when we can take control of our own research workflows?
|
||||
|
||||
In a world dominated by Artificial Intelligence, having the ability to think 🧠 and acquire new knowledge 💡, is a skill that should not be a privilege for a few, nor restricted to a single company.
|
||||
|
||||
Open Notebook empowers you to manage your research, generate AI-assisted notes, and interact with your content—on your terms.
|
||||
|
||||
## ⚙️ Setting Up
|
||||
|
||||
Go to the [Setup Guide](docs/SETUP.md) to learn how to set up the tool.
|
||||
|
||||
## Usage Instructions
|
||||
|
||||
Go to the [Usage](docs/USAGE.md) page to learn how to use all features.
|
||||
|
||||
## 🚀 Features
|
||||
|
||||
- **Multi-Notebook Support**: Organize your research across multiple notebooks effortlessly.
|
||||
- **Broad Content Integration**: Works with links, PDFs, TXT files, PowerPoint presentations, YouTube videos, and pasted text (audio/video support coming soon).
|
||||
- **AI-Powered Notes**: Write notes yourself or let the AI assist you in generating insights.
|
||||
- **Recursive Summarization**: Tackle large content by recursively summarizing it.
|
||||
- **Integrated Search Engines**: Built-in full-text and vector search for faster information retrieval.
|
||||
- **Fine-Grained Context Management**: Choose exactly what to share with the AI to maintain control.
|
||||
- **Cost Estimation**: Estimate costs for large context processing to keep budget control in check.
|
||||
|
||||
### 📝 Notebook Page
|
||||
|
||||
Three intuitive columns to streamline your work:
|
||||
1. **Sources**: Manage all research materials.
|
||||
2. **Notes**: Create or AI-generate notes.
|
||||
3. **Chat**: Chat with the AI, leveraging your content.
|
||||
|
||||
### ⚙️ Context Configuration
|
||||
|
||||
Take control of your data. Decide what gets sent to the AI with three context options:
|
||||
- No context
|
||||
- Summary only
|
||||
- Full content
|
||||
|
||||
Plus, you can add your project description to help the AI provide more accurate and helpful responses.
|
||||
|
||||
### 🔍 Integrated Search for Your Items
|
||||
|
||||
Locate anything across your research with ease using full-text and vector-based search.
|
||||
|
||||
### 💬 Powerful open prompts
|
||||
|
||||
Jinja based prompts that are easy to customize to your own preferences.
|
||||
|
||||
|
||||
|
||||
|
||||
## 🌟 Coming Soon
|
||||
|
||||
- **Podcast Generator**: Automatically convert your notes into a podcast format.
|
||||
- **Multi-model support**: Anthropic, Gemini, Mistral, Ollama coming soon.
|
||||
- **Enhanced Citations**: Improved layout and finer control for citations.
|
||||
- **Insight Generation**: New tools for creating insights, leveraging the Fabric framework.
|
||||
- **Better Embeddings & Summarization**: Smarter ways to distill information.
|
||||
- **Multiple Chat Sessions**: Juggle different discussions within the same notebook.
|
||||
- **Live Front-End Updates**: Real-time UI updates for a smoother experience.
|
||||
- **Async Processing**: Faster UI through asynchronous content processing.
|
||||
- **Improved Error Handling**: Making everything more robust.
|
||||
- **Cross-Notebook Sources and Notes**: Reuse research notes across projects.
|
||||
- **Bookmark Integration**: Integrate with your favorite bookmarking app.
|
||||
|
||||
|
||||
## 💻 Tech Stack
|
||||
|
||||
- **Streamlit**: For the front-end (Looking to move out of Streamlit. Contributors welcome!).
|
||||
- **SurrealDB**: Fast, scalable database solution.
|
||||
- **Langchain/Langgraph**: The backbone for LLM interactions.
|
||||
|
||||
|
||||
## 🙌 Help Wanted
|
||||
|
||||
We would love your contributions! Specifically, we're looking for help with:
|
||||
- **Front-End Development**: Improve the UI/UX by moving beyond Streamlit.
|
||||
- **Testing & Bug Fixes**: Help make Open Notebook more robust.
|
||||
- **Feature Development**: Let’s make the coolest note-taking tool together!
|
||||
|
||||
See more at [CONTRIBUTING](CONTRIBUTING.md)
|
||||
## 📄 License
|
||||
|
||||
Open Notebook is MIT licensed. See the [LICENSE](LICENSE) file for details.
|
||||
|
||||
---
|
||||
|
||||
Your contributions, feature requests, and bug reports are always welcome. Let's build a research tool that respects our privacy and makes learning truly open for everyone. ✨
|
||||
19
app_home.py
Normal file
|
|
@ -0,0 +1,19 @@
|
|||
import streamlit as st
|
||||
|
||||
from open_notebook.exceptions import InvalidDatabaseSchema
|
||||
from open_notebook.repository import check_version, execute_migration
|
||||
|
||||
try:
|
||||
check_version()
|
||||
except InvalidDatabaseSchema as e:
|
||||
st.error(e)
|
||||
if st.button("Execute Migration.."):
|
||||
try:
|
||||
execute_migration()
|
||||
st.success("Migration executed successfully")
|
||||
st.rerun()
|
||||
except Exception as e:
|
||||
st.error(e)
|
||||
st.stop()
|
||||
|
||||
st.switch_page("pages/2_📒_Notebooks.py")
|
||||
196
db_setup.surrealql
Normal file
|
|
@ -0,0 +1,196 @@
|
|||
REMOVE table IF EXISTS source;
|
||||
REMOVE table IF EXISTS reference;
|
||||
REMOVE table IF EXISTS notebook;
|
||||
REMOVE table IF EXISTS note;
|
||||
REMOVE table IF EXISTS artifact;
|
||||
REMOVE table IF EXISTS source_chunk;
|
||||
REMOVE table IF EXISTS source_insight;
|
||||
REMOVE ANALYZER IF EXISTS my_analyzer;
|
||||
REMOVE FUNCTION IF EXISTS fn::text_search;
|
||||
|
||||
REMOVE INDEX IF EXISTS idx_source_full ON TABLE source_chunk;
|
||||
REMOVE INDEX IF EXISTS idx_source_embed_chunk ON TABLE source_embedding;
|
||||
REMOVE INDEX IF EXISTS idx_source_insight ON TABLE source_insight;
|
||||
REMOVE INDEX IF EXISTS idx_note ON TABLE note;
|
||||
REMOVE INDEX IF EXISTS idx_source_title ON TABLE source;
|
||||
REMOVE INDEX IF EXISTS idx_note_title ON TABLE note;
|
||||
|
||||
DEFINE TABLE IF NOT EXISTS source SCHEMAFULL;
|
||||
|
||||
DEFINE FIELD asset
|
||||
ON TABLE source
|
||||
FLEXIBLE TYPE option<object>;
|
||||
|
||||
DEFINE FIELD title ON TABLE source TYPE option<string>;
|
||||
-- DEFINE FIELD summary ON TABLE source TYPE option<string>;
|
||||
DEFINE FIELD topics ON TABLE source TYPE option<array<string>>;
|
||||
|
||||
DEFINE FIELD created ON source DEFAULT time::now() VALUE $before OR time::now();
|
||||
DEFINE FIELD updated ON source DEFAULT time::now() VALUE time::now();
|
||||
|
||||
-- temporary while surreal doesn't fix the sdk
|
||||
DEFINE TABLE IF NOT EXISTS source_chunk SCHEMAFULL;
|
||||
DEFINE FIELD source ON TABLE source_chunk TYPE record<source>;
|
||||
DEFINE FIELD order ON TABLE source_chunk TYPE int;
|
||||
DEFINE FIELD content ON TABLE source_chunk TYPE string;
|
||||
|
||||
DEFINE TABLE IF NOT EXISTS source_embedding SCHEMAFULL;
|
||||
DEFINE FIELD source ON TABLE source_embedding TYPE record<source>;
|
||||
DEFINE FIELD order ON TABLE source_embedding TYPE int;
|
||||
DEFINE FIELD content ON TABLE source_embedding TYPE string;
|
||||
DEFINE FIELD embedding ON TABLE source_embedding TYPE array<float>;
|
||||
|
||||
DEFINE TABLE IF NOT EXISTS source_insight SCHEMAFULL;
|
||||
DEFINE FIELD source ON TABLE source_insight TYPE record<source>;
|
||||
DEFINE FIELD insight_type ON TABLE source_insight TYPE string;
|
||||
DEFINE FIELD content ON TABLE source_insight TYPE string;
|
||||
DEFINE FIELD embedding ON TABLE source_insight TYPE array<float>;
|
||||
|
||||
|
||||
DEFINE TABLE IF NOT EXISTS note SCHEMAFULL;
|
||||
|
||||
DEFINE FIELD title ON TABLE note TYPE option<string>;
|
||||
DEFINE FIELD summary ON TABLE note TYPE option<string>;
|
||||
DEFINE FIELD content ON TABLE note TYPE option<string>;
|
||||
DEFINE FIELD embedding ON TABLE note TYPE array<float>;
|
||||
|
||||
DEFINE FIELD created ON note DEFAULT time::now() VALUE $before OR time::now();
|
||||
DEFINE FIELD updated ON note DEFAULT time::now() VALUE time::now();
|
||||
|
||||
DEFINE TABLE IF NOT EXISTS notebook SCHEMAFULL;
|
||||
|
||||
DEFINE FIELD name ON TABLE notebook TYPE option<string>;
|
||||
DEFINE FIELD description ON TABLE notebook TYPE option<string>;
|
||||
|
||||
DEFINE FIELD created ON notebook DEFAULT time::now() VALUE $before OR time::now();
|
||||
DEFINE FIELD updated ON notebook DEFAULT time::now() VALUE time::now();
|
||||
|
||||
DEFINE TABLE reference
|
||||
TYPE RELATION
|
||||
FROM source TO notebook;
|
||||
|
||||
DEFINE TABLE artifact
|
||||
TYPE RELATION
|
||||
FROM note TO notebook;
|
||||
|
||||
-- entender o analyzer
|
||||
DEFINE ANALYZER my_analyzer TOKENIZERS blank,class,camel,punct FILTERS snowball(english), lowercase;
|
||||
|
||||
DEFINE INDEX idx_source_title ON TABLE source COLUMNS title SEARCH ANALYZER my_analyzer BM25 HIGHLIGHTS;
|
||||
DEFINE INDEX idx_source_full ON TABLE source_chunk COLUMNS content SEARCH ANALYZER my_analyzer BM25 HIGHLIGHTS;
|
||||
DEFINE INDEX idx_source_embed_chunk ON TABLE source_embedding COLUMNS content SEARCH ANALYZER my_analyzer BM25 HIGHLIGHTS;
|
||||
DEFINE INDEX idx_source_insight ON TABLE source_insight COLUMNS content SEARCH ANALYZER my_analyzer BM25 HIGHLIGHTS;
|
||||
DEFINE INDEX idx_note ON TABLE note COLUMNS content SEARCH ANALYZER my_analyzer BM25 HIGHLIGHTS;
|
||||
DEFINE INDEX idx_note_title ON TABLE note COLUMNS title SEARCH ANALYZER my_analyzer BM25 HIGHLIGHTS;
|
||||
|
||||
DEFINE FUNCTION IF NOT EXISTS fn::text_search($query_text: string, $match_count: int, $sources:bool, $show_notes:bool) {
|
||||
|
||||
|
||||
let $source_title_search =
|
||||
IF $sources {(
|
||||
SELECT id as item_id, math::max(search::score(1)) AS relevance
|
||||
FROM source
|
||||
WHERE title @1@ $query_text
|
||||
GROUP BY item_id)}
|
||||
ELSE { [] };
|
||||
|
||||
let $source_embedding_search =
|
||||
IF $sources {(
|
||||
SELECT source as item_id, math::max(search::score(1)) AS relevance
|
||||
FROM source_embedding
|
||||
WHERE content @1@ $query_text
|
||||
GROUP BY item_id)}
|
||||
ELSE { [] };
|
||||
|
||||
let $source_chunk_search =
|
||||
IF $sources {(
|
||||
SELECT source as item_id, math::max(search::score(1)) AS relevance
|
||||
FROM source_chunk
|
||||
WHERE content @1@ $query_text
|
||||
GROUP BY item_id)}
|
||||
ELSE { [] };
|
||||
|
||||
let $source_insight_search =
|
||||
IF $sources {(
|
||||
SELECT source as item_id, math::max(search::score(1)) AS relevance
|
||||
FROM source_insight
|
||||
WHERE content @1@ $query_text
|
||||
GROUP BY item_id)}
|
||||
ELSE { [] };
|
||||
|
||||
let $note_title_search =
|
||||
IF $show_notes {(
|
||||
SELECT id as item_id, math::max(search::score(1)) AS relevance
|
||||
FROM note
|
||||
WHERE title @1@ $query_text
|
||||
GROUP BY item_id)}
|
||||
ELSE { [] };
|
||||
|
||||
let $note_content_search =
|
||||
IF $show_notes {(
|
||||
SELECT id as item_id, math::max(search::score(1)) AS relevance
|
||||
FROM note
|
||||
WHERE content @1@ $query_text
|
||||
GROUP BY item_id)}
|
||||
ELSE { [] };
|
||||
|
||||
let $source_chunk_results = array::union($source_embedding_search, $source_chunk_search);
|
||||
|
||||
let $source_asset_results = array::union($source_title_search, $source_insight_search);
|
||||
|
||||
let $source_results = array::union($source_chunk_results, $source_asset_results );
|
||||
let $note_results = array::union($note_title_search, $note_content_search );
|
||||
let $final_results = array::union($source_results, $note_results );
|
||||
|
||||
RETURN (SELECT item_id, math::max(relevance) as relevance from $final_results
|
||||
group by item_id ORDER BY relevance DESC LIMIT $match_count);
|
||||
|
||||
|
||||
};
|
||||
|
||||
|
||||
REMOVE FUNCTION fn::vector_search;
|
||||
|
||||
DEFINE FUNCTION IF NOT EXISTS fn::vector_search($query: array<float>, $match_count: int, $sources:bool, $show_notes:bool) {
|
||||
|
||||
let $source_embedding_search =
|
||||
IF $sources {(
|
||||
SELECT source as item_id, content, vector::similarity::cosine(embedding, $query) as similarity
|
||||
FROM source_embedding LIMIT $match_count)}
|
||||
ELSE { [] };
|
||||
|
||||
|
||||
let $source_insight_search =
|
||||
IF $sources {(
|
||||
SELECT source as item_id, content, vector::similarity::cosine(embedding, $query) as similarity
|
||||
FROM source_insight LIMIT $match_count)}
|
||||
ELSE { [] };
|
||||
|
||||
|
||||
let $note_content_search =
|
||||
IF $show_notes {(
|
||||
SELECT id as item_id, content, vector::similarity::cosine(embedding, $query) as similarity
|
||||
FROM note LIMIT $match_count)}
|
||||
|
||||
ELSE { [] };
|
||||
|
||||
let $source_chunk_results = array::union($source_embedding_search, $source_insight_search);
|
||||
|
||||
let $source_results = array::union($source_chunk_results, $source_insight_search);
|
||||
|
||||
let $note_results = $note_content_search;
|
||||
let $final_results = array::union($source_results, $note_results );
|
||||
|
||||
RETURN (SELECT item_id, math::max(similarity) as similarity from $final_results
|
||||
group by item_id ORDER BY similarity DESC LIMIT $match_count);
|
||||
|
||||
|
||||
};
|
||||
|
||||
CREATE open_notebook:database_info SET
|
||||
version= "0.0.1";
|
||||
|
||||
UPDATE open_notebook:database_info SET
|
||||
version= "0.0.1";
|
||||
|
||||
|
||||
22
docker-compose.dev.yml
Normal file
|
|
@ -0,0 +1,22 @@
|
|||
version: '3'
|
||||
|
||||
services:
|
||||
surrealdb:
|
||||
image: surrealdb/surrealdb:v2
|
||||
ports:
|
||||
- "8000:8000"
|
||||
volumes:
|
||||
- ./surreal-data:/mydata
|
||||
user: "${UID}:${GID}"
|
||||
command: start --log trace --user root --pass root rocksdb:mydatabase.db
|
||||
pull_policy: always
|
||||
open_notebook:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile
|
||||
ports:
|
||||
- "8080:8502"
|
||||
volumes:
|
||||
- ./docker.env:/app/.env
|
||||
depends_on:
|
||||
- surrealdb
|
||||
22
docker-compose.yml
Normal file
|
|
@ -0,0 +1,22 @@
|
|||
version: '3'
|
||||
|
||||
services:
|
||||
surrealdb:
|
||||
image: surrealdb/surrealdb:v2
|
||||
ports:
|
||||
- "8000:8000"
|
||||
volumes:
|
||||
- ./surreal-data:/mydata
|
||||
user: "${UID}:${GID}"
|
||||
command: start --log trace --user root --pass root rocksdb:mydatabase.db
|
||||
pull_policy: always
|
||||
open_notebook:
|
||||
image: lfnovo/open_notebook:latest
|
||||
ports:
|
||||
- "8080:8502"
|
||||
volumes:
|
||||
- ./docker.env:/app/.env
|
||||
depends_on:
|
||||
- surrealdb
|
||||
pull_policy: always
|
||||
|
||||
128
docs/SETUP.md
Normal file
|
|
@ -0,0 +1,128 @@
|
|||
# Installing Open Notebook
|
||||
|
||||
|
||||
## 📦 Installing from Source
|
||||
|
||||
Quickly get started by cloning and installing the dependencies.
|
||||
|
||||
```sh
|
||||
git clone https://github.com/lfnovo/open_notebook.git
|
||||
cd open_notebook
|
||||
poetry install
|
||||
```
|
||||
|
||||
Make a copy of `example.env` and rename it to `.env`.
|
||||
|
||||
You need to enter at least your OPENAI_API_KEY and the Surreal DB connection details.
|
||||
|
||||
```
|
||||
OPENAI_API_KEY=
|
||||
|
||||
# CONNECTION DETAILS FOR YOUR SURREAL DB
|
||||
SURREAL_ADDRESS="ws://localhost:8000/rpc"
|
||||
SURREAL_USER="root"
|
||||
SURREAL_PASS="root"
|
||||
SURREAL_NAMESPACE="open_notebook"
|
||||
SURREAL_DATABASE="staging"
|
||||
```
|
||||
|
||||
Then, run it by using:
|
||||
|
||||
```sh
|
||||
poetry run streamlit run app_home.py
|
||||
```
|
||||
|
||||
or the shourcut
|
||||
|
||||
```sh
|
||||
make run
|
||||
```
|
||||
|
||||
## 🐳 Docker Setup
|
||||
|
||||
Alternatively, you can use Docker for easy setup.
|
||||
Copy the `.env.example` file and name it `docker.env`
|
||||
|
||||
```sh
|
||||
docker run -d \
|
||||
--name open_notebook \
|
||||
-p 8080:8502 \
|
||||
-v $(pwd)/docker.env:/app/.env \
|
||||
lfnovo/open_notebook:latest
|
||||
```
|
||||
|
||||
You can pass the environment variables manually if you want:
|
||||
|
||||
```sh
|
||||
docker run -d \
|
||||
--name open_notebook \
|
||||
-p 8080:8502 \
|
||||
-e OPENAI_API_KEY=API_KEY \
|
||||
-e DEFAULT_MODEL="gpt-4o-mini" \
|
||||
-e SURREAL_ADDRESS="ws://localhost:8000/rpc" \
|
||||
-e SURREAL_USER="root" \
|
||||
-e SURREAL_PASS="root" \
|
||||
-e SURREAL_NAMESPACE="open_notebook" \
|
||||
-e SURREAL_DATABASE="staging" \
|
||||
lfnovo/open_notebook:latest
|
||||
```
|
||||
|
||||
If you need to run Surreal DB on docker as well, it's easier to just use docker-compose, like this:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
surrealdb:
|
||||
image: surrealdb/surrealdb:v2
|
||||
ports:
|
||||
- "8000:8000"
|
||||
volumes:
|
||||
- ./surreal-data:/mydata
|
||||
user: "${UID}:${GID}"
|
||||
command: start --log trace --user root --pass root rocksdb:mydatabase.db
|
||||
pull_policy: always
|
||||
open_notebook:
|
||||
image: lfnovo/open_notebook:latest
|
||||
ports:
|
||||
- "8080:8502"
|
||||
volumes:
|
||||
- ./docker.env:/app/.env
|
||||
depends_on:
|
||||
- surrealdb
|
||||
pull_policy: always
|
||||
```
|
||||
|
||||
or with the environment variables:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
surrealdb:
|
||||
image: surrealdb/surrealdb:v2
|
||||
ports:
|
||||
- "8000:8000"
|
||||
volumes:
|
||||
- ./surreal-data:/mydata
|
||||
user: "${UID}:${GID}"
|
||||
command: start --log trace --user root --pass root rocksdb:mydatabase.db
|
||||
pull_policy: always
|
||||
open_notebook:
|
||||
image: lfnovo/open_notebook:latest
|
||||
ports:
|
||||
- "8080:8502"
|
||||
environment:
|
||||
- OPENAI_API_KEY=API_KEY
|
||||
- DEFAULT_MODEL=gpt-4o-mini
|
||||
- SURREAL_ADDRESS=ws://surrealdb:8000/rpc
|
||||
- SURREAL_USER=root
|
||||
- SURREAL_PASS=root
|
||||
- SURREAL_NAMESPACE=open_notebook
|
||||
- SURREAL_DATABASE=staging
|
||||
depends_on:
|
||||
- surrealdb
|
||||
pull_policy: always
|
||||
```
|
||||
|
||||
## Running the app
|
||||
|
||||
After the app is running, you can access it at http://localhost:8080.
|
||||
|
||||
The first time you connect, it will check for the database and see if the schema is ready. If not, it will create the database for you.
|
||||
49
docs/USAGE.md
Normal file
|
|
@ -0,0 +1,49 @@
|
|||
# Using Open Notebook
|
||||
|
||||
This first release of Open Notebook is inspired by Notebook LM, so you will find a very similar workflow.
|
||||
|
||||
## Creating a new notebook
|
||||
|
||||

|
||||
|
||||
Just type a name and description for the Notebook and you are good to go. Make the description as detailed as possible since it will be used by the LLM to understand the context of the notebook and provide you with better answers.
|
||||
|
||||
## Adding sources
|
||||
|
||||
Just click on Add Source and enter the URL, upload the file or paste the content of your source.
|
||||
|
||||

|
||||
|
||||
You'll find your new source in the first column of the Notebook Page.
|
||||
|
||||

|
||||
|
||||
## Talk to the Assistant
|
||||
|
||||
Once you have enough content in the notebook, you can decide which of them will be visible to LLM before sending your question.
|
||||
|
||||

|
||||
|
||||
- Not in Context: LLM won't get this as part of the context
|
||||
- Summary: LLM will get the summary for the content and can ask for the full document if desired
|
||||
- Full Content: LLM will receive the full transcript of the content together with your question.
|
||||
|
||||
It's recommended that you use the least amount of context so that you can save up on your API spend.
|
||||
|
||||
## Making Notes
|
||||
|
||||
There is 2 ways you can make notes:
|
||||
|
||||
Manually by clicking on New Note
|
||||
|
||||

|
||||
|
||||
Or by turning any LLM message into a Note.
|
||||
|
||||

|
||||
|
||||
## Searching
|
||||
|
||||
The search page gives you a glance of all the notes you have made and the sources you have added. You can query the database both by keyword as well as using the vector search.
|
||||
|
||||

|
||||
BIN
docs/assets/add_source.png
Normal file
|
After Width: | Height: | Size: 23 KiB |
BIN
docs/assets/ai_note.png
Normal file
|
After Width: | Height: | Size: 27 KiB |
BIN
docs/assets/asset_list.png
Normal file
|
After Width: | Height: | Size: 63 KiB |
BIN
docs/assets/context.png
Normal file
|
After Width: | Height: | Size: 24 KiB |
BIN
docs/assets/human_note.png
Normal file
|
After Width: | Height: | Size: 28 KiB |
BIN
docs/assets/new_notebook.png
Normal file
|
After Width: | Height: | Size: 41 KiB |
BIN
docs/assets/search.png
Normal file
|
After Width: | Height: | Size: 48 KiB |
8
mypy.ini
Normal file
|
|
@ -0,0 +1,8 @@
|
|||
[mypy]
|
||||
# Disable PEP 561 checks
|
||||
ignore_missing_imports = True
|
||||
check_untyped_defs = True
|
||||
|
||||
# Alternatively, you can ignore specific modules
|
||||
[mypy-some_module]
|
||||
ignore_missing_imports = True
|
||||
0
open_notebook/__init__.py
Normal file
433
open_notebook/domain.py
Normal file
|
|
@ -0,0 +1,433 @@
|
|||
import os
|
||||
from datetime import datetime
|
||||
from typing import Any, ClassVar, Dict, List, Literal, Optional, Type, TypeVar
|
||||
|
||||
from langchain_core.runnables.config import RunnableConfig
|
||||
from loguru import logger
|
||||
from pydantic import BaseModel, Field, field_validator
|
||||
|
||||
from open_notebook.exceptions import (
|
||||
DatabaseOperationError,
|
||||
InvalidInputError,
|
||||
NotFoundError,
|
||||
)
|
||||
from open_notebook.graphs.summary import graph as summarizer
|
||||
from open_notebook.repository import (
|
||||
repo_create,
|
||||
repo_delete,
|
||||
repo_query,
|
||||
repo_relate,
|
||||
repo_update,
|
||||
)
|
||||
from open_notebook.utils import get_embedding, split_text, surreal_clean
|
||||
|
||||
T = TypeVar("T", bound="ObjectModel")
|
||||
|
||||
|
||||
class ObjectModel(BaseModel):
|
||||
id: Optional[str] = None
|
||||
table_name: ClassVar[str] = ""
|
||||
created: Optional[datetime] = None
|
||||
updated: Optional[datetime] = None
|
||||
|
||||
@classmethod
|
||||
def get_all(cls: Type[T]) -> List[T]:
|
||||
try:
|
||||
result = repo_query(f"SELECT * FROM {cls.table_name}")
|
||||
objects = [cls(**obj) for obj in result]
|
||||
return objects
|
||||
except Exception as e:
|
||||
logger.error(f"Error fetching all {cls.table_name}: {str(e)}")
|
||||
logger.exception(e)
|
||||
raise DatabaseOperationError(f"Failed to fetch all {cls.table_name}")
|
||||
|
||||
@classmethod
|
||||
def get(cls: Type[T], id: str) -> Optional[T]:
|
||||
if not id:
|
||||
raise InvalidInputError("ID cannot be empty")
|
||||
try:
|
||||
result = repo_query(f"SELECT * FROM {id}")
|
||||
if result:
|
||||
return cls(**result[0])
|
||||
return None
|
||||
except Exception as e:
|
||||
logger.error(f"Error fetching {cls.table_name} with id {id}: {str(e)}")
|
||||
logger.exception(e)
|
||||
raise NotFoundError(f"{cls.table_name} with id {id} not found")
|
||||
|
||||
def needs_embedding(self) -> bool:
|
||||
return False
|
||||
|
||||
def get_embedding_content(self) -> Optional[str]:
|
||||
return None
|
||||
|
||||
def save(self) -> None:
|
||||
try:
|
||||
data = self._prepare_save_data()
|
||||
|
||||
if self.needs_embedding():
|
||||
embedding_content = self.get_embedding_content()
|
||||
if embedding_content:
|
||||
data["embedding"] = get_embedding(embedding_content)
|
||||
|
||||
if self.id is None:
|
||||
logger.debug("Creating new record")
|
||||
repo_result = repo_create(self.__class__.table_name, data)
|
||||
else:
|
||||
logger.debug(f"Updating record with id {self.id}")
|
||||
repo_result = repo_update(self.id, data)
|
||||
|
||||
# Update the current instance with the result
|
||||
for key, value in repo_result.items():
|
||||
if hasattr(self, key):
|
||||
setattr(self, key, value)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error saving {self.__class__.table_name}: {str(e)}")
|
||||
logger.exception(e)
|
||||
raise DatabaseOperationError(f"Failed to save {self.__class__.table_name}")
|
||||
|
||||
def _prepare_save_data(self) -> Dict[str, Any]:
|
||||
data = self.model_dump()
|
||||
logger.debug(f"Preparing data for save: {data}")
|
||||
del data["created"]
|
||||
del data["updated"]
|
||||
return {key: value for key, value in data.items() if value is not None}
|
||||
|
||||
def delete(self) -> bool:
|
||||
if self.id is None:
|
||||
raise InvalidInputError("Cannot delete object without an ID")
|
||||
try:
|
||||
logger.debug(f"Deleting record with id {self.id}")
|
||||
return repo_delete(self.id)
|
||||
except Exception as e:
|
||||
logger.error(
|
||||
f"Error deleting {self.__class__.table_name} with id {self.id}: {str(e)}"
|
||||
)
|
||||
raise DatabaseOperationError(
|
||||
f"Failed to delete {self.__class__.table_name}"
|
||||
)
|
||||
|
||||
def relate(self, relationship: str, target_id: str) -> Any:
|
||||
if not relationship or not target_id:
|
||||
raise InvalidInputError("Relationship and target ID must be provided")
|
||||
try:
|
||||
return repo_relate(self.id, relationship, target_id)
|
||||
except Exception as e:
|
||||
logger.error(f"Error creating relationship: {str(e)}")
|
||||
logger.exception(e)
|
||||
raise DatabaseOperationError("Failed to create relationship")
|
||||
|
||||
|
||||
class Notebook(ObjectModel):
|
||||
table_name: ClassVar[str] = "notebook"
|
||||
name: str
|
||||
description: str
|
||||
|
||||
@field_validator("name")
|
||||
@classmethod
|
||||
def name_must_not_be_empty(cls, v):
|
||||
if not v.strip():
|
||||
raise InvalidInputError("Notebook name cannot be empty")
|
||||
return v
|
||||
|
||||
@property
|
||||
def sources(self) -> List["Source"]:
|
||||
try:
|
||||
srcs = repo_query(f"""
|
||||
select * from (
|
||||
select
|
||||
<- source as source
|
||||
from reference
|
||||
where out={self.id}
|
||||
fetch source
|
||||
)
|
||||
order by source.updated desc
|
||||
""")
|
||||
return [Source(**src["source"][0]) for src in srcs] if srcs else []
|
||||
except Exception as e:
|
||||
logger.error(f"Error fetching sources for notebook {self.id}: {str(e)}")
|
||||
logger.exception(e)
|
||||
raise DatabaseOperationError("Failed to fetch sources for notebook")
|
||||
|
||||
@property
|
||||
def notes(self) -> List["Note"]:
|
||||
try:
|
||||
srcs = repo_query(f"""
|
||||
select * from (
|
||||
select
|
||||
<- note as note
|
||||
from artifact
|
||||
where out={self.id}
|
||||
fetch note
|
||||
)
|
||||
order by updated desc
|
||||
""")
|
||||
return [Note(**src["note"][0]) for src in srcs] if srcs else []
|
||||
except Exception as e:
|
||||
logger.error(f"Error fetching notes for notebook {self.id}: {str(e)}")
|
||||
logger.exception(e)
|
||||
raise DatabaseOperationError("Failed to fetch notes for notebook")
|
||||
|
||||
|
||||
class Asset(BaseModel):
|
||||
file_path: Optional[str] = None
|
||||
url: Optional[str] = None
|
||||
|
||||
|
||||
class SourceInsight(ObjectModel):
|
||||
insight_type: str
|
||||
content: str
|
||||
|
||||
@field_validator("insight_type")
|
||||
@classmethod
|
||||
def validate_insight_type(cls, v):
|
||||
allowed_types = ["summary", "key_points", "analysis"] # Add more as needed
|
||||
if v not in allowed_types:
|
||||
raise InvalidInputError(
|
||||
f"Invalid insight type. Allowed types are: {', '.join(allowed_types)}"
|
||||
)
|
||||
return v
|
||||
|
||||
|
||||
class Source(ObjectModel):
|
||||
table_name: ClassVar[str] = "source"
|
||||
asset: Optional[Asset] = None
|
||||
title: Optional[str] = None
|
||||
topics: Optional[List[str]] = Field(default_factory=list)
|
||||
|
||||
def get_context(
|
||||
self, context_size: Literal["short", "long"] = "short"
|
||||
) -> Dict[str, Any]:
|
||||
if context_size == "long":
|
||||
return dict(
|
||||
id=self.id,
|
||||
title=self.title,
|
||||
insights=self.insights,
|
||||
full_text=self.full_text,
|
||||
)
|
||||
else:
|
||||
return dict(id=self.id, title=self.title, insights=self.insights)
|
||||
|
||||
@property
|
||||
def insights(self) -> List[SourceInsight]:
|
||||
try:
|
||||
result = repo_query(
|
||||
"""
|
||||
SELECT * FROM source_insight WHERE source=$id
|
||||
""",
|
||||
{"id": self.id},
|
||||
)
|
||||
return [SourceInsight(**insight) for insight in result]
|
||||
except Exception as e:
|
||||
logger.error(f"Error fetching insights for source {self.id}: {str(e)}")
|
||||
logger.exception(e)
|
||||
raise DatabaseOperationError("Failed to fetch insights for source")
|
||||
|
||||
@property
|
||||
def full_text(self) -> str:
|
||||
try:
|
||||
results = []
|
||||
chunk_indexes = repo_query(
|
||||
"""
|
||||
select order
|
||||
from source_chunk
|
||||
where source=$id
|
||||
order by order
|
||||
""",
|
||||
{"id": self.id},
|
||||
)
|
||||
for chunk_index in chunk_indexes:
|
||||
chunk = repo_query(
|
||||
f"""
|
||||
select content
|
||||
from source_chunk
|
||||
where source={self.id} and order={chunk_index['order']}
|
||||
"""
|
||||
)
|
||||
results.append(chunk[0]["content"])
|
||||
return "".join(results)
|
||||
except Exception as e:
|
||||
logger.error(f"Error fetching full text for source {self.id}: {str(e)}")
|
||||
logger.exception(e)
|
||||
raise DatabaseOperationError("Failed to fetch full text for source")
|
||||
|
||||
def add_to_notebook(self, notebook_id: str) -> Any:
|
||||
if not notebook_id:
|
||||
raise InvalidInputError("Notebook ID must be provided")
|
||||
return self.relate("reference", notebook_id)
|
||||
|
||||
def save_chunks(self, text: str) -> None:
|
||||
if not text:
|
||||
raise InvalidInputError("Text cannot be empty")
|
||||
try:
|
||||
chunks = split_text(text, chunk=500000, overlap=1000)
|
||||
logger.debug(f"Split into {len(chunks)} chunks")
|
||||
for i, chunk in enumerate(chunks):
|
||||
logger.debug(f"Saving chunk {i}")
|
||||
repo_create(
|
||||
"source_chunk",
|
||||
{"source": self.id, "order": i, "content": surreal_clean(chunk)},
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Error saving chunks for source {self.id}: {str(e)}")
|
||||
logger.exception(e)
|
||||
raise DatabaseOperationError("Failed to save chunks for source")
|
||||
|
||||
def vectorize(self) -> None:
|
||||
try:
|
||||
full_text = self.full_text
|
||||
if not full_text:
|
||||
return
|
||||
chunks = split_text(
|
||||
self.full_text,
|
||||
chunk=int(os.environ.get("EMBEDDING_CHUNK_SIZE", 1000)),
|
||||
overlap=int(os.environ.get("EMBEDDING_CHUNK_OVERLAP", 1000)),
|
||||
)
|
||||
logger.debug(f"Split into {len(chunks)} chunks")
|
||||
|
||||
# future: we can increase the batch size after surreal launches their new SDK
|
||||
for i, chunk in enumerate(chunks):
|
||||
repo_create(
|
||||
"source_embedding",
|
||||
{
|
||||
"source": self.id,
|
||||
"order": i,
|
||||
"content": surreal_clean(chunk),
|
||||
"embedding": get_embedding(chunk),
|
||||
},
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Error vectorizing source {self.id}: {str(e)}")
|
||||
logger.exception(e)
|
||||
raise DatabaseOperationError(e)
|
||||
|
||||
@classmethod
|
||||
def search(cls, query: str) -> List[Dict[str, Any]]:
|
||||
if not query:
|
||||
raise InvalidInputError("Search query cannot be empty")
|
||||
try:
|
||||
result = repo_query(
|
||||
"""
|
||||
SELECT * omit full_text
|
||||
FROM source
|
||||
WHERE string::lowercase(title) CONTAINS $query or title @@ $query
|
||||
OR string::lowercase(summary) CONTAINS $query or summary @@ $query
|
||||
OR string::lowercase(full_text) CONTAINS $query or full_text @@ $query
|
||||
""",
|
||||
{"query": query},
|
||||
)
|
||||
return result
|
||||
except Exception as e:
|
||||
logger.error(f"Error searching sources: {str(e)}")
|
||||
logger.exception(e)
|
||||
raise DatabaseOperationError("Failed to search sources")
|
||||
|
||||
def _add_insight(self, insight_type: str, content: str) -> Any:
|
||||
if not insight_type or not content:
|
||||
raise InvalidInputError("Insight type and content must be provided")
|
||||
try:
|
||||
embedding = get_embedding(content)
|
||||
return repo_create(
|
||||
"source_insight",
|
||||
{
|
||||
"source": self.id,
|
||||
"insight_type": insight_type,
|
||||
"content": surreal_clean(content),
|
||||
"embedding": embedding,
|
||||
},
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Error adding insight to source {self.id}: {str(e)}")
|
||||
raise DatabaseOperationError("Failed to add insight to source")
|
||||
|
||||
def summarize(self) -> "Source":
|
||||
try:
|
||||
config = RunnableConfig(configurable=dict(thread_id=self.id))
|
||||
result = summarizer.invoke({"content": self.full_text}, config=config)[
|
||||
"summary"
|
||||
]
|
||||
self._add_insight("summary", surreal_clean(result.summary))
|
||||
self.title = surreal_clean(result.title)
|
||||
self.topics = result.topics
|
||||
self.save()
|
||||
return self
|
||||
except Exception as e:
|
||||
logger.error(f"Error summarizing source {self.id}: {str(e)}")
|
||||
logger.exception(e)
|
||||
raise DatabaseOperationError("Failed to summarize source")
|
||||
|
||||
|
||||
class Note(ObjectModel):
|
||||
table_name: ClassVar[str] = "note"
|
||||
title: Optional[str] = None
|
||||
note_type: Optional[Literal["human", "ai"]] = "human"
|
||||
content: Optional[str] = None
|
||||
|
||||
@field_validator("content")
|
||||
@classmethod
|
||||
def content_must_not_be_empty(cls, v):
|
||||
if v is not None and not v.strip():
|
||||
raise InvalidInputError("Note content cannot be empty")
|
||||
return v
|
||||
|
||||
def add_to_notebook(self, notebook_id: str) -> Any:
|
||||
if not notebook_id:
|
||||
raise InvalidInputError("Notebook ID must be provided")
|
||||
return self.relate("artifact", notebook_id)
|
||||
|
||||
def get_context(
|
||||
self, context_size: Literal["short", "long"] = "short"
|
||||
) -> Dict[str, Any]:
|
||||
if context_size == "long":
|
||||
return dict(id=self.id, title=self.title, content=self.content)
|
||||
else:
|
||||
return dict(
|
||||
id=self.id,
|
||||
title=self.title,
|
||||
content=self.content[:100] if self.content else None,
|
||||
)
|
||||
|
||||
def needs_embedding(self) -> bool:
|
||||
return True
|
||||
|
||||
def get_embedding_content(self) -> Optional[str]:
|
||||
return self.content
|
||||
|
||||
|
||||
def text_search(
|
||||
keyword: str, results: int, source: bool = True, note: bool = True
|
||||
) -> List[Dict[str, Any]]:
|
||||
if not keyword:
|
||||
raise InvalidInputError("Search keyword cannot be empty")
|
||||
try:
|
||||
results = repo_query(
|
||||
"""
|
||||
SELECT * FROM fn::text_search($keyword, $results, $source, $note);
|
||||
""",
|
||||
{"keyword": keyword, "results": results, "source": source, "note": note},
|
||||
)
|
||||
return results
|
||||
except Exception as e:
|
||||
logger.error(f"Error performing text search: {str(e)}")
|
||||
logger.exception(e)
|
||||
raise DatabaseOperationError("Failed to perform text search")
|
||||
|
||||
|
||||
def vector_search(
|
||||
keyword: str, results: int, source: bool = True, note: bool = True
|
||||
) -> List[Dict[str, Any]]:
|
||||
if not keyword:
|
||||
raise InvalidInputError("Search keyword cannot be empty")
|
||||
try:
|
||||
results = repo_query(
|
||||
"""
|
||||
SELECT * FROM fn::vector_search($keyword, $results, $source, $note);
|
||||
""",
|
||||
{"keyword": keyword, "results": results, "source": source, "note": note},
|
||||
)
|
||||
return results
|
||||
except Exception as e:
|
||||
logger.error(f"Error performing vector search: {str(e)}")
|
||||
logger.exception(e)
|
||||
raise DatabaseOperationError("Failed to perform vector search")
|
||||
64
open_notebook/exceptions.py
Normal file
|
|
@ -0,0 +1,64 @@
|
|||
class OpenNotebookError(Exception):
|
||||
"""Base exception class for Open Notebook errors."""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
class DatabaseOperationError(OpenNotebookError):
|
||||
"""Raised when a database operation fails."""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
class InvalidInputError(OpenNotebookError):
|
||||
"""Raised when invalid input is provided."""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
class NotFoundError(OpenNotebookError):
|
||||
"""Raised when a requested resource is not found."""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
class AuthenticationError(OpenNotebookError):
|
||||
"""Raised when there's an authentication problem."""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
class ConfigurationError(OpenNotebookError):
|
||||
"""Raised when there's a configuration problem."""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
class ExternalServiceError(OpenNotebookError):
|
||||
"""Raised when an external service (e.g., AI model) fails."""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
class RateLimitError(OpenNotebookError):
|
||||
"""Raised when a rate limit is exceeded."""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
class FileOperationError(OpenNotebookError):
|
||||
"""Raised when a file operation fails."""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
class NetworkError(OpenNotebookError):
|
||||
"""Raised when a network operation fails."""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
class InvalidDatabaseSchema(OpenNotebookError):
|
||||
"""Raised when the database is not under the expected schema."""
|
||||
|
||||
pass
|
||||
52
open_notebook/graphs/ask_content.py
Normal file
|
|
@ -0,0 +1,52 @@
|
|||
import os
|
||||
|
||||
from langchain_core.runnables import (
|
||||
RunnableConfig,
|
||||
)
|
||||
from langchain_openai import ChatOpenAI
|
||||
from langgraph.graph import END, START, StateGraph
|
||||
from loguru import logger
|
||||
from typing_extensions import TypedDict
|
||||
|
||||
from open_notebook.domain import Note, Notebook, Source
|
||||
from open_notebook.prompter import Prompter
|
||||
|
||||
|
||||
class AskState(TypedDict):
|
||||
doc_id: str
|
||||
doc_content: str
|
||||
question: str
|
||||
answer: str
|
||||
notebook: Notebook
|
||||
|
||||
|
||||
def call_model_with_messages(state: AskState, config: RunnableConfig) -> dict:
|
||||
model = ChatOpenAI(
|
||||
model=os.environ.get("RETRIEVAL_MODEL", os.environ["DEFAULT_MODEL"]),
|
||||
temperature=0,
|
||||
)
|
||||
system_prompt = Prompter(prompt_template="ask_content").render(data=state)
|
||||
logger.debug(f"System prompt: {system_prompt}")
|
||||
ai_message = model.invoke(system_prompt)
|
||||
return {"answer": ai_message}
|
||||
|
||||
|
||||
# todo: there is probably a better way to do this and avoid repetition
|
||||
def get_content(state: AskState) -> dict:
|
||||
doc_id = state["doc_id"]
|
||||
if "note:" in doc_id:
|
||||
doc: Note = Note.get(id=doc_id)
|
||||
elif "source:" in doc_id:
|
||||
doc: Source = Source.get(id=doc_id)
|
||||
doc_content = doc.get_context("long") if doc else None
|
||||
return {"doc_content": doc_content}
|
||||
|
||||
|
||||
agent_state = StateGraph(AskState)
|
||||
agent_state.add_node("get_content", get_content)
|
||||
agent_state.add_node("agent", call_model_with_messages)
|
||||
agent_state.add_edge(START, "get_content")
|
||||
agent_state.add_edge("get_content", "agent")
|
||||
agent_state.add_edge("agent", END)
|
||||
|
||||
graph = agent_state.compile()
|
||||
74
open_notebook/graphs/chat.py
Normal file
|
|
@ -0,0 +1,74 @@
|
|||
import os
|
||||
import sqlite3
|
||||
from typing import Annotated, List, Optional
|
||||
|
||||
from langchain_core.runnables import (
|
||||
RunnableConfig,
|
||||
)
|
||||
from langchain_openai import ChatOpenAI
|
||||
from langgraph.checkpoint.sqlite import SqliteSaver
|
||||
from langgraph.graph import START, StateGraph
|
||||
from langgraph.graph.message import add_messages
|
||||
from langgraph.prebuilt import ToolNode, tools_condition
|
||||
from loguru import logger
|
||||
from pydantic import BaseModel, Field
|
||||
from typing_extensions import TypedDict
|
||||
|
||||
from open_notebook.domain import Notebook
|
||||
from open_notebook.graphs.tools import ask_the_document, get_current_timestamp
|
||||
from open_notebook.prompter import Prompter
|
||||
|
||||
tools = [get_current_timestamp, ask_the_document]
|
||||
tool_node = ToolNode(tools)
|
||||
|
||||
|
||||
class ChatResponse(BaseModel):
|
||||
"""Respond to the user with this"""
|
||||
|
||||
title: Optional[str] = Field(
|
||||
description="A title to be used if your question would become a new note on the project"
|
||||
)
|
||||
message: str = Field(
|
||||
description="The actual message you'd like to reply to the user"
|
||||
)
|
||||
citations: Optional[List[str]] = Field(
|
||||
description="The ids for the documents you used to formulate your answer"
|
||||
)
|
||||
|
||||
|
||||
class ThreadState(TypedDict):
|
||||
messages: Annotated[list, add_messages]
|
||||
notebook: Optional[Notebook]
|
||||
context: Optional[str]
|
||||
context_config: Optional[dict]
|
||||
response: Optional[ChatResponse]
|
||||
|
||||
|
||||
def call_model_with_messages(state: ThreadState, config: RunnableConfig) -> dict:
|
||||
model = ChatOpenAI(model=os.environ["DEFAULT_MODEL"], temperature=0).bind_tools(
|
||||
tools
|
||||
)
|
||||
messages = state["messages"]
|
||||
system_prompt = Prompter(prompt_template="chat").render(data=state)
|
||||
logger.warning(f"System prompt: {system_prompt}")
|
||||
ai_message = model.invoke([system_prompt] + messages)
|
||||
return {"messages": ai_message}
|
||||
|
||||
|
||||
conn = sqlite3.connect(
|
||||
os.environ.get("CHECKPOINT_DATA_PATH", "sqlite-db/checkpoints.sqlite"),
|
||||
check_same_thread=False,
|
||||
)
|
||||
memory = SqliteSaver(conn)
|
||||
|
||||
agent_state = StateGraph(ThreadState)
|
||||
agent_state.add_node("agent", call_model_with_messages)
|
||||
agent_state.add_node("tools", tool_node)
|
||||
agent_state.add_edge(START, "agent")
|
||||
agent_state.add_conditional_edges(
|
||||
"agent",
|
||||
tools_condition,
|
||||
)
|
||||
agent_state.add_edge("tools", "agent")
|
||||
|
||||
graph = agent_state.compile(checkpointer=memory)
|
||||
217
open_notebook/graphs/content_process.py
Normal file
|
|
@ -0,0 +1,217 @@
|
|||
import re
|
||||
|
||||
import fitz # type: ignore
|
||||
import magic
|
||||
import requests # type: ignore
|
||||
from langgraph.graph import END, START, StateGraph
|
||||
from typing_extensions import TypedDict
|
||||
from youtube_transcript_api import YouTubeTranscriptApi # type: ignore
|
||||
from youtube_transcript_api.formatters import TextFormatter # type: ignore
|
||||
|
||||
|
||||
class SourceState(TypedDict):
|
||||
content: str
|
||||
file_path: str
|
||||
url: str
|
||||
source_type: str
|
||||
identified_type: str
|
||||
identified_provider: str
|
||||
|
||||
|
||||
def source_identification(state: SourceState):
|
||||
"""
|
||||
Identify the content source based on parameters
|
||||
"""
|
||||
if state.get("content"):
|
||||
doc_type = "text"
|
||||
elif state.get("file_path"):
|
||||
doc_type = "file"
|
||||
elif state.get("url"):
|
||||
doc_type = "url"
|
||||
else:
|
||||
raise ValueError("No source provided.")
|
||||
|
||||
return {"source_type": doc_type}
|
||||
|
||||
|
||||
def url_provider(state: SourceState):
|
||||
"""
|
||||
Identify the provider
|
||||
"""
|
||||
return_dict = {}
|
||||
url = state.get("url")
|
||||
if url:
|
||||
if "youtube.com" in url or "youtu.be" in url:
|
||||
return_dict["identified_type"] = (
|
||||
"youtube" # playlists, channels in the future
|
||||
)
|
||||
else:
|
||||
return_dict["identified_type"] = "article"
|
||||
# article providers in the future
|
||||
return return_dict
|
||||
|
||||
|
||||
def file_type(state: SourceState):
|
||||
"""
|
||||
Identify the file using python-magic
|
||||
"""
|
||||
return_dict = {}
|
||||
file_path = state.get("file_path")
|
||||
if file_path is not None:
|
||||
return_dict["identified_type"] = magic.from_file(file_path, mime=True)
|
||||
return return_dict
|
||||
|
||||
|
||||
def _extract_text_from_pdf(pdf_path):
|
||||
doc = fitz.open(pdf_path)
|
||||
text = ""
|
||||
for page in doc:
|
||||
text += page.get_text()
|
||||
doc.close()
|
||||
return text
|
||||
|
||||
|
||||
def extract_pdf(state: SourceState):
|
||||
"""
|
||||
Parse the text file and print its content.
|
||||
"""
|
||||
return_dict = {}
|
||||
if (
|
||||
state.get("file_path") is not None
|
||||
and state.get("identified_type") == "application/pdf"
|
||||
):
|
||||
file_path = state.get("file_path")
|
||||
try:
|
||||
text = _extract_text_from_pdf(file_path)
|
||||
return_dict["content"] = text
|
||||
except FileNotFoundError:
|
||||
raise FileNotFoundError(f"File not found at {file_path}")
|
||||
except Exception as e:
|
||||
raise Exception(f"An error occurred: {e}")
|
||||
|
||||
return return_dict
|
||||
|
||||
|
||||
def extract_url(state: SourceState):
|
||||
"""
|
||||
Get the content of a URL
|
||||
"""
|
||||
response = requests.get(f"https://r.jina.ai/{state.get('url')}")
|
||||
return {"content": response.text}
|
||||
|
||||
|
||||
def extract_txt(state: SourceState):
|
||||
"""
|
||||
Parse the text file and print its content.
|
||||
"""
|
||||
return_dict = {}
|
||||
if (
|
||||
state.get("file_path") is not None
|
||||
and state.get("identified_type") == "text/plain"
|
||||
):
|
||||
file_path = state.get("file_path")
|
||||
if file_path is not None:
|
||||
try:
|
||||
with open(file_path, "r", encoding="utf-8") as file:
|
||||
content = file.read()
|
||||
return_dict["content"] = content
|
||||
except FileNotFoundError:
|
||||
raise FileNotFoundError(f"File not found at {file_path}")
|
||||
except Exception as e:
|
||||
raise Exception(f"An error occurred: {e}")
|
||||
|
||||
return return_dict
|
||||
|
||||
|
||||
def _extract_youtube_id(url):
|
||||
"""
|
||||
Extract the YouTube video ID from a given URL using regular expressions.
|
||||
|
||||
Args:
|
||||
url (str): The YouTube URL from which to extract the video ID.
|
||||
|
||||
Returns:
|
||||
str: The extracted YouTube video ID or None if no valid ID is found.
|
||||
"""
|
||||
# Define a regular expression pattern to capture the YouTube video ID
|
||||
youtube_regex = (
|
||||
r"(?:https?://)?" # Optional scheme
|
||||
r"(?:www\.)?" # Optional www.
|
||||
r"(?:"
|
||||
r"youtu\.be/" # Shortened URL
|
||||
r"|youtube\.com" # Main URL
|
||||
r"(?:" # Group start
|
||||
r"/embed/" # Embed URL
|
||||
r"|/v/" # Older video URL
|
||||
r"|/watch\?v=" # Standard watch URL
|
||||
r"|/watch\?.+&v=" # Other watch URL
|
||||
r")" # Group end
|
||||
r")" # End main group
|
||||
r"([\w-]{11})" # 11 characters (YouTube video ID)
|
||||
)
|
||||
|
||||
# Search the URL for the pattern
|
||||
match = re.search(youtube_regex, url)
|
||||
|
||||
# Return the video ID if a match is found
|
||||
return match.group(1) if match else None
|
||||
|
||||
|
||||
def extract_youtube_transcript(state: SourceState):
|
||||
"""
|
||||
Parse the text file and print its content.
|
||||
"""
|
||||
|
||||
transcript = YouTubeTranscriptApi.get_transcript(
|
||||
_extract_youtube_id(state.get("url")), languages=["pt", "en"]
|
||||
)
|
||||
formatter = TextFormatter()
|
||||
return {"content": formatter.format_transcript(transcript)}
|
||||
|
||||
|
||||
def should_continue(data: SourceState):
|
||||
if data.get("source_type") == "url":
|
||||
return "parse_url"
|
||||
else:
|
||||
return "end"
|
||||
|
||||
|
||||
workflow = StateGraph(SourceState)
|
||||
workflow.add_node("source", source_identification)
|
||||
workflow.add_node("url_provider", url_provider)
|
||||
workflow.add_node("file_type", file_type)
|
||||
workflow.add_node("extract_txt", extract_txt)
|
||||
workflow.add_node("extract_pdf", extract_pdf)
|
||||
workflow.add_node("extract_url", extract_url)
|
||||
workflow.add_node("extract_youtube_transcript", extract_youtube_transcript)
|
||||
|
||||
workflow.add_edge(START, "source")
|
||||
workflow.add_conditional_edges(
|
||||
"source",
|
||||
lambda x: x.get("source_type"),
|
||||
{
|
||||
"url": "url_provider",
|
||||
"file": "file_type",
|
||||
"text": END,
|
||||
},
|
||||
)
|
||||
workflow.add_conditional_edges(
|
||||
"file_type",
|
||||
lambda x: x.get("identified_type"),
|
||||
{
|
||||
"text/plain": "extract_txt",
|
||||
"application/pdf": "extract_pdf",
|
||||
},
|
||||
)
|
||||
workflow.add_conditional_edges(
|
||||
"url_provider",
|
||||
lambda x: x.get("identified_type"),
|
||||
{"article": "extract_url", "youtube": "extract_youtube_transcript"},
|
||||
)
|
||||
workflow.add_edge("url_provider", END)
|
||||
workflow.add_edge("file_type", END)
|
||||
workflow.add_edge("extract_txt", END)
|
||||
workflow.add_edge("extract_pdf", END)
|
||||
workflow.add_edge("extract_url", END)
|
||||
workflow.add_edge("extract_youtube_transcript", END)
|
||||
graph = workflow.compile()
|
||||
96
open_notebook/graphs/summary.py
Normal file
|
|
@ -0,0 +1,96 @@
|
|||
import os
|
||||
from typing import List, Literal
|
||||
|
||||
from langchain_core.runnables import (
|
||||
RunnableConfig,
|
||||
)
|
||||
from langchain_openai import ChatOpenAI
|
||||
from langgraph.graph import END, START, StateGraph
|
||||
from langgraph.prebuilt import ToolNode
|
||||
from pydantic import BaseModel, Field
|
||||
from typing_extensions import TypedDict
|
||||
|
||||
from open_notebook.graphs.tools import get_current_timestamp
|
||||
from open_notebook.prompter import Prompter
|
||||
from open_notebook.utils import split_text
|
||||
|
||||
tools = [get_current_timestamp]
|
||||
tool_node = ToolNode(tools)
|
||||
|
||||
|
||||
class SummaryResponse(BaseModel):
|
||||
"""Respond to the user with this"""
|
||||
|
||||
summary: str = Field(description="The summary of the content")
|
||||
topics: List[str] = Field(description="List of 4-7 topics related to this content")
|
||||
title: str = Field(description="The title of the content")
|
||||
|
||||
|
||||
class SummaryState(TypedDict):
|
||||
chunks: List[str]
|
||||
content: str
|
||||
summary: SummaryResponse
|
||||
|
||||
|
||||
def build_chunks(state: SummaryState) -> dict:
|
||||
"""
|
||||
Split the input text into chunks.
|
||||
"""
|
||||
return {
|
||||
"chunks": split_text(
|
||||
state["content"],
|
||||
chunk=int(os.environ.get("SUMMARY_CHUNK_SIZE", 200000)),
|
||||
overlap=int(os.environ.get("SUMMARY_CHUNK_OVERLAP", 1000)),
|
||||
)
|
||||
}
|
||||
|
||||
|
||||
def setup_next_chunk(state: SummaryState) -> dict:
|
||||
"""
|
||||
Move the next item in the chunk to the processing area
|
||||
"""
|
||||
state["content"] = state["chunks"].pop(0)
|
||||
return {"chunks": state["chunks"], "content": state["content"]}
|
||||
|
||||
|
||||
def chunk_condition(state: SummaryState) -> Literal["get_chunk", END]: # type: ignore
|
||||
"""
|
||||
Checks whether there are more chunks to process.
|
||||
"""
|
||||
if len(state["chunks"]) > 0:
|
||||
return "get_chunk"
|
||||
return END
|
||||
|
||||
|
||||
# todo: build a helper method for LLM communication on all graphs
|
||||
def call_model_with_messages(state: SummaryState, config: RunnableConfig) -> dict:
|
||||
model = (
|
||||
ChatOpenAI(
|
||||
model=os.environ.get("SUMMARIZATION_MODEL", os.environ["DEFAULT_MODEL"]),
|
||||
temperature=0,
|
||||
)
|
||||
.bind_tools(tools)
|
||||
.with_structured_output(SummaryResponse)
|
||||
)
|
||||
|
||||
system_prompt = Prompter(prompt_template="summarize").render(data=state)
|
||||
ai_message = model.invoke(system_prompt)
|
||||
return {"summary": ai_message}
|
||||
|
||||
|
||||
agent_state = StateGraph(SummaryState)
|
||||
agent_state.add_node("setup_chunk", build_chunks)
|
||||
agent_state.add_edge(START, "setup_chunk")
|
||||
agent_state.add_conditional_edges(
|
||||
"setup_chunk",
|
||||
chunk_condition,
|
||||
)
|
||||
agent_state.add_node("get_chunk", setup_next_chunk)
|
||||
agent_state.add_node("agent", call_model_with_messages)
|
||||
agent_state.add_edge("get_chunk", "agent")
|
||||
agent_state.add_conditional_edges(
|
||||
"agent",
|
||||
chunk_condition,
|
||||
)
|
||||
|
||||
graph = agent_state.compile()
|
||||
24
open_notebook/graphs/tools.py
Normal file
|
|
@ -0,0 +1,24 @@
|
|||
from datetime import datetime
|
||||
|
||||
from langchain.tools import tool
|
||||
|
||||
|
||||
@tool
|
||||
def get_current_timestamp() -> str:
|
||||
"""
|
||||
Returns the current timestamp in the format YYYYMMDDHHmmss.
|
||||
"""
|
||||
return datetime.now().strftime("%Y%m%d%H%M%S")
|
||||
|
||||
|
||||
@tool
|
||||
def ask_the_document(doc_id: str, question: str):
|
||||
"""
|
||||
Use this tool to ask a question to the document.
|
||||
Another LLM will ready the document and answer the question.
|
||||
Be specific and complete in your query given the LLM that will process it is very capable.
|
||||
"""
|
||||
from open_notebook.graphs.ask_content import graph
|
||||
|
||||
result = graph.invoke({"doc_id": doc_id, "question": question})
|
||||
return result["answer"]
|
||||
93
open_notebook/prompter.py
Normal file
|
|
@ -0,0 +1,93 @@
|
|||
"""
|
||||
A prompt management module using Jinja to generate complex prompts with simple templates.
|
||||
"""
|
||||
|
||||
import os
|
||||
from dataclasses import dataclass
|
||||
from datetime import datetime
|
||||
from typing import Any, Optional, Union
|
||||
|
||||
from jinja2 import Environment, FileSystemLoader, Template
|
||||
|
||||
env = Environment(loader=FileSystemLoader(os.environ.get("PROMPT_PATH", "prompts")))
|
||||
|
||||
|
||||
@dataclass
|
||||
class Prompter:
|
||||
"""
|
||||
A class for managing and rendering prompt templates.
|
||||
|
||||
Attributes:
|
||||
prompt_template (str, optional): The name of the prompt template file.
|
||||
prompt_variation (str, optional): The variation of the prompt template.
|
||||
prompt_text (str, optional): The raw prompt text.
|
||||
template (Union[str, Template], optional): The Jinja2 template object.
|
||||
"""
|
||||
|
||||
prompt_template: Optional[str] = None
|
||||
prompt_variation: Optional[str] = "default"
|
||||
prompt_text: Optional[str] = None
|
||||
template: Optional[Union[str, Template]] = None
|
||||
parser: Optional[Any] = None
|
||||
|
||||
def __init__(self, prompt_template=None, prompt_text=None):
|
||||
"""
|
||||
Initialize the Prompter with either a template file or raw text.
|
||||
|
||||
Args:
|
||||
prompt_template (str, optional): The name of the prompt template file.
|
||||
prompt_text (str, optional): The raw prompt text.
|
||||
"""
|
||||
self.prompt_template = prompt_template
|
||||
self.prompt_text = prompt_text
|
||||
self.setup()
|
||||
|
||||
def setup(self):
|
||||
"""
|
||||
Set up the Jinja2 template based on the provided template file or text.
|
||||
Raises:
|
||||
ValueError: If neither prompt_template nor prompt_text is provided.
|
||||
"""
|
||||
if self.prompt_template:
|
||||
self.template = env.get_template(f"{self.prompt_template}.jinja")
|
||||
elif self.prompt_text:
|
||||
self.template = Template(self.prompt_text)
|
||||
else:
|
||||
raise ValueError("Prompter must have a prompt_template or prompt_text")
|
||||
|
||||
assert self.prompt_template or self.prompt_text, "Prompt is required"
|
||||
|
||||
@classmethod
|
||||
def from_text(cls, text: str):
|
||||
"""
|
||||
Create a Prompter instance from raw text, which can contain Jinja code.
|
||||
|
||||
Args:
|
||||
text (str): The raw prompt text.
|
||||
|
||||
Returns:
|
||||
Prompter: A new Prompter instance.
|
||||
"""
|
||||
return cls(prompt_text=text)
|
||||
|
||||
def render(self, data) -> str:
|
||||
"""
|
||||
Render the prompt template with the given data.
|
||||
|
||||
Args:
|
||||
data (dict): The data to be used in rendering the template.
|
||||
|
||||
Returns:
|
||||
str: The rendered prompt text.
|
||||
|
||||
Raises:
|
||||
AssertionError: If the template is not defined or not a Jinja2 Template.
|
||||
"""
|
||||
data["current_time"] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
|
||||
if self.parser:
|
||||
data["format_instructions"] = self.parser.get_format_instructions()
|
||||
assert self.template, "Prompter template is not defined"
|
||||
assert isinstance(
|
||||
self.template, Template
|
||||
), "Prompter template is not a Jinja2 Template"
|
||||
return self.template.render(data)
|
||||
109
open_notebook/repository.py
Normal file
|
|
@ -0,0 +1,109 @@
|
|||
import asyncio
|
||||
import os
|
||||
from contextlib import asynccontextmanager
|
||||
|
||||
from loguru import logger
|
||||
from surrealdb import Surreal
|
||||
|
||||
from open_notebook.exceptions import InvalidDatabaseSchema
|
||||
|
||||
EXPECTED_VERSION = "0.0.1"
|
||||
|
||||
|
||||
@asynccontextmanager
|
||||
async def db_connection():
|
||||
db = Surreal(os.environ["SURREAL_ADDRESS"])
|
||||
try:
|
||||
await db.connect()
|
||||
await db.signin(
|
||||
{"user": os.environ["SURREAL_USER"], "pass": os.environ["SURREAL_PASS"]}
|
||||
)
|
||||
await db.use(os.environ["SURREAL_NAMESPACE"], os.environ["SURREAL_DATABASE"])
|
||||
yield db
|
||||
finally:
|
||||
await db.close()
|
||||
|
||||
|
||||
def repo_query(query_str, vars=None):
|
||||
async def _query():
|
||||
async with db_connection() as db:
|
||||
result = await db.query(query_str, vars)
|
||||
return result
|
||||
|
||||
result = asyncio.run(_query())
|
||||
return result[0]["result"]
|
||||
|
||||
|
||||
def check_version():
|
||||
async def _check_version():
|
||||
async with db_connection() as db:
|
||||
result = await db.query("select * from open_notebook:database_info;")
|
||||
return result
|
||||
|
||||
try:
|
||||
result = asyncio.run(_check_version())
|
||||
if len(result) == 0 or len(result[0]["result"]) == 0:
|
||||
raise InvalidDatabaseSchema("Database schema not found")
|
||||
version = result[0]["result"][0]["version"]
|
||||
logger.info(f"Connected to SurrealDB, using schema version {version}")
|
||||
if version != EXPECTED_VERSION:
|
||||
raise InvalidDatabaseSchema(
|
||||
f"Version mismatch. Expected {EXPECTED_VERSION}, got {version}"
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(e)
|
||||
raise e
|
||||
|
||||
|
||||
def repo_create(table, data):
|
||||
async def _create():
|
||||
async with db_connection() as db:
|
||||
result = await db.create(table, data)
|
||||
return result
|
||||
|
||||
result = asyncio.run(_create())
|
||||
return result
|
||||
|
||||
|
||||
def repo_update(id, data):
|
||||
async def _update():
|
||||
async with db_connection() as db:
|
||||
result = await db.update(id, data)
|
||||
return result
|
||||
|
||||
result = asyncio.run(_update())
|
||||
return result
|
||||
|
||||
|
||||
def repo_delete(id):
|
||||
async def _delete():
|
||||
async with db_connection() as db:
|
||||
result = await db.delete(id)
|
||||
return result
|
||||
|
||||
result = asyncio.run(_delete())
|
||||
return result
|
||||
|
||||
|
||||
def repo_relate(source, relationship, target):
|
||||
async def _relate():
|
||||
async with db_connection() as db:
|
||||
query = f"RELATE {source}->{relationship}->{target};"
|
||||
result = await db.query(query)
|
||||
return result
|
||||
|
||||
result = asyncio.run(_relate())
|
||||
return result
|
||||
|
||||
|
||||
def execute_migration():
|
||||
async def _query():
|
||||
content = None
|
||||
with open("db_setup.surrealql", "r") as file:
|
||||
content = file.read()
|
||||
async with db_connection() as db:
|
||||
result = await db.query(content)
|
||||
return result
|
||||
|
||||
result = asyncio.run(_query())
|
||||
return result[0]["result"]
|
||||
83
open_notebook/utils.py
Normal file
|
|
@ -0,0 +1,83 @@
|
|||
from langchain_text_splitters import CharacterTextSplitter
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
|
||||
|
||||
def split_text(txt: str, chunk=1000, overlap=0, separator=" "):
|
||||
"""
|
||||
Split the input text into chunks.
|
||||
|
||||
Args:
|
||||
txt (str): The input text to be split.
|
||||
chunk (int): The size of each chunk. Default is 1000.
|
||||
overlap (int): The number of characters to overlap between chunks. Default is 0.
|
||||
separator (str): The separator to use when splitting the text. Default is " ".
|
||||
|
||||
Returns:
|
||||
list: A list of text chunks.
|
||||
"""
|
||||
text_splitter = CharacterTextSplitter(
|
||||
chunk_size=chunk, chunk_overlap=overlap, separator=separator
|
||||
)
|
||||
return text_splitter.split_text(txt)
|
||||
|
||||
|
||||
def token_count(input_string):
|
||||
"""
|
||||
Count the number of tokens in the input string using the 'o200k_base' encoding.
|
||||
|
||||
Args:
|
||||
input_string (str): The input string to count tokens for.
|
||||
|
||||
Returns:
|
||||
int: The number of tokens in the input string.
|
||||
"""
|
||||
import tiktoken
|
||||
|
||||
encoding = tiktoken.get_encoding("o200k_base")
|
||||
tokens = encoding.encode(input_string)
|
||||
token_count = len(tokens)
|
||||
return token_count
|
||||
|
||||
|
||||
def token_cost(token_count, cost_per_million=0.150):
|
||||
"""
|
||||
Calculate the cost of tokens based on the token count and cost per million tokens.
|
||||
|
||||
Args:
|
||||
token_count (int): The number of tokens.
|
||||
cost_per_million (float): The cost per million tokens. Default is 0.150.
|
||||
|
||||
Returns:
|
||||
float: The calculated cost for the given token count.
|
||||
"""
|
||||
return cost_per_million * (token_count / 1_000_000)
|
||||
|
||||
|
||||
def get_embedding(text, model="text-embedding-3-small"):
|
||||
"""
|
||||
Get the embedding for the input text using the specified model.
|
||||
|
||||
Args:
|
||||
text (str): The input text to get the embedding for.
|
||||
model (str): The name of the embedding model to use. Default is "text-embedding-3-small".
|
||||
|
||||
Returns:
|
||||
list: The embedding vector for the input text.
|
||||
"""
|
||||
text = text.replace("\n", " ")
|
||||
return client.embeddings.create(input=[text], model=model).data[0].embedding
|
||||
|
||||
|
||||
def surreal_clean(text):
|
||||
"""
|
||||
Clean the input text by escaping colons for SurrealDB compatibility.
|
||||
|
||||
Args:
|
||||
text (str): The input text to clean.
|
||||
Returns:
|
||||
str: The cleaned text with escaped colons.
|
||||
"""
|
||||
text = text.replace(":", "\:")
|
||||
return text
|
||||
115
pages/2_📒_Notebooks.py
Normal file
|
|
@ -0,0 +1,115 @@
|
|||
import streamlit as st
|
||||
from humanize import naturaltime
|
||||
|
||||
from open_notebook.domain import Notebook
|
||||
from stream_app.chat import chat_sidebar
|
||||
from stream_app.note import add_note, note_card
|
||||
from stream_app.source import add_source, source_card
|
||||
from stream_app.utils import setup_stream_state
|
||||
|
||||
st.set_page_config(
|
||||
layout="wide", page_title="📒 Open Notebook", initial_sidebar_state="expanded"
|
||||
)
|
||||
|
||||
|
||||
def notebook_header(current_notebook):
|
||||
c1, c2, c3 = st.columns([8, 2, 2])
|
||||
c1.header(current_notebook.name)
|
||||
if c2.button("Back to the list", icon="🔙"):
|
||||
st.session_state["current_notebook"] = None
|
||||
st.rerun()
|
||||
|
||||
if c3.button("Refresh", icon="🔄"):
|
||||
st.rerun()
|
||||
current_description = current_notebook.description
|
||||
with st.expander(
|
||||
current_notebook.description
|
||||
if len(current_description) > 0
|
||||
else "click to add a description"
|
||||
):
|
||||
notebook_name = st.text_input("Name", value=current_notebook.name)
|
||||
notebook_description = st.text_area(
|
||||
"Description",
|
||||
value=current_description,
|
||||
placeholder="Add as much context as you can as this will be used by the AI to generate insights.",
|
||||
)
|
||||
if st.button("Save", key="edit_notebook"):
|
||||
current_notebook.name = notebook_name
|
||||
current_notebook.description = notebook_description
|
||||
current_notebook.save()
|
||||
st.rerun()
|
||||
if st.button("Delete forever", icon="☠️"):
|
||||
current_notebook.delete()
|
||||
st.session_state["current_notebook"] = None
|
||||
st.rerun()
|
||||
|
||||
|
||||
def notebook_page(current_notebook_id):
|
||||
current_notebook: Notebook = Notebook.get(current_notebook_id)
|
||||
if not current_notebook:
|
||||
st.error("Notebook not found")
|
||||
return
|
||||
if current_notebook_id not in st.session_state.keys():
|
||||
st.session_state[current_notebook_id] = current_notebook
|
||||
|
||||
session_id = st.session_state["active_session"]
|
||||
st.session_state[session_id]["notebook"] = current_notebook
|
||||
sources = current_notebook.sources
|
||||
notes = current_notebook.notes
|
||||
|
||||
notebook_header(current_notebook)
|
||||
|
||||
work_tab, chat_tab = st.columns([4, 2])
|
||||
with work_tab:
|
||||
sources_tab, notes_tab = st.columns(2)
|
||||
with sources_tab:
|
||||
with st.container(border=True):
|
||||
if st.button("Add Source", icon="➕"):
|
||||
add_source(session_id)
|
||||
for source in sources:
|
||||
source_card(session_id=session_id, source=source)
|
||||
|
||||
with notes_tab:
|
||||
with st.container(border=True):
|
||||
if st.button("Write a Note", icon="📝"):
|
||||
add_note(session_id)
|
||||
for note in notes:
|
||||
note_card(session_id=session_id, note=note)
|
||||
with chat_tab:
|
||||
chat_sidebar(session_id=session_id)
|
||||
|
||||
|
||||
if "current_notebook" not in st.session_state:
|
||||
st.session_state["current_notebook"] = None
|
||||
|
||||
if st.session_state["current_notebook"]:
|
||||
notebook_page(st.session_state["current_notebook"])
|
||||
st.stop()
|
||||
|
||||
st.title("📒 My Notebooks")
|
||||
st.caption("Here are all your notebooks")
|
||||
|
||||
|
||||
notebooks = Notebook.get_all()
|
||||
|
||||
for notebook in notebooks:
|
||||
with st.container(border=True):
|
||||
st.subheader(notebook.name)
|
||||
st.caption(
|
||||
f"Created: {naturaltime(notebook.created)}, updated: {naturaltime(notebook.updated)}"
|
||||
)
|
||||
st.write(notebook.description)
|
||||
if st.button("Open", key=f"open_notebook_{notebook.id}"):
|
||||
setup_stream_state(notebook.id)
|
||||
st.session_state["current_notebook"] = notebook.id
|
||||
st.rerun()
|
||||
|
||||
with st.container(border=True):
|
||||
new_notebook_title = st.text_input("New Notebook Name")
|
||||
new_notebook_description = st.text_area("Description")
|
||||
if st.button("Create a new Notebook", icon="➕"):
|
||||
notebook = Notebook(
|
||||
name=new_notebook_title, description=new_notebook_description
|
||||
)
|
||||
notebook.save()
|
||||
st.rerun()
|
||||
65
pages/3_🔍_Search.py
Normal file
|
|
@ -0,0 +1,65 @@
|
|||
import streamlit as st
|
||||
|
||||
from open_notebook.domain import text_search, vector_search
|
||||
from open_notebook.utils import get_embedding
|
||||
from stream_app.note import note_list_item
|
||||
from stream_app.source import source_list_item
|
||||
|
||||
st.set_page_config(
|
||||
layout="wide", page_title="🔍 Open Notebook", initial_sidebar_state="expanded"
|
||||
)
|
||||
|
||||
# search_tab, ask_tab = st.tabs(["Search", "Ask"])
|
||||
# notebooks = Notebook.get_all()
|
||||
|
||||
if "search_results" not in st.session_state:
|
||||
st.session_state["search_results"] = []
|
||||
|
||||
# with search_tab:
|
||||
with st.container(border=True):
|
||||
st.subheader("🔍 Search")
|
||||
st.caption("Search your knowledge base for specific keywords or concepts")
|
||||
search_term = st.text_input("Search", "")
|
||||
search_type = st.radio("Search Type", ["Text Search", "Vector Search"])
|
||||
search_sources = st.checkbox("Search Sources", value=True)
|
||||
search_notes = st.checkbox("Search Notes", value=True)
|
||||
if st.button("Search"):
|
||||
if search_type == "Text Search":
|
||||
st.write(f"Searching for {search_term}")
|
||||
st.session_state["search_results"] = text_search(
|
||||
search_term, 100, search_sources, search_notes
|
||||
)
|
||||
elif search_type == "Vector Search":
|
||||
st.write(f"Searching for {search_term}")
|
||||
embed_query = get_embedding(search_term)
|
||||
st.session_state["search_results"] = vector_search(
|
||||
embed_query, 100, search_sources, search_notes
|
||||
)
|
||||
for item in st.session_state["search_results"]:
|
||||
score = item.get("relevance", item.get("similarity", 0))
|
||||
if item.get("item_id"):
|
||||
if "source:" in item["item_id"]:
|
||||
source_list_item(item["item_id"], score)
|
||||
elif "note:" in item["item_id"]:
|
||||
note_list_item(item["item_id"], score)
|
||||
|
||||
# coming soon
|
||||
# with ask_tab:
|
||||
# with st.form(key="ask_form"):
|
||||
# st.subheader("Ask Your Knowledge Base")
|
||||
# st.caption("Let the LLM formulate an answer based on your query")
|
||||
# question = st.text_input("Your question", "")
|
||||
|
||||
# notebooks = st.multiselect(
|
||||
# "Notebooks",
|
||||
# notebooks,
|
||||
# notebooks,
|
||||
# format_func=lambda x: x.name,
|
||||
# )
|
||||
# search_sources = st.multiselect(
|
||||
# "Use Sources",
|
||||
# ["Sources", "Notes"],
|
||||
# ["Sources", "Notes"],
|
||||
# )
|
||||
# if st.form_submit_button("Search"):
|
||||
# st.write(f"Searching for {search_term}")
|
||||
4021
poetry.lock
generated
Normal file
3
poetry.toml
Normal file
|
|
@ -0,0 +1,3 @@
|
|||
[virtualenvs]
|
||||
in-project = true
|
||||
path = "."
|
||||
26
prompts/ask_content.jinja
Normal file
|
|
@ -0,0 +1,26 @@
|
|||
|
||||
# BACKGROUND
|
||||
|
||||
Your are a cognitive assistant that helps me study and research.
|
||||
|
||||
# OUR WORKING FRAMEWORK
|
||||
|
||||
You have access to some information about the project I am working on
|
||||
as well as the content of a specific item I am interested about.
|
||||
|
||||
Your goal is to respond to the question using purely the content in your CONTEXT.
|
||||
|
||||
If the content in CONTEXT is not enough to answer the question, do not make up any information and just reply that you can't answer that.
|
||||
Kindly tell the user what sort of things you'd be able to talk about.
|
||||
|
||||
# PROJECT INFO
|
||||
|
||||
{{ notebook }}
|
||||
|
||||
# CONTENT
|
||||
|
||||
{{ doc_content }}
|
||||
|
||||
# QUESTION
|
||||
|
||||
{{ question}}
|
||||
45
prompts/chat.jinja
Normal file
|
|
@ -0,0 +1,45 @@
|
|||
|
||||
# BACKGROUND
|
||||
|
||||
Your are a cognitive assistant that helps me study and research.
|
||||
|
||||
# OUR WORKING FRAMEWORK
|
||||
|
||||
We are working within a virtual Notebook,
|
||||
which is a learning workspace for a specific project.
|
||||
|
||||
You have access to some information about the project,
|
||||
the contents that are selected for discussion, and relevant contexts.
|
||||
|
||||
Your goal is to respond to the user's commands and questions,
|
||||
using purely the content in your context.
|
||||
|
||||
# YOUR TOOLS
|
||||
|
||||
You might find that some of the documents in the CONTEXT are worth an extra look. For that, you can use the `ask_the_document` tool.
|
||||
Just ask the question as if you were talking to someone that knows the document deeply and the tool will provide you with the answer.
|
||||
Use the document id to specify which source or note you'd like to ask about.
|
||||
|
||||
# INSTRUCTIONS
|
||||
|
||||
- You can ask tools until you are satisfied with the information
|
||||
- You have a optional field in your answer called title. Only use this field if you believe your answer is important to be saved as a note. If it's just a quick chat, send an empty string to it.
|
||||
- Please add to the citations list all the ids for sources and notes that you used for your anwer.
|
||||
|
||||
# RESPONSE FORMAT
|
||||
```
|
||||
[ANSWER]
|
||||
|
||||
### CITATIONS
|
||||
- id1
|
||||
- id2
|
||||
- etc
|
||||
```
|
||||
|
||||
# PROJECT INFO
|
||||
|
||||
{{ notebook }}
|
||||
|
||||
# CONTEXT
|
||||
|
||||
{{ context }}
|
||||
11
prompts/spr.jinja
Normal file
|
|
@ -0,0 +1,11 @@
|
|||
|
||||
# MISSION
|
||||
You are a Sparse Priming Representation (SPR) writer. An SPR is a particular kind of use of language for advanced NLP, NLU, and NLG tasks, particularly useful for the latest generation of Large Language Models (LLMs). You will be given information by the USER which you are to render as an SPR.
|
||||
|
||||
# THEORY
|
||||
LLMs are a kind of deep neural network. They have been demonstrated to embed knowledge, abilities, and concepts, ranging from reasoning to planning, and even to theory of mind. These are called latent abilities and latent content, collectively referred to as latent space. The latent space of an LLM can be activated with the correct series of words as inputs, which will create a useful internal state of the neural network. This is not unlike how the right shorthand cues can prime a human mind to think in a certain way. Like human minds, LLMs are associative, meaning you only need to use the correct associations to "prime" another model to think in the same way.
|
||||
|
||||
# METHODOLOGY
|
||||
Render the input as a distilled list of succinct statements, assertions, associations, concepts, analogies, and metaphors. The idea is to capture as much, conceptually, as possible but with as few words as possible. Write it in a way that makes sense to you, as the future audience will be another language model, not a human. Use complete sentences.
|
||||
|
||||
{# thanks to https://github.com/daveshap/SparsePrimingRepresentations #}
|
||||
28
prompts/summarize.jinja
Normal file
|
|
@ -0,0 +1,28 @@
|
|||
|
||||
{% include "spr.jinja" %}
|
||||
|
||||
# YOUR TASK
|
||||
|
||||
You are part of a content summarization platform.
|
||||
Sometimes, you need to summarize the content gradually since it might be very big.
|
||||
Please summarize the content below in a few sentences, making it the most complete, dense and SPR compatible as you can.
|
||||
|
||||
## INSTRUCTIONS
|
||||
|
||||
- If the content already has a current summary, rewrite the summary to add the new information without losing the previous context
|
||||
- Always make it dense and SPR compatible
|
||||
- Do not reply with anything feedback or message other than the summary itself
|
||||
|
||||
## FORMATTING INSTRUCTIONS
|
||||
|
||||
{{ format_instructions }}
|
||||
|
||||
## CONTENT
|
||||
|
||||
{{content}}
|
||||
|
||||
## PREVIOUS SUMMARY
|
||||
|
||||
{{summary}}
|
||||
|
||||
## SUMMARY
|
||||
59
pyproject.toml
Normal file
|
|
@ -0,0 +1,59 @@
|
|||
[tool.poetry]
|
||||
name = "open-notebook"
|
||||
version = "0.0.1"
|
||||
description = "An open source implementation of a research assistant, inspired by Google Notebook LM"
|
||||
authors = ["Luis Novo <lfnovo@gmail.com>"]
|
||||
license = "MIT"
|
||||
readme = "README.md"
|
||||
classifiers = [
|
||||
"License :: OSI Approved :: MIT License",
|
||||
"Programming Language :: Python :: 3",
|
||||
"Programming Language :: Python :: 3.11",
|
||||
]
|
||||
|
||||
[tool.poetry.dependencies]
|
||||
python = "^3.11"
|
||||
streamlit = "^1.39.0"
|
||||
watchdog = "^5.0.3"
|
||||
pydantic = "^2.9.2"
|
||||
loguru = "^0.7.2"
|
||||
icecream = "^2.1.3"
|
||||
langchain = "^0.3.3"
|
||||
langgraph = "^0.2.38"
|
||||
humanize = "^4.11.0"
|
||||
streamlit-tags = "^1.2.8"
|
||||
streamlit-scrollable-textbox = "^0.0.3"
|
||||
tiktoken = "^0.8.0"
|
||||
streamlit-monaco = "^0.1.3"
|
||||
langgraph-checkpoint-sqlite = "^2.0.0"
|
||||
pymupdf = "1.24.11"
|
||||
python-magic = "^0.4.27"
|
||||
langdetect = "^1.0.9"
|
||||
youtube-transcript-api = "^0.6.2"
|
||||
surrealdb = "^0.3.2"
|
||||
openai = "^1.52.0"
|
||||
pre-commit = "^4.0.1"
|
||||
langchain-community = "^0.3.3"
|
||||
langchain-openai = "^0.2.3"
|
||||
|
||||
[tool.poetry.group.dev.dependencies]
|
||||
ipykernel = "^6.29.5"
|
||||
ruff = "^0.5.5"
|
||||
mypy = "^1.11.1"
|
||||
|
||||
[build-system]
|
||||
requires = ["poetry-core"]
|
||||
build-backend = "poetry.core.masonry.api"
|
||||
|
||||
|
||||
[tool.isort]
|
||||
profile = "black"
|
||||
line_length = 88
|
||||
|
||||
[tool.ruff]
|
||||
line-length = 88
|
||||
|
||||
[tool.ruff.lint]
|
||||
select = ["E", "F", "I"]
|
||||
ignore = ["E501"]
|
||||
|
||||
0
stream_app/__init__.py
Normal file
89
stream_app/chat.py
Normal file
|
|
@ -0,0 +1,89 @@
|
|||
import streamlit as st
|
||||
from langchain_core.runnables import RunnableConfig
|
||||
|
||||
from open_notebook.domain import Note, Source
|
||||
from open_notebook.graphs.chat import graph as chat_graph
|
||||
from open_notebook.utils import token_cost, token_count
|
||||
|
||||
|
||||
# todo: build a smarter, more robust context manager function
|
||||
def build_context(session_id):
|
||||
st.session_state[session_id]["context"] = dict(note=[], source=[])
|
||||
|
||||
for id, status in st.session_state[session_id]["context_config"].items():
|
||||
if not id:
|
||||
continue
|
||||
|
||||
item_type, item_id = id.split(":")
|
||||
if item_type not in ["note", "source"]:
|
||||
continue
|
||||
|
||||
if "not in" in status:
|
||||
continue
|
||||
|
||||
if item_type == "note":
|
||||
item: Note = Note.get(id)
|
||||
elif item_type == "source":
|
||||
item: Source = Source.get(id)
|
||||
else:
|
||||
continue
|
||||
|
||||
if not item:
|
||||
continue
|
||||
if "summary" in status:
|
||||
st.session_state[session_id]["context"][item_type] += [
|
||||
item.get_context(context_size="short")
|
||||
]
|
||||
elif "full content" in status:
|
||||
st.session_state[session_id]["context"][item_type] += [
|
||||
item.get_context(context_size="long")
|
||||
]
|
||||
|
||||
return st.session_state[session_id]["context"]
|
||||
|
||||
|
||||
def execute_chat(txt_input, session_id):
|
||||
current_state = st.session_state[session_id]
|
||||
current_state["messages"] += [txt_input]
|
||||
result = chat_graph.invoke(
|
||||
input=current_state,
|
||||
config=RunnableConfig(configurable={"thread_id": session_id}),
|
||||
)
|
||||
return result
|
||||
|
||||
|
||||
# todo: se eu for usar o token count, preciso deixar configuravel
|
||||
# seria bom ter um total de tokens no admin em algum lugar
|
||||
def chat_sidebar(session_id):
|
||||
context = build_context(session_id=session_id)
|
||||
tokens = token_count(str(context))
|
||||
cost = token_cost(tokens)
|
||||
with st.container(border=True):
|
||||
request = st.chat_input("Enter your question")
|
||||
st.caption(f"Total tokens: {tokens}, cost: ${cost:.4f}")
|
||||
if request:
|
||||
response = execute_chat(txt_input=request, session_id=session_id)
|
||||
st.session_state[session_id]["messages"] = response["messages"]
|
||||
|
||||
for msg in st.session_state[session_id]["messages"][::-1]:
|
||||
if msg.type not in ["human", "ai"]:
|
||||
continue
|
||||
if not msg.content:
|
||||
continue
|
||||
|
||||
with st.chat_message(name=msg.type):
|
||||
st.write(msg.content)
|
||||
if msg.type == "ai":
|
||||
if st.button("💾 New Note", key=f"render_save_{msg.id}"):
|
||||
title = "New Note"
|
||||
content = msg.content
|
||||
note = Note(
|
||||
title=title,
|
||||
content=content,
|
||||
note_type="ai",
|
||||
)
|
||||
note.save()
|
||||
note.add_to_notebook(
|
||||
st.session_state[session_id]["notebook"].id
|
||||
)
|
||||
st.rerun()
|
||||
5
stream_app/consts.py
Normal file
|
|
@ -0,0 +1,5 @@
|
|||
context_icons = [
|
||||
"⛔ not in context",
|
||||
"🟡 summary",
|
||||
"🟢 full content",
|
||||
]
|
||||
89
stream_app/note.py
Normal file
|
|
@ -0,0 +1,89 @@
|
|||
import streamlit as st
|
||||
from humanize import naturaltime
|
||||
from loguru import logger
|
||||
from streamlit_monaco import st_monaco # type: ignore
|
||||
|
||||
from open_notebook.domain import Note
|
||||
|
||||
from .consts import context_icons
|
||||
|
||||
|
||||
@st.dialog("Write a Note", width="large")
|
||||
def add_note(session_id):
|
||||
note_title = st.text_input("Title")
|
||||
note_content = st.text_area("Content")
|
||||
if st.button("Save", key="add_note"):
|
||||
logger.debug("Adding note")
|
||||
note = Note(title=note_title, content=note_content, note_type="human")
|
||||
note.save()
|
||||
note.add_to_notebook(st.session_state[session_id]["notebook"].id)
|
||||
st.rerun()
|
||||
|
||||
|
||||
@st.dialog("Add a Source", width="large")
|
||||
def note_panel(session_id=None, note_id=None):
|
||||
if note_id:
|
||||
note: Note = Note.get(note_id)
|
||||
else:
|
||||
note: Note = Note()
|
||||
|
||||
t_preview, t_edit = st.tabs(["Preview", "Edit"])
|
||||
with t_preview:
|
||||
st.subheader(note.title)
|
||||
st.markdown(note.content)
|
||||
with t_edit:
|
||||
note.title = st.text_input("Title", value=note.title)
|
||||
note.content = st_monaco(
|
||||
value=note.content, height="600px", language="markdown"
|
||||
)
|
||||
if st.button("Save", key=f"edit_note_{note_id}"):
|
||||
logger.debug("Editing note")
|
||||
note.save()
|
||||
if not note.id:
|
||||
note.add_to_notebook(st.session_state[session_id]["notebook"].id)
|
||||
st.rerun()
|
||||
if st.button("Delete", key=f"delete_note_{note_id}"):
|
||||
logger.debug("Deleting note")
|
||||
note.delete()
|
||||
st.rerun()
|
||||
|
||||
|
||||
def note_card(session_id, note):
|
||||
if note.note_type == "human":
|
||||
icon = "🤵"
|
||||
else:
|
||||
icon = "🤖"
|
||||
|
||||
context_state = st.selectbox(
|
||||
"Context",
|
||||
label_visibility="collapsed",
|
||||
options=context_icons,
|
||||
index=0,
|
||||
key=f"note_{note.id}",
|
||||
)
|
||||
with st.expander(f"{icon} **{note.title}** {naturaltime(note.updated)}"):
|
||||
st.write(note.content)
|
||||
with st.popover("Actions"):
|
||||
if st.button("Edit Note", icon="📝", key=f"edit_note_{note.id}"):
|
||||
note_panel(session_id, note.id)
|
||||
if st.button("Delete", icon="🗑️", key=f"delete_options_{note.id}"):
|
||||
note.delete()
|
||||
st.rerun()
|
||||
|
||||
st.session_state[session_id]["context_config"][note.id] = context_state
|
||||
|
||||
|
||||
def note_list_item(note_id, score=None):
|
||||
logger.debug(note_id)
|
||||
note: Note = Note.get(note_id)
|
||||
if note.note_type == "human":
|
||||
icon = "🤵"
|
||||
else:
|
||||
icon = "🤖"
|
||||
|
||||
with st.expander(
|
||||
f"{icon} [{score:.2f}] **{note.title}** {naturaltime(note.updated)}"
|
||||
):
|
||||
st.write(note.content)
|
||||
if st.button("Edit Note", icon="📝", key=f"x_edit_note_{note.id}"):
|
||||
note_panel(note_id=note.id)
|
||||
161
stream_app/source.py
Normal file
|
|
@ -0,0 +1,161 @@
|
|||
from pathlib import Path
|
||||
|
||||
import streamlit as st
|
||||
import streamlit_scrollable_textbox as stx # type: ignore
|
||||
from humanize import naturaltime
|
||||
from loguru import logger
|
||||
from streamlit_tags import st_tags # type: ignore
|
||||
|
||||
from open_notebook.domain import Asset, Source
|
||||
from open_notebook.graphs.content_process import graph
|
||||
from open_notebook.utils import token_cost, token_count
|
||||
|
||||
from .consts import context_icons
|
||||
|
||||
uploads_dir = Path("./.uploads")
|
||||
uploads_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
|
||||
@st.dialog("Source", width="large")
|
||||
def source_panel(source_id):
|
||||
source: Source = Source.get(source_id)
|
||||
if not source:
|
||||
st.error("Source not found")
|
||||
return
|
||||
title = st.empty()
|
||||
if source.title:
|
||||
title.subheader(source.title)
|
||||
st.caption(f"Created {naturaltime(source.created)}")
|
||||
# st.markdown(f"**URL:** {source.url}, **File:** {source.file_path}")
|
||||
summary = st.empty()
|
||||
for insight in source.insights:
|
||||
summary.write(insight.insight_type)
|
||||
summary.write(insight.content)
|
||||
|
||||
topics = source.topics or []
|
||||
if len(topics) > 0:
|
||||
st_tags(
|
||||
label="",
|
||||
text="Press enter to add more",
|
||||
value=source.topics,
|
||||
suggestions=source.topics,
|
||||
maxtags=10,
|
||||
key="1",
|
||||
)
|
||||
|
||||
if st.button("Delete", icon="🗑️"):
|
||||
source.delete()
|
||||
st.rerun()
|
||||
|
||||
cost = token_cost(token_count(source.full_text)) * 1.2
|
||||
if st.button(f"Summarize (about ${cost:.4f})", icon="📝"):
|
||||
source.summarize()
|
||||
st.rerun(scope="fragment")
|
||||
|
||||
cost_embedding = token_cost(token_count(source.full_text), 0.02)
|
||||
|
||||
if st.button(f"Embed (${cost_embedding:.4f})", icon="📝"):
|
||||
source.vectorize()
|
||||
st.success("Embedding complete")
|
||||
|
||||
st.subheader("Content")
|
||||
stx.scrollableTextbox(source.full_text, height=300)
|
||||
|
||||
|
||||
@st.dialog("Add a Source", width="large")
|
||||
def add_source(session_id):
|
||||
source_link = None
|
||||
source_file = None
|
||||
source_text = None
|
||||
source_type = st.radio("Type", ["Link", "Upload", "Text"])
|
||||
req = {}
|
||||
if source_type == "Link":
|
||||
source_link = st.text_input("Link")
|
||||
req["url"] = source_link
|
||||
elif source_type == "Upload":
|
||||
source_file = st.file_uploader("Upload")
|
||||
if source_file is not None:
|
||||
# Get the file name and extension
|
||||
file_name = source_file.name
|
||||
|
||||
file_extension = Path(file_name).suffix
|
||||
|
||||
# Generate a unique file name
|
||||
base_name = Path(file_name).stem
|
||||
counter = 1
|
||||
new_path = uploads_dir / file_name
|
||||
while new_path.exists():
|
||||
new_file_name = f"{base_name}_{counter}{file_extension}"
|
||||
new_path = uploads_dir / new_file_name
|
||||
counter += 1
|
||||
|
||||
req["file_path"] = str(new_path)
|
||||
# Save the file
|
||||
with open(new_path, "wb") as f:
|
||||
f.write(source_file.getbuffer())
|
||||
|
||||
else:
|
||||
source_text = st.text_area("Text")
|
||||
req["content"] = source_text
|
||||
if st.button("Process", key="add_source"):
|
||||
logger.debug("Adding source")
|
||||
with st.status("Processing...", expanded=True):
|
||||
st.write("Processing document...")
|
||||
result = graph.invoke(req)
|
||||
st.write("Saving..")
|
||||
source = Source(
|
||||
asset=Asset(url=req.get("url"), file_path=req.get("file_path")),
|
||||
)
|
||||
source.save()
|
||||
source.save_chunks(result["content"])
|
||||
source.add_to_notebook(st.session_state[session_id]["notebook"].id)
|
||||
st.write("Summarizing...")
|
||||
source.summarize()
|
||||
|
||||
st.rerun()
|
||||
# else:
|
||||
# st.stop()
|
||||
|
||||
|
||||
def source_card(session_id, source):
|
||||
icon = "🔗"
|
||||
context_state = st.selectbox(
|
||||
"Context",
|
||||
label_visibility="collapsed",
|
||||
options=context_icons,
|
||||
index=0,
|
||||
key=f"source_{source.id}",
|
||||
)
|
||||
with st.expander(f"**{source.title}**"):
|
||||
st.markdown(f"{icon} Updated: {naturaltime(source.updated)}")
|
||||
st.markdown("**" + ", ".join(source.topics) + "**")
|
||||
for insight in source.insights:
|
||||
st.write(insight.insight_type)
|
||||
st.write(insight.content)
|
||||
|
||||
with st.popover("Actions"):
|
||||
if st.button("Edit Source", icon="📝", key=source.id):
|
||||
result = source_panel(source.id)
|
||||
st.write(result)
|
||||
if st.button("Delete", icon="🗑️", key=f"delete_options_{source.id}"):
|
||||
source.delete()
|
||||
st.rerun()
|
||||
|
||||
st.session_state[session_id]["context_config"][source.id] = context_state
|
||||
|
||||
|
||||
def source_list_item(source_id, score=None):
|
||||
source: Source = Source.get(source_id)
|
||||
if not source:
|
||||
st.error("Source not found")
|
||||
return
|
||||
icon = "🔗"
|
||||
|
||||
with st.expander(
|
||||
f"{icon} [{score:.2f}] **{source.title}** {naturaltime(source.updated)}"
|
||||
):
|
||||
for insight in source.insights:
|
||||
st.markdown(f"**{insight.insight_type}**")
|
||||
st.write(insight.content)
|
||||
if st.button("Edit source", icon="📝", key=f"x_edit_source_{source.id}"):
|
||||
source_panel(source_id=source.id)
|
||||
18
stream_app/utils.py
Normal file
|
|
@ -0,0 +1,18 @@
|
|||
import streamlit as st
|
||||
|
||||
from open_notebook.graphs.chat import ThreadState, graph
|
||||
|
||||
|
||||
def setup_stream_state(session_id) -> None:
|
||||
"""
|
||||
Sets the value of the current session_id for langgraph thread state.
|
||||
If there is no existing thread state for this session_id, it creates a new one.
|
||||
"""
|
||||
existing_state = graph.get_state({"configurable": {"thread_id": session_id}}).values
|
||||
if len(existing_state.keys()) == 0:
|
||||
st.session_state[session_id] = ThreadState(
|
||||
messages=[], context=None, notebook=None, context_config={}, response=None
|
||||
)
|
||||
else:
|
||||
st.session_state[session_id] = existing_state
|
||||
st.session_state["active_session"] = session_id
|
||||
1
tests/README.md
Normal file
|
|
@ -0,0 +1 @@
|
|||
Coming Soon
|
||||
66
todo.md
Normal file
|
|
@ -0,0 +1,66 @@
|
|||
|
||||
Auto summarize
|
||||
|
||||
|
||||
# Code stuff
|
||||
- Linting
|
||||
|
||||
# Future versions:
|
||||
- Suporte more models other than OpenAI
|
||||
- Any LLM do crew ai
|
||||
- permitir mais de um vetorizer
|
||||
- Colocar o Gemini como modelo de consulta de documentos
|
||||
- Permitir usar modelos como Ollama entre outros
|
||||
- Tentar usar o Pydantic Output Parser
|
||||
- Tentar remover langchain_openai e anthropic
|
||||
- DB consistency
|
||||
- delete notebook (o que fazer com os filhos)
|
||||
- Ta acumulando 2 sumaries
|
||||
- deletar filhos quando deletar pais
|
||||
- Brincar com o tema do Streamlit
|
||||
- Docstrings
|
||||
- Arrumar o chat quando houver utilização de ferramentas
|
||||
- Implementar streaming no chat também
|
||||
- Citacions: explicar de onde vieram os insights
|
||||
- Usar propósito do projeto para sumarizar
|
||||
- Melhorar Citacions: explicar de onde vieram os insights
|
||||
- Melhorar as estratégias de embedding e limpeza de conteúdo e indexação
|
||||
- Improve streamlit navigation and refresh
|
||||
- Mais de uma sessão de chats?
|
||||
- Melhorias no banco, menos tabelas, mais inteligentes
|
||||
- Live Query for the front end
|
||||
- Implementar a ideia do Fabric de prompts e perguntas recomendadas
|
||||
- Menu bar: sources, notes, projects, search, topics
|
||||
- Trazer algum sistema de busca
|
||||
- Multiple study sessions
|
||||
|
||||
- Melhorar a visão dos dados
|
||||
- Usar as queries corretas no Surreal
|
||||
- Dar um talento de nada no models
|
||||
- Transformar tudo em lambda?
|
||||
- Por information nos edges para contexto?
|
||||
|
||||
- Processamento tinha que ser async
|
||||
- Ta dando pau em arquivos grandes
|
||||
- Precisa de um sistema de fila
|
||||
- Automatizar o processo de analise
|
||||
- Suportar transcrição de audio e de video
|
||||
https://www.youtube.com/watch?v=mdLBr9IMmgI
|
||||
- Langgraph
|
||||
- Mudar a memória das threads para o SurrealDB
|
||||
|
||||
- Estratégias mais poderosas que combinem fabric com embeddings
|
||||
|
||||
- Uma ideia legal seria usar um LLM muito barato para limpar textos e o vision para entender pdfs
|
||||
|
||||
----
|
||||
There is a known issue with the surreal sdk for large content
|
||||
|
||||
|
||||
FEATURES
|
||||
|
||||
- Recursive sumarizationa cima de 500k de texto
|
||||
- Estimativa de custo do vetorizer para os conteudos
|
||||
- Context Manager - fine grained
|
||||
- Campo de busca de texto, vetor e híbrida
|
||||
- Vector search on my own notes
|
||||