open-notebook/docs/3-USER-GUIDE/adding-sources.md
LUIS NOVO e13e4a2d8b docs: restructure documentation with new organized layout
- Replace old docs structure with new comprehensive documentation
- Organize into 8 major sections (0-START-HERE through 7-DEVELOPMENT)
- Convert CONFIGURATION.md, CONTRIBUTING.md, MAINTAINER_GUIDE.md to redirects
- Remove outdated MIGRATION.md and DESIGN_PRINCIPLES.md
- Fix all internal documentation links and cross-references
- Add progressive disclosure paths for different user types
- Include 44 focused guides covering all features
- Update README.md to remove v1.0 breaking changes notice
2026-01-03 20:10:24 -03:00

11 KiB

Adding Sources - Getting Content Into Your Notebook

Sources are the raw materials of your research. This guide covers how to add different types of content.


Quick-Start: Add Your First Source

Option 1: Upload a File (PDF, Word, etc.)

1. In your notebook, click "Add Source"
2. Select "Upload File"
3. Choose a file from your computer
4. Click "Upload"
5. Wait 30-60 seconds for processing
6. Done! Source appears in your notebook
1. Click "Add Source"
2. Select "Web Link"
3. Paste URL: https://example.com/article
4. Click "Add"
5. Wait for processing (usually faster than files)
6. Done!

Option 3: Paste Text

1. Click "Add Source"
2. Select "Text"
3. Paste or type your content
4. Click "Save"
5. Done! Immediately available

Supported File Types

Documents

  • PDF (.pdf) — Best support, including scanned PDFs with OCR
  • Word (.docx, .doc) — Full support
  • PowerPoint (.pptx) — Slides converted to text
  • Excel (.xlsx, .xls) — Spreadsheet data
  • EPUB (.epub) — eBook files
  • Markdown (.md, .txt) — Plain text formats
  • HTML (.html, .htm) — Web page files

File size limits: Up to ~100MB (varies by system)

Processing time: 10 seconds - 2 minutes (depending on length and file type)

Audio & Video

  • Audio: MP3, WAV, M4A, OGG, FLAC (~30 seconds - 3 minutes per hour)
  • Video: MP4, AVI, MOV, MKV, WebM (~3-10 minutes per hour)
  • YouTube: Direct URL support
  • Podcasts: RSS feed URL

Automatic transcription: Audio/video is transcribed to text automatically. This requires enabling speech-to-text in settings.

Web Content

  • Articles: Blog posts, news articles, Medium
  • YouTube: Full videos or playlists
  • PDFs online: Direct PDF links
  • News: News site articles

Just paste the URL in "Web Link" section.

What Doesn't Work

  • Paywalled content (WSJ, FT, etc.) — Can't extract
  • Password-protected PDFs — Can't open
  • Pure image files (.jpg, .png) — Except scanned PDFs which have OCR
  • Very large files (>100MB) — Timeout

What Happens When You Add a Source

The system automatically does four things:

1. EXTRACT TEXT
   File/URL → Readable text
   (PDFs get OCR if scanned)
   (Videos get transcribed if enabled)

2. BREAK INTO CHUNKS
   Long text → ~500-word pieces
   (So search finds specific parts, not whole document)

3. CREATE EMBEDDINGS
   Each chunk → Vector representation
   (Enables semantic/concept search)

4. INDEX & STORE
   Everything → Database
   (Ready to search and retrieve)

Time to use: After the progress bar completes, the source is ready immediately. Embeddings are created in the background.


Step-by-Step for Different Types

PDFs

Best practices:

Clean PDFs:
  1. Upload → Done
  2. Processing time: ~30-60 seconds

Scanned/Image PDFs:
  1. Upload same way
  2. System auto-detects and uses OCR
  3. Processing time: ~2-3 minutes
  4. (Higher, due to OCR overhead)

Large PDFs (50+ pages):
  1. Consider splitting into smaller files
  2. Or upload as-is (system handles it)
  3. Processing time scales with size

Common issues:

  • "Can't extract text" → PDF is corrupted or has copy protection
  • Solution: Try opening in Adobe. If it won't, the PDF is likely protected.

Best practices:

1. Copy full URL from browser: https://example.com/article-title
2. Paste in "Web Link"
3. Click Add
4. Wait for extraction

Processing time: Usually 5-15 seconds

What works:

  • Standard web articles
  • Blog posts
  • News articles
  • Wikipedia pages
  • Medium posts
  • Substack articles

What doesn't work:

  • Twitter threads (unreliable)
  • Paywalled articles (can't access)
  • JavaScript-heavy sites (content not extracted)

Pro tip: If it doesn't work, copy the article text and paste as "Text" instead.

Audio Files

Best practices:

1. Ensure speech-to-text is enabled in Settings
2. Upload MP3, WAV, or M4A file
3. System automatically transcribes to text
4. Processing time: ~1 minute per 5 minutes of audio

Example:
  - 1-hour podcast → 12 minutes processing
  - 10-minute recording → 2 minutes processing

Quality matters:

  • Clear audio: Fast transcription
  • Muffled/noisy audio: Slower, less accurate transcription
  • Background noise: Try to minimize before uploading

Tip: If audio quality is poor, the AI might misinterpret content. You can manually correct transcription if needed.

YouTube Videos

Best practices:

Two ways to add:

Method 1: Direct URL
  1. Copy YouTube URL: https://www.youtube.com/watch?v=...
  2. Paste in "Web Link"
  3. Click Add
  4. System extracts captions (if available) + transcript

Method 2: Playlist
  1. Paste playlist URL
  2. System adds all videos as separate sources
  3. Each video processed separately
  4. Takes longer (multiple videos)

What's extracted:

  • Captions/subtitles (if available)
  • Transcription (if captions aren't available)
  • Basic metadata (title, channel, length)

Processing:

  • 10-minute video: ~2-3 minutes
  • 1-hour video: ~10-15 minutes

Text / Paste Content

Best practices:

1. Select "Text" when adding source
2. Paste or type content
3. System processes immediately
4. No wait time needed

Good for:
  - Notes you want to reference
  - Quotes from books
  - Transcripts you have handy
  - Quick research snippets

Managing Your Sources

Viewing Source Details

Click on source → See:
  - Original file name/title
  - When it was added
  - Size and format
  - Processing status
  - Number of chunks

Organizing with Metadata

You can add to each source:

  • Title: Better name than original filename
  • Tags: Category labels ("primary research", "background", "competitor analysis")
  • Description: A few notes about what it contains

Why this matters:

  • Makes sources easier to find
  • Helps when contextualizing for Chat
  • Useful for organizing large notebooks

Searching Within Sources

After sources are added, you can:

Text search: "Find exact phrase"
Vector search: "Find conceptually similar"

Both search across all sources in notebook.
Results show:
  - Which source
  - Which section
  - Relevance score

Context Management: How Sources Get Used

You control how AI accesses sources:

Three Levels (for Chat)

Full Content:

AI sees: Complete source text
Cost: 100% of tokens
Use when: Analyzing in detail, need precise citations
Example: "Analyze this methodology paper closely"

Summary Only:

AI sees: AI-generated summary (not full text)
Cost: ~10-20% of tokens
Use when: Background material, reference context
Example: "Use this as context but focus on the main source"

Not in Context:

AI sees: Nothing (excluded)
Cost: 0 tokens
Use when: Confidential, not relevant, or archived
Example: "Keep this in notebook but don't use in this conversation"

How to Set Context (in Chat)

1. Go to Chat
2. Click "Select Context Sources"
3. For each source:
   - Toggle ON/OFF (include/exclude)
   - Choose level (Full/Summary/Excluded)
4. Click "Save"
5. Now chat uses these settings

Common Mistakes

Mistake What Happens How to Fix
Upload 200 sources at once System gets slow, processing stalls Add 10-20 at a time, wait for processing
Use full content for all sources Token usage skyrockets, expensive Use "Summary" or "Excluded" for background material
Add huge PDFs without splitting Processing is slow, search results less precise Consider splitting large PDFs into chapters
Forget source titles Can't distinguish between similar sources Rename sources with descriptive titles right after uploading
Don't tag sources Hard to find and organize later Add tags immediately: "primary", "background", etc.
Mix languages in one source Transcription/embedding quality drops Keep each language in separate sources
Use same source multiple times Takes up space, creates confusion Add once; reuse in multiple chats/notebooks

Processing Status & Troubleshooting

What the Status Indicators Mean

🟡 Processing
  → Source is being extracted and embedded
  → Wait 30 seconds - 3 minutes depending on size
  → Don't use in Chat yet

🟢 Ready
  → Source is processed and searchable
  → Can use immediately in Chat
  → Can apply transformations

🔴 Error
  → Something went wrong
  → Common reasons:
    - Unsupported file format
    - File too large or corrupted
    - Network timeout

⚪ Not in Context
  → Source added but excluded from Chat
  → Still searchable, not sent to AI

Common Errors & Solutions

"Unsupported file type"

  • You tried to upload a format not in the list (e.g., .webp image)
  • Solution: Convert to supported format (PDF for documents, MP3 for audio)

"Processing timeout"

  • Very large file (>100MB) or very long audio
  • Solution: Split into smaller pieces or try uploading again

"Transcription failed"

  • Audio quality too poor or language not detected
  • Solution: Re-record with better quality, or paste text transcript manually

"Web link won't extract"

  • Website blocks automated access or uses JavaScript for content
  • Solution: Copy the article text and paste as "Text" instead

Tips for Best Results

For PDFs

  • Clean, digital PDFs work best
  • Remove copy protection if present (legally)
  • Scanned PDFs work but take longer

For Web Articles

  • Use full URL including domain
  • Avoid cookie/popup-laden sites
  • If extraction fails, copy-paste text instead

For Audio

  • Clear, well-recorded audio transcribes better
  • Remove background noise if possible
  • YouTube videos usually have good transcriptions built-in

For Large Documents

  • Consider splitting into smaller sources
  • Gives more precise search results
  • Processing is faster for smaller pieces

For Organization

  • Name sources clearly (not "document_2.pdf")
  • Add tags immediately after uploading
  • Use descriptions for complex documents

What Comes After: Using Your Sources

Once you've added sources, you can:


Summary Checklist

Before adding sources, confirm:

  • File is in supported format
  • File is under 100MB (or splitting large ones)
  • Web links are full URLs (not shortened)
  • Audio files have clear speech (if transcription-dependent)
  • You've named source clearly
  • You've added tags for organization
  • You understand context levels (Full/Summary/Excluded)

Done! Sources are now ready for Chat, Search, Transformations, and more.