open-notebook/docs/3-USER-GUIDE/adding-sources.md
LUIS NOVO e13e4a2d8b docs: restructure documentation with new organized layout
- Replace old docs structure with new comprehensive documentation
- Organize into 8 major sections (0-START-HERE through 7-DEVELOPMENT)
- Convert CONFIGURATION.md, CONTRIBUTING.md, MAINTAINER_GUIDE.md to redirects
- Remove outdated MIGRATION.md and DESIGN_PRINCIPLES.md
- Fix all internal documentation links and cross-references
- Add progressive disclosure paths for different user types
- Include 44 focused guides covering all features
- Update README.md to remove v1.0 breaking changes notice
2026-01-03 20:10:24 -03:00

429 lines
11 KiB
Markdown

# Adding Sources - Getting Content Into Your Notebook
Sources are the raw materials of your research. This guide covers how to add different types of content.
---
## Quick-Start: Add Your First Source
### Option 1: Upload a File (PDF, Word, etc.)
```
1. In your notebook, click "Add Source"
2. Select "Upload File"
3. Choose a file from your computer
4. Click "Upload"
5. Wait 30-60 seconds for processing
6. Done! Source appears in your notebook
```
### Option 2: Add a Web Link
```
1. Click "Add Source"
2. Select "Web Link"
3. Paste URL: https://example.com/article
4. Click "Add"
5. Wait for processing (usually faster than files)
6. Done!
```
### Option 3: Paste Text
```
1. Click "Add Source"
2. Select "Text"
3. Paste or type your content
4. Click "Save"
5. Done! Immediately available
```
---
## Supported File Types
### Documents
- **PDF** (.pdf) — Best support, including scanned PDFs with OCR
- **Word** (.docx, .doc) — Full support
- **PowerPoint** (.pptx) — Slides converted to text
- **Excel** (.xlsx, .xls) — Spreadsheet data
- **EPUB** (.epub) — eBook files
- **Markdown** (.md, .txt) — Plain text formats
- **HTML** (.html, .htm) — Web page files
**File size limits:** Up to ~100MB (varies by system)
**Processing time:** 10 seconds - 2 minutes (depending on length and file type)
### Audio & Video
- **Audio**: MP3, WAV, M4A, OGG, FLAC (~30 seconds - 3 minutes per hour)
- **Video**: MP4, AVI, MOV, MKV, WebM (~3-10 minutes per hour)
- **YouTube**: Direct URL support
- **Podcasts**: RSS feed URL
**Automatic transcription**: Audio/video is transcribed to text automatically. This requires enabling speech-to-text in settings.
### Web Content
- **Articles**: Blog posts, news articles, Medium
- **YouTube**: Full videos or playlists
- **PDFs online**: Direct PDF links
- **News**: News site articles
**Just paste the URL** in "Web Link" section.
### What Doesn't Work
- Paywalled content (WSJ, FT, etc.) — Can't extract
- Password-protected PDFs — Can't open
- Pure image files (.jpg, .png) — Except scanned PDFs which have OCR
- Very large files (>100MB) — Timeout
---
## What Happens When You Add a Source
The system automatically does four things:
```
1. EXTRACT TEXT
File/URL → Readable text
(PDFs get OCR if scanned)
(Videos get transcribed if enabled)
2. BREAK INTO CHUNKS
Long text → ~500-word pieces
(So search finds specific parts, not whole document)
3. CREATE EMBEDDINGS
Each chunk → Vector representation
(Enables semantic/concept search)
4. INDEX & STORE
Everything → Database
(Ready to search and retrieve)
```
**Time to use:** After the progress bar completes, the source is ready immediately. Embeddings are created in the background.
---
## Step-by-Step for Different Types
### PDFs
**Best practices:**
```
Clean PDFs:
1. Upload → Done
2. Processing time: ~30-60 seconds
Scanned/Image PDFs:
1. Upload same way
2. System auto-detects and uses OCR
3. Processing time: ~2-3 minutes
4. (Higher, due to OCR overhead)
Large PDFs (50+ pages):
1. Consider splitting into smaller files
2. Or upload as-is (system handles it)
3. Processing time scales with size
```
**Common issues:**
- "Can't extract text" → PDF is corrupted or has copy protection
- Solution: Try opening in Adobe. If it won't, the PDF is likely protected.
### Web Links / Articles
**Best practices:**
```
1. Copy full URL from browser: https://example.com/article-title
2. Paste in "Web Link"
3. Click Add
4. Wait for extraction
Processing time: Usually 5-15 seconds
```
**What works:**
- Standard web articles
- Blog posts
- News articles
- Wikipedia pages
- Medium posts
- Substack articles
**What doesn't work:**
- Twitter threads (unreliable)
- Paywalled articles (can't access)
- JavaScript-heavy sites (content not extracted)
**Pro tip:** If it doesn't work, copy the article text and paste as "Text" instead.
### Audio Files
**Best practices:**
```
1. Ensure speech-to-text is enabled in Settings
2. Upload MP3, WAV, or M4A file
3. System automatically transcribes to text
4. Processing time: ~1 minute per 5 minutes of audio
Example:
- 1-hour podcast → 12 minutes processing
- 10-minute recording → 2 minutes processing
```
**Quality matters:**
- Clear audio: Fast transcription
- Muffled/noisy audio: Slower, less accurate transcription
- Background noise: Try to minimize before uploading
**Tip:** If audio quality is poor, the AI might misinterpret content. You can manually correct transcription if needed.
### YouTube Videos
**Best practices:**
```
Two ways to add:
Method 1: Direct URL
1. Copy YouTube URL: https://www.youtube.com/watch?v=...
2. Paste in "Web Link"
3. Click Add
4. System extracts captions (if available) + transcript
Method 2: Playlist
1. Paste playlist URL
2. System adds all videos as separate sources
3. Each video processed separately
4. Takes longer (multiple videos)
```
**What's extracted:**
- Captions/subtitles (if available)
- Transcription (if captions aren't available)
- Basic metadata (title, channel, length)
**Processing:**
- 10-minute video: ~2-3 minutes
- 1-hour video: ~10-15 minutes
### Text / Paste Content
**Best practices:**
```
1. Select "Text" when adding source
2. Paste or type content
3. System processes immediately
4. No wait time needed
Good for:
- Notes you want to reference
- Quotes from books
- Transcripts you have handy
- Quick research snippets
```
---
## Managing Your Sources
### Viewing Source Details
```
Click on source → See:
- Original file name/title
- When it was added
- Size and format
- Processing status
- Number of chunks
```
### Organizing with Metadata
You can add to each source:
- **Title**: Better name than original filename
- **Tags**: Category labels ("primary research", "background", "competitor analysis")
- **Description**: A few notes about what it contains
**Why this matters:**
- Makes sources easier to find
- Helps when contextualizing for Chat
- Useful for organizing large notebooks
### Searching Within Sources
```
After sources are added, you can:
Text search: "Find exact phrase"
Vector search: "Find conceptually similar"
Both search across all sources in notebook.
Results show:
- Which source
- Which section
- Relevance score
```
---
## Context Management: How Sources Get Used
You control how AI accesses sources:
### Three Levels (for Chat)
**Full Content:**
```
AI sees: Complete source text
Cost: 100% of tokens
Use when: Analyzing in detail, need precise citations
Example: "Analyze this methodology paper closely"
```
**Summary Only:**
```
AI sees: AI-generated summary (not full text)
Cost: ~10-20% of tokens
Use when: Background material, reference context
Example: "Use this as context but focus on the main source"
```
**Not in Context:**
```
AI sees: Nothing (excluded)
Cost: 0 tokens
Use when: Confidential, not relevant, or archived
Example: "Keep this in notebook but don't use in this conversation"
```
### How to Set Context (in Chat)
```
1. Go to Chat
2. Click "Select Context Sources"
3. For each source:
- Toggle ON/OFF (include/exclude)
- Choose level (Full/Summary/Excluded)
4. Click "Save"
5. Now chat uses these settings
```
---
## Common Mistakes
| Mistake | What Happens | How to Fix |
|---------|--------------|-----------|
| Upload 200 sources at once | System gets slow, processing stalls | Add 10-20 at a time, wait for processing |
| Use full content for all sources | Token usage skyrockets, expensive | Use "Summary" or "Excluded" for background material |
| Add huge PDFs without splitting | Processing is slow, search results less precise | Consider splitting large PDFs into chapters |
| Forget source titles | Can't distinguish between similar sources | Rename sources with descriptive titles right after uploading |
| Don't tag sources | Hard to find and organize later | Add tags immediately: "primary", "background", etc. |
| Mix languages in one source | Transcription/embedding quality drops | Keep each language in separate sources |
| Use same source multiple times | Takes up space, creates confusion | Add once; reuse in multiple chats/notebooks |
---
## Processing Status & Troubleshooting
### What the Status Indicators Mean
```
🟡 Processing
→ Source is being extracted and embedded
→ Wait 30 seconds - 3 minutes depending on size
→ Don't use in Chat yet
🟢 Ready
→ Source is processed and searchable
→ Can use immediately in Chat
→ Can apply transformations
🔴 Error
→ Something went wrong
→ Common reasons:
- Unsupported file format
- File too large or corrupted
- Network timeout
⚪ Not in Context
→ Source added but excluded from Chat
→ Still searchable, not sent to AI
```
### Common Errors & Solutions
**"Unsupported file type"**
- You tried to upload a format not in the list (e.g., `.webp` image)
- Solution: Convert to supported format (PDF for documents, MP3 for audio)
**"Processing timeout"**
- Very large file (>100MB) or very long audio
- Solution: Split into smaller pieces or try uploading again
**"Transcription failed"**
- Audio quality too poor or language not detected
- Solution: Re-record with better quality, or paste text transcript manually
**"Web link won't extract"**
- Website blocks automated access or uses JavaScript for content
- Solution: Copy the article text and paste as "Text" instead
---
## Tips for Best Results
### For PDFs
- Clean, digital PDFs work best
- Remove copy protection if present (legally)
- Scanned PDFs work but take longer
### For Web Articles
- Use full URL including domain
- Avoid cookie/popup-laden sites
- If extraction fails, copy-paste text instead
### For Audio
- Clear, well-recorded audio transcribes better
- Remove background noise if possible
- YouTube videos usually have good transcriptions built-in
### For Large Documents
- Consider splitting into smaller sources
- Gives more precise search results
- Processing is faster for smaller pieces
### For Organization
- Name sources clearly (not "document_2.pdf")
- Add tags immediately after uploading
- Use descriptions for complex documents
---
## What Comes After: Using Your Sources
Once you've added sources, you can:
- **Chat** → Ask questions (see [Chat Effectively](chat-effectively.md))
- **Search** → Find specific content (see [Search Effectively](search.md))
- **Transformations** → Extract structured insights (see [Working with Notes](working-with-notes.md))
- **Ask** → Get comprehensive answers (see [Search Effectively](search.md))
- **Podcasts** → Turn into audio (see [Creating Podcasts](creating-podcasts.md))
---
## Summary Checklist
Before adding sources, confirm:
- [ ] File is in supported format
- [ ] File is under 100MB (or splitting large ones)
- [ ] Web links are full URLs (not shortened)
- [ ] Audio files have clear speech (if transcription-dependent)
- [ ] You've named source clearly
- [ ] You've added tags for organization
- [ ] You understand context levels (Full/Summary/Excluded)
Done! Sources are now ready for Chat, Search, Transformations, and more.