mirror of
https://github.com/lfnovo/open-notebook.git
synced 2026-04-28 19:40:50 +00:00
- Replace old docs structure with new comprehensive documentation - Organize into 8 major sections (0-START-HERE through 7-DEVELOPMENT) - Convert CONFIGURATION.md, CONTRIBUTING.md, MAINTAINER_GUIDE.md to redirects - Remove outdated MIGRATION.md and DESIGN_PRINCIPLES.md - Fix all internal documentation links and cross-references - Add progressive disclosure paths for different user types - Include 44 focused guides covering all features - Update README.md to remove v1.0 breaking changes notice
429 lines
11 KiB
Markdown
429 lines
11 KiB
Markdown
# Adding Sources - Getting Content Into Your Notebook
|
|
|
|
Sources are the raw materials of your research. This guide covers how to add different types of content.
|
|
|
|
---
|
|
|
|
## Quick-Start: Add Your First Source
|
|
|
|
### Option 1: Upload a File (PDF, Word, etc.)
|
|
|
|
```
|
|
1. In your notebook, click "Add Source"
|
|
2. Select "Upload File"
|
|
3. Choose a file from your computer
|
|
4. Click "Upload"
|
|
5. Wait 30-60 seconds for processing
|
|
6. Done! Source appears in your notebook
|
|
```
|
|
|
|
### Option 2: Add a Web Link
|
|
|
|
```
|
|
1. Click "Add Source"
|
|
2. Select "Web Link"
|
|
3. Paste URL: https://example.com/article
|
|
4. Click "Add"
|
|
5. Wait for processing (usually faster than files)
|
|
6. Done!
|
|
```
|
|
|
|
### Option 3: Paste Text
|
|
|
|
```
|
|
1. Click "Add Source"
|
|
2. Select "Text"
|
|
3. Paste or type your content
|
|
4. Click "Save"
|
|
5. Done! Immediately available
|
|
```
|
|
|
|
---
|
|
|
|
## Supported File Types
|
|
|
|
### Documents
|
|
- **PDF** (.pdf) — Best support, including scanned PDFs with OCR
|
|
- **Word** (.docx, .doc) — Full support
|
|
- **PowerPoint** (.pptx) — Slides converted to text
|
|
- **Excel** (.xlsx, .xls) — Spreadsheet data
|
|
- **EPUB** (.epub) — eBook files
|
|
- **Markdown** (.md, .txt) — Plain text formats
|
|
- **HTML** (.html, .htm) — Web page files
|
|
|
|
**File size limits:** Up to ~100MB (varies by system)
|
|
|
|
**Processing time:** 10 seconds - 2 minutes (depending on length and file type)
|
|
|
|
### Audio & Video
|
|
- **Audio**: MP3, WAV, M4A, OGG, FLAC (~30 seconds - 3 minutes per hour)
|
|
- **Video**: MP4, AVI, MOV, MKV, WebM (~3-10 minutes per hour)
|
|
- **YouTube**: Direct URL support
|
|
- **Podcasts**: RSS feed URL
|
|
|
|
**Automatic transcription**: Audio/video is transcribed to text automatically. This requires enabling speech-to-text in settings.
|
|
|
|
### Web Content
|
|
- **Articles**: Blog posts, news articles, Medium
|
|
- **YouTube**: Full videos or playlists
|
|
- **PDFs online**: Direct PDF links
|
|
- **News**: News site articles
|
|
|
|
**Just paste the URL** in "Web Link" section.
|
|
|
|
### What Doesn't Work
|
|
- Paywalled content (WSJ, FT, etc.) — Can't extract
|
|
- Password-protected PDFs — Can't open
|
|
- Pure image files (.jpg, .png) — Except scanned PDFs which have OCR
|
|
- Very large files (>100MB) — Timeout
|
|
|
|
---
|
|
|
|
## What Happens When You Add a Source
|
|
|
|
The system automatically does four things:
|
|
|
|
```
|
|
1. EXTRACT TEXT
|
|
File/URL → Readable text
|
|
(PDFs get OCR if scanned)
|
|
(Videos get transcribed if enabled)
|
|
|
|
2. BREAK INTO CHUNKS
|
|
Long text → ~500-word pieces
|
|
(So search finds specific parts, not whole document)
|
|
|
|
3. CREATE EMBEDDINGS
|
|
Each chunk → Vector representation
|
|
(Enables semantic/concept search)
|
|
|
|
4. INDEX & STORE
|
|
Everything → Database
|
|
(Ready to search and retrieve)
|
|
```
|
|
|
|
**Time to use:** After the progress bar completes, the source is ready immediately. Embeddings are created in the background.
|
|
|
|
---
|
|
|
|
## Step-by-Step for Different Types
|
|
|
|
### PDFs
|
|
|
|
**Best practices:**
|
|
```
|
|
Clean PDFs:
|
|
1. Upload → Done
|
|
2. Processing time: ~30-60 seconds
|
|
|
|
Scanned/Image PDFs:
|
|
1. Upload same way
|
|
2. System auto-detects and uses OCR
|
|
3. Processing time: ~2-3 minutes
|
|
4. (Higher, due to OCR overhead)
|
|
|
|
Large PDFs (50+ pages):
|
|
1. Consider splitting into smaller files
|
|
2. Or upload as-is (system handles it)
|
|
3. Processing time scales with size
|
|
```
|
|
|
|
**Common issues:**
|
|
- "Can't extract text" → PDF is corrupted or has copy protection
|
|
- Solution: Try opening in Adobe. If it won't, the PDF is likely protected.
|
|
|
|
### Web Links / Articles
|
|
|
|
**Best practices:**
|
|
```
|
|
1. Copy full URL from browser: https://example.com/article-title
|
|
2. Paste in "Web Link"
|
|
3. Click Add
|
|
4. Wait for extraction
|
|
|
|
Processing time: Usually 5-15 seconds
|
|
```
|
|
|
|
**What works:**
|
|
- Standard web articles
|
|
- Blog posts
|
|
- News articles
|
|
- Wikipedia pages
|
|
- Medium posts
|
|
- Substack articles
|
|
|
|
**What doesn't work:**
|
|
- Twitter threads (unreliable)
|
|
- Paywalled articles (can't access)
|
|
- JavaScript-heavy sites (content not extracted)
|
|
|
|
**Pro tip:** If it doesn't work, copy the article text and paste as "Text" instead.
|
|
|
|
### Audio Files
|
|
|
|
**Best practices:**
|
|
```
|
|
1. Ensure speech-to-text is enabled in Settings
|
|
2. Upload MP3, WAV, or M4A file
|
|
3. System automatically transcribes to text
|
|
4. Processing time: ~1 minute per 5 minutes of audio
|
|
|
|
Example:
|
|
- 1-hour podcast → 12 minutes processing
|
|
- 10-minute recording → 2 minutes processing
|
|
```
|
|
|
|
**Quality matters:**
|
|
- Clear audio: Fast transcription
|
|
- Muffled/noisy audio: Slower, less accurate transcription
|
|
- Background noise: Try to minimize before uploading
|
|
|
|
**Tip:** If audio quality is poor, the AI might misinterpret content. You can manually correct transcription if needed.
|
|
|
|
### YouTube Videos
|
|
|
|
**Best practices:**
|
|
```
|
|
Two ways to add:
|
|
|
|
Method 1: Direct URL
|
|
1. Copy YouTube URL: https://www.youtube.com/watch?v=...
|
|
2. Paste in "Web Link"
|
|
3. Click Add
|
|
4. System extracts captions (if available) + transcript
|
|
|
|
Method 2: Playlist
|
|
1. Paste playlist URL
|
|
2. System adds all videos as separate sources
|
|
3. Each video processed separately
|
|
4. Takes longer (multiple videos)
|
|
```
|
|
|
|
**What's extracted:**
|
|
- Captions/subtitles (if available)
|
|
- Transcription (if captions aren't available)
|
|
- Basic metadata (title, channel, length)
|
|
|
|
**Processing:**
|
|
- 10-minute video: ~2-3 minutes
|
|
- 1-hour video: ~10-15 minutes
|
|
|
|
### Text / Paste Content
|
|
|
|
**Best practices:**
|
|
```
|
|
1. Select "Text" when adding source
|
|
2. Paste or type content
|
|
3. System processes immediately
|
|
4. No wait time needed
|
|
|
|
Good for:
|
|
- Notes you want to reference
|
|
- Quotes from books
|
|
- Transcripts you have handy
|
|
- Quick research snippets
|
|
```
|
|
|
|
---
|
|
|
|
## Managing Your Sources
|
|
|
|
### Viewing Source Details
|
|
|
|
```
|
|
Click on source → See:
|
|
- Original file name/title
|
|
- When it was added
|
|
- Size and format
|
|
- Processing status
|
|
- Number of chunks
|
|
```
|
|
|
|
### Organizing with Metadata
|
|
|
|
You can add to each source:
|
|
- **Title**: Better name than original filename
|
|
- **Tags**: Category labels ("primary research", "background", "competitor analysis")
|
|
- **Description**: A few notes about what it contains
|
|
|
|
**Why this matters:**
|
|
- Makes sources easier to find
|
|
- Helps when contextualizing for Chat
|
|
- Useful for organizing large notebooks
|
|
|
|
### Searching Within Sources
|
|
|
|
```
|
|
After sources are added, you can:
|
|
|
|
Text search: "Find exact phrase"
|
|
Vector search: "Find conceptually similar"
|
|
|
|
Both search across all sources in notebook.
|
|
Results show:
|
|
- Which source
|
|
- Which section
|
|
- Relevance score
|
|
```
|
|
|
|
---
|
|
|
|
## Context Management: How Sources Get Used
|
|
|
|
You control how AI accesses sources:
|
|
|
|
### Three Levels (for Chat)
|
|
|
|
**Full Content:**
|
|
```
|
|
AI sees: Complete source text
|
|
Cost: 100% of tokens
|
|
Use when: Analyzing in detail, need precise citations
|
|
Example: "Analyze this methodology paper closely"
|
|
```
|
|
|
|
**Summary Only:**
|
|
```
|
|
AI sees: AI-generated summary (not full text)
|
|
Cost: ~10-20% of tokens
|
|
Use when: Background material, reference context
|
|
Example: "Use this as context but focus on the main source"
|
|
```
|
|
|
|
**Not in Context:**
|
|
```
|
|
AI sees: Nothing (excluded)
|
|
Cost: 0 tokens
|
|
Use when: Confidential, not relevant, or archived
|
|
Example: "Keep this in notebook but don't use in this conversation"
|
|
```
|
|
|
|
### How to Set Context (in Chat)
|
|
|
|
```
|
|
1. Go to Chat
|
|
2. Click "Select Context Sources"
|
|
3. For each source:
|
|
- Toggle ON/OFF (include/exclude)
|
|
- Choose level (Full/Summary/Excluded)
|
|
4. Click "Save"
|
|
5. Now chat uses these settings
|
|
```
|
|
|
|
---
|
|
|
|
## Common Mistakes
|
|
|
|
| Mistake | What Happens | How to Fix |
|
|
|---------|--------------|-----------|
|
|
| Upload 200 sources at once | System gets slow, processing stalls | Add 10-20 at a time, wait for processing |
|
|
| Use full content for all sources | Token usage skyrockets, expensive | Use "Summary" or "Excluded" for background material |
|
|
| Add huge PDFs without splitting | Processing is slow, search results less precise | Consider splitting large PDFs into chapters |
|
|
| Forget source titles | Can't distinguish between similar sources | Rename sources with descriptive titles right after uploading |
|
|
| Don't tag sources | Hard to find and organize later | Add tags immediately: "primary", "background", etc. |
|
|
| Mix languages in one source | Transcription/embedding quality drops | Keep each language in separate sources |
|
|
| Use same source multiple times | Takes up space, creates confusion | Add once; reuse in multiple chats/notebooks |
|
|
|
|
---
|
|
|
|
## Processing Status & Troubleshooting
|
|
|
|
### What the Status Indicators Mean
|
|
|
|
```
|
|
🟡 Processing
|
|
→ Source is being extracted and embedded
|
|
→ Wait 30 seconds - 3 minutes depending on size
|
|
→ Don't use in Chat yet
|
|
|
|
🟢 Ready
|
|
→ Source is processed and searchable
|
|
→ Can use immediately in Chat
|
|
→ Can apply transformations
|
|
|
|
🔴 Error
|
|
→ Something went wrong
|
|
→ Common reasons:
|
|
- Unsupported file format
|
|
- File too large or corrupted
|
|
- Network timeout
|
|
|
|
⚪ Not in Context
|
|
→ Source added but excluded from Chat
|
|
→ Still searchable, not sent to AI
|
|
```
|
|
|
|
### Common Errors & Solutions
|
|
|
|
**"Unsupported file type"**
|
|
- You tried to upload a format not in the list (e.g., `.webp` image)
|
|
- Solution: Convert to supported format (PDF for documents, MP3 for audio)
|
|
|
|
**"Processing timeout"**
|
|
- Very large file (>100MB) or very long audio
|
|
- Solution: Split into smaller pieces or try uploading again
|
|
|
|
**"Transcription failed"**
|
|
- Audio quality too poor or language not detected
|
|
- Solution: Re-record with better quality, or paste text transcript manually
|
|
|
|
**"Web link won't extract"**
|
|
- Website blocks automated access or uses JavaScript for content
|
|
- Solution: Copy the article text and paste as "Text" instead
|
|
|
|
---
|
|
|
|
## Tips for Best Results
|
|
|
|
### For PDFs
|
|
- Clean, digital PDFs work best
|
|
- Remove copy protection if present (legally)
|
|
- Scanned PDFs work but take longer
|
|
|
|
### For Web Articles
|
|
- Use full URL including domain
|
|
- Avoid cookie/popup-laden sites
|
|
- If extraction fails, copy-paste text instead
|
|
|
|
### For Audio
|
|
- Clear, well-recorded audio transcribes better
|
|
- Remove background noise if possible
|
|
- YouTube videos usually have good transcriptions built-in
|
|
|
|
### For Large Documents
|
|
- Consider splitting into smaller sources
|
|
- Gives more precise search results
|
|
- Processing is faster for smaller pieces
|
|
|
|
### For Organization
|
|
- Name sources clearly (not "document_2.pdf")
|
|
- Add tags immediately after uploading
|
|
- Use descriptions for complex documents
|
|
|
|
---
|
|
|
|
## What Comes After: Using Your Sources
|
|
|
|
Once you've added sources, you can:
|
|
|
|
- **Chat** → Ask questions (see [Chat Effectively](chat-effectively.md))
|
|
- **Search** → Find specific content (see [Search Effectively](search.md))
|
|
- **Transformations** → Extract structured insights (see [Working with Notes](working-with-notes.md))
|
|
- **Ask** → Get comprehensive answers (see [Search Effectively](search.md))
|
|
- **Podcasts** → Turn into audio (see [Creating Podcasts](creating-podcasts.md))
|
|
|
|
---
|
|
|
|
## Summary Checklist
|
|
|
|
Before adding sources, confirm:
|
|
|
|
- [ ] File is in supported format
|
|
- [ ] File is under 100MB (or splitting large ones)
|
|
- [ ] Web links are full URLs (not shortened)
|
|
- [ ] Audio files have clear speech (if transcription-dependent)
|
|
- [ ] You've named source clearly
|
|
- [ ] You've added tags for organization
|
|
- [ ] You understand context levels (Full/Summary/Excluded)
|
|
|
|
Done! Sources are now ready for Chat, Search, Transformations, and more.
|