vrr/open-notebook

Fork 0

mirror of https://github.com/lfnovo/open-notebook.git synced 2026-04-28 11:30:00 +00:00

Luis Novo eac837d555

Development Build / extract-version (push) Has been cancelled

Details

Tests / Backend Tests (push) Has been cancelled

Details

Tests / Frontend Tests (push) Has been cancelled

Details

Development Build / build-regular (push) Has been cancelled

Details

Development Build / build-single (push) Has been cancelled

Details

Development Build / summary (push) Has been cancelled

Details

feat(podcasts): model registry integration, credential passthrough & new features (#632 )

* feat(podcasts): integrate model registry for profiles and credential passthrough

Replace loose provider/model string fields with record<model> references
in podcast profiles, enabling credential passthrough to podcast-creator.

Backend:
- EpisodeProfile: outline_llm, transcript_llm (record<model>) replace
  outline_provider/outline_model strings. New language field (BCP 47).
- SpeakerProfile: voice_model (record<model>) replaces tts_provider/
  tts_model strings. Per-speaker voice_model override support.
- Migration 14: schema changes making legacy fields optional, adding new
  record<model> fields.
- Data migration (migration.py): auto-converts legacy profiles to model
  registry references on startup. Idempotent.
- podcast_commands.py: resolves credentials for ALL profiles before
  calling podcast-creator.
- New /api/languages endpoint (pycountry + babel) with BCP 47 locale
  codes (pt-BR, en-US, etc.).

Frontend:
- Episode/speaker profile forms use ModelSelector instead of manual
  provider/model dropdowns.
- Language dropdown with BCP 47 codes in episode profile form.
- Per-speaker TTS voice model override in speaker profile form.
- "Templates" tab renamed to "Profiles".
- Setup required badge on unconfigured profiles.
- i18n updated across all 8 locales.

Closes #486, closes #552

* fix(i18n): remove unused legacy podcast provider/model keys

Remove 10 orphaned i18n keys across all 8 locales that were left behind
after replacing manual provider/model dropdowns with ModelSelector.

* fix: address review violations in podcast model registry

- P1: Remove profiles with failed model resolution from dicts to prevent
  podcast-creator validation errors on unrelated profiles
- P2: Use centralized QUERY_KEYS.languages instead of inline key
- P3: Fix ISO 639-1 → BCP 47 in model field description and CLAUDE.md
- P3: Update "templates" → "profiles" in locale string values (all 8)

* chore: bump version to 1.8.0

2026-02-27 11:06:47 -03:00

12 KiB

Raw Blame History

Podcasts Explained - Research as Audio Dialogue

Podcasts are Open Notebook's highest-level transformation: converting your research into audio dialogue for a different consumption pattern.

Why Podcasts Matter

The Problem

Research naturally accumulates as text: PDFs, articles, web pages, notes. This creates a friction point:

To consume research, you must:

Sit down at a desk
Focus intently
Read actively
Take notes
Set aside dedicated time

But much of life is passive time:

Commuting
Exercising
Doing dishes
Driving
Walking
Idle moments

The Solution

Convert your research into audio dialogue so you can consume it passively.

Before (Text-based):
  Research pile → Must schedule reading time → Requires focus

After (Podcast):
  Research pile → Podcast → Can listen while commuting
                         → Absorb while exercising
                         → Understand while walking
                         → Engage without screen time

What Makes It Special: Open Notebook vs. Competitors

Google Notebook LM Podcasts

Fixed format: 2 hosts, always conversational
Limited customization: You can't choose who the "hosts" are
One TTS voice per speaker: Can't customize voices
Only uses cloud services: No local options

Open Notebook Podcasts

Customizable format: 1-4 speakers, you design them
Rich speaker profiles: Create personas with backstories and expertise
Multiple TTS options:
- OpenAI (natural, fast)
- Google TTS (high quality)
- ElevenLabs (beautiful voices, accents)
- Local TTS (privacy-first, no API calls)
Async generation: Doesn't block your work
Full control: Choose outline structure, tone, depth

How Podcast Generation Works

Stage 1: Content Selection

You choose what goes into the podcast:

Notebook content → Which sources? → Which notes?
                → Which topics to focus on?
                → Depth of coverage?

Stage 2: Episode Profile

You define how you want the podcast structured:

Episode Profile
├─ Topic: "AI Safety Approaches"
├─ Length: 20 minutes
├─ Tone: Academic but accessible
├─ Format: Debate (2 speakers with opposing views)
├─ Audience: Researchers new to the field
└─ Focus areas: Main approaches, pros/cons, open questions

Stage 3: Speaker Configuration

You create speaker personas (1-4 speakers):

Speaker 1: "Expert Alex"
├─ Expertise: "Deep knowledge of alignment research"
├─ Personality: "Rigorous, academic, patient with explanation"
├─ Accent: (Optional) "British English"
└─ Voice Model: Selected from model registry (e.g., OpenAI TTS)
   └─ Optional per-speaker override of the episode's default voice model

Speaker 2: "Researcher Sam"
├─ Expertise: "Field observer, pragmatic perspective"
├─ Personality: "Curious, asks clarifying questions"
├─ Accent: "American English"
└─ Voice Model: Selected from model registry (e.g., ElevenLabs TTS)

Stage 4: Outline Generation

System generates episode outline:

EPISODE: "AI Safety Approaches"

1. Introduction (2 min)
   Alex: Introduces topic and speakers
   Sam: What will we cover today?

2. Main Approaches (8 min)
   Alex: Explains top 3 approaches
   Sam: Asks about tradeoffs

3. Debate: Best approach? (6 min)
   Alex: Advocates for approach A
   Sam: Argues for approach B

4. Open Questions (3 min)
   Both: What's unsolved?

5. Conclusion (1 min)
   Recap and where to learn more

Stage 5: Dialogue Generation

System generates dialogue based on outline:

Alex: "Today we're exploring three major approaches to AI alignment..."

Sam: "That's a great start. Can you break down what we mean by alignment?"

Alex: "Good question. Alignment means ensuring AI systems pursue the goals
       we actually want them to pursue, not just what we literally asked for.
       There's a classic example of a paperclip maximizer..."

Sam: "Interesting. So it's about solving the intention problem?"

Alex: "Exactly. And that's where the three approaches come in..."

Stage 6: Text-to-Speech

System converts dialogue to audio using the voice models configured in the model registry. Credentials are automatically resolved from each model's configuration.

Alex's text → Voice model (from registry) → Alex's voice (audio file)
Sam's text → Voice model (from registry) → Sam's voice (audio file)
Audio files → Mix together → Final podcast MP3

When Things Go Wrong: Failures & Retry

Podcast generation involves multiple steps (outline, transcript, TTS) and depends on external AI providers. Sometimes things fail.

What Happens on Failure

When podcast generation fails (e.g., wrong model configured, API key expired, provider outage):

The episode is marked as Failed with a red badge
The error message from the AI provider is displayed so you can understand what went wrong
No duplicate episodes are created — automatic retries are disabled to prevent confusion

How to Retry a Failed Episode

Go to the podcast's Episodes tab
Find the failed episode — it shows a red "FAILED" badge and an error details box
Click the Retry button
The failed episode is deleted and a new generation job is submitted
The new episode appears with "pending" status

Common Failure Causes

Error	What to Do
Invalid API key	Check Settings -> Credentials for the TTS and language model providers
Model not found	Verify the model exists in the model registry and has valid credentials configured
Rate limit exceeded	Wait a few minutes and retry
Provider unavailable	Check provider status page; retry later

Key Architecture Decisions

1. Asynchronous Processing

Podcasts are generated in the background. You upload → system processes → you download when ready.

Why? Podcast generation takes time (10+ minutes for a 30-minute episode). Blocking would lock up your interface.

2. Multi-Speaker Support

Unlike Google Notebook LM (always 2 hosts), you choose 1-4 speakers.

Why? Different discussions work better with different formats:

Expert monologue (1 speaker)
Interview (2 speakers: host + expert)
Debate (2 speakers: opposing views)
Panel discussion (3-4 speakers: different expertise)

3. Speaker Customization

You create rich speaker profiles, not just "Host A" and "Host B".

Why? Makes podcasts more engaging and authentic. Different speakers bring different perspectives.

4. Multiple TTS Providers

You're not locked into one voice provider.

Why?

Cost optimization (some providers cheaper)
Quality preferences (some voices more natural)
Privacy options (local TTS for sensitive content)
Accessibility (different accents, genders, styles)

5. Local TTS Option

Can generate podcasts entirely offline with local text-to-speech.

Why? For sensitive research, never send audio to external APIs.

Use Cases Show Why This Matters

Academic Publishing

Traditional: Academic paper → PDF
Problem: Hard to consume, linear reading required

Open Notebook:
Research materials → Podcast (expert explaining methodology)
                  → Podcast (debate format: different interpretations)
                  → Different consumption for different audiences

Content Creation

Blog creator: Has research pile on a topic
Problem: Doesn't have time to write the article

Solution:
Add research → Create podcast → Transcribe → Becomes article
OR: Podcast BECOMES the content (upload to podcast platforms)

Educational Content

Educator: Has reading materials for a course
Problem: Students don't read the papers

Solution:
Create podcast with expert explaining papers
Students listen → Better engagement → Discussions can reference podcast

Market Research

Product manager: Has interviews with customers
Problem: Too many hours of audio to review

Solution:
Create podcast with debate format (customer perspective vs. team perspective)
Much more engaging than raw transcripts

Knowledge Transfer

Domain expert: Leaving the organization
Problem: How to preserve expertise?

Solution:
Create expert-mode podcast explaining frameworks, decision-making, context
New team member listens, gets context faster than reading 100 documents

The Difference: Active vs. Passive Learning

Text-Based Research (Active)

Effort: High (must focus, read, synthesize)
When: Dedicated study time
Cost: Time is expensive (can't multitask)
Best for: Deep dives, precise information
Format: Whatever you write (notes, articles, books)

Audio Podcast (Passive)

Effort: Low (just listen)
When: Anywhere, anytime
Cost: Low (can multitask)
Best for: Overview, context, exploration
Format: Dialogue (more engaging than narration)

They complement each other:

First encounter: Listen to podcast (passive, get context)
Deep dive: Read source materials (active, precise)
Mastery: Both together (understand big picture + details)

How Podcasts Fit Into Your Workflow

1. Build notebook (add sources)
   ↓
2. Apply transformations (extract insights)
   ↓
3. Chat/Ask (explore content)
   ↓
4. Decide on podcast
   ├─→ Create speaker profiles
   ├─→ Define episode profile
   ├─→ Configure voice models (from model registry)
   └─→ Generate podcast
   ↓
5. Listen while commuting/exercising
   ↓
6. Reference sources for deep dive
   ↓
7. Repeat for different formats/speakers/focus

Advanced: Multiple Podcasts from Same Research

You can create different podcasts from the same sources:

Example: AI Safety Research

Podcast 1: "Expert Monologue"
  Speaker: Researcher explaining field
  Format: Educational, comprehensive
  Audience: Students new to field

Podcast 2: "Debate Format"
  Speakers: Optimist vs. skeptic
  Format: Discussion of tradeoffs
  Audience: Advanced researchers

Podcast 3: "Interview Format"
  Speakers: Journalist + expert
  Format: Q&A about practical applications
  Audience: Industry practitioners

Each tells the same story from different angles.

Privacy & Data Considerations

Where Your Data Goes

Option 1: Cloud TTS (Faster, Higher Quality)

Your outline → API call to TTS provider
            → Audio returned
            → Stored in your notebook

Provider sees: Your outlined script (not raw sources)
Privacy level: Medium (outline is shared, sources aren't)

Option 2: Local TTS (Slower, Maximum Privacy)

Your outline → Local TTS engine (runs on your machine)
            → Audio generated locally
            → Stored in your notebook

Provider sees: Nothing
Privacy level: Maximum (everything local)

Recommendation

Sensitive research: Use local TTS, no API calls
Less sensitive: Use ElevenLabs or Google (both handle audio data professionally)
Mixed: Use local TTS for speakers reading sensitive content

Cost Considerations

Cloud TTS Costs

Provider	Cost	Quality	Speed
OpenAI	~$0.015 per minute	Good	Fast
Google	~$0.004 per minute	Excellent	Fast
ElevenLabs	~$0.10 per minute	Exceptional	Medium
Local TTS	Free	Basic	Slow

A 30-minute podcast costs:

OpenAI: ~$0.45
Google: ~$0.12
ElevenLabs: ~$3.00
Local: Free (but slow)

Summary: Why Podcasts Are Special

Podcasts transform your research consumption:

Aspect	Text	Podcast
How consumed?	Active reading	Passive listening
Where consumed?	Desk	Anywhere
Multitasking	Hard	Easy
Time commitment	Scheduled	Flexible
Format	Whatever	Natural dialogue
Engagement	Academic	Conversational
Accessibility	Text-based	Audio-based

In Open Notebook specifically:

Full customization — you create speakers and format
Privacy options — local TTS for sensitive content
Cost control — choose TTS provider based on budget
Non-blocking — generates in background
Multiple versions — create different podcasts from same research

This is why podcasts matter: they change when and how you can consume your research.

12 KiB Raw Blame History

Podcasts Explained - Research as Audio Dialogue

Why Podcasts Matter

The Problem

The Solution

What Makes It Special: Open Notebook vs. Competitors

Google Notebook LM Podcasts

Open Notebook Podcasts

How Podcast Generation Works

Stage 1: Content Selection

Stage 2: Episode Profile

Stage 3: Speaker Configuration

Stage 4: Outline Generation

Stage 5: Dialogue Generation

Stage 6: Text-to-Speech

When Things Go Wrong: Failures & Retry

What Happens on Failure

How to Retry a Failed Episode

Common Failure Causes

Key Architecture Decisions

1. Asynchronous Processing

2. Multi-Speaker Support

3. Speaker Customization

4. Multiple TTS Providers

5. Local TTS Option

Use Cases Show Why This Matters

Academic Publishing

Content Creation

Educational Content

Market Research

Knowledge Transfer

The Difference: Active vs. Passive Learning

Text-Based Research (Active)

Audio Podcast (Passive)

How Podcasts Fit Into Your Workflow

Advanced: Multiple Podcasts from Same Research

Example: AI Safety Research

Privacy & Data Considerations

Where Your Data Goes

Recommendation

Cost Considerations

Cloud TTS Costs

Summary: Why Podcasts Are Special

12 KiB

Raw Blame History