Wise Songs — Video Pipeline Spec Two parallel tracks. Track A is our own sequential scene pipeline (better visual coherence than Revid, cheaper, full control)....

Wise Songs — Video Pipeline Spec

Two parallel tracks. Track A is our own sequential scene pipeline (better visual coherence than Revid, cheaper, full control). Track B is Revid automation (for movie-mode content where Revid genuinely wins).

Track A — Sequential Scene Pipeline (`--level scene`)

The Core Idea

Every existing tool (Revid included) generates each scene image independently from a text prompt. This produces visual incoherence: different styles, random palettes, jarring cuts.

Our approach: img2img chaining. Each scene is generated from the previous scene using a new prompt at a controlled strength. The result is a visually continuous world that evolves — same palette, same style, same general aesthetic — while the content changes per verse.

Revid cannot do this. It is a genuine differentiator.

Pipeline Steps

1. INGEST
   song_data (id, title, lyrics, category, style_prompt, image_url)
   mp3 path

2. VISUAL WORLD GENERATION  [GPT-4o, ~$0.002/song]
   Input: title + full lyrics + category + style_prompt
   Output (JSON):
     {
       "art_style": "warm watercolor, amber and indigo palette, soft rounded edges",
       "setting": "forest clearing, ancient library, ocean horizon...",
       "mood": "hopeful, tense, playful",
       "character_anchors": "small fox with orange scarf, wise owl in glasses",
       "scenes": [
         {
           "verse_index": 0,
           "section": "verse",
           "cinematic_description": "Wide shot: fox standing at crossroads, golden light...",
           "transition_to_next": "dissolve"
         },
         ...
       ]
     }
   Prompt template:
     "You are a visual director. Given these song lyrics, produce a JSON visual world for a
      music video. Art style should match the category ({category}). Output valid JSON only."

3. IMAGE GENERATION  [Replicate, ~$0.003–$0.25/song depending on scene count]
   Scene 0:
     FLUX-schnell text-to-image
     prompt = "{art_style}, {setting}, {scene[0].cinematic_description}, no text, 16:9"
     cost: $0.003

   Scene N (for N > 0):
     FLUX-dev img2img
     image = scene[N-1] output URL (pass directly to Replicate, no re-download needed)
     prompt = "{art_style}, {scene[N].cinematic_description}, no text, 16:9"
     prompt_strength = 0.55  (tune per category: fables 0.6, GRE 0.45, cerebral 0.65)
     cost: $0.025/image

   Total cost example (7 scenes): $0.003 + 6 × $0.025 = $0.153/song
   Total cost example (5 scenes): $0.003 + 4 × $0.025 = $0.103/song

4. VIDEO ASSEMBLY  [ffmpeg + moviepy, free]
   Per image: Ken Burns effect (zoompan filter)
     - Alternate zoom-in / zoom-out per scene
     - Pan direction varies: center, left-drift, right-drift, diagonal
     - Duration = verse duration + 1.5s overlap for crossfade

   Transitions (xfade filter, 1.0s):
     Selected per scene pair from visual world "transition_to_next" field:
       verse→verse:   "fade" or "dissolve"
       verse→chorus:  "wipeleft" or "circleopen"
       chorus→verse:  "wiperight" or "fadeblack"
       bridge/outro:  "pixelize" or "radial"
     Fallback: "fade" for any unrecognised type

   ffmpeg xfade supports 40+ types — full list at disposal.

5. KARAOKE TEXT OVERLAY  [Whisper + PIL + moviepy, ~$0.006/min audio]
   Whisper word timestamps → yellow highlight on active word
   Verse text fades in/slides up at segment boundary (0.18s)
   Semi-transparent text panel so scene imagery shows through
   Font: Impact 72px for lyrics, 44px for title bar

6. HOOK EXTRACTION  [ffmpeg, free]
   Trim first 30s → 9:16 reformat (crop center) → same karaoke + scene imagery
   Outputs to hooks/{channel}/{slug}_hook.mp4

7. DB UPDATE
   video_assets record: path, format, duration, scene_count, cost_usd, whisper_words
   pipeline_stage → "video_ready"

Cost Table

Songs	Scenes avg	Image cost	Whisper	GPT	Total
10	6	$1.53	$0.06	$0.02	~$1.61
50	6	$7.65	$0.30	$0.10	~$8.05
100	6	$15.30	$0.60	$0.20	~$16.10

$0.16/song fully automated. Revid charges $39/mo for ~100–400 videos depending on tier. At 100 songs/month we pay $0.16 vs ~$0.10–$0.39/Revid video. Parity at ~250 songs/month. Below that, Revid is competitive on price.

The reason to build our own anyway:

Visual coherence (img2img chaining) is genuinely better output
Full control over style per channel
No Revid account dependency
Can integrate motion video clips (Wan/LTX) as scene inserts when budget allows

Video Generation (Future: Actual Motion Per Scene)

Replicate has motion video models if we want actual video clips per scene instead of Ken Burns stills:

Model	Cost	Quality	Speed	Notes
LTX-Video	~$0.04/sec	Good	Fast	Best cost/quality for our use case
Wan 2.1 480p	$0.09/sec	Good	Moderate	Open source, controllable
Wan 2.1 720p	$0.25/sec	Excellent	Slow	Worth it for hero content
Kling	varies	Excellent	Slow	Revid's movie mode uses this or similar

For a 3s clip per scene (7 scenes × 3s): LTX = $0.84/song, Wan 480p = $1.89/song, Wan 720p = $5.25/song.

Recommendation: Use Ken Burns stills for standard production. Reserve Wan/LTX video clips for "hero" content (first GRE song, featured Aesop fable) where we want maximum quality.

prompt_strength Tuning by Channel

Channel	Recommended strength	Reason
GRE Word Wizards	0.40–0.45	Each word is a new concept — more visual change desired
Aesop's Fables	0.55–0.65	Narrative continuity — scenes should feel like same world
STEM Nursery Rhymes	0.50	Balance: new concept per verse but consistent style
Cerebral / Mental Models	0.60–0.70	Abstract visuals benefit from slow evolution

Track B — Revid Automation

Why Revid Still Matters

Revid's movie mode (likely Kling or equivalent under the hood) generates actual motion video — not Ken Burns on stills. For Aesop's fables and high-production cerebral songs, this is meaningfully better. The subscription is already paid.

The problem: no public API. Automation must go through the browser UI.

Automation Approach: Playwright

revid_automation.py
  class RevidSession:
    - login(email, password)         # cookie-based, persist session
    - create_video(mp3_path, config) # upload + configure
    - poll_status(job_id)            # wait for completion
    - download_result(job_id, dest)  # save to wise-songs/videos/

Config object:

{
  "title": "The Dog and His Reflection — Aesop's Fable Song",
  "style": "cinematic",
  "duration": "auto",
  "captions": True,
  "aspect_ratio": "16:9",
  "hook_clip": True,
}

Session flow:

1. Load saved cookies → skip login if valid
2. Navigate to create page
3. Upload mp3
4. Set title, style, aspect ratio, captions
5. Click generate
6. Poll job status (every 30s, timeout 20min)
7. Download mp4 when complete
8. Save to ~/sai-workspace/content/wise-songs/videos/{channel}/
9. Update content.db: video_assets, pipeline_stage → "video_ready"

Revid Config Per Channel

Channel	Style	Use Revid?	Notes
Aesop's Fables	cinematic or anime	Yes — movie mode	Stories benefit most from actual motion
Cerebral	cinematic or artistic	Yes for hero content	Philosophical visuals
Mental Models	cinematic	Optional	Scene pipeline acceptable
GRE Word Wizards	—	No	Kinetic vocab cards — our pipeline is better
STEM Nursery Rhymes	—	No	Diagram style — our pipeline is better

Make.com as Alternative

If Playwright proves fragile (UI changes breaking selectors), Make.com is the fallback:

Webhook → Make.com scenario:
  1. Receive {slug, mp3_url, title, style}
  2. Revid module: create video
  3. Wait for completion
  4. HTTP module: POST result URL back to our webhook receiver
  5. Our receiver downloads + updates DB

Cost: Make.com free tier = 1000 ops/month. Each video ≈ 5 ops → covers ~200 videos/month free.

Decision Matrix — Which Pipeline Per Song

song arrives at video_gen stage
    ↓
channel?
    ├── GRE Word Wizards      → scene pipeline (img2img, strength 0.40–0.45)
    ├── STEM Nursery Rhymes   → scene pipeline (strength 0.50)
    ├── Aesop's Fables        → Revid (movie mode) + scene pipeline as backup
    ├── Mental Models         → scene pipeline (strength 0.60) or Revid
    └── Cerebral              → Revid preferred; scene pipeline acceptable

Implementation Order

scene mode in video_pipeline.py — GPT-4o visual world + FLUX img2img chain + ffmpeg assembly
Test on 3 songs — one GRE, one Aesop, one cerebral — compare against viral mode
revid_automation.py — Playwright session, login, upload, poll, download
Content DB integration — track source, cost, style per video asset
Review workflow — never auto-publish; human review before upload
youtube_upload.py batch mode — queue reviewed videos for upload with metadata