Wise Songs — Video Pipeline Spec
Two parallel tracks. Track A is our own sequential scene pipeline (better visual coherence than Revid, cheaper, full control). Track B is Revid automation (for movie-mode content where Revid genuinely wins).
Track A — Sequential Scene Pipeline (--level scene)
The Core Idea
Every existing tool (Revid included) generates each scene image independently from a text prompt. This produces visual incoherence: different styles, random palettes, jarring cuts.
Our approach: img2img chaining. Each scene is generated from the previous scene using a new prompt at a controlled strength. The result is a visually continuous world that evolves — same palette, same style, same general aesthetic — while the content changes per verse.
Revid cannot do this. It is a genuine differentiator.
Pipeline Steps
1. INGEST
song_data (id, title, lyrics, category, style_prompt, image_url)
mp3 path
2. VISUAL WORLD GENERATION [GPT-4o, ~$0.002/song]
Input: title + full lyrics + category + style_prompt
Output (JSON):
{
"art_style": "warm watercolor, amber and indigo palette, soft rounded edges",
"setting": "forest clearing, ancient library, ocean horizon...",
"mood": "hopeful, tense, playful",
"character_anchors": "small fox with orange scarf, wise owl in glasses",
"scenes": [
{
"verse_index": 0,
"section": "verse",
"cinematic_description": "Wide shot: fox standing at crossroads, golden light...",
"transition_to_next": "dissolve"
},
...
]
}
Prompt template:
"You are a visual director. Given these song lyrics, produce a JSON visual world for a
music video. Art style should match the category ({category}). Output valid JSON only."
3. IMAGE GENERATION [Replicate, ~$0.003–$0.25/song depending on scene count]
Scene 0:
FLUX-schnell text-to-image
prompt = "{art_style}, {setting}, {scene[0].cinematic_description}, no text, 16:9"
cost: $0.003
Scene N (for N > 0):
FLUX-dev img2img
image = scene[N-1] output URL (pass directly to Replicate, no re-download needed)
prompt = "{art_style}, {scene[N].cinematic_description}, no text, 16:9"
prompt_strength = 0.55 (tune per category: fables 0.6, GRE 0.45, cerebral 0.65)
cost: $0.025/image
Total cost example (7 scenes): $0.003 + 6 × $0.025 = $0.153/song
Total cost example (5 scenes): $0.003 + 4 × $0.025 = $0.103/song
4. VIDEO ASSEMBLY [ffmpeg + moviepy, free]
Per image: Ken Burns effect (zoompan filter)
- Alternate zoom-in / zoom-out per scene
- Pan direction varies: center, left-drift, right-drift, diagonal
- Duration = verse duration + 1.5s overlap for crossfade
Transitions (xfade filter, 1.0s):
Selected per scene pair from visual world "transition_to_next" field:
verse→verse: "fade" or "dissolve"
verse→chorus: "wipeleft" or "circleopen"
chorus→verse: "wiperight" or "fadeblack"
bridge/outro: "pixelize" or "radial"
Fallback: "fade" for any unrecognised type
ffmpeg xfade supports 40+ types — full list at disposal.
5. KARAOKE TEXT OVERLAY [Whisper + PIL + moviepy, ~$0.006/min audio]
Whisper word timestamps → yellow highlight on active word
Verse text fades in/slides up at segment boundary (0.18s)
Semi-transparent text panel so scene imagery shows through
Font: Impact 72px for lyrics, 44px for title bar
6. HOOK EXTRACTION [ffmpeg, free]
Trim first 30s → 9:16 reformat (crop center) → same karaoke + scene imagery
Outputs to hooks/{channel}/{slug}_hook.mp4
7. DB UPDATE
video_assets record: path, format, duration, scene_count, cost_usd, whisper_words
pipeline_stage → "video_ready"
Cost Table
| Songs | Scenes avg | Image cost | Whisper | GPT | Total |
|---|---|---|---|---|---|
| 10 | 6 | $1.53 | $0.06 | $0.02 | ~$1.61 |
| 50 | 6 | $7.65 | $0.30 | $0.10 | ~$8.05 |
| 100 | 6 | $15.30 | $0.60 | $0.20 | ~$16.10 |
$0.16/song fully automated. Revid charges $39/mo for ~100–400 videos depending on tier. At 100 songs/month we pay $0.16 vs ~$0.10–$0.39/Revid video. Parity at ~250 songs/month. Below that, Revid is competitive on price.
The reason to build our own anyway:
- Visual coherence (img2img chaining) is genuinely better output
- Full control over style per channel
- No Revid account dependency
- Can integrate motion video clips (Wan/LTX) as scene inserts when budget allows
Video Generation (Future: Actual Motion Per Scene)
Replicate has motion video models if we want actual video clips per scene instead of Ken Burns stills:
| Model | Cost | Quality | Speed | Notes |
|---|---|---|---|---|
| LTX-Video | ~$0.04/sec | Good | Fast | Best cost/quality for our use case |
| Wan 2.1 480p | $0.09/sec | Good | Moderate | Open source, controllable |
| Wan 2.1 720p | $0.25/sec | Excellent | Slow | Worth it for hero content |
| Kling | varies | Excellent | Slow | Revid's movie mode uses this or similar |
For a 3s clip per scene (7 scenes × 3s): LTX = $0.84/song, Wan 480p = $1.89/song, Wan 720p = $5.25/song.
Recommendation: Use Ken Burns stills for standard production. Reserve Wan/LTX video clips for "hero" content (first GRE song, featured Aesop fable) where we want maximum quality.
prompt_strength Tuning by Channel
| Channel | Recommended strength | Reason |
|---|---|---|
| GRE Word Wizards | 0.40–0.45 | Each word is a new concept — more visual change desired |
| Aesop's Fables | 0.55–0.65 | Narrative continuity — scenes should feel like same world |
| STEM Nursery Rhymes | 0.50 | Balance: new concept per verse but consistent style |
| Cerebral / Mental Models | 0.60–0.70 | Abstract visuals benefit from slow evolution |
Track B — Revid Automation
Why Revid Still Matters
Revid's movie mode (likely Kling or equivalent under the hood) generates actual motion video — not Ken Burns on stills. For Aesop's fables and high-production cerebral songs, this is meaningfully better. The subscription is already paid.
The problem: no public API. Automation must go through the browser UI.
Automation Approach: Playwright
revid_automation.py
class RevidSession:
- login(email, password) # cookie-based, persist session
- create_video(mp3_path, config) # upload + configure
- poll_status(job_id) # wait for completion
- download_result(job_id, dest) # save to wise-songs/videos/
Config object:
{
"title": "The Dog and His Reflection — Aesop's Fable Song",
"style": "cinematic",
"duration": "auto",
"captions": True,
"aspect_ratio": "16:9",
"hook_clip": True,
}
Session flow:
1. Load saved cookies → skip login if valid
2. Navigate to create page
3. Upload mp3
4. Set title, style, aspect ratio, captions
5. Click generate
6. Poll job status (every 30s, timeout 20min)
7. Download mp4 when complete
8. Save to ~/sai-workspace/content/wise-songs/videos/{channel}/
9. Update content.db: video_assets, pipeline_stage → "video_ready"
Revid Config Per Channel
| Channel | Style | Use Revid? | Notes |
|---|---|---|---|
| Aesop's Fables | cinematic or anime | Yes — movie mode | Stories benefit most from actual motion |
| Cerebral | cinematic or artistic | Yes for hero content | Philosophical visuals |
| Mental Models | cinematic | Optional | Scene pipeline acceptable |
| GRE Word Wizards | — | No | Kinetic vocab cards — our pipeline is better |
| STEM Nursery Rhymes | — | No | Diagram style — our pipeline is better |
Make.com as Alternative
If Playwright proves fragile (UI changes breaking selectors), Make.com is the fallback:
Webhook → Make.com scenario:
1. Receive {slug, mp3_url, title, style}
2. Revid module: create video
3. Wait for completion
4. HTTP module: POST result URL back to our webhook receiver
5. Our receiver downloads + updates DB
Cost: Make.com free tier = 1000 ops/month. Each video ≈ 5 ops → covers ~200 videos/month free.
Decision Matrix — Which Pipeline Per Song
song arrives at video_gen stage
↓
channel?
├── GRE Word Wizards → scene pipeline (img2img, strength 0.40–0.45)
├── STEM Nursery Rhymes → scene pipeline (strength 0.50)
├── Aesop's Fables → Revid (movie mode) + scene pipeline as backup
├── Mental Models → scene pipeline (strength 0.60) or Revid
└── Cerebral → Revid preferred; scene pipeline acceptable
Implementation Order
scenemode in video_pipeline.py — GPT-4o visual world + FLUX img2img chain + ffmpeg assembly- Test on 3 songs — one GRE, one Aesop, one cerebral — compare against viral mode
revid_automation.py— Playwright session, login, upload, poll, download- Content DB integration — track source, cost, style per video asset
- Review workflow — never auto-publish; human review before upload
youtube_upload.pybatch mode — queue reviewed videos for upload with metadata