EllisShang

Case Study

AI Explainer Podcast from Images

Hackathon Developer & AI Engineer · Ship It Sunday · AI Hacker House Shanghai · Sep 2025

Overview

Built a two-host AI podcast explainer app that turns images and topics into animated, lip-synced video commentary using WAN 2.2, ElevenLabs, and GPT-4o.

Key Technologies

WAN 2.2ElevenLabsGPT-4o

Story & Process

This project was built during the **Ship It Sunday** hackathon in September 2025, together with a teammate.

We created an AI explainer app that generates a podcast-style video with **two virtual hosts** who discuss any topic of interest, starting from an image or prompt. The idea was to explore how conversational, dual-host formats can make complex topics easier to understand.

### Concept: Two-Host AI Explainer Podcast

- Generate a short podcast episode where two AI hosts comment on a topic derived from an input image or text prompt.
- Use a **teacher–student**, **peer–peer**, or even **adversarial debate** dynamic depending on the subject and difficulty.
- Target use cases include research summarization, decomposing hard technical concepts, and turning long, dense materials into approachable dialogue.

Conversational learning has been shown to improve engagement and retention. We wanted to validate that a two-host pattern works especially well for breaking down hard topics.

### Why Two Hosts?

- Hard and long materials (research papers, technical docs, etc.) are tiring to read.
- A single narrator can still feel like a lecture.
- Two hosts can naturally explore **questions, misunderstandings, and clarifications** that mirror how real learners think.

By framing explanations as a dialogue, we aim to:

- Make abstract concepts concrete through back-and-forth discussion.
- Model different viewpoints (teacher vs. student, expert vs. novice, peers, or debate partners).
- Increase learner satisfaction when used alongside traditional study materials and podcasts.

### Core Tech Stack

- **WAN 2.2** – Video generation and fine-grained control over character animation and timing.
- **ElevenLabs** – Voice cloning and speech generation for natural, consistent AI host voices.
- **GPT-4o** – Large language model for topic understanding, script generation, and decomposing difficult subjects into a conversational format.

The pipeline takes an image or topic, uses GPT-4o to generate a structured dialogue between two hosts, synthesizes voices with ElevenLabs, and finally drives WAN 2.2 to animate human avatars with lip sync to create the final explainer video.