Skip to main content

Command Palette

Search for a command to run...

The Real Architecture Behind an AI Video Generator

Updated
9 min read

Most people think an AI video generator is basically a prompt box connected to a model.

That is true for a demo.

It is not true for a product.

Once we tried to make video generation usable in a real app, the hard part stopped being prompting. The real work became everything around the model call: request validation, long-running jobs, provider failures, storage, retries, delivery, and keeping users informed while a generation is still running.

That is the part users do not see, but it is the part that decides whether the product feels reliable or fragile.

The architecture people imagine

The naive version looks like this:

  1. User enters a prompt

  2. Frontend sends it to the server

  3. Server calls a video model API

  4. Model returns a result

  5. User downloads the video

That works fine until the first real constraints appear:

  • generation takes 1 to 5 minutes

  • serverless functions time out

  • the provider returns status updates asynchronously

  • the output file is large

  • a webhook arrives twice

  • a job gets stuck in processing

  • the user refreshes the page and expects state to survive

At that point, you are no longer building a simple AI feature. You are building an async media-processing system.

Our actual architecture

A more realistic flow looks like this:

The key shift is that the model call is only one step inside a larger workflow.

We used Next.js for the app layer, Zod for request validation, a database-backed job table for durable state, and async processing so generation was not tied to a single HTTP request. That part matters because a direct request- response design breaks down quickly on platforms like Vercel, where long-running video generation is a poor fit for normal function lifecycles.

Why durable job state matters

The first thing we learned is that video generation cannot be treated like a normal request-response API call.

A single generation may take a few minutes, the provider may return progress asynchronously, and users will absolutely refresh the page while waiting. Once that happens, in-memory state is useless. We need a durable job record in the database that survives refreshes, retries, delayed callbacks, and temporary failures.

That job record becomes the source of truth for the whole system. The frontend reads from it, webhooks update it, polling reconciles it, and support issues usually start there too.

A simplified job model looks like this:

  type VideoJob = {
    id: string;
    userId: string;
    status: "pending" | "queued" | "processing" | "succeeded" | "failed" | "expired";
    provider: string;
    providerJobId?: string;
    prompt: string;
    outputUrl?: string;
    errorMessage?: string;
    createdAt: Date;
    updatedAt: Date;
  };

The exact schema will vary, but the principle is the same: the application needs explicit state that survives outside a single request lifecycle.

This is what makes the rest of the system coherent. The frontend can read the current job status instead of guessing. Webhooks can update a known record instead of inventing state on arrival. Background sync jobs can detect generations that have been stuck in processing for too long. Support can inspect what happened to a user request without reading raw provider logs. Billing logic can also be tied to real execution outcomes instead of assumptions.

The key idea is simple: once video generation becomes asynchronous, state management becomes a product concern, not just a backend detail.

That is the difference between a demo and a system people can trust.

Validation is more important than it sounds

When generation is expensive, validation is not just good API hygiene. It is cost control.

We validate the request before it enters the pipeline. A simplified schema looks like this:

  import { z } from "zod";

  export const GenerationRequestSchema = z.object({
    prompt: z.string().min(1).max(1000),
    aspectRatio: z.enum(["16:9", "9:16", "1:1"]),
    duration: z.number().min(5).max(10),
    imageUrl: z.string().url().optional(),
  });

This sounds basic, but it prevents a lot of avoidable waste:

  • empty or malformed prompts

  • unsupported duration values

  • invalid asset URLs

  • requests that exceed what the model or pricing plan should allow

For AI products, rejecting bad input early is much cheaper than discovering it after a provider call has already started.

Webhooks are not enough

A lot of people assume webhooks solve async state updates cleanly.

In practice, they help, but they are not enough by themselves.

One of the less obvious problems we ran into was that callback delivery is not the same thing as callback reliability. A webhook can arrive late, arrive twice, or fail while your own app is cold-starting or temporarily unavailable. If your system depends on “the webhook will definitely show up once and in order,” it will eventually drift into inconsistent state.

So we treated webhooks as a fast path, not as the only path.

The more reliable pattern was:

  • accept webhook updates when they arrive

  • verify and deduplicate them

  • update the job record idempotently

  • run a periodic sync task to find stale processing jobs and reconcile them

That last part is boring, but important. A simple cron-style sync job that scans old in-flight jobs every few minutes is often what keeps async systems honest.

Provider abstraction matters earlier than you think

It is easy to let one model provider shape your entire application. That is usually a mistake.

Different providers return different job IDs, status names, callback payloads, artifact formats, and failure semantics. If those details leak directly into your UI and database logic, changing providers later becomes painful.

We found it much cleaner to normalize provider-specific behavior at the integration boundary. Internally, the rest of the app only cares about our own job model:

  • internal job ID

  • current status

  • provider name

  • provider job ID

  • output URL

  • error reason

  • retry count

  • created and updated timestamps

That separation is not overengineering. It is just enough structure to stop the provider API from becoming your architecture.

Storage is part of the product

Large AI-generated media changes the storage conversation.

Returning a raw provider URL or exposing a direct object-storage path is tempting early on, but it usually creates problems later around expiration, access control, bandwidth, and inconsistent delivery.

For generated video, storage needs to be treated as a first-class part of the system:

  • where uploads live

  • where outputs are persisted

  • whether URLs are public or signed

  • how long artifacts are retained

  • how results are delivered quickly enough to feel good

This is one of those areas where a product can feel fine in early testing and then get messy at scale. Media accumulates. Costs rise quietly. Old artifacts stick around. Support has no clear retention policy. None of that is visible in the demo, but all of it becomes part of the real architecture.

Moderation is not a side feature

I wrote recently about moderation, and it matters here too.

In a video-generation system, moderation belongs on the main request path, not off to the side as an afterthought. If moderation happens too late, you may spend real money generating content you cannot serve. If it is too aggressive or too vague, legitimate users get blocked without understanding why.

The practical lesson is that moderation needs to be part of the workflow design:

  • check prompts before expensive generation starts

  • check uploaded media if the product allows references

  • return predictable failure states instead of generic errors

That is not just policy. It is architecture, because it changes what is allowed into the pipeline and what kind of abuse cost the system can absorb.

The hidden cost problem

One of the fastest ways to build a bad AI video product is to ignore cost until after launch.

Video generation is expensive enough that architecture choices directly affect margins. Resolution, duration, retries, duplicate submissions, storage retention, and failed jobs all compound.

A few examples:

  • higher resolution sounds great, but should not always be the default

  • long durations increase cost and wait time at the same time

  • unbounded retries can quietly burn money

  • keeping every generated artifact forever creates a storage bill that grows without discipline

This is why cost control is engineering work, not just finance work.

The architecture should enforce sensible constraints before scale forces them on you.

The frontend has to reflect backend truth

A weak frontend can make a good backend feel broken.

If the real system is async, the UI needs to show async state honestly. “Loading...” is not enough for a workflow that might take several minutes and can fail in multiple ways.

The frontend should map directly to backend job states:

  • queued

  • processing

  • succeeded

  • failed

  • expired

That seems small, but it changes user trust. When the UI reflects what the system actually knows, slow generation feels understandable. When the UI hides that complexity behind a spinner, the product feels random.

Good AI UX is often just backend truth exposed clearly.

What changed my mental model

The biggest mindset shift was this:

an AI video generator is not a prompt feature with a model behind it.

It is an async job system for expensive media generation, and the model is only one component inside that system.

Once you think about it that way, the architecture gets clearer. You stop over-focusing on the model call and start thinking in terms of state machines, storage, reconciliation, retries, and delivery.

That is where the product actually becomes usable.

Closing thought

The visible part of AI video generation is the prompt and the output.

The real product is everything in between.

Validation, moderation, durable job records, async orchestration, webhook recovery, storage, and delivery are not support layers. They are the architecture that turns a model capability into something people can depend on.

That is the real architecture behind an AI video generator.

If you want to see how I am applying that thinking in practice, take a look at VideoFlux.

7 views