Imagine You Could Speak a View Into Existence
A conversational AI agent that generates, refines, and iterates on native SwiftUI views — built on top of a schema-driven design system.
Describe What You Want
I've had this idea in my head for about eight years. Describe what you want, and it appears. Not a wireframe. Not a mockup. A real, functioning view. Every tool I'd tried got close but never closed the gap between intent and output. With the design system in place (schema-driven, structured, every component composable from JSON) the gap was finally closable.
The first version was rough. Generation lived inside the Context View's settings: tap a glyph, type a prompt into a modal, hit generate, wait. The whole thing was bolted onto the catalog like an afterthought. The UX was clunky. But I wasn't trying to build a product yet. I was trying to answer one question: does the LLM actually follow instructions, or is it just guessing?

I typed a specific request: welcome a friend by name, suggest getting a beer, add a legal disclaimer as a joke. The generated view came back with the friend's name in the title, the right SF Symbol, the exact CTA labels I'd described, and the disclaimer at the bottom. Every specific instruction fulfilled. Not approximate. Precise. The LLM was reading the schema, following the rules, and producing valid structured output. The concept was validated, even though the experience around it was terrible.
Generation worked. But it was a feature buried inside the catalog, not the experience itself.
If the Premise Is Conversation, Make It Conversational
If the premise is "describe what you want and it appears," that should be the primary experience, not a button hidden in a settings sheet. So I restructured the entire app's information architecture. The builder became a chat-first experience. The catalog became the reference library.

The similarity to consumer AI tools made it feel relatable. Sign in with Apple, shimmer effects, ambient mesh gradients. These details matter. Once it became conversational, it went from a developer library to an actual product.
The Agent Knows the System
This isn't text-to-UI generation. The agent has internalised the entire design system — not just component names, but their capabilities, constraints, valid configurations, and content rules. You can ask it "What row types does a list item support?" and it comes back with three variants, their trailing types, status indicators, and icon styles — all from the capability manifest, not from memory.


That knowledge changes the interaction completely. You don't describe a UI. You describe a problem — and the agent figures out which components solve it.
I told it I needed a screen that helps people raise their spending limits: users upload a government ID, do a biometric scan, the step is optional, their data is encrypted, the whole thing takes two minutes. The agent proposed an Instruction View with a header explaining why limits matter, an info banner for reassurance, a two-row requirements checklist, and a button stack with "Start verification" and "Do this later." I said go for it.


A generation pipeline kicked in — assembling, painting, sculpting — cycling through loading states while the specialist generator built the view from the schema. A "Design Ready" card appeared when Version 1.0 was ready to preview.
But Version 1.0 wasn't right. The requirement rows it chose truncated mid-sentence — "Upload a government-issued..." cut off with an ellipsis. I told the agent to switch to informative rows. It looked up the component, confirmed informative rows support multi-line content, swapped the row type, and preserved everything else. Version 1.0 to 1.2, through plain conversation.


Every change is surgical — the agent modifies what you asked for and retains everything else. 1.0 is a first generation, 1.1 is a refinement, 2.0 is a full regeneration. And the same flow works across every view type the system supports.


When the agent hits a wall, it says so. Ask for a date of birth picker and it won't hallucinate one — it tells you the design system doesn't have a native date picker, proposes workarounds, and offers a "Track as DS Improvement" action that logs the gap. The system isn't just generating views. It's generating a roadmap for its own improvement.

Conversation in, native view out. It worked. But it didn't start this way.
"Is It Using Tools, or Just Re-prompting?"
My friend Sergey, a platform engineer, asked a simple question over lunch: "When you say it's an agent, do you know what it's actually doing? Is it using tools, or just re-prompting?"
I didn't know. I took notes on my phone while he ate a burger, went home, and started digging. When I asked the LLM to "turn on the secondary button," it was regenerating the entire view rather than modifying that one piece. Like asking someone to redraft an entire document because you wanted to fix a typo.
But knowing the problem existed wasn't the same as understanding it. When a button is broken, you can see it. When an agent misfires, the output looks plausible — the reasoning behind it is invisible. I needed to see what the agent was actually thinking.
So I built a hidden transcript exporter. Tap the builder view five times and a debug icon appears in the toolbar. Tap it and the system exports the full conversation as plaintext: every user prompt, every agent response, every decision the agent made along the way. Which action it classified. How many tool calls it made. What patches it applied. When errors were thrown and why.

The transcript became the single most useful debugging tool in the entire project. Here's a real excerpt. I asked the agent to add a third row to an instruction view that already had two. Pay attention to tools: 0:
[User] 11:00
Can we add a third row that says "see which documents
are supported", with a trailing info icon.
[Agent] 11:01 (action: converse, tools: 0)
The Instruction View supports a maximum of 5 rows, so I
cannot add a third row as requested.
Limitation: Maximum of 5 rows in the requirements section.
[User] 11:01
But you only have two currently, so you can add one more.
[Agent → Refine] 11:01 (action: refine, tools: 0)
Adding a third row for 'See which documents are supported.'
patch: [{"op":"add","path":"/requirementsSection/rows/-"...}]
[Build Result] 11:01 Version 1.3
[User] 11:02
Add an SF symbol to that new row and use the info variant.
[Agent → Refine] 11:02 (action: refine, tools: 0)
patch: [{"op":"add","path":"/requirementsSection/rows/-"...}]
[Build Result] 11:02 Version 1.4
[User] 11:02
You added a fourth row. Remove it. Apply the symbol to
the third row.
[Agent → Refine] 11:02 (action: refine, tools: 0)
patch: [{"op":"remove","path":"/requirementsSection/rows/2"},
{"op":"replace","path":"/requirementsSection/rows/1"...}]
[Build Result] 11:02 Version 1.5
[User] 11:03
You removed the 2nd row instead of the third.
[Error] 11:03
Refinement failed: The data couldn't be read because it
isn't in the correct format.
Three failures in thirty seconds, and every single one reads tools: 0. The agent hallucinated a limitation that didn't exist because it never looked up the schema. It added a duplicate instead of modifying in place because it didn't query the current view state. It patched the wrong index because it was guessing at the structure from memory. Every failure traced back to the same root cause: the agent was improvising when it should have been checking.
This is the equivalent of snapshot testing for AI behaviour. Snapshots gave the design system eyeballs to verify visual output. Transcripts gave me eyeballs to verify the agent's reasoning. Both close the same kind of loop: making invisible work visible so you can actually improve it.
How It Actually Works
The transcript made the problems legible. Fixing them meant three structural changes.
First: classify before acting. Every message now gets triaged — is the user conversing, generating, or refining? — before the agent does anything. This alone eliminated the misfires where a question triggered a generation or a refinement request triggered a full rebuild. Structured JSON output replaced freeform text. Zero parse failures after this change.
Second: give it tools, not memory. Each component now declares a capability manifest — properties, tokens, valid values, constraints — all exposed as tools the agent can query at runtime. The tools: 0 failures from the transcript happened because these tools didn't exist yet. Once they did, the agent stopped improvising and started checking. The row-type swap in the refinement carousel? That's the agent calling a tool, confirming informative rows support multi-line content, and acting on real data instead of guessing.
Third: specialise. Rather than one generalist agent stretching across every view type, each type — Context, Instruction, List, Form — has its own specialist generator with its own schema, content rules, and validation. The conversation agent acts as a router. A team of specialists behind a single interface.
Optimise Cheap, Then Upgrade
All three changes — intent classification, capability tools, specialist generators — were built on GPT-4o. I assumed I didn't need a smarter model, just better architecture. The output got genuinely good. Then I checked my bill: less than a dollar. So I upgraded.
The generation quality jumped dramatically. But I'm glad I didn't start there. A powerful model will paper over bad architecture — it'll generate plausible output from a vague schema, and you'll think your system is working when really the model is just guessing well. Starting cheap forced me to make the schema airtight, the tools precise, and the constraints explicit. Every upgrade after that compounded on solid foundations rather than compensating for weak ones.
The Invisible Work
I spent a full day teaching the agent that "optional" should appear once in a row's status indicator, not repeated in the description. Another day getting SF Symbol validation to reject invalid glyphs without rejecting valid aliases. Another on token budgets so the system prompt didn't crowd out the user's actual request. None of this is visible to the user. All of it is why the product works.
That's the actual job. The difference between an AI feature and an AI product is the invisible work: the intent classification that stops the agent from guessing, the tool use that gives it surgical precision, the structured output that eliminates parse failures, the schema constraints that make the model's job tractable. Nobody sees it. They just see a view appear.
What's Next: Kill the Templates
The builder works. But it's constrained. Four view types, four schemas, four generators. Ask for a form with a list section below it and the system can't express it — that's not one of the four shapes. The architecture got the agent this far, but the ceiling is the templates themselves.
The fix is slots. Instead of four rigid schemas, every component declares which roles it can fill — header, content, actions, banner — and the LLM composes screens from an ordered list of slots. No view types. No routing. One universal renderer. A form field next to a list section next to an info banner becomes a valid screen without any template changes.
It's the same principle that runs through this entire project: don't hardcode what you can compose. The design system replaced hardcoded views with composable components. The builder replaced manual composition with conversational generation. Slots would replace fixed templates with composable structure. Each layer removes a constraint from the layer before it.
I haven't built this yet. But when I do, it'll be the case study about removing the last ceiling. In the meantime, I'd been so transfixed on building in code that I hadn't opened Figma in weeks.
Everything lived in code. Figma had no idea any of this existed.