Why Your First AI Feature Shouldn't Be a Chat
Two years ago I built our first AI feature at Southern Lights. A prompt interface on top of our app. I demoed it once -- one carefully crafted prompt -- and it magically generated a complete project design from my input. The room was ecstatic.
The problem is that was the only prompt that worked.
To make that one demo work I had to feed the model everything we had. Documentation, data model explanations, customer goals. All of it. And it still only reliably handled the one scenario I'd rehearsed. Any other input fell apart. Instant context rot. We never shipped the feature. If I knew then what I know now, I would have built a button, not a prompt.
Everyone wants a chat
When I talk to companies exploring AI features, they almost always want the same thing: a prompt interface that "does everything." A text box where the user types what they want and the AI handles it.
This makes sense. It's how we all use LLMs today. You chat, you get a response or an action, you continue. It's understandable. It feels natural.
It's also the wrong first AI feature to build.
Why the chat-with-everything approach fails
Every time I ask a company what they want their AI prompt to be able to do, they list specific things. "Generate reports," "suggest configurations," "draft proposals."
They have concrete features in mind. Things their users have already asked for. But instead of building those features, they try to build a general-purpose chat that can handle all of them. And then they discover that making a general chat reliably do specific things takes enormous effort -- the right context, the right RAG pipeline, guardrails, evals -- and most teams underestimate this by an order of magnitude.
That's just the engineering side. There's a bigger problem.
A prompt with free text input and LLM output is non-deterministic. You can't test it like a normal feature. You can't validate that the user experience is good. Sure, you can let users rate messages as good or bad, but that's fixing the machine in the wrong place. You're measuring satisfaction with the output instead of measuring whether the feature itself works.
And the more freedom you give the user, the harder it gets. If they can ask anything, they will ask anything. And to handle anything, the LLM needs the right context -- which by then means the entire product. Every data model, every business rule, every edge case. The surface area explodes.
Build buttons, not prompts
Here's what I tell every company instead.
Take that list of specific things you want the AI to do. Build each one as a button. Not a prompt. A button.
That button should trigger the AI behind the scenes and return structured output -- not free text. A JSON list, for example, that your UI renders into something the user can actually work with. The user then validates or modifies the result before it gets executed or saved into the system.
This changes everything.
It's fast. Did you want your prompt to handle five different things? Build five different buttons. Release them like you'd release any other feature. Iterate on each one independently.
It's controllable. You define the input, you define the output format, you own the scope. No free text input means no surprises. If you can build it without a text field, always do.
It's measurable. Do users click the button? Do they change the output before accepting? Do they accept at all? This is data you can use to iterate on the prompt and context until it actually works. Real product metrics instead of thumbs up/thumbs down on chat messages.
Always keep a human in the loop
Any output from an LLM should be treated as unsafe. Anything you connect a model to that has write access to your system is a risk. Never let an AI auto-execute actions on behalf of the user.
Put a human validation step between the AI output and the actual operation. Always. Human review doesn't eliminate the risk of bad AI output, but it adds a necessary layer of protection before anything touches your real data.
When free text actually works
There is one case where a conversational interface earns its place: reading and interpreting large amounts of data. If you have a large dataset and can give the model specific context about what the user can ask about, a dialogue with an LLM can be extremely valuable. This still requires context scoping -- the model needs to understand your data, its structure, and its boundaries to give useful answers. But the key difference is that this is read-only. The user is exploring data, not triggering actions. The risk profile is completely different.
AI is a feature, not a layer
The instinct to build a big chat interface comes from a misunderstanding of what AI is in your product. It's not a layer you put on top of everything and hope it solves your users' problems.
AI functionality is a subset of your features. Specific, scoped, measurable. Build it like you'd build any other feature -- with clear inputs, clear outputs, and a way to know if it's working.
Start with one button. Ship it. Measure it. Then build the next one.