What I've Learned Designing AI Systems for Learning and Performance

Framing

I don't design chatbots. I design judgment-support systems where conversation is the interface.

Over the past year, I've built and iterated on AI systems embedded in real learning, performance, and evaluation workflows—promotion writing, performance reflection, enablement, and self‑assessment—where outputs don't just save time. They shape judgment, confidence, fairness, and accountability.

In these contexts, the cost of getting things wrong isn't primarily technical. It's human. What a system infers, hides, or smooths over can influence how people think and decide long after a single interaction ends.

This piece is a snapshot of what I've learned so far from building, testing, and revising these systems in real use. It's a record of the tradeoffs I now design around, and the practices I return to because they consistently hold up when stakes are real.

The question I keep coming back to: Will this system make the user clearer and more capable next time—or just faster this time?

That question guides almost every decision below.

What Mattered Less Than I Expected

Early on, I assumed much of the work would live in prompts—finding the right phrasing, tone, or examples to get better outputs.

Prompts matter. But they turned out to be the least interesting part of the problem.

Once AI systems touch learning, evaluation, or performance, what matters far more is the system wrapped around the model:

what the system is allowed to decide (and what it isn't)
how much thinking it leaves with the user
how it behaves when inputs are weak, biased, or incomplete
how it signals uncertainty, limits, and responsibility

Most of the real design work has been about boundaries, workflows, and failure modes—not clever phrasing.

Tensions I Now Design Around

The following tensions surfaced once people started using these systems for real work—and they now show up explicitly in my designs.

Speed vs. Skill-Building

AI can dramatically increase speed in writing-heavy workflows. For busy people under pressure, that speed can feel genuinely delightful: paste messy notes, get a clean draft, move on.

Skill-building can also be delightful, but in a different way. There's satisfaction in clarifying your own thinking, strengthening how you articulate impact, or realizing (over time) that you're getting better at a task.

The tension isn't that one is good and the other is bad. It's that they reinforce different behaviors.

I ran into this directly while designing AI-assisted writing flows. Some users wanted the fastest possible path. Others wanted help thinking—surfacing gaps, sharpening claims, and improving how they write over time.

Rather than treating one approach as correct, I've learned to be explicit about what each path optimizes for:

Guided paths ask users to draft first, then use AI to surface gaps and improve clarity. These reinforce thinking and writing muscle. It can feel like more work—but it's work well spent.
Fast paths allow users to paste rough context and receive synthesis quickly. These prioritize speed and reduced cognitive load. They're easier and often more appealing, but they can lead to skill atrophy.

Both can be the right choice depending on the moment. Designing responsibly means making those tradeoffs visible, not accidental.

Automation vs. Transparency

AI systems are very good at collapsing steps. They can infer intent, fill in gaps, apply rules, and produce outputs that look complete and confident—often without showing how they got there.

In learning- and performance-adjacent systems, the risk isn't primarily about learning loss. It's about invisible decision-making.

When a system automates too much without making its assumptions legible, users can't tell:

what was inferred vs. explicitly provided
where uncertainty exists
what criteria were applied
where the system stopped and their responsibility began

That lack of visibility shifts ownership and this isn't always obvious to the user. Outputs get pasted downstream assuming correctness—not because users were told to trust them, but because the system didn't make its limits clear.

Because of this, I favor transparency over maximum automation whenever judgment or accountability is involved.

In practice, this often means:

asking follow-up questions instead of inferring missing intent
surfacing uncertainty or incomplete evidence instead of hiding it
framing outputs as drafts, not conclusions
including explicit responsibility reminders when outputs feed real decisions

These choices can make systems feel slower or less "magical." But they preserve trust, accountability, and human judgment—especially as stakes rise.

Delight Before Seriousness

When I say "delight," I don't mean playfulness or novelty. I mean the conditions that make people want to engage.

In learning-, reflection-, or performance-adjacent systems, engagement isn't guaranteed. People may be anxious, rushed, or unsure what's expected. If a system feels judgmental or overwhelming from the start, rigor won't help—users will disengage or defer their thinking to the tool.

That's why I think about delight before constraints.

Delight is how a system earns trust and motivation. Seriousness is how it protects judgment and outcomes. They aren't opposites—they're sequential.

What changes is how delight shows up. In low-stakes systems, it may look expressive: curiosity, identity, or lightness that lowers fear and self-censorship. As stakes rise, delight becomes quieter. It shows up as speed, clarity, predictability, and emotional safety—not playfulness.

Once engagement is earned, structure can do its work. Clear workflows and firm boundaries don't feel restrictive when users feel oriented and safe. They make serious thinking possible.

Practices I Now Default To

Alongside these tensions, there are a handful of concrete practices I return to. They're responses to things that broke, confused users, or undermined the goals of the system.

Treat Conversation Like a Workflow (Not an Open Chat)

Early on, I experimented with more open-ended conversational designs. They felt flexible and friendly—until people started using them for real work.

What I saw instead:

users jumping ahead before the system had enough context
feedback being interpreted without the full picture
endless loops because it wasn't clear when the task was "done"

I now default to designing conversational systems as explicit workflows. That usually means:

a clear sequence (input → processing → output → optional refinement)
gating feedback until enough information is collected
making the end state visible and explicit

In one reflection system, I deliberately designed feedback to appear only after all questions were answered. This allowed the system to reason across inputs, catch contradictions, and offer feedback that reflected the whole picture rather than isolated responses.

Some users asked for feedback after each response. While that may have felt more responsive, it also would have encouraged premature interpretation—both by the system and by the user—before enough context existed.

Flexibility feels nice. Structure is what makes complex thinking possible.

Never Let AI Smooth Over Weak Thinking

One of the most tempting things to do with AI is to quietly fix weak inputs.

In learning- and performance-adjacent systems, this is often the wrong move.

When AI rewrites vague claims, hedging language, or biased phrasing automatically, the output improves—but the user never sees what was wrong. The system gets credit for improvement, and the learning moment disappears.

I now design systems to preserve weakness long enough to address it. In practice, that means:

keeping problematic language visible
surfacing gaps or unsupported claims
asking users to reframe or add evidence before regenerating

In one manager-writing tool, phrases like "they're not really leadership material" or "I just helped out where I could" were surfaced verbatim rather than quietly rewritten. The system flagged these phrases and prompted the manager to reframe them using observable behaviors, evidence, or concrete outcomes.

Nothing was auto-corrected. The manager had to notice the problem, confront it, and do the reframing themselves—with guidance, but without the system inventing clarity on their behalf.

The output improves, but more importantly, the underlying judgment improves too. Automatically smoothing over weak or biased thinking produces cleaner text. Preserving it (briefly and deliberately) is what actually teaches people what "good" looks like.

Design Knowing You Won't Have Full Control

No matter how careful you are, AI systems will sometimes:

overreach
hallucinate
ignore constraints under pressure

I stopped trying to eliminate this entirely (trust me, it's a losing battle) and started designing for recovery instead.

That often shows up as:

visible uncertainty rather than hidden confidence
explicit reminders of user responsibility
clear ways for users to correct or reject outputs

In one coaching system, testing revealed that users were accepting advice they didn't actually find helpful—not because it was good, but because disagreeing felt like friction. The system looked successful on paper, but it was subtly encouraging compliance.

Adding an explicit "this didn't land—try again" path changed that dynamic. It signaled that disagreement was expected, not exceptional, and made it easier for users to correct the system rather than defer to it.

Perfect control isn't realistic. Designing systems that support correction, disagreement, and user ownership is.

Modularize Early

This isn't specific to learning or performance systems. It's true of most AI systems—and frankly, most complex software. But in judgment-adjacent systems, the consequences of getting this wrong compound quickly, so I'm calling it out explicitly.

As systems grow, large, monolithic prompts become brittle very quickly.

Early on, modularity felt like an implementation concern. Over time, I realized it was a design control.

When everything lives in one giant prompt:

it's unclear which part of the system is responsible for which behavior
failures are hard to diagnose, and even harder to explain
small changes can have cascading effects that are difficult to trace
judgment boundaries become implicit instead of enforced

I now default to separating concerns explicitly. That usually means isolating:

UX flow definitions (user-facing text and sequencing)
system instructions (role, guardrails, enforcement)
evaluation logic (signals, rubrics, thresholds)
coaching or response templates (how feedback is phrased)

This separation does more than improve maintainability. It makes the system legible (to collaborators and to future me). When something goes wrong, failures are local and traceable rather than opaque.

Questions I'm Still Thinking About

As I keep building and testing these systems, a few questions keep resurfacing:

How do we design for skill-building without creating friction that turns people away in time-pressured environments?
Where is the right line between helpful synthesis and over-automation as models get more capable?
What responsibilities do we have as designers as these systems start shaping norms, not just outputs?

Designing With Limits (and Evolution)

These systems aren't static. They shape behavior over time—sometimes in ways you didn't intend. They also encode values, whether you name them or not.

I've learned to treat early versions as hypotheses: to watch where users hesitate, defer, or misinterpret, and to revisit tradeoffs once real usage patterns emerge. What feels helpful in a demo can behave very differently once a system is embedded in real work.

Where I stand now isn't where I expect to stand forever. But this is where I stand today.

Right now, I believe that designing AI systems close to judgment requires resisting the urge to optimize for speed alone. It means being deliberate about where automation stops, where human thinking must remain visible, and where responsibility clearly rests with the user.

That's why I keep returning to the same question: Will this system make the user clearer and more capable next time—or just faster this time?

If this reflection helps you notice different tradeoffs, pressure-test your own instincts, or design with a bit more intention around judgment and accountability, then it's done what I hoped it would do.

And if your answers differ from mine, I'm genuinely interested in how others are navigating these same tensions.