Trust Has to Be Engineered: Notes from ODSC East Day 3 by Dakota Kim

Day 3 at ODSC East felt like a natural continuation of the day before. I left feeling both inspired and energized.

In my previous recap, the theme that stood out was maturity. Enterprise AI is moving beyond the clean demo space and into the messier world of ownership, reliability, data access, evals, governance, and cost. Day 3 pushed that one layer deeper: what does it take for people to trust these systems when the stakes start to matter?

For leaders, trust is not abstract. It is the difference between a pilot people admire and a system they are willing to put into a real workflow. Trust means knowing where an answer came from, what the system was allowed to do, how it was evaluated, and what happens when it gets something wrong.

Put differently: Day 2 made the operational layer visible. Day 3 made the trust layer unavoidable.

Trust Has to Be Engineered

Agents were still everywhere, but the most useful conversations were about making agentic workflows testable.

That matters because agents fail in ways that feel different from traditional software. They can call the wrong tool, misunderstand context, follow a plausible but bad path, satisfy the surface form of a request while missing the intent, or behave differently when the surrounding workflow changes.

Traditional tests still matter, but they do not cover the whole surface. A unit test can tell you whether a function behaves correctly for a known input. An agent also needs to be tested against situations: messy users, ambiguous instructions, missing data, bad tool outputs, policy constraints, and multi-step workflows where one small mistake compounds.

That is why the simulation-as-evals idea stood out to me. With realistic users, tools, workflows, and failure scenarios, teams can watch how an agent behaves before it reaches production. It cannot guarantee safety. It does give teams a better way to find weak points than waiting for real users to discover them.

Projects like VendBench and Andon Labs' longitudinal research are interesting for this reason. They explore whether AI systems can maintain goal coherence over longer horizons while conditions change around them. That feels much closer to the real problem than asking whether a model can answer one clean prompt in isolation.

The same idea showed up in evaluation. Static benchmarks are useful for broad comparison, but they are often too shallow for systems that plan, act, and adapt. If an agent uses tools, makes plans, asks follow-up questions, and changes course, the evaluation environment needs to reflect that.

This is where trust becomes an engineering practice. The system needs a clear job, known boundaries, realistic evaluation, and a record of what happened when things go wrong. The more agency we give these systems, the more important it becomes to understand their behavior before users rely on them.

A related point is how difficult it can be to define the goal or intent of an agent without introducing noise.

With traditional software, intent is usually distributed across requirements, code, tests, documentation, and the assumptions shared by the team. With agents, some of that intent moves into prompts, tool descriptions, policies, examples, retrieval context, and runtime instructions. That creates a larger surface area for misunderstanding.

A small change in wording, tool behavior, available context, or workflow design can change what the agent thinks it is supposed to do. The system may still appear to work, but it may be optimizing for the wrong thing, skipping an important constraint, or satisfying the surface form of a request while missing the real need behind it.

That becomes harder as teams move faster with AI. Code can change quickly. Prompts can change quickly. Tools can be added or swapped. Workflows can evolve before everyone has the same mental model of the system. The faster the system changes, the more important it becomes to preserve shared understanding.

Teams need better ways to see what changed, why it changed, and whether the system still behaves as expected. That means versioning more than code. It means tracking prompts, tools, policies, retrieval behavior, eval results, and the assumptions behind major design decisions.

Regulation Looks a Lot Like Good Engineering

One of the more useful framings from the day was around AI regulation.

Regulation can sound like a separate universe from engineering: legal language, slow-moving process, and requirements that arrive late in the build. A lot of what regulators care about maps directly onto good ML and software practice: observable systems, traceable data, accountable outputs, documented processes, monitoring, and clear responsibility.

That was especially clear in the healthcare and BioPharma sessions. Real-time AI risk monitoring, the AI-enabled clinician, drug development, and foundation model evaluation for in-silico target discovery all point at the same reality. In high-stakes domains, usefulness is inseparable from safety, auditability, and outcomes.

Healthcare is a helpful forcing function because vague trust does not go very far. If a model drifts, a data feed changes, or recommendations become unreliable, the consequences can be serious. The system needs monitoring, workflow-aware governance, and human-in-the-loop design that matches the stakes.

The same lesson applies outside healthcare too. Regulated domains make the discipline visible, but every team deploying AI eventually has to answer similar questions. Where did the data come from? Who was allowed to access it? What process produced the output? What evidence supports it? What happens when the system is wrong?

Those questions belong to compliance, product quality, engineering, and trust at the same time.

Healthcare Makes This Concrete

Healthcare makes these questions feel less theoretical to me. I have worked in admin and research roles in healthcare before, so it is easy to imagine how helpful these tools could be when they are designed carefully.

There is so much coordination work, documentation, research support, patient context, and operational friction in healthcare. AI could help with a lot of that. But the bar has to be different because the work is tied to real people, real workflows, and real consequences.

The stakes demand better questions than "is the model impressive?" Does it fit the workflow? Does it improve outcomes, reduce burden, or help people make better decisions? Can clinicians trust it? Can researchers audit it?

Domain-specific AI requires humility. The model is only one part of the system. The workflow, the user, the risk, the data, and the outcome all matter, and maintaining that trust requires ongoing research, evaluation, and engineering work.

Data Quality Is Part of Trust

The data theme also got sharper: data quality is part of trust.

Data quality goes well beyond retrieval quality. It shapes training, evaluation, model behavior, latency, freshness, and production reliability. The session "Models are What They Eat" captures a lot of this. If teams want smaller, cheaper, more specialized, or more capable models, data curation becomes part of the system.

There was also a thread around reasoning and retrieval. RAG has become a familiar pattern, but the move from RAG to agents changes the expectations. An agent may search, use tools, self-correct, take actions, or coordinate across systems. The data layer starts supporting decisions in addition to lookup.

The production reliability session added another useful wrinkle. AI systems can behave differently across environments because of nondeterminism, hardware differences, runtime behavior, latency changes, and hidden reliability gaps. A system that works in development may not behave the same way in production.

That is uncomfortable, but useful to face directly. It pushes teams toward better observability, better evals, better deployment practices, and more humility about what they think they know.

What I Am Taking Away

Day 2 made me think about maturity. Day 3 made me think about trust.

The two are tied together. A mature AI system can perform a task, but it can also be evaluated, monitored, governed, revised, and understood by the people depending on it.

The practical takeaway is simple enough to write down and hard enough to keep honest: build the eval loop early.

Stress test your applied AI. Build systems that continue to test as prompts, models, runtimes, tools, data sources, and workflows evolve. Many organizations have already learned that MLOps requires dedicated practice and ownership. Evals and governance are likely to become a similar part of building trust in AI products.

That means knowing what data was used, what changed, how the system was evaluated, and where human judgment still belongs. It means treating regulation and governance as design constraints, and giving users a way to trust the system without asking them to trust it blindly.

That, to me, is one of the important next phases of AI work. The teams that do this well may not always have the flashiest demos. But they will have something more durable: clearer ownership, better evals, stronger data practices, better logs, and a deeper understanding of the people depending on the system.

Dakota Kim