Sonos Engineering Leader Manushi Sheth on Why ML Operationalization Is the Hardest Part of Building Mental Health Software
An engineering leader who has spent years operationalizing machine learning systems on cloud infrastructure spent two weeks reviewing 72-hour mental health prototypes — and found that the gap between an AI feature that works in a notebook and an AI feature that helps a real user in a real moment is wider, and more consequential, than almost any team in the field has yet acknowledged.
In the modern data and ML stack, there is a discipline that goes by the name of operationalization. It is the work of taking a model that performs well in a research environment and turning it into a system that performs well in production: instrumented, monitored, versioned, audited, gracefully degrading when the world stops looking the way the training data assumed it would. The discipline does not produce features that the user can see. It produces a system that does not break in ways that the user has to live with.
Manushi Sheth has spent her career inside this discipline. As an Engineering Leader at Sonos, with extensive experience in data analytics, machine learning operationalization, and AI ethics, her practice spans cloud infrastructure on AWS, modern data stacks built on Apache Iceberg, and the hands-on Python and SQL implementation work that the operationalization layer demands. When Hackathon Raptors invited her to evaluate seven projects from MINDCODE 2026 — an international 72-hour hackathon focused on software for human health — she encountered a category of system in which the operationalization gap was not just a technical concern but an ethical one. The cost of a poorly operationalized recommendation system at a music company is a user listening to the wrong song. The cost of a poorly operationalized intervention model in a mental health product can be measured in entirely different units.
“The hardest part of building mental health software with AI in it is not the AI,” Sheth observes. “It is the part after the AI works. It is the part where you have to know whether the model is still doing the right thing, whether the data it sees in production looks like the data it was trained on, whether its outputs are still safe under conditions you did not anticipate, and what happens to the user if any of those answers turns out to be no. Most hackathon teams in this space had built impressive AI demos. Very few had built the operational layer underneath it that determines whether the AI is safe to deploy.”
The Discipline of Knowing When the Model Is Wrong
A pattern Sheth repeatedly saw across MINDCODE submissions was the use of large language models, classification systems, and recommendation engines as the product’s central intelligence. The teams varied widely in how they integrated the models. Some used external API calls to commercial inference providers. Some used fine-tuned open-source models. Some used retrieval-augmented generation pipelines built on vector databases. The technical sophistication was real. What was almost universally missing was the layer of instrumentation that, in any production ML system, exists to answer one question: is the model still doing what we think it is doing?
“In production ML, the assumption is that the model will eventually be wrong about something,” Sheth explains. “Either because the input distribution shifts, or because the user population changes, or because someone exploits an adversarial pattern, or because the model provider silently updated the weights, or because the system started seeing inputs nobody anticipated. The operational layer exists to detect wrongness early enough to act on it. Without that layer, the model is wrong for as long as it takes someone to notice — which, in a consumer mental health product, could be a very long time.”
Her recommendation in this domain was specific: instrument the model from the first integration. Log every input and every output, with appropriate privacy protections. Track the distribution of outputs over time. Sample a fraction of outputs for human review. Define what an “unusual” output looks like and trigger an alert when one is produced. Build a kill switch that can disable the model independently of the rest of the product. Build a feedback loop so user reports of harmful or wrong outputs flow back into the operational layer.
“None of this is exotic,” she observes. “All of this is what you would do for a recommendation system at a music company, because the music company has a quality bar and an internal accountability process. The mental health domain has to apply the same disciplines, but with a higher quality bar and a more serious accountability process — because the consequences of getting it wrong are not a user listening to the wrong song. The consequences are a user being offered the wrong response in a moment when they most needed the right one.”
The Honesty Problem in AI-Generated Demos
A pattern that drew sharper criticism in Sheth’s reviews was what she described as the honesty gap in AI-generated demonstration material. Several MINDCODE submissions used AI to produce parts of the product itself, including video walkthroughs, voice narration, or generated text that was presented as the team’s own communication. For an engineer who has spent her career building systems where the line between machine output and human authorship is operationally consequential, this practice was not aesthetically distasteful — it was a sign of an architectural confusion that would matter in production.
“AI doing a presentation of a project is not the same as a human explaining the project,” Sheth notes. “When a real human walks through a demo, you learn what the human understands about the system. When an AI walks through the demo, you learn what the team is willing to claim about the system. Those are very different things, and the second one is much weaker evidence that the team understands what they have built. In a domain as serious as mental health software, the willingness to substitute AI presentation for actual understanding is a signal about how the team will handle the next architectural decision they have to make.”
Her observation was not a stylistic preference. It was a structural concern. The teams that produced AI-generated demonstration material were, in her experience, the same teams that produced AI-generated documentation, AI-generated safety claims, and AI-generated explanations of how their model handled crisis input. None of those things should be AI-generated in a system that the user is supposed to trust with their mental health. The integrity of the system has to come from somewhere, and if the team has outsourced the integrity to a model, the system does not have it.
“The strongest projects in my batch all had a real human walking through the work,” she observes. “Even when the implementation had limitations, the team understood what those limitations were and could explain them. That is the discipline I want to see in this space — the willingness to be honest about what the product does and does not do, and to take responsibility for the explanation rather than handing it to a model.”
When Innovation Outruns Safety
Across her reviews, Sheth flagged a recurring tension between innovation and safety in mental health software. The pressure to demonstrate novelty in 72 hours pushed teams toward features that pushed the boundary of what AI in mental health was responsibly able to do. Some of those features were impressive. Some of them were ahead of the safety architecture that should have accompanied them. And some of them were designed in a way that suggested the team had not yet been forced to consider what would happen when the feature encountered an edge case it could not handle.
“There is a category of feature that looks innovative until you ask what the worst-case path through it looks like,” Sheth observes. “An AI mood interpreter that triggers an intervention is a great idea until you realize the intervention might fire in response to a journal entry that was actually a user describing a crisis they wanted to be heard, not a user asking for the system to act. A wellness coaching agent is a great idea until you realize the coaching might contradict the actual clinical advice the user is following. A mental state classifier is a great idea until you realize the classification is now in a database that will outlive the user’s relationship with the product.”
Her recommendation to teams in this space was discipline-flavored: build the safety architecture before you build the feature, not after. Define the failure modes you are willing to accept and the ones you are not. Build a path for the user to override the system’s interpretation of their state. Build a path for the user to delete the model’s session record. Build a path for the system to defer to a human when the input is ambiguous in a way the model cannot reliably resolve. None of these is a creative limitation. All of them are the substrates that allow creative work to be done responsibly.
“The teams that scored highest in my batch were not necessarily the most innovative,” she notes. “They were the teams whose innovation was bounded by an explicit safety architecture — where the team had thought, before they shipped the feature, about the worst path the user could walk through it and had built rails to keep the worst path from being the default. That bounded innovation is more valuable in this domain than unbounded innovation, because the unbounded version exposes the user to risk the user did not consent to bear.”
Data Architecture as the Substrate of Trust
A theme that ran through Sheth’s evaluations was how the data architecture of a mental health product determines the trust a user can rationally place in it. This is not a feature the user can see. It is the schema decisions, the storage decisions, the retention policies, the access control rules, the audit logs, the deletion paths, and the way all of those interact under load. The user cannot see any of this directly. But they can see the consequences of getting it wrong, and so can the regulators and the engineers who eventually inherit the system from the team that built it.
“In the modern data stack, you can build systems that are operationally honest about what they remember and what they forget,” Sheth observes. “Apache Iceberg gives you table-format guarantees about schema evolution and time travel that let you reason about historical data states. AWS gives you the infrastructure to enforce data residency and access control at scale. Python and SQL give you the tools for implementation. The question is whether the team uses those tools deliberately or accidentally. The teams that use them deliberately produce systems whose data architecture supports the trust the product asks the user to place in it. The teams that use them accidentally produce systems where the user’s data is wherever the path of least resistance leaves it.”
Her advice for hackathon teams in this space was concrete and unromantic: write down the data flow. For every input the user provides, document where it goes, how long it stays there, who can access it, and what happens to it when the user deletes their account. Most teams could not do this when she asked them to. The few who could had built systems that her operational reflexes recognized as defensible. The rest had built systems that would inevitably, in production, disclose data the user thought they had not disclosed.
“Data architecture in mental health software is not a back-end concern,” she reflects. “It is the substrate of the trust relationship between the product and the user. If the back end does not support the trust the front end implies, the product is making promises it cannot keep. That is the gap I want hackathon teams in this space to start closing in the first 72 hours, not the last.”
Tech Blaster
What the Strongest Submissions Demonstrated
The submissions that scored highest in Sheth’s batch shared a quality that her background in ML operationalization made impossible to ignore. They had thought about the AI not as the central feature of the product, but as one component in a system that had to be safe, observable, and recoverable. They had thought of the data not as a passive resource to be queried, but as a structured asset with retention obligations and access constraints. They had thought about the user not as an abstract recipient of intelligence, but as a specific person whose mental health was being entrusted, in some small way, to the product the team had just built.
“The teams that took the work seriously,” Sheth notes, “produced submissions that I would feel comfortable handing to a real user in a real moment. The teams that did not produce submissions that were impressive demos and would have been irresponsible deployments. That gap is the most important variable in the mental health software space right now, and it does not get smaller because the team had a great idea or wrote elegant code. It only gets smaller because the team made specific architectural decisions about safety, observability, and data integrity that the rest of the field has not yet been forced to make.”
Her closing observation was deliberately practical. The ML operationalization disciplines that other regulated industries have already developed are not secrets. The patterns are public. The frameworks are mature. The hard part is the willingness to apply them in a domain that is still largely being built by teams that have not yet faced the failures that would teach them to want the discipline in the first place.
MINDCODE 2026 — Software for Human Health was an international 72-hour hackathon organized by Hackathon Raptors from February 27 to March 2, 2026, with the official evaluation period running March 3–14. The competition attracted over 200 registrants and yielded 21 valid submissions in the mental health and wellness domain. Submissions were independently reviewed by a panel of judges across three evaluation batches. Projects were assessed against five weighted criteria: Impact & Vision (35%), Execution (25%), Innovation (20%), User Experience (15%), and Presentation (5%). Hackathon Raptors is a United Kingdom Community Interest Company (CIC No. 15557917) that curates technically rigorous international hackathons and engineering initiatives focused on meaningful innovation in software systems.