a Slop Detector

Amazon Principal Supply Chain Manager Sushil Choubey on Why a Slop Detector Has to Pass the Same Test as a Factory Inspection Station

Every quality-control station on a production line answers two questions before it earns its place: can it keep up with the line, and does it catch the defect before the defect reaches the customer. A station that needs a human to hand-feed each unit is not a station, it is a bottleneck wearing a badge. A station that passes flawed units through while looking busy is worse than no station at all, because it gives everyone downstream permission to stop checking. Sushil Choubey has spent twelve years building and running systems that answer those two questions at Amazon scale, and when he sat down to evaluate AI slop detectors, he asked them of every tool in front of him.

Sushil Choubey, a Principal Supply Chain Manager at Amazon with more than twelve years in supply chain management, program management, and operations, judged the AI Slop Scan Hackathon through an operations lens the panel otherwise lacked. To him a slop detector is not a clever algorithm. It is an inspection station on a content production line, and it lives or dies on throughput, manual touch points, and whether it actually stops defects from flowing downstream.

AI Slop Scan, organized by Hackathon Raptors, asked teams to build tools that detect, measure, or mitigate AI-generated low-quality content across code review, documentation, marketplace reviews, and general writing. Forty-three teams shipped. Most of them framed the work as a detection problem. Choubey reframed it as a throughput problem, because that is the question that decides whether a detector is ever used twice. A factory does not adopt an inspection method because it is accurate in a lab. It adopts it because it survives contact with the line: the volume, the speed, the operators who will route around anything that slows them down.

His career is a long study in exactly that gap between a process that works in a demo and one that works at volume. “In operations you learn fast that the smartest solution that needs a person babysitting it is not a solution, it is a cost,” he says. “The questions I ask are boring and they are the only ones that matter. How many of these can it handle an hour. Where does a human have to touch it. What happens when the input is malformed. Where does it slow down, and where does it quietly let a bad unit through. I asked these slop detectors the same things I ask a sortation process, because a detector that cannot run at the speed of the content it inspects is decoration.”

The bottleneck that breaks the line

The clearest example of the failure mode Choubey is trained to spot came from a project called Orbit. The tool worked, but it required a user to manually add reviews before it could produce an output. To most evaluators that is a minor setup step. To an operations leader it is the whole problem.

“The product requires manual reviews to be added to get the output, which can be a blocker for usability,” he wrote in his assessment. He is direct about why that single requirement caps the tool’s ceiling. “The moment a process needs a human to load it by hand, you have put a person in the critical path, and that person is now your throughput limit. It does not matter how good the analysis is downstream. The line moves at the speed of the slowest manual step, and you just made the manual step mandatory. In a real deployment, where the whole point is to inspect content at the rate it is produced, a tool that needs to be hand-fed will sit unused while the content it was supposed to check flows right past it.” For Choubey, the engineering behind Orbit was not the issue. The operating model was, and the operating model is what determines whether anyone runs it at scale.

Constraints that look small and cap the whole system

A related pattern showed up in Momo, a multi-domain detector Choubey found genuinely capable, undercut by limits on how much it could process at once. “The character limit or file limit is really restricting the usability of the tool,” he wrote. The note is short, and the principle behind it is one he applies constantly.

“A throughput cap is the most underestimated defect there is, because it does not show up in the demo,” he says. “In a demo you paste in one short example and it works perfectly. Then someone points it at a real corpus, a whole documentation set, a quarter of pull requests, a feed of thousands of reviews, and the limit that seemed reasonable becomes the wall the entire workflow hits. I have watched good processes die on exactly this. Not because they were wrong, but because they were sized for the sample and not for the volume. An inspection tool has to be built for the busiest day, not the demo day.” His scoring consistently rewarded tools that could plausibly run at content scale and marked down ones whose constraints, however reasonable on a single input, would collapse against real volume.

Inspecting the inspector

The most revealing moment of Choubey’s evaluation was when he turned a detector on itself. Reviewing a tool called ahead, he did what an incoming-inspection manager does by reflex: he ran the product against real input and read the output critically. “I used to rate the code for this tool and it gave me 0.52 rating,” he wrote. “Multiple hollow comments or type-gaps detected. Review codebase stubs.”

He had fed the tool the tool’s own codebase, and the detector reported that the thing inspecting slop was itself full of hollow comments and stubbed-out logic. He treats this as the single most important test a quality station can be put to. “If your inspection tool flags your own work as defective, that is not embarrassing, that is the most honest result in the building,” he says. “It tells you the detector is actually measuring something, and it tells you the team shipped a station that would have failed its own inspection. In operations we audit the auditors for exactly this reason. A scale that cannot weigh itself accurately is not a scale. A slop detector whose own codebase is full of stubs and empty comments is reporting, truthfully, that it is not yet a product. The number, 0.52, is more useful than any pitch, because it is a measurement instead of a claim.”

Differentiation is an operations question too

Not every note Choubey made was about speed. Reviewing SoloByte, he questioned whether the tool earned its place at all. “It is not clear how this is different than an inbuilt task manager, and the usability of this product,” he wrote. The instinct behind that is the same one a supply chain leader applies to any proposed new step in a process: does this station add value, or is it duplicating something the line already does for free.

“Every step you add to a process has to justify its existence, because every step is cost, latency, and another thing that can break,” he says. “When I look at a tool and cannot immediately tell what it does that the existing, free, already-installed option does not, that is not a small marketing gap. It is a sign the team did not define the defect they are catching. An inspection station exists to catch a specific failure that nothing else catches. If you cannot name that failure, you have not built a station, you have added a step.” It is the discipline of an operator who has spent years removing redundant steps from systems, applied to a field that tends to add them.

What a slop detector needs to survive the line

Choubey’s evaluations resolve into a short checklist, the operations version of a detection rubric. It is the set of questions he would put to any team shipping a tool meant to inspect content at volume.

Where is the human in the critical path, and can you remove them. A tool that must be hand-loaded, like Orbit, runs at human speed no matter how fast its analysis is. The goal is a station the line feeds automatically.

What is the throughput ceiling, and is it sized for the busiest day. Character and file limits that look reasonable on a single input, as in Momo, become the wall a real corpus hits. Build for volume, not for the demo.

Does it catch defects when you turn it on real input, including its own. The ahead result, 0.52 on its own codebase, is the kind of honest measurement that separates a working station from a confident one. Run the inspector against reality before you trust it.

Does this station catch a defect nothing else catches. If a tool’s function overlaps something already installed and free, as with SoloByte, it has not defined the failure it exists to stop. Name the defect first.

Is the output usable without a second pass. Tools that require manual cleanup, re-formatting, or a person to interpret the result are adding labor, not removing it, and labor is the cost an inspection step is supposed to save.

The verdict, and the discipline coming to content

For Choubey, AI Slop Scan was a preview of a problem operations solved decades ago, arriving now in a new domain. As AI floods every pipeline with content, organizations are about to face the same question a factory faces with a flood of units: you cannot inspect everything by hand, so the inspection itself has to be automated, fast, and trustworthy enough that people stop checking behind it. That is a quality-control problem, and quality control is a solved discipline with hard-won rules.

“The content world is about to learn what manufacturing learned a long time ago,” he says. “When volume explodes, manual inspection stops being an option, and the quality of your automated inspection becomes the quality of everything you ship. The teams that treated their detector as a station on a line, something that has to run fast, run unattended, and actually stop defects, were building the right thing. The ones that built a clever analysis that needs a human to operate it built a demo. The next few years are going to be about putting inspection stations on every content pipeline in the company, and the people who understand throughput, bottlenecks, and honest measurement are going to build the ones that hold. Accuracy in a lab is where this starts. Surviving the line is where it counts.”

AI Slop Scan was organized by Hackathon Raptors, a Community Interest Company supporting innovation in software development. The event challenged 43 teams to build tools that detect, measure, and mitigate AI-generated low-quality content across code review, documentation, marketplace reviews, and general writing. Sushil Choubey, a Principal Supply Chain Manager at Amazon with more than twelve years in supply chain management, program management, and operations, served as a judge evaluating submissions for detection accuracy, practical usefulness, technical execution, innovation, and presentation.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *