Security Observability

Okta Director of Security Engineering Arun Kumar Elengovan on What Security Observability Teaches About Monitoring Systems in Permanent Failure States

The engineer who built observability infrastructure for a platform handling more than a billion identity transactions daily evaluated hackathon projects in which system failure is the intended outcome. They discovered that the hardest problem in security operations applies just as forcefully when collapse is the feature, not the bug.

In 2019, Capital One’s security team had every tool a modern Security Operations Center could want. SIEM platforms ingested logs from hundreds of services and monitored alert rules for known attack patterns. Dashboards visualized traffic anomalies in real time. A former Amazon engineer still exfiltrated 100 million customer records by exploiting a misconfigured web application firewall—a misconfiguration that generated alerts the SOC had already classified as noise. The breach wasn’t a failure of detection technology. It was a failure of observability design: when every system constantly generates alerts, the one alert that a sea of expected abnormalities drowns out matters.

This is the central paradox of modern security observability. At scale, Security Operations Centers process between 10,000 and 150,000 alerts per day. Studies by the Ponemon Institute consistently find that more than 40 percent are false positives. Analyst fatigue isn’t a staffing problem — it’s an architectural one. The systems designed to detect threats become threats themselves when they desensitize the humans who depend on them.

Arun Kumar Elengovan has spent his career on both sides of this equation. As Director of Security Engineering at Okta, he founded and leads the Engineering Security organization that protects the identity platform, which serves millions of users across 19,000 customer organizations. Okta processes authentication events at a scale where “normal” encompasses enormous behavioral variance — users signing in from new devices, traveling across time zones, accessing applications at unusual hours. A Forbes Technology Council member with international recognition in applied cryptography and AI-driven defense systems, Elengovan has built the kind of monitoring infrastructure where baseline instability is the starting condition, not the exception.

That experience gave him an unusual lens for evaluating System Collapse 2026 — a 72-hour hackathon organized by Hackathon Raptors where 26 teams built software designed to thrive on instability. Elengovan assessed eight projects in his batch, and his scoring patterns reveal the concerns of an engineer whose professional life revolves around a single problem: distinguishing genuine security incidents from the ambient noise of systems whose normal state already appears to be failure.

Shifting Baselines and the Death of Anomaly Detection

Security observability depends on baselines. User and Entity Behavior Analytics platforms build behavioral profiles—login times, access patterns, data volumes, and geographic locations—and flag deviations that exceed statistical thresholds. The approach works when “normal” is relatively stable. A user who logs in from New York every weekday at 9 am and suddenly authenticates from Lagos at 3 am triggers an obvious anomaly. The approach collapses when the baseline itself is in constant motion.

Gravity Shift by team Nomrelol_ earned a perfect 5.00 in Elengovan’s evaluation across all three criteria. The game sends a ball through eight levels, where each collision increases an entropy counter, causing visual corruption, physical inversions, and reality distortions. The player navigates not just the current state of the world but the compounding effects of every previous interaction. Controls that worked reliably in level one behave unpredictably by level four. The environment that was “normal” five seconds ago no longer exists.

“In security monitoring, we call this baseline drift,” Elengovan explains. “Your SIEM learns what normal looks like for a system or user over time. However, if the system itself is changing—deployments, configuration updates, traffic pattern shifts—the baseline becomes a moving target. Gravity Shift captures this perfectly. The player can’t rely on what worked before because the rules have changed underneath them.”

The game’s entropy mechanic mirrors a specific problem that plagues large-scale identity platforms. Okta’s customers deploy configuration changes frequently—new applications added to SSO, conditional access policies modified, user groups restructured. Each change shifts the behavioral baseline. A UEBA platform that learned “normal” last month may flag legitimate behavior this month as anomalous simply because the organization changed its own rules. The result is alert fatigue: analysts encounter so many false positives from baseline drift that they begin to ignore the alert category entirely.

Gravity Shift makes this dynamic visceral. By level eight, the player has accumulated so much entropy that every aspect of the game behaves abnormally. There is no baseline to deviate from because deviation is the permanent state. The player must develop a different kind of situational awareness—not “is this normal?” but “is this the kind of abnormal I expect given what I’ve done so far?” This is precisely the cognitive shift that senior SOC analysts develop over years of experience: the ability to distinguish expected abnormality from unexpected abnormality in systems that are never truly stable.

Incident Response Patterns in Recovery Systems

When a security incident is detected, the response follows a well-documented lifecycle: identification, containment, eradication, recovery, and lessons learned. The recovery phase is where most organizations struggle. Restoring a compromised system isn’t simply rolling back to a pre-incident state — the attacker may have planted persistence mechanisms or modified configurations in ways that aren’t immediately visible. True recovery requires understanding not just what changed during the incident but what the system learned from surviving it.

After the stroke, team Gladiators also earned a perfect 5.00 from Elengovan. The project is an evolutionary drawing application in which strokes persist, decay, and mutate over time, creating emergent glitch art through the collapse of an autonomous system. Users intentionally draw something; the system transforms it into an unrecognizable form through sustained degradation. The original input survives only as a ghost — a trace of what was, embedded in what has become.

“Incident response teams deal with this exact dynamic,” Elengovan observes. “After a breach, you clean up the obvious artifacts — revoke compromised credentials, patch the vulnerability, remove the malware. But the system isn’t the same system it was before. Configuration changes made during containment persist. Emergency access grants that were never revoked. Monitoring rules added during the incident now generate alerts for which no one remembers the context. The system carries its incident history in its current state.”

After the Stroke’s decay mechanics model what security architects call “configuration drift under incident pressure.” During an active incident, teams make rapid changes to contain the threat —for example, by modifying firewall rules, revoking access, and adjusting network segmentation. These changes are necessary but often poorly documented. “Every major incident at Okta leaves behind a layer of configuration changes,” Elengovan notes. “Six months later, someone asks why a particular firewall rule exists, and nobody can trace it back to the incident that created it. The system remembers the response even when the responders have forgotten.” The drawing the user created remains technically available, but subsequent response actions have transformed it.

The project’s use of a Go backend with WebSocket-based real-time synchronization introduces an additional dimension of observability. Multiple users can draw simultaneously, meaning the system’s state reflects the combined interactions of multiple independent agents—the same challenge that enterprise SIEM platforms face when correlating events from dozens of data sources, each generating its own stream of potentially meaningful signals.

Modeling Alert Fatigue Through Stress and Burnout

The human cost of security observability failures is well documented. A 2022 study by Tines found that 71 percent of SOC analysts reported burnout, with alert fatigue cited as the primary cause. Analysts who process thousands of alerts daily develop coping mechanisms that reduce their effectiveness —automatically closing alert categories known to produce false positives, thereby reducing investigation time below the threshold at which meaningful analysis is possible. The monitoring system works perfectly. The humans monitoring the monitors do not.

Life Simulator, by team VrajC0Dee, scored 4.70 in Elengovan’s evaluation—a browser-based particle simulation disguised as a shop management game. Users restock inventory, serve customers, and adjust parameters like greed and hustle. The system models stress, burnout, and renewal through a collapse mechanism in which excessive effort causes the entire simulation to break down. The collapse isn’t a failure state — it’s the thematic core. The system tells you, through its mechanics, that sustained maximum performance is unsustainable.

“This is the SOC analyst problem expressed as a game mechanic,” Elengovan notes. “You can run your security operations at maximum alert sensitivity. Every anomaly gets investigated. Every potential threat gets triaged. But the humans doing the work have finite cognitive bandwidth. Push them beyond their sustainable throughput, and the quality of their analysis collapses—not gradually, but catastrophically. One missed critical alert during a burnout episode can be worse than running at reduced sensitivity from the start.”

Life Simulator’s greed and hustle parameters map directly to SOC configuration decisions. Increasing alert sensitivity (hustle) catches more potential threats but generates more noise. Expanding the scope of monitoring (greed) provides broader visibility but demands more analyst attention. The game teaches through consequence what security architects learn through incidents: the optimal operating point is never the maximum operating point. A SIEM configured to alert on every deviation will generate so many notifications that the team stops taking any of them seriously.

Other evaluators recognized the simulation’s depth beneath its deceptively simple interface. Harshit Kohli noted “perfect theme execution, zero dependencies, elegant orbital physics, exceptional system design.” Shubhankar Shilpi described “a surprisingly thoughtful and emotionally resonant simulation” where “the collapse mechanic is not just a failure state — it’s the thematic core of the experience.” For Elengovan, what resonated was the system’s modeling of a truth that security operations teams learn painfully: the humans-in-the-loop are the most fragile component, and no amount of tooling compensates for the cognitive damage inflicted by chronic overload.

Cascading Failure and Lateral Movement Detection

In cybersecurity, lateral movement refers to an attacker’s progression through a network following initial compromise. The attacker gains access to one system, uses it to discover and access adjacent systems, and gradually expands their foothold until they reach high-value targets. Detecting lateral movement is one of the hardest problems in security observability because each action —such as a user accessing a file share, querying a directory service, or connecting to a database—is individually legitimate. The threat exists only in the pattern: the sequence, velocity, and scope of access across systems that no single monitoring point can see in isolation.

System Collapse by team keystone earned 4.00 from Elengovan. The game inverts the typical relationship between player action and world stability: every action the player takes destabilizes the environment. Progress and decay are the same mechanic. The goal isn’t to win but to experience the most architecturally interesting destruction possible before the system gives in entirely. Andrei Dzeikalo, another evaluator, described the experience as understanding “how tough it is to keep it under control while still maintaining a balance”—a sentiment that security operations teams would recognize immediately.

“Lateral movement in a real breach looks exactly like this,” Elengovan explains. “The attacker doesn’t break in and immediately grab the crown jewels. They move methodically—accessing one system, pivoting to the next, escalating privileges incrementally. Each action is defensible. The accumulated effect is catastrophic. System Collapse models this brilliantly: each player’s action is small, but the compound effect destabilizes everything.”

The challenge of detecting lateral movement is fundamentally an observability correlation problem. Network monitoring sees connections. Endpoint detection sees process execution. Identity platforms see authentication events. No single data source reveals the full attack chain. Security teams address this through SOAR platforms that correlate events across sources, building timeline views that reveal patterns invisible to any individual sensor.

System Collapse’s progressive destabilization captures the defender’s experience during an active breach. Each signal is ambiguous in isolation. The compound picture is clear only in retrospect, after the damage is done. The game forces players into the same cognitive position: watching a system deteriorate, unable to determine the exact point where manageable instability becomes irreversible collapse.

Emergent Threats and Zero-Day Detection

Conway’s Game of Life produces complex emergent behavior from simple deterministic rules—glider guns, spaceships, and self-replicating patterns that cannot be predicted without running the simulation. None of these behaviors is encoded in the rules. They emerge from interactions that resist analytical prediction.

The Variant of Conway’s Game of Life by Team Lawless extends this principle by mutating the rules themselves every N generations. The cellular automaton doesn’t just produce emergent behavior — it produces emergent behavior from rules that are themselves changing. Patterns that were stable under one rule set become unstable under the next. Structures that were impossible become inevitable. The system generates novelty not just from complexity but from the instability of its own governing logic.

Elengovan scored the project 4.00, and his evaluation reflects a specific concern that preoccupies security observability teams in organizations such as Okta. “Zero-day detection is the hardest problem in security because you’re looking for something you’ve never seen before,” he explains. “Signature-based detection fails because there’s no signature. Behavioral detection fails if the attack mimics legitimate behavior. The only reliable approach is anomaly detection against a well-understood baseline — and as we’ve discussed, baselines are themselves unstable.”

Conway’s variant illustrates the fundamental limits of rule-based security monitoring. “We maintain thousands of detection rules at Okta,” Elengovan says. “Each one encodes a known attack pattern. They catch what we’ve seen before. The problem is that sophisticated attackers study your defenses and mutate their approach specifically to avoid your signatures.” The mutating rules in the Conway variant mirror the behavior of sophisticated threat actors, who adapt their tactics based on the defenses they observe. Detection rules written for one attack pattern become obsolete when the attacker shifts to a new technique, just as cellular automata patterns that thrived under one rule set collapse when the rules mutate.

The project’s C++ implementation and clean, modular architecture—described by Dzeikalo as “an additional plus” for its “clean modular architecture” and “object-oriented structure”—demonstrate the engineering discipline required to build systems that operate under constantly changing conditions. In security observability, the equivalent discipline is building detection pipelines that ingest new threat intelligence and update behavioral models without rebuilding the entire monitoring stack. The organizations that detect zero-days are those whose observability infrastructure evolves as rapidly as the threats it monitors.

The Observability Paradox

Across his eight evaluations, Elengovan’s scores reveal a consistent pattern rooted in his experience of observability. Projects that scored highest—Gravity Shift and After the Stroke — at 5.00 each implemented instability as a fundamental transformation of system behavior, mirroring the baseline drift and post-incident configuration entropy that make real-world security monitoring difficult. Projects in the middle range — Life Simulator at 4.70, System Collapse and Conway’s Variant at 4.00 each — modeled specific observability challenges: analyst burnout, lateral movement ambiguity, and the limits of rule-based detection. The lowest-scoring viable project, The Atomic Simulator, at 3.00, implemented particle interactions without the kind of systemic feedback loops that make observability problems particularly challenging.

The distinction maps to what Elengovan calls the observability paradox. “The systems that are most important to monitor are the ones that are hardest to monitor,” he explains. “A stable system with predictable behavior is easy to observe — any deviation from the baseline is meaningful. But the systems that matter most in security are the ones under active attack, undergoing incident response, or operating under abnormal load. Those systems exhibit constant deviation. The challenge isn’t seeing what’s happening — it’s determining which of the thousand things happening right now is the one that requires immediate action.”

The System Collapse hackathon theme—systems that thrive on instability—unintentionally created a laboratory for exploring this paradox. Every project that embraced genuine instability, rather than decorative chaos, forced its users into the cognitive position of a security analyst during an active incident: surrounded by abnormal signals, unable to establish a reliable baseline, forced to make decisions about which anomalies matter and which are simply the system behaving as its new, unstable self.

“Most security professionals spend their careers trying to make systems more stable, more predictable, more observable,” Elengovan observes. “These hackathon projects did the opposite — they made instability the point. And in doing so, they accidentally demonstrated why security observability is so hard. You can’t monitor your way out of fundamental instability. You have to build systems — and teams — that can operate effectively within it. The best SOC analysts aren’t the ones who eliminate noise. They’re the ones who develop intuition for which noise matters.”

The shift from “detect all anomalies” to “develop judgment about which anomalies warrant investigation” is the maturity curve that every security operation follows. Junior analysts want more alerts. Senior analysts want better alerts. The most experienced analysts want fewer, higher-fidelity signals with enough context to enable immediate action. The hackathon projects that scored highest in Elengovan’s evaluation were those that required users to navigate the same curve—from confusion caused by overwhelming instability, through pattern recognition, to mastery of operating in a permanently degraded environment.

System Collapse 2026 was organized by Hackathon Raptors, a Community Interest Company supporting innovation in software development. The event featured 26 teams competing across 72 hours, building systems designed to thrive on instability. Arun Kumar Elengovan served as a judge evaluating projects for technical execution, system design, creativity, and expression.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *