How Data Quality Limits AI SOC Performance

Low-quality data will prevent successful AI SOC transformation

Defenders have long known that richer evidence improves security outcomes by enabling faster triage, deeper analysis, and more complete investigation. Although Corelight was founded on this premise, it’s been hard for us to quantify the impact of better network data - until now.

Recently, we built an agentic test harness to measure the success of frontier LLMs in responding to real-world attack scenarios, using a range of source data.

For background, Corelight hosts over 100 Capture the Flag (CTF) exercises annually. These realistic, gamified competitions are an opportunity for security practitioners to train and prove their skills, while working with Corelight data to solve a range of challenges. As a result, we have access to a large, well-understood collection of CTF data that has never been published online and could not have been part of any LLM's training set.^*

To explore the impact of data quality on SOC performance, we asked agents to compete in a realistic Capture the Flag (CTF) exercise inspired by the Volt Typhoon threat actor. Then we ran dozens of competitions, keeping the language model fixed and changing only the network data available.

The results were striking, so we set out to validate our findings with a second, independent experiment. For this new experiment, we built a second agentic application tasked with generating a series of incident response (IR) reports for a Salt Typhoon compromise. Just as before, we varied the data source available to the AI. Our new experiment was an attempt to measure how much analytical power is lost when evidence is degraded.

With these experiments complete, we can now share the results:

Corelight logs improved CTF scores by over 350% compared to NetFlow logs, and over 60% compared to firewall logs.
Corelight logs provided evidence for nearly 50% more IR findings compared to firewall logs, and almost 300% more compared to NetFlow.
Corelight data enabled AI to answer CTF questions almost twice as fast compared to AI working with lower-quality network logs.

In short, data quality has a critical impact on SOC performance.

Reassuringly, the leading frontier models performed roughly the same, and we observed very little hallucination. The complete research paper on this topic contains much more information about our methodology, results, and the steps we took to ensure fairness.

Our results show that quality of data sets a hard ceiling on SOC performance. Even the most advanced models could not overcome limitations imposed by low-quality network data. This finding has important implications for the architecture of SOCs going forward, especially because efficiency improvements are compounded in an increasingly automated environment. Assuming that agentic automation will boost SOC efficiency by 10x in the next year or two, then the choice to use better data and lift investigative success rates from 30% to 90% will result in a dramatic increase in overall efficiency.

The largest performance improvement available to any security organization wanting to use AI for automation is not an upgraded model. The most direct path to better SOC performance is better source data.

Invest in the evidence. Everything else follows from it.

* In case you are not familiar, Corelight data is produced through deep packet inspection, using the underlying power of Zeek and Suricata. It takes the form of structured logs summarizing network conversations across dozens of protocols, interlinked by unique connection identifiers. In addition, it incorporates proprietary content generated from a range of analysis engines, including behavioral, statistical, and machine learning models (with a focus on C2 detection, lateral movement, encrypted traffic analysis, entity analysis, and analysis of extracted files, among other goals). For fairness, we chose a Corelight configuration similar to what most customers tend to deploy. It should be noted that Corelight supports additional detection features, capabilities, and integrations (including unsupervised machine learning for anomaly detection, and integrations with EDR, CMDB, and identity providers) not enabled or tested in these experiments.