In my security research role at Corelight, I often have to go through large, complex data sets to detect subtle anomalies and threats. It reminds me of a famous quote by Abraham Lincoln:
Give me six hours to chop down a tree and I will spend the first four sharpening the axe.
For me, that means investing time up front to build tools that allow a large language model (LLM) to do the heavy lifting on key tasks, namely those that teams of analysts would have handled in the past.
One such tool is our map-reduce script, which overcomes the inherent context limitations of LLMs by processing vast amounts of data in smaller, manageable chunks.
Modern LLMs are incredibly powerful but are constrained by a fixed context window—they can only process a certain amount of data at once. RAM size is often the biggest constraint when running LLMs.
Our map-reduce approach addresses this challenge by:
This approach makes it possible to analyze huge datasets efficiently, and also transforms the way we extract actionable insights from data in cybersecurity.
(It is important to note that Langchain and Langgraph offer some map-reduce functionality here: https://python.langchain.com/docs/versions/migrating_chains/map_reduce_chain/. Still, we will implement our method in this article without complicating the process with a state graph.)
This map-reduce methodology is transformative for several key cybersecurity activities:
Below is an overview of the prerequisites, installation, and step-by-step usage instructions. After that, I will show two examples where this script could be used to analyze Zeek scripts and logs.
Below is an overview of the prerequisites, installation, and step-by-step usage instructions. After that, I will show two examples where this script could be used to analyze Zeek scripts and logs.
Once you have the prerequisites in place, install the required Python packages by running:
To execute the script, use the following command:
-d, --directory:
-p, --path:
-q, --query:
-f, --query_file:
-m, --model:
phi4
).-c, --chunk_size:
100000
).-o, --chunk_overlap:
100
).-t, --temperature:
0.0
).-x, --num_ctx:
37500
).-u, --output:
-s, --tika_server:
http://localhost:9998).
-z, --debug:
RecursiveCharacterTextSplitter
, making it easier to handle large documents.To illustrate how this tool works in practice, consider the following example using the ZeekNetSupport detector developed by Corelight. This Zeek package monitors network traffic to detect the usage of NetSupport—an administrative tool that is often exploited by malware operators to facilitate unauthorized remote access.
Example Command:
Output:
You can see in the final LLM output that it successfully distilled the complex Zeek detection logic into three methods:
This LLM response makes the detection mechanisms easier to understand for those who may not be familiar with Zeek source code. It also illustrates how the LLM can translate intricate code into actionable insights. By breaking down the logic into these distinct methods and providing citations to the relevant source files, the output serves as an invaluable resource that bridges the gap between complex technical implementations and practical security analysis.
Below is another example where we use the map-reduce script to analyze just the Zeek logs (no source code) produced from the testing PCAP in the NetSupport repository. In this case, the tool reviews multiple log files and returns a consolidated analysis of suspicious or malicious activities, complete with direct quotes from the raw logs for context:
This example clearly demonstrates the power of the map-reduce approach in cybersecurity log analysis. Even if the LLM’s response is imperfect, it gives the network analyst a head start when looking into these logs. By breaking down extensive Zeek log data into manageable chunks and then consolidating the results, the tool efficiently distills complex network activities into actionable insights. The final output not only highlights key suspicious behaviors—such as potential NetSupport malware communications, anomalous connection patterns, DNS irregularities, and protocol issues—but also directly references raw log excerpts to provide context.
This detailed yet concise summary enables security teams to rapidly assess threats and prioritize further investigation, ultimately enhancing their incident response and forensic capabilities.
By investing time in sharpening our analytical tools—just as Lincoln advised with his axe—we enable LLMs to process complex, large-scale data efficiently. The map-reduce approach allows us to extract actionable insights from massive datasets, fundamentally transforming threat hunting, incident response, and forensic analysis.
Whether you’re analyzing network logs, dissecting source code like the Zeek NetSupport detector, or exploring new ways to automate data analysis, this methodology paves the way for more agile and accurate cybersecurity practices.
Explore this approach further in the LLM-Ninja repository, and join me in harnessing the full power of LLMs to stay ahead of evolving cyber threats.