Leveraging Map-Reduce & LLMs for Network Detection | Corelight

Written by Keith J. Jones | Mar 25, 2025 3:42:28 PM

In my security research role at Corelight, I often have to go through large, complex data sets to detect subtle anomalies and threats. It reminds me of a famous quote by Abraham Lincoln:

Give me six hours to chop down a tree and I will spend the first four sharpening the axe.

For me, that means investing time up front to build tools that allow a large language model (LLM) to do the heavy lifting on key tasks, namely those that teams of analysts would have handled in the past.

One such tool is our map-reduce script, which overcomes the inherent context limitations of LLMs by processing vast amounts of data in smaller, manageable chunks.

Overcoming LLM limitations with map-reduce

Modern LLMs are incredibly powerful but are constrained by a fixed context window—they can only process a certain amount of data at once. RAM size is often the biggest constraint when running LLMs.

Our map-reduce approach addresses this challenge by:

Breaking Down Data: Splitting enormous datasets into smaller chunks that can be processed individually.
Mapping: Sending each chunk to the LLM with a tailored query (for example, summarizing events or highlighting key indicators), so that even enormous datasets can be handled without exceeding token limits.
Reducing: Aggregating the individual responses into a unified final report that includes citations to the original source files—ensuring traceability and facilitating further investigation.

This approach makes it possible to analyze huge datasets efficiently, and also transforms the way we extract actionable insights from data in cybersecurity.

(It is important to note that Langchain and Langgraph offer some map-reduce functionality here: https://python.langchain.com/docs/versions/migrating_chains/map_reduce_chain/. Still, we will implement our method in this article without complicating the process with a state graph.)

Real-world applications in cybersecurity

This map-reduce methodology is transformative for several key cybersecurity activities:

Threat Hunting and Incident Response: When an incident occurs, it is crucial to quickly sift through repositories of network logs to pinpoint suspicious activities or indicators of compromise (IoCs). Using our tool, you can generate a coherent summary of all related events with direct references to the original files, significantly speeding up investigations.
Automated Forensic Analysis: In the aftermath of a security incident, forensic teams can reconstruct timelines and piece together evidence from disparate logs. This automated consolidation of information helps in rapidly identifying the root cause and understanding the sequence of events.
Complex Source Code Analysis: This approach lets us analyze intricate source code and extract only the relevant information we need—such as information about the logic that detects malicious activity in a Zeek script. As demonstrated in our next example with Zeek's NetSupport detector, this logic can distill complex scripts into actionable insights.

General usage

Below is an overview of the prerequisites, installation, and step-by-step usage instructions. After that, I will show two examples where this script could be used to analyze Zeek scripts and logs.

Prerequisites

Below is an overview of the prerequisites, installation, and step-by-step usage instructions. After that, I will show two examples where this script could be used to analyze Zeek scripts and logs.

Python 3.7+
Make sure your environment is running Python 3.7 or later.
Apache Tika Server
Download and start the Tika server manually. The script is configured to use the endpoint: http://localhost:9998. More details: Apache Tika Server.
Required Python Packages:
- tika
- beautifulsoup4
- langchain
- langchain_ollama
- argparse (included in the standard library)
Ollama Installation and Model Download:
To integrate with ChatOllama, install Ollama and download the required model (the default used here is phi4).

Installation

Once you have the prerequisites in place, install the required Python packages by running:

Running the tool

To execute the script, use the following command:

Command-line arguments

-d, --directory:
(Required) The directory containing the files you wish to process.
-p, --path:
Regular expression(s) to filter file paths. You can provide multiple regexes separated by commas (default: .*).
-q, --query:
A single query to send to the LLM. If provided alongside --query_file, this argument takes precedence.
-f, --query_file:
Path to a file containing a multi-line query.
-m, --model:
Specify the Ollama model (default: phi4).
-c, --chunk_size:
The size (in characters) used to split documents into chunks (default: 100000).
-o, --chunk_overlap:
The amount of overlap between chunks (default: 100).
-t, --temperature:
The temperature setting for the ChatOllama model (default: 0.0).
-x, --num_ctx:
The context window size for ChatOllama (default: 37500).
-u, --output:
If specified, the final response will be written to the given file.
-s, --tika_server:
The URL endpoint for the Tika server (default: http://localhost:9998).
-z, --debug:
Enable debug output to generate detailed logs.

How it works

Document Ingestion:
The script traverses the specified directory recursively and uses Apache Tika to extract text from each file.
Text Splitting:
Extracted content is divided into manageable chunks using LangChain’s RecursiveCharacterTextSplitter, making it easier to handle large documents.
Map Stage:
Each text chunk is individually processed. The tool sends the chunk along with the query to ChatOllama, generating preliminary results for each section.
Reduce Stage:
The map outputs are then aggregated. If the combined output exceeds the model’s context window, the script intelligently consolidates intermediate results until a final, coherent answer is produced.
Final Output:
The final answer—complete with citations referencing the original document sources—is printed to the console or written to a specified file if the output option is used.

Example 1: Analyzing Zeek's NetSupport detector

To illustrate how this tool works in practice, consider the following example using the ZeekNetSupport detector developed by Corelight. This Zeek package monitors network traffic to detect the usage of NetSupport—an administrative tool that is often exploited by malware operators to facilitate unauthorized remote access.

Example Command:

Output:

You can see in the final LLM output that it successfully distilled the complex Zeek detection logic into three methods:

HTTP Header Inspection: The LLM identified that the package scrutinizes HTTP headers—specifically fields like "USER-AGENT" and "SERVER"—for the presence of the string "NetSupport." This technique helps flag network traffic that might indicate abnormal or potentially malicious use of a legitimate administrative tool.
CMD=POLL Command Detection: The response shows that the script monitors TCP payloads for the command string "CMD=POLL." By employing a regular expression that looks for this command, the system can detect when NetSupport is being used for command-and-control activities, a common tactic in compromised environments.
CMD=ENCD Command Detection: Similarly, the LLM output points out that the package also checks for the "CMD=ENCD" command within TCP payloads. This additional layer of detection further reinforces the identification of potentially nefarious remote administration activities.

This LLM response makes the detection mechanisms easier to understand for those who may not be familiar with Zeek source code. It also illustrates how the LLM can translate intricate code into actionable insights. By breaking down the logic into these distinct methods and providing citations to the relevant source files, the output serves as an invaluable resource that bridges the gap between complex technical implementations and practical security analysis.

Example 2: Analyzing NetSupport activity in Zeek logs

Below is another example where we use the map-reduce script to analyze just the Zeek logs (no source code) produced from the testing PCAP in the NetSupport repository. In this case, the tool reviews multiple log files and returns a consolidated analysis of suspicious or malicious activities, complete with direct quotes from the raw logs for context:

This example clearly demonstrates the power of the map-reduce approach in cybersecurity log analysis. Even if the LLM’s response is imperfect, it gives the network analyst a head start when looking into these logs. By breaking down extensive Zeek log data into manageable chunks and then consolidating the results, the tool efficiently distills complex network activities into actionable insights. The final output not only highlights key suspicious behaviors—such as potential NetSupport malware communications, anomalous connection patterns, DNS irregularities, and protocol issues—but also directly references raw log excerpts to provide context.

This detailed yet concise summary enables security teams to rapidly assess threats and prioritize further investigation, ultimately enhancing their incident response and forensic capabilities.

Conclusion

By investing time in sharpening our analytical tools—just as Lincoln advised with his axe—we enable LLMs to process complex, large-scale data efficiently. The map-reduce approach allows us to extract actionable insights from massive datasets, fundamentally transforming threat hunting, incident response, and forensic analysis.

Whether you’re analyzing network logs, dissecting source code like the Zeek NetSupport detector, or exploring new ways to automate data analysis, this methodology paves the way for more agile and accurate cybersecurity practices.

Explore this approach further in the LLM-Ninja repository, and join me in harnessing the full power of LLMs to stay ahead of evolving cyber threats.

View full post