Leveraging map-reduce and LLMs for enhanced cybersecurity network detection

March 25, 2025 by Keith J. Jones

In my security research role at Corelight, I often have to go through large, complex data sets to detect subtle anomalies and threats. It reminds me of a famous quote by Abraham Lincoln:

Give me six hours to chop down a tree and I will spend the first four sharpening the axe.

For me, that means investing time up front to build tools that allow a large language model (LLM) to do the heavy lifting on key tasks, namely those that teams of analysts would have handled in the past.

One such tool is our map-reduce script, which overcomes the inherent context limitations of LLMs by processing vast amounts of data in smaller, manageable chunks.

Overcoming LLM limitations with map-reduce

Modern LLMs are incredibly powerful but are constrained by a fixed context window—they can only process a certain amount of data at once. RAM size is often the biggest constraint when running LLMs.

Our map-reduce approach addresses this challenge by:

Breaking Down Data: Splitting enormous datasets into smaller chunks that can be processed individually.
Mapping: Sending each chunk to the LLM with a tailored query (for example, summarizing events or highlighting key indicators), so that even enormous datasets can be handled without exceeding token limits.
Reducing: Aggregating the individual responses into a unified final report that includes citations to the original source files—ensuring traceability and facilitating further investigation.

This approach makes it possible to analyze huge datasets efficiently, and also transforms the way we extract actionable insights from data in cybersecurity.

(It is important to note that Langchain and Langgraph offer some map-reduce functionality here: https://python.langchain.com/docs/versions/migrating_chains/map_reduce_chain/. Still, we will implement our method in this article without complicating the process with a state graph.)

Real-world applications in cybersecurity

This map-reduce methodology is transformative for several key cybersecurity activities:

Threat Hunting and Incident Response: When an incident occurs, it is crucial to quickly sift through repositories of network logs to pinpoint suspicious activities or indicators of compromise (IoCs). Using our tool, you can generate a coherent summary of all related events with direct references to the original files, significantly speeding up investigations.
Automated Forensic Analysis: In the aftermath of a security incident, forensic teams can reconstruct timelines and piece together evidence from disparate logs. This automated consolidation of information helps in rapidly identifying the root cause and understanding the sequence of events.
Complex Source Code Analysis: This approach lets us analyze intricate source code and extract only the relevant information we need—such as information about the logic that detects malicious activity in a Zeek script. As demonstrated in our next example with Zeek's NetSupport detector, this logic can distill complex scripts into actionable insights.

General usage

Below is an overview of the prerequisites, installation, and step-by-step usage instructions. After that, I will show two examples where this script could be used to analyze Zeek scripts and logs.

Prerequisites

Below is an overview of the prerequisites, installation, and step-by-step usage instructions. After that, I will show two examples where this script could be used to analyze Zeek scripts and logs.

Python 3.7+
Make sure your environment is running Python 3.7 or later.
Apache Tika Server
Download and start the Tika server manually. The script is configured to use the endpoint: http://localhost:9998. More details: Apache Tika Server.
Required Python Packages:
- tika
- beautifulsoup4
- langchain
- langchain_ollama
- argparse (included in the standard library)
Ollama Installation and Model Download:
To integrate with ChatOllama, install Ollama and download the required model (the default used here is phi4).

Installation

Once you have the prerequisites in place, install the required Python packages by running:

    
     pip install -r requirements.txt

Running the tool

To execute the script, use the following command:

    
     python map-reduce.py --directory /path/to/your/documents --query "Your query here"

Command-line arguments

-d, --directory:
(Required) The directory containing the files you wish to process.
-p, --path:
Regular expression(s) to filter file paths. You can provide multiple regexes separated by commas (default: .*).
-q, --query:
A single query to send to the LLM. If provided alongside --query_file, this argument takes precedence.
-f, --query_file:
Path to a file containing a multi-line query.
-m, --model:
Specify the Ollama model (default: phi4).
-c, --chunk_size:
The size (in characters) used to split documents into chunks (default: 100000).
-o, --chunk_overlap:
The amount of overlap between chunks (default: 100).
-t, --temperature:
The temperature setting for the ChatOllama model (default: 0.0).
-x, --num_ctx:
The context window size for ChatOllama (default: 37500).
-u, --output:
If specified, the final response will be written to the given file.
-s, --tika_server:
The URL endpoint for the Tika server (default: http://localhost:9998).
-z, --debug:
Enable debug output to generate detailed logs.

How it works

Document Ingestion:
The script traverses the specified directory recursively and uses Apache Tika to extract text from each file.
Text Splitting:
Extracted content is divided into manageable chunks using LangChain’s RecursiveCharacterTextSplitter, making it easier to handle large documents.
Map Stage:
Each text chunk is individually processed. The tool sends the chunk along with the query to ChatOllama, generating preliminary results for each section.
Reduce Stage:
The map outputs are then aggregated. If the combined output exceeds the model’s context window, the script intelligently consolidates intermediate results until a final, coherent answer is produced.
Final Output:
The final answer—complete with citations referencing the original document sources—is printed to the console or written to a specified file if the output option is used.

Example 1: Analyzing Zeek's NetSupport detector

To illustrate how this tool works in practice, consider the following example using the ZeekNetSupport detector developed by Corelight. This Zeek package monitors network traffic to detect the usage of NetSupport—an administrative tool that is often exploited by malware operators to facilitate unauthorized remote access.

Example Command:

    
     % python map-reduce.py -d ~/Source/zeek-netsupport-detector -q "How does this Zeek package detect NetSupport?." --path "(?i).*readme\.md,.*/scripts/.*\.(zeek|sig)"

Output:

    
     Ingesting file: /Users/keith.jones/Source/zeek-netsupport-detector/README.md
Ingesting file: /Users/keith.jones/Source/zeek-netsupport-detector/scripts/netsupport.sig
Ingesting file: /Users/keith.jones/Source/zeek-netsupport-detector/scripts/main.zeek
Ingesting file: /Users/keith.jones/Source/zeek-netsupport-detector/scripts/__load__.zeek
File README.md produced 1 chunks.
File netsupport.sig produced 1 chunks.
File main.zeek produced 1 chunks.
File __load__.zeek produced 1 chunks.
[Map] Sending chunk from README.md (chunk 1/4) to the model...
[Map] Sending chunk from netsupport.sig (chunk 2/4) to the model...
[Map] Sending chunk from main.zeek (chunk 3/4) to the model...
[Map] Sending chunk from __load__.zeek (chunk 4/4) to the model...
[Reduce] Starting reduction stage...
[Reduce] Combining map outputs for reduction...
[Reduce] Combined output within context limit. Sending final prompt to model...
Final Answer:
The Zeek package detects NetSupport Command and Control (C2) traffic by analyzing network traffic captured in PCAP files for specific patterns associated with the use of NetSupport—a legitimate administrative tool that is often exploited by malware operators for remote administration. The detection process involves two primary mechanisms: HTTP headers analysis and command strings detection within TCP payloads.
 
1. **HTTP Headers Detection**:
   - The package identifies suspicious C2 traffic by examining HTTP headers in the network packets. It looks for characteristics such as "USER-AGENT" or "SERVER" headers containing the string "NetSupport". When these patterns are observed, a notice is triggered indicating potential NetSupport-related activity (`NetSupport::C2_Traffic_Observed_HTTP_Headers`). This mechanism is detailed in `main.zeek` (Global chunk 3 of 4).
 
2. **Command Strings Detection**:
   - The package also detects specific command strings within TCP payloads, such as `CMD=POLL` and `CMD=ENCD`, which are indicative of NetSupport's use in remote administration.
     - For the pattern `CMD=POLL`, a regular expression `/.*(\x0a|\x0d)CMD=POLL(\x0a|\x0d)/` is used to match occurrences bounded by newline characters. Detection triggers the function `NetSupport::netsupport_cmd_poll_match`.
     - Similarly, for `CMD=ENCD`, the corresponding pattern is matched using `/.*(\x0a|\x0d)CMD=ENCD(\x0a|\x0d)/`. Upon detection, the function `NetSupport::netsupport_cmd_encd_match` is executed.
   - These detection signatures are defined in `netsupport.sig` (Global chunk 2 of 4).
The process is automated through Zeek scripts that parse PCAP files and apply these rules to identify potential misuse of NetSupport. Detailed logs capture information such as timestamps, IP addresses, ports, and notes on the detected events.
 
**Citations:**
- `README.md`
- `netsupport.sig` (Global chunk 2 of 4)
- `main.zeek` (Global chunk 3 of 4)
- `__load__.zeek` (Global chunk 4 of 4)

You can see in the final LLM output that it successfully distilled the complex Zeek detection logic into three methods:

HTTP Header Inspection: The LLM identified that the package scrutinizes HTTP headers—specifically fields like "USER-AGENT" and "SERVER"—for the presence of the string "NetSupport." This technique helps flag network traffic that might indicate abnormal or potentially malicious use of a legitimate administrative tool.
CMD=POLL Command Detection: The response shows that the script monitors TCP payloads for the command string "CMD=POLL." By employing a regular expression that looks for this command, the system can detect when NetSupport is being used for command-and-control activities, a common tactic in compromised environments.
CMD=ENCD Command Detection: Similarly, the LLM output points out that the package also checks for the "CMD=ENCD" command within TCP payloads. This additional layer of detection further reinforces the identification of potentially nefarious remote administration activities.

This LLM response makes the detection mechanisms easier to understand for those who may not be familiar with Zeek source code. It also illustrates how the LLM can translate intricate code into actionable insights. By breaking down the logic into these distinct methods and providing citations to the relevant source files, the output serves as an invaluable resource that bridges the gap between complex technical implementations and practical security analysis.

Example 2: Analyzing NetSupport activity in Zeek logs

Below is another example where we use the map-reduce script to analyze just the Zeek logs (no source code) produced from the testing PCAP in the NetSupport repository. In this case, the tool reviews multiple log files and returns a consolidated analysis of suspicious or malicious activities, complete with direct quotes from the raw logs for context:

    
     % time python map-reduce.py -d ~/Desktop/logs -q "Review these Zeek logs representing network traffic and tell me about any suspicious or malicious cybersecurity activities. Quote the raw logs to support your arguments, for context." --path "(?i).+[^/]\.log$"
Ingesting file: /Users/keith.jones/Desktop/logs/notice.log
Ingesting file: /Users/keith.jones/Desktop/logs/x509.log
Ingesting file: /Users/keith.jones/Desktop/logs/conn.log
Ingesting file: /Users/keith.jones/Desktop/logs/ssl.log
Ingesting file: /Users/keith.jones/Desktop/logs/files.log
Ingesting file: /Users/keith.jones/Desktop/logs/analyzer.log
Ingesting file: /Users/keith.jones/Desktop/logs/http.log
Ingesting file: /Users/keith.jones/Desktop/logs/packet_filter.log
Ingesting file: /Users/keith.jones/Desktop/logs/weird.log
Ingesting file: /Users/keith.jones/Desktop/logs/dns.log
Ingesting file: /Users/keith.jones/Desktop/logs/ocsp.log
File notice.log produced 1 chunks.
File x509.log produced 1 chunks.
File conn.log produced 1 chunks.
File ssl.log produced 1 chunks.
File files.log produced 1 chunks.
File analyzer.log produced 1 chunks.
File http.log produced 1 chunks.
File packet_filter.log produced 1 chunks.
File weird.log produced 1 chunks.
File dns.log produced 1 chunks.
File ocsp.log produced 1 chunks.
[Map] Sending chunk from notice.log (chunk 1/11) to the model...
[Map] Sending chunk from x509.log (chunk 2/11) to the model...
[Map] Sending chunk from conn.log (chunk 3/11) to the model...
[Map] Sending chunk from ssl.log (chunk 4/11) to the model...
[Map] Sending chunk from files.log (chunk 5/11) to the model...
[Map] Sending chunk from analyzer.log (chunk 6/11) to the model...
[Map] Sending chunk from http.log (chunk 7/11) to the model...
[Map] Sending chunk from packet_filter.log (chunk 8/11) to the model...
[Map] Sending chunk from weird.log (chunk 9/11) to the model...
[Map] Sending chunk from dns.log (chunk 10/11) to the model...
[Map] Sending chunk from ocsp.log (chunk 11/11) to the model...
[Reduce] Starting reduction stage...
[Reduce] Combining map outputs for reduction...
[Reduce] Combined output within context limit. Sending final prompt to model...
Final Answer:
The analysis of the provided Zeek log data reveals several instances of potentially suspicious or malicious network activity associated with NetSupport malware and other cybersecurity concerns. Here is a consolidated summary:
 
### Suspicious Activities Identified
 
1. **NetSupport Malware Command-and-Control (C2) Traffic:**
   - The `notice.log` entries indicate multiple instances of NetSupport C2 traffic, identified through specific HTTP headers and commands (`CMD=POLL`, `CMD=ENCD`). These patterns are consistent with malware operations using NetSupport for remote control and data exfiltration.
     - **HTTP Headers Detection:**
       ```
       1717442617.920239 CQ7b0y4Vd4NVQ3nJRi 192.168.100.146 49741 45.134.174.143 443 tcp NetSupport::C2_Traffic_Observed_HTTP_Headers
       ```
     - **CMD=POLL Detection:**
       ```
       1717442617.920239 CQ7b0y4Vd4NVQ3nJRi 192.168.100.146 49741 45.134.174.143 443 tcp NetSupport::C2_Traffic_Observed_CMD_POLL
       ```
     - **CMD=ENCD Detection:**
       ```
       1717442617.955368 CQ7b0y4Vd4NVQ3nJRi 192.168.100.146 49741 45.134.174.143 443 tcp NetSupport::C2_Traffic_Observed_CMD_ENCD
       ```
2. **Repeated Connections and Long Duration Traffic:**
   - The `conn.log` entries show repeated connections to a single IP address (`4.231.128.59`) with no data transferred, which could indicate scanning or probing activities.
     ```
     1717442509.310809 ClEkJM2Vm5giqnMf4h 192.168.100.146 49676 4.231.128.59 443 tcp -
     ```
   - Long duration connections, such as to `40.126.32.76`, suggest potential data exfiltration or sustained C2 communication.
     ```
     1717442518.264335 C9rXSW3KSpTYvPrlI1 192.168.100.146 49689 40.126.32.76 443 tcp ssl
     ```
3. **DNS and HTTP Anomalies:**
   - The `dns.log` entries reveal repeated queries to suspicious domains like `gurjlxyozkjnjqf.gurjlxyozkjnjqf`, which could indicate DNS tunneling or automated scripts.
     ```
     1717442517.960287 CwjjYJ2WqgTbAqiHl6 192.168.100.146 50229 192.168.100.2 53 udp gurjlxyozkjnjqf.gurjlxyozkjnjqf
     ```
   - The `http.log` entries show multiple OCSP and CRL requests, suggesting a compromised system querying certificate statuses.
     ```
     1717442517.843005 C3eiCBGOLw3VtHfOj 192.168.100.146 49687 192.229.221.95 80 GET ocsp.digicert.com
     ```
4. **TCP Connection Anomalies:**
   - The `weird.log` entries indicate TCP acknowledgment and sequence number irregularities, which could be exploited for network attacks.
     ```
     1717442550.344203 CfQutTEgs3g8sUrsa 192.168.100.146 49700 23.14.28.77 443 TCP_ack_underflow_or_misorder
     ```
5. **Invalid TLS Version:**
   - An entry in `analyzer.log` shows an invalid TLS version being used, potentially pointing to attempts to exploit vulnerabilities.
     ```
     1717442617.920239 violation protocol SSL CQ7b0y4Vd4NVQ3nJRi - 192.168.100.146 49741 45.134.174.143 443 Invalid version in TLS connection. Version: 20307
     ```
 
### Conclusion
 
The consolidated analysis of logs from `notice.log`, `conn.log`, `dns.log`, `http.log`, `weird.log`, and `analyzer.log` suggests several cybersecurity threats, including potential malware activity, scanning or probing attempts, DNS tunneling, and protocol anomalies. These insights, backed by direct log quotes, highlight the value of using the map-reduce approach to rapidly sift through extensive log data and identify actionable intelligence.
**Citations:**
- `notice.log`
- `conn.log`
- `dns.log`
- `http.log`
- `weird.log`
- `analyzer.log`
python map-reduce.py -d ~/Desktop/logs -q  --path "(?i).+[^/]\.log$"  6.93s user 1.53s system 0% cpu 19:58.50 total

This example clearly demonstrates the power of the map-reduce approach in cybersecurity log analysis. Even if the LLM’s response is imperfect, it gives the network analyst a head start when looking into these logs. By breaking down extensive Zeek log data into manageable chunks and then consolidating the results, the tool efficiently distills complex network activities into actionable insights. The final output not only highlights key suspicious behaviors—such as potential NetSupport malware communications, anomalous connection patterns, DNS irregularities, and protocol issues—but also directly references raw log excerpts to provide context.

This detailed yet concise summary enables security teams to rapidly assess threats and prioritize further investigation, ultimately enhancing their incident response and forensic capabilities.

Conclusion

By investing time in sharpening our analytical tools—just as Lincoln advised with his axe—we enable LLMs to process complex, large-scale data efficiently. The map-reduce approach allows us to extract actionable insights from massive datasets, fundamentally transforming threat hunting, incident response, and forensic analysis.

Whether you’re analyzing network logs, dissecting source code like the Zeek NetSupport detector, or exploring new ways to automate data analysis, this methodology paves the way for more agile and accurate cybersecurity practices.

Explore this approach further in the LLM-Ninja repository, and join me in harnessing the full power of LLMs to stay ahead of evolving cyber threats.

Corelight Bright Ideas Blog