Databricks + Corelight: A Powerful Combination

Incident response, threat hunting and cybersecurity in general relies on great data. Just like the rest of the world where virtually everything these days is data-driven, from self-driving cars to personalized medicine, effective security strategies also need to be data-driven.

Whatever security solution, service or business process you implement, it probably relies on data from just four sources: logs, networks, hosts and third party intelligence feeds. Some or all of that data is fed into an analytics stack like Apache Spark™ where advanced analytics, artificial intelligence, machine learning and other techniques can be applied.

When it comes to the data about network traffic though, most organizations are stuck between “not enough data” and “way too much.” The former is normally NetFlow data that provides basic information about network traffic, source / destination IPs, time and date, bytes sent / received and few other important (but sparse) pieces of information. The latter is typically PCAP (packet capture) where every bit on the wire is stored. Since big networks carry a tremendous amount of data, storing it all is non-trivial and can become expensive and cumbersome.

The Alternative

Screen Shot 2018-07-16 at 2.07.56 PM So what’s the alternative? Well for over 20 years, an open source project called Bro has been used in production at some of the world’s largest organizations and biggest networks, namely government agencies, research universities and very large / web-scale companies.

Bro is a network monitoring framework that ingests a copy of all traffic on a network, parses and analyzes it in real time using its event processing engines and scripts, and then outputs data (“Bro logs”) to some external analytics system like Spark. There are dozens of Bro logs for most common protocols like SMTP, SSL, SMB, DHCP, DNS and many others. Each Bro log contains 10 to 40 fields describing that part of the network traffic.
img bro logs id tracking.png
Furthermore, Bro logs include key pivot points in the data that allow incident responders to follow their instincts or observations by taking advantage of unique identifiers for critical aspects of network traffic: all connections (Connection UID) and files (FUID) are uniquely identified. Along with consistent and precise timestamps across all logs, Bro gives incident responders access to everything they’ll need to resolve most security incidents quickly without having to resort to PCAP except in rare instances.

Corelight was founded by the creators of Bro to deliver network security solutions built on this powerful and widely used framework. Since Corelight does not provide an analytics application, we are excited to work with leading companies like Databricks to combine the power of Bro data with the intelligence of Apache Spark and all of its analytic capability.

The challenge of managing threats in a big data world

Staying abreast of the latest threat isn’t the only challenge. The increasing volume and complexity of threats require security teams to capture and mine mountains of data in order to avoid a breach. Yet, the Security Information and Event Management (SIEM) and threat detection tools they’ve come to rely on were not built with big data in mind resulting in a number of challenges:

Inability to scale cost efficiently
Companies deploy logging and monitoring devices across their networks, end-user devices and production machines to help detect suspicious behavior. These tools produce petabytes of log data that need to be contextualized and analyzed in real-time. Processing petabytes of data takes significant computing power. Unfortunately, most SIEM tools were built for on-premises environments requiring significant build-outs to meet processing demands. Additionally, most SIEM tools charge customers per GB of data ingested. This makes scaling threat detection tools for large volumes of data incredibly cost-prohibitive.
Inability to conduct historic reviews in real-time
Identifying a cybersecurity breach as soon as it happens is critical to minimizing data theft, damages, and creation of backlogs. As soon as an event occurs, security analysts need to conduct deep historic analyses to fully investigate the validity and breadth of an attack. Without a means to efficiently scale existing tools most security teams only have access to a few weeks of historical data. This limits the ability of security teams to identify attacks over long time horizons or conduct forensic reviews in real-time.
Abundance of false positives
Another common challenge is the high volume of false positives produced by SIEM tools. The massive amounts of data captured in OS logs, cloud infrastructure logs, intrusion detection systems and other monitoring devices produce events that in isolation or in connection with other events may signify a compromised network. Most events need further investigation to determine if the threat is legitimate. Relying on individuals to review hundreds of alerts including a large number of false positives results in alert fatigue. Eventually, overwhelmed security teams disregard or overlook events that are in actuality legitimate threats.

In order to effectively detect and remediate threats in today’s environment, security teams need to find a better way to process and correlate massive amounts of real-time and historical data, detect patterns that exist outside pre-defined rules and reduce the number of false positives.

Enhancing threat detection with scalable analytics and AI

Screen Shot 2018-07-16 at 2.25.24 PM Databricks offers security teams a new set of tools to combat the growing challenges of big data and sophisticated threats. Where existing tools fall short, the Databricks Unified Analytics Platform fills the void with a platform for data scientists and cybersecurity analysts to easily build, scale, and deploy real-time analytics and machine learning models in minutes, leading to better detection and remediation.
Databricks complements existing threat detection efforts with the following capabilities:

Full enterprise visibility
Native to the cloud and built on Apache Spark by the original creators of Apache Spark, Databricks is optimized to process large volumes of streaming and historical data for real-time threat analysis and review. Security teams can query petabytes of historical data stretching months or years into the past, making it possible to profile long-term threats and conduct deep forensic reviews to uncover backdoors left behind by hackers. Security teams can also integrate all types of enterprise data – SIEM logs, cloud logs, system security logs, threat feeds, etc – for a more complete view of the threat environment.
Proactive threat analytics
Databricks enables security teams to build predictive threat intelligence with a powerful, easy-to-use platform for developing AI and machine learning models. Data scientists can build machine learning models that better score alerts from SIEM tools reducing reviewer fatigue caused by too many false positives. Data scientists can also use Databricks to build machine learning models that detect anomalous behaviors that exist outside pre-defined rules and known threat patterns.
Collaborative investigations
Interactive notebooks and dashboards enable data scientists, analysts and security teams to collaborate in real-time. Multiple users can run queries, share visualizations and make comments within the same workspace to keep investigations moving forward without interruption.
Cost efficient scale
The Databricks platform is fully managed in the cloud with cost-efficient pricing designed for big data processing. Security teams don’t need to absorb the costly burden of building and maintaining a homegrown cybersecurity analytics platform or paying per GB of data ingested and retained.

How a Fortune 100 company uses Databricks and advanced cybersecurity analytics to combat threats

A leading technology company employs a large cybersecurity operations center to monitor, analyze and investigate trillions of threat signals each day. Data flows in from a diverse set of sources including intrusion detection systems, network infrastructure and server logs, application logs and more, totaling petabytes in size.

When a suspicious event is identified, threat response teams need to run queries in real-time against large historical datasets to verify the extent and validity of a potential breach. To keep pace with the threat environment the team needed a solution capable of:

Large data volumes at low latency: Analyze billions of records within seconds
Correct and consistent data: Partial and failed writes cannot show up in user queries
Fast, flexible queries on current and historical data: Security analysts need to explore petabytes of data with multiple languages (e.g. Python, SQL)

As an example of how customers are using advanced cybersecurity analytics, check out this recent video from Apple.

For another more general explanation of how Corelight and Databricks can be deployed together, Databricks produced this video that includes an explanation of how they ingest Bro logs as part of their security solution.

The Challenge

It took a team of twenty engineers over six months to build their legacy architecture that consisted of various data lakes, data warehouses, and ETL tools to try to meet these requirements. Even then, the team was only able to store two weeks of data in its data warehouses due to cost, limiting its ability to look backward in time. Furthermore, the data warehouses chosen were not able to run machine learning.
Screen Shot 2018-07-16 at 10.20.15 AM.png

The Solution

Using the Databricks Unified Analytics platform the company was able to put their new architecture into production in just two weeks with a team of five engineers.

Their new architecture is simple and performant. End-to-end latency is low (seconds to minutes) and the threat response team saw up to 100x query speed improvements over open source Apache Spark on Parquet. Moreover, using Databricks, the team is now able to run interactive queries on all its historical data — not just two weeks worth — making it possible to better detect threats over longer time horizons and conduct deep forensic reviews. They also gain the ability to leverage Apache Spark for machine learning and advanced analytics.
Screen Shot 2018-07-16 at 10.20.21 AM

Final Thoughts

As cybercriminals continue to evolve their techniques, so do cybersecurity teams need to evolve how they detect and prevent threats. Comprehensive network traffic extracted by Bro is the highest quality data available to incident responders and threat hunters, and an essential ingredient to the Databricks analytics platform for cybersecurity. Big data analytics and AI offer a new hope for organizations looking to improve their security posture, but choosing the right platform is critical to success.

Download our Cybersecurity Analytics Solution Brief or watch the replay of our recent webinar “Enhancing Threat Detection with Big Data and AI” to learn how Databricks can enhance your security posture, including ingestion of Bro logs for network traffic monitoring.

If you’re not familiar at all with Bro, watch Corelight’s two-minute video here.