What Is MTTR?
Learn what MTTR really means in cybersecurity, why it matters, and how to improve response times with the right strategies, tools, and metrics.
What is MTTR?
Understanding the various interpretations of MTTR, such as Mean Time to Respond, Mean Time to Recover, or Mean Time to Resolve, is crucial. Each of these measures has different aspects of performance, and their application depends on the specific context. Essentially, all forms of MTTR calculate the average time required to achieve a particular objective over a set period. This understanding is vital for effectively using MTTR as a metric in different operational settings.
It gets more complicated since organizations may use different definitions of response, resolve, or recover. This article is not meant to be a comprehensive list of definitions of MTTR, but instead covers three of the possible definitions of these terms and how they can be used in a security program’s metrics.
Definitions of MTTR and their uses
This article covers Mean Time to Respond, Mean Time to Recovery, and Mean Time to Resolve. As much as possible, the post will use guidance from the National Institute of Standards and Technology (NIST) to form these definitions.
MTTR Term | Definition |
---|---|
Mean Time to Respond |
The NIST Cybersecurity Framework (CSF) describes the response phase of an intrusion as the actions taken after indications of compromise have been analyzed and “declared” to be a genuine incident. The response includes execution of a response plan, triage, analysis, escalation, communication, mitigation, and ultimately containment and/or eradication.
The Time to Respond measures how long it took to go from detection to completing the steps in the Response phase of an intrusion. Essentially, this measurement shows how quickly the organization was able to stop the damage once an intrusion was detected. The Mean Time to Respond is the average of all the Time to Respond across all intrusions that are being measured. The formula is:
MTTRespond = (sum of time to respond for all incidents) / number of incidents
This does not mean the victim has completely recovered from the intrusion, as there will still be Recovery tasks to complete before achieving a full restoration of the organization’s functionality. |
Mean Time to Recovery |
NIST defines Mean Time to Recovery as “a metric that tracks the average amount of time that it takes to recover from a product or system failure.”
The point at which this metric ends is fairly straightforward: It’s when the system or product has been recovered. But it’s an open question about when the clock starts. Some definitions start from the moment the initial compromise occurs. Others indicate that it begins once response is complete and final recovery begins. Organizations may find one definition or another works best for them; what matters is that the definition of “recovery” is clear and that it’s applied consistently.
If an organization decides that the time to Recovery begins once the Response is complete, then they are measuring only the Recovery phase of NIST’s CSF. The formula for this metric is:
MTTRecovery = (sum of time to recovery for all incidents) / number of incidents |
Mean Time to Resolve |
NIST does not provide a definition of this metric. Like the Mean Time to Recovery, there’s a logical ending point of mean time to resolve. The incident response process ends after a lessons-learned phase and a debrief on how the response went and what could be improved. It seems reasonable to say that the resolution ends once this debrief is done. But where does the timer start?
This metric can begin at either the time of initial compromise or when the compromise was first detected. Organizations should select the one that aligns with their capabilities to detect and investigate intrusions. The most accurate may be from the time of the initial compromise, but it may not be possible to reliably determine when an intrusion occurred. Ideally, organizations want to improve their detections as well as their resolutions, and this more expansive definition highlights the value of discovering incidents as early as possible in the cyber kill chain.
If the Time to Resolve is measured from initial compromise to the completion of lessons learned, then it creates a metric for the entire scope of the intrusion. Its formula is:
MTTResolve = (sum of time to resolve for all incidents) / number of incidents |
These metrics can be visualized by putting them on a timeline of an intrusion. The graphic below includes another metric: Mean Time to Detect (MTTD). By laying these metrics out next to each other, it is easier to see what they measure and how they interact with each other.
Mean Time to Detect flows into the Mean Time to Respond, which moves next into Mean Time to Recovery. The Mean Time to Resolve measures the entire scope of an incident, including learning from the incident.
Why MTTR is important in cybersecurity
The reason to use MTTR in cybersecurity (in all its forms) is to measure how well our processes and procedures are working. The information that these metrics contain are valuable because they show organizations where things might be breaking down.
However, having three definitions for one acronym may make it difficult to know where to start the measurement process. One suggestion is to pick the metrics you want to measure that will help you understand how your security organization is performing, and whether performance is improving or declining. It is necessary to establish each metric’s definition and stick with it over time. If you change the definition of what you’re measuring mid-stream, you won’t be able to compare metrics from before the change to those from after the change.
For example, measuring MTTD can help you understand if your detection strategy is working, and whether the introduction or removal of a detection tool is improving or hindering your detection efficacy. Measuring Mean Time to Respond and Mean Time to Recovery can help you understand whether your organization has sufficient staff and equipment to handle security incidents quickly once they arise.
For instance, if an organization can determine that it’s detecting intrusions fairly quickly, but the response phase is getting worse over time, it’s clear where more investment could be beneficial. Deeper analysis of the response phase could reveal a number of factors contributing to the slow down, e.g., a procedural obstacle that needs to be addressed, or a new tool the security staff is still figuring out how to use.
Factors that influence and improve MTTR
There are many things that can impact the various MTTR metrics. Security monitoring and incident response can be very complex and involve many different teams and departments. However, most potential impacts can be funneled into these categories: procedures and authority, preparation, and visibility.
Factors | Context |
---|---|
Procedures and authority |
In some ways, having effective procedures and the authority to perform them is the most difficult category for organizations to address. Teams with different responsibilities and priorities have to work out how an incident will be handled, when a step can be taken, and who will execute it. For example, deciding when to take a business system offline to prevent an attacker from moving to other systems is going to be a difficult, and potentially contentious, discussion. Even smaller-scale decisions, like whether security will have any access to affected systems, can be very involved. The system owner doesn’t want security to take an action without understanding the impact it could have, and security wants to contain the adversary as quickly as possible. These objectives may come into conflict. |
Preparation |
Preparation for an incident also has a major impact on MTTR. An organization can implement tools, define procedures, and create meaningful metrics, but there still can be serious impacts from an intrusion if any stakeholders have not learned and practiced their roles. Security operations center (SOC) teams need to know how to triage alerts and how to escalate them appropriately. Incident responders need to be well-versed in using tools to restrict adversary movement and contain their actions. System owners need to understand what is required to stop an intrusion and why security might ask for procedures to be exercised.
Organizations need to practice responding to incidents in ways that familiarize their teams with procedures, and the simulations should induce some stress so that responders can become more capable of acting under pressure. |
Visibility |
Without sufficient visibility into its systems, an organization may not detect an intrusion or may not know when they actually resolved it. Maintaining adequate visibility is an ongoing effort, since networks, applications, systems, and other technology resources are rarely static. Organizations should regularly assess what they are able to see in their environments and what they can’t see. |
The best tools for reducing MTTR
Endpoint detection and response (EDR), security information and event management (SIEM), and network detection and response (NDR) systems are some of the best tools to help organizations reduce MTTR in all its definitions. These tools allow SOCs, incident responders, and threat hunters to identify and respond to the actions of adversaries when used effectively and actively.
EDR gives detailed information about what is executing on compromised hosts, what accounts were used, what files were dropped, and where they are stored. SIEM aggregates security and event information into a single place for defenders to analyze.
NDR gives these teams the ability to see all traffic between hosts, whether those hosts are managed by EDR or not. It allows incident responders to see what hosts a known compromised host was interacting with to determine the extent of an adversary’s control and the full extent of the incident. These technologies provide critical information that can be combined to provide a comprehensive view of what occurred during an intrusion, if it is still occurring, and where it is happening. Together they form the SOC Visibility Triad and give SOC operators the ability to respond effectively to intrusions.
It is not enough to simply purchase these technologies, they must make investments in their procedures to get the most out of their tools. These include building automation to improve speed and efficiency and make use of each platform’s machine learning capabilities to respond to changing adversary behaviors. They must also take advantage of integrations between platforms to rapidly respond to adversaries' actions. SOCs should also be testing AI capabilities to improve their current workflows and preparing for the impact that AI tools will have in the future.
Common challenges in reducing MTTR
There are a number of challenges that most organizations face in responding to adversaries and managing their security posture. Organizations need to effectively prioritize the threats that they face and must address. They must ensure that they have a complete view of their environments so that they can detect the threats they face. This must include monitoring all networked devices, such as printers, cameras, third-party systems, and appliances. Adversaries will take advantage of these devices as they may present a blind spot for some security monitoring.
Organizations must also be able to determine how far an adversary has penetrated their environment when an intrusion occurs. What systems did they interact with, modify, or use for remote access? Without this information they will not be able to effectively isolate compromised systems, implement firewall rules to block activity, and other steps to eradicate the adversary from the environment. Organizations risk the adversary using unidentified systems to return to the environment if they do not completely identify the scope of an intrusion.
How Corelight can help
Corelight’s Open NDR Platform provides organizations with the visibility and capability to detect malicious activity in their environments across the entire network. Additionally, Corelight’s integrations with other security platforms enables SOC’s and incident responders to identify threats and respond quickly and effectively. Corelight’s NDR enables threat hunters and SOCs to monitor for adversary actions with real-time network visibility of on-premises, cloud, and hybrid environments. This capability allows organizations to reduce MTTR by identifying adversary activity quickly and determining the blast radius of an intrusion so that containment and eradication efforts can effectively remove the adversary from the environment.