Using normalization and machine learning to get more from your threat data
By Wayne Chiang, Chief Architect, and Co-Founder, ThreatQuotient
While the year-end results haven’t been tallied, 2017 saw a record number of breaches with 3,833 reported through the end of September 2017, exposing over 7 billion records. Obviously, the Equifax breach skews the number of records exposed, but the number of reported breaches is still up 18% compared to the same period in 2016.
As organizations look to better protect themselves from such attacks, they may think more threat intelligence will help. But organizations typically have more threat intelligence than they know what to do with. They have multiple data feeds, some from commercial sources, some open source, some industry and some from their existing security vendors – each in a different format. On top of that, each point product within their layers of defense has its own intelligence.
More threat data isn’t necessarily the solution. The biggest challenge is organizing the data you do have and optimizing the value to your enterprise. This requires aggregating intelligence from disparate sources into a central repository and designing a strategy for normalizing data and minimizing segmentation.
When it comes to organizing the data, one of the first tasks is the mapping of different attributes for indicators of compromise (IOCs) that all refer to the same thing. For example, using Lockheed Martin’s Cyber Kill Chain, we frequently see several variants from different feeds: Command and Control, C2, C&C, C2IP, C and C, Command & Control, and CnC.
How can we prevent segmentation when these values all point to the same thing? An example of how this rears its ugly head is when we search for IOCs created within the past 48 hours that fall within the Command and Control stage of the attack phase. Without a proper segmentation strategy, we will have to perform multiple searches with different variants of the C2 value or devise some sort of convoluted wildcard search. These manual and time-intensive methods result in a less than ideal situation that may even cause IOCs to slip through the cracks.
Here are two strategies to approach this challenge:
- Normalization. This approach involves defining a schema of standardized values. For example, the “Impact” indicator attribute can only come from a list of possible values: low, medium, high and critical. Standardized values prevent segmentation by ensuring that all values within the system are limited to approved values. When coming up with a predefined list of values, we will have to consider future flexibility to account for the constantly changing threat landscape and the ways to describe adversaries’ tools, techniques, and procedures (TTPs). This also results in one of the challenges with standardizing values: creating a comprehensive schema that will cover all different threat intel artifacts.
Using a central repository to aggregate and correlate threat intel you can quickly see all IOC attributes available within your environment. This can be a good starting point for seeding your initial schema values and understanding what kind of data your providers are publishing. Once you build a schema, you can use the repository to automatically enforce standardized values by preventing users from creating new values within the system. When users want to add a new attribute to an IOC, they will have to select it from the schema list.
When it comes to normalizing values from different vendors, we can also employ a translation layer strategy. For example, if any of these incoming values equal “Fancy Bear, Operation Pawn Storm, Strontium, Sednit, Sofacy, Tsar Team,” we can rename the output value to the organization’s designated name: APT28. This “rosetta stone” allows the various values used by different vendors to be remapped and enables analysts to reduce confusion. Something to think about with this strategy is determining whether to store the original value from the vendor for the purposes of handling future data integrity. It is also important to note that normalization often requires work on the analysts’ part as the mappings will eventually become their own.
- Machine learning. In this approach, we leverage machine learning as an innovative way to predict the meaning of various types of values. With a sufficiently large dataset, we can analyze the overlap of different artifacts such as adversary names or malware families and reduce the variants into a single shared value. For example, the following malware family names can be distilled into a common shared value: Ramnit
An otherwise manually intensive process of pivoting back and forth between IOCs can quickly be solved by applying some novel machine learning. Not only does this reduce confusion with a simplified common language, but we can also derive new, interesting relationships that may exist between different objects. This is especially true when we start to look at intrusion datasets that share common artifacts. We can start to build an understanding of how various events are tied together as well as the various vendor-applied names used to describe the attacks. As an organization’s threat library grows, the predictive accuracy of machine learning can increase with the addition of new threat artifact relationships. However, it is important to remember that machine learning will always require analysts to correct any mistakes or add new values it missed. In other words, machine learning is NOT a silver bullet and does require human oversight.
To start to reverse the upward trend of breaches each year, organizations need to devise ways to make better use of the threat intelligence they have. These two strategies for tackling data segmentation within your team will help. As your team starts building requirements for your threat intelligence repository, fine details like indicator normalization with play a critical role in the effectiveness and usability of your data. Balancing flexibility and preventing data schema abuse is a persistent challenge within the industry but one that is surmountable with the right strategies, tools and talent working together in concert.
About the Author
Wang Chiang, Chief Architect and Co-Founder, ThreatQuotient Wayne is a cybersecurity professional with a passion for implementing elegant solutions to complex problems and perpetually optimizing everything he touches. He leverages his cross-functional industry experience in software engineering and cybersecurity to develop innovative strategies in mitigating risk from advanced cyber threats. Wayne is also exceptionally precise and adept in describing his accomplishments in the third person.