The only question that remained was how did the threat actor get the key in the first place? MSA signing keys are closely guarded by Microsoft due to their sensitivity. Well, ask no more as Microsoft has released a report explaining how they believe the key was compromised. Of course, as with many reports of this type, it raises almost as many questions as it answers. Before I dive in and give you my thoughts on this, I want to say that the Microsoft team did a great job in identifying the root cause. I also want to add that Nation State actors have massive amounts of resources to task on a target like Microsoft. With over 200,000 employees as massive amounts of data floating out on the darker parts of the internet, phishing (including spearphishing) is probably an every-minute thing for them. Still there are some clear lapses and failures on the part of Microsoft and while I can, and will, applaud them for the good, I am also going to call out the bad.
Microsoft’s report creates a timeline of the incident that begins back in 2018 with the creation of a particular consumer MSA signing Key. This key was then expired in April 2021. Shortly after the expiration, there was a crash of a particular consumer signing system (which one was not described) and a crash dump was created. For those that are not familiar, a crash dump is a snapshot of everything going on at the time of an application or service crash. These dumps will include information that is in resident memory at the time of the crash. In the case of the signing service the crash dump contained cryptographic information about a signing key. This dump is created inside of the production area (a High Side area). The fact that a crash dump contains this type of information is not new, it happens all the time. Because of this Microsoft has a built-in system to scrub this type of information from any crash dumps related to the signing process. After this system runs, there is a check of the file to ensure that the scrub worked. The crash dump is then moved to a low-side area (the corporate debug environment).
In this particular case the scrub failed due to a race condition. A race condition is when more than one process tries to access a file at the same time. One system will win and the other will generate an error, typically “file in use” or “access denied”. This type of issue is common with security tools, especially when there is an overlap in what they are doing. While working at Cylance we saw this quite a bit when Cylance Protect was combined with an MDR like Carbon Black. One or the other would win in the race to check out the file. What other process was looking at the dump file? Microsoft is not telling us, but it could be something like a Data Loss Prevention system that was cataloging the file and due to its size had not released it before the scrubber tried to do its job. This is a guess on my part here, but it is not out of the realm of possibility when you look at the details. After the failure of the scrub there was a cascading failure in that the check of the dump file for any remaining cryptographic data (I never thought I would see a resonance cascade, let alone cause one! For you Half Life fans). The dump file was then moved to the debug area, still containing the key.
To pause here, let’s talk about the comedy of errors we can already see. A high-side service crashes, Microsoft already knows there is likely to be sensitive data as they have a scrubber yet there was no flag if the scrubber linked to this type of sensitive data failed. Next the sanity check failed to see the sensitive data despite a failure of the scrubber. Here we have two events that in context are massive red flags. The failure of the initial scrub (for any reason) should have generated a flag for manual review, or for a rescrub. A context rule should have been in place (in my opinion) to flag and alert if a file that failed a scrub, then resulted in a negative detection on the sanity check. To those that say this is overly burdensome due to the number of accounts, I would reply that the sensitivity of the data warrants the controls. Next, the data was moved from high-side to low-side into a debug area in the corporate environment without a second check of the data (the scrub is the action, not a check). Even with a scrub and two checks you are talking about moving a dump file of a sensitive process into an insecure area. This is not entirely best practices here.
The good news is that Microsoft says they have fixed the scrubber to prevent race condition, added additional monitoring, validation and checking to ensure the files are clean, but have not talked about changing where this type of sensitive data is dropped to be reviewed/worked on.
Now all of the above happened in April of 2021. The next step is not entirely clear from the Microsoft report. The report says that a developer account with access to the debug area in the corporate environment had their account compromised. The actor that compromised this account used correct credentials and potentially a valid MFA option (not a hardware token, just normal MFA). When this compromise happened is what is not clear. The language of the report in some areas makes me think the account was compromised prior to the crash in April of 2021, but other areas make it seem like it happened after. Either way a developer account was compromised that had access to the debug area of Microsoft’s corporate environment. There is no indication, at this time, that the threat actor got into the secure production area or had any hand in the crash and subsequent failures of the scrubbing and checking systems. There is some conversation in certain circles on if the developer account was targeted or not. I believe they were, as this pattern has been established through analysis of other attacks of this nature. I also believe that this attack started as an attempt at a supply chain attack, and through a fortuitous event (finding the key in the dump file) turned into something else.
I also do not believe the threat actors accessed or attempted to access the production (high-side) network. Not because there was a technical insufficiency, but because they might have felt it would increase the risk of detection. Microsoft goes to great lengths to describe the protections on the production network including the use of hardware tokens for authentication. It is very likely that the attackers knew, or suspected this and left that part of the environment alone by choice. Remember with most nation-states, it is low and slow. Dwell time and information gathering is the goal so the longer you can remain in the environment the better.
So, we are up to the point where we have had a crash of the signing system, a cascading failure has allowed sensitive data to leak outside of a secure environment, and a threat actor has access to the environment where the data leaked. All that remains, in very simple terms, is for the threat actor to start going through “the trash” looking for a prize and boy did they find one. Even without a signing key crash dump files as very valuable bits of information as they can be used to understand how a process functions, what other processes it relies on etc. This is why I do believe this started out as a general supply chain attack with an intelligence gather component (they almost always are). While I also believe that the targeting of a developer was intentional there might have been a bonus in this one account they owned, I believe it was an Exchange engineer. My reasoning behind this will be explained below.
Now we have said more than once that the signing key was an expired, consumer key. Yet it was used to compromise enterprise and potentially GCC low-side tenants. Microsoft had an explanation for this, and I will use their own language and offer my thoughts on it after.
“As part of a pre-existing library of documentation and helper APIs, Microsoft provided an API to help validate the signatures cryptographically but did not update these libraries to perform this scope validation automatically (this issue has been corrected). The mail systems were updated to use the common metadata endpoint in 2022. Developers in the mail system incorrectly assumed libraries performed complete validation and did not add the required issuer/scope validation. Thus, the mail system would accept a request for enterprise email using a security token signed with the consumer key (this issue has been corrected using the updated libraries). “
This paragraph is why I believe the developer in question was on the Exchange team, the attackers appear to have known thier consumer key would work before they tried it. It also highlights an issue with large companies (not just Microsoft). Changes in the way things work and the documentation around them are often at odds simply due to the massive amounts of documentation and changes that occur in routine cycles. In this case a misreading of the documentation along with a possible (Microsoft is not clear on this) missed review of the changes implemented allowed the threat actor to use their shiny expired consumer signing key to access enterprise tenants. If there was a review, then the same ambiguity in the documentation allowed this flaw to be rolled out into a production environment in 2022 leaving the system vulnerable.
We now have a perfect storm. A sophisticated threat actor inside the environment in possession of a highly prized developer account which has access to all sorts of data inside crash dump files from inside the production environment, an MSA singing Key, AND a system that is missing proper validation checks. To get where we are in this stage of the incident there are some fortuitous events (the leakage of the key) and a sophisticated attacker who knew where to look for treasure.
After the release of the report there have been some questions pertaining to how Microsoft identified the race condition through their IR efforts. I do want to address that here. Microsoft knew which signing key was compromised, they had the thumbprint. This information would have allowed them to search the environment for the data to see how it might have leaked. Once the information was found inside the crash dump, they could use the date time stamp to find the crash event and check the systems around it. As a race condition will generate an alert indicating an inability to access the file in question, that should have been relatively easy to do. The big challenge would have been the massive amounts of logs and files to go through to find that nugget. This is the exceptionally impressive part to me. Although the IR and forensic team had a thumbprint and an idea of where it came from, they still had mountains of stuff to go through and then connect the dots. In some cases, they did not have all of the logs, but were still able to identify a highly likely path that the threat actor took to find and remove the data used for the account take over attacks. I do not find the detection of a race condition all that surprising as we used to find them during IRs where “fileless” malware was used through an Office product. I would see DLP systems holding onto the file looking for sensitive data and the MDR would not be able to scan or quarantine the file etc. Race conditions are not unusual, and a lot of work has been put in to finding ways to prevent and/or avoid them so that threats do not slip through.
The Microsoft IR team did a fantastic job of digging in and identifying the chain of events that lead to the original incident. It was a cascading failure of checks combined with a vulnerability, and a very sophisticated actor in the environment. When you are dealing with that much information sprawl it can be challenging to find the path an actor took to do their work. Microsoft should also be lauded for the transparency in the report as it is. Sure, there are some things we will never know, but that is expected in some cases (you do not want to give away the secret sauce). Microsoft states they have fixed the issues in the production environment that led to this particular failure by correcting the race condition error, added monitoring of and additional validation for the scrubber, added additional monitoring of and validation on the check of the scrubbed data, and corrected the validation check issue in Exchange Online. They have also added additional check in the debug environment looking for any signing data there (I would have preferred the debug area get moved into a more controlled environment, but at least this is a good step).
“Identified and resolved race Condition that allowed the signing key to be present in crash dumps
Enhanced prevention, detection, and response for key material erroneously included in crash dumps
Enhanced credential scanning to better detect presence of signing key in the debugging environment
Released enhanced libraries to automate key scope validation in authentication libraries, and clarified related documentation”
All of these are great steps and do show a commitment by Microsoft to correct the issues identified in this forensic review. I am sure that it did not hurt that Government agencies were doing the asking when it came to “how did this happen?”, but hey we are getting improved security controls of critical data and at statement committing to a continuous improvement cycle on these very sensitive systems.
A quick (and simplified) summary of the report:
Microsoft found:
Iinadequate controls around sensitive information and processes involved in handling sensitive information.
Inadequate detection controls for account compromise for developer accounts.
Poor documentation and library configuration allowed an insecure configuration to be pushed to production (Exchange Online).
You can read the report for yourself (if you have not already)