AI researchers at Microsoft accidentally exposed trabytes of sensitive data, including private keys and passwords, by releasing an open source training data repository on GitHub. The impact was due to a misconfigured URL that granted permissions to the entire storage account rather than just read permissions. Microsoft said that no customer data was affected by the impact of Azure Storage and that no other internal services were at risk due to this issue, which has been resolved.
Cloud security provider Wiz has discovered 38 TB of private Microsoft data that was accidentally exposed by artificial intelligence researchers working for a major technology company. Wiz’s research was published in a blog post on Monday as part of a coordinated disclosure with Microsoft.
Understanding the mechanism
In Azure, a shared access signature (SAS) token is a signed URL that provides access to Azure storage data. The access level can be configured by the user; Permissions range from read-only to full control, and the scope can be a single file, a container, or an entire storage account. Expiration is also fully customizable, allowing the user to create access tokens that never expire. This granularity provides greater flexibility for users, but also poses the risk of granting too much access; in the most permissive case, the token can grant permissions for complete control over an entire account while providing essentially the same level of access as the account key itself.
There are 3 types of SAS tokens: SAS Account, SAS Service, and SAS User Delegation. In the case of the Microsoft researchers, the most popular type was used: SAS account tokens.
Creating a SAS account is a simple process. As shown in the screen below, the user configures the scope, permissions, and expiration date of the token and generates the token. In the background, the browser downloads the account key from Azure and signs the generated token with that key. This entire process is done on the client side; This is not an Azure event, and the resulting token is not an Azure object.
Creation of a SAS token with a high level of privileges and an unlimited validity period.
For this reason, when a user creates a high-permission token that does not expire, the administrator has no way of knowing whether the token exists or where it is circulating. Revoking a token is also not a simple task: it requires replacing the key of the account that signed the token, which also renders all other tokens signed by the same key ineffective. These unique traps make the service an easy target for attackers looking for exposed data.
In addition to the risk of accidental exposure, the service’s traps make it an effective tool for attackers seeking to preserve the integrity of compromised storage accounts. A recent report from Microsoft indicates that attackers are taking advantage of the service’s lack of monitoring capabilities to backdoor SAS privileged tokens. Since the release of the token is not documented anywhere, it is impossible to know whether it has been issued and act accordingly.
SAS token that is too liberal
According to Wiz security researchers Hillai Ben-Sasson and Ronnie Greenberg, authors of the study, Microsoft’s artificial intelligence research team included an overly permissive shared access token (SAS) in the URL that accidentally exposed 38 TB of private data.
The data exposed included personal backups of two Microsoft employees, which contained passwords for Microsoft services, private keys, and more than 30,000 internal Microsoft Teams messages from 359 Microsoft employees. Data submitted since 2020 was also misconfigured to allow full control rather than read-only access, meaning anyone who knew where to look could potentially remove, replace, and inject malicious content into that data.
In addition to the overly permissive scope of access, the token was also incorrectly configured to grant full access permissions instead of read-only permissions. This means that an attacker could not only see all files in a storage account, but also delete and overwrite existing files.
This is especially interesting given the original purpose of the repository: to provide AI models for use in training code. The repository requires users to download a model data file from a SAS link and paste it into a script. File format CBT, a format created by the TensorFlow library. This is an ugly trainer format. pickle Python, which by design allows arbitrary code to be executed. This means that an attacker could inject malicious code into all AI models in that storage account, and all users trusting the Microsoft GitHub repository would be infected.
“AI offers enormous potential for technology companies,” said Ami Luttwak, co-founder and CTO of Wiz. However, as data scientists and engineers work to bring new AI solutions into production, the massive volumes of data they process require additional controls and security. While many development teams must process massive amounts of data, share it with colleagues, or collaborate on public open source projects, cases like Microsoft’s are becoming increasingly difficult to track and prevent.
Wiz said it shared its findings with Microsoft on June 22, 2023, and two days later, on June 24, Microsoft revoked the SAS token. Microsoft said it completed its investigation into the potential organizational impact on August 16. On July 7, the SAS token was replaced by GitHub. In a blog post, Microsoft’s Security Incident Response Center said that “no customer data was exposed and no other internal services were compromised due to this issue.”
Microsoft said that as a result of Wiz’s research, it has expanded the GitHub Secrets Scanning service, which monitors all public changes to open source code to detect the disclosure of credentials and other secrets in the clear, to include any SAS tokens that may have an expiration date or too long a period. permitting privileges.
And you ?
What are the potential consequences of Microsoft AI researchers accidentally disclosing sensitive data?
How can Microsoft improve the security of its open source training data and prevent similar incidents from happening in the future?
What do you think of GitHub’s secrets analysis service, which monitors public changes to open source code to detect the disclosure of credentials and other secrets in the clear?
Do you think AI researchers should be more careful when sharing their open-source training data with the scientific community?