Microsoft's Open-Source AI Project Leaks 38TB of Personal Data

John Lopez, Tech Times 19 September 2023, 08:09 am

(Photo: Image via Unsplash)
Microsoft's Open-Source AI Project Leaks 38TB of Personal Data

Microsoft's AI research team inadvertently exposed a staggering 38 terabytes of personal data while sharing open-source training data on GitHub, Engadget reports.

This data breach, discovered by cybersecurity firm Wiz, raised concerns about the security of AI projects and handling sensitive information.

What Happened

In a bid to foster collaboration and offer valuable resources to the AI community, Microsoft's AI research division decided to upload training data on GitHub.

However, this seemingly noble gesture took an unexpected turn. A link containing backups of Microsoft employees' computers was among the files shared, inadvertently exposing a treasure trove of sensitive information.

Wiz's researchers were quick to spot the security lapse and reported it to Microsoft on June 22, 2023. The exposed data included not only backups of two former employees' workstations but also passwords, secret keys, and over 30,000 internal Microsoft Teams messages from hundreds of employees.

Azure's SAS Tokens: The Culprit

The breach stemmed from using Azure's Shared Access Signature (SAS) tokens, a feature that facilitates controlled access to Azure Storage data.

While SAS tokens offer fine-grained control over data access, their misconfiguration in this instance led to the exposure of the entire storage account.

Researchers at Microsoft inadvertently configured the token to share the complete storage account, which was supposed to provide access only to specific files.

This oversight exposed the intended open-source models and vast amounts of private data, including sensitive communications.

Potential Risks and Impact

The incident's severity lies in the potential risks it posed. Had malicious actors discovered this vulnerability, they could have injected harmful code into AI models stored in the affected account, potentially impacting users who trust Microsoft's GitHub repository.

Moreover, this breach is a wake-up call for organizations as they increasingly leverage AI and work with massive training data. The need for stringent security checks and safeguards in handling such data is now more critical than ever.

GitHub's Role and Security Measures

GitHub, the platform hosting the open-source data, plays a pivotal role in detecting such breaches.

Its secret scanning service monitors public open-source code changes for plaintext exposure of credentials and secrets, including SAS tokens.

This service detected the SAS token in question but was initially marked as a "false positive." GitHub has since expanded its detection capabilities to catch overly permissive SAS tokens.

SAS Tokens Explained

Shared Access Signatures (SAS) tokens, used by Azure, are a secure mechanism to delegate access to data within a storage account.

These tokens offer granular control over what resources a client can access, what operations they can perform, and for how long. However, as the incident illustrates, creating and handling SAS tokens requires meticulous attention to detail.

Azure Storage recommends several best practices when working with SAS URLs, including applying the Principle of Least Privilege, using short-lived SAS tokens, handling them carefully, and having a revocation plan. These practices can significantly reduce the risk of unintended access or abuse.

Stay posted here at Tech Times.