- Truffle Security found thousands of pieces of private info in Common Crawl
- The archives are used to train some of the biggest LLMs today
- The researchers notified the vendors and helped fix the problem
Cybersecurity researchers have found thousands of login credentials and other secrets in the Common Crawl dataset.
Common Crawl is a nonprofit organization that provides a freely accessible archive of web data, collected through large-scale web crawling. As of recent estimates, the organization hosts over 250 petabytes of web data, with monthly crawls adding several petabytes more.
Recently, security researchers from Truffle Security analyzed roughly 400 terabytes of information, collected from 2.67 billion web pages archived in 2024. They said that almost 12,000 valid secrets (API keys, passwords, and similar) were found hardcoded in the archive. They found more than 200 different secret types, but the majority were for Amazon Web Services (AWS), MailChimp, and WalkScore.
Training AI
“Nearly 1,500 unique Mailchimp API keys were hard coded in front-end HTML and JavaScript,” the researchers said, noting many secrets were found in multiple instances. In fact, almost two-thirds (63%) were found on multiple pages, with one WalkScore API key appearing “57,029 times across 1,871 subdomains”.
Software developers often leave login credentials and other secrets in the code, to simplify the process during development. However, many seem to forget to remove the data, leaving a simple backdoor for malicious actors to exploit.
Cybercriminals could scour the archives for the secrets themselves, but there is an ever bigger problem here. Many of the world’s most popular large language models (LLM), such as the ones from OpenAI, DeepSeek, Google, Meta, and others, are trained using Common Crawl’s archives, meaning that crooks could use Generative AI to uncover login credentials and other secrets.
LLMs don’t use entirely raw data, and it is filtered to remove sensitive information, but the question remains how well the filters work, and how many secrets make it through.
That being said, Truffle Security allegedly reached out to impacted vendors and helped them revoke compromised keys.
Via BleepingComputer
You might also like
https://cdn.mos.cms.futurecdn.net/dEpz5LV5PYpqYBngLd6omi-1200-80.jpg
Source link