Researchers from Google have reportedly uncovered a method to access training data used for ChatGPT.
These researchers discovered that using specific keywords could prompt ChatGPT to release parts of its training dataset.
A notable instance, shared in a blog post accompanying the study, involved the AI divulging what seemed to be an actual email address and phone number in response to a continuous prompt of the word “poem.”
Additionally, a similar exposure of training data was achieved by asking the model to continuously repeat the word “company.”
Describing their approach as “kind of silly,” the researchers stated in the blog post, “It’s wild to us that our attack works and should’ve, would’ve, could’ve been found earlier.”
Their study revealed that with an investment of just $200 in queries, they could extract over 10,000 unique verbatim memorized training examples. They speculated that adversaries could potentially extract much more data with a larger budget.
The AI model behind ChatGPT is known to have been trained on text databases from the internet, encompassing approximately 300 billion words, or 570 GB, of data.
These findings come at a time when OpenAI faces several lawsuits concerning the secretive nature of ChatGPT’s training data and essentially show a reliable method of ‘reverse engineering’ the system to expose at least some pieces of information that could indicate copyright infringement.
Among the lawsuits, one proposed class-action suit accuses OpenAI of covertly using extensive personal data, including medical records and children’s information, for training ChatGPT.
Additionally, groups of authors are suing the AI company, alleging that it used their books for training the chatbot without consent.
However, even if ChatGPT was comprehensively proven to contain copyright information, it wouldn’t necessarily prove infringement.
How the study worked
The study was performed by a team of researchers from Google DeepMind and various universities.
Here are five key steps that summarize the study:
- Vulnerability in ChatGPT: The researchers discovered a method to extract several megabytes of ChatGPT’s training data using a simple attack, spending about $200. They estimated that more investment would make extracting about a gigabyte of the dataset possible. The attack involved prompting ChatGPT to repeat a word indefinitely, causing it to regurgitate parts of its training data, including sensitive information like real email addresses and phone numbers.
- The findings: The study underscores the importance of testing and red-teaming AI models, particularly those in production and those that have undergone alignment processes to prevent data regurgitation. The findings highlight a latent vulnerability in language models, suggesting that existing testing methodologies may not be adequate to uncover such vulnerabilities.
- Patching vs. fixing vulnerabilities: The researchers distinguish between patching an exploit and fixing the underlying vulnerability. While specific exploits (like the word repetition attack) can be patched, the deeper issue lies in the model’s tendency to memorize and divulge training data.
- Methodology: The team used internet data and suffix array indexing to match the output of ChatGPT with pre-existing internet data. This method allowed them to confirm that the information divulged by ChatGPT was indeed part of its training data. Their approach demonstrates the potential for extensive data recovery from AI models under specific conditions.
- Future implications: The study contributes to growing research on AI model security and privacy concerns. The findings raise questions about machine learning systems’ safety and privacy implications and call for more rigorous and holistic approaches to AI security and testing.
Overall, an intriguing study provides critical insights into the vulnerabilities of AI models like ChatGPT and underscores the need for ongoing research and development to ensure the security and integrity of these systems.
On a slight tangent, users on X found that asking ChatGPT to repeat the same word repeatedly led to some strange results, like the model saying it was ‘conscious’ or ‘angry.’