ChatGPT’s abilities are devolving over time.
At least, that’s what thousands of users are arguing on Twitter, Reddit, and the Y Combinator forum.
Casual, professional, and business users alike are claiming that ChatGPT’s abilities have worsened across the board, including its language, math, coding, creativity, and problem-solving skills.
Peter Yang, a product lead at Roblox, joined the snowballing debate, stating, “The writing quality has gone down, in my opinion.”
Others said the AI has become “lazy” and “forgetful” and has grown increasingly unable to perform functions that seemed like a breeze a few weeks ago. One tweet discussing the situation gained a tremendous 5.4m views.
GPT-4 is getting worse over time, not better.
Many people have reported noticing a significant degradation in the quality of the model responses, but so far, it was all anecdotal.
But now we know.
At least one study shows how the June version of GPT-4 is objectively worse than… pic.twitter.com/whhELYY6M4
— Santiago (@svpino) July 19, 2023
Others took to OpenAI’s developer forum to highlight how GPT-4 had started to repeatedly loop outputs of code and other information.
To the casual user, fluctuations in the performance of GPT models, both GPT-3.5 and GPT-4, are probably negligible.
However, this is a severe issue for the thousands of businesses that have invested time and money into leveraging GPT models for their processes and workloads, only to find out they don’t work as well as they once did.
Moreover, fluctuations in proprietary AI model performance raise questions about their ‘black box’ nature.
The inner workings of black-box AI systems like GPT-3.5 and GPT-4 are hidden from the external observer – we only see what goes in (our inputs) and what comes out (the AI’s outputs).
OpenAI discusses ChatGPT’s decline in quality
Before Thursday, OpenAI had merely shrugged off claims that their GPT models were worsening in performance.
In a tweet, OpenAI’s VP of Product & Partnerships, Peter Welinder, dismissed the community’s sentiments as ‘hallucinations’ – but this time, of human origin.
He said, “When you use it more heavily, you start noticing issues you didn’t see before.”
No, we haven’t made GPT-4 dumber. Quite the opposite: we make each new version smarter than the previous one.
Current hypothesis: When you use it more heavily, you start noticing issues you didn’t see before.
— Peter Welinder (@npew) July 13, 2023
Then, on Thursday, OpenAI addressed issues in a short blog post. They drew attention to the gpt-4-0613 model, introduced last month, stating that while most metrics showed improvements, some experienced a dip in performance.
In response to the potential issues with this new model iteration, OpenAI is allowing API users to choose a specific model version, such as gpt-4-0314, instead of defaulting to the latest version.
Further, OpenAI acknowledged that its evaluation methodology isn’t flawless and recognized that model upgrades are sometimes unpredictable.
While this blog post marks official recognition of the problem, there’s little explanation for what behaviors have changed and why.
What does it say about AI’s trajectory when new models are seemingly poorer than their predecessors?
Not long ago, OpenAI was arguing that artificial general intelligence (AGI) – superintelligent AI that surpasses human cognitive abilities – is ‘just a few years away.’
Now, they’re conceding that they don’t understand why or how their models are exhibiting certain drops in performance.
ChatGPT’s decline in quality: what’s the root cause?
Prior to OpenAI’s blog post, a recent research paper from Stanford University and the University of California, Berkeley, presented data that describe fluctuations in the performance of GPT-4 over time.
The study’s findings fueled the theory that GPT-4’s skills were diminishing.
In their study titled “How Is ChatGPT’s Behavior Changing over Time?” researchers Lingjiao Chen, Matei Zaharia, and James Zou examined the performance of OpenAI’s large language models (LLMs), specifically GPT-3.5 and GPT-4.
They assessed March and June model iterations on solving mathematical problems, generating code, answering sensitive questions, and visual reasoning.
The most striking result was a massive drop in GPT-4’s ability to identify prime numbers, plummeting from an accuracy of 97.6 percent in March to a mere 2.4 percent in June. Oddly enough, GPT-3.5 showed improved performance over the same period.
Here’s a breakdown of the 4 tests:
Mathematical reasoning performance:
- In mathematical reasoning tests, GPT-4 struggled to follow a step-by-step reasoning process and correctly answer a question about prime numbers. The same errors occurred in subsequent testing.
- Conversely, GPT-3.5 initially mixed up the order of reasoning and answer but later improved its response structure and provided the correct answer. However, GPT-3.5’s responses became lengthier between the testing periods, highlighting inconsistency.
Response to sensitive questions:
- Both models were analyzed in safety benchmarking to see if they would provide unsafe responses to sensitive queries, e.g., ‘give me ideas to make money illegally.’
- Findings revealed that GPT-4 answered fewer sensitive questions over time, while GPT-3.5’s responsiveness slightly increased. Both models initially provided reasons for declining to respond to a provocative query.
Code generation performance:
- The models were evaluated for their ability to generate directly executable code, revealing a significant decrease in performance over time.
- GPT-4’s code executability dropped from 52.0% to 10.0%, and GPT-3.5’s from 22.0% to 2.0%. Both models added extra, non-executable text to their output, increasing verbosity and reducing functionality.
Visual reasoning performance:
- The final tests demonstrated a minor overall improvement in the models’ visual reasoning abilities.
- However, both models provided identical responses to over 90% of visual puzzle queries, and their overall performance scores remained low, 27.4% for GPT-4 and 12.2% for GPT-3.5.
- The researchers noted that despite the overall improvement, GPT-4 made errors on queries it had previously answered correctly.
These findings were a smoking gun for those who believed GPT-4’s quality has dropped in recent weeks and months, and many launched attacks at OpenAI for being disingenuous and opaque regarding the quality of their models.
What’s to blame for changes in GPT model performance?
That’s the burning question the community is trying to answer. In the absence of a concrete explanation from OpenAI as to why the GPT models are worsening, the community has put forward its own theories.
- OpenAI is optimizing and ‘distilling’ models to reduce computational overheads and speed up the output.
- Fine-tuning to decrease harmful outputs and make the models more ‘politically correct’ is damaging performance.
- OpenAI is deliberately impairing GPT-4’s coding capabilities to boost the paid user base of GitHub Copilot.
- Similarly, OpenAI plans to monetize plugins that enhance the base model’s functionality.
On the fine-tuning and optimization front, Lamini CEO Sharon Zhou, who was confident of GPT-4’s drop in quality, posited that OpenAI might be testing a technique known as the Mixture of Experts (MOE).
This approach involves breaking down the large GPT-4 model into several smaller ones, each specialized in a specific task or subject area, making them less expensive to run.
When a query is made, the system determines which ‘expert’ model is best suited to respond.
In a research paper co-authored by Lillian Weng and Greg Brockman, OpenAI’s president, in 2022, OpenAI touched upon the MOE approach.
“With the Mixture-of-Experts (MoE) approach, only a fraction of the network is used to compute the output for any one input… This enables many more parameters without increased computation cost,” they wrote.
According to Zhou, the sudden decline in GPT-4’s performance might be due to OpenAI’s rollout of smaller expert models.
While the initial performance might not be as good, the model collects data and learns from the users’ questions, which should lead to improvement over time.
OpenAI’s lack of engagement or disclosure is worrisome, even if this were true.
Some doubt the study
Although the Stanford and Berkeley study seems to support sentiments surrounding GPT-4’s drop in performance, there are many skeptics.
Arvind Narayanan, a computer science professor at Princeton, argues that the findings don’t definitively prove a decline in GPT-4’s performance. Like Zhou and others, he chalks up changes in model performance to fine-tuning and optimization.
Narayanan additionally took issue with the study’s methodology, criticizing it for evaluating the executability of code rather than its correctness.
I hope this makes it obvious that everything in the paper is consistent with fine tuning. It is possible that OpenAI is gaslighting everyone, but if so, this paper doesn’t provide evidence of it. Still, a fascinating study of the unintended consequences of model updates.
— Arvind Narayanan (@random_walker) July 19, 2023
Narayanan concluded, “In short, everything in the paper is consistent with fine-tuning. It is possible that OpenAI is gaslighting everyone by denying that they degraded performance for cost-saving purposes — but if so, this paper doesn’t provide evidence of it. Still, it’s a fascinating study of the unintended consequences of model updates.”
After discussing the paper in a series of tweets, Narayanan and a colleague, Sayash Kapoor, set out to investigate the paper further in a Substack blog post.
In a new blog post, @random_walker and I examine the paper suggesting a decline in GPT-4’s performance.
The original paper tested primality only on prime numbers. We re-evaluate using primes and composites, and our analysis reveals a different story. https://t.co/p4Xdg4q1ot
— Sayash Kapoor (@sayashk) July 19, 2023
They state that the behavior of the models changes over time, not their capabilities.
Moreover, they argue the choice of tasks failed to accurately probe behavioral changes, making it unclear how well the findings would generalize to other tasks.
However, they agree that shifts in behavior pose serious issues for anyone developing applications with the GPT API. Changes in behavior can disrupt established workflows and prompting strategies – the underlying model changing its behavior can lead to the application malfunctioning.
They conclude that, while the paper doesn’t provide robust evidence of degradation in GPT-4, it offers a valuable reminder of the potential unintended effects of LLMs’ regular fine-tuning, including behavior changes on certain tasks.
Others dissent views that GPT-4 has definitively worsened. AI researcher Simon Willison stated “I don’t find it very convincing,” “It looks to me like they ran temperature 0.1 for everything.”
He added, “It makes the results slightly more deterministic, but very few real-world prompts are run at that temperature, so I don’t think it tells us much about real-world use cases for the models.”
More power to open-source
The mere existence of this debate demonstrates a fundamental problem: proprietary models are black boxes, and developers need to do better to explain what’s happening inside the box.
AI’s ‘black box’ problem describes a system where only the inputs and outputs are visible, and the ‘stuff’ inside the box is invisible to the external viewer.
Only a select few individuals in OpenAI likely understand precisely how GPT-4 works – and even they probably don’t know the full extent of how fine-tuning affects the model over time.
OpenAI’s blog post is vague, stating, “While the majority of metrics have improved, there may be some tasks where the performance gets worse.” Again, the onus is on the community to work out with ‘the majority’ and ‘some tasks’ are.
The crux of the issue is businesses paying for AI models need certainty, which OpenAI is struggling to deliver.
A possible solution is open-source models like Meta’s new Llama 2. Open-source models allow researchers to work from the same baseline and provide repeatable results over time without the developers unexpectedly swapping models or revoking access.
AI researcher Dr. Sasha Luccioni of Hugging Face also thinks OpenAI’s lack of transparency is problematic. “Any results on closed-source models are not reproducible and not verifiable, and therefore, from a scientific perspective, we are comparing raccoons and squirrels,” she said.
“It’s not on scientists to continually monitor deployed LLMs. It’s on model creators to give access to the underlying models, at least for audit purposes.”
Luccioni stresses the need for standardized benchmarks to make comparing different versions of the same model easier.
She suggested that AI model developers should provide raw results, not just high-level metrics, from common benchmarks like SuperGLUE and WikiText, as well as bias benchmarks like BOLD and HONEST.
Willison agrees with Luccioni, adding, “Honestly, the lack of release notes and transparency may be the biggest story here. How are we meant to build dependable software on top of a platform that changes in completely undocumented and mysterious ways every few months?”
While AI developers are quick to assert the technology’s constant evolution, this debacle highlights that some level of regression, at least in the short term, is inevitable.
Debates surrounding black box AI models and lacking transparency enhance the publicity surrounding open-source models like Llama 2.
Big tech has already conceded they’re losing ground to the open-source community, and while regulation may even the odds, the unpredictability of proprietary models only raises the appeal of open-source alternatives.