Since the recent launch of GPT-4 Turbo, OpenAI’s latest iteration of its language model, the AI community has been abuzz with mixed reactions.
While OpenAI touted GPT-4 Turbo as a more capable and efficient version of its predecessor, anecdotal evidence from users suggests a varied experience, particularly in areas requiring high-level reasoning and programming capabilities.
Concrete evidence from benchmark tests is only just beginning to surface.
In one independent benchmark test, a user evaluated GPT-4 Turbo against GPT-4 and GPT-3.5 using sections from an official 2008-2009 SAT reading test.
The results indicated a notable difference in performance:
- GPT-3.5 scored 690, with 10 incorrect answers.
- GPT-4 scored 770, with only 3 incorrect answers.
- GPT-4 Turbo, tested in two modes, scored 740 (5 wrong) and 730 (6 wrong), respectively.
OpenAI claims GPT4-turbo is “better” than GPT4, but I ran my own tests and don’t think that’s true.
I benchmarked on SAT reading, which is a nice human reference for reasoning ability. Took 3 sections (67 questions) from an official 2008-2009 test (2400 scale) and got the… pic.twitter.com/LzIYS3R9ny
— Jeffrey Wang (@wangzjeff) November 7, 2023
Other early benchmarks say differently
Another preliminary test benchmarking exercise was conducted to assess the code editing skills of this new version, using Aider, an open-source command-line tool designed for AI-assisted code editing.
It found that GPT-4 Turbo (GPT-4-1106) exhibits improved performance in coding tasks, which is, of course, a different task to the natural language test above.
The benchmark employed Aider to facilitate interactions between the user and the GPT-4 model for editing code in local git repositories. The test involved completing 133 Exercism Python coding exercises, providing a structured and quantitative assessment of the model’s code editing efficiency and accuracy.
The process was structured in two phases:
- Aider provided the GPT-4 model with the initial code file containing function stubs and natural language problem descriptions. The model’s first response was directly applied to edit the code.
- If the code failed the test suite, Aider presented the model with the test error output, asking it to fix the code.
GPT-4-1106-Preview results
- Speed improvement: The GPT-4-1106-preview model showed a noticeable increase in processing speed compared to its predecessors.
- First attempt accuracy: The model demonstrated a 53% success rate in correctly solving the exercises on the first try, which is an improvement over the 46 to 47% success rate of previous GPT-4 versions.
- Performance after corrections: After being given a second chance to correct the code based on test suite errors, the new model achieved a similar performance level (~62%) to the older GPT-4 models, with success rates of around 63 to 64%.
User experiences in programming tasks
Developers using GPT-4 Turbo for coding-related tasks have reported mixed experiences.
A variety of users across X and Reddit have noted a decline in the model’s ability to follow instructions accurately or retain context effectively in programming scenarios. Some reverted to using GPT-4 after facing challenges with the new model.
One user expressed frustration on Reddit, stating, “Yes, it’s pretty bad. I run GPT-4 on some scripts and keep sample tests to ensure it performs the same. All those tests failed with the new GPT-4-preview, and I had to revert back to old. It can’t reason properly.”
Another remarked, “It is insane what some of the responses are, it makes me want to cancel my subscription.”
The anecdotes are near-endless, another says, “I pasted 100 or so lines of code and just asked it some pretty basic things. The code it sent back to me was entirely different from what I had just shown it, and almost entirely wrong. I’ve never seen it hallucinate this bad.”
Regrettably, I’ve noticed some clear setbacks in GPT-4 Turbo compared to GPT-4,
especially in following instructions.
I’m not the only one in the community feeling this way.
Haven’t tested in detail, but hope you’ll take note and improve.
Otherwise, it’s quite disappointing.— Augusdin (@augusdin) November 12, 2023
Despite user reports, OpenAI has emphasized the advancements in GPT-4 Turbo, highlighting its extended knowledge cutoff to April 2023 and an increased context window capable of handling over 300 pages of text.
OpenAI also noted the model’s optimized performance, making it more cost-effective. However, details on the specific optimization techniques and their impact on the model’s capabilities remain limited.
OpenAI CEO Sam Altman announced that Turbo had been edited today, asking users to try the model again, conceding there are issues.
The company faced similar criticisms surrounding versions of GPT-4, which seemed to drop in performance since its release.
OpenAI faces criticism surrounding censorship
ChatGPT, developed by OpenAI, has been scrutinized for its handling of censorship and potential political bias.
Critics argue that the model sometimes exhibits a tendency to avoid or skew specific topics, especially those deemed politically sensitive or controversial.
This behavior is often attributed to the training data and the moderation guidelines shaping AI responses.
These guidelines aim to prevent the spread of misinformation, hate speech, and biased content, but some users feel that this approach leads to overcorrection, resulting in perceived censorship or bias in the AI’s responses.
In contrast, xAI’s Grok has been noted for its seemingly less restrictive approach to content moderation.
Users of Grok have observed that the platform appears more willing to engage in a wider range of topics, including those that might be filtered or handled more cautiously by ChatGPT.
Grok, fueled by Elon Musk’s feisty antics, has been viewed as ‘putting the sword’ to “woke AI,” for which ChatGPT is a flagship.
To summarize, benchmark tests on GPT-4 Turbo’s performance are extremely limited right now, and relying on anecdotal reports is problematic.
OpenAI’s rising success has put the company firmly in people’s crosshairs, particularly with the release of xAI’s Grok and its resistance to ‘woke AI.’
Attaining an objective view of GPT-4 Turbo’s performance is exceptionally tough for now, but the debate as to whether ChatGPT’s outputs are genuinely improving will remain.