The conventional approach to scientific research heavily relies on peer review, where other scientists meticulously evaluate and critique a study before it gets published.
However, this traditional system is bottlenecked due to the surging number of submissions and the scarcity of available human reviewers.
“It’s getting harder and harder for researchers to get high-quality feedback from reviewers,” says James Zou from Stanford University.
In response to this challenge, Zou and his team turned to ChatGPT to discover whether the chatbot could deliver clear, objective feedback on research papers. They used GPT-4 to review over 3,000 manuscripts from Nature and more than 1700 papers from the International Conference on Learning Representations (ICLR).
When comparing ChatGPT’s feedback to that of human reviewers on the same papers, they found that over 50% of the AI’s comments on the Nature papers and more than 77% on the ICLR papers aligned with the points raised by human reviewers.
Extending the experiment, the team also used ChatGPT to assess several hundred yet-to-be-peer-reviewed papers on preprint servers.
Gathering feedback from 308 authors in AI and computational biology, they found that over 82% of them deemed ChatGPT’s feedback to be generally more beneficial than some of the past feedback they received from human reviewers.
Despite these promising results, concerns about the AI’s ability to provide nuanced and technically detailed feedback persist.
Moreover, ChatGPT’s feedback can be unpredictable, with variable results depending on the content of the study.
Zou acknowledges these limitations, noting that some researchers found ChatGPT’s feedback overly vague.
The researchers remained optimistic that GPT-4 can assist with some of the heavy lifting of the peer-review process, flagging more obvious errors and inconsistencies.
More about the study
ChatGPT – specifically, the GPT-4 model – is practically effective at reviewing scientific studies and providing rapid feedback.
Here’s more about the study:
- Objective: The study identifies the difficulty in obtaining high-quality human peer reviews. It aimed to explore the use of large language models (LLMs) like GPT-4 to provide scientific feedback on research manuscripts.
- Model design: The researchers created an automated pipeline using GPT-4 to provide comments on the full PDFs of scientific papers. This pipeline is designed to assess how well LLM-generated feedback can supplement or aid the existing peer review processes in scientific publishing.
- Results: The quality of GPT-4’s feedback was evaluated through two studies. The first involved a retrospective analysis, where the generated feedback was compared with human peer reviewer feedback on 3,096 papers from 15 Nature family journals and 1,709 papers from the ICLR machine learning conference. The overlap in the points raised by GPT-4 and human reviewers were quantitatively assessed.
- The second study was conducted with 308 researchers from 110 US institutions in AI and computational biology. These researchers provided their perceptions on the feedback generated by the GPT-4 system in their own papers.
- Conclusions: The researchers found substantial overlap between the points raised by GPT-4 and human reviewers and positive perceptions of the LLM-generated feedback from most participants in the user study. The results suggest that LLM and human feedback can complement each other, although limitations of the LLM-generated feedback were also identified.
GPT-4 was almost certainly exposed to hundreds of thousands of scientific studies, which likely contributes to the model’s ability to accurately dissect and critique research similarly to human peer reviewers.
AI is becoming increasingly intertwined with academic processes. Nature recently surveyed 1,600 researchers about their opinions on generative AIs like ChatGPT, and while many raised concerns of bias, the majority admitted that its integration into the scientific process is inevitable.