AI is transforming scientific research, but without proper guidance, it may do more harm than good.
That’s the pointed conclusion of a new paper published in Science Advances by an interdisciplinary team of 19 researchers led by Princeton University computer scientists Arvind Narayanan and Sayash Kapoor.
The team argues that the misuse of machine learning across scientific disciplines is fueling a reproducibility crisis that threatens to undermine the very foundations of science.
“When we graduate from traditional statistical methods to machine learning methods, there are a vastly greater number of ways to shoot oneself in the foot,” said Narayanan, who directs Princeton’s Center for Information Technology Policy.
“If we don’t have an intervention to improve our scientific standards and reporting standards when it comes to machine learning-based science, we risk not just one discipline but many different scientific disciplines rediscovering these crises one after another.”
According to the authors, the problem is that machine learning has been rapidly adopted by nearly every scientific field, often without clear standards to ensure the integrity and reproducibility of the results.
They highlight that thousands of papers using flawed machine learning methods have already been published.
But the Princeton-led team says there’s still time to avoid this impending crisis. They’ve put forward a simple checklist of best practices that, if widely adopted, could safeguard the reliability of machine learning in science.
The checklist, called REFORMS (Recommendations for Machine-learning-based Science), consists of 32 questions across eight key areas:
- Study goals: Clearly state the scientific claim being made and how machine learning will be used to support it. Justify the choice of machine learning over traditional statistical methods.
- Computational reproducibility: Provide the code, data, computing environment specifications, documentation, and a reproduction script needed for others to reproduce the study’s results independently.
- Data quality: Document the data sources, sampling frame, outcome variables, sample size, and amount of missing data. Justify that the dataset is appropriate and representative of the scientific question.
- Data preprocessing: Report how data was cleaned, transformed, and split into training and test sets. Provide a rationale for any data that was excluded.
- Modeling: Describe and justify all models tried, the method used to select the final model(s), and the hyperparameter tuning process. Compare performance against appropriate baselines.
- Data leakage: Verify that the modeling process didn’t inadvertently use information from the test data and that input features don’t leak the outcome.
“This is a systematic problem with systematic solutions,” explains Kapoor.
However, the costs of getting it wrong could be immense. Faulty science could sink promising research, discourage researchers, and erode public trust in science.
Previous research, such as Nature’s large-scale survey of academics on generative AI in science, indicated that AI’s deeper and progressive integration into scientific workflows is inevitable.
Participants highlighted plenty of benefits – 66% noted that AI enables quicker data processing, 58% believed it enhances computations, and 55% said it saves time and money.
However, 53% felt results could be unreplicable, 58% worried about bias, and 55% believed AI might enable fraudulent research.
We observed evidence of this when researchers published an article with nonsense AI-generated diagrams in the journal Frontiers — a rat with giant testicles, no less. Comical, but it showed how peer review might not even catch glaringly obvious uses of AI.
Ultimately, like any tool, AI is only as safe and effective as the human behind it. Careless use, even if unintentional, can lead science astray.
The new guidelines aim to keep “honest people honest,” as Narayanan put it.
Widespread adoption by researchers, reviewers, and journals could set a new standard for scientific integrity in the age of AI.
However, building consensus will be challenging, especially since the reproducibility crisis is already flying under the radar.