As AI systems like large language models (LLMs) grow in size and complexity, researchers are uncovering intriguing fundamental limitations.
Recent studies from Google and the University of Singapore have uncovered the mechanics behind AI “hallucinations” – where models generate convincing but fabricated information – and the accumulation of “technical debt,” which could create messy, unreliable systems over time.
Beyond the technical challenges, aligning AI’s capabilities and incentives with human values remains an open question.
As companies like OpenAI push towards artificial general intelligence (AGI), securing the path ahead means acknowledging the boundaries of current systems.
However, carefully acknowledging risks is antithetical to Silicon Valley’s motto to “move fast and break things,” which characterizes AI R&D as it did for tech innovations before it.
Study 1: AI models are accruing ‘technical debt’
Machine learning is often touted as continuously scalable, with systems offering a modular, integrated framework for development.
However, in the background, developers may be accruing a high level of ‘technical debt’ they’ll need to solve down the line.
In a Google research paper, “Machine Learning: The High-Interest Credit Card of Technical Debt,” researchers discuss the concept of technical debt in the context of ML systems.
Kaggle CEO and long-time Google researcher D. Sculley and colleagues argue that while ML offers powerful tools for rapidly building complex systems, these “quick wins” are often misleading.
The simplicity and speed of deploying ML models can mask the future burdens they impose on system maintainability and evolution.
As the authors describe, this hidden debt arises from several ML-specific risk factors that developers should avoid or refactor.
Here are the key insights:
- ML systems, by their nature, introduce a level of complexity beyond coding alone. This can lead to what the authors call “boundary erosion,” where the clear lines between different system components become blurred due to the interdependencies created by ML models. This makes it difficult to isolate and implement improvements without affecting other parts of the system.
- The paper also highlights the problem of “entanglement,” where changes to any part of an ML system, such as input features or model parameters, can have unpredictable effects on the rest of the system. Altering one small parameter might instigate a cascade of effects that impacts an entire model’s function and integrity.
- Another issue is the creation of “hidden feedback loops,” where ML models influence their own training data in unforeseen ways. This can lead to systems that evolve in unintended directions, compounding the difficulty of managing and understanding the system’s behavior.
- The authors also address “data dependencies,” such as where input signals change over time, which are particularly problematic as they’re harder to detect.
Why technical debt matters
Technical debt touches on the long-term health and efficiency of ML systems.
When developers rush to get ML systems up and running, they might ignore the messy intricacies of data handling or the pitfalls of ‘gluing’ together different parts.
This might work in the short term but can lead to a tangled mess that’s hard to dissect, update, or even understand later.
⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️
GenAI is an avalanche of technical debt* waiting to happen
Just this week
👉ChatGPT went “berserk” with almost no real explanation
👉Sora can’t consistently infer how many legs a cat has
👉Gemini’s diversity intervention went utterly off the rails.… pic.twitter.com/qzrVlpX9yz— Gary Marcus @ AAAI 2024 (@GaryMarcus) February 24, 2024
For example, using ML models as-is from a library seems efficient until you’re stuck with a “glue code” nightmare, where most of the system is just duct tape holding together bits and pieces that weren’t meant to fit together.
Or consider “pipeline jungles,” described in a previous paper by D. Sculley and colleagues, where data preparation becomes a labyrinth of intertwined processes, so making a change feels like defusing a bomb.
The implications of technical debt
For starters, the more tangled a system becomes, the harder it is to improve or maintain it. This not only stifles innovation but can also lead to more sinister issues.
For instance, if an ML system starts making decisions based on outdated or biased data because it’s too cumbersome to update, it can reinforce or amplify societal biases.
Moreover, in critical applications like healthcare or autonomous vehicles, such technical debt could have dire consequences, not just in terms of time and money but in human well-being.
As the study describes, “Not all debt is necessarily bad, but technical debt does tend to compound. Deferring the work to pay it off results in increasing costs, system brittleness, and reduced rates of innovation.”
It’s also a reminder for businesses and consumers to demand transparency and accountability in the AI technologies they adopt.
After all, the goal is to harness the power of AI to make life better, not to get bogged down in an endless cycle of technical debt repayment.
Study 2: You can’t separate hallucinations from LLMs
In a different but related study from the National University of Singapore, researchers Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli investigated the inherent limitations of LLMs.
“Hallucination is Inevitable: An Innate Limitation of Large Language Models” explores the nature of AI hallucinations, which describe instances when AI systems generate plausible but inaccurate or entirely fabricated information.
The hallucination phenomena pose a major technical challenge, as it highlights a fundamental gap between the output of an AI model and what is considered the “ground truth” – an ideal model that always produces correct and logical information.
Understanding how and why generative AI hallucinates is paramount as the technology integrates into critical sectors such as policing and justice, healthcare, and legal.
What if one could *prove* that hallucinations are inevitable within LLMs?
Would that change
• How you view LLMs?
• How much investment you would make in them?
• How much you would prioritize research in alternatives?New paper makes the case: https://t.co/r0eP3mFxQg
h/t… pic.twitter.com/Id2kdaCSGk— Gary Marcus @ AAAI 2024 (@GaryMarcus) February 25, 2024
Theoretical foundations of hallucinations
The study begins by laying out a theoretical framework to understand hallucinations in LLMs.
Researchers created a theoretical model known as the “formal world.” This simplified, controlled environment enabled them to observe the conditions under which AI models fail to align with the ground truth.
They then tested two major families of LLMs:
- Llama 2: Specifically, the 70-billion-parameter version (llama2-70b-chat-hf) accessible on HuggingFace was used. This model represents one of the newer entries into the large language model arena, designed for a wide range of text generation and comprehension tasks.
- Generative Pretrained Transformers (GPT): The study included tests on GPT-3.5, specifically the 175-billion-parameter gpt-3.5-turbo-16k model, and GPT-4 (gpt-4-0613), for which the exact number of parameters remains undisclosed.
LLMs were asked to list strings of a given length using a specified alphabet, a seemingly simple computational task.
More specifically, the models were tasked with generating all possible strings of lengths varying from 1 to 7, using alphabets of two characters (e.g., {a, b}) and three characters (e.g., {a, b, c}).
The outputs were evaluated based on whether they contained all and only the strings of the specified length from the given alphabet.
Findings
The results showed a clear limitation in the models’ abilities to complete the task correctly as the complexity increased (i.e., as the string length or the alphabet size increased). Specifically:
- The models performed adequately for shorter strings and smaller alphabets but faltered as the task’s complexity increased.
- Notably, even the advanced GPT-4 model, the most sophisticated LLM available right now, couldn’t successfully list all strings beyond certain lengths.
This shows that hallucinations aren’t a simple glitch that can be patched or corrected – they’re a fundamental aspect of how these models understand and replicate human language.
As the study describes, “LLMs cannot learn all of the computable functions and will therefore always hallucinate. Since the formal world is a part of the real world which is much more complicated, hallucinations are also inevitable for real world LLMs.”
The implications for high-stakes applications are vast. In sectors like healthcare, finance, or law, where the accuracy of information can have serious consequences, relying on an LLM without a fail-safe to filter out these hallucinations could lead to serious errors.
This study caught the eye of AI expert Dr. Gary Marcus and eminent cognitive psychologist Dr. Steven Pinker.
Hallucination is inevitable with Large Language Models because of their design: no representation of facts or things, just statistical intercorrelations. New proof of “an innate limitation” of LLMs. https://t.co/Hl1kqxJGXt
— Steven Pinker (@sapinker) February 25, 2024
Deeper issues are at play
The accumulation of technical debt and the inevitability of hallucinations in LLMs are symptomatic of a deeper issue — the current paradigm of AI development may be inherently misaligned to create highly intelligent systems and reliably aligned with human values and factual truth.
In sensitive fields, having an AI system that’s right most of the time is not enough. Technical debt and hallucinations both threaten model integrity over time.
Fixing this isn’t just a technical challenge but a multidisciplinary one, requiring input from AI ethics, policy, and domain-specific expertise to navigate safely.
Right now, this is seemingly at odds with the principles of an industry living up to the motto to “move fast and break things.”
Let’s hope humans aren’t the ‘things.’