Despite rapid advancements in LLMs, our understanding of how these models cope with longer inputs remains poor.
Mosh Levy, Alon Jacoby, and Yoav Goldberg, from the Bar-Ilan University and Allen Institute for AI, investigated how the performance of large language models (LLMs) varies with changes in the length of the input text they are given to process.
They developed a reasoning framework specifically for this purpose, allowing them to dissect the influence of input length on LLM reasoning in a controlled environment.
The questioning framework proposed different versions of the same question, each containing the necessary information for answering the question, padded with additional, irrelevant text of varying lengths and types.
This enables the isolation of input length as a variable, ensuring that changes in model performance can be attributed directly to the length of the input.
Key findings
Levy, Jacoby, and Goldberg uncovered that LLMs exhibit a noteworthy decline in reasoning performance at input lengths far below what developers assert they can handle. They documented their findings in this study.
Decline was consistently observed across all versions of the dataset, indicating a systemic issue with handling longer inputs rather than a problem tied to specific data samples or model architectures.
As the researchers describe, “Our findings show a notable degradation in LLMs’ reasoning performance at much shorter input lengths than their technical maximum. We show that the degradation trend appears in every version of our dataset, although at different intensities.”
Moreover, the study highlights how traditional metrics like perplexity, commonly used to evaluate LLMs, fail to correlate with the models’ performance on reasoning tasks involving long inputs.
Further exploration found that the degradation in performance was not solely dependent on the presence of irrelevant information (padding) but was observed even when such padding consisted of duplicated relevant information.
When we keep the two core spans together and add text around them, accuracy already drops. Introducing paragraphs between spans, results drop much more. The drop occurs both when the texts we add are similar to the task texts, and when they are completely different. 3/7 pic.twitter.com/c91l9uzyme
— Mosh Levy (@mosh_levy) February 26, 2024
This suggests that the challenge for LLMs lies in filtering out noise and the inherent processing of longer text sequences.
Ignoring instructions
One critical area of failure mode highlighted in the study is LLMs’ tendency to ignore instructions embedded within the input as the input length increases.
Models would also sometimes generate responses indicating uncertainty or lack of sufficient information, such as “There is not enough information in the text,” despite all the necessary information.
Overall, LLMs seem to consistently struggle to prioritize and focus on key information pieces, including direct instructions, as input length grows.
Exhibiting biases in responses
Another notable issue was increased biases in the models’ responses as inputs became longer.
Specifically, LLMs were biased towards answering “False” as input length increased. This bias indicates a skew in probability estimation or decision-making processes within the model, possibly as a defensive mechanism in response to increased uncertainty due to longer input lengths.
The inclination to favor “False” responses could also reflect an underlying imbalance in the training data or an artifact of the models’ training process, where negative responses may be overrepresented or associated with contexts of uncertainty and ambiguity.
This bias affects the accuracy of the models’ outputs and raises concerns about the reliability and fairness of LLMs in applications requiring nuanced understanding and impartiality.
Implementing robust bias detection and mitigation strategies during model training and fine-tuning phases is essential to reduce unwarranted biases in model responses.
Ensuring that training datasets are diverse, balanced, and representative of a wide range of scenarios can also help minimize biases and improve model generalization.
This contributes to other recent studies that similarly highlight fundamental issues in how LLMs work, thus leading to a situation where that ‘technical debt’ could threaten model functionality and integrity over time.