LLMs produce more inaccurate and biased outputs with longer inputs

Despite rapid advancements in LLMs, our understanding of how these models cope with longer inputs remains poor.

Mosh Levy, Alon Jacoby, and Yoav Goldberg, from the Bar-Ilan University and Allen Institute for AI, investigated how the performance of large language models (LLMs) varies with changes in the length of the input text they are given to process.

They developed a reasoning framework specifically for this purpose, allowing them to dissect the influence of input length on LLM reasoning in a controlled environment.

The questioning framework proposed different versions of the same question, each containing the necessary information for answering the question, padded with additional, irrelevant text of varying lengths and types.

This enables the isolation of input length as a variable, ensuring that changes in model performance can be attributed directly to the length of the input.

Key findings

Levy, Jacoby, and Goldberg uncovered that LLMs exhibit a noteworthy decline in reasoning performance at input lengths far below what developers assert they can handle. They documented their findings in this study.

Decline was consistently observed across all versions of the dataset, indicating a systemic issue with handling longer inputs rather than a problem tied to specific data samples or model architectures.

As the researchers describe, “Our findings show a notable degradation in LLMs’ reasoning performance at much shorter input lengths than their technical maximum. We show that the degradation trend appears in every version of our dataset, although at different intensities.”

As the size of the input increases, the ability to perform reasoning tasks diminishes. These inputs consist of relevant (highlighted in red) and irrelevant (shown in grey) text, which are sourced from various places and expanded upon incrementally. Identifying two specific text segments, which could be situated randomly within the input, is necessary to answer accurately. The performance data is aggregated from 600 samples. Source: Via ArXiv.

Moreover, the study highlights how traditional metrics like perplexity, commonly used to evaluate LLMs, fail to correlate with the models’ performance on reasoning tasks involving long inputs.

Further exploration found that the degradation in performance was not solely dependent on the presence of irrelevant information (padding) but was observed even when such padding consisted of duplicated relevant information.

When we keep the two core spans together and add text around them, accuracy already drops. Introducing paragraphs between spans, results drop much more. The drop occurs both when the texts we add are similar to the task texts, and when they are completely different. 3/7 pic.twitter.com/c91l9uzyme

— Mosh Levy (@mosh_levy) February 26, 2024

This suggests that the challenge for LLMs lies in filtering out noise and the inherent processing of longer text sequences.

Ignoring instructions

One critical area of failure mode highlighted in the study is LLMs’ tendency to ignore instructions embedded within the input as the input length increases.

Models would also sometimes generate responses indicating uncertainty or lack of sufficient information, such as “There is not enough information in the text,” despite all the necessary information.

Overall, LLMs seem to consistently struggle to prioritize and focus on key information pieces, including direct instructions, as input length grows.

Exhibiting biases in responses

Another notable issue was increased biases in the models’ responses as inputs became longer.

Specifically, LLMs were biased towards answering “False” as input length increased. This bias indicates a skew in probability estimation or decision-making processes within the model, possibly as a defensive mechanism in response to increased uncertainty due to longer input lengths.

The inclination to favor “False” responses could also reflect an underlying imbalance in the training data or an artifact of the models’ training process, where negative responses may be overrepresented or associated with contexts of uncertainty and ambiguity.

models AI — Models exhibited bias towards answering binary questions as “false” as the input length increased. Source: Via ArXiv.

This bias affects the accuracy of the models’ outputs and raises concerns about the reliability and fairness of LLMs in applications requiring nuanced understanding and impartiality.

Implementing robust bias detection and mitigation strategies during model training and fine-tuning phases is essential to reduce unwarranted biases in model responses.

Ensuring that training datasets are diverse, balanced, and representative of a wide range of scenarios can also help minimize biases and improve model generalization.

This contributes to other recent studies that similarly highlight fundamental issues in how LLMs work, thus leading to a situation where that ‘technical debt’ could threaten model functionality and integrity over time.

LLMs produce more inaccurate and biased outputs with longer inputs

Key findings

Ignoring instructions

Exhibiting biases in responses

Join The Future

Sam Jeans

RELATED POSTS

OpenAI announces “SearchGPT” to try and stay at the front of the pack

Meta releases Llama 3.1 models, sticks with open strategy

Senate probes OpenAI’s safety and governance after whistleblower claims

Google’s AI predicts weather using fraction of computing power

LLMs produce more inaccurate and biased outputs with longer inputs

Key findings

Ignoring instructions

Exhibiting biases in responses

Join The Future

Sam Jeans

RELATED POSTS

OpenAI announces “SearchGPT” to try and stay at the front of the pack

Meta releases Llama 3.1 models, sticks with open strategy

Senate probes OpenAI’s safety and governance after whistleblower claims

Google’s AI predicts weather using fraction of computing power

FREE PDF EXCLUSIVEStay Ahead with DailyAI

FREE PDF EXCLUSIVE
Stay Ahead with DailyAI