AI isn't always as smart as we think it is

Human supervision of AI isn't enough. Better, more careful design is essential to building models that can be relied on to be accurate, study finds. (c) ConsumerAffairs

Study finds that large language models are good at big things, not so much at small ones

Everyone speaks respectfully of the growing capabilities of large language models (LLMs) -- often called artificial intelligence -- but a new study calls that assumption into question.

It basically says that LLMs, like ChatGPT, have a problem. Namely, these models can sometimes be really good at hard things and bad at easy things.

"Ultimately, large language models are becoming increasingly unreliable from a human point of view, and user supervision to correct errors is not the solution, as we tend to rely too much on models and cannot recognise incorrect results at different difficulty levels," said Wout Schellaert, one of the researchers who worked on the study. 

The study is a bit hard to wade through, possibly also calling into question the ability of human intelligence to express itself clearly. Here's a simplified breakdown:

  • Unexpected Mistakes: LLMs can solve complex problems, like PhD-level math, but then make mistakes on simple things, like basic addition. This is surprising because we expect them to get the easy stuff right.

  • No "Safe Zone": There's no type of task where these models are always 100% accurate. They can make mistakes on both easy and difficult tasks, so we can't completely trust them.

  • More Wrong Answers: Newer LLMs are more likely to give a wrong answer than to say "I don't know." This can be frustrating for users who expect the model to be correct.

  • Tricky Questions: Even if we ask questions in a way that works well for complex tasks, it might lead to wrong answers for simple tasks.

  • Humans Can't Fix It: Even with human supervision, these problems are hard to fix. People tend to overestimate the accuracy of LLMs and may not catch their mistakes.

Could be a problem

The researchers are saying that LLMs are becoming less reliable even as they get better at certain tasks. This is a problem, especially for important uses like healthcare or finance. They argue that we need to rethink how we design and develop these models to make them more trustworthy.

"Models can solve certain complex tasks in line with human abilities, but at the same time, they fail on simple tasks in the same domain. For example, they can solve several PhD-level mathematical problems. Still, they can get a simple addition wrong," said José Hernández Orallo at the Universitat Politecnica de Valencia, Spain, who conducted the study, published in the academic journal Nature. 

The researchers noted that in 2022, Ilya Sutskever, the scientist behind some of the most significant advances in artificial intelligence in recent years and co-founder of OpenAI, predicted that "maybe over time that discrepancy will diminish."

But the new study proves otherwise, Hernández Orallo said.

Take an Identity Theft Quiz. Get matched with an Authorized Partner.