The topic of artificial intelligence has gained enormous momentum since the success of ChatGPT. But the seemingly omniscient AI systems already fail at this simple logical question. Could you have answered it?

Developments in the field of artificial intelligence are progressing at a rapid pace. Since the successful launch of ChatGPT, AI systems have become increasingly important.

But are the large language models really as smart as they are always portrayed? A new study by the AI ​​research organization Laion has now shown that AI can also fail on a simple logical question.

AI fails on simple logic question

The researchers used a “conventional common sense problem” for their study. They used it to test the capabilities of large language models such as GPT-3.5/4, Gemini or LLaMa 2/3.

However, they come to a surprising conclusion: almost all major AI language models have failed to answer the researchers' logic question.

Here, we demonstrate a dramatic breakdown in the functioning and reasoning capabilities of state-of-the-art models trained at the largest available scales and claiming strong functioning.

The researchers did not try to trick the AI ​​systems because the logical question was “formulated in concise natural language” and can “easily be solved by humans.”

“The breakdown is dramatic because the models also show strong confidence in their incorrect solutions,” the researchers write in their summary. Even requests for the models to reconsider the incorrect solutions through a multi-stage re-evaluation did not lead to the correct answer to the question.

Could you have answered this question?

For their study, the researchers turned to the “Alice in Wonderland” problem. The respective AI had to answer a simple logic question based on the following problem formulation.

Alice has N brothers and she also has M sisters. How many sisters does Alice's brother have?

For their study, the researchers used different versions, i.e. they used different numbers for N and M. For example, if the sentence is: “Alice has 3 brothers and she also has 2 sisters.” Then the answer to the question “How many sisters does Alice's brother have?” would in this case be three, since Alice is also a sister of her brothers.

The AI ​​models surveyed included OpenAI's GPT-3, GPT-4 and GPT-4o as well as Google's Gemini and Meta's Llama models. However, according to the results of the study, all AI models had problems solving the logic question. In some cases, they even insisted on their wrong solutions after repeated questions.

Only the new GPT-4o from OpenAI stood out from the crowd with a success rate of 65 percent. Claude 3 Opus, on the other hand, only got 43 percent of the questions right, while Google's Gemini Pro only got the right answer in 0.8 percent of cases.

Researchers raise questions about the capabilities of AI models

Based on the results of their study, the researchers at Laion are calling for “an urgent reassessment of the claimed capabilities” of large language models. To do this, the industry would have to create standardized benchmarks.

Only in this way can such “fundamental argumentation deficits” be identified that “remain undetected in the evaluation procedures and benchmarks currently used”.

Also interesting:

Source: https://www.basicthinking.de/blog/2024/06/12/ki-scheitert-an-logikfrage/

Leave a Reply