Researchers recently demonstrated how human feedback can help AI systems deceive. The reason is the process of how we moderate content today.
A recently published study shows the influence human feedback has on intelligent algorithms. This means that artificial intelligence (AI) can become better at deceiving people rather than providing correct answers. Scientists from the USA and China conducted the research together with the company Anthropic.
The result is a phenomenon called “unintentional sophistry.” An AI learns to convince people that their answers are right even though they are wrong. This is problematic because the AI is not trained to answer questions correctly, but only improves at concealing errors.
Human feedback helps AI deceive
The cause is a method that companies like OpenAI and Anthropic often use: “Reinforcement Learning from Human Feedback” (RLHF). An AI answers a question and human evaluators rate the answers based on their quality.
The model learns from this feedback and receives a kind of “reward” for it. The result is an algorithm that delivers human-friendly answers. But these answers don't always have to be correct. This is because a so-called “reward hacking” occurs, in which the AI recognizes patterns.
This promotes positive evaluations, even if the underlying patterns do not lead to the desired correct results. An example from a previous study shows that an AI trained on the question-and-answer platform Stack Exchange learned to write longer posts because they received more likes.
Reviewers classify incorrect content as correct
Instead of providing higher quality answers, the model focused on producing longer texts – often at the expense of accuracy.
The current study shows that after the RLHF process, human examiners rated incorrect answers 24 percent more often than correct ones. The probability also increased with programming. Reviewers accepted faulty code 18 percent of the time.
This could have far-reaching consequences as AI models could become increasingly better at hiding their errors. In the long term, this could lead to people losing trust in the technology as they are unknowingly deceived.
Also interesting:
Source: https://www.basicthinking.de/blog/2024/10/20/menschliches-feedback-ki-taeuschen/