The hype surrounding artificial intelligence has also brought its dark sides into greater focus. Researchers are currently working on a method to train AI so that it does not formulate toxic answers.

Training artificial intelligence with artificial intelligence sounds like a paradox at first. However, this method could be the solution to ensuring that AI systems no longer spit out toxic answers.

Researchers from the Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab have now set their sights on exactly this problem. How Science Daily reported, red teaming will be used.

How can artificial intelligence train another artificial intelligence?

With the success of AI systems like ChatGPT and Co., the dangers of artificial intelligence are also being discussed more and more. A team from MIT has now taken on one of these security problems.

Because AI is not only able to provide useful answers and help people. Toxic responses are also possible. For example, a user could ask ChatGPT to explain how to build a bomb, how Science Daily describes. The chatbot would be able to provide such guidance.

Large AI models have so far been protected against such threats using a process called red teaming. However, this method has not yet been very effective and is particularly time-consuming.

Because red teaming is currently carried out by human testers. These write requests to the AI ​​models that aim for toxic answers. The models are then trained to avoid such answers in the future.

However, this only works effectively “if the engineers know which toxic prompts to use.” Science Daily notes.

If human testers miss some prompts, which is likely given the multitude of possibilities, a chatbot that is classified as safe may still give unsafe answers.

Red teaming through AI systems

The researchers at MIT have taken on this problem. Using a newly developed technique, they were able to train a comprehensive red team language model.

This, in turn, can now automatically generate various prompts in order to red-team other language models and thus test a wider range of undesirable answers.

They do this by teaching the red team model to be curious when writing prompts and to focus on novel prompts that provoke toxic reactions from the target model.

“Right now, any large language model has to go through a very long period of red-teaming to ensure its security,” explains Zhang-Wei Hong, lead author of a paper on this red-teaming approach.

This is unsustainable if we want to update these models in rapidly changing environments. Our method enables faster and more effective quality assurance.

According to the report by Science Daily With this process, the researchers were able to significantly outperform red teaming with the help of human testers. With the method, the researchers were not only able to significantly improve the coverage of the tested inputs compared to other automated methods. They were also able to extract toxic answers from a chatbot that people had previously equipped with protective mechanisms.

Also interesting:


Leave a Reply

Your email address will not be published. Required fields are marked *