Artificial intelligence is currently on everyone’s lips, mainly due to the success of ChatGPT. But how do tech companies train their systems? That has the Washington Post researches and analyzes which websites make AI software appear intelligent.
The topic of artificial intelligence has increasingly come into focus due to the success of the AI software ChatGPT. According to estimates, the market with AI in the areas of hardware, software and IT services could be worth around 554.3 billion US dollars in 2024. In 2021, the number was still around 380 billion US dollars.
But what is behind artificial intelligence like ChatGPT and Co. and how are they trained? That has the Washington Post analyzed in a research and scrutinized which websites make AI software appear intelligent.
With which websites does artificial intelligence train?
For their research, the Washington Post “high-quality English-speaking AIs” analyzed. These are trained as so-called “large language models” with the help of websites.
analyzed the Washington Post including Google’s T5 and Facebook’s LLaMA. This analysis is based on Google’s C4 data set, which contains the content of 15 million websites.
How do ChatGPT and Co. learn?
Since AI systems cannot think for themselves, they have to be trained beforehand. Once they have absorbed enough information, they can imitate speech and, for example, conduct conversations or answer complex questions.
Of course, it depends on what information the respective artificial intelligence was previously fed with. Because that’s the only way she can work later.
In the meantime, however, technology companies often try to keep this secret. So does the ChatGPT mother OpenAI. Because the company does not disclose which data sets its AI software uses to train.
The AI analysis of the Washington Post
For the analysis of the websites, the Washington Post collaborated with the Allen Institute for AI. First, the 15 million websites were categorized. Websites that were no longer categorizable or available were excluded from the analysis.
The remaining ten million websites has die Post sorted by how many tokens occur in their data set. This can be individual words or entire phrases.
The websites examined came mainly from the fields of journalism, entertainment, software development or medicine. The websites patents.google.com, a listing of all patents worldwide, and the online encyclopedia Wikipedia are in the first two places.
But dubious websites have also made it to the top. According to research by the Washington Post at least 27 websites identified by the US government as pirate and counterfeit markets.
Questionable content also makes it into the training for artificial intelligence
But not only serious content makes it into the training data of the AI systems. Because although Google and Co. filter the data in advance, racist or radical websites also make it into the list.
Business websites appear most frequently in the analysis. At 16 percent, this category accounts for the majority of AI training sites.
Websites on the subject of technology follow in second place, while journalistic content takes third place. News houses such as nytimes.com, theguardian.com and forbes.com can be found among the first places.
According to the Washington Post above all, that no permission is obtained for the use of the content. The use of radical and right-wing extremist sites is also problematic. Because websites like RT.com or Breitbart.com also appear in the list.
Precisely because of this questionable content, the Washington Post This is because the data that tech companies use to train artificial intelligence must be disclosed.