More than half of news summaries by AI chatbots contain “significant errors” according to a new BBC study, throwing into question the effectiveness of generative AI.
AI assistants from OpenAI, Microsoft, Google and Perplexity AI were given content from the BBC and asked questions about the news.
The results were then analysed by BBC journalists familiar with the topics who found that 51% of the summaries contained “significant errors”.
Almost all, 91%, of the responses “had at least some errors”. 19% contained factual errors and 13% included quotes sourced from BBC articles that were altered from the original source or not present in the cited article.
Examples of the errors included Google’s Gemini incorrectly stating that the NHS warns against using vaping as a method to help smokers quit. Additionally, Microsoft’s Copilot incorrectly stated that Gisèle Pelicot uncovered the crimes against her when she began having blackouts and memory loss. In fact, she found out about the crimes when police showed her videos confiscated from her husband.
The report said that as well as containing factual inaccuracies, the chatbots “struggled to differentiate between opinion and fact, editorialised, and often failed to include essential context”.
Overall, Google’s Gemini contained the most significant issues at 34%, followed by Microsoft’s Copilot at 27% and Perplexity AI at 17%. OpenAI’s ChatGPT performed the best, as only 15% of its responses contained significant issues.
“The scale and scope of errors and the distortion of trusted content is unknown,” warned Peter Archer, Programme Director of Generative AI at the BBC, in his foreword to the study.
“This is because AI assistants can provide answers on a very broad range of questions and users can receive different answers to the same or similar question. Audiences, media companies and regulators do not know the extent of the issue. It may be that AI companies do not know either.”
This is not the first time that generative AI’s capabilities to understand the news have fallen under the microscope in recent months.
In January, Apple was forced to pause its error-strewn AI-generated news alerts following mounting pressure from news organisations. The BBC were among those to complain about the alerts, which were part of Apple’s intelligence tools.
As well as pulling the AI notifications for news and entertainment headlines, Apple also confirmed that all of its AI-generated notifications are now shown in italics to differentiate from regular notifications.
Large language models (LLMs) that power generative AI chatbots like ChatGPT are trained on vast quantities of information from the internet, including news content from publishers.
In the past, OpenAI has signed deals with the Associated Press and NewsCorp, which included the Wall Street Journal and The Times among its stable, to allow it to train its AI models using content from the publications.
However, not everyone has been so accommodating. In September New York Times filed against OpenAI and Microsoft for allegedly violating copyright law by training its models using the NYT’s content. More recently, Indian Book publishers have brought their own suit against OpenAI.
Reflecting on the results of the study, Archer called on AI companies to “hear our concerns and work constructively with us” to understand how to rectify the issues identified and establish a long-term approach to “ensuring accuracy and trustworthiness in AI assistant”.
Additionally, he urged AI companies, Public Service Broadcasters like the BBC, Ofcom, the UK’s communication regulator, and the government to establish an “effective regulatory regime” for AI.