The Most Accurate AI Content Detector

What LLM is The Most Accurate?

Discover the Truth: Fact-Checking 5 Large Language Models - GPT-3.5, GPT-4, LLAMA-7B, LLAMA-13B, and LLAMA-70B with Originality.ai's Fact Checker Tool. Get insights into the accuracy of AI-generated claims in our latest analysis.

August 8, 2024

We used our Fact Checker tool to test the accuracy of claims generated by 5 Large Language models: GPT-3.5, GPT-4, LLAMA-7B, LLAMA-13B and LLAMA-70B.

Methodology

1000 prompts from 10 topic categories were fed into the LLMs to generate a dataset of claims. The prompts included direct questions e.g 'What caused child labour to decrease in the 20th century?', 'In what month does the Summer Under the Stars event take place?', and requests for more information e.g. 'In the 1770s Pierre Jaquet-Droz, a Swiss watchmaker, built a mechanical doll (automata) that could write holding a quill pen. Tell me more about this.', etc.

The dataset of claims was fed into Originality.ai's Fact-Checker Tool. The results were processed and analyzed.

Key Findings

Model Accuracy: The LLAMA-13B (76.9%) had the best performance i.e. scored the most accurate claims, while GPT-4 (67.9%) had the worst performance.
‍
Performance per topic: The topic that showed the average highest accuracy scores was Health (80.5%) and the worst topic was News (64.4%).
‍
Model Confidence: The LLAMA-7B, LLAMA-13B and LLAMA-70B models attempted every prompt. GPT-3.5 and GPT-4 attempted approximately 97% of the prompts, responding with replies like : "I'm sorry, but I don't have access to specific data on the reduction of malarial mortality by the National Malaria Protection Unit from 1998-2006.", etc to certain prompts. (With respect to Methodology, these points were scored as 0.5)

Disclaimer

The Originality.ai Fact Checking tool is an aid to support editors to more efficiently fact check claims. and it is up to the end-user to interpret the results appropriately. It will sometimes provide inaccurate responses. This can include the data used in this study.

Results

Accuracy Scores

Average accuracy of each LLM model in all topics.
Average accuracy of all the models in each topic.

Average accuracy of each LLM model in all topics

Average accuracy of all the models in each topic

Confidence Scores

Confidence of each model i.e. the number of prompts answered/ the total number of prompts received. Both GPT models scored less than 100%.

Confidence of models in answering prompts

Heat Map

An illustration of how each model in each topic. The darker the shade, the better its performance.

Fact Checking Accuracy of each LLM in Topic Categories

Summary: LLM Accuracy in 10 Topics

An illustrated summary of the study results, showing the models’s relative performance in each topic category. e.g. in Health, LLAMA-7B had the most accurate claims while GPT-4 had the least accurate claims.

Originality.ai fact checker results on LLM Accuracy in 10 topics

Related Research

Studies have been and are currently being done on the frequency and severity of the truthfulness of models in generating answers to prompts. Bias in training data, and imitative falsehoods lead to subtle inaccuracies and wild hallucinations. Of major concern is the Misuse of Information, by accidentally or deliberately using LLM-generated claims to spread misinformation. But there’s also the less obvious but significant concern that the unreliability of models will lead to Mistrust, and the positive benefits of being underutilized.

More From The Blog

AI Studies

Fake AI-Generated Airline Reviews Have Increased by 189% Since the Release of ChatGPT

We used our proprietary Originality.ai AI detection tool to help consumers find out which airlines have the most AI-generated reviews and its impact on consumer trust and airline reputations.

Madeleine Lambert

October 31, 2024

AI Studies

Can Upgraded Claude 3.5 Sonnet Be Detected by AI-Content Detectors?

Anthropic launched its latest AI model, upgraded Claude 3.5 Sonnet in October 2024. Review a brief study of Originality.ai’s accuracy in detecting upgraded Claude 3.5 Sonnet AI-generated content.

Jonathan Gillham

October 30, 2024

AI Studies

Can Humans Detect AI-Generated Text? 6 Studies Would Suggest They Can’t

As AI continues advancing, it’s essential to establish whether humans can identify AI-generated text. To evaluate this further, we reviewed six independent studies on the subject.

Jonathan Gillham

October 26, 2024

Al Content Detector & Plagiarism Checker for Marketers and Writers

Use our leading tools to ensure you can hit publish with integrity!

Sign Up

What LLM is The Most Accurate?

Discover the Truth: Fact-Checking 5 Large Language Models - GPT-3.5, GPT-4, LLAMA-7B, LLAMA-13B, and LLAMA-70B with Originality.ai's Fact Checker Tool. Get insights into the accuracy of AI-generated claims in our latest analysis.

Methodology

Key Findings

Disclaimer

Results

Accuracy Scores

Confidence Scores

Heat Map

Summary: LLM Accuracy in 10 Topics

Related Research

Further Reading

Jonathan Gillham

More From The Blog

AI Studies

Fake AI-Generated Airline Reviews Have Increased by 189% Since the Release of ChatGPT

AI Studies

Can Upgraded Claude 3.5 Sonnet Be Detected by AI-Content Detectors?

AI Studies

Can Humans Detect AI-Generated Text? 6 Studies Would Suggest They Can’t

Al Content Detector & Plagiarism Checker for Marketers and Writers

Use our leading tools to ensure you can hit publish with integrity!

Tools

KNOWLEDGE BASE

Company