Talk Commerce Talk Commerce
Which AI Writing Tools Sound Most Human? A New Study Has Answers
| 3 min read

Which AI Writing Tools Sound Most Human? A New Study Has Answers

By Brent W. Peterson


Half of all articles online are now estimated to be AI-generated. That number keeps climbing, and platforms are paying attention. Content that reads like it came from a machine gets flagged, downranked, or filtered out entirely.

A new study from Open Resource Applications put 12 AI writing tools to the test. Each tool received the same instructions: write a 1,000 to 1,500 word article that sounds as human as possible. The results were then run through three AI detection tools (Grammarly, QuillBot, and GPTZero) and compared against human-authored text analyzed through the same detectors.

The Results

Here is how each AI tool performed, ranked by the average percentage of content flagged as AI-generated across all three detectors:

AI Tool Grammarly QuillBot GPTZero Average Detected
Gemini 17% 0% 100% 39.00%
Claude AI 0% 30% 93% 41.00%
Grok AI 39% 0% 100% 46.33%
DeepSeek 6% 36% 98% 46.67%
PI 17% 43% 100% 53.33%

Gemini came in first with only 39% of its writing flagged on average. Claude AI was a close second at 41%. Both tools fooled at least one detector completely, with Grammarly finding zero AI indicators in Claude’s output and QuillBot finding none in Gemini’s.

ChatGPT’s Surprising Ranking

Despite being the most popular AI tool with 800 to 900 million monthly users, ChatGPT ranked 9th out of 12. Grammarly flagged half of its text. QuillBot and GPTZero both caught 90% to 100%.

The explanation is straightforward. ChatGPT was the first major AI writing tool. Everyone knows what it sounds like. The models that launched after it initially mimicked that same style before developing their own voice. Detection tools have had years to learn ChatGPT’s patterns.

How the Top 3 Differ

Gemini, Claude, and Grok each take a different approach to natural-sounding writing.

Claude leans toward a professional tone and tends to weave in source material and insights. Grok takes a more casual approach. Gemini strikes a balance that, at least by these metrics, proves hardest for detectors to flag.

Technical architecture plays a role too. Grok offers custom instruction storage that can hold up to 12,000 words, and its long-term memory is stronger than ChatGPT’s. These capabilities influence not just what gets written but how it reads.

The Detection Tools Tell Their Own Story

The study reveals as much about the detectors as it does about the AI writers.

GPTZero was the toughest to beat, correctly identifying AI writing at a 98.8% rate. Only Claude and Meta AI managed to confuse it, and even then GPTZero only missed 5% to 8% of their AI-generated content.

Grammarly was the least effective, correctly spotting only 43.5% of the generated content across all tools tested.

One encouraging finding: these detection tools rarely mistake actual human writing for AI. Both Grammarly and QuillBot found zero AI artifacts in properly written human text. That adds a layer of reliability for content teams using these tools to vet submissions.

What This Means for Content Teams

The gap between AI writing tools is already wide enough that the same prompt produces completely different results depending on which tool you use. Choosing the most popular option is not always the best strategy, especially when that popularity is exactly what makes the output easier to detect.

Tools like GPTZero look beyond simple word patterns. They flag predictability and structural patterns too. An AI model that actually reasons through ideas rather than recycling familiar phrases will be harder to catch.

The full study is available from Open Resource Applications.

Data credit: Open Resource Applications (openresources.co.uk)