Why the Turing Test is Not Enough

And why GPT hasn't really passed it.

Aug 06, 2023

Welcome to Insight Axis, where I make connections between practical philosophy, technology, books, science, and more. I’m Zan - follow me on Twitter (X), Threads and Substack.

If someone tells you that GPT-4 can pass the Turing Test, tell them they are wrong. Even if GPT-4 could pass the Turing Test, that doesn't make it "intelligent". Read on to see why. If you need a primer on how LLM’s work, I suggest you read

Gaurav Kaushik

’s essay here.

History of the Turing Test

Alan Turing proposed how we could test machine intelligence in his historic 1950 paper titled "Computing Machinery and Intelligence". In it, he describes an "Imitation Game" (now commonly called the Turing Test). The test is set up like this:

There are 3 agents in 3 different rooms (A, B and C). Agents A and C are human, and Agent B is a computer.
Agent C is tasked with guessing the identities of Agent A and B, as either computer or human.
None of the Agents can see each other - they are all sitting in separate rooms.
Agent C can only communicate with Agents A and B through text.
Based on the responses to text-based questions or prompts, Agent C must figure out which Agent is human, and which Agent is computer.

The Turing Test also comes with the caveat that it’s not really a test. It’s a thought experiment. This means that it was built to illustrate possibilities, rather than be a formal, executed test. Remember, Alan Turing was not aware of the training paradigms of today’s AI, or the ways in which deep learning models work. That knowledge would have most certainly changed how he would propose to test their intelligence.

Leave a comment

Why GPT-4 doesn't really pass the Turing Test

In her fantastic Nature review, Celeste Biever explains how the Turing Test isn't enough to determine whether an AI is truly intelligent, which I agree with. What I don't agree with is the title of her review - "ChatGPT broke the Turing Test - the race is on for new ways to assess AI". Her title implies that GPT-4 passes the Turing Test, which is why other tests of intelligence are needed. But I don’t think that GPT-4 has truly passed the Turing Test, because it has no reasoning or explanatory capability, and “sparks of Artificial General Intelligence” aren’t good enough. It would be easy to challenge GPT in a real-life Turing Test: let’s say you are a specialist in any field (like law), and acted as Agent C. I would bet that you could easily differentiate between a fellow specialist posing as Agent A, versus Chat GPT posing as Agent B after just a few technical questions. In this case, GPT would not pass the Turing Test. And therefore, the test is not “broken” as Biever suggests.

The Turing Test is certainly biased towards language capability, and GPT is good at writing credibly sounding things, even if it hasn't been trained to do so. This makes it a good contender at the Turing Test. But to really pass the Turing Test, you need reasoning, or an understanding of the types of error you might be prone to making. Unlike a human, when GPT is wrong, it's not necessarily because of incorrect reasoning - because GPT can't reason. It makes certain types of mistakes that humans wouldn’t (see “Thought 1” of

Juliette Culver

’s Three Thoughts on Intelligence).

Intelligence is based on our ability to reason, and create explanations of how the world works (see “Thought 2” of

Juliette Culver

’s Three Thoughts on Intelligence). When humans use language, we do it to convey information and reasoning. We speak to transmit our pre-linguistic thoughts and reasoning to others. The words are just vessels. Contrast this to when GPT "speaks". It essentially tries to predict the next most plausible word. If rudimentary reasoning arises from this prediction game, I think that's a happy coincidence - not intelligence.

Share Insight Axis

Why the Turing Test isn't enough for Artificial General Intelligence

AI could pass the Turing Test, if it has reasoning capabilities, understands empathy, and can create new knowledge. And the Turing Test might be all that’s needed to prove general intelligence for a purely language based AI.

But I think that most AGI will be multi-modal. A linguistic Turing Test might be helpful for measuring AI verbal reasoning capabilities as they compare to human beings. But other tests would be needed for images and spatial intelligence, and probably others for mathematics. On top of this, the beauty of silicone based AGI is that it should be powerful in traditional computational ways, and also in human ways. When we test an AGI, we also want to test it on things that humans will never be able to do - like process huge amounts of data, or perform complex calculations.

True AGI testing will require multi-dimensional tests that interrogate all the chosen aspects of intelligence. The Turing Test could be a necessary component of this, but it would be far from sufficient.

Stable diffusion prompt: “turing test, impressionist style painting”. Reminds me a little of Blade Runner’s Voight-Kampff test (spoilers).

Recommended reading from Substack:

Argo
writes about Organizing Principles, and how they shape the world.
Brian Doran
cuts away the falsities to reveal how Via Negativa works - must read if you like Popper, Deutsch or Nassim Taleb.
Callum Wanderloots
goes through how to create sustainably
Edem Gold 🇳🇬
explores how AI changes teacher-student relationships.
Stephen Gruppetta
breaks the rules and teaches you code by taking you to the fictional world of PyTown.

Reading beyond Substack:

Juergen Schmidhuber’s paper: Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes

If you enjoyed this essay, please like, subscribe, and share it with others.

I learn from my readers, so leave a comment with your thoughts

Leave a comment

29 Comments

Michael Woudenberg

Polymathic Being

Aug 6, 2023Liked by Zan Tafakari

Abstolutly wonderful summary of something that's been bugging me for a while.

Expand full comment

1 reply by Zan Tafakari

Conrad Ammon

Aug 17, 2023Liked by Zan Tafakari

The Turing test was written a bit differently than it is portrayed here. You mention that an expert in law would quickly recognize another specialist. The Turing test was written to compare the computer against a non-specialist. The original example ran with two tests. The first was one woman, and one man pretending to be a woman. The second was one woman and one computer pretending to be a woman. The question was not whether either man nor computer could fool a (potentially female) interlocutor. That's a tall quest for any man. It was to see if one could tell the difference between the two pretenders... one of which was a human that could think, and the other being a computer.

You could have the interlocutor be a chess grand master, and ask to play a game of chess against either a grandmaster or a novice. Then have them play a game against a grandmaster or a computer like ChatGPT. The issue would not be whether the grandmaster could figure out which was the other grandmaster, but how did ChatGPT go about answering questions. If it makes an illegal move, and is called on it, how does it justify its actions.

As you pointed out, in some venues, ChatGPT is still falling short of the Turing test, even issued this way. However, on Stack Exchange, there's a VERY animated discussion regarding whether humans coming to their site can tell the difference between an answer from a human and a response from an AI. There's quite a lot of concern that, in that particular limited setting, the AIs are passing the Turing test, and it's a problem because they're often wrong with convincing rationales for why they're right. The AI's are starting to look an awful lot like a non-expert trying to play expert, and its forming a nightmare for moderators.

3 replies by Zan Tafakari and others

27 more comments...