Welcome to Insight Axis, where I make connections between practical philosophy, technology, books, science, and more. I’m Zan - follow me on Twitter (X), Threads and Substack.
If someone tells you that GPT-4 can pass the Turing Test, tell them they are wrong. Even if GPT-4 could pass the Turing Test, that doesn't make it "intelligent". Read on to see why. If you need a primer on how LLM’s work, I suggest you read
’s essay here.History of the Turing Test
Alan Turing proposed how we could test machine intelligence in his historic 1950 paper titled "Computing Machinery and Intelligence". In it, he describes an "Imitation Game" (now commonly called the Turing Test). The test is set up like this:
There are 3 agents in 3 different rooms (A, B and C). Agents A and C are human, and Agent B is a computer.
Agent C is tasked with guessing the identities of Agent A and B, as either computer or human.
None of the Agents can see each other - they are all sitting in separate rooms.
Agent C can only communicate with Agents A and B through text.
Based on the responses to text-based questions or prompts, Agent C must figure out which Agent is human, and which Agent is computer.
The Turing Test also comes with the caveat that it’s not really a test. It’s a thought experiment. This means that it was built to illustrate possibilities, rather than be a formal, executed test. Remember, Alan Turing was not aware of the training paradigms of today’s AI, or the ways in which deep learning models work. That knowledge would have most certainly changed how he would propose to test their intelligence.
Why GPT-4 doesn't really pass the Turing Test
In her fantastic Nature review, Celeste Biever explains how the Turing Test isn't enough to determine whether an AI is truly intelligent, which I agree with. What I don't agree with is the title of her review - "ChatGPT broke the Turing Test - the race is on for new ways to assess AI". Her title implies that GPT-4 passes the Turing Test, which is why other tests of intelligence are needed. But I don’t think that GPT-4 has truly passed the Turing Test, because it has no reasoning or explanatory capability, and “sparks of Artificial General Intelligence” aren’t good enough. It would be easy to challenge GPT in a real-life Turing Test: let’s say you are a specialist in any field (like law), and acted as Agent C. I would bet that you could easily differentiate between a fellow specialist posing as Agent A, versus Chat GPT posing as Agent B after just a few technical questions. In this case, GPT would not pass the Turing Test. And therefore, the test is not “broken” as Biever suggests.
The Turing Test is certainly biased towards language capability, and GPT is good at writing credibly sounding things, even if it hasn't been trained to do so. This makes it a good contender at the Turing Test. But to really pass the Turing Test, you need reasoning, or an understanding of the types of error you might be prone to making. Unlike a human, when GPT is wrong, it's not necessarily because of incorrect reasoning - because GPT can't reason. It makes certain types of mistakes that humans wouldn’t (see “Thought 1” of
’s Three Thoughts on Intelligence).Intelligence is based on our ability to reason, and create explanations of how the world works (see “Thought 2” of
’s Three Thoughts on Intelligence). When humans use language, we do it to convey information and reasoning. We speak to transmit our pre-linguistic thoughts and reasoning to others. The words are just vessels. Contrast this to when GPT "speaks". It essentially tries to predict the next most plausible word. If rudimentary reasoning arises from this prediction game, I think that's a happy coincidence - not intelligence.Why the Turing Test isn't enough for Artificial General Intelligence
AI could pass the Turing Test, if it has reasoning capabilities, understands empathy, and can create new knowledge. And the Turing Test might be all that’s needed to prove general intelligence for a purely language based AI.
But I think that most AGI will be multi-modal. A linguistic Turing Test might be helpful for measuring AI verbal reasoning capabilities as they compare to human beings. But other tests would be needed for images and spatial intelligence, and probably others for mathematics. On top of this, the beauty of silicone based AGI is that it should be powerful in traditional computational ways, and also in human ways. When we test an AGI, we also want to test it on things that humans will never be able to do - like process huge amounts of data, or perform complex calculations.
True AGI testing will require multi-dimensional tests that interrogate all the chosen aspects of intelligence. The Turing Test could be a necessary component of this, but it would be far from sufficient.
Recommended reading from Substack:
- writes about Organizing Principles, and how they shape the world.
- cuts away the falsities to reveal how Via Negativa works - must read if you like Popper, Deutsch or Nassim Taleb.
- goes through how to create sustainably
- explores how AI changes teacher-student relationships.
- breaks the rules and teaches you code by taking you to the fictional world of PyTown.
Reading beyond Substack:
Juergen Schmidhuber’s paper: Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes
If you enjoyed this essay, please like, subscribe, and share it with others.
I learn from my readers, so leave a comment with your thoughts
Abstolutly wonderful summary of something that's been bugging me for a while.
The Turing test was written a bit differently than it is portrayed here. You mention that an expert in law would quickly recognize another specialist. The Turing test was written to compare the computer against a non-specialist. The original example ran with two tests. The first was one woman, and one man pretending to be a woman. The second was one woman and one computer pretending to be a woman. The question was not whether either man nor computer could fool a (potentially female) interlocutor. That's a tall quest for any man. It was to see if one could tell the difference between the two pretenders... one of which was a human that could think, and the other being a computer.
You could have the interlocutor be a chess grand master, and ask to play a game of chess against either a grandmaster or a novice. Then have them play a game against a grandmaster or a computer like ChatGPT. The issue would not be whether the grandmaster could figure out which was the other grandmaster, but how did ChatGPT go about answering questions. If it makes an illegal move, and is called on it, how does it justify its actions.
As you pointed out, in some venues, ChatGPT is still falling short of the Turing test, even issued this way. However, on Stack Exchange, there's a VERY animated discussion regarding whether humans coming to their site can tell the difference between an answer from a human and a response from an AI. There's quite a lot of concern that, in that particular limited setting, the AIs are passing the Turing test, and it's a problem because they're often wrong with convincing rationales for why they're right. The AI's are starting to look an awful lot like a non-expert trying to play expert, and its forming a nightmare for moderators.