Turing take two

‘A but not I’ has a long way to go

Jan 31, 2025

I first encountered the science fiction writing of Isaac Asimov as a teenager in the late 1970s and early 1980s. If the name R. Daneel Olivaw means anything to you, then you probably had a similar adolescence.

Asimov, I think, did more for the anthropomorphism of technology than anyone through a series of novels and short stories written in the mid-20^th century. R. Daneel Olivaw (the R stands for Robot) was a character introduced in the 1953 novel The Caves of Steel and appeared in many subsequent stories and novels. Olivaw had the distinction of being the first robot in Asimov’s imaginary world that was indistinguishable from a human—thus presaging the advent of Artificial General Intelligence, or AGI. Asimov put the date of Olivaw’s construction at 4920 AD.

I’m also a big fan of Alan Turing, the British mathematician and computer scientist. I don’t know if Turing and Asimov ever met, but I think it might not be a coincidence that Turing devised his famous test for AI sentience only a few years before the publication of Asimov’s novel. Scientists and artists alike at the time seemed obsessed with technology, AI, androids and their relationship to humanity.

Fast-forward to the present day, and the hype around generative AI shows no signs of abating. Some believe that ChatGPT has already passed the Turing Test, others suggest that an AGI more intelligent than humans is only a few years away, and no less a celebrity than recent Nobel laureate Geoffrey Hinton claims that there is a 20 per cent probability that humanity will go extinct due to rogue AI within 30 years.

Personally, I’m not so sure. When I started writing about AI in early 2023 I introduced my misgivings about anthropomorphizing the technology, and I revisited those concerns in July of 2024. I continue to worry about this and I think that the current state of AI, impressive as it is in many ways, still falls far short of what’s required to demonstrate human-level sentience. Further, I’m beginning to think that we’re not well served by the term AI itself, in large part because this implied anthropomorphism is unjustified. I believe it’s time we take a reset, step back from the apocalyptic hype and recognize AI for what it is—a collection of technical tools that can help us in a variety of tasks but that is not in any meaningful way human or sentient.

History lesson
Generative AI is in itself not a brand-new invention from late 2022. A decade before ChatGPT, IBM built a system which became known as Watson. It included natural language processing capability, that enabled it to understand English prompts and respond appropriately. It was trained on a large, optimized database of as many facts as possible taken from the internet and other sources available at the time. For publicity, IBM entered Watson in the television game show Jeopardy! which it won handily,[1] but most did not mistake it for human-level intelligence.

Long before that, in the mid-1960s, computer scientist Joseph Weizenbaum, working at MIT, created a program called ELIZA, which would analyze a user’s sentence, looking for keywords it would then generate into a response. Given the limitations of technology at the time, ELIZA couldn’t always find keywords and would simply repeat back one of the user’s prior sentences. It worked well enough to convince some people that they were communicating with a human being, even a psychotherapist. Many similar systems followed ELIZA in subsequent decades.

Today’s generative AI systems differ from Watson and ELIZA only in scale. They work more or less the same way, by breaking down text into tokens—common grammatical elements with recognizable meaning—that can be used to retrieve data and be recombined into plausible responses. The predictive algorithms are more sophisticated, the training data is considerably larger, but the basic architecture is still the same.

Imitation is the sincerest form
Ask yourself this: When’s the last time ChatGPT initiated a conversation with you? It sounds like a trivial question, but the fact that it needed to be asked and answered—in the negative of course—should tell you something. Infants and even pets can do better. I would suggest that any system that needs prompting before coming up with anything to say—regardless of whether what it says is original, or even interesting—is on a path that diverges pretty sharply from true human intelligence.

It’s extremely telling that Alan Turing’s original name for his eponymous test was “The Imitation Game.” I’ve touched on this in a previous post, and I think it’s worth some deeper consideration: the Turing Test only evaluates whether a computer can imitate human conversational behaviour. The more I think about it, the more I realize that there are a lot of problems with this approach—beginning with the subjective nature of the human interrogator. Should the test be repeated with multiple interrogators? Does the computer pass if 50 per cent of the interrogators believe it to be human? Ultimately, is conversational ability the right test at all? What does it even mean to be human in the first place? It’s hard, I think, to define an objective test when you don’t really know what you’re testing for.

This is what philosopher Eric Schwitzgebel was getting at in an article which I quoted last summer—he insists that mimicry is not sufficient to impute intelligence, but some extra argument is required. What that extra argument is—well, that was left as an open question. However, we might now have enough information to start to formulate an answer.

Number crunch
If we accept that mimicking human speech is not a sufficient test for intelligence, then what else can we test for that might provide Schwitzgebel’s extra argument? Looking for some deeper reasoning capabilities, AI researchers turned to mathematics. If AI systems could pass standard mathematics exams—maybe that combined with linguistic abilities might get us somewhere.

A VentureBeat article from November 2024 succinctly describes what happened next.

Datasets of math problems already exist online, and the article references a couple of them—GSM-8K and MATH, each of which contains thousands of word problems that should be well within the ability of an average primary or secondary school student. It didn’t take long for Generative AI systems to start scoring well over 90 per cent on these benchmarks. Impressive? Not really. VentureBeat notes that data contamination has been a big factor in degrading the reliability of these math tests.

What does that mean? Imagine you’re a high school student studying for your final math exam, and your teacher hands out practice exams as study aids. You go to the effort of memorizing every question and answer in the practice exams and, when you go to sit your final exam, you find that the questions you’re asked are exactly a subset of the questions you already saw. Naturally, you ace your final, but you never really learned anything. So it was with AI—the training data fed into GPT-4, Claude and many other generative AI systems included virtually all the word problems that also appeared in the GSM-8K and MATH benchmarks. It’s a large-scale but classic case of overfitting the data—the AI systems had all the answers already and didn’t actually have to solve the problems. The only surprise, maybe, is that they didn’t score 100 per cent—possibly one or two problems were missing from the data, or an occasional hallucination crept in.

At any rate, testing AI on easy problems with known solutions turned out to be meaningless. So Epoch AI, an organization of mathematics professors, stepped in with a new benchmark called FrontierMath. This is a collection of known university-level math problems that have not yet been solved and were not widely published. Each one would take a competent mathematician a few days to solve. Brute force methods and memorization would not work; these problems would require domain knowledge in mathematics along with the ability to reason by inference.

Epoch AI described their test in a very good paper published on ArXiv in November 2024. The test was set up to see if generative AI could reason through a problem in a similar way to a human mathematician. When prompted with the problem, the AI systems would come up with a possible solution, and then have to generate Python code that, when executed, could validate the solution. Output and errors from the Python program could be fed back into the AI prompt, allowing the system to refine its solution or decide if it had come up with a final answer. This iteration could be repeated as many times as necessary, but the test designers did set a limit of 10,000 tokens per problem as a measure of the computational efficiency of the AI systems.

Well, how did ChatGPT and its siblings do? None of the commercial generative AI systems, including ChatGPT 4 and Gemini 1.5, scored better than two per cent. Interestingly, the AI systems tested did not take much advantage of the ability to iterate through their solutions. Most used only one or two iterations before presenting a final answer, and few reached or exceeded the 10,000 token limit.

Epoch AI seemed surprised at the low scores and repeated the test five times, focusing on the successful answers from the first round. The repeated results did not always yield the same correct answer, suggesting that in these cases the AI systems had essentially been making “lucky guesses.” There was no correlation between initial scores and subsequent scores—Gemini 1.5 with the highest initial score near two per cent, provided wildly different answers in subsequent tests. A preview version of ChatGPT o1, whose initial score was closer to one per cent, was the most consistent in getting the same questions right on subsequent attempts.

The above results, although not ironclad proof, are to me strongly suggestive that generative AI does not have reasoning ability or anything close to human-level intelligence. When prompted, it continues to just regurgitate data points from its training and dress up the response in natural language, thus presenting the illusion of conversational ability but nothing more.

Epoch AI asked leading mathematicians including Fields medalists Terence Tao, Timothy Gowers and Richard Borcherds to weigh in on the benchmark. These experts agreed that FrontierMath did indeed pose extremely challenging problems with very little publicly available data, therefore requiring deep mathematical knowledge to solve. They expect that the problems will be resistant to AI for several years at least.

Further, they note that the role AI will play in mathematics will be as a tool or assistant, but not as an autonomous professional mathematician. They envision handing off to a future, FrontierMath-capable AI system “slightly boring bits of research” or “technically demanding calculations” that would form components of a larger proof the human mathematician is working on. They emphasize that AI systems are currently limited by the lack of publicly available training data for FrontierMath problems, which I take to mean that they do not expect AI systems to eventually solve problems via reasoning—rather, it will be a matter of higher levels of brute-force memorization.

A thumb on the scale?
OpenAI, clearly, was not satisfied with the not even two per cent score achieved by ChatGPT 4 and ChatGPT o1. So, when their latest LLM, ChatGPT o3, was released, they re-ran the FrontierMath tests and achieved a score of 25 per cent—a large improvement but still not hugely impressive. Nonetheless, the question arose as to how OpenAI achieved such an improvement in a very short period of time.

Here's where you might excuse me for wanting to put on a tin-foil hat. EpochAI quietly re-issued their paper in late December, with one line in a footer on the first page acknowledging OpenAI’s support for the FrontierMath project. Not only did OpenAI provide financial support, but it turns out that, unbeknownst to the mathematicians designing the tests, the company had access to the FrontierMath dataset. EpochAI quickly pointed out that OpenAI didn’t get all the test questions, there was still a ‘holdout’ dataset of questions known only to EpochAI’s team. Also, both organizations insist that OpenAI did not train ChatGPT o3 on the data it had access to, although EpochAI has launched an investigation.

What do we make of this? I’d say, given ChatGPT 4’s dismal results in the first round, it’s probable that OpenAI had not trained it on the test data. But I can’t help but be suspicious that the company might have tipped the scale a little bit for o3—I am very eager to see the results of the investigation and will continue to follow this story.

Waiting for Olivaw
We’ve seen that passing the Turing Test is at best an exercise in successful mimicry, and it probably doesn’t provide a meaningful measurement of intelligent behaviour. The FrontierMath story further tells us that generative AI systems can’t reason their way through unfamiliar problems in the way that humans can, and have not shown any ability to learn how to do that. Matching patterns of tokens, even at large scale, can look impressive but I’m not convinced that it is the right technical architecture to deliver true human-level intelligence. The term AI seems to me to be half-right. Artificial, yes; intelligent, no.

So, let’s drop AI and all its attendant anthropomorphism from our vocabulary. There are many other forms of technology, like machine learning, neural networks and deep learning systems, that can be useful now. Generative LLMs will eventually find their place but will have to overcome quite a few obstacles before they become seriously useful and trustworthy.

Will we ever meet R. Daneel Olivaw? I think Isaac Asimov was on to something, and his estimate of almost three millennia sounds pretty much right to me.

[1] Watson’s opponent was the show’s highest-earning human winner at the time, Ken Jennings. He is now the host of Jeopardy!

Emergent technology

Discussion about this post