The Turing test
 
 
 



This month, we’re celebrating the 75th anniversary of the publication of a landmark paper by mathematician Alan Turing. In October 1950, he wrote a famous article, ‘Computing Machinery and Intelligence’, in which he proposed an “imitation game” that became known as the ‘Turing test’.

In it, an examiner converses freely via a form of chat (at the time, a teleprinter was envisaged) with a computer and a human being. If the examiner cannot distinguish which of the two is the computer and which is the human being, then it can be concluded, Turing argued, that the computer “thinks” or at least is capable of perfectly imitating human thought and is therefore as intelligent as a human being. He said:
“I believe that in about fifty years’ time it will be possible to programme computers, with a storage capacity of about a gigabyte (!), to make them play the imitation game so well that an average interrogator will not have more than 70 percent chance of making the right identification after five minutes of questioning...”
There are many questions that can be raised about Turing’s test and his predictions about digital intelligence.

The main question for me however is: even if the computer’s responses are very similar to what one would expect from a human, does that actually have any significance? The test explicitly makes no attempt to verify that actual intelligence has been used in response to the questions. Indeed, it cannot. It can only test for what may simply be a matter of appearance, rather than actual reasoning ability.

When Alan Turing first put forward the test, his idea was obviously that the computer would be programmed with numerous rules from which it could make decisions, having first been given the relevant background information.

As late as the 90s, computers could not actually learn from experience. Indeed the ultimate triumph at that time was the ability of a computer, ‘Deep Blue’, to play chess to a high level. This ability was based on a set of rules which had originally been developed by Alan Turing many years previously.

Obviously, the number of moves possible to start the game is limited. If though you try to explore all the possibilities for moves after that, it soon becomes a completely unmanageable number of permutations. Experienced players use their experience to know which possibilities to explore and which to ignore.

For this reason, Alan Turing had proposed 10 rules of thumb to give the computer a series of guidelines – e.g. if you can capture an opponent's piece without putting yourself in peril, then go ahead and do it. In the light of experience, the researchers added other heuristics and, with the increase in computing power available, a computer finally achieved the status of Grand Master.

But it was only a traditional computer and so without the flexibility of a brain. And it wasn’t the computer which had learned from experience, but the researchers.

Now, we have computers capable of discerning patterns, even ones which we cannot discern. They can produce predictions from the patterns, such as how protein molecules will fold and so how they will react with other chemicals. But they still do not mimic human beings with their error-prone reasoning and their emotions and general intelligence.

Turing could have had no concept of Large Language Models (LLMs). But now we have LLMs, machines which seem to use human reasoning, and so appearing to be human substitutes.

As we know, they actually scoop up masses of information from the internet and on a probabilistic basis give it back to us as answers to questions we put to them. An LLM makes no overt pretence to understand what it is doing. But it can nonetheless be convincing, very convincing.

A just published paper however looks at their reliability - not just in the sense of factual accuracy, but also bias. The researchers found that the deep research agents and search engines powered by them frequently make unsupported and biased claims, ones that aren’t backed up by the sources they cite.

That’s according to an analysis which found that about one-third of answers provided by the AI tools aren’t backed up by reliable sources.

For OpenAI’s GPT 4.5, the figure was even higher, at 47 per cent. The AI engines were given 303 queries to answer, with the AI’s responses assessed against eight different metrics. They were designed to test whether an answer was one-sided or overconfident and how relevant it was to the question. It also looked for what sources it cited, if any, and how much support the citations actually gave for the claims made in the answers.

The AI-powered search engines performed poorly. Many models provided one-sided answers. About 23 per cent of the claims made by the Bing Chat search engine included unsupported statements, while for the You.com and Perplexity AI search engines, the figure was about 31 per cent. GPT-4.5 produced even more unsupported claims – 47 per cent.

The AI answers were, however, themselves evaluated by an LLM. So then, AI marking its own homework? Sort of. The LLM used was in fact one claimed to have been taught to understand how best to judge an answer of that sort. The training process for the LLM was based on comparing how two human annotators assessed answers to more than 100 questions similar to those used in the study. So then, a source of even more bias, which leaves us with a piece of research which is not the strongest of studies.

But then, we humans are filled to the brim with biases and fail to provide sources to justify our assertions.
 
Many people find the answers from LLMs to be convincing and indistinguishable from answers produced by actual human beings. They even ascribe personality and self-awareness to them, sometimes accepting their advice even where it suggests self-harm or suicide as the answer to life’s problems.

But in that sense an LLM reflects us and our abilities very well indeed and so passes the Turing Test with flying colours.

If though a computer process which has no actual intelligence or reasoning ability can convince us that it is a person, then that in turn tells us that the Turing test is not of any obvious value. It is merely a test of the appearance of rationality or human personality. It does not tell us that it "
is therefore as intelligent as a human being", as Turing claimed.

indeed, I would suggest that the very ability of LLMs to produce answers instantly should tell us that it is not a rational being. After all, except in the simplest of cases we usually have to pause to allow ourselves to reflect on what we are saying.

Which in turn sheds an interesting light on a much debated question: free will - a concept which I consider to be an illusion.

As explanations of how we make decisions, we have determinism and/or randomness. The randomness would necessarily be at quantum level, but would be able to trigger new thoughts. Of course, the new thoughts are then subject to our reasoning processes (and biases).

We are though convinced that there is somehow a third way. This is, I now suspect, partly down to our learned ability to pause in order to allow ourselves to gather extra information and reflect on what we are saying, using our reasoning for quality control.

The very ability to pause gives the impression of choice outside the deterministic chain.

Typically then, the expression of our opinions is very different from the
smoothly flowing, and so very obviously completely determined, unthinking and unchecked output of an LLM.

This is also why we should be wary of populist speakers, those capable of talking for hours on end without pause. They are utterly convinced of their opinions, opinions so deeply rooted and devoid of any rational justification that they could be the product of an LLM.

Who might I be referring to?

Paul Buckingham

5th October 2025




Home      A Point of View     Philosophy     Who am I?      Links     Photos of Annecy