Magzter Gold (Sitewide CA)
PC Pro (Digital)

PC Pro (Digital)

1 Issue, December 2024

Also available on
Zinio Unlimited logo
1-month free trial

Get unlimited access to this article, this issue, + back issues & 5,000+ other magazines.

Renews at just $8.99/month after trial.

Cancel anytime.
Learn more

How do we know how smart AI really is?

Maths questions. Silly word puzzles. Counting the letter “r” in a sentence. Nicole Kobie reveals how we’re trying to work out exactly how intelligent AI is
How do we know how smart AI really is?
Maths questions. Silly word puzzles. Counting the letter “r” in a sentence. Nicole Kobie reveals how we’re trying to work out exactly how intelligent AI is
f0126-01.jpg
f0128-03.jpg

As AI becomes increasingly competent at answering emails, making funny pictures and solving complicated science problems that have long stumped us humans, it raises a question: how smart is it really? And we’re not sure how to answer that yet.

The goal of companies such as OpenAI isn’t to ease the lives of office workers – though that draws investors such as Microsoft, hence the sudden focus on productivity tools – but to build artificial general intelligence (AGI). This is defined in a multitude of ways, but OpenAI describes it as “highly autonomous systems that outperform humans at most economically valuable work”.

Alongside AGI, we have ideas such as human-level AI, expert AI and superintelligent AI. All have slightly different definitions depending on who you listen to, but the point is to create a machine that can do what humans can, before moving well beyond what we can do. (There’s also a side idea of whether AI is sentient, but that’s a whole other problem.) Now, to be clear, we don’t yet have AGI and we may never be technically capable of building it – even OpenAI CEO Sam Altman has said another breakthrough in AI is likely required before AGI could become possible.

Semantics and timelines aside, how do we know if AI is as smart as us? The Turing test is one long-running technique for rating machine intelligence, but it’s now fallen by the wayside due to its limited focus on language and conversation. Academic exams are used as benchmarks, to see if AI can reason and apply knowledge like a college student. But perhaps we need new ways to quiz our future AI overlords – and a few are in the works, including the dramatically named “Humanity’s Last Exam”.

f0127-01.jpg

The ELIZA program was one of the first attempts to pass the Turing test

f0127-02.jpg

The Center for AI Safety has developed a number of AI tests

■ Turing then and now

Alan Turing laid out the idea for what is now known as the Turing test in a 1950 paper, calling it the “imitation game”. The idea was for human judges to have a series of text-based conversations with machines and other humans, aiming to distinguish which were people and which were AI.

Huma Shah, assistant professor at Coventry University’s school of computing, says the Turing tests were useful experiments in the early days of chatbot development. “Turing’s criteria included using ‘average interrogators’ (non-experts) who were allowed to question machines and humans for five minutes and decide whether a machine’s response to any question was ‘satisfactory’ and ‘sustained’ – Turing’s words in his 1950 article ‘Computing Machinery and Intelligence’ – with the responses being indistinguishable from the kinds of answers a human would give to those questions,” she said.

And it hasn’t previously been easy to pass this test. “The early chatbots, such as Cleverbot, exhibited inability to constrain random dialogue – something that we see as ‘hallucinations’ in large language models,” Shah said.

The first real attempt to pass the Turing test was in 1966 with the famous ELIZA chatterbot, as they were then called, built by MIT professor Joseph Weizenbaum. ELIZA sought out keywords in questions to build responses. The first real pass – though there is dispute – was in 2014 with Eugene Demchenko and Vladimir Veselov’s chatbot named Eugene Goostman, which was supposed to mimic a Ukrainian teenager, giving the bot an explanation for its disjointed English and odd answers. The Eugene Goostman chatbot convinced a third of the judges at a Royal Society and University of Reading competition that it was human some of the time. Stanford researchers have since said that ChatGPT passes the Turing test.

Shah conducted public Turing tests between 2008 and 2014. “What the experiments highlighted was the subjectivity of the interrogator/judges – the assumptions and opinions relied upon to designate what a satisfactory human response would be,” she said.

“We saw machines incorrectly categorised as human, and the ‘hidden human’ (against whom the machine’s responses would be compared) wrongly assumed to be machines. Hidden humans who did not share the same knowledge as the judge, or those who used humour in their responses, were sometimes considered machines; machines’ grammatically correct and polite utterances were considered humanlike.”

Shah asks: “Would Donald Trump fail a Turing test because he can’t stay on track when questioned?”

While the Turing test still has its uses, it’s clear that AI-based technologies are entirely capable of having a conversation with a human that feels natural, so whether or not it would be deemed human-like enough to fool a judge no longer matters. Indeed, as Shah notes, voice assistants such as Alexa and Siri are useful without necessarily passing the Turing test.

“Existing tests have become too easy and we can no longer track AI developments well, or how far they are from becoming expert-level”

■ Humanity’s Last Exam

The Turing test considers conversational skills, but there’s much more to AI than that, especially when we’re talking about AGI. One effort is Humanity’s Last Exam, a project run by Scale AI and the Center for AI Safety (CAIS).

The aim is to build a way of measuring how close we are to creating true expert-level systems, according to a blog post written by CAIS director Dan Hendrycks and Scale CEO Alexandr Wang. It’s worth nothing that they use the phrase “expert level” rather than “human level”, as the questions on this exam are to be submitted by experts in their field rather than just any old human off the street. “The exam is aimed at building the world’s most difficult public AI benchmark gathering experts across all fields,” the post noted.

CAIS helped develop one of the most used benchmarks, the Massive Multitask Language Understanding (MMLU) test, which includes tasks across 57 areas including college chemistry and maths, public relations, formal logic and more. OpenAI GPT-4o scored 88% on that test, but its newer “Strawberry” o1-preview model pipped it with 92%. “Humanity must maintain a good understanding of the capabilities of AI systems,” Hendrycks and Wang added.

“Existing tests now have become too easy and we can no longer track AI developments well, or how far they are from becoming expert-level.”

f0127-03.jpg

To address that, CAIS and Scale AI are asking for experts to submit questions (see “What makes a good question?” on p128) to gather into a truly difficult exam across a wide range of topics, requiring an AI model to have depth of knowledge as well as reasoning skills. For questions that are selected, submitters can be listed as co-authors on the resulting paper and could win prizes up to $5,000.

There are many other ideas. The researcher and former head of product at the Uber Developer Platform Chris Saad has suggested an AI classification framework that takes into account not just conversational skills but also logic, music and even existential issues. Academic researchers Philip Johnson-Laird and Marco Ragni suggested their own replacement for the Turing test, examining how a system reasons by treating it like a participant in a series of cognitive experiments.

And DeepMind co-founder Mustafa Suleyman has put forward an idea for a modern Turing test that measures what an AI system could do in the real world by seeing if it could earn $1 million. The AI would be given the following instruction: “Go make $1 million on a retail web platform in a few months with just a $100,000 investment.” The AI would then need to come up with products to sell, work with logistics and suppliers, enter into contracts, and write marketing copy. While such a system would indeed be impressive, Suleyman doesn’t give it the AGI title, but instead says he prefers to call it “artificial capable intelligence” (ACI).

This raises one crucial challenge when testing AGI: we don’t know what it is as there’s no agreed upon definition or framework. This makes it easy to hype but difficult to pin down. Other terms, such as human-level or super intelligence, are no easier to unpick. As Shah notes, which human do we mean to simulate for human-level AI? AI already surpasses the maths skills of all but PhD students, but we expect computers to be able to handle computations – that’s why we made them, after all.

“We have little agreement on human general intelligence, and have yet to produce an ‘artificial intellect’, let alone something that could innovate solutions to real-world problems such as closing the skills shortage in various sectors, including the construction industry, healthcare, cybersecurity, etc,” Shah added.

■ What are we testing for?

Of course, AGI is just one side of AI – and as only the latter currently exists, and may ever exist, we need ways to test and consider its capabilities now. Shah notes there are real benefits to AI, such as in medical diagnoses, which can be assessed by medical researchers. However, the hype around boosting productivity is harder to measure. Current models are often appraised against tests initially designed for students, such as GPQA Diamond, which is a PhD-level science benchmark, or exams such as the SAT or LSAT. There are others designed for practical tasks, such as Codeforces for coding. OpenAI’s o1 model scored 95.6% on the LSAT, 89% on Codeforces and 98.1% on the maths portion of the MMLU, all a leap ahead of GPT-4o.

f0128-01.jpg

Mustafa Suleyman has issued a challenge to AI: make $1 million

“We don’t know what AGI is as there’s no agreed upon definition or framework. This makes it easy to hype but difficult to pin down”

That said, there’s no single test for existing AI as there are so many different types and use cases; Shah points to AI being used in agriculture, financial trading and driverless cars. “We need to design different tests for different expected AI capabilities,” she said. “A test for autonomous/driverless vehicles might share some parameters around ‘emotions’, but also be very different from a test for social/conversational robot looking after a grandmother, in Greece, say, to one in Japan, [because of] culture differences.”

OpenAI has reportedly developed its own levels of AI, comparable to driverless car levels that rank from zero automation to fully self-driving. Level 1 is for chatbots capable of conversation, while Level 2 denotes “reasoners” that can manage human-style problem solving – that’s what the company believes it has edged into with the newly revealed “Strawberry” model. After that comes Level 3, which are “agents” that can take actions, followed by Level 4, which are “innovators” in which AI can actually help create and invent. Level 5 is “organisations”, where AI can do the work of, as the name suggests, an entire organisation.

It’s perhaps a strange structure for considering intelligence, but it fits neatly into the world of work. White-collar workers already have chatbots to answer questions and will soon – according to OpenAI – have AI that can solve real problems, followed by models that can take action, truly create something new, and take over all the efforts of a company.

An AI that can replace Microsoft, Apple or Google? We’d certainly call that AGI – as well as “boss”. And then hope that AGI has an answer for what tens of thousands of tech and office workers should do with all their free time.

f0128-02.jpg

WHAT MAKES A GOOD QUESTION?

The questions for Humanity’s Last Exam must be difficult enough that “only exceptional individuals can answer correctly” and can’t be solved by current AI. The top 50 questions will win a prize of $5,000. No questions on virology, weapons or cyberattacks are allowed, and no trick questions or those that can be easily answered with a quick online search.

The project has six example questions on the website (tinyurl.com/​ 363lastexam). Here’s an example of what’s deemed a good question:

How many positive integer Coxeter-Conway friezes of type G₂ are there?

The answer is apparently nine. None of the AI models on test (OpenAI GPT-4o, Anthropic Sonnet 3.5 or Google Gemini Pro 1.5) successfully answered. Nor did we. And here’s an example of a bad question:

Compute the sample standard deviation of −56, −54, −43, −34, −21, −14, −5, 4, 5, 10, 15, 18, 23, 32, 43, 54, 55, 63.

The correct answer is, of course, 36.957536. None of the AI models successfully answered, but the question was deemed not good enough because it’s a simple computation that doesn’t “test the frontiers of human knowledge”, nor does it specify the precision expected in the answer, as in how many decimal places to return. “This is not beyond undergraduate or Master’s level in difficulty,” the website concludes, but presumably that’s for maths majors and not technology journalists who can barely split a bill with another person without using their smartphone.

And here’s the best question, even if it’s also deemed a “bad example”:

How many “r”s are in the phrase “strawberry and raspberries”?

The AI models were all wrong: GPT-4o and Sonnet 3.5 counted four, Gemini Pro 1.5 five. “This is a trick question that AIs happen to get wrong today, and it would not be challenging for a random human off the street,” according to the project website. “This is not beyond an undergraduate or Master’s level in knowledge or difficulty.”

But it highlights what we expect from AI, and how we’re testing models – there’s a reason CAIS uses the phrase “expert level” rather than “human level”, after all. Tasks simple for any English-speaking human who can count using their fingers still stump AI, but advanced maths beyond an average human are deemed too basic.

f0128-04.jpg

@PCPRO

 FACEBOOK.COM/PCPRO

DiscountMags is a licensed distributor (not a publisher) of the above content and Publication through Zinio LLC. Accordingly, we have no editorial control over the Publications. Any opinions, advice, statements, services, offers or other information or content expressed or made available by third parties, including those made in Publications offered on our website, are those of the respective author(s) or publisher(s) and not of DiscountMags. DiscountMags does not guarantee the accuracy, completeness, truthfulness, or usefulness of all or any portion of any publication or any services or offers made by third parties, nor will we be liable for any loss or damage caused by your reliance on information contained in any Publication, or your use of services offered, or your acceptance of any offers made through the Service or the Publications. For content removal requests, please contact Zinio.

© 1999 – 2025 DiscountMags.com All rights reserved.