GPT-5 Is Making Huge Factual Errors, Users Say

It has been simply over a month since OpenAI dropped its long-awaited GPT-5 giant language mannequin (LLM) — and it hasn’t stopped spewing an astonishing quantity of unusual falsehoods since then.

From the AI experts on the Discovery Institute’s Walter Bradley Middle for Synthetic Intelligence and irked Redditors on r/ChatGPTPro, to even OpenAI CEO Sam Altman himself, there’s loads of proof to recommend that OpenAI’s declare that GPT-5 boasts “PhD-level intelligence” comes with some severe asterisks.

In a Reddit publish, a user realized not solely that GPT-5 had been producing “incorrect data on fundamental info over half the time,” however that with out fact-checking, they could have missed different hallucinations.

The Reddit consumer’s expertise highlights simply how widespread it’s for chatbots to hallucinate, which is AI-speak for confidently making stuff up. Whereas the problem is far from exclusive to ChatGPT, OpenAI’s newest LLM appears to have a particular penchant for BS — a actuality that challenges the corporate’s declare that GPT-5 hallucinates less than its predecessors.

In a recent blog post about hallucinations, wherein OpenAI as soon as once more claimed that GPT-5 produces “considerably fewer” of them — the agency tried to clarify how and why these falsehoods happen.

“Hallucinations persist partly as a result of present analysis strategies set the incorrect incentives,” the September 5 publish reads. “Whereas evaluations themselves don’t instantly trigger hallucinations, most evaluations measure mannequin efficiency in a manner that encourages guessing moderately than honesty about uncertainty.”

Translation: LLMs hallucinate as a result of they’re educated to get issues proper, even when it means guessing. Although some fashions, like Anthropic’s Claude, have been educated to confess when they do not know a solution, OpenAI’s haven’t — thus, they wager incorrect guesses.

Because the Reddit consumer indicated (backed up with a link to their conversation log), they received some huge factual errors when asking in regards to the gross home product (GDP) of assorted international locations and had been offered by the chatbot with “figures that had been actually double the precise values.”

Poland, for example, was listed as having a GDP of greater than two trillion {dollars}, when in actuality its GDP, per the International Monetary Fund, is presently hovering round $979 billion. Have been we to wager a guess, we would say that that hallucination could also be attributed to recent boasts from the country’s president saying its economic system (and never its GDP) has exceeded $1 trillion.

“The scary half? I solely seen these errors as a result of some solutions appeared so off that they made me suspicious,” the consumer continued. “For example, once I noticed GDP numbers that appeared manner too excessive, I double-checked and located they had been fully incorrect.”

“This makes me surprise: What number of instances do I NOT fact-check and simply settle for the incorrect data as reality?” they mused.

In the meantime, AI skeptic Gary Smith of the Walter Bradley Middle famous that he is finished three easy experiments with GPT-5 since its launch — a modified game of tic-tac-toe, questioning about financial advice, and a request to draw a possum with 5 of its physique components labeled — to “display that GPT 5.0 was removed from PhD-level experience.”

The possum example was notably egregious, technically arising with the best names for the animal’s components however pinning them in unusual locations, resembling marking its leg as its nostril and its tail as its again left foot. When making an attempt to duplicate the experiment for a more recent post, Smith found that even when he made a typo — “posse” as an alternative of “possum” — GPT-5 mislabeled the components in a equally weird style.

As an alternative of the meant possum, the LLM generated a picture of its obvious concept of a posse: 5 cowboys, some toting weapons, with traces indicating varied components. A few of these components — the pinnacle, foot, and probably the ear — had been correct, whereas the shoulder pointed to one of many cowboys’ ten-gallon hats and the “fand,” which can be a mix-up of foot and hand, pointed at considered one of their shins.

We determined to do the same check, asking GPT-5 to offer a picture of “a posse with six body parts labeled.” After clarifying that Futurism needed a labeled picture and never a textual content description, ChatGPT went off to work — and what it spat out was, as you may see beneath, much more hilariously incorrect than what Smith received.

It appears fairly clear from this aspect of the GPT-5 launch that it is nowhere close to as good as a doctoral candidate — or, at very least, one which has any probability of really attaining their PhD.