[long] Some tests of how much AI "understands" what it says (spoiler: very little)

diz@awful.systems · 9 months ago

[long] Some tests of how much AI "understands" what it says (spoiler: very little)

Anamnesis@lemmy.world · 9 months ago

I think a lot of these issues stem from LLMs not having actual symbolic reasoning. They only do association. We humans do a lot of thinking via association, but some tasks require symbol manipulation in discrete stages. The clearest case of this is giving chatgpt predicate calculus problems. It can’t solve them reliably at all and if you ask it to explain its reasoning in a step by step manner it will shit its pants and hallucinate answers. IMO once we figure out symbolic representation, we’ll have real AI.

scruiser@awful.systems · edit-2 9 months ago

Careful, if you present the problem and solution that way, AI tech bros will try pasting a LLM and a Computer Algebra System (which already exist) together, invent a fancy buzzword for it, act like they invented something fundamentally new, and then devise some benchmarks that break typical LLMs but their Frankenstein kludge can ace, and then sell the hype (actual consumer applications are luckily not required in this cycle but they might try some anyway).

I think there is some promise to the idea of an architecture similar to a LLM with components able to handle math like a CAS. It won’t fix a lot of LLM issues but maybe some fundamental issues (like ability to count or ability to hold an internal state) will improve. And (as opposed to an actually innovative architecture) simply pasting LLM output into CAS input and then the CAS output back into LLM input (which, let’s be honest, is the first thing tech bros will try as it doesn’t require much basic research improvement), will not help that much and will likely generate an entirely new breed of hilarious errors and bullshit (I like the term bullshit instead of hallucination, it captures the connotation errors are of a kind with the normal output).

diz@awful.systems · edit-2 9 months ago

I think you can make a slight improvement to Wolfram Alpha: using an LLM to translate natural language queries into queries WA can consume, then feeding them into WA. WA always reports exactly what it computed, so if it “misunderstands” you, it’s a lot easier to notice.

The problem here is that AI boys got themselves hyped up for it being actually intelligent, so none of them would ever settle for some modest application of LLMs. Google fired the authors of “stochastic parrot” paper, AFAIK.

simply pasting LLM output into CAS input and then the CAS output back into LLM input (which, let’s be honest, is the first thing tech bros will try as it doesn’t require much basic research improvement), will not help that much and will likely generate an entirely new breed of hilarious errors and bullshit (I like the term bullshit instead of hallucination, it captures the connotation errors are of a kind with the normal output).

Yeah I have examples of that as well. I asked GPT4 at work to calculate the volume of 10cm long, 0.1mm diameter wire. It seems to be doing correct arithmetic by some mysterious means which do not use scientific notation, and then the LLM can not actually count so it miscounts zeroes and outputs a result that is 1000x larger than the correct answer.

diz@awful.systems · 9 months ago

Well the problem is it not having any reasoning period.

Not clear what symbolic reasoning would entail, but puzzles generally require you to think through several approaches to solve them, too. That requires a world model, a search, etc. the kind of stuff that actual AIs, even a tik tac toe AI, have, but LLMs don’t.

On top of it this all works through machine learning, which produces the resulting network weights through very gradual improvement at next word prediction, tiny step by tiny step. Even if some sort of discrete model (like say the account of what’s on either side of the river) could help it predict the next token, there isn’t a tiny fraction of a discrete “model” that would help it, and so it simply does not go down that path at all.

[long] Some tests of how much AI "understands" what it says (spoiler: very little)

[long] Some tests of how much AI "understands" what it says (spoiler: very little)

A couple simple probes:

GPT4 is uncannily good at recognizing the river crossing puzzle

An Idiot With a Petascale Cheat Sheet

Is this a “hallucination”?

But after an update, GPT-whatever is so much better at such prompts.

The need for an Absolute Imbecile Level Reasoning Benchmark

Randomness in bullshitting