[IDEA] Scaling inference-time with complexity

Smorty [she/her] · 7 months ago

[IDEA] Scaling inference-time with complexity

hendrik@palaver.p3x.de · edit-2 7 months ago

As I said in a comment on one of your previous posts, you might want to read the papers on “Chain of thought” prompting. This has already been studied and you’ll find some more ideas and estimates of what it can do. It is a good approach to make the LLMs a bit smarter. And recently it was popularized by OpenAI.

Smorty [she/her] · 7 months ago

I actually tried, but the papers are written in a really technical way. They only give very few examples and talk a lot about complex LaTeX stuff and they don’t give that many examples…

laitalaj@lemm.ee · 7 months ago

How about adding a mechanism for storing the raw, embedding-dimensional vectors as a part of the sequence instead of introducing a set of additional discrete “invisible” tokens? So basically something like checking the final element of each vector in the sequence before the final linear layer and if the element is larger than, say, 0, giving the vector as-is as the output instead of passing through the de-embedding process. Then, when generating the next token, one could just interleave the thought vectors between the embedded “real” tokens after the embedding. This would allow the “thoughts” of the LLM to be continuous and thus more nuanced - a transformer doesn’t need the sequence to be discrete, that’s something imposed on LLMs by the nature of natural language. Could be an advatage over traditional CoT!

One other reason as to why something like this might beat o1’s thought document (at least for some tasks) is the way the attention mechanism works: it’s much more natural to attend to nearby tokens than to far away ones.

Training thought tokens like this is pretty simple in principle: one could construct a loss for them based on whether they increase the odds of producing the correct token next. Probably should pair that with some minimum increase threshold (below which we actually penalize for thought token generation) and an increasing penalty for outputting multiple thought tokens in a row (in addition to the hard constraint suggested in the OP). The training does pose one major challenge, though: it would need to be done autoregressively instead of pushing the whole sequence through at once, as we don’t have ground truth for these thought tokens. So this would slow things down quite a bit!

The Hobbyist@lemmy.zip · 7 months ago

What your are describing on a high level is what O1 does. But where you are mistaken is when you say:

This thought is not human-interpretable, but it is much more efficient than the pre-output reasoning tokens of o1, which uses human language to fill its own context window with.

What makes those reasoning tokens more efficient? They are just tokens, similarly to all other ones and equally complex/simple to generate. Yes they allow for more reflexion before a presented output is given, but the process is the same.

Also, they would all need to fit in the same context because otherwise you will prevent the model from actually reasoning on it while it iterates its thoughts.

Smorty [she/her] · 7 months ago

I imagine that a model would be held back by the format of human readable text.

Human text uses some concepts, which are mostly unimportant to an AI. Sentence syntax and grammar rules being examples. I think that letting the AI “define its own way of thinking” instead of telling it to think in human language would lead to more efficient thought proccesses. It would be similar to embeddings. A bunch of numbers representing a specific topic in these tokens. Not human readable, but useful for the model.

As far as I know, o1 writes a big document on what it will do, how it will do it and some reflection aswell. My approach however would allow the model to think of things on the fly, while it is writing the text.

You are right in that it would have to fit into the context window. As far as I can tell, the output from the o1 model doesn’t remember what the big thought document says. With my approach, the model would keep all its thoughts in mind while it is writing, as they are literally part of its message, just unreadable by humans.

Am I missing something here? If so, please point it out.

[IDEA] Scaling inference-time with complexity

[IDEA] Scaling inference-time with complexity

My observation

Example

My idea

Chances

Pitfalls and potential risks

What do you think?