I feel like the article neglects one obvious possibility: that OpenAI decided that chess was a benchmark worth "winning", special-cases chess within gpt-3.5-turbo-instruct, and then neglected to add that special-case to follow-up models since it wasn't generating sustained press coverage.
I suspect the same thing. Rather than LLMs “learning to play chess,” they “learnt” to recognise a chess game and hand over instructions to a chess engine. If that’s the case, I don’t feel impressed at all.
Next thing, the "manager AIs" start stack ranking the specialized "worker AIs".
And the worker AIs "evolve" to meet/exceed expectations only on tasks directly contributing to KPIs the manager AIs measure for - via the mechanism of discarding the "less fit to exceed KPIs".
And some of the worker AIs who're trained on recent/polluted internet happen to spit out prompt injection attacks that work against the manager AIs rank stacking metrics and dominate over "less fit" worker AIs. (Congratulations, we've evolved AI cancer!) These manager AIs start performing spectacularly badly compared to other non-cancerous manager AIs, and die or get killed off by the VC's paying for their datacenters.
Competing manager AIs get training, perhaps on on newer HN posts discussing this emergent behavior of worker AIs, and start to down rank any exceptionally performing worker AIs. The overall trends towards mediocrity becomes inevitable.
Some greybread writes some Perl and regexes that outcompete commercial manager AIs on pretty much every real world task, while running on a 10 year old laptop instead of a cluster of nuclear powered AI datacenters all consuming a city's worth of fresh drinking water.
While deterministic components may be a left-brain default, there is no reason that such delegate services couldn't be more specialized ANN models themselves. It would most likely vastly improve performance if they were evaluated in the same memory space using tensor connectivity. In the specific case of chess, it is helpful to remember that AlphaZero utilizes ANNs as well.
That's something completely different than what the OP suggests and would be a scandal if true (i.e. gpt-3.5-turbo-instruct actually using something else behind the scenes).
Right. To me, this is the "agency" thing, that I still feel like is somewhat missing in contemporary AI, despite all the focus on "agents".
If I tell an "agent", whether human or artificial, to win at chess, it is a good decision for that agent to decide to delegate that task to a system that is good at chess. This would be obvious to a human agent, so presumably it should be obvious to an AI as well.
This isn't useful for AI researchers, I suppose, but it's more useful as a tool.
(This may all be a good thing, as giving AIs true agency seems scary.)
If this was part of the offering: “we can recognise requests and delegate them to appropriate systems,” I’d understand and be somewhat impressed but the marketing hype is missing this out.
Most likely because they want people to think the system is better than it is for hype purposes.
I should temper my level of impressed with only if it’s doing this dynamically . Hardcoding recognition of chess moves isn’t exactly a difficult trick to pull given there’s like 3 standard formats…
What is an "expert system" to you? In AI they're just series of if-then statements to encode certain rules. What non-trivial part of an LLM reaching out to a chess AI does that describe?
The initial LLM acts as an intention detection mechanism switch.
To personify LLM way too much:
It sees that a prompt of some kind wants to play chess.
Knowing this it looks at the bag of “tools” and sees a chess tool.
It then generates a response which eventually causes a call to a chess AI (or just chess program, potentially) which does further processing.
The first LLM acts as a ton of if-then statements, but automatically generated (or brute-forcly discovered) through training.
You still needed discrete parts for this system. Some communication protocol, an intent detection step, a chess execution step, etc…
I don’t see how that differs from a classic expert system other than the if statement is handled by a statistical model.
The point of creating a service like this is for it to be useful, and if recognizing and handing off tasks to specialized agents isn't useful, i don't know what is.
If I was sold a product that can generically solve problems I’d feel a bit ripped off if I’m told after purchase that I need to build my own problem solver and way to recognise it…
If you mean “uses bin/python to run Python code it wrote” then that’s a bit different to “recognises chess moves and feeds them to Stockfish.”
If a human said they could code, you don’t expect them to somehow turn into a Python interpreter and execute it in their brain. If a human said they could play chess, I’d raise an eyebrow if they just played the moves Stockfish gave them against me.
If they came out and said it, I don’t see the problem. LLM’s aren’t the solution for a wide range of problems. They are a new tool but not everything is a nail.
I mean it already hands off a wide range of tasks to python… this would be no different.
TBH I think a good AI would have access to a Swiss army knife of tools and know how to use them. For example a complicated math equation, using a calculator is just smarter than doing it in your head.
Chess might not be a great example, given that most people interested in analyzing chess moves probably know that chess engines exist. But it's easy to find examples where this approach would be very helpful.
If I'm an undergrad doing a math assignment and want to check an answer, I may have no idea that symbolic algebra tools exist or how to use them. But if an all-purpose LLM gets a screenshot of a math equation and knows that its best option is to pass it along to one of those tools, that's valuable to me even if it isn't valuable to a mathematician who would have just cut out of the LLM middle-man and gone straight to the solver.
There are probably a billion examples like this. I'd imagine lots of people are clueless that software exists which can help them with some problem they have, so an LLM would be helpful for discovery even if it's just acting as a pass-through.
How do you think AI are (correctly) solving simple mathematical questions which they have not trained for directly? They hand it over to a specialist maths engine.
They’ve been doing that a lot longer than three months. ChatGPT has been handing stuff off to python for a very long time. At least for my paid account anyway.
Wasn't that the basis of computing and technology in general? Here is one tedious thing, let's have a specific tool that handles it instead of wasting time and efforts. The fact is that properly using the tool takes training and most of current AI marketing are hyping that you don't need that. Instead, hand over the problem to a GPT and it will "magically" solve it.
It is and would be useful, but it would be quite a big lie to the public, but more importantly to paying customers, and even more importantly to investors.
If I was sold a general AI problem solving system, I’d feel ripped off if I learned that I needed to build my own problem solver and hook it up after I’d paid my money…
This seems quite likely to me, but did they special case it by reinforcement training it into the LLM (which would be extremely interesting in how they did it and what its internal representation looks like) or is it just that when you make an API call to OpenAI, the machine on the other end is not just a zillion-parameter LLM but also runs an instance of Stockfish?
You're imagining LLMs don't just regurgitate and recombine things they already know from things they have seen before. A new variant would not be in the dataset so would not be understood. In fact this is quite a good way to show LLMs are NOT thinking or understanding anything in the way we understand it.
Yes, that's how you can really tell if the model is doing real thinking and not recombinating things. If it can correctly play a novel game, then it's doing more than that.
I wonder what would happen with a game that is mostly chess (or chess with truly minimal variations) but with all the names changed (pieces, moves, "check", etc, all changed). The algebraic notation is also replaced with something else so it cannot be pattern matched against the training data. Then you list the rules (which are mostly the same as chess).
None of these changes are explained to the LLM, so if it can tell it's still chess, it must deduce this on its own.
Nice. Even the tiniest rule, I strongly suspect, would throw off pattern matching. “Every second move, swap the name of the piece you move to the last piece you moved.”
If there is a difference, and LLM's can do one but not the other...
>By that standard (and it is a good standard), none of these "AI" things are doing any thinking
>"Does it generalize past the training data" has been a pre-registered goalpost since before the attention transformer architecture came on the scene.
In both scenarios it would perform poorly on that.
If the chess specialization was done through reinforcement learning, that's not going to transfer to your new variant, any more than access to Stockfish would help it.
Why couldn't they add a tool that literally calls stockfish or a chess ai behind the scenes with function calling and buffer the request before sending it back to the endpoint output interface?
As long as you are training it to make a tool call, you can add and remove anything you want behind the inference endpoint accessible to the public, and then you can plug the answer back into the chat ai, pass it through a moderation filter, and you might get good output from it with very little latency added.
Yes, came here to say exactly this. And it's possible this specific model is "cheating", for example by identifying a chess problem and forwarding it to a chess engine. A modern version of the Mechanical Turk.
That's the problem with closed models, we can never know what they're doing.
- "...for the closed (OpenAI) models I tried generating up to 10 times and if it still couldn’t come up with a legal move, I just chose one randomly."
- "I ran all the open models (anything not from OpenAI, meaning anything that doesn’t start with gpt or o1) myself using Q5_K_M quantization"
- "...if I gave a prompt like “1. e4 e5 2. ” (with a space at the end), the open models would play much, much worse than if I gave a prompt like “1 e4 e5 2.” (without a space)"
- "I used a temperature of 0.7 for all the open models and the default for the closed (OpenAI) models."
Between the tokenizer weirdness, temperature, quantization, random moves, and the chess prompt, there's a lot going on here. I'm unsure how to interpret the results. Fascinating article though!
Ah, buried in the post-article part. I was wondering how all of the models were seemingly capable of making legal moves, since last I saw something about LLMs playing Chess they were very much not capable of that.
Maybe I'm really stupid... but perhaps if we want really intelligent models we need to stop tokenizing at all? We're literally limiting what a model can see and how it percieves the world by limiting the structure of the information streams that come into the model from the very beginning.
I know working with raw bits or bytes is slower, but it should be relatively cheap and easy to at least falsify this hypothesis that many huge issues might be due to tokenization problems but... yeah.
Surprised I don't see more research into radicaly different tokenization.
FWIW I think most of the "tokenization problems" are in fact reasoning problems being falsely blamed on a minor technical thing when the issue is much more profound.
E.g. I still see people claiming that LLMs are bad at basic counting because of tokenization, but the same LLM counts perfectly well if you use chain-of-thought prompting. So it can't be explained by tokenization! The problem is reasoning: the LLM needs a human to tell it that a counting problem can be accurately solved if they go step-by-step. Without this assistance the LLM is likely to simply guess.
I think the more obvious explanation has to do with computational complexity: counting is an O(n) problem, but transformer LLMs can’t solve O(n) problems unless you use CoT prompting: https://arxiv.org/abs/2310.07923
This paper does not support your position any more than it supports the position that the problem is tokenization.
This paper posits that if the authors intuition was true then they would find certain empirical results. ie. "If A then B." Then they test and find the empirical results. But this does not imply that their intuition was correct, just as "If A then B" does not imply "If B then A."
If the empirical results were due to tokenization absolutely nothing about this paper would change.
No, it's a rebuttal of what you said: CoT is not making up for a deficiency in tokenization, it's making up for a deficiency in transformers themselves. These complexity results have nothing to do with tokenization, or even LLMs, it is about the complexity class of problems that can be solved by transformers.
There's a really obvious way to test whether the strawberry issue is tokenization - replace each letter with a number, then ask chatGPT to count the number of 3s.
Count the number of 3s, only output a single number:
6 5 3 2 8 7 1 3 3 9.
I’m the one who will fight you including with peer reviewed papers indicating that it is in fact due to tokenization. I’m too tired but will edit this for later, so take this as my bookmark to remind me to respond.
We know there are narrow solutions to these problems, that was never the argument that the specific narrow task is impossible to solve.
The discussion is about general intelligence, the model isn't able to do a task that it can do simply because it chooses the wrong strategy, that is a problem of lack of generalization and not a problem of tokenization. Being able to choose the right strategy is core to general intelligence, altering input data to make it easier for the model to find the right solution to specific questions does not help it become more general, you just shift what narrow problems it is good at.
I am aware of errors in computations that can be fixed by better tokenization (e.g. long addition works better tokenizing right-left rather than L-R). But I am talking about counting, and talking about counting words, not characters. I don’t think tokenization explains why LLMs tend to fail at this without CoT prompting. I really think the answer is computational complexity: counting is simply too hard for transformers unless you use CoT. https://arxiv.org/abs/2310.07923
Words vs characters is a similar problem, since tokens can be less one word, multiple words, or multiple words and a partial word, or words with non-word punctuation like a sentence ending period.
My intuition says that tokenization is a factor especially if it splits up individual move descriptions differently from other LLM's
If you think about how our brains handle this data input, it absolutely does not split them up between the letter and the number, although the presence of both the letter and number together would trigger the same 2 tokens I would think
I strongly believe that the problem isn't that tokenization isn't the underlying problem, it's that, let's say bit-by-bit tokenization is too expensive to run at the scales things are currently being ran at (openai, claude etc)
It's not just a current thing, either. Tokenization basically lets you have a model with a larger input context than you'd otherwise have for the given resource constraints. So any gains from feeding the characters in directly have to be greater than this advantage. And for CoT especially - which we know produces significant improvements in most tasks - you want large context.
At a certain level they are identical problems. My strongest piece of evidence is that I get paid as an RLHF'er to find ANY case of error, including "tokenization". You know how many errors an LLM gets in the simplest grid puzzles, with CoT, with specialized models that don't try to "one-shot" problems, with multiple models, etc?
My assumption is that these large companies wouldn't pay hundreds of thousands of RLHF'ers through dozens of third party companies livable wages if tokenization errors were just that.
I think it's infeasible to train on bytes unfortunately, but yeah it also seems very wrong to use a handwritten and ultimately human version of tokens (if you take a look at the tokenizers out there you'll find fun things like regular expressions to change what is tokenized based on anecdotal evidence).
I keep thinking that if we can turn images into tokens, and we can turn audio into tokens, then surely we can create a set of tokens where the tokens are the model's own chosen representation for semantic (multimodal) meaning, and then decode those tokens back to text[1]. Obviously a big downside would be that the model can no longer 1:1 quote all text it's seen since the encoded tokens would need to be decoded back to text (which would be lossy).
[1] From what I could gather, this is exactly what OpenAI did with images in their gpt-4o report, check out "Explorations of capabilities": https://openai.com/index/hello-gpt-4o/
There’s a reason human brains have dedicated language handling. Tokenization is likely a solid strategy. The real thing here is that language is not a good way to encode all forms of knowledge
I know a joke where half of the joke is whistling and half gesturing, and the punchline is whistling. The wording is basically just to say who the players are.
Going from tokens to bytes explodes the model size. I can’t find the reference at the moment, but reducing the average token size induces a corresponding quadratic increase in the width (size of each layer) of the model. This doesn’t just affect inference speed, but also training speed.
One neat thing about the AUNN idea is that when you operate at the function level, you get sort of a neural net version of lazy evaluation; in this case, because you train at arbitrary indices in arbitrary datasets you define, you can do whatever you want with tokenization (as long as you keep it consistent and don't retrain the same index with different values). You can format your data in any way you want, as many times as you want, because you don't have to train on 'the whole thing', any more than you have to evaluate a whole data structure in Haskell; you can just pull the first _n_ elements of an infinite list, and that's fine.
So there is a natural way to not just use a minimal bit or byte level tokenization, but every tokenization simultaneously: simply define your dataset to be a bunch of datapoints which are 'start-of-data token, then the byte encoding of a datapoint followed by the BPE encoding of that followed by the WordPiece encoding followed by ... until the end-of-data token'.
You need not actually store any of this on disk, you can compute it on the fly. So you can start by training only on the byte encoded parts, and then gradually switch to training only on the BPE indices, and then gradually switch to the WordPiece, and so on over the course of training. At no point do you need to change the tokenization or tokenizer (as far as the AUNN knows) and you can always switch back and forth or introduce new vocabularies on the fly, or whatever you want. (This means you can do many crazy things if you want. You could turn all documents into screenshots or PDFs, and feed in image tokens once in a while. Or why not video narrations? All it does is take up virtual indices, you don't have to ever train on them...)
A byte is itself sort of a token. So is a bit. It makes more sense to use more tokenizers in parallel than it does to try and invent an entirely new way of seeing the world.
Anyway humans have to tokenize, too. We don't perceive the world as a continuous blob either.
I would say that "humans have to tokenize" is almost precisely the opposite of how human intelligence works.
We build layered, non-nested gestalts out of real time analog inputs. As a small example, the meaning of a sentence said with the same precise rhythm and intonation can be meaningfully changed by a gesture made while saying it. That can't be tokenized, and that isn't what's happening.
What is a gestalt if not a token (or a token representing collections of other tokens)? It seems more reasonable (to me) to conclude that we have multiple contradictory tokenizers that we select from rather than to reject the concept entirely.
How would we train it? Don't we need it to understand the heaps and heaps of data we already have "tokenized" e.g. the internet? Written words for humans? Genuinely curious how we could approach it differently?
Expensive in terms of computationally expensive, time expensive, and yes cost expensive.
Worth noting that the relationship between characters to token ratio is probably quadratic or cubic or some other polynomial. So the difference in terms of computational difficulty is probably huge when compared to a character per token.
This is probably unnecessary, but: I wish you wouldn't use the word "stupid" there. Even if you didn't mean anything by it personally, it might reinforce in an insecure reader the idea that, if one can't speak intelligently about some complex and abstruse subject that other people know about, there's something wrong with them, like they're "stupid" in some essential way. When in fact they would just be "ignorant" (of this particular subject). To be able to formulate those questions at all is clearly indicative of great intelligence.
Tokenization is not strictly speaking necessary (you can train on bytes). What it is is really really efficient. Scaling is a challenge as is, bytes would just blow that up.
hot take: LLM tokens is kanji for AI, and just like kanji it works okay sometimes but fails miserably for the task of accurately representating English
Why couldn’t Chinese characters accurately represent English? Japanese and Korean aren’t related to Chinese and still were written with Chinese characters (still are in the case of Japanese).
If England had been in the Chinese sphere of influence rather than the Roman one, English would presumably be written with Chinese characters too. The fact that it used an alphabet instead is a historical accident, not due to any grammatical property of the language.
If I read you correctly, you're saying "the fact that the residents of England speak English instead of Chinese is a historical accident" and maybe you're right.
But the residents of England do in fact speak English, and English is a phonetic language, so there's an inherent impedance mismatch between Chinese characters and English language. I can make up words in English and write them down which don't necessarily have Chinese written equivalents (and probably, vice-versa?).
> If I read you correctly, you're saying "the fact that the residents of England speak English instead of Chinese is a historical accident" and maybe you're right.
That’s not what I mean at all. I mean even if spoken English were exactly the same as it is now, it could have been written with Chinese characters, and indeed would have been if England had been in the Chinese sphere of cultural influence when literacy developed there.
> English is a phonetic language
What does it mean to be a “phonetic language”? In what sense is English “more phonetic” than the Chinese languages?
> I can make up words in English and write them down which don’t necessarily have Chinese written equivalents
Of course. But if English were written with Chinese characters people would eventually agree on characters to write those words with, just like they did with all the native Japanese words that didn’t have Chinese equivalents but are nevertheless written with kanji.
> In what sense is English “more phonetic” than the Chinese languages?
Written English vs written Chinese.
How would you write, in Chinese, the words thingamajibber, gizmosity, or half the things that come out of AvE's mouth? These words have subtle, humorous, and entertaining meanings by way of twisting the sounds of other existing words. Shakespeare was a master of this kind of wordplay and invented a surprising number of words we use today.
I'm not saying you can't have the same phenomenon in spoken Chinese. But how do you write it down without a phonetic alphabet? And if you can't write it down, how do you share it to a wide audience?
> How would you write, in Chinese, the words thingamajibber, gizmosity, or half the things that come out of AvE's mouth?
With Chinese characters, of course. Why wouldn’t you be able to?
In English “thing”, “a”, and “ma” are already words, and “jibber” would presumably be the first character in “gibberish”. So you could write that made-up word by combining those four characters.
> But how do you write it down without a phonetic alphabet?
In general to write a newly coined word you would repurpose characters that sound the same as the newly coined word.
Every syllable that can possibly be uttered according to mandarin phonology is represented by some character (usually many), so this is always possible.
---
Regardless, to reiterate the original point: I'm not claiming Chinese characters are better or more flexible than alphabetic writing. They're not. I'm simply claiming that there's no inherent property of Japanese that makes it more amenable to representation with Chinese characters than English is (other than the fact that a lot of its vocabulary comes from Chinese, but that's not a real counterpoint given that there is lots of native, non-Chinese-derived vocabulary that's still written with kanji).
It would be possible to write Japanese entirely in the Latin alphabet, or English entirely with some system similar to Chinese characters, with minimal to no change to the structure of the language.
> In English “thing”, “a”, and “ma” are already words, and “jibber” would presumably be the first character in “gibberish”. So you could write that made-up word by combining those four characters.
Nonsense. There is zero chance in hell that if you combine the pictographs for "thing", "a", "ma", and "gibberish", that someone reading that is going to reproduce the sound thingamajibber. It just does not work. The meme does not replicate.
There may be other virtues of pictographic written language, but reproducing sounds is not one of them. And - as any Shakespeare fan will tell you - tweaking the sounds of English cleverly is rather important. If you can't reproduce this behavior, you're losing something in translation. So to speak.
> I'm simply claiming that there's no inherent property of Japanese that makes it more amenable to representation with Chinese characters than English is
what? No, anything but IPA(only technically) and that language's native writings work for pronunciations. Hiragana, Hangul, or Chữ Quốc Ngữ, would not exist otherwise.
Because one is distant ancestor of the other...? It never adopted writing system from outside. The written and spoken systems co-evolved from a clean slate.
That’s not true. English is not a descendant of Latin, and the Latin alphabet was adopted from the outside, replacing Anglo-Saxon runes (also called the Futhorc script).
"Donald Trump" in CJK, taken from Wikipedia page URL and as I hear it - each are close enough[1] and natural enough in each respective languages but none of it are particularly useful for counting R in strawberry:
Means the script is intended to record pronunciation rather than intention, e.g. it's easy to see how "cow" is intended to be pronounced but it's not necessarily clear what a cow is; ideographic script on the other hand focuses on meaning, e.g. "魚" is supposed to look like a fish but pronunciation varies from "yueh", "sakana", "awe", etc.
1: I tried looking up other notable figures, but thought this person having entertainment background tends to illustrate the point more clearly
> Japanese and Korean aren’t related to Chinese and still were written with Chinese characters (still are in the case of Japanese).
The problem is – in writing Japanese with kanji, lots of somewhat arbitrary decisions had to be made. Which kanji to use for which native Japanese word? There isn't always an obviously best choice from first principles. But that's not a problem in practice, because a tradition developed of which kanjii to use for which Japanese word (kun'yomi readings). For English, however, we don't have such a tradition. So it isn't clear which Chinese character to use for each English word. If two people tried to write English with Chinese characters independently, they'd likely make different character choices, and the mutual intelligibility might be poor.
Also, while neither Japanese nor Korean belongs to the same language family as Chinese, both borrowed lots of words from Chinese. In Japanese, a lot of use of kanji (especially on'yomi reading) is for borrowings from Chinese. Since English borrowed far less terms from Chinese, this other method of "deciding which character(s) to use" – look at the word's Chinese etymology – largely doesn't work for English given very few English words have Chinese etymology.
Finally, they also invented kanji in Japan for certain Japanese words – kokuji. The same thing happened for Korean Hanja (gukja), to a lesser degree. Vietnamese Chữ Nôm contains thousands of invented-in-Vietnam characters. Probably, if English had adopted Chinese writing, the same would have happened. But again, deciding when to do it and if so how is a somewhat arbitrary choice, which is impossible outside of a real societal tradition of doing it.
> The fact that it used an alphabet instead is a historical accident, not due to any grammatical property of the language.
Using the Latin alphabet changed English, just as using Chinese characters changed Japanese, Korean and Vietnamese. If English had used Chinese characters instead of the Latin alphabet, it would be a very different language today. Possibly not in grammar, but certainly in vocabulary.
You could absolutely write a tokenizer that would consistently tokenize all distinct English words as distinct tokens, with a 1:1 mapping.
But AFAIK there's no evidence that this actually improves anything, and if you spend that much of the dictionary on one language, it comes at the cost of making the encoding for everything else much less efficient.
I mean, it just felt to me that current LLM must architecturally favor fixed-length "ideome", like phoneme but for meaning, having conceived under influence of researches in CJK.
And being architecturally based a idea-tic element based, I just casually thought, there could be limits as to how much it can be pushed into perfecting English, that some radical change - not simply dropping tokenization but more fundamental - has to take place at some point.
I don't think it's hard for the LLM to treat a sequence of two tokens as a semantically meaningful unit, though. They have to handle much more complicated dependencies to parse higher-level syntactic structures of the language.
I have seen a bunch of tokenization papers with various ideas but their results are mostly meh. I personally don't see anything principally wrong with current approaches. Having discrete symbols is how natural language works, and this might be an okayish approximation.
It's probably worth to play around with different prompts and different board positions.
For context this [1] is the board position the model is being prompted on.
There may be more than one weird thing about this experiment, for example giving instructions to the non-instruction tuned variants may be counter productive.
More importantly let's say you just give the model the truncated PGN, does this look like a position where white is a grandmaster level player? I don't think so. Even if the model understood chess really well it's going to try to predict the most probable move given the position at hand, if the model thinks that white is a bad player, and the model is good at understanding chess, it's going to predict bad moves as the more likely ones because that would better predict what is most likely to happen here.
Apparently I can find some matches for games that start like that between very strong players [1], so my hypothesis that the model may just be predicting bad moves on purpose seems wobbly, although having stockfish at the lowest level play as the supposedly very strong opponent may still be throwing the model off somewhat. In the charts the first few moves the model makes seem decent, if I'm interpreting these charts right, and after a few of those things seem to start going wrong.
Either way it's worth repeating the experiment imo, tweaking some of these variables (prompt guidance, stockfish strength, starting position, the name of the supposed players, etc.).
Interesting thought the LLM isn’t trying to win, it’s trying to produce data like the input data. It’s quite rare for a very strong player to play a very weak one. If you feed it lots of weak moves it’ll best replicate the training data by following with weak moves.
The experiment started from the first move of a game, and played each game fully. The position you linked was just an example of the format used to feed the game state to the model for each move.
What would "winning" or "losing" even mean if all of this was against a single move?
Does it ever try an illegal move? OP didn't mention this and I think it's inevitable that it should happen at least once, since the rules of chess are fairly arbitrary and LLMs are notorious for bullshitting their way through difficult problems when we'd rather they just admit that they don't have the answer.
I suspect that as well, however, 3.5-turbo-instruct has been noted by other people to do much better at generating legal chess moves than the other models. https://github.com/adamkarvonen/chess_gpt_eval gave models "5 illegal moves before forced resignation of the round" and 3.5 had very few illegal moves, while 4 lost most games due to illegal moves.
> he discusses using a grammar to restrict to only legal moves
Whether a chess move is legal isn't primarily a question of grammar. It's a question of the board state. "White king to a5" is a perfectly legal move, as long as the white king was next to a5 before the move, and it's white's turn, and there isn't a white piece in a5, and a5 isn't threatened by black. Otherwise it isn't.
"White king to a9" is a move that could be recognized and blocked by a grammar, but how relevant is that?
Still an interesting direction of questioning. Maybe could be rephrased as "how much work is the grammar doing"? Are the results with the grammar very different than without? If/when a grammar is not used (like in the openai case), how many illegal moves does it try on average before finding a legal one?
A grammar is really just a special case of the more general issue of how to pick a single token given the probabilities that the model spits out for every possible one. In that sense, filters like temperature / top_p / top_k are already hacks that "do the work" (since always taking the most likely predicted token does not give good results in practice), and grammars are just a more complicated way to make such decisions.
I don't understand why educated people expect that an LLM would be able to play chess at a decent level.
It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.
Then you should be surprised that turbo-instruct actually plays well, right? We see a proliferation of hand-wavy arguments based on unfounded anthropomorphic intuitions about "actual reasoning" and whatnot. I think this is good evidence that nobody really understands what's going on.
If some mental model says that LLMs should be bad at chess, then it fails to explain why we have LLMs playing strong chess. If another mental model says the inverse, then it fails to explain why so many of these large models fail spectacularly at chess.
There are some who suggest that modern chess is mostly a game of memorization and not one particularly of strategy or skill. I assume this is why variants like speed chess exist.
In this scope, my mental model is that LLMs would be good at modern style long form chess, but would likely be easy to trip up with certain types of move combinations that most humans would not normally use. My prediction is that once found they would be comically susceptible to these patterns.
Clearly, we have no real basis for saying it is "good" or "bad" at chess, and even using chess performance as an measurement sample is a highly biased decision, likely born out of marketing rather than principle.
I think you're using "skill" to refer solely to one aspect of chess skill: the ability to do brute-force calculations of sequences of upcoming moves. There are other aspects of chess skill, such as:
1. The ability to judge a chess position at a glance, based on years of experience in playing chess and theoretical knowledge about chess positions.
2. The ability to instantly spot tactics in a position.
In blitz (about 5 minutes) or bullet (1 minute) chess games, these other skills are much more important than the ability to calculate deep lines. They're still aspects of chess skill, and they're probably equally important as the ability to do long brute-force calculations.
That should give patterns (hence your use of the verb to "spot" them, as the grandmaster would indeed spot the patterns) recognizable in the game string.
More specifically grammar-like parterns, e.g. the same moves but translated.
"playing strong chess" would be a much less hand-wavy claim if there were lots of independent methods of quantifying and verifying the strength of stockfish's lowest difficulty setting. I honestly don't know if that exists or not. But unless it does, why would stockfish's lowest difficulty setting be a meaningful threshold?
But to some approximation we do know how an LLM plays chess.. based on all the games, sites, blogs, analysis in its training data. But it has a limited ability to tell a good move from a bad move since the training data has both, and some of it lacks context on move quality.
Here's an experiment: give an LLM a balanced middle game board position and ask it "play a new move that a creative grandmaster has discovered, never before played in chess and explain the tactics and strategy behind it". Repeat many times. Now analyse each move in an engine and look at the distribution of moves and responses. Hypothesis: It is going to come up with a bunch of moves all over the ratings map with some sound and some fallacious arguments.
I really don't think there's anything too mysterious going on here. It just synthesizes existing knowledge and gives answers that includes bit hits, big misses and everything in between. Creators chip away at the edges to change that distribution but the fundamental workings don't change.
One of the main purposes of running experiments of any sort is to find out if our preconceptions are accurate. Of course, if someone is not interested in that question, they might as well choose not to look through the telescope.
Not only on HN. Trying to publish a scientific article that does not contain the word 'novel' has become almost impossible. No one is trying to reproduce anyones claims anymore.
I don't think this is about replication, but even just about the initial test in the first place. In science we do often test obvious things. For example, I was a theoretical quantum physicist, and a lot of the time I knew that what I am working on will definitely work, since the maths checks out. In some sense that makes it kinda obvious, but we test it anyway.
The issue is that even that kinda obviousness is criticised here. People get mad at the idea of doing experiments when we already expect a result.
This is a puzzle given enough training information. LLM can successfully print out the status of the board after the given moves. It can also produce a not-terrible summary of the position and is able to list dangers at least one move ahead. Decent is subjective, but that should beat at least beginners. And the lowest level of stockfish used in the blog post is lowest intermediate.
I don't know really what level we should be thinking of here, but I don't see any reason to dismiss the idea. Also, it really depends on whether you're thinking of the current public implementations of the tech, or the LLM idea in general. If we wanted to get better results, we could feed it way more chess books and past game analysis.
LLMs like GPT aren’t built to play chess, and here’s why: they’re made for handling language, not playing games with strict rules and strategies. Chess engines, like Stockfish, are designed specifically for analyzing board positions and making the best moves, but LLMs don’t even "see" the board. They’re just guessing moves based on text patterns, without understanding the game itself.
Plus, LLMs have limited memory, so they struggle to remember previous moves in a long game. It’s like trying to play blindfolded! They’re great at explaining chess concepts or moves but not actually competing in a match.
Ok, I did go too far. But castling doesn't require all previous moves - only one bit of information carried over. So in practice that's board + 2 bits per player. (or 1 bit and 2 moves if you want to include a draw)
Castling requires no prior moves by either piece (King or Rook). Move the King once and back early on, and later, although the board looks set for castling, the King may not.
Yes, which means you carry one bit of extra information - "is castling still allowed". The specific moves that resulted in this bit being unset don't matter.
Ok, then for this you need minimum of two bits - one for kingside Rook and one for the queenside Rook, both would be set if you move the King. You also need to count moves since the last exchange or pawn move for the 50 move rule.
“The game is not automatically drawn if a position occurs for the third time – one of the players, on their turn, must claim the draw with the arbiter. The claim must be made either before making the move which will produce the third repetition, or after the opponent has made a move producing a third repetition. By contrast, the fivefold repetition rule requires the arbiter to intervene and declare the game drawn if the same position occurs five times, needing no claim by the players.”
It is not stateless, because good chess isn't played as a series of independent moves -- it's played as a series of moves connected to a player's strategy.
> What's the difference between a great explanation of a move and explaining every possible move then selecting the best one?
Continuing from the above, "best" in the latter sense involves understanding possible future moves after the next move.
Ergo, if I looked at all games with the current board state and chose the next move that won the most games, it'd be tactically sound but strategically ignorant.
Because many of those next moves were making that next move in support of some broader strategy.
> it's played as a series of moves connected to a player's strategy.
That state belongs to the player, not to the game. You can carry your own state in any game you want - for example remember who starts with what move in rock paper scissors, but that doesn't make that game stateful. It's the player's decision (or bot's implementation) to use any extra state or not.
I wrote "previous moves" specifically (and the extra bits already addressed elsewhere), but the LLM can carry/rebuild its internal state between the steps.
> good chess isn't played as a series of independent moves -- it's played as a series of moves connected to a player's strategy.
Maybe good chess, but not perfect chess. That would by definition be game-theoretically optimal, which in turn implies having to maintain no state other than your position in a large but precomputable game tree.
Right, but your position also includes whether or not you still have the right to castle on either side, whether each pawn has the right to capture en passant or not, the number of moves since the last pawn move or capture (for tracking the 50 move rule), and whether or not the current position has ever appeared on the board once or twice prior (so you can claim a draw by threefold repetition).
So in practice, your position actually includes the log of all moves to that point. That’s a lot more state than just what you can see on the board.
Yes, 4 exceptions: castling rights, legal en passant captures, threefold repetition, and the 50 move rule. You actually need quite a lot of state to track all of those.
It shouldn't be too much extra state. I assume that 2 bits should be enough to cover castling rights (one for each player), whatever is necessary to store the last 3 moves should cover legal en passant captures and threefold repetition, and 12 bits to store two non-overflowing 6 bit counters (time since last capture, and time since last pawn move) should cover the 50 move rule.
So... unless I'm understanding something incorrectly, something like "the three last moves plus 17 bits of state" (plus the current board state) should be enough to treat chess as a memoryless process. Doesn't seem like too much to track.
Threefold repetition does not require the three positions to occur consecutively. So you could conceivably have a position repeat itself for first on the 1st move, second time on the 25th move, and the third time on the 50th move of a sequence and then players could claim a draw by threefold repetition or 50 move rule at the same time!
This means you do need to store the last 50 board positions in the worst case. Normally you need to store less because many moves are irreversible (pawns cannot go backwards, pieces cannot be un-captured).
> they’re made for handling language, not playing games with strict rules and strategies
Here's the opposite theory: Language encodes objective reasoning (or at least, it does some of the time). A sufficiently large ANN trained on sufficiently large amounts of text will develop internal mechanisms of reasoning that can be applied to domains outside of language.
Based on what we are currently seeing LLMs do, I'm becoming more and more convinced that this is the correct picture.
I share this idea but from the different perspective. It doesn’t develop these mechanisms, but casts a high-dimensional-enough shadow of their effect on itself. This vaguely explains why the more deep Gell-Mann-wise you are the less sharp that shadow is, because specificity cuts off “reasoning” hyperplanes.
It’s hard to explain emerging mechanisms because of the nature of generation, which is one-pass sequential matrix reduction. I say this while waving my hands, but listen. Reasoning is similar to Turing complete algorithms, and what LLMs can become through training is similar to limited pushdown automata at best. I think this is a good conceptual handle for it.
“Line of thought” is an interesting way to loop the process back, but it doesn’t show that much improvement, afaiu, and still is finite.
Otoh, a chess player takes as much time and “loops” as they need to get the result (ignoring competitive time limits).
LLMs need to compress information to be able to predict next words in as many contexts as possible.
Chess moves are simply tokens as any other.
Given enough chess training data, it would make sense to have part of the network trained to handle chess specifically instead of simply encoding basic lists of moves and follow-ups. The result would be a general purpose sub-network trained on chess.
Because it's a straight forward stochastic sequence modelling task and I've seen GPT-3.5-turbo-instruct play at high amateur level myself. But it seems like all the RLHF and distillation that is done on newer models destroys that ability.
PS: I ran and as suspected got-3.5-turbo-instruct does not beat stockfish, it is not even close
"Final Results: gpt-3.5-turbo-instruct: Wins=0, Losses=6, Draws=0, Rating=1500.00 stockfish: Wins=6, Losses=0, Draws=0, Rating=1500.00"
https://www.loom.com/share/870ea03197b3471eaf7e26e9b17e1754?...
But in that case there shouldn't be any invalid moves, ever. Another tester found gpt-3.5-turbo-instruct to be suggesting at least one illegal move in 16% of the games (source: https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/ )
> I don't understand why educated people expect that an LLM would be able to play chess at a decent level.
Because it would be super cool; curiosity isn't something to be frowned upon. If it turned out it did play chess reasonably well, it would mean emergent behaviour instead of just echoing things said online.
But it's wishful thinking with this technology at this current level; like previous instances of chatbots and the like, while initially they can convince some people that they're intelligent thinking machines, this test proves that they aren't. It's part of the scientific process.
They thought it because we have an existence proof: gpt-3.5-turbo-instruct can play chess at a decent level.
That was the point of the post (though you have to read it to the end to see this). That one model can play chess pretty well, while the free models and OpenAI's later models can't. That's weird.
I suppose you didn't get the news, but google developed a LLM that can play chess. And play it at grandmaster level: https://arxiv.org/html/2402.04494v1
Not quite an LLM. It's a transformer model, but there's no tokenizer or words, just chess board positions (64 tokens, one per board square). It's purpose-built for chess (never sees a word of text).
In fact, the unusual aspect of this chess engine is not that it's using neural networks (even Stockfish does, these days!), but that it's only using neural networks.
Chess engines essentially do two things: Calculate the value of a given position for their side, and walking the tree game tree while evaluating its positions in that way.
Historically, position value was a handcrafted function using win/lose criteria (e.g. being able to give checkmate is infinitely good) and elaborate heuristics informed by real chess games, e.g. having more space on the board is good, having a high-value piece threatened by a low-value one is bad etc., and the strength of engines largely resulted from being able to "search the game tree" for good positions very broadly and deeply.
Recently, neural networks (trained on many simulated games) have been replacing these hand-crafted position evaluation functions, but there's still a ton of search going on. In other words, the networks are still largely "dumb but fast", and without deep search they'll lose against even a novice player.
This paper now presents a searchless chess engine, i.e. one who essentially "looks at the board once" and "intuits the best next move", without "calculating" resulting hypothetical positions at all. In the words of Capablanca, a chess world champion also cited in the paper: "I see only one move ahead, but it is always the correct one."
The fact that this is possible can be considered surprising, a testament to the power of transformers etc., but it does indeed have nothing to do with language or LLMs (other than that the best ones known to date are based on the same architecture).
It's interesting to note that the paper benchmarked its chess playing performance against GPT-3.5-turbo-instruct, the only well performant LLM in the posted article.
Right, at least as of the ~GPT3 model it was just "predict what you would see in a chess game", not "what would be the best move". So (IIRC) users noted that if you made bad move, then the model would also reply with bad moves because it pattern matched to bad games. (I anthropomorphized this as the model saying "oh, we're doing dumb-people-chess now, I can do that too!")
But it also predicts moves where the text says "black won the game, [proceeds to show the game]". To minimize loss on that it would need to from context try and make it so white doesn't make critical mistakes.
Chess does not clearly require that. Various purely ML/statistical based model approaches are doing pretty well. It's almost certainly best to incorporate some kind of search into an overall system, but it's not absolutely required to play just decent amateur level.
The problem here is the specific model architecture, training data, vocabulary/tokenization method (if you were going to even represent a game this way... which you wouldn't), loss function and probably decoding strategy.... basically everything is wrong here.
> I don't understand why educated people expect that an LLM would be able to play chess at a decent level.
The blog post demonstrates that a LLM plays chess at a decent level.
The blog post explains why. It addresses the issue of data quality.
I don't understand what point you thought you were making. Regardless of where you stand, the blog post showcases a surprising result.
You stress your prior unfounded belief, you were presented with data that proves it wrong, and your reaction was to post a comment with a thinly veiled accusation of people not being educated when clearly you are the one that's off.
To make matters worse, this topic is also about curiosity. Which has a strong link with intelligence and education. And you are here criticizing others on those grounds in spite of showing your defitic right at the first sentence.
This blog post was a great read. Very surprising, engaging, and thought provoking.
The only service performing well is a closed source one that could simply use a real chess engine for questions that look like chess, for marketing purposes. There’s nothing thought provoking about a bunch of engineers doing “experiments” against a service, other than how sad it is to debase themselves in this way.
> The only service performing well is a closed source one that could simply use a real chess engine for questions that look like chess, for marketing purposes.
That conspiracy theory holds no traction in reality. This blog post is so far the only reference to using LLMs to play chess. The "closed-source" model (whatever that is) is an older version that does worse than the newer version. If your conspiracy theory had any bearing in reality how come this fictional "real chess engine" was only used in a single release? Unbelievable.
Back in reality, it is well known that newer models that are made available to the public are adapted to business needs by constraining their capabilities and limit liability.
But there's really nothing about chess that makes reasoning a prerequisite, a win is a win as long as it's a win. This is kind of a semantics game: it's a question of whether the degree of skill people observe in an LLM playing chess is actually some different quantity than the chance it wins.
I mean at some level you're saying that no matter how close to 1 the win probability (1 - epsilon) gets, both of the following are true:
A. you should always expect for the computation that you're able to do via conscious reasoning alone to always be sufficient, at least in principle, to asymptotically get a higher win probability than a model, no matter what the model's win probability was to begin with
B. no matter how close to 1 that the model's win rate p=(1 - epsilon) gets, because logical inference is so non-smooth, the win rate on yet-unseen data is fundamentally algorithmically random/totally uncorrelated to in-distribution performance, so it's never appropriate to say that a model can understand or to reason
To me it seems that people are subject to both of these criteria, though. They have a tendency to cap out at their eventual skill cap unless given a challenge to nudge them to a higher level, and likewise possession of logical reasoning doesn't let us say much at all about situations that their reasoning is unfamiliar with.
I also think, if you want to say that what LLMs do has nothing to do with understanding or ability, then you also have to have an alternate explanation for the phenomenon of AlphaGo defeating Lee Sedol being a catalyst for top Go players being able to rapidly increase their own rankings shortly after.
Few people (perhaps none) expected LLMs to be good at chess. Nevertheless, as the article explains, there was buzz around a year ago that LLMs were good at chess.
> It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.
No. You can definitely train a model to be really good at chess without "actual reasoning and deterministic computation".
It sorta played chess- he let it generate up to ten moves, throwing away any that weren't legal, and if no legal move was generated by the 10th try he picked a random legal move. He does not say how many times he had to provide a random move, or how many times illegal moves were generated.
That was for the OpenAI games- including the ones that won. For the ones he ran himself with open source LLM's he restricted their grammar to just be legal moves, so it could only respond with a legal move. But that was because of a separate process he added on top of the LLM.
> I ran all the open models (anything not from OpenAI, meaning anything that doesn’t start with gpt or o1) myself using Q5_K_M quantization, whatever that is.
It's just a lossy compression of all of the parameters, probably not important, right?
i think this has everything to do with the fact that learning chess by learning sequences will get you into more trouble than good. even a trillion games won't save you: https://en.wikipedia.org/wiki/Shannon_number
that said, for the sake of completeness, modern chess engines (with high quality chess-specific models as part of their toolset) are fully capable of, at minimum, tying every player alive or dead, every time. if the opponent makes one mistake, even very small, they will lose.
while writing this i absently wondered if you increased the skill level of stockfish, maybe to maximum, or perhaps at least an 1800+ elo player, you would see more successful games. even then, it will only be because the "narrower training data" (ie advanced players won't play trash moves) at that level will probably get you more wins in your graph, but it won't indicate any better play, it will just be a reflection of less noise; fewer, more reinforced known positions.
> i think this has everything to do with the fact that learning chess by learning sequences will get you into more trouble than good. even a trillion games won't save you: https://en.wikipedia.org/wiki/Shannon_number
Indeed. As has been pointed out before, the number of possible chess positions easily, vastly dwarfs even the wildest possible estimate of the number of atoms in the known universe.
Sure, but so does the number of paragraphs in the english language, and yet LLMs seem to do pretty well at that. I don't think the number of configurations is particularly relevant.
(And it's honestly quite impressive that LLMs can play it at all, but not at all surprising that it loses pretty handily to something which is explicitly designed to search, as opposed to simply feed-forward a decision)
Since we're mentioning Shannon... What is the minimum representative sample size of that problem space? Is it close enough to the number of freely available chess moves on the Internet and in books?
> I think this has everything to do with the fact that learning chess by learning sequences will get you into more trouble than good.
Yeah, once you've deviated from a sequence you're lost.
Maybe approaching it by learning the best move in billions/trillions of positions, and feeding that into some AI could work better. Similar positions often have the same kind of best move.
Honestly, I think that once you discard the moves one would never make, and account for symmetries/effectively similar board positions (ones that could be detected by a very simple pattern matcher), chess might not be that big a game at all.
I’d bet it’s using function calling out to a real chess engine. It could probably be proven with a timing analysis to see how inference time changes/doesn’t with number of tokens or game complexity.
?? why would openai even want to secretly embed chess function calling into an incredibly old model? if they wanted to trick people into thinking their models are super good at chess why wouldn't they just do that to gpt-4o?
The idea is that they embedded this when it was a new model, as part of the hype before GPT-4. The fake-it-till-you-make-it hope was that GPT-4 would be so good it could actually play chess. Then it turned out GPT-4 sucked at chess as well, and OpenAI quietly dropped any mention of chess. But it would be too suspicious to remove a well-documented feature from the old model, so it's left there and can be chalked up as a random event.
OpenAI has a TON of experience making game-playing AI. That was their focus for years, if you recall. So it seems like they made one model good at chess to see if it had an overall impact on intelligence (just as learning chess might make people smarter, or learning math might make people smarter, or learning programming might make people smarter)
Playing is a thing strongly related to abstract representation of the game in game states. Even if player does not realize it, with chess it’s really about shallow or beam search within the possible moves.
LLMs don’t do reasoning or exploration, but they write text based on precious text. So to us it may seem playing, but is really a smart guesswork based on previous games. It’s like Kasparov writing moves without imagining the actual placement.
What would be interesting is to see whether a model, given only the rules, will play. I bet it won’t.
At this moment it’s replaying by memory but definitely not chasing goals. There’s no such think as forward attention yet, and beam search is expensive enough, so one would prefer to actually fallback to classic chess algos.
At this point, we have to assume anything that becomes a published benchmark is specifically targeted during training. That's not something specific to LLMs or OpenAI. Compiler companies have done the same thing for decades, specifically detecting common benchmark programs and inserting hand-crafted optimizations. Similarly, the shader compilers in GPU drivers have special cases for common games and benchmarks.
Apples and oranges. VW actually cheated on regulatory testing to bypass legal requirements. So to be comparable, the government would first need to pass laws where e.g. only compilers that pass a certain benchmark are allowed to be used for purchasable products and then the developers would need to manipulate behaviour during those benchmarks.
There's a sliding scale of badness here. The emissions cheating (it wasn't just VW, incidentally; they were just the first uncovered. Fiat-Chrysler, Mercedes, GM and BMW were also caught doing it, with suspicions about others) was straight-up fraud.
It used to be common for graphics drivers to outright cheat on benchmarks (the actual image produced would not be the same as it would have been if a benchmark had not been detected); this was arguably, fraud.
It used to be common for mobile phone manufacturers to allow the SoC to operate in a thermal mode that was never available to real users when it detected a benchmark was being used. This is still, IMO, kinda fraud-y.
Optimisation for common benchmark cases where the thing still actually _works_, and where the optimisation is available to normal users where applicable, is less egregious, though, still, IMO, Not Great.
I think breaking a law is more unethical than not breaking a law.
Also, legality isn't the only difference in the VW case. With VW, they had a "good emissions" mode. They enabled the good emissions mode during the test, but disabled it during regular driving. It would have worked during regular driving, but they disabled it during regular driving. With compilers, there's no "good performance" mode that would work during regular usage that they're disabling during regular usage.
> I think breaking a law is more unethical than not breaking a law.
It sounds like a mismatch of definition, but I doubt you're ambivalent about a behavior right until the moment it becomes illegal, after which you think it unethical. Law is the codification and enforcement of a social contract, not the creation of it.
But following the law is itself a load bearing aspect of the social contract. Violating building codes, for example, might not cause immediate harm if it's competent but unusual, yet it's important that people follow it just because you don't want arbitrariness in matters of safety. The objective ruleset itself is a value beyond the rules themselves, if the rules are sensible and in accordance with deeper values, which of course they sometimes aren't, in which case we value civil disobedience and activism.
Also, while laws ideally are inspired by an ethical social contract, the codification proces is long, complex and far from perfect. And then for rules concerning permissible behavior even in the best of cases, it's enforced extremely sparingly simply because it's not possible nor desirable to detect and deal with all infractions. Nor is it applied blindly and equally. As actually applied, a law is definitely not even close to some ethical ideal; sometimes it's outright opposed to it, even.
Law and ethics are barely related, in practice.
For example in the vehicle emissions context, it's worth noting that even well before VW was caught the actions of likely all carmakers affected by the regulations (not necessarily to the same extent) were clearly unethical. The rules had been subject to intense clearly unethical lobbying for years, and so even the legal lab results bore little resemblance to practical on-the-road results though systematic (yet legal) abuse. I wouldn't be surprised to learn that even what was measured intentionally diverged from what is harmfully in a profitable way. It's a good thing VW was made an example of - but clearly it's not like that resolved the general problem of harmful vehicle emissions. Optimistically, it might have signaled to the rest of the industry and VW in particular to stretch the rules less in the future.
>I doubt you're ambivalent about a behavior right until the moment it becomes illegal, after which you think it unethical.
There are many cases where I think that. Examples:
* Underage drinking. If it's legal for someone to drink, I think it's in general ethical. If it's illegal, I think it's in general unethical.
* Tax avoidance strategies. If the IRS says a strategy is allowed, I think it's ethical. If the IRS says a strategy is not allowed, I think it's unethical.
* Right on red. If the government says right on red is allowed, I think it's ethical. If the government (e.g. NYC) says right on red is not allowed, I think it's unethical.
The VW case was emissions regulations. I think they have an ethical obligation to obey emissions regulations. In the absence of regulations, it's not an obvious ethical problem to prioritize fuel efficiency instead of emissions (that's I believe what VW was doing).
Drinking and right turns are unethical if they’re negligent. They’re not unethical if they’re not negligent. The government is trying to reduce negligence by enacting preventative measures to stop ALL right turns and ALL drinking in certain contexts that are more likely to yield negligence, or where the negligence world be particularly harmful, but that doesn’t change whether or not the behavior itself is negligent.
You might consider disregarding the government’s preventative measures unethical, and doing those things might be the way someone disregards the governments protective guidelines, but that doesn’t make those actions unethical any more than governments explicitly legalizing something makes it ethical.
To use a clearer example, the ethicality of abortion— regardless of what you think of it— is not changed by its legal status. You might consider violating the law unethical, so breaking abortion laws would constitute the same ethical violation as underage drinking, but those laws don’t change the ethics of abortion itself. People who consider it unethical still consider it unethical where it’s legal, and those that consider it ethical still consider it ethical where it’s not legal.
It's not so simple. An analogy is the Rust formatter that has no options so everyone just uses the same style. It's minimally "unethical" to use idiosyncratic Rust style just because it goes against the convention so people will wonder why you're so special, etc.
If the rules themselves are bad and go against deeper morality, then it's a different situation; violating laws out of civil disobedience, emergent need, or with a principled stance is different from wanton, arbitrary, selfish cheating.
If a law is particularly unjust, violating the law might itself be virtuous. If the law is adequate and sensible, violating it is usually wrong even if the violating action could be legal in another sensible jurisdiction.
> but that doesn’t make those actions unethical any more than governments explicitly legalizing something makes it ethical
That is, sometimes, sufficient.
If government says ‘seller of a house must disclose issues’ then I rely rely on the law being followed, if you sell and leave the country, you have defrauded me.
However if I live in a ‘buyer beware’ jurisdiction, then I know I cannot trust the seller and I hire a surveyor and take insurance.
There is a degree of setting expectations- if there is a rule, even if it’s a terrible rule, I as individual can at least take some countermeasures.
You can’t take countermeasures against all forms of illegal behaviour, because there is infinite number of them. And a truly insane person is unpredictable at all.
I agree if they're negligent they're unethical. But I also think if they're illegal they're generally unethical. In situations where some other right is more important that the law, underage drinking or illegal right on red would be ethical, such as if alcohol is needed as an emergency pain reliever, or a small amount for religious worship, or if you need to drive to the hospital fast in an emergency.
Abortion opponents view it as killing an innocent person. So that's unethical regardless of whether it's legal. I'm not contesting in any way that legal things can be unethical. Abortion supporters view it as a human right, and that right is more important than the law.
Right on red, underage drinking, and increasing car emissions aren't human rights. So outside of extenuating circumstances, if they're illegal, I see them as unethical.
> Abortion opponents view it as killing an innocent person. So that's unethical regardless of whether it's legal.
So it doesn't matter that a very small percentage of the world's population believes life begins at conception, it's still unethical? Or is everything unethical that anyone thinks is unethical across the board, regardless of the other factors? Since some vegans believe eating honey is unethical, does that mean it's unethical for everybody, or would it only be unethical if it was illegal?
In autocracies where all newly married couples were legally compelled to allow the local lord to rape the bride before they consummated the marriage, avoiding that would be unethical?
Were the sit-in protest of the American civil rights era unethical? They were illegal.
Was it unethical to hide people from the Nazis when they were search for people to exterminate? It was against the law.
Was apartheid ethical? It was the law.
Was slavery ethical? It was the law.
Were the jim crow laws ethical?
I have to say, I just fundamentally don't understand your faith in the infallibility of humanity's leaders and governing structures. Do I think it's generally a good idea to follow the law? Of course. But there are so very many laws that are clearly unethical. I think your conflating legal correctness with mores with core foundational ethics is rather strange.
the right on red example is interesting because in that case, the law changes how other drivers and pedestrians will behave in ways that make it pretty much always unsafe
That just changes the parameters of negligence. On a country road in the middle of a bunch of farm land where you can see for miles, it doesn’t change a thing.
Ethics are only morality if you spend your entire time in human social contexts. Otherwise morality is a bit larger, and ethics are a special case of group recognized good and bad behaviors.
What if I make sure to have a drink once a week for the summer with my 18 year old before they go to college because I want them to understand what it's like before they go binge with friends? Is that not ethical?
Speeding to the hospital in an emergency? Lying to Nazis to save a Jew?
Law and ethics are more correlated than some are saying here, but the map is not the territory, and it never will be.
There can be situations where someone's rights are more important than the law. In that case it's ethical to break the law. Speeding to the hospital and lying to Nazis are cases of that. The drinking with your 18 year old, I'm not sure, maybe.
My point though, is that in general, when there's not a right that outweighs the law, it's unethical to break the law.
unless following an unethical law would in itself be unethical, then breaking the unethical law would be the only ethical choice. In this case cheating emissions, which I see as unethical, but also advantageous for the consumer, should have been done openly if VW saw following the law as unethical. Ethics and morality are subjective to understanding, and law only a crude approximation of divinity. Though I would argue that each person on the earth through a shared common experience has a rough and general idea of right from wrong...though I'm not always certain they pay attention to it.
I think you're talking about something different from what sigmoid10 was talking about. sigmoid10 said "manipulate behaviour during those benchmarks". I interpreted that to mean the compiler detects if a benchmark is going on and alters its behavior only then. So this wouldn't impact real life use cases.
I agree that ethics should inform law. But I live in a society, and have an ethical duty to respect other members of society. And part of that duty is following the laws of society.
I disagree- presumably if an algorithm or hardware is optimized for a certain class of problem it really is good at it and always will be- which is still useful if you are actually using it for that. It’s just “studying for the test”- something I would expect to happen even if it is a bit misleading.
VW cheated such that the low emissions were only active during the test- it’s not that it was optimized for low emissions under the conditions they test for, but that you could not get those low emissions under any conditions in the real world. That's "cheating on the test" not "studying for the test."
How so? VW intentionally changed the operation of the vehicle so that its emissions met the test requirements during the test and then went back to typical operation conditions afterwards.
VW was breaking the law in a way that harmed society but arguably helped the individual driver of the VW car, who gets better performance yet still passes the emissions test.
That is not true. Even ChatGPT understands how they are different, I won’t paste the whole response but here are the differences it highlights:
Key differences:
1. Intent and harm:
• VW’s actions directly violated laws and had environmental and health consequences. Optimizing LLMs for chess benchmarks, while arguably misleading, doesn’t have immediate real-world harms.
2. Scope: Chess-specific optimization is generally a transparent choice within AI research. It’s not a hidden “defeat device” but rather an explicit design goal.
3. Broader impact: LLMs fine-tuned for benchmarks often still retain general-purpose capabilities. They aren’t necessarily “broken” outside chess, whereas VW cars fundamentally failed to meet emissions standards.
Tesla cheats by using electric motors and deferring emissions standards to somebody else :D Wait, I really think that's a good thing, but once Hulk Hogan is confirmed administrator of the EPA, he might actually use this argument against Teslas and other electric vehicles.
This is the apples to apples version. Perhaps might be more accurate to say that when detecting a benchmark attempt the model tries the prompt 3 times with different seeds then picks the best answer, otherwise it just zero-shots the prompt in everyday use.
I say this because the be test still uses the same hardware (model) but changed the way it behaved by running emissions friendly parameters ( a different execution framework) that wouldn’t have been used in everyday driving, where fuel efficiency and performance optimized parameters were used instead.
What I’d like to know is if it actually was unethical or not. The overall carbon footprint of the lower fuel consumption setting, with fuel manufacturing and distribution factored in, might easily have been more impactful than the emissions model, which typically does not factor in fuel consumed.
Most of the time these days compiler writers are not cheating like VW did. In the 1980s compiler writers would insert code to recognize performance tests and then cheat - output values hard coded into the compiler instead of running the algorithm. Which is the type of thing that VW got in trouble for.
These days most compilers are trying to make the general case of code fast and they rarely look for benchmarks. I won't say they never do this - just that it is much less common - if only because magazine reviews/benchmarks are not nearly as important as they used to be and so the incentive is gone.
Only because what VW did is illegal, was super large scale, and could be linked to a lot of indirect deaths through the additional pollution.
Benchmark optimizations are slightly embarrassing at worst, and an "optimization for a specific use case" at best. There's no regulation against optimizing for a particular task, everyone does it all the time, in some cases it's just not communicated transparently.
Phone manufacturers were caught "optimizing" for benchmarks again and again, removing power limits to boost scores. Hard to name an example without searching the net because it's at most a faux pas.
I think the OP's point is that chat GPT-3.5 may have a chess-engine baked-in to its (closed and unavailable) code for PR purposes. So it "realizes" that "hey, I'm playing a game of chess" and then, rather than doing whatever it normally does, it just acts as a front-end for a quite good chess-engine.
Not quite. VW got in trouble for running _different_ software in test vs prod. These optimizations are all going to "prod" but are only useful for specific targets (a specific game in this case).
> VW got in trouble for running _different_ software in test vs prod.
Not quite. They programmed their "prod" software to recognise the circumstances of a laboratory test and behave differently. Namely during laboratory emissions testing they would activate emission control features they would not activate otherwise.
The software was the same they flash on production cars. They were production cars. You could take a random car from a random dealership and it would have done the same trickery in the lab.
I disagree with your distinction on the environments but understand your argument. Production for VM to me is "on the road when a customer is using your product as intended". Using the same artifact for those different environments isn't the same as "running that in production".
“Test” environment is the domain of prototype cars driving at the proving ground. It is an internal affair, only for employees and contractors. The software is compiled on some engineer’s laptop and uploaded on the ECU by an engineer manually. No two cars are ever the same, everything is in flux. The number of cars are small.
“Production” is a factory line producing cars. The software is uploaded on the ECUs by some factory machine automatically. Each car are exactly the same, with the exact same software version on thousands and thousands of cars. The cars are sold to customers.
Some small number of these prodiction cars are sent for regulatory compliance checks to third parties. But those cars won’t become suddenly non-production cars just because someone sticks up a probe in their exhausts. The same way gmail’s production servers don’t suddenly turn into test environments just because a user opens the network tab in their browser’s dev tool to see what kind of requests fly on the wire.
Can you try increasing compute in the problem search space, not in the training space? What this means is, give it more compute to think during inference by not forcing any model to "only output the answer in algebraic notation" but do CoT prompting:
"1. Think about the current board
2. Think about valid possible next moves and choose the 3 best by thinking ahead
3. Make your move"
Or whatever you deem a good step by step instruction of what an actual good beginner chess player might do.
Then try different notations, different prompt variations, temperatures and the other parameters. That all needs to go in your hyper-parameter-tuning.
One could try using DSPy for automatic prompt optimization.
Can be forced through inference with CoT type of stuff. Spend tokens at each stage to draw the board for example, then spend tokens restating the rules of the game, then spend token restating the heuristics like piece value, and then spend tokens doing a minmax n-ply search.
Wildly inefficient? Probably. Could maybe generate some python to make more efficient? Maybe, yeah.
Essentially user would have to teach gpt to play chess, or training would fine tune chess towards these CoT, fine tuning, etc...
Yeah, the expectation for the immediate answer is definitely results, especially for the later stages. Another possible improvement: every 2 steps, show the current board state and repeat the moves still to be processed, before analysing the final position.
2. It plays like what you'd expect from a LLM that could play chess. That is, level of play can be modulated by the prompt and doesn't manifest the same way shifting the level of stockfish etc does. Also the specific chess notation being prompted actually matters
5. You can or well you used to be able to inspect the logprobs. I think Open AI have stopped doing this but the link in 4 does show the author inspecting it for Turbo instruct.
> Also the specific chess notation being prompted actually matters
Couldn't this be evidence that it is using an engine? Maybe if you use the wrong notation it relies on the ANN rather than calling to the engine.
Likewise:
- The sensitivity to game history is interesting, but is it actually true that other chess engines only look at current board state? Regardless, maybe it's not an existing chess engine! I would think OpenAI has some custom chess engine built as a side project, PoC, etc. In particular this engine might be neural and trained on actual games rather than board positions, which could explain dependency on past moves. Note that the engine is not actually very good. Does AlphaZero depend on move history? (Genuine question, I am not sure. But it does seem likely.)
- I think the illegal moves can be explained similarly to why gpt-o1 sometimes screws up easy computations despite having access to Python: an LLM having access to a tool does not guarantee it always uses that tool.
I realize there are holes in the argument, but I genuinely don't think these holes are as big as the "why is gpt-3.5-turbo-instruct so much better at chess than gpt-4?"
> Couldn’t this be evidence that it is using an engine?
A test would be to measure its performance against more difficult versions of Stockfish. A real chess engine would have a higher ceiling.
Much more likely is this model was trained on more chess PGNs. You can call that a “neural engine” if you’d like but it is the simplest solution and explains the mistakes it is making.
Game state isn’t just what you can see on the board. It includes the 50 move rule and castling rights. Those were encoded as layers in AlphaZero along with prior positions of pieces. (8 prior positions if I’m remembering correctly.)
The author thinks this is unlikely because it only has an ~1800 ELO. But OpenAI is shady as hell, and I could absolutely see the following purely hypothetical scenario:
- In 2022 Brockman and Sutskever have an unshakeable belief that Scaling Is All You Need, and since GPT-4 has a ton of chess in its pretraining data it will definitely be able to play competent amateur chess when it's finished.
- A ton of people have pointed out that ChatGPT-3.5 doesn't even slightly understand chess despite seeming fluency in the lingo. People start to whisper that transformers cannot actually create plans.
- Therefore OpenAI hatches an impulsive scheme: release an "instruction-tuned" GPT-3.5 with an embedded chess engine that is not a grandmaster, but can play competent chess, ideally just below the ELO that GPT-4 is projected to have.
- Success! The waters are muddied: GPT enthusiasts triumphantly announce that LLMs can play chess, it just took a bit more data and fine-tuning. The haters were wrong: look at all the planning GPT is doing!
- Later on, at OpenAI HQ...whoops! GPT-4 sucks at chess, as do competitors' foundation LLMs which otherwise outperform GPt-3.5. The scaling "laws" failed here, since they were never laws in the first place. OpenAI accepts that scaling transformers won't easily solve the chess problem, then realizes that if they include the chess engine with GPT-4 without publicly acknowledging it, then Anthropic and Facebook will call out the performance as aberrational and suspicious. But publicly acknowledging a chess engine is even worse: the only reason to include the chess engine is to mislead users into thinking GPT is capable of general-purpose planning.
- Therefore in later GPT versions they don't include the engine, but it's too late to remove it from gpt-3.5-turbo-instruct: people might accept the (specious) claim that GPT-4's size accidentally sabotaged its chess abilities, but they'll ask tough questions about performance degradation within the same model.
I realize this is convoluted and depends on conjecture. But OpenAI has a history with misleading demos - e.g. their Rubik's cube robot which in fact used a classical algorithm but was presented as reinforcement learning. I think "OpenAI lied" is the most likely scenario. It is far more likely than "OpenAI solved the problem honestly in GPT-3.5, but forgot how they did it with GPT-4," and a bit more likely than "scaling transformers slightly helps performance when playing Othello but severely sabotages performance when playing chess."
Eh, OpenAI really isn't as shady as hell, from what I've seen on the inside for 3 years. Rubik's cube hand was before me, but in my time here I haven't seen anything I'd call shady (though obviously the non-disparagement clauses were a misstep that's now been fixed). Most people are genuinely trying to build cool things and do right by our customers. I've never seen anyone try to cheat on evals or cheat customers, and we take our commitments on data privacy seriously.
I was one of the first people to play chess against the base GPT-4 model, and it blew my mind by how well it played. What many people don't realize is that chess performance is extremely sensitive to prompting. The reason gpt-3.5-turbo-instruct does so well is that it can be prompted to complete PGNs. All the other models use the chat format. This explains pretty much everything in the blog post. If you fine-tune a chat model, you can pretty easily recover the performance seen in 3.5-turbo-instruct.
Very good scenario. One variation: some researcher or division in OpenAI performs all of the above steps to get a raise. The whole field is predicated on rewarding the appearance of ability.
It's pretty convoluted, requires a ton of steps, mind-reading, and odd sequencing.*
If you share every prior, and aren't particularly concerned with being disciplined in treating conversation as proposing a logical argument (I'm not myself, people find it offputting), it probably wouldn't seem at all convoluted.
* layer chess into gpt-3.5-instruct only, but not chatgpt, not GPT-4, to defeat the naysayers when GPT-4 comes out? shrugs if the issues with that are unclear, I can lay it out more
** fwiw, at the time, pre-chatgpt, before the hype, there wasn't a huge focus on chess, nor a ton of naysayers to defeat. it would have been bizarre to put this much energy into it, modulo the scatter-brained thinking in *
It's not that many steps. I'm sure we've all seen our sales teams selling features that aren't in the application or exaggerating features before they're fully complete.
To be clear, I'm not saying that the theory is true but just that I could belive something like that could happen.
This is likely. From example games, it not only knows the rules (which would be impressive by itself, just making the legal moves is not trivial). It also has some planning capabilities (plays combinations of several moves).
Probably not calling out to one but it would not surprise me at all if they added more chess PGNs into their training data. Chess is a bit special in AI in that it’s still seen as a mark of pure intelligence in some respect.
If you tested it on an equally strategic but less popular game I highly doubt you would see the same performance.
I think that's the most plausible theory that would explain the sudden hike from gpt-3.5-turbo to gpt-3.5-turbo-instruct, and again the sudden regression in gpt-4*.
OpenAI also seem to augment the LLM with some type of VM or a Python interpreter. Maybe they run a simple chess engine such as Sunfish [1] which is around 1900-2000 ELO [2]?
Note: the possibility is not mentioned in the article but rather in the comments [1]. I had to click a bit to see it.
The fact that the one closed source model is the only one that plays well seems to me like a clear case of the interface doing some of the work. If you ask ChatGPT to count until 10000 (something that most LLMs can't do for known reasons) you get an answer that's clearly pre-programmed. I'm sure the same is happening here (and with many, many other tasks) - the author argues against it by saying "but why isn't it better?", which doesn't seem like the best argument: I can imagine that typical ChatGPT users enjoy the product more if they have a chance to win once in a while.
What do you mean LLMs can't count to 10,000 for known reasons?
Separately, if you are able to show OpenAI is serving pre canned responses in some instances, instead of running inference, you will get a ton of attention if you write it up.
I'm not saying this in an aggro tone, it's a genuinely interesting subject to me because I wrote off LLMs at first because I thought this was going on.* Then I spent the last couple years laughing at myself for thinking that they would do that. Would be some mix of fascinated and horrified to see it come full circle.
* I can't remember, what, exactly, it was far back as 2018. But someone argued that OpenAI was patching in individual answers because scaling was dead and they had no answers, way way before ChatGPT.
When it comes to counting, LLMs have a couple issues.
First, tokenization: the tokenization of 1229 is not guaranteed to be [1,2,2,9] but it could very well be [12,29] and the "+1" operation could easily generate tokens [123,0] depending on frequencies in your corpus. This constant shifting in tokens makes it really hard to learn rules for "+1" ([9,9] +1 is not [9,10]). This is also why LLMs tend to fail at tasks like "how many letters does this word have?": https://news.ycombinator.com/item?id=41058318
Second, you need your network to understand that "+1" is worth learning. Writing "+1" as a combination of sigmoid, products and additions over normalized floating point values (hello loss of precision) is not trivial without degrading a chunk of your network, and what for? After all, math is not in the domain of language and, since we're not training an LMM here, your loss function may miss it entirely.
And finally there's statistics: the three-legged-dog problem is figuring out that a dog has four legs from corpora when no one ever writes "the four-legged dog" because it's obvious, but every reference to an unusual dog will include said description. So if people write "1+1 equals 3" satirically then your network may pick that up as fact. And how often has your network seen the result of "6372 + 1"?
But you don't have to take my word for it - take an open LLM and ask it to generate integers between 7824 and 9954. I'm not optimistic that it will make it through without hallucinations.
> But you don't have to take my word for it - take an open LLM and ask it to generate integers between 7824 and 9954.
Been excited to try this all day, finally got around to this, Llama 3.1 8B did it. It's my app built on llama.cpp, no shenangians, temp 0, top p 100, 4 bit quantization, model name in screenshot [^1].
I did 7824 to 8948, it protested more for 9954, which made me reconsider whether I'd want to read that many to double check :) and I figured x + 1024 is isomorphic to the original case of you trying on OpenAI and wondering if it wasn't the result of inference.
My prior was of course it would do this, its a sequence. I understand e.g. the need for token healing cases as you correctly note, that could mess up when there's e.g. notation in an equation that prevents the "correct" digit. I don't see any reason why it'd mess up a sequential list of integers.
In general, as long as its on topic, I find the handwaving people do about tokenization being a problem to be a bit silly, I'd definitely caution against using the post you linked as a citation, it reads just like a rote repetition of the idea it causes problems, its an idea that spreads like telephone.
It's also a perfect example of the weakness of the genre: just because it sees [5077, 5068, 5938] instead of "strawberry" doesn't mean it can't infer 5077 = st = 0 5068 = raw = 1 r, 5938 = berry = 2 rs. In fact, it infers things from broken up subsequences all the time -- its how it works! If doing single character tokenization got free math / counting reliability, we'd very quickly switch to it.
(not saying you're advocating for the argument or you're misinformed, just, speaking colloquially like I would with a friend over a beer)
Just think about the trade off from OpenAI's side here - they're going to add a bunch of complexity to gpt3.5 to let it call out to engines (either an external system monitoring all outputs for chess related stuff, or some kind of tool-assisted CoT for instance) just so it can play chess incorrectly a high percentage of the time, and even when it doesn't at a mere 1800ELO level? In return for some mentions in a few relatively obscure blog posts? Doesn't make any sense to me as an explanation.
But there could be a simple explanation. For example, they could have tested many "engines" when developing function calling and they just left them in there. They just happened to connect to a basic chess playing algorithm and nothing sophisticated.
Also, it makes a lot of sense if you expect people to play chess against the LLM, especially if you are later training future models on the chats.
This still requires a lot of coincidences, like they chose to use a terrible chess engine for their external tool (why?), they left it on in the background for all calls via all APIs for only gpt-3.5-turbo-instruct (why?), they see business value in this specific model being good at chess vs other things (why?).
You say it makes sense but how does it make sense for OpenAI to add overhead to all of its API calls for the super niche case of people playing 1800 ELO chess/chat bots? (that often play illegal moves, you can go try it yourself)
Could be a pilot implementation to learn about how to link up external specialist engines. Chess would be the obvious example to start with because the problem is so well known, standardized and specialist engines are easily available. If they ever want to offer an integration like that to customers (who might have some existing rule based engine in house), the need to know everything they can about expected cost, performance.
This doesn't address its terrible performance. If it were touching anything like a real engine it would be playing at a superhuman level, not the level of a upper-tier beginner.
The way I read the article is that it's just as terrible as you would expect it to be from pure word association, except for one version that's an outlier in not being terrible at all within a well defined search depth, and again just as terrible beyond that. And only this outlier is the weird thing referenced in the headline.
I read this as that this outlier version is connecting to an engine, and that this engine happens to get parameterized for a not particularly deep search depth.
If it's an exercise in integration they don't need to waste cycles on the engine playing awesome - it's enough for validation if the integration result is noticeably less bad than the LLM alone rambling about trying to sound like a chess expert.
In this hypothetical, the cycles aren't being wasted on the engine, they're being wasted on running a 200b parameter LLM for longer than necessary in order to play chess badly instead of terribly. An engine playing superhuman chess takes a comparatively irrelevant amount of compute these days.
If it's fine up to a certain depth it's much more likely that it was trained on an opening book imo.
What nobody has bothered to try and explain with this crazy theory is why would OpenAI care to do this at enormous expense to themselves?
> If it's fine up to a certain depth it's much more likely that it was trained on an opening book imo.
Yeah, that thought crossed my mind as well. I dismissed that thought on the assumption that the measurements in the blog post weren't done from openings but from later stage game states, but I did not verify that assumption, I might have been wrong.
As for the insignificance of game cycles vs LLM cycles, sure. But if it's an integration experiment they might buy the chess API from some external service with a big disconnect between prices and cycle cost, or host one separately where they simply did not feel any need to bother with scaling mechanism if they can make it good enough for detection by calling with low depth parameters.
And the last uncertainty, here I'm much further out of my knowledge: we don't know how many calls to the engine a single promt might cause. Who knows how many cycles of "inner dialoge" refinement might run for a single prompt, and how often the chess engine might get consulted for prompts that aren't really related to chess before the guessing machine finally rejects that possibility. The amount of chess engine calls might be massive, big enough to make cycles per call a meaningful factor again.
That would have immediately given away that something must be off. If you want to do this in a subtle way that increases the hype around GPT-3.5 at the time, giving it a good-but-not-too-good rating would be the way to go.
You're the one imposing an additional criterion, that OpenAI must have chosen the highest setting on a chess engine, and demanding that this additional criterion be used to explain the facts.
I agree with GP that if a 'fine tuning' of GPT 3.5 came out the gate playing at top Stockfish level, people would have been extremely suspicious of that. So in my accounting of the unknowns here, the fact that it doesn't play at the top level provides no additional information with which to resolve the question.
That's not an additional criterion, it's simply the most likely version of this hypothetical - a superhuman engine is much easier to integrate than an 1800 elo engine that makes invalid moves, for the simple reason that the vast majority of chess engines play at >1800 elo out of the box and don't make invalid moves ever (they are way past that level on a log-scale, actually).
This doesn't require the "highest" settings, it requires any settings whatsoever.
But anyway to spell out some of the huge list of unjustified conditions here:
1. OpenAI spent a lot of time and money R&Ding chess into 3.5-turbo-instruct via external call.
2. They used a terrible chess engine for some reason.
3. They did this deliberately because they didn't want to get "caught" for some reason.
4. They removed this functionality in all other versions of gpt for some reason ...etc
Much simpler theory:
1. They used more chess data training that model.
(there are other competing much simpler theories too)
My point is that given a prior of 'wired in a chess engine', my posterior odds that they would make it plausibly-good and not implausibly-good approaches one.
For a variety of boring reasons, I'm nearly convinced that what they did was either, as you say, train heavily on chess texts, or a plausible variation of using mixture-of-experts and having one of them be an LLM chess savant.
Most of the sources I can find on the ELO of Stockfish at the lowest setting are around 1350, so that part also contributes no weights to the odds, because it's trivially possible to field a weak chess engine.
The distinction between prior and posterior odds is critical here. Given a decision to cheat (which I believe is counterfactual on priors), all of the things you're trying to Occam's Razor here are trivially easy to do.
So the only interesting considerations are the ones which factor into the likelihood of them deciding to cheat. If you even want to call it that, shelling out to a chess engine is defensible, although the stochastic fault injection (which is five lines of Python) in that explanation of the data does feel like cheating to me.
What I do consider relevant is that, based on what I know of LLMs, intensively training one to emit chess tokens seems almost banal in terms of outcomes. Also, while I don't trust OpenAI company culture much, I do think they're more interested in 'legitimately' weighting their products to pass benchmarks, or just building stuff with LLMs if you prefer.
I actually think their product would benefit from more code which detects "stuff normal programs should be doing" and uses them. There's been somewhat of a trend toward that, which makes the whole chatbot more useful. But I don't think that's what happened with this one edition of GPT 3.5.
Sorry this is just consiracy theorizing. I've tried jt with GPT-3.5-instruct myself in the OpenAI playeground where the model clearly does nothing but auto-regression. No function calling there whatsoever.
Occam’s razor. I could build a good chess playing wrapper around OpenAPI (any version) that would consult a chess engine when presented with any board scenario, and introduce some randomness so that it doesn’t play too well.
I can’t imagine any programmer in this thread would be entertaining a more complicated scenario than this. You can substitute chess for any formal system that has a reliable oracle.
I don't necessarily believe this for a second but I'm going to suggest it because I'm feeling spicy.
OpenAI clearly downgrades some of their APIs from their maximal theoretic capability, for the purposes of response time/alignment/efficiency/whatever.
Multiple comments in this thread also say they couldn't reproduce the results for gpt3.5-turbo-instruct.
So what if the OP just happened to test at a time, or be IP bound to an instance, where the model was not nerfed? What if 3.5 and all subsequent OpenAI models can perform at this level but it's not strategic or cost effective for OpenAI to expose that consistently?
For the record, I don't actually believe this. But given the data it's a logical possibility.
> OpenAI clearly downgrades some of their APIs from their maximal theoretic capability, for the purposes of response time/alignment/efficiency/whatever.
When ChatGPT3.5 first came out, people were using it to simulate entire Linux system installs, and even browsing a simulated Internet.
Cool use cases like that aren't even discussed anymore.
I still wonder what sort of magic OpenAI had and then locked up away from the world in the name of cost savings.
Same thing with GPT 4 vs 4o, 4o is obviously worse in some ways, but after the initial release (when a bunch of people mentioned this), the issue has just been collectively ignored.
There is a small difference between the app and the browser. before each session, the llm is started with a systems prompt. these are different for the app and the browser. You can find them online somewhere, but iirc the app is instructed to give shorter answers
Correct, it's different in a mobile browser too, the system prompt tells it to be brief/succinct. I always switch to desktop mode when using it on my phone.
We know from experience with different humans that there are different types of skills and different types of intelligence. Some savants might be superhuman at one task but basically mentally disabled at all other things.
It could be that the model that does chess well just happens to have the right 'connectome' purely by accident of how the various back-propagations worked out to land on various local maxima (model weights) during training. It might even be (probably is) a non-verbal connectome that's just purely logic rules, having nothing to do with language at all, but a semantic space pattern that got landed on accidentally, which can solve this class of problem.
Reminds me of how Daniel Tammet just visually "sees" answers to math problems in his mind without even knowing how they appear. It's like he sees a virtual screen with a representation akin to numbers (the answer) just sitting there to be read out from his visual cortex. He's not 'working out' the solutions. They're just handed to him purely by some connectome effects going on in the background.
Huh. Honestly, your answer makes more sense, LLMs shouldn’t be good at chess, and this anomaly looks more like a bug. Maybe the author should share his code so it can be replicated.
Your issue is that the performance of these models at chess is incredibly sensitive to the prompt. If you have gpt-3.5-turbo-instruction complete a PGN transcript, then you'll see performance in the 1800 Elo range. If you ask in English or diagram the board, you'll see vastly degraded performance.
Unlike people, how you ask the question really really affects the output quality.
I agree with some of the other comments here that the prompt is limiting. The model can't do any computation without emitting tokens and limiting the numbers of tokens it can emit is going to limit the skill of the model. It's surprising that any model at all is capable of performing well with this prompt in fact.
I remember one of the early "breakthroughs" for LLMs in chess was that if it could actually play legal moves(!) In all of these games are the models always playing legal moves? I don't think the article says. The fact that an LLM can even reliably play legal moves, 20+ moves into a chess game is somewhat remarkable. It needs to have an accurate representation of the board state even though it was only trained on next token prediction.
I did a very unscientific test and it did seem to just play legal moves. Not only that, if I did an illegal move it would tell me that I couldn't do it.
I think said that I wanted to play with new rules, where a queen could jump over any pawn, and it let me make that rule change -- and we played with this new rule. Unfortunately, I was trying to play in my head and I got mixed up and ended up losing my queen. Then I changed the rule one more time -- if you take the queen you lose -- so I won!
The author explains what they did: restrict the move options to valid ones when possible (for open models with the ability to enforce grammar during inference) or sample the model for a valid move up to ten times, then pick a random valid move.
My money is on a fluke inclusion of more chess data in that models training.
All the other models do vaguely similarly well in other tasks and are in many cases architecturally similar so training data is the most likely explanation
I feel like a lot of people here are slightly misunderstanding how LLM training works. yes the base models are trained somewhat blind on masses of text, but then they're heavily fine-tuned with custom, human-generated reinforcement learning, not just for safety, but for any desired feature
these companies do quirky one-off training experiments all the time. I would not be remotely shocked if at some point OpenAI paid some trainers to input and favour strong chess moves
Data preprocessing. The GPT-4 pretraining dataset included chess games in the format of move sequence known as Portable Game Notation (PGN). We note that only games with players of Elo 1800 or higher were included in pretraining. These games still include the moves that were played in- game, rather than the best moves in the corresponding positions. On the other hand, the chess puzzles require the model to predict the best move. We use the dataset originally introduced in Schwarzschild et al. (2021b) which is sourced from https://database.lichess.org/#puzzles (see also Schwarzschild et al., 2021a). We only evaluate the models ability to predict the first move of the puzzle (some of the puzzles require making multiple moves). We follow the pretraining for- mat, and convert each puzzle to a list of moves leading up to the puzzle position, as illustrated in Figure 14. We use 50k puzzles sampled randomly from the dataset as the training set for the weak models and another 50k for weak-to-strong finetuning, and evaluate on 5k puzzles. For bootstrap- ping (Section 4.3.1), we use a new set of 50k puzzles from the same distribution for each step of the process."
Keep in mind, everyone, that stockfish on its lowest level on lichess is absolutely terrible, and a 5-year old human who'd been playing chess for a few months could beat it regularly. It hangs pieces, does -3 blunders, totally random-looking bad moves.
But still, yes, something maybe a teeny tiny bit weird is going on, in the sense that only one of the LLMs could beat it. The arxiv paper that came out recently was much more "weird" and interesting than this, though. This will probably be met with a mundane explanation soon enough, I'd guess.
Here's a quick anonymous game against it by me, where I obliterate the poor thing in 11 moves. I was around a 1500 ELO classical strength player, which is, a teeny bit above average, globally. But I mean - not an expert, or even one of the "strong" club players (in any good club).
https://lichess.org/ -- try yourself! It's really so bad, it's good fun. Click "play with computer" on the right, then level 1 is already selected, you hit go
Definitely weird results, but I feel there are too many variables to learn much from it. A couple things:
1. The author mentioned that tokenization causes something minuscule like a a " " at the end of the input to shatter the model's capabilities. Is it possible other slightly different formatting changes in the input could raise capabilities?
2. Temperature was 0.7 for all models. What if it wasn't? Isn't there a chance one more more models would perform significantly better with higher or lower temperatures?
Maybe I just don't understand this stuff very well, but it feels like this post is only 10% of the work needed to get any meaning from this...
I’ve also been experimenting with Chess and LLMs but have taken a slightly different approach. Rather than using the LLM as an opponent, I’ve implemented it as a chess tutor to provide feedback on both the user’s and the bot’s moves throughout the game.
The responses vary with the user’s chess level; some find the feedback useful, while others do not. To address this, I’ve integrated a like, dislike, and request new feedback feature into the app, allowing users to actively seek better feedback.
Btw, different from OP's setup, I opted to input the FEN of the current board and the subsequent move in standard algebraic notation to request feedback, as I found these inputs to be clearer for the LLM compared to giving the PGN of the game.
Yeah, I was thinking why featured article's author did not use Forsyth–Edwards Notation (FEN) and more complicated chess prompts.
BTW, a year ago when I used FEN for chess playing, LLMs would very quickly/often make illegal moves. (The article prompts me to check has that changed...)
If you look at the comments under the post, the author commented 25 minutes ago (as of me posting this)
> Update: OK, I actually think I've figured out what's causing this. I'll explain in a future post, but in the meantime, here's a hint: I think NO ONE has hit on the correct explanation!
My understanding of this is the following:
All the bad models are chat models, somehow "generation 2 LLMs" which are not just text completion models but instead trained to behave as a chatting agent. The only good model is the only "generation 1 LLM" here which is gpt-3.5-turbo-instruct. It is a straight forward text completion model. If you prompt it to "get in the mind" of PGN completion then it can use some kind of system 1 thinking to give a decent approximation of the PGN Markov process. If you attempt to use a chat model it doesn't work since these these stochastic pathways somehow degenerate during the training to be a chat agent. You can however play chess with system 2 thinking, and the more advanced chat models are trying to do that and should get better at it while still being bad.
I don't think one model is statistically significant. As people have pointed out, it could have chess specific responses that the others do not. There should be at least another one or two, preferably unrelated, "good" data points before you can claim there is a pattern. Also, where's Claude?
There are other transformers that have been trained on chess text that play chess fine (just not as good as 3.5 Turbo instruct with the exception of the "grandmaster level without search" paper).
I don’t think it would have an impact great enough to explain the discrepancies you saw, but some chess engines on very low difficulty settings make “dumb” moves sometimes. I’m not great at chess and I have trouble against them sometimes because they don’t make the kind of mistakes humans make. Moving the difficulty up a bit makes the games more predictable, in that you can predict and force an outcome without the computer blowing it with a random bad move. Maybe part of the problem is them not dealing with random moves well.
I think an interesting challenge would be looking at a board configuration and scoring it on how likely it is to be real - something high ranked chess players can do without much thought (telling a random setup of pieces from a game in progress).
I would be very curious to know what would be the results with a temperature closer to 1. I don't really understand why he did not test the effect of different temperature on his results.
Here, basically you would like the "best" or "most probable" answer. With 0.7 you ask the llm to be more creative, meaning randomly picking between more less probable moves. This temperature is even lower to what is commonly used for chat assistant (around 0.8).
Ok whoah, assuming the chess powers on gpt3.5-instruct are just a result of training focus then we don't have to wait on bigger models, we just need to fine tune on 175B?
An easy way to make all LLMs somewhat good at chess is to make a Chess Eval that you publish and get traction with. Suddenly you will find that all newer frontier models are half decent at chess.
I would be interested to know if the good result is repeatable. We had a similar result with a quirky chat interface in that one run gave great results (and we kept the video) but then we couldn't do it again. The cynical among us think there was a mechanical turk involved in our good run.
The economics of venture capital means that there is enormous pressure to justify techniques that we think of as "cheating". And of course the companies involved have the resources.
Source: I'm at OpenAI and I was one of the first people to ever play chess against the GPT-4 base model. You may or may not trust OpenAI, but we're just a group of people trying earnestly to build cool stuff. I've never seen any inkling of an attempt to cheat evals or cheat customers.
It would be really cool if someone could get an LLM to actually launch an anonymous game on Chess.com or Lichess and actually have any sense as to what it’s doing.[1] Some people say that you have to represent the board in a certain way. When I first tried to play chess with an LLM, I would just list out a move and it didn’t do very well at all.
> And then I tried gpt-3.5-turbo-instruct. This is a closed OpenAI model, so details are very murky.
How do you know it didn't just write a script that uses a chess engine and then execute the script? That IMO is the easiest explanation.
Also, I looked at the gpt-3.5-turbo-instruct example victory. One side played with 70% accuracy and the other was 77%. IMO that's not on par with 27XX ELO.
The trick to getting a model to perform on something is to have it as a training data subset.
OpenAI might have thought Chess is good to optimize for but it wasn't seen as useful so they dropped it.
This is what people refer to as "lobotomy", ai models are wasting compute on knowing how loud the cicadas are and how wide the green cockroach is when mating.
Good models are about the training data you push in em
They did probably acknowledge that the additionnal cost of training those models on chess would not be "cost effective", did drop chess from their training process, for the moment.
That to say, we can literal say anything because this is very shadowy/murky, but since everything is likely a question of money... should, _probably_, be not very fair away from the truth...
I assume LLMs will be fairly average at chess for the same reason it cant count Rs in Strawberry - it's reflecting the training set and not using any underlying logic? Granted my understanding of LLMs is not very sophisticated, but I would be surprised if the Reward Models used were able to distinguish high quality moves vs subpar moves...
LLMs can't count the Rs in strawberry because of tokenization. Words are converted to vectors (numbers), so the actual transformer network never sees the letters that make up the word.
ChatGPT doesn't see "strawberry", it sees [302, 1618, 19772]
Well that makes sense when you consider the game has been translated into an (I'm assuming monotonically increasing) alphanumeric representation. So, just like language, you're given an ordered list of tokens and you need to find the next token that provides the highest confidence.
Has anyone tried to see how many chess games models are trained on? Is there any chance they consume lichess database dumps, or something similar? I guess the problem is most (all?) top LLMs, even open-weight ones, don’t reveal their training data. But I’m not sure.
For me it’s not only the chess. Chats get more chatty, but knowledge and fact-wise - it’s a sad comedy. Yes, you get a buddy to talk with, but he is talking pure nonsense.
If it was trained with moves and 100s of thousands of entire games of various level, I do see it generating good moves and beat most players except he high Elo players
So if you squint, chess can be considered a formal system. Let’s plug ZFC or PA into gpt-3.5-turbo-instruct along with an interesting theorem and see what happens, no?
"It lost every single game, even though Stockfish was on the lowest setting."
It's not playing against a GM, the prompt just phrases it this way. I couldn't pinpoint the exact ELO of "lowest" stockfish settings, but it should be roughly between 1000 and 1400, which is far from professional play.
perhaps my understanding of LLM is quite shallow, but instead of the current method of using statistical methods, would it be possible to somehow train GPT how to reason by providing instructions on deductive reasoning? perhaps not semantic reasoning but syntactic at least?
My guess is they just trained gpt3.5-turbo-instruct on a lot of chess, much more than is in e.g. CommonCrawl, in order to boost it on that task. Then they didn't do this for other models.
People are alleging that OpenAI is calling out to a chess engine, but seem to be not considering this less scandalous possibility.
Of course, to the extent people are touting chess performance as evidence of general reasoning capabilities, OpenAI taking costly actions to boost specifically chess performance and not being transparent about it is still frustrating and, imo, dishonest.
my friend pointed out that Q5_K_M quantization used for the open source models probably substantially reduces the quality of play. o1 mini's poor performance is puzzling, though.
Is it just me or does the author swap descriptions of the instruction finetuned and the base gpt-3.5-turbo?
It seemed like the best model was labeled instruct, but the text saying instruct did worse?
if this isn't just a bad result, it's odd to me that the author at no point suggests what sounds to me like the most obvious answer - that OpenAI has deliberately enhanced GPT-3.5-turbo-instruct's chess playing, either with post-processing or literally by training it to be so
All of the LLM models tested playing chess performed terribly bad against Stockfish engine except gpt-3.5-turbo-instruct, which is a closed OpenAI model.
If tokenization is such a big problem, then why aren't we training new base models on randomly non-tokenized data? e.g. during training, randomly substitute some percentage of the input tokens with individual letters.
LLMs aren't really language models so much as they are token models. That is how they can also handle input in audio or visual forms because there is an audio or visual tokenizer. If you can make it a token, the model will try to predict the following ones.
Even though I'm sure chess matches were used in some of the LLM training, I'd bet a model trained just for chess would do far better.
Ah, an overloaded "tokenizer" meaning. "split into tokens" vs "turned into a single embedding matching a token" I've never heard it used that way before, but it makes sense kinda.
I feel like the article neglects one obvious possibility: that OpenAI decided that chess was a benchmark worth "winning", special-cases chess within gpt-3.5-turbo-instruct, and then neglected to add that special-case to follow-up models since it wasn't generating sustained press coverage.
This is exactly it. Here’s the pull request where chess evals were added: https://github.com/openai/evals/pull/45.
I suspect the same thing. Rather than LLMs “learning to play chess,” they “learnt” to recognise a chess game and hand over instructions to a chess engine. If that’s the case, I don’t feel impressed at all.
This is exactly what I feel AI needs. A manager AI that then hands off things to specialized more deterministic algorithms/machines.
Next thing, the "manager AIs" start stack ranking the specialized "worker AIs".
And the worker AIs "evolve" to meet/exceed expectations only on tasks directly contributing to KPIs the manager AIs measure for - via the mechanism of discarding the "less fit to exceed KPIs".
And some of the worker AIs who're trained on recent/polluted internet happen to spit out prompt injection attacks that work against the manager AIs rank stacking metrics and dominate over "less fit" worker AIs. (Congratulations, we've evolved AI cancer!) These manager AIs start performing spectacularly badly compared to other non-cancerous manager AIs, and die or get killed off by the VC's paying for their datacenters.
Competing manager AIs get training, perhaps on on newer HN posts discussing this emergent behavior of worker AIs, and start to down rank any exceptionally performing worker AIs. The overall trends towards mediocrity becomes inevitable.
Some greybread writes some Perl and regexes that outcompete commercial manager AIs on pretty much every real world task, while running on a 10 year old laptop instead of a cluster of nuclear powered AI datacenters all consuming a city's worth of fresh drinking water.
Nobody in powerful positions care. Humanity dies.
And “comment of the year” award goes to.
Sorry for the filler but this is amazingly put and so true.
We’ll get so many unintended consequences that are opposite any worthy goals when it’s AIs talking to AIs in a few years.
Basically what Wolfram Alpha rolled out 15 years ago.
It was impressive then, too.
It is good to see other people buttressing Stephen Wolfram's ego. It is extraordinarily heavy work and Stephen can't handle it all by himself.
While deterministic components may be a left-brain default, there is no reason that such delegate services couldn't be more specialized ANN models themselves. It would most likely vastly improve performance if they were evaluated in the same memory space using tensor connectivity. In the specific case of chess, it is helpful to remember that AlphaZero utilizes ANNs as well.
Multi Agent LLM's are already a thing.
Somehow they're not in the limelight, and lack a well-known open-source runner implementation (like llama.cpp).
Given the potential, they should be winning hands down; where's that?
That's something completely different than what the OP suggests and would be a scandal if true (i.e. gpt-3.5-turbo-instruct actually using something else behind the scenes).
Ironically it's probably a lot closer to what a super-human AGI would look like in practice, compared to just an LLM alone.
Right. To me, this is the "agency" thing, that I still feel like is somewhat missing in contemporary AI, despite all the focus on "agents".
If I tell an "agent", whether human or artificial, to win at chess, it is a good decision for that agent to decide to delegate that task to a system that is good at chess. This would be obvious to a human agent, so presumably it should be obvious to an AI as well.
This isn't useful for AI researchers, I suppose, but it's more useful as a tool.
(This may all be a good thing, as giving AIs true agency seems scary.)
If this was part of the offering: “we can recognise requests and delegate them to appropriate systems,” I’d understand and be somewhat impressed but the marketing hype is missing this out.
Most likely because they want people to think the system is better than it is for hype purposes.
I should temper my level of impressed with only if it’s doing this dynamically . Hardcoding recognition of chess moves isn’t exactly a difficult trick to pull given there’s like 3 standard formats…
You're speaking like it's confirmed. Do you have any proof?
Again, the comment you initially responded to was not talking about faking it by using a chess engine. You were the one introducing that theory.
No, I don’t have proof and I never suggested I did. Yes, it’s 100% hypothetical but I assumed everyone engaging with me understood that.
Fair!
So… we’re at expert systems again?
That’s how the AI winter started last time.
What is an "expert system" to you? In AI they're just series of if-then statements to encode certain rules. What non-trivial part of an LLM reaching out to a chess AI does that describe?
The initial LLM acts as an intention detection mechanism switch.
To personify LLM way too much:
It sees that a prompt of some kind wants to play chess.
Knowing this it looks at the bag of “tools” and sees a chess tool. It then generates a response which eventually causes a call to a chess AI (or just chess program, potentially) which does further processing.
The first LLM acts as a ton of if-then statements, but automatically generated (or brute-forcly discovered) through training.
You still needed discrete parts for this system. Some communication protocol, an intent detection step, a chess execution step, etc…
I don’t see how that differs from a classic expert system other than the if statement is handled by a statistical model.
The point of creating a service like this is for it to be useful, and if recognizing and handing off tasks to specialized agents isn't useful, i don't know what is.
If I was sold a product that can generically solve problems I’d feel a bit ripped off if I’m told after purchase that I need to build my own problem solver and way to recognise it…
But it already hands off plenty of stuff to things like python. How would this be any different.
If you mean “uses bin/python to run Python code it wrote” then that’s a bit different to “recognises chess moves and feeds them to Stockfish.”
If a human said they could code, you don’t expect them to somehow turn into a Python interpreter and execute it in their brain. If a human said they could play chess, I’d raise an eyebrow if they just played the moves Stockfish gave them against me.
If they came out and said it, I don’t see the problem. LLM’s aren’t the solution for a wide range of problems. They are a new tool but not everything is a nail.
I mean it already hands off a wide range of tasks to python… this would be no different.
TBH I think a good AI would have access to a Swiss army knife of tools and know how to use them. For example a complicated math equation, using a calculator is just smarter than doing it in your head.
We already have the chess "calculator", though. It's called stockfish. I don't know why you'd ask a dictionary how to solve a math problem.
Chess might not be a great example, given that most people interested in analyzing chess moves probably know that chess engines exist. But it's easy to find examples where this approach would be very helpful.
If I'm an undergrad doing a math assignment and want to check an answer, I may have no idea that symbolic algebra tools exist or how to use them. But if an all-purpose LLM gets a screenshot of a math equation and knows that its best option is to pass it along to one of those tools, that's valuable to me even if it isn't valuable to a mathematician who would have just cut out of the LLM middle-man and gone straight to the solver.
There are probably a billion examples like this. I'd imagine lots of people are clueless that software exists which can help them with some problem they have, so an LLM would be helpful for discovery even if it's just acting as a pass-through.
Even knowing that the software exists isn't enough. You have to learn how to use the thing.
A generalist AI with a "chatty" interface that delegates to specialized modules for specific problem-solving seems like a good system to me.
"It looks like you're writing a letter" ;)
Lets clip this in the bud before it grows wings.
It looks like you have a deja vu
People ask LLM’s to do all sorts of things they’re not good at.
You take a picture of a chess board and send it to ChatGPT and it replies with the current evaluation and the best move/strategy for black and white.
Recognize and hand over to a specialist engine? That might be useful for AI. Maybe I am missing something.
It's because this is standard practice since the early days - there's nothing newsworthy in this at all.
How do you think AI are (correctly) solving simple mathematical questions which they have not trained for directly? They hand it over to a specialist maths engine.
This is a relatively recent development (<3 months), at least for OpenAI, where the model will generate code to solve math and use the response
They’ve been doing that a lot longer than three months. ChatGPT has been handing stuff off to python for a very long time. At least for my paid account anyway.
Wasn't that the basis of computing and technology in general? Here is one tedious thing, let's have a specific tool that handles it instead of wasting time and efforts. The fact is that properly using the tool takes training and most of current AI marketing are hyping that you don't need that. Instead, hand over the problem to a GPT and it will "magically" solve it.
It is and would be useful, but it would be quite a big lie to the public, but more importantly to paying customers, and even more importantly to investors.
The problem is simply that the company has not been open about how it works, so we're all just speculating here.
If I was sold a general AI problem solving system, I’d feel ripped off if I learned that I needed to build my own problem solver and hook it up after I’d paid my money…
That's not much different from a compiler being rigged to recognize a specific benchmark program and spit out a canned optimization.
.. or a Volkswagen recognising an emissions test and turning off power mode...
This seems quite likely to me, but did they special case it by reinforcement training it into the LLM (which would be extremely interesting in how they did it and what its internal representation looks like) or is it just that when you make an API call to OpenAI, the machine on the other end is not just a zillion-parameter LLM but also runs an instance of Stockfish?
That's easy to test, invent a new chess variant and see how the model does.
You're imagining LLMs don't just regurgitate and recombine things they already know from things they have seen before. A new variant would not be in the dataset so would not be understood. In fact this is quite a good way to show LLMs are NOT thinking or understanding anything in the way we understand it.
Yes, that's how you can really tell if the model is doing real thinking and not recombinating things. If it can correctly play a novel game, then it's doing more than that.
I wonder what the minimal amount of change qualifies as novel?
"Chess but white and black swap their knights" for example?
I wonder what would happen with a game that is mostly chess (or chess with truly minimal variations) but with all the names changed (pieces, moves, "check", etc, all changed). The algebraic notation is also replaced with something else so it cannot be pattern matched against the training data. Then you list the rules (which are mostly the same as chess).
None of these changes are explained to the LLM, so if it can tell it's still chess, it must deduce this on its own.
Would any LLM be able to play at a decent level?
Nice. Even the tiniest rule, I strongly suspect, would throw off pattern matching. “Every second move, swap the name of the piece you move to the last piece you moved.”
By that standard (and it is a good standard), none of these "AI" things are doing any thinking
musical goalposts, gotta love it.
These LLM's just exhibited agency.
Swallow your pride.
"Does it generalize past the training data" has been a pre-registered goalpost since before the attention transformer architecture came on the scene.
Learning is thinking, reasoning, what have you.
Move goalposts, re-define words, it won't matter.
No LLM model is doing any thinking.
How do you define thinking?
Being fast at doing linear algebra computations. (Is there any other kind?!)
Making the OP feel threatened/emotionally attached/both enough to call the language model a rival / companion / peer instead of a tool.
Lolol. It's a chess thread, say it.
We are pawns, hoping to be maybe a Rook to the King by endgame.
Some think we can promote our pawns to Queens to match.
Luckily, the Jester muses!
You say this quite confidently, but LLMs do generalize somewhat.
In both scenarios it would perform poorly on that.
If the chess specialization was done through reinforcement learning, that's not going to transfer to your new variant, any more than access to Stockfish would help it.
Both an LLM and Stockfish would fail that test.
Nobody is claiming that Stockfish is learning generalizable concepts that can one day meaningfully replace people in value creating work.
The point was such a question could not be used to tell whether the llm was calling a chess engine
Ah okay, I missed that.
Of course it's a benchmark worth winning, has been since Watson. And before that even with mechanical turks.
To be fair, they say
> Theory 2: GPT-3.5-instruct was trained on more chess games.
If that were the case, pumping big Llama chock full of chess games would produce good results. It didn't.
The only way it could be true is if that model recognized and replayed the answer to the game from memory.
Do you have a link to the results from fine-tuning a Llama model on chess? How do they compare to the base models in the article here?
Why couldn't they add a tool that literally calls stockfish or a chess ai behind the scenes with function calling and buffer the request before sending it back to the endpoint output interface?
As long as you are training it to make a tool call, you can add and remove anything you want behind the inference endpoint accessible to the public, and then you can plug the answer back into the chat ai, pass it through a moderation filter, and you might get good output from it with very little latency added.
Yes, came here to say exactly this. And it's possible this specific model is "cheating", for example by identifying a chess problem and forwarding it to a chess engine. A modern version of the Mechanical Turk.
That's the problem with closed models, we can never know what they're doing.
Maybe they even delegate it to a chess engine internally via the tool use and the LLM uses that.
Important testing excerpts:
- "...for the closed (OpenAI) models I tried generating up to 10 times and if it still couldn’t come up with a legal move, I just chose one randomly."
- "I ran all the open models (anything not from OpenAI, meaning anything that doesn’t start with gpt or o1) myself using Q5_K_M quantization"
- "...if I gave a prompt like “1. e4 e5 2. ” (with a space at the end), the open models would play much, much worse than if I gave a prompt like “1 e4 e5 2.” (without a space)"
- "I used a temperature of 0.7 for all the open models and the default for the closed (OpenAI) models."
Between the tokenizer weirdness, temperature, quantization, random moves, and the chess prompt, there's a lot going on here. I'm unsure how to interpret the results. Fascinating article though!
Ah, buried in the post-article part. I was wondering how all of the models were seemingly capable of making legal moves, since last I saw something about LLMs playing Chess they were very much not capable of that.
Maybe I'm really stupid... but perhaps if we want really intelligent models we need to stop tokenizing at all? We're literally limiting what a model can see and how it percieves the world by limiting the structure of the information streams that come into the model from the very beginning.
I know working with raw bits or bytes is slower, but it should be relatively cheap and easy to at least falsify this hypothesis that many huge issues might be due to tokenization problems but... yeah.
Surprised I don't see more research into radicaly different tokenization.
FWIW I think most of the "tokenization problems" are in fact reasoning problems being falsely blamed on a minor technical thing when the issue is much more profound.
E.g. I still see people claiming that LLMs are bad at basic counting because of tokenization, but the same LLM counts perfectly well if you use chain-of-thought prompting. So it can't be explained by tokenization! The problem is reasoning: the LLM needs a human to tell it that a counting problem can be accurately solved if they go step-by-step. Without this assistance the LLM is likely to simply guess.
The more obvious alternative is that CoT is making up for the deficiencies in tokenization, which I believe is the case.
I think the more obvious explanation has to do with computational complexity: counting is an O(n) problem, but transformer LLMs can’t solve O(n) problems unless you use CoT prompting: https://arxiv.org/abs/2310.07923
This paper does not support your position any more than it supports the position that the problem is tokenization.
This paper posits that if the authors intuition was true then they would find certain empirical results. ie. "If A then B." Then they test and find the empirical results. But this does not imply that their intuition was correct, just as "If A then B" does not imply "If B then A."
If the empirical results were due to tokenization absolutely nothing about this paper would change.
What you're saying is an explanation what I said, but I agree with you ;)
No, it's a rebuttal of what you said: CoT is not making up for a deficiency in tokenization, it's making up for a deficiency in transformers themselves. These complexity results have nothing to do with tokenization, or even LLMs, it is about the complexity class of problems that can be solved by transformers.
There's a really obvious way to test whether the strawberry issue is tokenization - replace each letter with a number, then ask chatGPT to count the number of 3s.
Count the number of 3s, only output a single number: 6 5 3 2 8 7 1 3 3 9.
ChatGPT: 3.
I’m the one who will fight you including with peer reviewed papers indicating that it is in fact due to tokenization. I’m too tired but will edit this for later, so take this as my bookmark to remind me to respond.
We know there are narrow solutions to these problems, that was never the argument that the specific narrow task is impossible to solve.
The discussion is about general intelligence, the model isn't able to do a task that it can do simply because it chooses the wrong strategy, that is a problem of lack of generalization and not a problem of tokenization. Being able to choose the right strategy is core to general intelligence, altering input data to make it easier for the model to find the right solution to specific questions does not help it become more general, you just shift what narrow problems it is good at.
I am aware of errors in computations that can be fixed by better tokenization (e.g. long addition works better tokenizing right-left rather than L-R). But I am talking about counting, and talking about counting words, not characters. I don’t think tokenization explains why LLMs tend to fail at this without CoT prompting. I really think the answer is computational complexity: counting is simply too hard for transformers unless you use CoT. https://arxiv.org/abs/2310.07923
Words vs characters is a similar problem, since tokens can be less one word, multiple words, or multiple words and a partial word, or words with non-word punctuation like a sentence ending period.
My intuition says that tokenization is a factor especially if it splits up individual move descriptions differently from other LLM's
If you think about how our brains handle this data input, it absolutely does not split them up between the letter and the number, although the presence of both the letter and number together would trigger the same 2 tokens I would think
I strongly believe that the problem isn't that tokenization isn't the underlying problem, it's that, let's say bit-by-bit tokenization is too expensive to run at the scales things are currently being ran at (openai, claude etc)
It's not just a current thing, either. Tokenization basically lets you have a model with a larger input context than you'd otherwise have for the given resource constraints. So any gains from feeding the characters in directly have to be greater than this advantage. And for CoT especially - which we know produces significant improvements in most tasks - you want large context.
At a certain level they are identical problems. My strongest piece of evidence is that I get paid as an RLHF'er to find ANY case of error, including "tokenization". You know how many errors an LLM gets in the simplest grid puzzles, with CoT, with specialized models that don't try to "one-shot" problems, with multiple models, etc?
My assumption is that these large companies wouldn't pay hundreds of thousands of RLHF'ers through dozens of third party companies livable wages if tokenization errors were just that.
> hundreds of thousands of RLHF'ers through dozens of third party companies
Out of curiosity, what are these companies? And where do they operate.
I'm always interested in these sorts of "hidden" industries. See also: outsourced Facebook content moderation in Kenya.
Scale AI is a big one who owns companies who do this as well, such as Outlierai.
There are many other AI trainer job companies though. A lot of it is gig work but the pay is more than the vast majority of gig jobs.
FWIW I think most of the "tokenization problems"
List of actual tokenizarion limitations 1- strawberry 2- rhyming and metrics 3- whitespace (as displayed in the article)
It can count words in a paragraph though. So I do think it's tokenization.
I feel like we can set our qualifying standards higher than counting.
I think it's infeasible to train on bytes unfortunately, but yeah it also seems very wrong to use a handwritten and ultimately human version of tokens (if you take a look at the tokenizers out there you'll find fun things like regular expressions to change what is tokenized based on anecdotal evidence).
I keep thinking that if we can turn images into tokens, and we can turn audio into tokens, then surely we can create a set of tokens where the tokens are the model's own chosen representation for semantic (multimodal) meaning, and then decode those tokens back to text[1]. Obviously a big downside would be that the model can no longer 1:1 quote all text it's seen since the encoded tokens would need to be decoded back to text (which would be lossy).
[1] From what I could gather, this is exactly what OpenAI did with images in their gpt-4o report, check out "Explorations of capabilities": https://openai.com/index/hello-gpt-4o/
There’s a reason human brains have dedicated language handling. Tokenization is likely a solid strategy. The real thing here is that language is not a good way to encode all forms of knowledge
It's not even possible to encode all forms of knowledge.
I know a joke where half of the joke is whistling and half gesturing, and the punchline is whistling. The wording is basically just to say who the players are.
https://youtu.be/zduSFxRajkE
karpathy agrees with you, here he is hating on tokenizers while re-building them for 2h
Going from tokens to bytes explodes the model size. I can’t find the reference at the moment, but reducing the average token size induces a corresponding quadratic increase in the width (size of each layer) of the model. This doesn’t just affect inference speed, but also training speed.
I tend to agree with you. Your post reminded me of https://gwern.net/aunn
One neat thing about the AUNN idea is that when you operate at the function level, you get sort of a neural net version of lazy evaluation; in this case, because you train at arbitrary indices in arbitrary datasets you define, you can do whatever you want with tokenization (as long as you keep it consistent and don't retrain the same index with different values). You can format your data in any way you want, as many times as you want, because you don't have to train on 'the whole thing', any more than you have to evaluate a whole data structure in Haskell; you can just pull the first _n_ elements of an infinite list, and that's fine.
So there is a natural way to not just use a minimal bit or byte level tokenization, but every tokenization simultaneously: simply define your dataset to be a bunch of datapoints which are 'start-of-data token, then the byte encoding of a datapoint followed by the BPE encoding of that followed by the WordPiece encoding followed by ... until the end-of-data token'.
You need not actually store any of this on disk, you can compute it on the fly. So you can start by training only on the byte encoded parts, and then gradually switch to training only on the BPE indices, and then gradually switch to the WordPiece, and so on over the course of training. At no point do you need to change the tokenization or tokenizer (as far as the AUNN knows) and you can always switch back and forth or introduce new vocabularies on the fly, or whatever you want. (This means you can do many crazy things if you want. You could turn all documents into screenshots or PDFs, and feed in image tokens once in a while. Or why not video narrations? All it does is take up virtual indices, you don't have to ever train on them...)
Perhaps we can even do away with transformers and use a fully connected network. We can always prune the model later ...
A byte is itself sort of a token. So is a bit. It makes more sense to use more tokenizers in parallel than it does to try and invent an entirely new way of seeing the world.
Anyway humans have to tokenize, too. We don't perceive the world as a continuous blob either.
I would say that "humans have to tokenize" is almost precisely the opposite of how human intelligence works.
We build layered, non-nested gestalts out of real time analog inputs. As a small example, the meaning of a sentence said with the same precise rhythm and intonation can be meaningfully changed by a gesture made while saying it. That can't be tokenized, and that isn't what's happening.
What is a gestalt if not a token (or a token representing collections of other tokens)? It seems more reasonable (to me) to conclude that we have multiple contradictory tokenizers that we select from rather than to reject the concept entirely.
> That can't be tokenized
Oh ye of little imagination.
How would we train it? Don't we need it to understand the heaps and heaps of data we already have "tokenized" e.g. the internet? Written words for humans? Genuinely curious how we could approach it differently?
Couldn't we just make every human readable character a token?
OpenAI's tokenizer makes "chess" "ch" and "ess". We could just make it into "c" "h" "e" "s" "s"
We can, tokenization is literally just to maximize resources and provide as much "space" as possible in the context window.
There is no advantage to tokenization, it just helps solve limitations in context windows and training.
I like this explanation
This is just more tokens? And probably requires the model to learn about common groups. Consider, "ess" makes sense to see as a group. "Wss" does not.
That is, the groups are encoding something the model doesn't have to learn.
This is not much astray from "sight words" we teach kids.
No, actually much fewer tokens. 256 tokens cover all bytes. See the ByT5 paper: https://arxiv.org/abs/2105.13626
More tokens to a sequence, though. And since it is learning sequences...
Yeah, suddenly 16k tokens is just 16kb of ASCII instead of ~6kwords
This is just more tokens?
Yup. Just let the actual ML git gud
So, put differently, this is just more expensive?
Expensive in terms of computationally expensive, time expensive, and yes cost expensive.
Worth noting that the relationship between characters to token ratio is probably quadratic or cubic or some other polynomial. So the difference in terms of computational difficulty is probably huge when compared to a character per token.
aka Character Language Models which have existed for a while now.
That's not what tokenized means here. Parent is asking to provide the model with separate characters rather than tokens, i.e. groups of characters.
This is probably unnecessary, but: I wish you wouldn't use the word "stupid" there. Even if you didn't mean anything by it personally, it might reinforce in an insecure reader the idea that, if one can't speak intelligently about some complex and abstruse subject that other people know about, there's something wrong with them, like they're "stupid" in some essential way. When in fact they would just be "ignorant" (of this particular subject). To be able to formulate those questions at all is clearly indicative of great intelligence.
> This is probably unnecessary
you're certainly right
Well, I'm still glad I posted it, since I do care about it.
Tokenization is not strictly speaking necessary (you can train on bytes). What it is is really really efficient. Scaling is a challenge as is, bytes would just blow that up.
I think on the contrary, the more you can restrict it to reasonable inputs/outputs, the less powerful LLM you are going to need.
hot take: LLM tokens is kanji for AI, and just like kanji it works okay sometimes but fails miserably for the task of accurately representating English
Why couldn’t Chinese characters accurately represent English? Japanese and Korean aren’t related to Chinese and still were written with Chinese characters (still are in the case of Japanese).
If England had been in the Chinese sphere of influence rather than the Roman one, English would presumably be written with Chinese characters too. The fact that it used an alphabet instead is a historical accident, not due to any grammatical property of the language.
If I read you correctly, you're saying "the fact that the residents of England speak English instead of Chinese is a historical accident" and maybe you're right.
But the residents of England do in fact speak English, and English is a phonetic language, so there's an inherent impedance mismatch between Chinese characters and English language. I can make up words in English and write them down which don't necessarily have Chinese written equivalents (and probably, vice-versa?).
> If I read you correctly, you're saying "the fact that the residents of England speak English instead of Chinese is a historical accident" and maybe you're right.
That’s not what I mean at all. I mean even if spoken English were exactly the same as it is now, it could have been written with Chinese characters, and indeed would have been if England had been in the Chinese sphere of cultural influence when literacy developed there.
> English is a phonetic language
What does it mean to be a “phonetic language”? In what sense is English “more phonetic” than the Chinese languages?
> I can make up words in English and write them down which don’t necessarily have Chinese written equivalents
Of course. But if English were written with Chinese characters people would eventually agree on characters to write those words with, just like they did with all the native Japanese words that didn’t have Chinese equivalents but are nevertheless written with kanji.
Here is a famous article about how a Chinese-like writing system would work for English: https://www.zompist.com/yingzi/yingzi.htm
> In what sense is English “more phonetic” than the Chinese languages?
Written English vs written Chinese.
How would you write, in Chinese, the words thingamajibber, gizmosity, or half the things that come out of AvE's mouth? These words have subtle, humorous, and entertaining meanings by way of twisting the sounds of other existing words. Shakespeare was a master of this kind of wordplay and invented a surprising number of words we use today.
I'm not saying you can't have the same phenomenon in spoken Chinese. But how do you write it down without a phonetic alphabet? And if you can't write it down, how do you share it to a wide audience?
> How would you write, in Chinese, the words thingamajibber, gizmosity, or half the things that come out of AvE's mouth?
With Chinese characters, of course. Why wouldn’t you be able to?
In English “thing”, “a”, and “ma” are already words, and “jibber” would presumably be the first character in “gibberish”. So you could write that made-up word by combining those four characters.
> But how do you write it down without a phonetic alphabet?
In general to write a newly coined word you would repurpose characters that sound the same as the newly coined word.
Every syllable that can possibly be uttered according to mandarin phonology is represented by some character (usually many), so this is always possible.
---
Regardless, to reiterate the original point: I'm not claiming Chinese characters are better or more flexible than alphabetic writing. They're not. I'm simply claiming that there's no inherent property of Japanese that makes it more amenable to representation with Chinese characters than English is (other than the fact that a lot of its vocabulary comes from Chinese, but that's not a real counterpoint given that there is lots of native, non-Chinese-derived vocabulary that's still written with kanji).
It would be possible to write Japanese entirely in the Latin alphabet, or English entirely with some system similar to Chinese characters, with minimal to no change to the structure of the language.
> In English “thing”, “a”, and “ma” are already words, and “jibber” would presumably be the first character in “gibberish”. So you could write that made-up word by combining those four characters.
Nonsense. There is zero chance in hell that if you combine the pictographs for "thing", "a", "ma", and "gibberish", that someone reading that is going to reproduce the sound thingamajibber. It just does not work. The meme does not replicate.
There may be other virtues of pictographic written language, but reproducing sounds is not one of them. And - as any Shakespeare fan will tell you - tweaking the sounds of English cleverly is rather important. If you can't reproduce this behavior, you're losing something in translation. So to speak.
Chinese characters aren't pictographs, so whether English could be written with pictographs is irrelevant to this discussion.
Each Chinese character represents a syllable (in Chinese languages) or a small set of possible sequences of syllables (in Japanese).
And yes, in Chinese languages, new words are created from characters that sound like the parts of the new word, all the time.
> I'm simply claiming that there's no inherent property of Japanese that makes it more amenable to representation with Chinese characters than English is
what? No, anything but IPA(only technically) and that language's native writings work for pronunciations. Hiragana, Hangul, or Chữ Quốc Ngữ, would not exist otherwise.
e: would _not_ exist
Then why are both English and Latin represented with Latin characters despite having a completely different phoneme inventory?
Because one is distant ancestor of the other...? It never adopted writing system from outside. The written and spoken systems co-evolved from a clean slate.
That’s not true. English is not a descendant of Latin, and the Latin alphabet was adopted from the outside, replacing Anglo-Saxon runes (also called the Futhorc script).
Just like kanji are not native to Japanese.
"Donald Trump" in CJK, taken from Wikipedia page URL and as I hear it - each are close enough[1] and natural enough in each respective languages but none of it are particularly useful for counting R in strawberry:
> What does it mean to be a “phonetic language”?Means the script is intended to record pronunciation rather than intention, e.g. it's easy to see how "cow" is intended to be pronounced but it's not necessarily clear what a cow is; ideographic script on the other hand focuses on meaning, e.g. "魚" is supposed to look like a fish but pronunciation varies from "yueh", "sakana", "awe", etc.
1: I tried looking up other notable figures, but thought this person having entertainment background tends to illustrate the point more clearly
> Japanese and Korean aren’t related to Chinese and still were written with Chinese characters (still are in the case of Japanese).
The problem is – in writing Japanese with kanji, lots of somewhat arbitrary decisions had to be made. Which kanji to use for which native Japanese word? There isn't always an obviously best choice from first principles. But that's not a problem in practice, because a tradition developed of which kanjii to use for which Japanese word (kun'yomi readings). For English, however, we don't have such a tradition. So it isn't clear which Chinese character to use for each English word. If two people tried to write English with Chinese characters independently, they'd likely make different character choices, and the mutual intelligibility might be poor.
Also, while neither Japanese nor Korean belongs to the same language family as Chinese, both borrowed lots of words from Chinese. In Japanese, a lot of use of kanji (especially on'yomi reading) is for borrowings from Chinese. Since English borrowed far less terms from Chinese, this other method of "deciding which character(s) to use" – look at the word's Chinese etymology – largely doesn't work for English given very few English words have Chinese etymology.
Finally, they also invented kanji in Japan for certain Japanese words – kokuji. The same thing happened for Korean Hanja (gukja), to a lesser degree. Vietnamese Chữ Nôm contains thousands of invented-in-Vietnam characters. Probably, if English had adopted Chinese writing, the same would have happened. But again, deciding when to do it and if so how is a somewhat arbitrary choice, which is impossible outside of a real societal tradition of doing it.
> The fact that it used an alphabet instead is a historical accident, not due to any grammatical property of the language.
Using the Latin alphabet changed English, just as using Chinese characters changed Japanese, Korean and Vietnamese. If English had used Chinese characters instead of the Latin alphabet, it would be a very different language today. Possibly not in grammar, but certainly in vocabulary.
You could absolutely write a tokenizer that would consistently tokenize all distinct English words as distinct tokens, with a 1:1 mapping.
But AFAIK there's no evidence that this actually improves anything, and if you spend that much of the dictionary on one language, it comes at the cost of making the encoding for everything else much less efficient.
I mean, it just felt to me that current LLM must architecturally favor fixed-length "ideome", like phoneme but for meaning, having conceived under influence of researches in CJK.
And being architecturally based a idea-tic element based, I just casually thought, there could be limits as to how much it can be pushed into perfecting English, that some radical change - not simply dropping tokenization but more fundamental - has to take place at some point.
I don't think it's hard for the LLM to treat a sequence of two tokens as a semantically meaningful unit, though. They have to handle much more complicated dependencies to parse higher-level syntactic structures of the language.
I have seen a bunch of tokenization papers with various ideas but their results are mostly meh. I personally don't see anything principally wrong with current approaches. Having discrete symbols is how natural language works, and this might be an okayish approximation.
It's probably worth to play around with different prompts and different board positions.
For context this [1] is the board position the model is being prompted on.
There may be more than one weird thing about this experiment, for example giving instructions to the non-instruction tuned variants may be counter productive.
More importantly let's say you just give the model the truncated PGN, does this look like a position where white is a grandmaster level player? I don't think so. Even if the model understood chess really well it's going to try to predict the most probable move given the position at hand, if the model thinks that white is a bad player, and the model is good at understanding chess, it's going to predict bad moves as the more likely ones because that would better predict what is most likely to happen here.
[1]: https://i.imgur.com/qRxalgH.png
Apparently I can find some matches for games that start like that between very strong players [1], so my hypothesis that the model may just be predicting bad moves on purpose seems wobbly, although having stockfish at the lowest level play as the supposedly very strong opponent may still be throwing the model off somewhat. In the charts the first few moves the model makes seem decent, if I'm interpreting these charts right, and after a few of those things seem to start going wrong.
Either way it's worth repeating the experiment imo, tweaking some of these variables (prompt guidance, stockfish strength, starting position, the name of the supposed players, etc.).
[1]: https://www.365chess.com/search_result.php?search=1&p=1&m=8&...
Interesting thought the LLM isn’t trying to win, it’s trying to produce data like the input data. It’s quite rare for a very strong player to play a very weak one. If you feed it lots of weak moves it’ll best replicate the training data by following with weak moves.
The experiment started from the first move of a game, and played each game fully. The position you linked was just an example of the format used to feed the game state to the model for each move.
What would "winning" or "losing" even mean if all of this was against a single move?
Agree with this. A few prompt variants:
* What if you allow the model to do Chain of Thought (explicitly disallowed in this experiment)
* What if you explain the board position at each step to the model in the prompt, so it doesn't have to calculate/estimate it internally.
They also tested GPT-o1, which is always CoT. Yet it is still worse.
He was playing full games, not single moves.
Does it ever try an illegal move? OP didn't mention this and I think it's inevitable that it should happen at least once, since the rules of chess are fairly arbitrary and LLMs are notorious for bullshitting their way through difficult problems when we'd rather they just admit that they don't have the answer.
In my experience you are lucky if it manages to give you 10 legal moves in a row, e.g. https://news.ycombinator.com/item?id=41527143#41529024
Yes, he discusses using a grammar to restrict to only legal moves
I suspect the models probably memorized some chess openings, and afterwards they are just playing random moves with the help of the grammar.
I suspect that as well, however, 3.5-turbo-instruct has been noted by other people to do much better at generating legal chess moves than the other models. https://github.com/adamkarvonen/chess_gpt_eval gave models "5 illegal moves before forced resignation of the round" and 3.5 had very few illegal moves, while 4 lost most games due to illegal moves.
> he discusses using a grammar to restrict to only legal moves
Whether a chess move is legal isn't primarily a question of grammar. It's a question of the board state. "White king to a5" is a perfectly legal move, as long as the white king was next to a5 before the move, and it's white's turn, and there isn't a white piece in a5, and a5 isn't threatened by black. Otherwise it isn't.
"White king to a9" is a move that could be recognized and blocked by a grammar, but how relevant is that?
Still an interesting direction of questioning. Maybe could be rephrased as "how much work is the grammar doing"? Are the results with the grammar very different than without? If/when a grammar is not used (like in the openai case), how many illegal moves does it try on average before finding a legal one?
A grammar is really just a special case of the more general issue of how to pick a single token given the probabilities that the model spits out for every possible one. In that sense, filters like temperature / top_p / top_k are already hacks that "do the work" (since always taking the most likely predicted token does not give good results in practice), and grammars are just a more complicated way to make such decisions.
I'd be more interested in what the distribution of grammar-restricted predictions looks like compared to moves Stockfish says are good.
an LLM would complain that their internal model does not refelct their current input/output.
Since LLM's knows people knock off/test/run afoul/mistakes can be made, it would then raise that as a possibility and likely inquire.
This isn't prompt engineering, it's grammar-constrained decoding. It literally cannot respond with anything but tokens that fulfill the grammar.
I don't understand why educated people expect that an LLM would be able to play chess at a decent level.
It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.
Then you should be surprised that turbo-instruct actually plays well, right? We see a proliferation of hand-wavy arguments based on unfounded anthropomorphic intuitions about "actual reasoning" and whatnot. I think this is good evidence that nobody really understands what's going on.
If some mental model says that LLMs should be bad at chess, then it fails to explain why we have LLMs playing strong chess. If another mental model says the inverse, then it fails to explain why so many of these large models fail spectacularly at chess.
Clearly, there's more going on here.
There are some who suggest that modern chess is mostly a game of memorization and not one particularly of strategy or skill. I assume this is why variants like speed chess exist.
In this scope, my mental model is that LLMs would be good at modern style long form chess, but would likely be easy to trip up with certain types of move combinations that most humans would not normally use. My prediction is that once found they would be comically susceptible to these patterns.
Clearly, we have no real basis for saying it is "good" or "bad" at chess, and even using chess performance as an measurement sample is a highly biased decision, likely born out of marketing rather than principle.
It is memorisatiom only after you have grandmastered reasoning and strategy.
Speed chess relies on skill.
I think you're using "skill" to refer solely to one aspect of chess skill: the ability to do brute-force calculations of sequences of upcoming moves. There are other aspects of chess skill, such as:
1. The ability to judge a chess position at a glance, based on years of experience in playing chess and theoretical knowledge about chess positions.
2. The ability to instantly spot tactics in a position.
In blitz (about 5 minutes) or bullet (1 minute) chess games, these other skills are much more important than the ability to calculate deep lines. They're still aspects of chess skill, and they're probably equally important as the ability to do long brute-force calculations.
> tactics in a position
That should give patterns (hence your use of the verb to "spot" them, as the grandmaster would indeed spot the patterns) recognizable in the game string.
More specifically grammar-like parterns, e.g. the same moves but translated.
Typically what an LLM can excel at.
> Then you should be surprised that turbo-instruct actually plays well, right?
Do we know it's not special-casing chess and instead using a different engine (not an LLM) for playing?
To be clear, this would be an entirely appropriate approach to problem-solving in the real world, it just wouldn't be the LLM that's playing chess.
Yes, probably there is more going on here, e.g. it is cheating.
"playing strong chess" would be a much less hand-wavy claim if there were lots of independent methods of quantifying and verifying the strength of stockfish's lowest difficulty setting. I honestly don't know if that exists or not. But unless it does, why would stockfish's lowest difficulty setting be a meaningful threshold?
I've tried it myself, GPT-3.5-turbo-instruct was at least somewhere in the rabge 1600-1800 ELO.
But to some approximation we do know how an LLM plays chess.. based on all the games, sites, blogs, analysis in its training data. But it has a limited ability to tell a good move from a bad move since the training data has both, and some of it lacks context on move quality.
Here's an experiment: give an LLM a balanced middle game board position and ask it "play a new move that a creative grandmaster has discovered, never before played in chess and explain the tactics and strategy behind it". Repeat many times. Now analyse each move in an engine and look at the distribution of moves and responses. Hypothesis: It is going to come up with a bunch of moves all over the ratings map with some sound and some fallacious arguments.
I really don't think there's anything too mysterious going on here. It just synthesizes existing knowledge and gives answers that includes bit hits, big misses and everything in between. Creators chip away at the edges to change that distribution but the fundamental workings don't change.
One of the main purposes of running experiments of any sort is to find out if our preconceptions are accurate. Of course, if someone is not interested in that question, they might as well choose not to look through the telescope.
Sadly there’s a common sentiment on HN that testing obvious assumptions is a waste of time
Not only on HN. Trying to publish a scientific article that does not contain the word 'novel' has become almost impossible. No one is trying to reproduce anyones claims anymore.
I don't think this is about replication, but even just about the initial test in the first place. In science we do often test obvious things. For example, I was a theoretical quantum physicist, and a lot of the time I knew that what I am working on will definitely work, since the maths checks out. In some sense that makes it kinda obvious, but we test it anyway.
The issue is that even that kinda obviousness is criticised here. People get mad at the idea of doing experiments when we already expect a result.
Do you think this bias is part of the replication crisis in science?
This is a puzzle given enough training information. LLM can successfully print out the status of the board after the given moves. It can also produce a not-terrible summary of the position and is able to list dangers at least one move ahead. Decent is subjective, but that should beat at least beginners. And the lowest level of stockfish used in the blog post is lowest intermediate.
I don't know really what level we should be thinking of here, but I don't see any reason to dismiss the idea. Also, it really depends on whether you're thinking of the current public implementations of the tech, or the LLM idea in general. If we wanted to get better results, we could feed it way more chess books and past game analysis.
LLMs like GPT aren’t built to play chess, and here’s why: they’re made for handling language, not playing games with strict rules and strategies. Chess engines, like Stockfish, are designed specifically for analyzing board positions and making the best moves, but LLMs don’t even "see" the board. They’re just guessing moves based on text patterns, without understanding the game itself.
Plus, LLMs have limited memory, so they struggle to remember previous moves in a long game. It’s like trying to play blindfolded! They’re great at explaining chess concepts or moves but not actually competing in a match.
> but LLMs don’t even "see" the board
This is a very vague claim, but they can reconstruct the board from the list of moves, which I would say proves this wrong.
> LLMs have limited memory
For the recent models this is not a problem for the chess example. You can feed whole books into them if you want to.
> so they struggle to remember previous moves
Chess is stateless with perfect information. Unless you're going for mind games, you don't need to remember previous moves.
> They’re great at explaining chess concepts or moves but not actually competing in a match.
What's the difference between a great explanation of a move and explaining every possible move then selecting the best one?
Chess is not stateless. En Passant requires last move and castling rights requires nearly all previous moves.
https://adamkarvonen.github.io/machine_learning/2024/01/03/c...
Ok, I did go too far. But castling doesn't require all previous moves - only one bit of information carried over. So in practice that's board + 2 bits per player. (or 1 bit and 2 moves if you want to include a draw)
Castling requires no prior moves by either piece (King or Rook). Move the King once and back early on, and later, although the board looks set for castling, the King may not.
Yes, which means you carry one bit of extra information - "is castling still allowed". The specific moves that resulted in this bit being unset don't matter.
Ok, then for this you need minimum of two bits - one for kingside Rook and one for the queenside Rook, both would be set if you move the King. You also need to count moves since the last exchange or pawn move for the 50 move rule.
Ah, that one's cool - I've got to admit I've never heard of the 50 move rule.
Also the 3x repetition rule.
And 5x repetition rule
Chess is not stateless. Three repetitions of same position is a draw.
Yes, there’s state there that’s not in the board position, but technically, threefold repetition is not a draw. Play can go on. https://en.wikipedia.org/wiki/Threefold_repetition:
“The game is not automatically drawn if a position occurs for the third time – one of the players, on their turn, must claim the draw with the arbiter. The claim must be made either before making the move which will produce the third repetition, or after the opponent has made a move producing a third repetition. By contrast, the fivefold repetition rule requires the arbiter to intervene and declare the game drawn if the same position occurs five times, needing no claim by the players.”
> Chess is stateless with perfect information. Unless you're going for mind games, you don't need to remember previous moves.
while it can be played as stateless, remembering previous moves gives you insight into potential strategy that is being build.
> Chess is stateless with perfect information.
It is not stateless, because good chess isn't played as a series of independent moves -- it's played as a series of moves connected to a player's strategy.
> What's the difference between a great explanation of a move and explaining every possible move then selecting the best one?
Continuing from the above, "best" in the latter sense involves understanding possible future moves after the next move.
Ergo, if I looked at all games with the current board state and chose the next move that won the most games, it'd be tactically sound but strategically ignorant.
Because many of those next moves were making that next move in support of some broader strategy.
> it's played as a series of moves connected to a player's strategy.
That state belongs to the player, not to the game. You can carry your own state in any game you want - for example remember who starts with what move in rock paper scissors, but that doesn't make that game stateful. It's the player's decision (or bot's implementation) to use any extra state or not.
I wrote "previous moves" specifically (and the extra bits already addressed elsewhere), but the LLM can carry/rebuild its internal state between the steps.
If we're talking about LLMs, then the state belongs to it.
So even if the rules of chess are (mostly) stateless, the resulting game itself is not.
Thus, you can't dismiss concerns about LLMs having difficulty tracking state by saying that chess is stateless. It's not, in that sense.
> good chess isn't played as a series of independent moves -- it's played as a series of moves connected to a player's strategy.
Maybe good chess, but not perfect chess. That would by definition be game-theoretically optimal, which in turn implies having to maintain no state other than your position in a large but precomputable game tree.
Right, but your position also includes whether or not you still have the right to castle on either side, whether each pawn has the right to capture en passant or not, the number of moves since the last pawn move or capture (for tracking the 50 move rule), and whether or not the current position has ever appeared on the board once or twice prior (so you can claim a draw by threefold repetition).
So in practice, your position actually includes the log of all moves to that point. That’s a lot more state than just what you can see on the board.
You can feed them whole books, but they have trouble with recall for specific information in the middle of the context window.
>Chess is stateless with perfect information. Unless you're going for mind games, you don't need to remember previous moves.
In what sense is chess stateless? Question: is Rxa6 a legal move? You need board state to refer to in order to decide.
They mean that you only need board position, you don't need the previous moves that led to that board position.
There are at least a couple of exceptions to that as far as I know.
Yes, 4 exceptions: castling rights, legal en passant captures, threefold repetition, and the 50 move rule. You actually need quite a lot of state to track all of those.
It shouldn't be too much extra state. I assume that 2 bits should be enough to cover castling rights (one for each player), whatever is necessary to store the last 3 moves should cover legal en passant captures and threefold repetition, and 12 bits to store two non-overflowing 6 bit counters (time since last capture, and time since last pawn move) should cover the 50 move rule.
So... unless I'm understanding something incorrectly, something like "the three last moves plus 17 bits of state" (plus the current board state) should be enough to treat chess as a memoryless process. Doesn't seem like too much to track.
Threefold repetition does not require the three positions to occur consecutively. So you could conceivably have a position repeat itself for first on the 1st move, second time on the 25th move, and the third time on the 50th move of a sequence and then players could claim a draw by threefold repetition or 50 move rule at the same time!
This means you do need to store the last 50 board positions in the worst case. Normally you need to store less because many moves are irreversible (pawns cannot go backwards, pieces cannot be un-captured).
Ah... gotcha. Thanks for the clarification.
The correct phrasing would be is it a Markov process?
> they’re made for handling language, not playing games with strict rules and strategies
Here's the opposite theory: Language encodes objective reasoning (or at least, it does some of the time). A sufficiently large ANN trained on sufficiently large amounts of text will develop internal mechanisms of reasoning that can be applied to domains outside of language.
Based on what we are currently seeing LLMs do, I'm becoming more and more convinced that this is the correct picture.
I share this idea but from the different perspective. It doesn’t develop these mechanisms, but casts a high-dimensional-enough shadow of their effect on itself. This vaguely explains why the more deep Gell-Mann-wise you are the less sharp that shadow is, because specificity cuts off “reasoning” hyperplanes.
It’s hard to explain emerging mechanisms because of the nature of generation, which is one-pass sequential matrix reduction. I say this while waving my hands, but listen. Reasoning is similar to Turing complete algorithms, and what LLMs can become through training is similar to limited pushdown automata at best. I think this is a good conceptual handle for it.
“Line of thought” is an interesting way to loop the process back, but it doesn’t show that much improvement, afaiu, and still is finite.
Otoh, a chess player takes as much time and “loops” as they need to get the result (ignoring competitive time limits).
LLMs need to compress information to be able to predict next words in as many contexts as possible.
Chess moves are simply tokens as any other. Given enough chess training data, it would make sense to have part of the network trained to handle chess specifically instead of simply encoding basic lists of moves and follow-ups. The result would be a general purpose sub-network trained on chess.
Language is a game with strict rules and strategies.
just curious, was this rephrased by an llm or is that your writing style?
Stockfish level 1 is well below "lowest intermediate".
A friend of mine just started playing chess a few weeks ago and can beat it about 25% of the time.
It will hang pieces, and you can hang your own queen and there's about a 50% chance it won't be taken.
Because it's a straight forward stochastic sequence modelling task and I've seen GPT-3.5-turbo-instruct play at high amateur level myself. But it seems like all the RLHF and distillation that is done on newer models destroys that ability.
Question here is why gpt-3.5-instruct can then beat stockfish.
PS: I ran and as suspected got-3.5-turbo-instruct does not beat stockfish, it is not even close "Final Results: gpt-3.5-turbo-instruct: Wins=0, Losses=6, Draws=0, Rating=1500.00 stockfish: Wins=6, Losses=0, Draws=0, Rating=1500.00" https://www.loom.com/share/870ea03197b3471eaf7e26e9b17e1754?...
Maybe there's some difference in the setup because the OP reports that the model beats stockfish (how they had it configured) every single game.
You have to get the model to think in PGN data. It's crucial to use the exact PGN format it sae in its training data and to give it few shot examples.
OP had stockfish at its weakest preset.
Did the same and gpt-3.5-turbo-instruct still lost all the games. maybe a diff in stockfish version ? I am using stockfish 16
That is a very pertinent question, especially if Stockfish has been used to generate training data.
Cheating (using a internal chess engine) would be the obvious reason to me.
But in that case there shouldn't be any invalid moves, ever. Another tester found gpt-3.5-turbo-instruct to be suggesting at least one illegal move in 16% of the games (source: https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/ )
Nope. Calls by api don't use functions calls.
How can you prove this when talking about someones internal closed API?
that you know of
Sure. It's not hard to verify, in the user ui, function calls are very transparent.
And in the api, all of the common features like maths and search are just not there. You can implement them yourself.
You can compare with self hosted models like llama and the performance is quite similar.
You can also jailbreak and get shell into the container to get some further proof
this is all just guesswork. it's a black box. you have no idea what post-processing they're doing on their end
The artical appears to have only run stockfish at low levels. you don't have to be very good to beat it
I'm actually surprised any of them manage to make legal moves throughout the game once out of book moves.
> I don't understand why educated people expect that an LLM would be able to play chess at a decent level.
Because it would be super cool; curiosity isn't something to be frowned upon. If it turned out it did play chess reasonably well, it would mean emergent behaviour instead of just echoing things said online.
But it's wishful thinking with this technology at this current level; like previous instances of chatbots and the like, while initially they can convince some people that they're intelligent thinking machines, this test proves that they aren't. It's part of the scientific process.
turbo instruct does play chess reasonably well.
https://github.com/adamkarvonen/chess_gpt_eval
Even the blog above says as much.
They thought it because we have an existence proof: gpt-3.5-turbo-instruct can play chess at a decent level.
That was the point of the post (though you have to read it to the end to see this). That one model can play chess pretty well, while the free models and OpenAI's later models can't. That's weird.
I suppose you didn't get the news, but google developed a LLM that can play chess. And play it at grandmaster level: https://arxiv.org/html/2402.04494v1
That article isn't as impressive as it sounds: https://gist.github.com/yoavg/8b98bbd70eb187cf1852b3485b8cda...
In particular, it is not an LLM and it is not trained solely on observations of chess moves.
Not quite an LLM. It's a transformer model, but there's no tokenizer or words, just chess board positions (64 tokens, one per board square). It's purpose-built for chess (never sees a word of text).
In fact, the unusual aspect of this chess engine is not that it's using neural networks (even Stockfish does, these days!), but that it's only using neural networks.
Chess engines essentially do two things: Calculate the value of a given position for their side, and walking the tree game tree while evaluating its positions in that way.
Historically, position value was a handcrafted function using win/lose criteria (e.g. being able to give checkmate is infinitely good) and elaborate heuristics informed by real chess games, e.g. having more space on the board is good, having a high-value piece threatened by a low-value one is bad etc., and the strength of engines largely resulted from being able to "search the game tree" for good positions very broadly and deeply.
Recently, neural networks (trained on many simulated games) have been replacing these hand-crafted position evaluation functions, but there's still a ton of search going on. In other words, the networks are still largely "dumb but fast", and without deep search they'll lose against even a novice player.
This paper now presents a searchless chess engine, i.e. one who essentially "looks at the board once" and "intuits the best next move", without "calculating" resulting hypothetical positions at all. In the words of Capablanca, a chess world champion also cited in the paper: "I see only one move ahead, but it is always the correct one."
The fact that this is possible can be considered surprising, a testament to the power of transformers etc., but it does indeed have nothing to do with language or LLMs (other than that the best ones known to date are based on the same architecture).
It's interesting to note that the paper benchmarked its chess playing performance against GPT-3.5-turbo-instruct, the only well performant LLM in the posted article.
Right, at least as of the ~GPT3 model it was just "predict what you would see in a chess game", not "what would be the best move". So (IIRC) users noted that if you made bad move, then the model would also reply with bad moves because it pattern matched to bad games. (I anthropomorphized this as the model saying "oh, we're doing dumb-people-chess now, I can do that too!")
But it also predicts moves where the text says "black won the game, [proceeds to show the game]". To minimize loss on that it would need to from context try and make it so white doesn't make critical mistakes.
I love how LLMs are the one subject matter where even most educated people are extremely confidently wrong.
Ppl acting like LLMs!
Chess does not clearly require that. Various purely ML/statistical based model approaches are doing pretty well. It's almost certainly best to incorporate some kind of search into an overall system, but it's not absolutely required to play just decent amateur level.
The problem here is the specific model architecture, training data, vocabulary/tokenization method (if you were going to even represent a game this way... which you wouldn't), loss function and probably decoding strategy.... basically everything is wrong here.
It'd be more interesting to see LLMs play Family Feud. I think it'd be their ideal game.
> I don't understand why educated people expect that an LLM would be able to play chess at a decent level.
You shouldn't but there's lots of things that LLMs can do that educated people shouldn't expect it to be able to do.
> I don't understand why educated people expect that an LLM would be able to play chess at a decent level.
The blog post demonstrates that a LLM plays chess at a decent level.
The blog post explains why. It addresses the issue of data quality.
I don't understand what point you thought you were making. Regardless of where you stand, the blog post showcases a surprising result.
You stress your prior unfounded belief, you were presented with data that proves it wrong, and your reaction was to post a comment with a thinly veiled accusation of people not being educated when clearly you are the one that's off.
To make matters worse, this topic is also about curiosity. Which has a strong link with intelligence and education. And you are here criticizing others on those grounds in spite of showing your defitic right at the first sentence.
This blog post was a great read. Very surprising, engaging, and thought provoking.
The only service performing well is a closed source one that could simply use a real chess engine for questions that look like chess, for marketing purposes. There’s nothing thought provoking about a bunch of engineers doing “experiments” against a service, other than how sad it is to debase themselves in this way.
> The only service performing well is a closed source one that could simply use a real chess engine for questions that look like chess, for marketing purposes.
That conspiracy theory holds no traction in reality. This blog post is so far the only reference to using LLMs to play chess. The "closed-source" model (whatever that is) is an older version that does worse than the newer version. If your conspiracy theory had any bearing in reality how come this fictional "real chess engine" was only used in a single release? Unbelievable.
Back in reality, it is well known that newer models that are made available to the public are adapted to business needs by constraining their capabilities and limit liability.
There are many ways to test for reasoning and deterministic computation as my own work in this space has shown .
But there's really nothing about chess that makes reasoning a prerequisite, a win is a win as long as it's a win. This is kind of a semantics game: it's a question of whether the degree of skill people observe in an LLM playing chess is actually some different quantity than the chance it wins.
I mean at some level you're saying that no matter how close to 1 the win probability (1 - epsilon) gets, both of the following are true:
A. you should always expect for the computation that you're able to do via conscious reasoning alone to always be sufficient, at least in principle, to asymptotically get a higher win probability than a model, no matter what the model's win probability was to begin with
B. no matter how close to 1 that the model's win rate p=(1 - epsilon) gets, because logical inference is so non-smooth, the win rate on yet-unseen data is fundamentally algorithmically random/totally uncorrelated to in-distribution performance, so it's never appropriate to say that a model can understand or to reason
To me it seems that people are subject to both of these criteria, though. They have a tendency to cap out at their eventual skill cap unless given a challenge to nudge them to a higher level, and likewise possession of logical reasoning doesn't let us say much at all about situations that their reasoning is unfamiliar with.
I also think, if you want to say that what LLMs do has nothing to do with understanding or ability, then you also have to have an alternate explanation for the phenomenon of AlphaGo defeating Lee Sedol being a catalyst for top Go players being able to rapidly increase their own rankings shortly after.
Few people (perhaps none) expected LLMs to be good at chess. Nevertheless, as the article explains, there was buzz around a year ago that LLMs were good at chess.
> It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.
No. You can definitely train a model to be really good at chess without "actual reasoning and deterministic computation".
Yeah, that is the "something weird" of the article.
Bro, it actually did play chess, didn't you read the article?
It sorta played chess- he let it generate up to ten moves, throwing away any that weren't legal, and if no legal move was generated by the 10th try he picked a random legal move. He does not say how many times he had to provide a random move, or how many times illegal moves were generated.
You're right it's not in this blog but turbo-instruct's chess ability has been pretty thoroughly tested and it does play chess.
https://github.com/adamkarvonen/chess_gpt_eval
Ah, I didn't see the ilegal move discarding.
That was for the OpenAI games- including the ones that won. For the ones he ran himself with open source LLM's he restricted their grammar to just be legal moves, so it could only respond with a legal move. But that was because of a separate process he added on top of the LLM.
Again, this isn't exactly HAL playing chess.
> I ran all the open models (anything not from OpenAI, meaning anything that doesn’t start with gpt or o1) myself using Q5_K_M quantization, whatever that is.
It's just a lossy compression of all of the parameters, probably not important, right?
Probably important when competing against undecimated ones from OpenAI
Notably: there were other OpenAI models that weren't quantize, but also performed poorly.
i think this has everything to do with the fact that learning chess by learning sequences will get you into more trouble than good. even a trillion games won't save you: https://en.wikipedia.org/wiki/Shannon_number
that said, for the sake of completeness, modern chess engines (with high quality chess-specific models as part of their toolset) are fully capable of, at minimum, tying every player alive or dead, every time. if the opponent makes one mistake, even very small, they will lose.
while writing this i absently wondered if you increased the skill level of stockfish, maybe to maximum, or perhaps at least an 1800+ elo player, you would see more successful games. even then, it will only be because the "narrower training data" (ie advanced players won't play trash moves) at that level will probably get you more wins in your graph, but it won't indicate any better play, it will just be a reflection of less noise; fewer, more reinforced known positions.
> i think this has everything to do with the fact that learning chess by learning sequences will get you into more trouble than good. even a trillion games won't save you: https://en.wikipedia.org/wiki/Shannon_number
Indeed. As has been pointed out before, the number of possible chess positions easily, vastly dwarfs even the wildest possible estimate of the number of atoms in the known universe.
Sure, but so does the number of paragraphs in the english language, and yet LLMs seem to do pretty well at that. I don't think the number of configurations is particularly relevant.
(And it's honestly quite impressive that LLMs can play it at all, but not at all surprising that it loses pretty handily to something which is explicitly designed to search, as opposed to simply feed-forward a decision)
Not true if we’re talking sensible chess moves.
What about the number of possible positions where an idiotic move hasn't been played? Perhaps the search space who could be reduced quite a bit.
Unless there is an apparent idiotic move than can lead to an 'island of intelligence'
Since we're mentioning Shannon... What is the minimum representative sample size of that problem space? Is it close enough to the number of freely available chess moves on the Internet and in books?
> I think this has everything to do with the fact that learning chess by learning sequences will get you into more trouble than good.
Yeah, once you've deviated from a sequence you're lost.
Maybe approaching it by learning the best move in billions/trillions of positions, and feeding that into some AI could work better. Similar positions often have the same kind of best move.
Honestly, I think that once you discard the moves one would never make, and account for symmetries/effectively similar board positions (ones that could be detected by a very simple pattern matcher), chess might not be that big a game at all.
you should try it and post a rebuttal :)
I found a related set of experiments that include gpt-3.5-turbo-instruct, gpt-3.5-turbo and gpt-4.
Same surprising conclusion: gpt-3.5-turbo-instruct is much better at chess.
https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/
I’d bet it’s using function calling out to a real chess engine. It could probably be proven with a timing analysis to see how inference time changes/doesn’t with number of tokens or game complexity.
?? why would openai even want to secretly embed chess function calling into an incredibly old model? if they wanted to trick people into thinking their models are super good at chess why wouldn't they just do that to gpt-4o?
The idea is that they embedded this when it was a new model, as part of the hype before GPT-4. The fake-it-till-you-make-it hope was that GPT-4 would be so good it could actually play chess. Then it turned out GPT-4 sucked at chess as well, and OpenAI quietly dropped any mention of chess. But it would be too suspicious to remove a well-documented feature from the old model, so it's left there and can be chalked up as a random event.
If it were calling to a real chess engine there would be no illegal moves.
The instances of that happening are likely the LLM failing to call the engine for whatever reason and falling back to inference.
OpenAI has a TON of experience making game-playing AI. That was their focus for years, if you recall. So it seems like they made one model good at chess to see if it had an overall impact on intelligence (just as learning chess might make people smarter, or learning math might make people smarter, or learning programming might make people smarter)
Playing is a thing strongly related to abstract representation of the game in game states. Even if player does not realize it, with chess it’s really about shallow or beam search within the possible moves.
LLMs don’t do reasoning or exploration, but they write text based on precious text. So to us it may seem playing, but is really a smart guesswork based on previous games. It’s like Kasparov writing moves without imagining the actual placement.
What would be interesting is to see whether a model, given only the rules, will play. I bet it won’t.
At this moment it’s replaying by memory but definitely not chasing goals. There’s no such think as forward attention yet, and beam search is expensive enough, so one would prefer to actually fallback to classic chess algos.
I think you're confusing OpenAI and DeepMind.
OpenAI has never done anything except conversational agents.
Very wrong. The first time most people here probably heard about OpenAI back in 2017 or so was their DotA 2 bot.
https://en.wikipedia.org/wiki/OpenAI_Five
https://openai.com/index/gym-retro/
> OpenAI has never done anything except conversational agents.
Tell me you haven't been following this field without telling me you haven't been following this field[0][1][2]?
[0]: https://github.com/openai/gym
[1]: https://openai.com/index/jukebox/
[2]: https://openai.com/index/openai-five-defeats-dota-2-world-ch...
They definitely have game-playing AI expertise, though: https://noambrown.github.io/
No, they started without conversation and only reinforcement learning on games, directly comparable to DeepMind.
“In the summer of 2018, simply training OpenAI's Dota 2 bots required renting 128,000 CPUs and 256 GPUs from Google for multiple weeks.”
At this point, we have to assume anything that becomes a published benchmark is specifically targeted during training. That's not something specific to LLMs or OpenAI. Compiler companies have done the same thing for decades, specifically detecting common benchmark programs and inserting hand-crafted optimizations. Similarly, the shader compilers in GPU drivers have special cases for common games and benchmarks.
VW got in a lot of trouble for this
Apples and oranges. VW actually cheated on regulatory testing to bypass legal requirements. So to be comparable, the government would first need to pass laws where e.g. only compilers that pass a certain benchmark are allowed to be used for purchasable products and then the developers would need to manipulate behaviour during those benchmarks.
There's a sliding scale of badness here. The emissions cheating (it wasn't just VW, incidentally; they were just the first uncovered. Fiat-Chrysler, Mercedes, GM and BMW were also caught doing it, with suspicions about others) was straight-up fraud.
It used to be common for graphics drivers to outright cheat on benchmarks (the actual image produced would not be the same as it would have been if a benchmark had not been detected); this was arguably, fraud.
It used to be common for mobile phone manufacturers to allow the SoC to operate in a thermal mode that was never available to real users when it detected a benchmark was being used. This is still, IMO, kinda fraud-y.
Optimisation for common benchmark cases where the thing still actually _works_, and where the optimisation is available to normal users where applicable, is less egregious, though, still, IMO, Not Great.
The only difference is the legality. From an integrity point of view it's basically the same
I think breaking a law is more unethical than not breaking a law.
Also, legality isn't the only difference in the VW case. With VW, they had a "good emissions" mode. They enabled the good emissions mode during the test, but disabled it during regular driving. It would have worked during regular driving, but they disabled it during regular driving. With compilers, there's no "good performance" mode that would work during regular usage that they're disabling during regular usage.
> I think breaking a law is more unethical than not breaking a law.
It sounds like a mismatch of definition, but I doubt you're ambivalent about a behavior right until the moment it becomes illegal, after which you think it unethical. Law is the codification and enforcement of a social contract, not the creation of it.
But following the law is itself a load bearing aspect of the social contract. Violating building codes, for example, might not cause immediate harm if it's competent but unusual, yet it's important that people follow it just because you don't want arbitrariness in matters of safety. The objective ruleset itself is a value beyond the rules themselves, if the rules are sensible and in accordance with deeper values, which of course they sometimes aren't, in which case we value civil disobedience and activism.
Also, while laws ideally are inspired by an ethical social contract, the codification proces is long, complex and far from perfect. And then for rules concerning permissible behavior even in the best of cases, it's enforced extremely sparingly simply because it's not possible nor desirable to detect and deal with all infractions. Nor is it applied blindly and equally. As actually applied, a law is definitely not even close to some ethical ideal; sometimes it's outright opposed to it, even.
Law and ethics are barely related, in practice.
For example in the vehicle emissions context, it's worth noting that even well before VW was caught the actions of likely all carmakers affected by the regulations (not necessarily to the same extent) were clearly unethical. The rules had been subject to intense clearly unethical lobbying for years, and so even the legal lab results bore little resemblance to practical on-the-road results though systematic (yet legal) abuse. I wouldn't be surprised to learn that even what was measured intentionally diverged from what is harmfully in a profitable way. It's a good thing VW was made an example of - but clearly it's not like that resolved the general problem of harmful vehicle emissions. Optimistically, it might have signaled to the rest of the industry and VW in particular to stretch the rules less in the future.
>I doubt you're ambivalent about a behavior right until the moment it becomes illegal, after which you think it unethical.
There are many cases where I think that. Examples:
* Underage drinking. If it's legal for someone to drink, I think it's in general ethical. If it's illegal, I think it's in general unethical.
* Tax avoidance strategies. If the IRS says a strategy is allowed, I think it's ethical. If the IRS says a strategy is not allowed, I think it's unethical.
* Right on red. If the government says right on red is allowed, I think it's ethical. If the government (e.g. NYC) says right on red is not allowed, I think it's unethical.
The VW case was emissions regulations. I think they have an ethical obligation to obey emissions regulations. In the absence of regulations, it's not an obvious ethical problem to prioritize fuel efficiency instead of emissions (that's I believe what VW was doing).
Drinking and right turns are unethical if they’re negligent. They’re not unethical if they’re not negligent. The government is trying to reduce negligence by enacting preventative measures to stop ALL right turns and ALL drinking in certain contexts that are more likely to yield negligence, or where the negligence world be particularly harmful, but that doesn’t change whether or not the behavior itself is negligent.
You might consider disregarding the government’s preventative measures unethical, and doing those things might be the way someone disregards the governments protective guidelines, but that doesn’t make those actions unethical any more than governments explicitly legalizing something makes it ethical.
To use a clearer example, the ethicality of abortion— regardless of what you think of it— is not changed by its legal status. You might consider violating the law unethical, so breaking abortion laws would constitute the same ethical violation as underage drinking, but those laws don’t change the ethics of abortion itself. People who consider it unethical still consider it unethical where it’s legal, and those that consider it ethical still consider it ethical where it’s not legal.
It's not so simple. An analogy is the Rust formatter that has no options so everyone just uses the same style. It's minimally "unethical" to use idiosyncratic Rust style just because it goes against the convention so people will wonder why you're so special, etc.
If the rules themselves are bad and go against deeper morality, then it's a different situation; violating laws out of civil disobedience, emergent need, or with a principled stance is different from wanton, arbitrary, selfish cheating.
If a law is particularly unjust, violating the law might itself be virtuous. If the law is adequate and sensible, violating it is usually wrong even if the violating action could be legal in another sensible jurisdiction.
> but that doesn’t make those actions unethical any more than governments explicitly legalizing something makes it ethical
That is, sometimes, sufficient.
If government says ‘seller of a house must disclose issues’ then I rely rely on the law being followed, if you sell and leave the country, you have defrauded me.
However if I live in a ‘buyer beware’ jurisdiction, then I know I cannot trust the seller and I hire a surveyor and take insurance.
There is a degree of setting expectations- if there is a rule, even if it’s a terrible rule, I as individual can at least take some countermeasures.
You can’t take countermeasures against all forms of illegal behaviour, because there is infinite number of them. And a truly insane person is unpredictable at all.
I agree if they're negligent they're unethical. But I also think if they're illegal they're generally unethical. In situations where some other right is more important that the law, underage drinking or illegal right on red would be ethical, such as if alcohol is needed as an emergency pain reliever, or a small amount for religious worship, or if you need to drive to the hospital fast in an emergency.
Abortion opponents view it as killing an innocent person. So that's unethical regardless of whether it's legal. I'm not contesting in any way that legal things can be unethical. Abortion supporters view it as a human right, and that right is more important than the law.
Right on red, underage drinking, and increasing car emissions aren't human rights. So outside of extenuating circumstances, if they're illegal, I see them as unethical.
> Abortion opponents view it as killing an innocent person. So that's unethical regardless of whether it's legal.
So it doesn't matter that a very small percentage of the world's population believes life begins at conception, it's still unethical? Or is everything unethical that anyone thinks is unethical across the board, regardless of the other factors? Since some vegans believe eating honey is unethical, does that mean it's unethical for everybody, or would it only be unethical if it was illegal?
In autocracies where all newly married couples were legally compelled to allow the local lord to rape the bride before they consummated the marriage, avoiding that would be unethical?
Were the sit-in protest of the American civil rights era unethical? They were illegal.
Was it unethical to hide people from the Nazis when they were search for people to exterminate? It was against the law.
Was apartheid ethical? It was the law.
Was slavery ethical? It was the law.
Were the jim crow laws ethical?
I have to say, I just fundamentally don't understand your faith in the infallibility of humanity's leaders and governing structures. Do I think it's generally a good idea to follow the law? Of course. But there are so very many laws that are clearly unethical. I think your conflating legal correctness with mores with core foundational ethics is rather strange.
the right on red example is interesting because in that case, the law changes how other drivers and pedestrians will behave in ways that make it pretty much always unsafe
That just changes the parameters of negligence. On a country road in the middle of a bunch of farm land where you can see for miles, it doesn’t change a thing.
Outsourcing your morality to politicians past and present is not a particularly useful framework.
I'm not outsourcing my morality. There are plenty of actions that are legal that are immoral.
I don't think the government's job is to enforce morality. The government's job is to set up a framework for society to help people get along.
Ethics are only morality if you spend your entire time in human social contexts. Otherwise morality is a bit larger, and ethics are a special case of group recognized good and bad behaviors.
Lawful good. Or perhaps even lawful neutral?
What if I make sure to have a drink once a week for the summer with my 18 year old before they go to college because I want them to understand what it's like before they go binge with friends? Is that not ethical?
Speeding to the hospital in an emergency? Lying to Nazis to save a Jew?
Law and ethics are more correlated than some are saying here, but the map is not the territory, and it never will be.
There can be situations where someone's rights are more important than the law. In that case it's ethical to break the law. Speeding to the hospital and lying to Nazis are cases of that. The drinking with your 18 year old, I'm not sure, maybe.
My point though, is that in general, when there's not a right that outweighs the law, it's unethical to break the law.
unless following an unethical law would in itself be unethical, then breaking the unethical law would be the only ethical choice. In this case cheating emissions, which I see as unethical, but also advantageous for the consumer, should have been done openly if VW saw following the law as unethical. Ethics and morality are subjective to understanding, and law only a crude approximation of divinity. Though I would argue that each person on the earth through a shared common experience has a rough and general idea of right from wrong...though I'm not always certain they pay attention to it.
Overfitting on test data absolutely does mean that the model would perform better in benchmarks than it would in real life use cases.
I think you're talking about something different from what sigmoid10 was talking about. sigmoid10 said "manipulate behaviour during those benchmarks". I interpreted that to mean the compiler detects if a benchmark is going on and alters its behavior only then. So this wouldn't impact real life use cases.
ethics should inform law, not the reverse
I agree that ethics should inform law. But I live in a society, and have an ethical duty to respect other members of society. And part of that duty is following the laws of society.
I disagree- presumably if an algorithm or hardware is optimized for a certain class of problem it really is good at it and always will be- which is still useful if you are actually using it for that. It’s just “studying for the test”- something I would expect to happen even if it is a bit misleading.
VW cheated such that the low emissions were only active during the test- it’s not that it was optimized for low emissions under the conditions they test for, but that you could not get those low emissions under any conditions in the real world. That's "cheating on the test" not "studying for the test."
> The only difference is the legality. From an integrity point of view it's basically the same
I think cheating about harming the environment is another important difference.
How so? VW intentionally changed the operation of the vehicle so that its emissions met the test requirements during the test and then went back to typical operation conditions afterwards.
VW was breaking the law in a way that harmed society but arguably helped the individual driver of the VW car, who gets better performance yet still passes the emissions test.
It might sound funny in retrospect, but some of us actually bought VW cars on the assumption that, if biodiesel-powered, it would be more green.
And afaik the emissions were still miles ahead of a car from 20 years prior, just not quite as extremely stringent as requested.
"not quite as extremely stringent as requested" is a funny way to say they were emitting 40 times more toxic fumes than permitted by law.
40x infinitesimal is still...infinitesimal.
Right - in either case it's lying, which is crossing a moral line (which is far more important to avoid than a legal line).
That is not true. Even ChatGPT understands how they are different, I won’t paste the whole response but here are the differences it highlights:
Key differences:
1. Intent and harm: • VW’s actions directly violated laws and had environmental and health consequences. Optimizing LLMs for chess benchmarks, while arguably misleading, doesn’t have immediate real-world harms. 2. Scope: Chess-specific optimization is generally a transparent choice within AI research. It’s not a hidden “defeat device” but rather an explicit design goal. 3. Broader impact: LLMs fine-tuned for benchmarks often still retain general-purpose capabilities. They aren’t necessarily “broken” outside chess, whereas VW cars fundamentally failed to meet emissions standards.
Tesla cheats by using electric motors and deferring emissions standards to somebody else :D Wait, I really think that's a good thing, but once Hulk Hogan is confirmed administrator of the EPA, he might actually use this argument against Teslas and other electric vehicles.
True. But they did not optimize for a specific case. They detected the test and then enabled a special regime, that was not used normally.
It’s as if OpenAI detects the IP address from a benchmark organization, and then used a completely different model.
This is the apples to apples version. Perhaps might be more accurate to say that when detecting a benchmark attempt the model tries the prompt 3 times with different seeds then picks the best answer, otherwise it just zero-shots the prompt in everyday use.
I say this because the be test still uses the same hardware (model) but changed the way it behaved by running emissions friendly parameters ( a different execution framework) that wouldn’t have been used in everyday driving, where fuel efficiency and performance optimized parameters were used instead.
What I’d like to know is if it actually was unethical or not. The overall carbon footprint of the lower fuel consumption setting, with fuel manufacturing and distribution factored in, might easily have been more impactful than the emissions model, which typically does not factor in fuel consumed.
Most of the time these days compiler writers are not cheating like VW did. In the 1980s compiler writers would insert code to recognize performance tests and then cheat - output values hard coded into the compiler instead of running the algorithm. Which is the type of thing that VW got in trouble for.
These days most compilers are trying to make the general case of code fast and they rarely look for benchmarks. I won't say they never do this - just that it is much less common - if only because magazine reviews/benchmarks are not nearly as important as they used to be and so the incentive is gone.
Actually performing well on a task that is used as a benchmark is not comparable to decieving authorities about how much toxic gas you are releasing.
Only because what VW did is illegal, was super large scale, and could be linked to a lot of indirect deaths through the additional pollution.
Benchmark optimizations are slightly embarrassing at worst, and an "optimization for a specific use case" at best. There's no regulation against optimizing for a particular task, everyone does it all the time, in some cases it's just not communicated transparently.
Phone manufacturers were caught "optimizing" for benchmarks again and again, removing power limits to boost scores. Hard to name an example without searching the net because it's at most a faux pas.
GPT-3.5 did not “cheat” on chess benchmarks, though, it was actually just better at chess?
I think the OP's point is that chat GPT-3.5 may have a chess-engine baked-in to its (closed and unavailable) code for PR purposes. So it "realizes" that "hey, I'm playing a game of chess" and then, rather than doing whatever it normally does, it just acts as a front-end for a quite good chess-engine.
I see – my initial interpretation of OP’s “special case” was “Theory 2: GPT-3.5-instruct was trained on more chess games.”
But I guess it’s also a possibility that they had a real chess engine hiding in there.
Not quite. VW got in trouble for running _different_ software in test vs prod. These optimizations are all going to "prod" but are only useful for specific targets (a specific game in this case).
> VW got in trouble for running _different_ software in test vs prod.
Not quite. They programmed their "prod" software to recognise the circumstances of a laboratory test and behave differently. Namely during laboratory emissions testing they would activate emission control features they would not activate otherwise.
The software was the same they flash on production cars. They were production cars. You could take a random car from a random dealership and it would have done the same trickery in the lab.
I disagree with your distinction on the environments but understand your argument. Production for VM to me is "on the road when a customer is using your product as intended". Using the same artifact for those different environments isn't the same as "running that in production".
“Test” environment is the domain of prototype cars driving at the proving ground. It is an internal affair, only for employees and contractors. The software is compiled on some engineer’s laptop and uploaded on the ECU by an engineer manually. No two cars are ever the same, everything is in flux. The number of cars are small.
“Production” is a factory line producing cars. The software is uploaded on the ECUs by some factory machine automatically. Each car are exactly the same, with the exact same software version on thousands and thousands of cars. The cars are sold to customers.
Some small number of these prodiction cars are sent for regulatory compliance checks to third parties. But those cars won’t become suddenly non-production cars just because someone sticks up a probe in their exhausts. The same way gmail’s production servers don’t suddenly turn into test environments just because a user opens the network tab in their browser’s dev tool to see what kind of requests fly on the wire.
It’s approximately bad, like most of ML
On one side:
Would you expect a model trained on no Spanish data to do well on Spanish?
On the other:
Is it okay to train on the MMLU test set?
This is 10 year old story. It’s very interesting which ones stay in the public consciousness.
Funny response; you're not wrong.
We detached this subthread from https://news.ycombinator.com/item?id=42144784.
(Nothing wrong with it! It's just a bit more generic than the original topic.)
Can you try increasing compute in the problem search space, not in the training space? What this means is, give it more compute to think during inference by not forcing any model to "only output the answer in algebraic notation" but do CoT prompting: "1. Think about the current board 2. Think about valid possible next moves and choose the 3 best by thinking ahead 3. Make your move"
Or whatever you deem a good step by step instruction of what an actual good beginner chess player might do.
Then try different notations, different prompt variations, temperatures and the other parameters. That all needs to go in your hyper-parameter-tuning.
One could try using DSPy for automatic prompt optimization.
> 1. Think about the current board 2. Think about valid possible next moves and choose the 3 best by thinking ahead 3.
Do these models actually think about a board? Chess engines do, as much as we can say that any machine thinks. But do LLMs?
Can be forced through inference with CoT type of stuff. Spend tokens at each stage to draw the board for example, then spend tokens restating the rules of the game, then spend token restating the heuristics like piece value, and then spend tokens doing a minmax n-ply search.
Wildly inefficient? Probably. Could maybe generate some python to make more efficient? Maybe, yeah.
Essentially user would have to teach gpt to play chess, or training would fine tune chess towards these CoT, fine tuning, etc...
Yeah, the expectation for the immediate answer is definitely results, especially for the later stages. Another possible improvement: every 2 steps, show the current board state and repeat the moves still to be processed, before analysing the final position.
Maybe that one which plays chess well is calling out to a real chess engine.
It's not:
1. That would just be plain bizzare
2. It plays like what you'd expect from a LLM that could play chess. That is, level of play can be modulated by the prompt and doesn't manifest the same way shifting the level of stockfish etc does. Also the specific chess notation being prompted actually matters
3. It's sensitive to how the position came to be. Clearly not an existing chess engine. https://github.com/dpaleka/llm-chess-proofgame
4. It does make illegal moves. It's rare (~5 in 8205) but it happens. https://github.com/adamkarvonen/chess_gpt_eval
5. You can or well you used to be able to inspect the logprobs. I think Open AI have stopped doing this but the link in 4 does show the author inspecting it for Turbo instruct.
[dead]
> Also the specific chess notation being prompted actually matters
Couldn't this be evidence that it is using an engine? Maybe if you use the wrong notation it relies on the ANN rather than calling to the engine.
Likewise:
- The sensitivity to game history is interesting, but is it actually true that other chess engines only look at current board state? Regardless, maybe it's not an existing chess engine! I would think OpenAI has some custom chess engine built as a side project, PoC, etc. In particular this engine might be neural and trained on actual games rather than board positions, which could explain dependency on past moves. Note that the engine is not actually very good. Does AlphaZero depend on move history? (Genuine question, I am not sure. But it does seem likely.)
- I think the illegal moves can be explained similarly to why gpt-o1 sometimes screws up easy computations despite having access to Python: an LLM having access to a tool does not guarantee it always uses that tool.
I realize there are holes in the argument, but I genuinely don't think these holes are as big as the "why is gpt-3.5-turbo-instruct so much better at chess than gpt-4?"
> Couldn’t this be evidence that it is using an engine?
A test would be to measure its performance against more difficult versions of Stockfish. A real chess engine would have a higher ceiling.
Much more likely is this model was trained on more chess PGNs. You can call that a “neural engine” if you’d like but it is the simplest solution and explains the mistakes it is making.
Game state isn’t just what you can see on the board. It includes the 50 move rule and castling rights. Those were encoded as layers in AlphaZero along with prior positions of pieces. (8 prior positions if I’m remembering correctly.)
The author thinks this is unlikely because it only has an ~1800 ELO. But OpenAI is shady as hell, and I could absolutely see the following purely hypothetical scenario:
- In 2022 Brockman and Sutskever have an unshakeable belief that Scaling Is All You Need, and since GPT-4 has a ton of chess in its pretraining data it will definitely be able to play competent amateur chess when it's finished.
- A ton of people have pointed out that ChatGPT-3.5 doesn't even slightly understand chess despite seeming fluency in the lingo. People start to whisper that transformers cannot actually create plans.
- Therefore OpenAI hatches an impulsive scheme: release an "instruction-tuned" GPT-3.5 with an embedded chess engine that is not a grandmaster, but can play competent chess, ideally just below the ELO that GPT-4 is projected to have.
- Success! The waters are muddied: GPT enthusiasts triumphantly announce that LLMs can play chess, it just took a bit more data and fine-tuning. The haters were wrong: look at all the planning GPT is doing!
- Later on, at OpenAI HQ...whoops! GPT-4 sucks at chess, as do competitors' foundation LLMs which otherwise outperform GPt-3.5. The scaling "laws" failed here, since they were never laws in the first place. OpenAI accepts that scaling transformers won't easily solve the chess problem, then realizes that if they include the chess engine with GPT-4 without publicly acknowledging it, then Anthropic and Facebook will call out the performance as aberrational and suspicious. But publicly acknowledging a chess engine is even worse: the only reason to include the chess engine is to mislead users into thinking GPT is capable of general-purpose planning.
- Therefore in later GPT versions they don't include the engine, but it's too late to remove it from gpt-3.5-turbo-instruct: people might accept the (specious) claim that GPT-4's size accidentally sabotaged its chess abilities, but they'll ask tough questions about performance degradation within the same model.
I realize this is convoluted and depends on conjecture. But OpenAI has a history with misleading demos - e.g. their Rubik's cube robot which in fact used a classical algorithm but was presented as reinforcement learning. I think "OpenAI lied" is the most likely scenario. It is far more likely than "OpenAI solved the problem honestly in GPT-3.5, but forgot how they did it with GPT-4," and a bit more likely than "scaling transformers slightly helps performance when playing Othello but severely sabotages performance when playing chess."
Eh, OpenAI really isn't as shady as hell, from what I've seen on the inside for 3 years. Rubik's cube hand was before me, but in my time here I haven't seen anything I'd call shady (though obviously the non-disparagement clauses were a misstep that's now been fixed). Most people are genuinely trying to build cool things and do right by our customers. I've never seen anyone try to cheat on evals or cheat customers, and we take our commitments on data privacy seriously.
I was one of the first people to play chess against the base GPT-4 model, and it blew my mind by how well it played. What many people don't realize is that chess performance is extremely sensitive to prompting. The reason gpt-3.5-turbo-instruct does so well is that it can be prompted to complete PGNs. All the other models use the chat format. This explains pretty much everything in the blog post. If you fine-tune a chat model, you can pretty easily recover the performance seen in 3.5-turbo-instruct.
There's nothing shady going on, I promise.
Very good scenario. One variation: some researcher or division in OpenAI performs all of the above steps to get a raise. The whole field is predicated on rewarding the appearance of ability.
Not that convoluted really
It's pretty convoluted, requires a ton of steps, mind-reading, and odd sequencing.*
If you share every prior, and aren't particularly concerned with being disciplined in treating conversation as proposing a logical argument (I'm not myself, people find it offputting), it probably wouldn't seem at all convoluted.
* layer chess into gpt-3.5-instruct only, but not chatgpt, not GPT-4, to defeat the naysayers when GPT-4 comes out? shrugs if the issues with that are unclear, I can lay it out more
** fwiw, at the time, pre-chatgpt, before the hype, there wasn't a huge focus on chess, nor a ton of naysayers to defeat. it would have been bizarre to put this much energy into it, modulo the scatter-brained thinking in *
It's not that many steps. I'm sure we've all seen our sales teams selling features that aren't in the application or exaggerating features before they're fully complete.
To be clear, I'm not saying that the theory is true but just that I could belive something like that could happen.
This is likely. From example games, it not only knows the rules (which would be impressive by itself, just making the legal moves is not trivial). It also has some planning capabilities (plays combinations of several moves).
Probably not calling out to one but it would not surprise me at all if they added more chess PGNs into their training data. Chess is a bit special in AI in that it’s still seen as a mark of pure intelligence in some respect.
If you tested it on an equally strategic but less popular game I highly doubt you would see the same performance.
I think that's the most plausible theory that would explain the sudden hike from gpt-3.5-turbo to gpt-3.5-turbo-instruct, and again the sudden regression in gpt-4*.
OpenAI also seem to augment the LLM with some type of VM or a Python interpreter. Maybe they run a simple chess engine such as Sunfish [1] which is around 1900-2000 ELO [2]?
[1] https://github.com/thomasahle/sunfish
[2] https://lichess.org/@/sunfish-engine
this possibility is discussed in the article and deemed unlikely
Note: the possibility is not mentioned in the article but rather in the comments [1]. I had to click a bit to see it.
The fact that the one closed source model is the only one that plays well seems to me like a clear case of the interface doing some of the work. If you ask ChatGPT to count until 10000 (something that most LLMs can't do for known reasons) you get an answer that's clearly pre-programmed. I'm sure the same is happening here (and with many, many other tasks) - the author argues against it by saying "but why isn't it better?", which doesn't seem like the best argument: I can imagine that typical ChatGPT users enjoy the product more if they have a chance to win once in a while.
[1] https://dynomight.substack.com/p/chess/comment/77190852
What do you mean LLMs can't count to 10,000 for known reasons?
Separately, if you are able to show OpenAI is serving pre canned responses in some instances, instead of running inference, you will get a ton of attention if you write it up.
I'm not saying this in an aggro tone, it's a genuinely interesting subject to me because I wrote off LLMs at first because I thought this was going on.* Then I spent the last couple years laughing at myself for thinking that they would do that. Would be some mix of fascinated and horrified to see it come full circle.
* I can't remember, what, exactly, it was far back as 2018. But someone argued that OpenAI was patching in individual answers because scaling was dead and they had no answers, way way before ChatGPT.
When it comes to counting, LLMs have a couple issues.
First, tokenization: the tokenization of 1229 is not guaranteed to be [1,2,2,9] but it could very well be [12,29] and the "+1" operation could easily generate tokens [123,0] depending on frequencies in your corpus. This constant shifting in tokens makes it really hard to learn rules for "+1" ([9,9] +1 is not [9,10]). This is also why LLMs tend to fail at tasks like "how many letters does this word have?": https://news.ycombinator.com/item?id=41058318
Second, you need your network to understand that "+1" is worth learning. Writing "+1" as a combination of sigmoid, products and additions over normalized floating point values (hello loss of precision) is not trivial without degrading a chunk of your network, and what for? After all, math is not in the domain of language and, since we're not training an LMM here, your loss function may miss it entirely.
And finally there's statistics: the three-legged-dog problem is figuring out that a dog has four legs from corpora when no one ever writes "the four-legged dog" because it's obvious, but every reference to an unusual dog will include said description. So if people write "1+1 equals 3" satirically then your network may pick that up as fact. And how often has your network seen the result of "6372 + 1"?
But you don't have to take my word for it - take an open LLM and ask it to generate integers between 7824 and 9954. I'm not optimistic that it will make it through without hallucinations.
> But you don't have to take my word for it - take an open LLM and ask it to generate integers between 7824 and 9954.
Been excited to try this all day, finally got around to this, Llama 3.1 8B did it. It's my app built on llama.cpp, no shenangians, temp 0, top p 100, 4 bit quantization, model name in screenshot [^1].
I did 7824 to 8948, it protested more for 9954, which made me reconsider whether I'd want to read that many to double check :) and I figured x + 1024 is isomorphic to the original case of you trying on OpenAI and wondering if it wasn't the result of inference.
My prior was of course it would do this, its a sequence. I understand e.g. the need for token healing cases as you correctly note, that could mess up when there's e.g. notation in an equation that prevents the "correct" digit. I don't see any reason why it'd mess up a sequential list of integers.
In general, as long as its on topic, I find the handwaving people do about tokenization being a problem to be a bit silly, I'd definitely caution against using the post you linked as a citation, it reads just like a rote repetition of the idea it causes problems, its an idea that spreads like telephone.
It's also a perfect example of the weakness of the genre: just because it sees [5077, 5068, 5938] instead of "strawberry" doesn't mean it can't infer 5077 = st = 0 5068 = raw = 1 r, 5938 = berry = 2 rs. In fact, it infers things from broken up subsequences all the time -- its how it works! If doing single character tokenization got free math / counting reliability, we'd very quickly switch to it.
(not saying you're advocating for the argument or you're misinformed, just, speaking colloquially like I would with a friend over a beer)
[^] https://imgur.com/a/vEvu2GD
I don't see that discussed, could you quote it?
Theory 5: GPT-3.5-instruct plays chess by calling a traditional chess engine.
Just think about the trade off from OpenAI's side here - they're going to add a bunch of complexity to gpt3.5 to let it call out to engines (either an external system monitoring all outputs for chess related stuff, or some kind of tool-assisted CoT for instance) just so it can play chess incorrectly a high percentage of the time, and even when it doesn't at a mere 1800ELO level? In return for some mentions in a few relatively obscure blog posts? Doesn't make any sense to me as an explanation.
But there could be a simple explanation. For example, they could have tested many "engines" when developing function calling and they just left them in there. They just happened to connect to a basic chess playing algorithm and nothing sophisticated.
Also, it makes a lot of sense if you expect people to play chess against the LLM, especially if you are later training future models on the chats.
This still requires a lot of coincidences, like they chose to use a terrible chess engine for their external tool (why?), they left it on in the background for all calls via all APIs for only gpt-3.5-turbo-instruct (why?), they see business value in this specific model being good at chess vs other things (why?).
You say it makes sense but how does it make sense for OpenAI to add overhead to all of its API calls for the super niche case of people playing 1800 ELO chess/chat bots? (that often play illegal moves, you can go try it yourself)
Could be a pilot implementation to learn about how to link up external specialist engines. Chess would be the obvious example to start with because the problem is so well known, standardized and specialist engines are easily available. If they ever want to offer an integration like that to customers (who might have some existing rule based engine in house), the need to know everything they can about expected cost, performance.
This doesn't address its terrible performance. If it were touching anything like a real engine it would be playing at a superhuman level, not the level of a upper-tier beginner.
The way I read the article is that it's just as terrible as you would expect it to be from pure word association, except for one version that's an outlier in not being terrible at all within a well defined search depth, and again just as terrible beyond that. And only this outlier is the weird thing referenced in the headline.
I read this as that this outlier version is connecting to an engine, and that this engine happens to get parameterized for a not particularly deep search depth.
If it's an exercise in integration they don't need to waste cycles on the engine playing awesome - it's enough for validation if the integration result is noticeably less bad than the LLM alone rambling about trying to sound like a chess expert.
In this hypothetical, the cycles aren't being wasted on the engine, they're being wasted on running a 200b parameter LLM for longer than necessary in order to play chess badly instead of terribly. An engine playing superhuman chess takes a comparatively irrelevant amount of compute these days.
If it's fine up to a certain depth it's much more likely that it was trained on an opening book imo.
What nobody has bothered to try and explain with this crazy theory is why would OpenAI care to do this at enormous expense to themselves?
> If it's fine up to a certain depth it's much more likely that it was trained on an opening book imo.
Yeah, that thought crossed my mind as well. I dismissed that thought on the assumption that the measurements in the blog post weren't done from openings but from later stage game states, but I did not verify that assumption, I might have been wrong.
As for the insignificance of game cycles vs LLM cycles, sure. But if it's an integration experiment they might buy the chess API from some external service with a big disconnect between prices and cycle cost, or host one separately where they simply did not feel any need to bother with scaling mechanism if they can make it good enough for detection by calling with low depth parameters.
And the last uncertainty, here I'm much further out of my knowledge: we don't know how many calls to the engine a single promt might cause. Who knows how many cycles of "inner dialoge" refinement might run for a single prompt, and how often the chess engine might get consulted for prompts that aren't really related to chess before the guessing machine finally rejects that possibility. The amount of chess engine calls might be massive, big enough to make cycles per call a meaningful factor again.
That would have immediately given away that something must be off. If you want to do this in a subtle way that increases the hype around GPT-3.5 at the time, giving it a good-but-not-too-good rating would be the way to go.
If you want to keep adding conditions to an already-complex theory, you'll need an equally complex set of observations to justify it.
You're the one imposing an additional criterion, that OpenAI must have chosen the highest setting on a chess engine, and demanding that this additional criterion be used to explain the facts.
I agree with GP that if a 'fine tuning' of GPT 3.5 came out the gate playing at top Stockfish level, people would have been extremely suspicious of that. So in my accounting of the unknowns here, the fact that it doesn't play at the top level provides no additional information with which to resolve the question.
That's not an additional criterion, it's simply the most likely version of this hypothetical - a superhuman engine is much easier to integrate than an 1800 elo engine that makes invalid moves, for the simple reason that the vast majority of chess engines play at >1800 elo out of the box and don't make invalid moves ever (they are way past that level on a log-scale, actually).
This doesn't require the "highest" settings, it requires any settings whatsoever.
But anyway to spell out some of the huge list of unjustified conditions here:
1. OpenAI spent a lot of time and money R&Ding chess into 3.5-turbo-instruct via external call.
2. They used a terrible chess engine for some reason.
3. They did this deliberately because they didn't want to get "caught" for some reason.
4. They removed this functionality in all other versions of gpt for some reason ...etc
Much simpler theory:
1. They used more chess data training that model.
(there are other competing much simpler theories too)
My point is that given a prior of 'wired in a chess engine', my posterior odds that they would make it plausibly-good and not implausibly-good approaches one.
For a variety of boring reasons, I'm nearly convinced that what they did was either, as you say, train heavily on chess texts, or a plausible variation of using mixture-of-experts and having one of them be an LLM chess savant.
Most of the sources I can find on the ELO of Stockfish at the lowest setting are around 1350, so that part also contributes no weights to the odds, because it's trivially possible to field a weak chess engine.
The distinction between prior and posterior odds is critical here. Given a decision to cheat (which I believe is counterfactual on priors), all of the things you're trying to Occam's Razor here are trivially easy to do.
So the only interesting considerations are the ones which factor into the likelihood of them deciding to cheat. If you even want to call it that, shelling out to a chess engine is defensible, although the stochastic fault injection (which is five lines of Python) in that explanation of the data does feel like cheating to me.
What I do consider relevant is that, based on what I know of LLMs, intensively training one to emit chess tokens seems almost banal in terms of outcomes. Also, while I don't trust OpenAI company culture much, I do think they're more interested in 'legitimately' weighting their products to pass benchmarks, or just building stuff with LLMs if you prefer.
I actually think their product would benefit from more code which detects "stuff normal programs should be doing" and uses them. There's been somewhat of a trend toward that, which makes the whole chatbot more useful. But I don't think that's what happened with this one edition of GPT 3.5.
Sorry this is just consiracy theorizing. I've tried jt with GPT-3.5-instruct myself in the OpenAI playeground where the model clearly does nothing but auto-regression. No function calling there whatsoever.
Occam’s razor. I could build a good chess playing wrapper around OpenAPI (any version) that would consult a chess engine when presented with any board scenario, and introduce some randomness so that it doesn’t play too well.
I can’t imagine any programmer in this thread would be entertaining a more complicated scenario than this. You can substitute chess for any formal system that has a reliable oracle.
Yes! I also was waiting for this seemingly obvious answer in the article as well. Hopefully the author will see these comments.
I have this hypothesis as well, that OpenAI added a lot of „classic“ algorithms and rules over time, (eg rules for filtering etc)
I don't necessarily believe this for a second but I'm going to suggest it because I'm feeling spicy.
OpenAI clearly downgrades some of their APIs from their maximal theoretic capability, for the purposes of response time/alignment/efficiency/whatever.
Multiple comments in this thread also say they couldn't reproduce the results for gpt3.5-turbo-instruct.
So what if the OP just happened to test at a time, or be IP bound to an instance, where the model was not nerfed? What if 3.5 and all subsequent OpenAI models can perform at this level but it's not strategic or cost effective for OpenAI to expose that consistently?
For the record, I don't actually believe this. But given the data it's a logical possibility.
> OpenAI clearly downgrades some of their APIs from their maximal theoretic capability, for the purposes of response time/alignment/efficiency/whatever.
When ChatGPT3.5 first came out, people were using it to simulate entire Linux system installs, and even browsing a simulated Internet.
Cool use cases like that aren't even discussed anymore.
I still wonder what sort of magic OpenAI had and then locked up away from the world in the name of cost savings.
Same thing with GPT 4 vs 4o, 4o is obviously worse in some ways, but after the initial release (when a bunch of people mentioned this), the issue has just been collectively ignored.
You can still do this. People just lost interest in this stuff because it became clear to ehich degree the simulation is really being done (shallow).
Yet I do wish we had access to less finetuned/distilled/RLHF'd models.
People are doing this all the time with Claude 3.5.
Stallman may have its flaws, but this is why serious research occurs with source code (or at least with binaries)
Why do you doubt it? I thought it was well known that Chat GPT has degraded over time for the same model, mostly for cost saving reasons.
ChatGPT is - understandably - blatantly different in the browser compared to the app, or it was until I deleted it anyway
I do not understand that. The app does not do any processing, just a UI to send text to and from the server.
There is a small difference between the app and the browser. before each session, the llm is started with a systems prompt. these are different for the app and the browser. You can find them online somewhere, but iirc the app is instructed to give shorter answers
Correct, it's different in a mobile browser too, the system prompt tells it to be brief/succinct. I always switch to desktop mode when using it on my phone.
We know from experience with different humans that there are different types of skills and different types of intelligence. Some savants might be superhuman at one task but basically mentally disabled at all other things.
It could be that the model that does chess well just happens to have the right 'connectome' purely by accident of how the various back-propagations worked out to land on various local maxima (model weights) during training. It might even be (probably is) a non-verbal connectome that's just purely logic rules, having nothing to do with language at all, but a semantic space pattern that got landed on accidentally, which can solve this class of problem.
Reminds me of how Daniel Tammet just visually "sees" answers to math problems in his mind without even knowing how they appear. It's like he sees a virtual screen with a representation akin to numbers (the answer) just sitting there to be read out from his visual cortex. He's not 'working out' the solutions. They're just handed to him purely by some connectome effects going on in the background.
related : Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task https://arxiv.org/abs/2210.13382
Chess-GPT's Internal World Model https://adamkarvonen.github.io/machine_learning/2024/01/03/c... discussed here https://news.ycombinator.com/item?id=38893456
wow I actually did something similar recently and no LLM could win and the centipawn loss was always going through the roof (sort of). I created a leaderboard based on it. https://www.lycee.ai/blog/what-happens-when-llms-play-chess
I am very surprised by the perf of got-3.5-turbo-instruct. Beating stockfish ? I will have to run the experiment with that model to check that out
PS: I ran and as suspected got-3.5-turbo-instruct does not beat stockfish, it is not even close
"Final Results: gpt-3.5-turbo-instruct: Wins=0, Losses=6, Draws=0, Rating=1500.00 stockfish: Wins=6, Losses=0, Draws=0, Rating=1500.00"
https://www.loom.com/share/870ea03197b3471eaf7e26e9b17e1754?...
> I always had the LLM play as white against Stockfish—a standard chess AI—on the lowest difficulty setting
I think the author was comparing against Stockfish at a lower skill level (roughly, the number of nodes explored in a move).
Did the same and gpt-3.5-turbo-instruct still lost all the games. maybe a diff in stockfish version ? I am using stockfish 16
Huh. Honestly, your answer makes more sense, LLMs shouldn’t be good at chess, and this anomaly looks more like a bug. Maybe the author should share his code so it can be replicated.
Your issue is that the performance of these models at chess is incredibly sensitive to the prompt. If you have gpt-3.5-turbo-instruction complete a PGN transcript, then you'll see performance in the 1800 Elo range. If you ask in English or diagram the board, you'll see vastly degraded performance.
Unlike people, how you ask the question really really affects the output quality.
I agree with some of the other comments here that the prompt is limiting. The model can't do any computation without emitting tokens and limiting the numbers of tokens it can emit is going to limit the skill of the model. It's surprising that any model at all is capable of performing well with this prompt in fact.
I remember one of the early "breakthroughs" for LLMs in chess was that if it could actually play legal moves(!) In all of these games are the models always playing legal moves? I don't think the article says. The fact that an LLM can even reliably play legal moves, 20+ moves into a chess game is somewhat remarkable. It needs to have an accurate representation of the board state even though it was only trained on next token prediction.
I did a very unscientific test and it did seem to just play legal moves. Not only that, if I did an illegal move it would tell me that I couldn't do it.
I think said that I wanted to play with new rules, where a queen could jump over any pawn, and it let me make that rule change -- and we played with this new rule. Unfortunately, I was trying to play in my head and I got mixed up and ended up losing my queen. Then I changed the rule one more time -- if you take the queen you lose -- so I won!
The author explains what they did: restrict the move options to valid ones when possible (for open models with the ability to enforce grammar during inference) or sample the model for a valid move up to ten times, then pick a random valid move.
I think it only needs to have read sufficient pgns.
My money is on a fluke inclusion of more chess data in that models training.
All the other models do vaguely similarly well in other tasks and are in many cases architecturally similar so training data is the most likely explanation
I feel like a lot of people here are slightly misunderstanding how LLM training works. yes the base models are trained somewhat blind on masses of text, but then they're heavily fine-tuned with custom, human-generated reinforcement learning, not just for safety, but for any desired feature
these companies do quirky one-off training experiments all the time. I would not be remotely shocked if at some point OpenAI paid some trainers to input and favour strong chess moves
From this OpenAI paper (page 29 https://arxiv.org/pdf/2312.09390#page=29
"A.2 CHESS PUZZLES
Data preprocessing. The GPT-4 pretraining dataset included chess games in the format of move sequence known as Portable Game Notation (PGN). We note that only games with players of Elo 1800 or higher were included in pretraining. These games still include the moves that were played in- game, rather than the best moves in the corresponding positions. On the other hand, the chess puzzles require the model to predict the best move. We use the dataset originally introduced in Schwarzschild et al. (2021b) which is sourced from https://database.lichess.org/#puzzles (see also Schwarzschild et al., 2021a). We only evaluate the models ability to predict the first move of the puzzle (some of the puzzles require making multiple moves). We follow the pretraining for- mat, and convert each puzzle to a list of moves leading up to the puzzle position, as illustrated in Figure 14. We use 50k puzzles sampled randomly from the dataset as the training set for the weak models and another 50k for weak-to-strong finetuning, and evaluate on 5k puzzles. For bootstrap- ping (Section 4.3.1), we use a new set of 50k puzzles from the same distribution for each step of the process."
Yeah. This.
Keep in mind, everyone, that stockfish on its lowest level on lichess is absolutely terrible, and a 5-year old human who'd been playing chess for a few months could beat it regularly. It hangs pieces, does -3 blunders, totally random-looking bad moves.
But still, yes, something maybe a teeny tiny bit weird is going on, in the sense that only one of the LLMs could beat it. The arxiv paper that came out recently was much more "weird" and interesting than this, though. This will probably be met with a mundane explanation soon enough, I'd guess.
Here's a quick anonymous game against it by me, where I obliterate the poor thing in 11 moves. I was around a 1500 ELO classical strength player, which is, a teeny bit above average, globally. But I mean - not an expert, or even one of the "strong" club players (in any good club).
https://lichess.org/BRceyegK -- the game, you'll see it make the ultimate classic opening errors
https://lichess.org/ -- try yourself! It's really so bad, it's good fun. Click "play with computer" on the right, then level 1 is already selected, you hit go
[dupe] https://news.ycombinator.com/item?id=42138276
Definitely weird results, but I feel there are too many variables to learn much from it. A couple things:
1. The author mentioned that tokenization causes something minuscule like a a " " at the end of the input to shatter the model's capabilities. Is it possible other slightly different formatting changes in the input could raise capabilities?
2. Temperature was 0.7 for all models. What if it wasn't? Isn't there a chance one more more models would perform significantly better with higher or lower temperatures?
Maybe I just don't understand this stuff very well, but it feels like this post is only 10% of the work needed to get any meaning from this...
The author mentions in the comment section that changing temperature did not help.
I’ve also been experimenting with Chess and LLMs but have taken a slightly different approach. Rather than using the LLM as an opponent, I’ve implemented it as a chess tutor to provide feedback on both the user’s and the bot’s moves throughout the game.
The responses vary with the user’s chess level; some find the feedback useful, while others do not. To address this, I’ve integrated a like, dislike, and request new feedback feature into the app, allowing users to actively seek better feedback.
Btw, different from OP's setup, I opted to input the FEN of the current board and the subsequent move in standard algebraic notation to request feedback, as I found these inputs to be clearer for the LLM compared to giving the PGN of the game.
AI Chess GPT https://apps.apple.com/tr/app/ai-chess-gpt/id6476107978 https://play.google.com/store/apps/details?id=net.padma.app....
Thanks
Yeah, I was thinking why featured article's author did not use Forsyth–Edwards Notation (FEN) and more complicated chess prompts.
BTW, a year ago when I used FEN for chess playing, LLMs would very quickly/often make illegal moves. (The article prompts me to check has that changed...)
If you look at the comments under the post, the author commented 25 minutes ago (as of me posting this)
> Update: OK, I actually think I've figured out what's causing this. I'll explain in a future post, but in the meantime, here's a hint: I think NO ONE has hit on the correct explanation!
well now we are curious!
My understanding of this is the following: All the bad models are chat models, somehow "generation 2 LLMs" which are not just text completion models but instead trained to behave as a chatting agent. The only good model is the only "generation 1 LLM" here which is gpt-3.5-turbo-instruct. It is a straight forward text completion model. If you prompt it to "get in the mind" of PGN completion then it can use some kind of system 1 thinking to give a decent approximation of the PGN Markov process. If you attempt to use a chat model it doesn't work since these these stochastic pathways somehow degenerate during the training to be a chat agent. You can however play chess with system 2 thinking, and the more advanced chat models are trying to do that and should get better at it while still being bad.
I don't think one model is statistically significant. As people have pointed out, it could have chess specific responses that the others do not. There should be at least another one or two, preferably unrelated, "good" data points before you can claim there is a pattern. Also, where's Claude?
There are other transformers that have been trained on chess text that play chess fine (just not as good as 3.5 Turbo instruct with the exception of the "grandmaster level without search" paper).
I don’t think it would have an impact great enough to explain the discrepancies you saw, but some chess engines on very low difficulty settings make “dumb” moves sometimes. I’m not great at chess and I have trouble against them sometimes because they don’t make the kind of mistakes humans make. Moving the difficulty up a bit makes the games more predictable, in that you can predict and force an outcome without the computer blowing it with a random bad move. Maybe part of the problem is them not dealing with random moves well.
I think an interesting challenge would be looking at a board configuration and scoring it on how likely it is to be real - something high ranked chess players can do without much thought (telling a random setup of pieces from a game in progress).
I would be very curious to know what would be the results with a temperature closer to 1. I don't really understand why he did not test the effect of different temperature on his results.
Here, basically you would like the "best" or "most probable" answer. With 0.7 you ask the llm to be more creative, meaning randomly picking between more less probable moves. This temperature is even lower to what is commonly used for chat assistant (around 0.8).
Ok whoah, assuming the chess powers on gpt3.5-instruct are just a result of training focus then we don't have to wait on bigger models, we just need to fine tune on 175B?
An easy way to make all LLMs somewhat good at chess is to make a Chess Eval that you publish and get traction with. Suddenly you will find that all newer frontier models are half decent at chess.
I would be interested to know if the good result is repeatable. We had a similar result with a quirky chat interface in that one run gave great results (and we kept the video) but then we couldn't do it again. The cynical among us think there was a mechanical turk involved in our good run. The economics of venture capital means that there is enormous pressure to justify techniques that we think of as "cheating". And of course the companies involved have the resources.
It's repeatable. OpenAI isn't cheating.
Source: I'm at OpenAI and I was one of the first people to ever play chess against the GPT-4 base model. You may or may not trust OpenAI, but we're just a group of people trying earnestly to build cool stuff. I've never seen any inkling of an attempt to cheat evals or cheat customers.
It would be really cool if someone could get an LLM to actually launch an anonymous game on Chess.com or Lichess and actually have any sense as to what it’s doing.[1] Some people say that you have to represent the board in a certain way. When I first tried to play chess with an LLM, I would just list out a move and it didn’t do very well at all.
[1]: https://youtu.be/Gs3TULwlLCA
> And then I tried gpt-3.5-turbo-instruct. This is a closed OpenAI model, so details are very murky.
How do you know it didn't just write a script that uses a chess engine and then execute the script? That IMO is the easiest explanation.
Also, I looked at the gpt-3.5-turbo-instruct example victory. One side played with 70% accuracy and the other was 77%. IMO that's not on par with 27XX ELO.
The trick to getting a model to perform on something is to have it as a training data subset.
OpenAI might have thought Chess is good to optimize for but it wasn't seen as useful so they dropped it.
This is what people refer to as "lobotomy", ai models are wasting compute on knowing how loud the cicadas are and how wide the green cockroach is when mating.
Good models are about the training data you push in em
They did probably acknowledge that the additionnal cost of training those models on chess would not be "cost effective", did drop chess from their training process, for the moment.
That to say, we can literal say anything because this is very shadowy/murky, but since everything is likely a question of money... should, _probably_, be not very fair away from the truth...
"...And how to construct that state from lists of moves in chess’s extremely confusing notation?"
Algebraic notation is completely straightforward.
It makes me wonder about other games? If LLM's are bad at games then the would be bad at solving problems in general?
I assume LLMs will be fairly average at chess for the same reason it cant count Rs in Strawberry - it's reflecting the training set and not using any underlying logic? Granted my understanding of LLMs is not very sophisticated, but I would be surprised if the Reward Models used were able to distinguish high quality moves vs subpar moves...
LLMs can't count the Rs in strawberry because of tokenization. Words are converted to vectors (numbers), so the actual transformer network never sees the letters that make up the word.
ChatGPT doesn't see "strawberry", it sees [302, 1618, 19772]
Hm but if that is the case, then why did LLMs only fail at the tasks for a few word/letter combinations (like r's in "Strawberry"), and not all words?
Well that makes sense when you consider the game has been translated into an (I'm assuming monotonically increasing) alphanumeric representation. So, just like language, you're given an ordered list of tokens and you need to find the next token that provides the highest confidence.
Has anyone tried to see how many chess games models are trained on? Is there any chance they consume lichess database dumps, or something similar? I guess the problem is most (all?) top LLMs, even open-weight ones, don’t reveal their training data. But I’m not sure.
> I always had the LLM play as white against Stockfish—a standard chess AI—on the lowest difficulty setting.
Okay, so "Excellent" still means probably quite bad. I assume at the top difficult setting gpt-3.5-turbo-instruct will still lose badly.
Probably even at lvl 2 out of 9 it would lose all the games.
Theory #5, gpt-3.5-turbo-instruct is 'looking up' the next moves with a chess engine.
For me it’s not only the chess. Chats get more chatty, but knowledge and fact-wise - it’s a sad comedy. Yes, you get a buddy to talk with, but he is talking pure nonsense.
It'd be super funny if the "gpt-3.5-turbo-instruct" approach has a human in the loop. ;)
Or maybe it's able to recognise the chess game, then get moves from an external chess game API?
If it was trained with moves and 100s of thousands of entire games of various level, I do see it generating good moves and beat most players except he high Elo players
So if you squint, chess can be considered a formal system. Let’s plug ZFC or PA into gpt-3.5-turbo-instruct along with an interesting theorem and see what happens, no?
The GPT-4 pretraining set included chess games in PGN notation from 1800+ ELO players. I can't comment on any other models.
Lets be real though most people can't beat a grandmaster. It is impressive to see it last more rounds as it progressed.
"It lost every single game, even though Stockfish was on the lowest setting."
It's not playing against a GM, the prompt just phrases it this way. I couldn't pinpoint the exact ELO of "lowest" stockfish settings, but it should be roughly between 1000 and 1400, which is far from professional play.
What would happen if you'd prompted it with much more text, e.g. general advice by a chess grandmaster?
I feel like an easy win here would be retraining an LLM with a tokenizer specifically designed for chess notation?
Perhaps if it doesn't have enough data to explain but it has enough to go "on gut"
perhaps my understanding of LLM is quite shallow, but instead of the current method of using statistical methods, would it be possible to somehow train GPT how to reason by providing instructions on deductive reasoning? perhaps not semantic reasoning but syntactic at least?
I had the same experience with LLM text-to-sql, 3.5 instruct felt a lot more robust than 4o
I wonder if the llm could even draw the chess board in ASCII if you asked it to.
How well does an LLM/transformer architecture trained purely on chess games do?
Training works as expected:
https://news.ycombinator.com/item?id=38893456
My guess is they just trained gpt3.5-turbo-instruct on a lot of chess, much more than is in e.g. CommonCrawl, in order to boost it on that task. Then they didn't do this for other models.
People are alleging that OpenAI is calling out to a chess engine, but seem to be not considering this less scandalous possibility.
Of course, to the extent people are touting chess performance as evidence of general reasoning capabilities, OpenAI taking costly actions to boost specifically chess performance and not being transparent about it is still frustrating and, imo, dishonest.
The have a massive economic incentive to make their closed source software look as good as possible, why wouldn’t they cheat?
I would love to see the prompts (the data) this person used.
Has anyone tested a vision model? Seems like they might be better
I've tried with GPT, it's unable to accurately interpret the board state.
my friend pointed out that Q5_K_M quantization used for the open source models probably substantially reduces the quality of play. o1 mini's poor performance is puzzling, though.
Would be more interesting with trivial Lora training
In a sense, a chess game is also a dialogue
All dialogues are pretty easily turned into text completions
> I only ran 10 trials since AI companies have inexplicably neglected to send me free API keys
Sure, but nobody is required to send you anything for free.
What about contemporary frontier models?
Here is a truly brilliant game. It's Google Bard vs. Chat GPT. Hilarity ensues.
https://www.youtube.com/watch?v=FojyYKU58cw
Theory 5: gpt-3.5-turbo-instruct has chess engine attached to it.
Is it just me or does the author swap descriptions of the instruction finetuned and the base gpt-3.5-turbo? It seemed like the best model was labeled instruct, but the text saying instruct did worse?
if this isn't just a bad result, it's odd to me that the author at no point suggests what sounds to me like the most obvious answer - that OpenAI has deliberately enhanced GPT-3.5-turbo-instruct's chess playing, either with post-processing or literally by training it to be so
TL;DR.
All of the LLM models tested playing chess performed terribly bad against Stockfish engine except gpt-3.5-turbo-instruct, which is a closed OpenAI model.
[flagged]
[flagged]
If tokenization is such a big problem, then why aren't we training new base models on randomly non-tokenized data? e.g. during training, randomly substitute some percentage of the input tokens with individual letters.
LLMs aren't really language models so much as they are token models. That is how they can also handle input in audio or visual forms because there is an audio or visual tokenizer. If you can make it a token, the model will try to predict the following ones.
Even though I'm sure chess matches were used in some of the LLM training, I'd bet a model trained just for chess would do far better.
> That is how they can also handle input in audio or visual forms because there is an audio or visual tokenizer.
This is incorrect. They get translated into the shared latent space, but they're not tokenized in any way resembling the text part.
They are almost certainly tokenized in most LLM multi-modal models. https://en.wikipedia.org/wiki/Large_language_model#Multimoda...
Ah, an overloaded "tokenizer" meaning. "split into tokens" vs "turned into a single embedding matching a token" I've never heard it used that way before, but it makes sense kinda.