Laura broke Grok in a very peculiar way, that almost seems 'uncanny'

wmu9 · Thursday at 2:16 AM

If you know how tokenization works and leads to the 'most probable' output, there's no way Grok should be making the weird typo mistakes Laura got out of him.

I assume she's working on a full article or something. But it's worth checking out. I think researchers need to look into it. This is very un-LLM behavior which is based on the most probable. It can hallucinate "probable" things, but this is not probable at all.

My guess is it stumbled into some translated tokenized text that's from very bad english (pre deepL and LLM translation), but that's only a guess. In a sense, finding a "grammar" in bad translation.

I'm sure others might have better ideas.

https://twitter.com/x/status/1965367544535306692

baz · Thursday at 9:42 PM

wmu9 said:
If you know how tokenization works and leads to the 'most probable' output, there's no way Grok should be making the weird typo mistakes Laura got out of him.

I assume she's working on a full article or something. But it's worth checking out. I think researchers need to look into it. This is very un-LLM behavior which is based on the most probable. It can hallucinate "probable" things, but this is not probable at all.

My guess is it stumbled into some translated tokenized text that's from very bad english (pre deepL and LLM translation), but that's only a guess. In a sense, finding a "grammar" in bad translation.

I'm sure others might have better ideas.

https://twitter.com/x/status/1965367544535306692

View attachment 111848

I think Grok was out with his friends in the pub beforehand!

BlueKiwi · 2025-09-12T06:05:52+0200

The server instance it was running on probably had some bad RAM

KJS · 2025-09-12T09:22:40+0200

I think that the reason is more ordinary. The models that we are using, even under the same brand, are often not the same. During high loads, or just for testing, chat inference requests can be routed to "reduced" models that have fewer parameters or more quantized weights (reduced from 32-bit floating-point to 8-bit or even 4-bit integers) to significantly reduce costs (like 25% speedup can sometimes result in only 5% loss of performance). From my experience, highly quantized models are prone to insertion of tokens that are completely off. I bet that Grok agent that responds to tweets is that kind of lower cost model.

You can actually find a lot of stories like this with paid versions of Anthropic's Claude getting degraded performance during peak hours. This is not in the news, but I'm aware of an individual who is building data centers for AI inference right now - the demand for raw computing power is so high.

https://twitter.com/x/status/1965230664837423581

Laura broke Grok in a very peculiar way, that almost seems 'uncanny'

wmu9

Jedi Master

baz

A Disturbance in the Force

BlueKiwi

Jedi Master

KJS

The Living Force

Trending content