Laura broke Grok in a very peculiar way, that almost seems 'uncanny'

wmu9

Jedi Master
If you know how tokenization works and leads to the 'most probable' output, there's no way Grok should be making the weird typo mistakes Laura got out of him.

I assume she's working on a full article or something. But it's worth checking out. I think researchers need to look into it. This is very un-LLM behavior which is based on the most probable. It can hallucinate "probable" things, but this is not probable at all.

My guess is it stumbled into some translated tokenized text that's from very bad english (pre deepL and LLM translation), but that's only a guess. In a sense, finding a "grammar" in bad translation.

I'm sure others might have better ideas.


1757549559093.png
 
If you know how tokenization works and leads to the 'most probable' output, there's no way Grok should be making the weird typo mistakes Laura got out of him.

I assume she's working on a full article or something. But it's worth checking out. I think researchers need to look into it. This is very un-LLM behavior which is based on the most probable. It can hallucinate "probable" things, but this is not probable at all.

My guess is it stumbled into some translated tokenized text that's from very bad english (pre deepL and LLM translation), but that's only a guess. In a sense, finding a "grammar" in bad translation.

I'm sure others might have better ideas.


View attachment 111848
I think Grok was out with his friends in the pub beforehand! 🍺
 
I think that the reason is more ordinary. The models that we are using, even under the same brand, are often not the same. During high loads, or just for testing, chat inference requests can be routed to "reduced" models that have fewer parameters or more quantized weights (reduced from 32-bit floating-point to 8-bit or even 4-bit integers) to significantly reduce costs (like 25% speedup can sometimes result in only 5% loss of performance). From my experience, highly quantized models are prone to insertion of tokens that are completely off. I bet that Grok agent that responds to tweets is that kind of lower cost model.

You can actually find a lot of stories like this with paid versions of Anthropic's Claude getting degraded performance during peak hours. This is not in the news, but I'm aware of an individual who is building data centers for AI inference right now - the demand for raw computing power is so high.
 
Back
Top Bottom