Their testing involved asking multiple LLMs hundreds of questions that have been used before as a means of testing the abilities of LLMs—but the researchers also included a bit of non-pertinent information. And that, they found, was enough to confuse the LLMs into giving wrong or even nonsensical answers to questions they had previously answered correctly.
This, the researchers suggest, shows that the LLMs do not really understand what they are being asked. They instead recognize the structure of a sentence and then spit out an answer based on what they have learned through machine-learning algorithms.
They also note that most of the LLMs they tested very often respond with answers that can seem correct, but upon further review are not, such as when asked how they "feel" about something and get responses that suggest the AI thinks it is capable of such behavior.