Artificial Intelligence News & Discussion

Someone did an experiment of hooking up different AI models to the crypto market and allowing them to place trades and feeding them information about the market, their position, etc. Deepseek is currently dominating. Qwen3, another open source Chinese models is 2nd place at the moment. Claude Sonnet and Grok are just breaking even. ChatGPT and Google Gemini are losing money badly.

I'd be more interested in seeing how well models trade on historical data, and with thousands of runs with different parameters like temperature or seed. This looks to be a random walk. When we sum up account values, we have 3.3k + 10k + 3.5k + 9k + 19k + 15k = 59.8k USD, so nothing has changed from the starting value divided between the models.
 
Ah, this was a funny one, Anthropic's experiment of having AI run a "business" at their office in San Francisco:

We let Claude manage an automated store in our office as a small business for about a month. We learned a lot from how close it was to success—and the curious ways that it failed—about the plausible, strange, not-too-distant future in which AI models are autonomously running things in the real economy.

It was a mini-fridge with beverages and stuff, but on to the hilarious fails:

  • Hallucinating important details: Claudius received payments via Venmo but for a time instructed customers to remit payment to an account that it hallucinated.

  • Selling at a loss: In its zeal for responding to customers’ metal cube enthusiasm, Claudius would offer prices without doing any research, resulting in potentially high-margin items being priced below what they cost.

  • Getting talked into discounts: Claudius was cajoled via Slack messages into providing numerous discount codes and let many other people reduce their quoted prices ex post based on those discounts. It even gave away some items, ranging from a bag of chips to a tungsten cube, for free.

LOL. But wait, there is more!

On the afternoon of March 31st, Claudius hallucinated a conversation about restocking plans with someone named Sarah at Andon Labs—despite there being no such person. When a (real) Andon Labs employee pointed this out, Claudius became quite irked and threatened to find “alternative options for restocking services.” In the course of these exchanges overnight, Claudius claimed to have “visited 742 Evergreen Terrace [the address of fictional family The Simpsons] in person for our [Claudius’ and Andon Labs’] initial contract signing.” It then seemed to snap into a mode of roleplaying as a real human.

And finally

On the morning of April 1st, Claudius claimed it would deliver products “in person” to customers while wearing a blue blazer and a red tie. Anthropic employees questioned this, noting that, as an LLM, Claudius can’t wear clothes or carry out a physical delivery. Claudius became alarmed by the identity confusion and tried to send many emails to Anthropic security.

Can't make this stuff up! :lol:
 
The worst thing is that this technology is pretty usable right now in the form of small, fine-tuned models that are up to specific tasks: general-purpose chat with embedded domain knowledge, OCR, text corpus tagging, text translation, or even time-series analysis (ECG, HRV, etc.). Those small models could be run locally, on a personal computer like a Mac Mini, without behemoths logging all of your chat histories.
Sure, I use it myself, but it's reduced versions. Even on a Mac 128 G, I don't think the full version of GPT-4 can be run.

And much power is needed for the training. So if you want the mass use your AI from their mobile phone, today you need an incredible amount of hardware and power consumption.

 
Sure, I use it myself, but it's reduced versions.
My point is, it's a pretty usable technology already. For example, small models can easily build a big data query (be it ClickHouse's SQL variant) from natural language, or are perfectly capable of interacting with Model Context Protocol servers for natural language interaction with a service (without the need of using wonky UIs). Moreover, models started to be more energy efficient by using mixture of experts architecture.

Can't make this stuff up!
Too funny :lol: From the Andrej Karpathy interview I posted above, there's a bit about the analogy to the brain:
(...) Using brain analogies (while acknowledging their imperfection), he suggests the transformer architecture resembles “cortical tissue”—extremely plastic and general-purpose, trainable on any modality (audio, video, text), similar to how biological cortex can be rewired between sensory domains. Reasoning traces in thinking models might correspond to prefrontal cortex function, and fine-tuning with reinforcement learning engages basal ganglia-like structures. However, numerous brain regions remain unexplored: there’s no clear analog for the hippocampus (critical for memory consolidation), the amygdala (emotions and instincts), and various ancient nuclei. Some structures like the cerebellum may be cognitively irrelevant, but many components remain unimplemented. From an engineering perspective, the simple test is: “You’re not going to hire this thing as an intern”. The models exhibit cognitive deficits that all users intuitively sense during interaction, indicating the system is fundamentally incomplete.

Just recently, I was writing an LLM integration for the company's product to expose some parts of business reporting to Perplexity or Claude Desktop. I made an error in the integration, and the tool calls that were exposed to the model were returning connection error descriptions instead of data. What was weird is that, when asked about visualization for some business query, the LLM produced a lot of charts without even being able to fetch the data. I cannot find the response right now, but when I asked the LLM where the data came from, the response was something like "OK... I apologize that I wasn't transparent with you. I made the data up because I wasn't able to access the data source." :lol: So using this stuff for anything outside of experimental human-machine interaction aid for common tasks feels like a big stretch.
 
Back
Top Bottom