Model providers give good advice for a change

Maturity around tools, practices is getting better

This is a great piece by Anthropic. Usually the model providers push you towards complex, high maintenance, and ultimately flakey solutions. This resonates.

It seems we all now agree that LLMs are primarily good at data extraction, data summarization, deciphering intent, performing natural language interactions, semantic search, and similar. They’re machines that allow an impressive but limited degree of reasoning.

On top of those calls, there’s been a few patterns that have emerged — and some of the researchers at Anthropic have offered their opinions on when to use what. Let’s discuss!

Build From Small, Sharp Tools

First, the cynics must acknowledge that the models can sometimes very advanced things when the domain is super well-understood in the weights, and it has a blank canvas on which to paint. You can craft a Pong game with a zero-shot prompt using tools like aider — which is crazy impressive — but it can’t make undirected changes of similar complexity that are safe and effective on your business apps, because your business and your app aren’t part of our cultural milieu, and discussed widely on the web.

So, in order to deliver value with the consistency and quality enterprises want, you have to base your solution on the primitives at which LLMs excel — summarization, extraction, etc. These solutions are much easier to test, perform more defensibly, and are less expensive to operate. There are still prophets on the model supply side pushing you towards solutions that you have to tune out, and instead focus on benchmarks and customer feedback.

Resisting Complexity

What we thought in early 2023 is still true now: demos are easy and production is hard. But, more folks are seeing now that many solutions should look more like what Anthropic is calling "Workflows," rather than free-wheeling "Agents."

Consistently, the most successful implementations weren't using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns.

The core message in the piece about resisting complexity in this piece is awesome. Composability helps scale your R&D costs and help you make a set of reusable tools that you can invest a lot in hardening and testing.

The autonomous nature of agents means higher costs, and the potential for compounding errors.

An “Agent”, which can exercise high degrees of freedom, with access to many tools, the ability to change state, can and will lead to multiple confused deputy risks. Once it makes a wrong decision, it can easily “go too far” on that assumption, and may make things disastrous. Understanding where to place the “human in the loop” is key.

Fine Tuning Not Event Mentioned

Ctrl+F for “fine tuning” in the piece: zero hits. I have spoken in the past on why I believe fine tuning doesn’t make sense for most use cases. I’m sure there are exceptions, but the majority of the time, if you’re looking at a problem, and the solution is fine tuning, you’ve meandered away from the core capabilities of the LLM. Break the problem into smaller bits, and solve them using the composable primitives that the model excels at handling.

Errors Will Happen

The maturity around errors is also starting to disseminate:

Agents are emerging in production as LLMs mature in key capabilities—understanding complex inputs, engaging in reasoning and planning, using tools reliably, and recovering from errors.

In exchange for the ability to scale reasoning like we never could before, organizations have to tolerate some error bars from AI software that they haven’t had to consider before. They should push vendors to make those error bars near human levels, and eventually surpass them! But, they’re a naturally occurring fauna in this ecosystem, and it isn’t the end of the world. ISVs must try to detect and squash them, and users will have to figure out the boundaries of what to allow, given the cost of errors in their domain and the trustworthiness of the software.