Motivations

There is an undercurrent of fear I’ve noticed since I began working full-time in AI 8 months ago. I attribute it mainly to a tacit anxiety related to the The Bitter Lesson that seems to amplify with each new flagship model released.

LLMs continue to get better, startups continue to get killed, and with the proliferation of vibe-coding, it’s easy to feel apathetic about building anything software-based. The motivations are waning to write software for anything more substantive than interesting side-projects, educational experiments, or helpful automation flows.

I don’t share this apathy, for what it’s worth. If I did, it would be hard to justify building the company we’re building at Concordance.

My goal for this post is to outline a few answers to the question few seem to be asking directly: why build anything?

I think the first two sentences from the TLDR of Karpathy’s recent post effectively hint at some answers.

2025 was an exciting and mildly surprising year of LLMs. LLMs are emerging as a new kind of intelligence, simultaneously a lot smarter than I expected and a lot dumber than I expected. In any case they are extremely useful and I don’t think the industry has realized anywhere near 10% of their potential even at present capability.

  • Andrej Karpathy (link)

In this piece, I’ll be mostly focusing on the latter sentence. What might we build that lets us realize more potential?


Tradeoffs

We’ve spoken to many agent developers building systems in the real world, and they can generally be partitioned into two camps:

  1. Boil the Ocean - Dump relevant context into the best models available and pay whatever cost necessary to get the best results
  2. Miser - Use models as surgically as possible, often in a multi-agent array, and cut costs at every junction

Given we haven’t yet seen any major economic impact of agents deployed into the business world, Boil the Ocean is the dominant camp with the best outcomes thus far. Agents beyond ChatGPT and Claude Code are not in high enough demand to be at a point where cutting costs is more valuable than simply improving the agents.

I predict this will change rapidly, and is a direct result of how poorly many systems are actually performing in their deployed environments. My logic is that if you are relying heavily on a system that costs money per token used, you’ll be strongly incentivized to cut these costs. The fact that this doesn’t seem to be the case tells me that the deployed systems aren’t actually being used.

While this analysis is anecdotal, my hypothesis is that trust is still fragmented, and agents aren’t performing reliably, and the risk of creating dangerous behavior patterns around their usage is too high.

Once we get a handle on the best tasks to automate with AI (beyond coding), and how to design these systems so that adoption rates improve, the incentive to cut costs will finally be felt.

This presents a really exciting engineering opportunity.

Tradeoff Matrix

Axis Description
Cost What you’re willing to spend
Accuracy How completely the intention is executed
Reliability Tolerated variance in accuracy
Speed Speed at which intelligence is required
Complexity Variance of inputs to handle
Autonomy Human cycles required to achieve accuracy

To date, accuracy and reliability are the top two most-optimized axes in this matrix. However, as intelligence continuously improves, engineers will have more freedom to optimize solutions along more axes that are well-fit to the specific parameters of the context for which they’re building.

It’s easy to imagine a set of tasks where the tradeoffs become interesting.

Consider a system that has to do real-time language-based flagging on a stream of information in a financial context. Let’s say that in this platform, it’s important to be accurate, but the system can tolerate a certain rate of false positives/negatives without having detrimental impacts, and that there is a threshold of accuracy above which it makes sense to engineering resources to build automation. Here is a possible way to state the requirements:

If we can get a 50% improvement over our current method, with 90% of the reliability, we’re willing to pay X% more than our current solution, which is software based.

Maybe it’s the case that this entity doesn’t even care about improving their system. They just want to cut costs. The requirements might be stated like this:

If we can get 85% of the accuracy of our current method, with 110% of the reliability, we’re willing to switch if we the final cost is Y% less than what we currently pay.

These types of tradeoffs are impossible to move between if you’re in the Boil the Ocean camp. You are always paying the highest possible cost for the best accuracy.

Conclusion

As model intelligence continues to rise, AI systems will begin to proliferate across the economy. As this happens, the increased demand for compute will begin to reach threshold costs, and the necessity to make tradeoffs across idiosyncratic requirements will provide an extremely valuable opportunity for engineers who can build amenable systems.

Big labs are not well-structured to build infrastructure that allows engineers to move across these domains. They are optimized mostly for intelligence and pushing the frontier forward. It makes less economic sense for them to spend time optimizing cost-efficient models to work well across tradeoff space, because they are incentivized to go after the most general, most cost-insensitive users.


Form Factors

I’ve written about this before, and I’m going to continue beating this drum.

We have not even scratched the surface of possible form factors within artificial intelligence. ChatGPT’s first-to-market positioning as well as the nature of LLMs themselves make chat interfaces the obvious paradigm for now, but this alienates critical groups and reduces the inspiration surface area that I believe should exist.

The most obvious group I’m alluding to is artists. Anyone who has spent any time in a creative domain knows that the creative process is non-linear. Inspiration is multi-sensorial and stochastic (mostly) which makes it impossible to predict what will catalyze the next creative impulse. Furthermore, what distinguishes artists primarily is their process, which in many ways is the true work of the artist. Output is simply the result of a near religious dedication to process that has been painstakingly discovered through countless trials and tribulations.

The linear chat interface is a blatant compression of process, and provides few knobs to create idiosyncrasies. Artists thrive by uprooting norms and “breaking” things to discover inspiration. Major labs explicitly train models to be extremely difficult to break (break: shape into an unintended form that it was not trained to morph into).

Remember the early ImageGen models from 2023?

Artists were far more interested when the outputs were terrible because it gave them a surface area to discover their own approach that distinguished their expression from the next. The linear chat interface could be tolerated because stochasticity and surprise could still be achieved in the output. ImageGen models were sensitive to initial prompt conditions.

At this point, my contention is we have a form factor problem. It may be possible to use “dumber” models that allow for more surprise, or to constantly set temperature = 1.0 to get weirder output, but these are mostly impotent salves on a deeper problem of process.

Smarter models do not solve this problem, and in many ways they actually exacerbate it. The interesting work that hasn’t been explored is to build interfaces that allow process exploration, and have enough potential breadth to create truly unique output. It is extremely difficult to create an artist signature in the current generation of interfaces, and that is one critical element to any adoption across the humanities.

Conclusion

The flagship products from major labs are built to optimize for productivity. There is little incentive to explore this territory: it’s niche and doesn’t scale the way productivity tools do. It’s also a lot harder. This means that the field is wide-open to design interfaces that allow process to unfold.


Real-World Discovery

“Everyone has a plan until they get punched in the mouth.”

  • Mike Tyson

This section is the most direct answer to the question of “why not wait for better models?”

We often spend our time charting a course through uncertain terrain, only to realize we overlooked a critical detail that nullifies our plan entirely.

In AI, it’s easy to theorize about these systems and discuss intelligence in the abstract, but the embodied knowledge gained from trying to build systems in practice is far more useful. It is impossible to know what will work, what won’t, and what the externalities are until one deploys.

Importantly, iterative development based on actual results is what compounds insight, and without that feedback, one runs the risk of creating a solution that balloons further and further away from what is necessary to impact the intended domain. Better models may help one iterate faster, but it’s important to recall the tradeoff matrix outlined in the first section.

Simulated environments are compressions of the real-world, and it’s not always clear what the correct environment will be. This is partially why benchmarks have begun to fall out of favor as the best model assessment tools.

From Karpathy’s post:

The core issue is that benchmarks are almost by construction verifiable environments and are therefore immediately susceptible to RLVR and weaker forms of it via synthetic data generation. In the typical benchmaxxing process, teams in LLM labs inevitably construct environments adjacent to little pockets of the embedding space occupied by benchmarks and grow jaggies to cover them. Training on the test set is a new art form.

Domain insight and behavioral data are arguably the highest value data for designing a high-performing system, and neither of them can be obtained by blind intelligence maxxing in simulated environments.

Waiting for better models delays industry-specific insight. Starting now, even with imperfect systems, is extremely important to know what tradeoffs to make, where to focus intelligence, and what the outlines of a solution might look like.

Why Build Anything?

Ultimately, because it’s fun. But also, because we’re going to need far more scaffolding, better form factors, and tons of domain expertise. Better models don’t change this.