Formalizing Autonomy and Complexity in Agentic Systems
1. Introduction
What We’re Trying to Do
This document proposes a formal framework for evaluating agentic systems along two fundamental axes: autonomy (how independently an agent operates from human oversight) and complexity (how difficult the decisions it faces are). We believe these two factors are critical to understanding agentic capabilities and their implications for the future of work.
Why Formalize?
Any formalization of autonomy and complexity involves judgment calls—some arbitrary (weight parameters), some subjective (prompt specificity). Given this, why attempt formalization at all rather than offering a purely conceptual framework?
Two reasons:
-
Reproducible evaluation. A formal framework with explicit variables enables consistent measurement of agentic capabilities across systems, teams, and time.
-
Taxonomy over leaderboard. Benchmark scores rank systems but obscure what they can do. A multidimensional formalization reveals the shape of capability—where a system operates, not just how it compares.
2. Overview of Approach
We decompose agentic capability into two orthogonal dimensions:
-
Autonomy: The degree to which an agent operates independently, measured by (a) human instruction specificity, and (b) how much of its own operational infrastructure it can define.
-
Complexity: The difficulty of the decisions an agent must make, measured by the number of variables involved and their heterogeneity.
These dimensions are independent. A system can be high-autonomy/low-complexity (e.g., an unsupervised agent performing simple repetitive tasks) or low-autonomy/high-complexity (e.g., a tightly supervised agent assisting with intricate multi-factor decisions).
3. Autonomy Formalization
3.1 Variables
| Symbol | Definition | Range |
|---|---|---|
| $i$ | Expected human-intervention rate (1 = intervention every cycle) | $[0,1]$ |
| $s$ | Prompt specificity (1 = maximally narrow; 0 = totally broad) | $[0,1]$ |
| $d_p$ | Number of distinct parameter-type schemas the agent can define | $\mathbb{Z}^+$ |
| $d_o$ | Number of distinct output-type schemas the agent can define | $\mathbb{Z}^+$ |
| $D_p^{\max}, D_o^{\max}$ | Tunable upper bounds on schema variety | $\mathbb{Z}^+$ |
| $n_{\text{env}}$ | Number of environments the agent can integrate without external help | $\mathbb{Z}^+$ |
| $\lambda$ | Scale parameter for world-building saturation | $\mathbb{R}^+$ |
| $w_p, w_o, w_g$ | Weights for input schemas, output schemas, and world-building (must sum to 1) | $[0,1]$ |
| $\alpha, \beta$ | Weights for combining $A_{\text{intent}}$ and $A_{\text{self}}$ (must sum to 1) | $[0,1]$ |
3.2 Proximity to Human Intent
This component captures how much freedom the agent has relative to human instruction and oversight:
\[A_{\text{intent}} = (1 - s)(1 - i)\]- When prompts are narrow ($s \to 1$) or intervention is frequent ($i \to 1$), autonomy decreases.
- Maximum autonomy ($A_{\text{intent}} = 1$) occurs with broad prompts and no intervention.
3.3 Self-Generation Component
This component captures the agent’s ability to define its own operational infrastructure.
Schema coverage. Normalize the agent’s ability to define input and output schemas:
\[p = \min\left(\frac{d_p}{D_p^{\max}}, 1\right), \quad o = \min\left(\frac{d_o}{D_o^{\max}}, 1\right)\]World-building capability. Measure the agent’s ability to integrate environments without external help using an exponential saturator:
\[g = 1 - \exp\left(-\frac{n_{\text{env}}}{\lambda}\right)\]- As $n_{\text{env}} \to \infty$, $g \to 1$.
- For small $n_{\text{env}}$, returns diminish according to $\lambda$.
Combined self-generation score:
\[A_{\text{self}} = w_p \cdot p + w_o \cdot o + w_g \cdot g\]Or in matrix form:
\[A_{\text{self}} = \begin{bmatrix} w_p & w_o & w_g \end{bmatrix} \begin{bmatrix} p \ o \ g \end{bmatrix}\]Default weights: $w_p = 0.4, ; w_o = 0.2, ; w_g = 0.4$
3.4 Overall Autonomy Score
Combine the two components:
\[A_{\text{autonomy}} = \alpha \cdot A_{\text{intent}} + \beta \cdot A_{\text{self}}\]Or in matrix form:
\[A_{\text{autonomy}} = \begin{bmatrix} \alpha & \beta \end{bmatrix} \begin{bmatrix} A_{\text{intent}} \ A_{\text{self}} \end{bmatrix}\]Default weights: $\alpha = 0.5, ; \beta = 0.5$
4. Complexity Formalization
4.1 Background
Complexity measurement has deep roots in information theory—notably Kolmogorov complexity and Shannon entropy. We intentionally simplify here, using pairwise Euclidean distance as a tractable proxy for variable heterogeneity. For theoretical grounding on this approach, see Xu et al. (2023), “A Complexity-Based Theory of Compositionality.”
4.2 Variables
| Symbol | Definition |
|---|---|
| $n$ | Total number of decisions |
| $m_i$ | Number of variables involved in decision $i$ |
| ${v_{i,1}, \dots, v_{i,m_i}} \subset \mathbb{R}^d$ | Vector representations of variables in decision $i$ |
| $\tau$ | Scale parameter for difficulty saturation |
4.3 Decision Difficulty
For each decision $i$, compute:
Disparity (average pairwise distance between variables):
\[h_i = \frac{1}{\binom{m_i}{2}} \sum_{1 \le j < k \le m_i} |v_{i,j} - v_{i,k}|\]Difficulty (saturating function of variable count and disparity):
\[D_i = 1 - \exp\left(-\frac{m_i \cdot h_i}{\tau}\right)\]4.4 Overall Complexity Score
The overall complexity is the product of individual decision difficulties:
\[C = \prod_{i=1}^{n} D_i\]In log form:
\[\log C = \sum_{i=1}^{n} \log D_i\]4.5 Alternative: Entropy-Based Disparity
For non-vector or categorical variables, disparity can be measured via entropy:
\[h_i = -\sum_{c} p(c) \log p(c)\]where $p(c)$ is the fraction of variables in category $c$.
5. Worked Examples
To illustrate how this framework distinguishes between systems, we compare two agents operating in the same domain (software development) but at very different points in the autonomy-complexity space.
5.1 Example A: Code Autocomplete
A basic code completion tool (e.g., simple Copilot-style autocomplete) that suggests the next line or block of code as the developer types.
Autonomy Calculation
| Variable | Value | Rationale |
|---|---|---|
| $s$ | 0.85 | Highly specific context: cursor position, current line, file contents |
| $i$ | 0.95 | Near-constant intervention: user accepts/rejects each suggestion |
| $d_p$ | 2 | Minimal schema definition (basically: code context in, completion out) |
| $d_o$ | 2 | Limited output variety |
| $n_{\text{env}}$ | 1 | Operates only within the IDE |
Using $D_p^{\max} = 20$, $D_o^{\max} = 20$, $\lambda = 5$:
\[A_{\text{intent}} = (1 - 0.85)(1 - 0.95) = 0.15 \times 0.05 = 0.0075\] \[p = \frac{2}{20} = 0.1, \quad o = \frac{2}{20} = 0.1, \quad g = 1 - e^{-1/5} \approx 0.18\] \[A_{\text{self}} = 0.4(0.1) + 0.2(0.1) + 0.4(0.18) = 0.04 + 0.02 + 0.07 = 0.13\] \[A_{\text{autonomy}} = 0.5(0.0075) + 0.5(0.13) \approx \mathbf{0.07}\]Complexity Calculation
A representative decision: predicting the next line of code.
| Variable | Value | Rationale |
|---|---|---|
| $m$ | 4 | Variables: current line, surrounding context, language syntax, cursor position |
| $h$ | 0.3 | Low disparity—all variables are code-related and semantically similar |
Using $\tau = 10$:
\[D = 1 - \exp\left(-\frac{4 \times 0.3}{10}\right) = 1 - e^{-0.12} \approx \mathbf{0.11}\]5.2 Example B: Autonomous Software Engineer
An agentic coding system that can independently plan features, write implementations, debug failures, and iterate—given only a high-level objective like “add user authentication to this application.”
Autonomy Calculation
| Variable | Value | Rationale |
|---|---|---|
| $s$ | 0.2 | Broad prompts: “build this feature,” “fix this bug” |
| $i$ | 0.15 | Infrequent check-ins; human reviews output, not each step |
| $d_p$ | 12 | Defines its own tool schemas, API contracts, data structures |
| $d_o$ | 10 | Produces varied outputs: code, tests, documentation, shell commands |
| $n_{\text{env}}$ | 6 | Operates across IDE, terminal, browser, file system, git, external APIs |
Using $D_p^{\max} = 20$, $D_o^{\max} = 20$, $\lambda = 5$:
\[A_{\text{intent}} = (1 - 0.2)(1 - 0.15) = 0.8 \times 0.85 = 0.68\] \[p = \frac{12}{20} = 0.6, \quad o = \frac{10}{20} = 0.5, \quad g = 1 - e^{-6/5} \approx 0.70\] \[A_{\text{self}} = 0.4(0.6) + 0.2(0.5) + 0.4(0.70) = 0.24 + 0.10 + 0.28 = 0.62\] \[A_{\text{autonomy}} = 0.5(0.68) + 0.5(0.62) = \mathbf{0.65}\]Complexity Calculation
A representative decision: designing the authentication architecture.
| Variable | Value | Rationale |
|---|---|---|
| $m$ | 10 | Variables: security requirements, existing codebase structure, user needs, performance constraints, framework conventions, session management, database schema, API design, error handling, testing strategy |
| $h$ | 0.7 | High disparity—variables span security, UX, infrastructure, and business logic |
Using $\tau = 10$:
\[D = 1 - \exp\left(-\frac{10 \times 0.7}{10}\right) = 1 - e^{-0.7} \approx \mathbf{0.50}\]5.3 Comparison
| Code Autocomplete | Autonomous Engineer | |
|---|---|---|
| Autonomy | 0.07 | 0.65 |
| Complexity | 0.11 | 0.50 |
| Quadrant | Low-Low | High-High |
The framework captures the intuitive difference: autocomplete is a tightly constrained tool operating on narrow, homogeneous decisions, while an autonomous engineer has broad latitude and must integrate diverse considerations. Same domain, fundamentally different capability profiles.
6. Limitations and Open Questions
Subjectivity in parameters. Several inputs require judgment: prompt specificity ($s$), weight parameters, and scale factors ($\lambda$, $\tau$). We flag these as tunable rather than claiming objectivity.
Performance is not captured. This framework measures the operational space an agent occupies, not how well it performs within that space. A system attempting high-autonomy, high-complexity tasks while failing most of them would score the same as one succeeding. A separate performance axis may be needed.
Benchmarking. An interesting extension would be to chart existing models using this framework, potentially drawing on results from benchmarks like AgentBench or tau-bench.
Note: I haven’t hand-checked all the math in the examples. I think it’s right, but it’s mostly to illustrate the point. I expect there to be errors.
6.1 A Note on AI Assistance
For transparency, I want to describe my usage of LLMs in creating this piece.
We first came up with the idea to create an Autonomy vs. Complexity matrix over the summer when we were first exploring agentic systems generally at Concordance. I spent the next few days writing/graphing some equations based on an intuitive sense of how variables should interact. Once I had something in Desmos that felt pretty good, I dumped everything into a markdown file and began writing descriptions, ideas, examples, and trying to tie it all together.
Trent, one of my co-founders, put some of his thoughts in the doc, which was extremely helpful, but the paper was still basically a mess of math and some short descriptive moments that tried to make sense of it.
I took that draft of ideas/descriptions and told Claude to take only the information on the page and organize it into a clear flow, as well as to tighten up any confusing variable names or math errors.
It found a couple, cleaned the structure pretty well, and helped me create the two worked examples. The resulting piece is what came out of that process. It was really fun.
7. References
- Leike et al. (2018). Scalable Agent Alignment via Reward Modeling. https://arxiv.org/abs/1811.07871
- Shi et al. (2023). AgentBench: Evaluating Foundation Models as Agents. https://arxiv.org/abs/2308.03688
- Yao et al. (2022). Planning with Language Models for Code Generation. https://arxiv.org/abs/2303.05510
- Xu et al. (2023). A Complexity-Based Theory of Compositionality. https://arxiv.org/abs/2308.11189
Appendix: Interactive Tools
- Desmos graph for complexity function: https://www.desmos.com/calculator/jslfqw1aam?tour=restrictions