Formalizing Autonomy and Complexity in Agentic Systems

1. Introduction

What We’re Trying to Do

This document proposes a formal framework for evaluating agentic systems along two fundamental axes: autonomy (how independently an agent operates from human oversight) and complexity (how difficult the decisions it faces are). We believe these two factors are critical to understanding agentic capabilities and their implications for the future of work.

Why Formalize?

Any formalization of autonomy and complexity involves judgment calls—some arbitrary (weight parameters), some subjective (prompt specificity). Given this, why attempt formalization at all rather than offering a purely conceptual framework?

Two reasons:

Reproducible evaluation. A formal framework with explicit variables enables consistent measurement of agentic capabilities across systems, teams, and time.
Taxonomy over leaderboard. Benchmark scores rank systems but obscure what they can do. A multidimensional formalization reveals the shape of capability—where a system operates, not just how it compares.

2. Overview of Approach

We decompose agentic capability into two orthogonal dimensions:

Autonomy: The degree to which an agent operates independently, measured by (a) human instruction specificity, and (b) how much of its own operational infrastructure it can define.
Complexity: The difficulty of the decisions an agent must make, measured by the number of variables involved and their heterogeneity.

These dimensions are independent. A system can be high-autonomy/low-complexity (e.g., an unsupervised agent performing simple repetitive tasks) or low-autonomy/high-complexity (e.g., a tightly supervised agent assisting with intricate multi-factor decisions).

3. Autonomy Formalization

3.1 Variables

Symbol	Definition	Range
$i$	Expected human-intervention rate (1 = intervention every cycle)	$[0,1]$
$s$	Prompt specificity (1 = maximally narrow; 0 = totally broad)	$[0,1]$
$d_p$	Number of distinct parameter-type schemas the agent can define	$\mathbb{Z}^+$
$d_o$	Number of distinct output-type schemas the agent can define	$\mathbb{Z}^+$
$D_p^{\max}, D_o^{\max}$	Tunable upper bounds on schema variety	$\mathbb{Z}^+$
$n_{\text{env}}$	Number of environments the agent can integrate without external help	$\mathbb{Z}^+$
$\lambda$	Scale parameter for world-building saturation	$\mathbb{R}^+$
$w_p, w_o, w_g$	Weights for input schemas, output schemas, and world-building (must sum to 1)	$[0,1]$
$\alpha, \beta$	Weights for combining $A_{\text{intent}}$ and $A_{\text{self}}$ (must sum to 1)	$[0,1]$

3.2 Proximity to Human Intent

This component captures how much freedom the agent has relative to human instruction and oversight:

\[A_{\text{intent}} = (1 - s)(1 - i)\]

When prompts are narrow ($s \to 1$) or intervention is frequent ($i \to 1$), autonomy decreases.
Maximum autonomy ($A_{\text{intent}} = 1$) occurs with broad prompts and no intervention.

3.3 Self-Generation Component

This component captures the agent’s ability to define its own operational infrastructure.

Schema coverage. Normalize the agent’s ability to define input and output schemas:

\[p = \min\left(\frac{d_p}{D_p^{\max}}, 1\right), \quad o = \min\left(\frac{d_o}{D_o^{\max}}, 1\right)\]

World-building capability. Measure the agent’s ability to integrate environments without external help using an exponential saturator:

\[g = 1 - \exp\left(-\frac{n_{\text{env}}}{\lambda}\right)\]

As $n_{\text{env}} \to \infty$, $g \to 1$.
For small $n_{\text{env}}$, returns diminish according to $\lambda$.

Combined self-generation score:

\[A_{\text{self}} = w_p \cdot p + w_o \cdot o + w_g \cdot g\]

Or in matrix form:

\[A_{\text{self}} = \begin{bmatrix} w_p & w_o & w_g \end{bmatrix} \begin{bmatrix} p \ o \ g \end{bmatrix}\]

Default weights: $w_p = 0.4, ; w_o = 0.2, ; w_g = 0.4$

3.4 Overall Autonomy Score

Combine the two components:

\[A_{\text{autonomy}} = \alpha \cdot A_{\text{intent}} + \beta \cdot A_{\text{self}}\]

Or in matrix form:

\[A_{\text{autonomy}} = \begin{bmatrix} \alpha & \beta \end{bmatrix} \begin{bmatrix} A_{\text{intent}} \ A_{\text{self}} \end{bmatrix}\]

Default weights: $\alpha = 0.5, ; \beta = 0.5$

4. Complexity Formalization

4.1 Background

Complexity measurement has deep roots in information theory—notably Kolmogorov complexity and Shannon entropy. We intentionally simplify here, using pairwise Euclidean distance as a tractable proxy for variable heterogeneity. For theoretical grounding on this approach, see Xu et al. (2023), “A Complexity-Based Theory of Compositionality.”

4.2 Variables

Symbol	Definition
$n$	Total number of decisions
$m_i$	Number of variables involved in decision $i$
${v_{i,1}, \dots, v_{i,m_i}} \subset \mathbb{R}^d$	Vector representations of variables in decision $i$
$\tau$	Scale parameter for difficulty saturation

4.3 Decision Difficulty

For each decision $i$, compute:

Disparity (average pairwise distance between variables):

\[h_i = \frac{1}{\binom{m_i}{2}} \sum_{1 \le j < k \le m_i} |v_{i,j} - v_{i,k}|\]

Difficulty (saturating function of variable count and disparity):

\[D_i = 1 - \exp\left(-\frac{m_i \cdot h_i}{\tau}\right)\]

4.4 Overall Complexity Score

The overall complexity is the product of individual decision difficulties:

\[C = \prod_{i=1}^{n} D_i\]

In log form:

\[\log C = \sum_{i=1}^{n} \log D_i\]

4.5 Alternative: Entropy-Based Disparity

For non-vector or categorical variables, disparity can be measured via entropy:

\[h_i = -\sum_{c} p(c) \log p(c)\]

where $p(c)$ is the fraction of variables in category $c$.

5. Worked Examples

To illustrate how this framework distinguishes between systems, we compare two agents operating in the same domain (software development) but at very different points in the autonomy-complexity space.

5.1 Example A: Code Autocomplete

A basic code completion tool (e.g., simple Copilot-style autocomplete) that suggests the next line or block of code as the developer types.

Autonomy Calculation

Variable	Value	Rationale
$s$	0.85	Highly specific context: cursor position, current line, file contents
$i$	0.95	Near-constant intervention: user accepts/rejects each suggestion
$d_p$	2	Minimal schema definition (basically: code context in, completion out)
$d_o$	2	Limited output variety
$n_{\text{env}}$	1	Operates only within the IDE

Using $D_p^{\max} = 20$, $D_o^{\max} = 20$, $\lambda = 5$:

\[A_{\text{intent}} = (1 - 0.85)(1 - 0.95) = 0.15 \times 0.05 = 0.0075\] \[p = \frac{2}{20} = 0.1, \quad o = \frac{2}{20} = 0.1, \quad g = 1 - e^{-1/5} \approx 0.18\] \[A_{\text{self}} = 0.4(0.1) + 0.2(0.1) + 0.4(0.18) = 0.04 + 0.02 + 0.07 = 0.13\] \[A_{\text{autonomy}} = 0.5(0.0075) + 0.5(0.13) \approx \mathbf{0.07}\]

Complexity Calculation

A representative decision: predicting the next line of code.

Variable	Value	Rationale
$m$	4	Variables: current line, surrounding context, language syntax, cursor position
$h$	0.3	Low disparity—all variables are code-related and semantically similar

Using $\tau = 10$:

\[D = 1 - \exp\left(-\frac{4 \times 0.3}{10}\right) = 1 - e^{-0.12} \approx \mathbf{0.11}\]

5.2 Example B: Autonomous Software Engineer

An agentic coding system that can independently plan features, write implementations, debug failures, and iterate—given only a high-level objective like “add user authentication to this application.”

Autonomy Calculation

Variable	Value	Rationale
$s$	0.2	Broad prompts: “build this feature,” “fix this bug”
$i$	0.15	Infrequent check-ins; human reviews output, not each step
$d_p$	12	Defines its own tool schemas, API contracts, data structures
$d_o$	10	Produces varied outputs: code, tests, documentation, shell commands
$n_{\text{env}}$	6	Operates across IDE, terminal, browser, file system, git, external APIs

Using $D_p^{\max} = 20$, $D_o^{\max} = 20$, $\lambda = 5$:

\[A_{\text{intent}} = (1 - 0.2)(1 - 0.15) = 0.8 \times 0.85 = 0.68\] \[p = \frac{12}{20} = 0.6, \quad o = \frac{10}{20} = 0.5, \quad g = 1 - e^{-6/5} \approx 0.70\] \[A_{\text{self}} = 0.4(0.6) + 0.2(0.5) + 0.4(0.70) = 0.24 + 0.10 + 0.28 = 0.62\] \[A_{\text{autonomy}} = 0.5(0.68) + 0.5(0.62) = \mathbf{0.65}\]

Complexity Calculation

A representative decision: designing the authentication architecture.

Variable	Value	Rationale
$m$	10	Variables: security requirements, existing codebase structure, user needs, performance constraints, framework conventions, session management, database schema, API design, error handling, testing strategy
$h$	0.7	High disparity—variables span security, UX, infrastructure, and business logic

Using $\tau = 10$:

\[D = 1 - \exp\left(-\frac{10 \times 0.7}{10}\right) = 1 - e^{-0.7} \approx \mathbf{0.50}\]

5.3 Comparison

Code Autocomplete	Autonomous Engineer
Autonomy	0.07	0.65
Complexity	0.11	0.50
Quadrant	Low-Low	High-High

The framework captures the intuitive difference: autocomplete is a tightly constrained tool operating on narrow, homogeneous decisions, while an autonomous engineer has broad latitude and must integrate diverse considerations. Same domain, fundamentally different capability profiles.

6. Limitations and Open Questions

Subjectivity in parameters. Several inputs require judgment: prompt specificity ($s$), weight parameters, and scale factors ($\lambda$, $\tau$). We flag these as tunable rather than claiming objectivity.

Performance is not captured. This framework measures the operational space an agent occupies, not how well it performs within that space. A system attempting high-autonomy, high-complexity tasks while failing most of them would score the same as one succeeding. A separate performance axis may be needed.

Benchmarking. An interesting extension would be to chart existing models using this framework, potentially drawing on results from benchmarks like AgentBench or tau-bench.

Note: I haven’t hand-checked all the math in the examples. I think it’s right, but it’s mostly to illustrate the point. I expect there to be errors.

6.1 A Note on AI Assistance

For transparency, I want to describe my usage of LLMs in creating this piece.

We first came up with the idea to create an Autonomy vs. Complexity matrix over the summer when we were first exploring agentic systems generally at Concordance. I spent the next few days writing/graphing some equations based on an intuitive sense of how variables should interact. Once I had something in Desmos that felt pretty good, I dumped everything into a markdown file and began writing descriptions, ideas, examples, and trying to tie it all together.

Trent, one of my co-founders, put some of his thoughts in the doc, which was extremely helpful, but the paper was still basically a mess of math and some short descriptive moments that tried to make sense of it.

I took that draft of ideas/descriptions and told Claude to take only the information on the page and organize it into a clear flow, as well as to tighten up any confusing variable names or math errors.

It found a couple, cleaned the structure pretty well, and helped me create the two worked examples. The resulting piece is what came out of that process. It was really fun.

7. References

Leike et al. (2018). Scalable Agent Alignment via Reward Modeling. https://arxiv.org/abs/1811.07871
Shi et al. (2023). AgentBench: Evaluating Foundation Models as Agents. https://arxiv.org/abs/2308.03688
Yao et al. (2022). Planning with Language Models for Code Generation. https://arxiv.org/abs/2303.05510
Xu et al. (2023). A Complexity-Based Theory of Compositionality. https://arxiv.org/abs/2308.11189

Appendix: Interactive Tools

Desmos graph for complexity function: https://www.desmos.com/calculator/jslfqw1aam?tour=restrictions