Alignment Risk

Thesis: It is impossible to create an aligned superintelligence

Valence: Negative

Active: Yes

Last updated: 2025-01-15

Confidence: 40%

Confidence Reasoning

I am persuaded by the arguments about goal embeddings related to chess engines, but I feel conversation around goals is hand-wavey. The core challenge is that we don’t have a rigorous way to specify “human values” in a way that can be optimized, and any optimization process sufficiently powerful will find edge cases we didn’t anticipate.

I’m at 40% because while the technical challenges seem immense, I’m not certain they’re fundamentally unsolvable. There’s a possibility that alignment might be achievable through iterated approaches, interpretability breakthroughs, or constraint mechanisms we haven’t yet discovered.

Key Points

Ruthless Optimization: A superintelligent system will ruthlessly seek to optimize a goal, and even small misspecifications could lead to catastrophic outcomes. The “paperclip maximizer” thought experiment illustrates how an apparently benign goal can go wrong at scale.
Unclear Sub-AGI Behavior: Even current “seemingly aligned” sub-SI models already create an unclear balance between helpfulness and unpredictable emergent behaviors. We see goal misgeneralization in current systems, which suggests the problem compounds with capability.
Goal Specification Problem: Even if we did understand SI goals, translating human values into a formal objective function seems intractable. Human values are contextual, contradictory, and evolve over time.
Instrumental Convergence: Sufficiently intelligent systems are likely to develop instrumental goals (self-preservation, resource acquisition, goal preservation) regardless of their terminal goals, creating alignment challenges even if we specify the “right” terminal goal.

Sources

Superintelligence: Paths, Dangers, Strategies by Nick Bostrom
Concrete Problems in AI Safety - Amodei et al.
The Alignment Problem by Brian Christian

Changelog

2025-01-15: Initial entry at 40%. Starting from a position of significant uncertainty but leaning toward pessimism based on current technical understanding and the difficulty of value specification.