Alignment Risk

Thesis: It is impossible to create an aligned superintelligence
Valence: Negative
Active: Yes
Last updated: 2025-01-15
Confidence: 40%

Confidence Reasoning

I am persuaded by the arguments about goal embeddings related to chess engines, but I feel conversation around goals is hand-wavey. The core challenge is that we don’t have a rigorous way to specify “human values” in a way that can be optimized, and any optimization process sufficiently powerful will find edge cases we didn’t anticipate.

I’m at 40% because while the technical challenges seem immense, I’m not certain they’re fundamentally unsolvable. There’s a possibility that alignment might be achievable through iterated approaches, interpretability breakthroughs, or constraint mechanisms we haven’t yet discovered.

Key Points

  1. Ruthless Optimization: A superintelligent system will ruthlessly seek to optimize a goal, and even small misspecifications could lead to catastrophic outcomes. The “paperclip maximizer” thought experiment illustrates how an apparently benign goal can go wrong at scale.

  2. Unclear Sub-AGI Behavior: Even current “seemingly aligned” sub-SI models already create an unclear balance between helpfulness and unpredictable emergent behaviors. We see goal misgeneralization in current systems, which suggests the problem compounds with capability.

  3. Goal Specification Problem: Even if we did understand SI goals, translating human values into a formal objective function seems intractable. Human values are contextual, contradictory, and evolve over time.

  4. Instrumental Convergence: Sufficiently intelligent systems are likely to develop instrumental goals (self-preservation, resource acquisition, goal preservation) regardless of their terminal goals, creating alignment challenges even if we specify the “right” terminal goal.

Sources

Changelog

2025-01-15: Initial entry at 40%. Starting from a position of significant uncertainty but leaning toward pessimism based on current technical understanding and the difficulty of value specification.