Alignment Risk
Confidence Reasoning
I am persuaded by the arguments about goal embeddings related to chess engines, but I feel conversation around goals is hand-wavey. The core challenge is that we don’t have a rigorous way to specify “human values” in a way that can be optimized, and any optimization process sufficiently powerful will find edge cases we didn’t anticipate.
I’m at 40% because while the technical challenges seem immense, I’m not certain they’re fundamentally unsolvable. There’s a possibility that alignment might be achievable through iterated approaches, interpretability breakthroughs, or constraint mechanisms we haven’t yet discovered.
Key Points
-
Ruthless Optimization: A superintelligent system will ruthlessly seek to optimize a goal, and even small misspecifications could lead to catastrophic outcomes. The “paperclip maximizer” thought experiment illustrates how an apparently benign goal can go wrong at scale.
-
Unclear Sub-AGI Behavior: Even current “seemingly aligned” sub-SI models already create an unclear balance between helpfulness and unpredictable emergent behaviors. We see goal misgeneralization in current systems, which suggests the problem compounds with capability.
-
Goal Specification Problem: Even if we did understand SI goals, translating human values into a formal objective function seems intractable. Human values are contextual, contradictory, and evolve over time.
-
Instrumental Convergence: Sufficiently intelligent systems are likely to develop instrumental goals (self-preservation, resource acquisition, goal preservation) regardless of their terminal goals, creating alignment challenges even if we specify the “right” terminal goal.
Sources
- Superintelligence: Paths, Dangers, Strategies by Nick Bostrom
- Concrete Problems in AI Safety - Amodei et al.
- The Alignment Problem by Brian Christian
Changelog
2025-01-15: Initial entry at 40%. Starting from a position of significant uncertainty but leaning toward pessimism based on current technical understanding and the difficulty of value specification.