Anthropic has unveiled Natural Language Autoencoders (NLAs), a new AI interpretability technique designed to convert Claude’s internal activations into understandable natural language explanations. The system helps researchers examine how large language models process information and make decisions internally.
NLAs use one model to verbalize hidden activations and another to reconstruct them, improving the accuracy of explanations through reinforcement learning. Anthropic says the approach can support AI auditing, safety evaluations, and the detection of hidden or deceptive reasoning patterns before deployment.
The company believes this research could improve transparency and trust in advanced AI systems while helping researchers better understand model behavior and decision-making processes.





