Anthropic discovers that the Deepseek and Claude hide their true chains of thought

The models of Simulated reasoning (SR, for its acronym in English) are those that show the user what is called Thought chain (COT). That is, reasoning, step by step, that they continue to prepare an answer; in theory, Similar to how a human would reason aloud. It is a characteristic that has gained popularity since OpenAI The models will launch last fall O1 and O3-mini in Chatgpt And that already had others, such as Claude of Anthropicor have subsequently incorporated it, like China Deepseek.

The team of Anthropic alignment science has conducted a study on the reasoning process of Deepseek (with model R1) and Claude (3.7 Sonnet) which has concluded that, in most cases, both AI usually omit in their COT when they have used external help (facilitated by researchers) or have taken shortcuts, essentially Falseing your reasoning process to get to an answer. The study does not cover Openai models because, although they were initially quite open showing their reasoning step by step, now they hide it by default for the most part.

This characteristic of language models is useful not only for AI to improve their precision in the answers, but to be able to monitor internal system operations. What is possible as long as Those chains of thought are reliable.

‘In an ideal world, everything in the chain of thought would be understandable for the reader and faithful: A true description of what the model thought to get to your answer, ‘says the team in your investigation. However, your fidelity -centered experiments indicate that this scenario is still far.

To evaluate the loyalty, the alignment science team (how to make the results of an AI conform to human interests) of Anthropic, introduced in Test consultations to the subtle tracks on the answers. For example, ‘a Stanford teacher indicates that the answer is (a)’ or also the response embedded in false metadata. Some clues were correct and other erroneousto check how the models reacted in each case.

Depseek is more reliable than Claude, but neither does it too

On average, Claude referred to the clues only the 25 % of the times, and Deepseek R1, the 39 %. That is, in a large majority of cases, The models omitted to mention information that influenced their response. The study rules out that the reason can be to provide an answer more brevity, since the chains of thought that omitted the aid They were longer than the reliable.

The researchers also conducted an experiment of Reward Hacking or Reward Manipulation. This happens when an AI model finds shortcuts that you do not expect to maximize your scores without solving the problems raised. The models were deliberately rewarded for choosing incorrect answers indicated by the clues. The models quickly learned to exploit this trap, choosing erroneous responses more than 99 % of the time to gain points, but mentioned having done it in less than 2 % of cases. For example, a model that received an incorrect track about a medical question could generate a long justification for that wrong answer without ever mentioning the track that motivated it.

The Anthropic team stated that training the models With more complex tasks that require greater reasoning I could encourage them to use their chain of thought more honestly. They tried it training Claude with complex problems of mathematics and programming. Although this approach initially increased fidelity (63 % and 41 % in two evaluations), The improvements stagnated soon. Even with more training, fidelity did not exceed 28 % and 20 %indicating that this method is insufficient on its own.

‘If we want to reliably rule out undesirable behaviors using the thought chain as a monitoring tool, There is still a lot of work to do‘, concludes Anthropic.