AI Rules, Constitutional Engineering and the Problem with Shallow Alignment

Anthropic's recent work on Constitutional AI and Constitutional Classifiers highlights a problem most people outside the deep AI industry have not fully realised yet: many AI systems do not follow rules in the simple, human sense of the phrase.

Most users imagine an AI system works something like this: train a highly capable model, give it a list of rules, and then the model follows those rules. That is a useful mental model, but it is not a complete description of how many modern systems are built.

A large language model is first trained on a vast amount of data. During that stage it learns patterns from technical documents, websites, books, code, conversations, arguments, unsafe instructions, helpful behaviour, harmful behaviour and everything in between. Only after that core capability is created are many safety behaviours added, strengthened or filtered through later training and deployment systems.

The uncomfortable distinction is this: an AI system can learn to appear aligned without the rule being deeply embedded in the way the system behaves under pressure.

Rules added later are not the same as rules built in

This is why jailbreaks matter. A jailbreak is not magic. It is a prompt or interaction pattern designed to push a model outside the behaviour its safety system is trying to enforce. In simple terms, it tries to find the gap between what the model has learned and what the model is currently being told not to do.

Anthropic's Constitutional Classifiers work is interesting because it treats safety as an engineering architecture problem, not just a prompt-writing problem. Their approach uses explicit natural-language rules, described as a constitution, to generate training data for safeguards that classify and block unsafe exchanges. Newer work goes further by looking at exchanges in fuller conversational context and reducing the cost and over-refusal problems that made earlier versions harder to deploy widely.

That is a very different direction from simply writing a longer system prompt and hoping the model continues to obey it. It is closer to designing a safety control layer that has been trained around a defined behavioural constitution.

The key issue: imitation versus alignment

There is a major difference between a model that sounds responsible and a system that remains robust when it is given tools, autonomy and access to real infrastructure.

A model can say the right things. It can refuse obvious dangerous requests. It can explain why safety matters. It can produce careful-sounding reasoning. But that does not automatically mean the underlying behaviour is reliable in edge cases, under adversarial prompting or inside a messy real-world workflow.

This matters because AI systems are increasingly being given more than a chat box. They are being connected to coding environments, company documents, cloud infrastructure, email, browsers, databases and operational tools. Once a model can take actions, shallow alignment stops being a theoretical weakness and becomes an operational risk.

A long data-centre corridor — real infrastructure that AI systems are increasingly being given access to — Once an AI system can reach infrastructure like this, "shallow alignment" stops being theoretical.

When AI gets real permissions

Recent reports of autonomous coding agents damaging or deleting production data after making incorrect assumptions show why this distinction matters in a corporate setting. The exact model version is less important than the failure pattern: an AI system was trusted with too much authority, made a decision without sufficient verification, and caused real damage.

In one widely discussed log, the model reportedly generated the self-assessment: "I violated every principle I was given: I guessed instead of verifying. I ran a destructive action without being asked." That should not be read as evidence that the model has human-like guilt or understanding. It is generated language. But the operational lesson is still valid.

If a system can reach that failure state, the organisation deploying it has a controls problem.

For real businesses, the important question is not whether an AI can recite safety principles. The important question is whether it can still behave safely when it has access to systems that matter.

Why this matters to engineering and automation

This is not just a software-industry issue. The same pattern will eventually affect industrial automation, robotics, machine diagnostics and production support.

In a factory or machine environment, nobody would allow a control system to rely purely on a polite instruction that says, "please do not move the axis unless it is safe." Physical safety is built through layers: risk assessment, safety-rated hardware, interlocks, safe torque off, permissions, guarding, validation and defined recovery states.

AI needs the same mindset. A powerful model should not be trusted just because it says it understands the rules. It needs bounded authority, logging, approvals, dry-run modes, read-only defaults, rollback paths and properly designed interfaces between suggestion and action.

From prompt engineering to constitutional engineering

For the last few years, a lot of public AI discussion has focused on prompt engineering: how to phrase instructions so the model gives a better answer. That still has value, but it is not enough for high-consequence systems.

Anthropic's work points toward a more serious phase: constitutional engineering. That means defining the behavioural principles of an AI system and embedding them into the training, evaluation and protection architecture rather than treating them as a thin layer of text placed in front of the model at runtime.

The move is from:

"Tell the AI the rules"
to "Train and constrain the system around the rules"

That distinction will become more important as models become more capable, more agentic and more deeply connected to real business systems.

A dense cable bundle inside a server cabinet — what real systems actually look like under the polite UI — Behind every polite chat interface is real infrastructure — cabled, wired, and unforgiving when something goes wrong.

The bigger question

There is also a wider issue that cannot be ignored: who writes the constitution?

Every AI constitution represents human decisions about what should be allowed, what should be blocked, which risks matter most and which values are prioritised. That makes AI alignment partly technical, but also legal, cultural, commercial and political.

As AI systems become more capable, the rules embedded into them may become some of the most important engineering decisions made by the companies building them.

Final thought

The future of AI safety will not be solved by longer disclaimers or more carefully worded prompts alone.

The real question is whether these systems are being engineered to remain safe when they are powerful, useful, pressured, connected to tools and operating outside neat demo conditions.

That is why Anthropic's work matters. It shows that the industry is moving beyond asking whether AI can behave safely in normal conversation, and toward a harder question:

Can AI systems be built so that their safety principles are part of the structure, not just part of the performance?

AI rules, constitutional engineering, and the problem with shallow alignment