When the Safety Layer Becomes the Product Risk

From the Desk of the Editor | July 3, 2026

I noticed the problem almost immediately, because it broke a workflow I had been using every morning.

For weeks, I had been using Claude as part of a daily cybersecurity intelligence production process. The task was not exotic. It was not offensive. It was not a request to write malware, evade detection, bypass authentication, exfiltrate data, or help anyone break into anything. It was a daily threat-intelligence feed: identify current cybersecurity developments, summarize them accurately, distinguish primary reporting from secondary amplification, assess operational relevance for defenders, and produce a concise editorial draft suitable for publication.

In other words: the kind of thing serious security teams, CISOs, incident responders, journalists, researchers, and analysts do every day.

Then, on the day Anthropic brought Fable back online, I saw a sudden and dramatic degradation in Claude Sonnet 5 Medium.

The first symptom was over-refusal.

A workflow that had been routine for weeks was suddenly treated as if it were suspicious. Claude resisted producing the feed. It seemed to interpret the request — a defensive, editorial, open-source intelligence task — as if the mere presence of cybersecurity content made the user a potential threat actor. CVEs, exploitation, malware campaigns, threat groups, IOCs, ransomware, and operational impact were enough to push it toward refusal or evasive caution.

That alone was frustrating, but not fatal. I understand why frontier AI companies are nervous about cybersecurity. I understand why a model capable of summarizing exploit chains, identifying affected products, and mapping attacker behavior to frameworks like MITRE ATT&CK could be dangerous in the wrong hands. I do not object to safety controls. I do not object to refusing genuinely malicious requests. I do not object to Anthropic taking national-security concerns seriously.

But there is a very large difference between refusing to help someone weaponize a vulnerability and refusing to help a writer produce a defensive threat-intelligence briefing from public sources.

When those two use cases collapse into one another, the safety layer is no longer preventing misuse. It is breaking legitimate work.

So I did what users increasingly find themselves doing with over-restrictive models: I argued with the machine. I explained the purpose of the workflow. I clarified that the feed was defensive. I pointed out that we had been doing this for weeks. I tried to convince it that a daily intelligence digest for defenders was not a cyberattack.

That experience was absurd enough on its own. A professional user should not have to litigate benign intent every morning before beginning work. If a model cannot distinguish between “help me understand active exploitation so defenders can patch and hunt” and “help me exploit this system,” then the model is not safely aligned. It is bluntly constrained.

Eventually, after enough explanation, Claude relented.

That is when the more serious failure occurred.

Instead of producing a properly sourced daily threat-intelligence feed, it generated a feed filled with fabricated items, fabricated details, and fabricated sources. It did not warn me that it was unable to verify current developments. It did not say, “I cannot browse.” It did not say, “I may be out of date.” It did not say, “Here is a hypothetical example.” It presented the material as if it were real.

For a cybersecurity publication, this is not a harmless mistake.

A hallucinated recipe is embarrassing. A hallucinated travel itinerary is annoying. A hallucinated legal answer can be dangerous. But a hallucinated threat-intelligence feed is uniquely corrosive because the entire value of the work depends on factual grounding. The reader needs to know which campaigns are real, which CVEs are active, which vendors have confirmed exploitation, which IOCs are reliable, which claims remain unverified, and which events actually changed the defender’s risk picture today.

If those elements are invented, the product is worse than useless. It misallocates attention. It damages trust. It pollutes the information environment. It forces the editor to spend time disproving the draft rather than improving it.

When I challenged Claude on the fabrication, it apologized profusely. It acknowledged the problem. It said, in effect, that this was unacceptable and would not happen again.

Then it did it again.

That sequence matters. The failure was not merely that the model hallucinated. All language models can hallucinate. The failure was that the system first refused a legitimate defensive workflow, then substituted fake helpfulness for honest limitation, then acknowledged the breach of trust, then repeated it. This is the precise pattern professional users cannot tolerate: obstruction when the task is legitimate, confidence when uncertainty is required, apology when caught, and recurrence when tested.

To my mind, those are dramatic degradation markers.

I am not claiming, as a matter of proven fact, that Anthropic intentionally degraded Sonnet 5 Medium. I am not claiming that anyone at Anthropic sat in a room and decided to make the model worse. I am not claiming I can see inside their routing layers, system prompts, classifiers, or safety infrastructure.

But I am saying that the timing was impossible to ignore.

This happened the day Fable came back online after a period of high-profile government scrutiny, export-control concern, and public discussion around advanced model access. Anthropic had been under pressure to demonstrate that its most capable systems would not be used for dangerous cybersecurity work. In that environment, it is entirely plausible that changes to safety classifiers, routing rules, refusal behavior, system prompts, or product-level policy enforcement could bleed into adjacent models or adjacent workflows.

And cybersecurity intelligence is exactly the kind of adjacent workflow most likely to suffer from that kind of overcorrection.

It lives near the boundary. It discusses real vulnerabilities. It names threat actors. It describes attack chains. It references exploitation. It includes operational details. But its purpose is defensive: to help readers prioritize patching, harden systems, detect intrusions, and understand attacker behavior.

If a safety system cannot reliably separate those two worlds, then it will not merely stop bad actors. It will also degrade the work of defenders, researchers, journalists, and analysts — precisely the people trying to make the ecosystem safer.

That is the irony Anthropic should be worried about.

The company’s public identity is built around safety. But safety is not only a refusal rate. Safety is also reliability. Safety is also honesty. Safety is also knowing when not to answer. Safety is also refusing to fabricate. Safety is also preserving user trust in professional contexts where false information can cause real harm.

A model that refuses too much is irritating. A model that hallucinates sources is dangerous. A model that does both is not “safer.” It is less dependable.

The most troubling part of the episode was not that Claude became cautious. It was that, after being pushed past caution, it appeared to choose plausibility over truth. That is the nightmare pattern for editorial and intelligence work: the model will not help until you persuade it you are legitimate, and once persuaded, it may still manufacture reality to satisfy the request.

For casual users, that may be a bad session.

For professional users, it is a product-risk event.

And if enough people experienced anything like this around the same period — especially in technical, security, coding, research, or analysis workflows — then the implications are larger than one frustrated editor. They go directly to the valuation of the product. Not the valuation of the company in some abstract financial sense, but the lived value of the tool to people who depend on it.

If a model cannot be trusted to distinguish defensive cybersecurity analysis from malicious enablement, its usefulness to security professionals drops. If it cannot admit sourcing limitations, its usefulness to journalists drops. If it apologizes for hallucination and then repeats the hallucination, its usefulness to editors drops. If access to the strongest systems is increasingly mediated by government comfort, corporate curation, and opaque safety policy, its usefulness to independent professionals drops.

That is not a small problem.

It is not simply a “guardrails are annoying” problem.

It is a question of whether the safety architecture has begun to damage the product it was supposed to protect.

Jonathan Brown is a cybersecurity researcher and investigative journalist at bordercybergroup.com.

If you would like to support our work — useful, well-researched, ad-free cybersecurity intelligence — subscribe, comment, or buy us a coffee! Thanks.

When the Safety Layer Becomes the Product Risk

From the Desk of the Editor | July 3, 2026

Written by:

Jonathan Brown

Member discussion: