The Trials and Tribulations of an AI Arms Race - Architecture & Governance Magazine

By Justice Swan

Frontier AI labs are sidestepping model safety and giving themselves an excuse to develop and release even more powerful AI models.

The release of OpenAI’s o3 brought praise to openAI for releasing such a ‘smart’ model, one apparently capable of generating new scientific avenues of research. According to insiders who spoke with the Financial Times, however, there has been a severe lack of safety evaluations for both of the company’s recent releases: GPT 4.1 and o3. According to eight individuals familiar with OpenAI’s testing processes, the pace accelerated from the reported six-month testing window provided for GPT-4 to less than a week for some of the o3 testers. This accelerated timeline is raising concerns among those who are involved with the testing process. A current tester involved with the “o3” model described the arms race dynamic of the situation: “But because there is more demand for it, they want it out faster. I hope it is not a catastrophic misstep, but it is reckless. This is a recipe for disaster.” In regards to GPT 4.1, OpenAI said it was not “cutting edge” and therefore not deserving of the typical safety card, a voluntary commitment in its own right.

Further corroboration of the statements made by the insiders is brought forth by Metr, the organization that partners with OpenAI on safety evaluations. In a blog post, (referenced by TechCrunch) they said its assessments of o3 and o4-mini were conducted in a “relatively short time” compared to previous benchmarks. Their safety analysis notes that “The work reported here was conducted over only three weeks. More thorough evaluations are likely to reveal additional capabilities or risk-relevant observations.”

Specific testing methodologies have also drawn scrutiny. Steven Adler, a former OpenAI safety researcher, highlighted concerns that OpenAI may not be fully following through on commitments to create customized, fine-tuned model versions to probe for dangerous misuse potential (like facilitating bioweapon development) on its most advanced models. In addition, a former technical staff member told the Financial Times that testing is often performed on earlier versions of the model rather than the final released model, calling it “bad practice”. Metr confirms this, stating that they received access to o3 and o4-mini “three weeks prior to model release on an earlier checkpoint (version) of the model”.

OpenAI contends the checkpoints are “basically identical” to the final product.

This situation isn’t just limited to OpenAI, however. Meta released a vague safety card for Llama 4, Gemini 2.5 pro’s safety card was released weeks after the model was made public, and xAI still hasn’t released a safety card for Grok 3. The prior safety evaluations released by these companies were done so strictly on a voluntary basis.

Google and Meta’s opposition to California’s SB 1047, which would have required many AI developers to audit and publish safety evaluations on models that they make public, really shows where their priorities are. These developments occur against a backdrop of lacking global standards for AI safety testing. While forthcoming regulations like the EU’s AI Act will soon mandate such evaluations for powerful models, the current evaluations are and will most likely continue to be voluntary in the US.

Law Update: Not only will it continue to be voluntary, but it will actively be illegal to mandate these labs to do this safety testing. The “Big Beautiful Bill”, which originally said “no State or political subdivision thereof may enforce any law or regulation regulating artificial intelligence models, artificial intelligence systems, or automated decision systems during the 10 year period beginning on the date of the enactment of this Act” is now down to 5 years. Right direction, but at the current pace that AI is developing, and also the very broad language used in the bill, 5 years is 5 years too many.

On top of that, OpenAI has rewritten its internal policies to allow for the release of “high risk” if it has taken appropriate steps to ‘reduce those dangers’. Whatever that means. They also said they would consider releasing a model with the safety rating of “critical risk” if a rival AI lab had released a similar model. Previously, OpenAI said it wouldn’t release any AI model that presented more than a “medium risk.” They also removed from the guidelines the requirement for assessments of the model’s ability to persuade and manipulate people. Do we already know they’re superhuman? Why hide it? Oh yeah, maybe you don’t want us to know that the technology you are building and no doubt using is superhumanly persuasive.

This entire situation brings to light one of the dangers of an arms race situation: As companies demand rapid development and deployment of models due to the competitive pressure of the AI landscape, the robust, time-intensive safety protocols needed for these increasingly sophisticated systems are increasingly being pushed to the sidelines. Safety or profits?

References

1: Financial Times News, OpenAI slashes AI model safety testing time

2: TechCrunch, OpenAI partner says it had relatively little time to test the company’s newest AI models | TechCrunch

3: ZDNet, OpenAI used to test its AI models for months – now it’s days. Why that matters

4: METR’s preliminary evaluation of o3 and o4-mini

5: Fortune, Google released safety risks report of Gemini 2.5 Pro weeks after its release.

6: Axios,https://www.axios.com/pro/tech-policy/2025/06/30/blackburn-cruz-reach-ai-pause-deal