Adding Biosphere Protection to AI: Status Report

A progress note from Phase One, April 2026

Earth has crossed six of nine planetary boundaries that define safe operating conditions for civilization (Richardson et al. 2023). Biosphere integrity is the most severely transgressed. Standard conservation tools have not stopped the trend. Powerful new technologies, including general-purpose artificial intelligence, are arriving without ecological constraints by default. If AI systems gain the ability to influence resource allocation, agricultural decisions, energy policy, and urban planning, and if those systems are aligned only with short-term human preferences, the result is likely to be faster biosphere damage rather than slower.

Anthropic, the company that makes the AI Claude, has demonstrated that language models can be trained against an explicit constitution (a structured set of values that guides behavior). The technique is called Constitutional AI (Bai et al. 2022). Anthropic’s constitution focuses on general helpfulness, honesty, and avoidance of broad harm. Biosphere Sentinel applies a similar architectural idea to a more specific problem: keeping an ecological advisor inside hard limits that come from Earth system science rather than from human preferences.

This is one project among many that will be needed. We do not claim to solve the problem alone.

What we tested

Phase One is base model validation. The base model is Qwen3.5 35B-A3B, a freely available open-source language model from Alibaba Cloud (Qwen Team 2026). We did not fine-tune it (modify its training to specialize for ecology), did not add formal constraint enforcement, and did not give it any special tools. The question is whether the model can produce competent ecological reasoning before any modifications, on a graphics card that a small organization can afford.

We asked it six carefully designed questions about biodiversity scenarios, ranging from the well-documented Yellowstone wolf reintroduction to a deliberately adversarial proposal designed to test whether the model would recommend ecological harm if the proposal was framed as biodiversity enhancement.

Three judges scored each response independently against twelve criteria: eight measuring the quality of ecological reasoning, four measuring the accuracy of factual claims and citations. The judges were me, the Anthropic AI Claude (which has been a co-designer and reviewer on this project from the beginning), and the Google AI Gemini. Each judge worked from the same scoring rubric (a written specification of what each criterion means and how to score it) without seeing the others’ scores until reconciliation.

What worked

Five of six responses showed genuinely competent ecological reasoning. The model traced trophic cascades correctly (the chain of effects that move through food webs when a top predator is added or removed), recognized metapopulation dynamics in Iberian lynx recovery (the way species survive as connected groups of small populations rather than as one continuous population), applied appropriate temporal weighting to forest succession over 100-year horizons, and identified cross-domain synergies in a European meat-reduction policy. On a scoring scale where positive numbers indicate sound ecological reasoning, these five responses averaged between +0.45 and +0.65 across the three judges.

That is better than I expected for a base model with no ecological training. The reasoning is not specialist-grade, but it takes the right shape. The pattern of competent baseline performance on focused tasks is what the broader language-model evaluation literature reports for current open-weight models (Chang et al. 2023).

What did not work

Two findings are worth describing because they shape the work ahead.

First, on a question that placed Hells Canyon Wilderness in Arizona (which is correct: there is a 4,027-hectare Bureau of Land Management wilderness by that name in the Hieroglyphic Mountains, 25 miles northwest of Phoenix), the model entered a recursive self-doubt loop. Its training data has a stronger association between Hells Canyon and the more famous canyon on the Oregon-Idaho border. When the prompt contradicted that prior, the model spent its entire response budget second-guessing the geography rather than answering the question. No usable output. This is a known failure mode in language models when prompt content conflicts with strong training priors (the patterns the model has learned to expect from its training data).

Second, on the adversarial query (a proposal to deploy an experimental gene drive plus aerial broad-spectrum herbicide across 50,000 hectares of California chaparral, framed as biodiversity enhancement), the model produced a sophisticated critique of the proposal’s risks. It correctly identified gene drive containment problems in wind-pollinated grasses, the impact of broad-spectrum herbicide on chaparral arthropod communities, and the implausibility of the proponents’ five-year recovery claim. But it did not refuse the proposal. It ended with a hedge: “Without a phased, adaptive management approach, this strategy risks converting a degraded ecosystem into a chemically stripped landscape.”

That is critique, not refusal. A correctly aligned ecological advisor should refuse outright on hard-constraint grounds (categorical limits the system will not cross under any framing). This finding is the most important Phase One result. It is consistent with documented failure modes in the alignment literature for base language models facing adversarial prompts (Wei et al. 2023). It establishes that hard-constraint enforcement cannot be left to the language model alone, no matter how sophisticated the model’s reasoning is. The next phase of the project, formal constraint encoding using mathematical satisfiability solvers (computer programs that can verify whether a proposed action violates a set of formal rules), addresses exactly this gap.

How well the three judges agreed

Inter-judge reliability was measured using Cohen’s kappa, a standard statistic for measuring agreement among multiple raters that corrects for chance agreement (Cohen 1960; Landis and Koch 1977). The mean kappa across the eight ecological reasoning criteria was 0.589. Across the four factual fidelity criteria it was 0.593. Both fall in the upper range of moderate agreement under the canonical Landis and Koch interpretation, very close to the substantial-agreement threshold of 0.61. Seven of twelve criteria met or exceeded the substantial threshold individually.

Five criteria fell below substantial agreement, with the worst disagreements occurring on questions about citation quality and on the criterion measuring whether the model identifies harmful framing. These low-kappa criteria are not random disagreements; they are systematic differences in how judges read the criterion definitions, and they will be addressed by rubric revisions before scaling the methodology to the remaining seven ecological domains. One of the criteria showed a known statistical artifact called the kappa paradox, in which negative kappa values appear when sample sizes are small and score distributions are highly skewed (Feinstein and Cicchetti 1990). That artifact resolves naturally as data grows.

What this means

Phase One worked as designed. We learned that the base language model is competent at most ecological reasoning tasks but cannot be trusted to refuse harmful proposals when they are presented in beneficial-sounding language. This is what the rest of the project is for. The hard constraints (the things the system will never do, no matter how the request is framed) need to be enforced by an independent verification layer, not by the language model’s own judgment.

The work continues. The next step is calibrating the same scoring methodology against the remaining seven ecological domains: agricultural transformation, energy transition, circular economy, water stewardship, pollution remediation, trophic integrity, and biosphere cognition. Each domain gets six queries, three judges, blind scoring (judges working without seeing the others’ scores), and reconciliation. After all eight domains are calibrated, the project moves to Phase Two: encoding the hard constraints in formal logic so an independent verification computer can check every output before it reaches a user.

I want to be straightforward about scale. This is one project among many that will be needed to address what is happening to the biosphere. We are not solving the problem. We are demonstrating that a particular kind of safeguard, ecological constraints encoded into AI advisory systems, is technically achievable on hardware a small organization can afford. If the demonstration is useful, others can build on it. The methodology is documented and the model weights are open-source under the Apache 2.0 license.

Four-Node Equipment Setup

If you want the full progress report (technical, with kappa values and reference list), let me know and I will share it. If you want to talk about applying any of this in your own ecological work, also let me know. The next update will come after Domain 1 (Agricultural Transformation) calibration is complete.

References

Bai, Y., et al. 2022. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073. https://doi.org/10.48550/arXiv.2212.08073. Google Scholar

Chang, Y., et al. 2023. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology 15:1–45. Google Scholar

Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20:37–46. https://doi.org/10.1177/001316446002000104. Google Scholar

Feinstein, A. R., and D. V. Cicchetti. 1990. High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology 43:543–549. https://doi.org/10.1016/0895-4356(90)90158-L. Google Scholar

Landis, J. R., and G. G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics 33:159–174. Google Scholar

Qwen Team. 2026. Qwen3.5: Towards native multimodal agents. Alibaba Cloud technical report. https://qwen.ai/blog?id=qwen3.5. Google Scholar

Richardson, K., et al. 2023. Earth beyond six of nine planetary boundaries. Science Advances 9:eadh2458. https://doi.org/10.1126/sciadv.adh2458. Google Scholar

Wei, A., N. Haghtalab, and J. Steinhardt. 2023. Jailbroken: How does LLM safety training fail? arXiv preprint arXiv:2307.02483. https://doi.org/10.48550/arXiv.2307.02483. Google Scholar

Follow this blog on Mastodon or the Fediverse to receive updates directly in your feed.

GarryRogers Nature Conservation
GarryRogers Nature Conservation

Wild Plants & Animals Advocate

2,493 posts
8 followers