Managing Performance Metrics with Governance

Tokenmaxxing and the Controls It Breaks

Author: Alan McCay

In the previous article “When Performance Becomes Governance Risk,” I argued that token maximizing conditions humans to behave like the models they operate: optimize for the reward signal, disregard what the metric doesn’t measure. That piece focused on the psychological and ethical dimensions.

This article is focused on the specific governance controls that tokenmaxxing puts at risk. If your organisation is running consumption leaderboards, these are the controls your auditor will test. The question is whether they will still be functioning when that audit happens.

Arise – M.CM: Change Management

The ARISE Change Management control (M.CM) requires a formal request-for-change process for all production modifications, mandatory pre-deployment testing with rollback verification, and segregation of duties to prevent self-approval. These are P1 requirements, the baseline before an organisation can claim its systems are governed. The control also requires model lifecycle tracking: version control, dataset hashes, and lineage metadata for AI systems.

ForHumanity’s cybersecurity certification criteria reinforce the same principle through specific change management controls requiring documented approval chains and separation between builders and approvers.

Under tokenmaxxing, the RFC queue becomes dead time. Testing cycles are throughput lost. The colleague whose approval you need is chasing their own consumption target. The P1 requirement for segregation of duties, the rule that the person who wrote the code cannot push it to production, becomes the thing standing between the employee and their performance metric.

M.CM also requires post-implementation review for failed or emergency changes. Under consumption pressure, failed changes don’t get retrospectives. They get re-attempted faster.

Separation of Duties and Identity Management

ARISE’s Identity Management control (P.IM) enforces role-based access provisioning, least-privilege enforcement, and explicit segregation of duties between developer, deployer, and auditor roles. At P2, it requires human-AI identity integration: every AI agent must have a unique identifier traceable to a human sponsor, with no unsupervised autonomous privilege escalation.

ForHumanity’s independence and objectivity principles, drawn from the same tradition as financial audit independence, require separation between those who produce and those who evaluate. Their ethics framework requires a standing, empowered ethics committee with the authority to halt deployments.

Tokenmaxxing makes everyone a producer. The reviewer role becomes a bottleneck the incentive structure penalises. The five recognised threats to auditor independence include self-review and intimidation. Tokenmaxxing activates both: the person who generated output has an interest in not flagging problems with it, and the person who raises concerns risks being labelled a low performer.

P.IM’s requirement that AI agent actions are traceable to human sponsors becomes particularly relevant when agentic tools are burning through tokens autonomously. If nobody reviewed what the agent produced because review time was consumption time lost, the traceability exists on paper but the human judgement it was supposed to evidence does not.

Ethics Oversight Committee

ARISE’s Ethics Oversight Committee control (M.OV-01) requires a chartered body with documented mission, authority, and decision rights. Its P1 escalation authority requirement is explicit: the committee must be empowered to halt deployments pending resolution of ethical risk. At P2, it requires cross-functional membership including legal, HR, engineering, and external advisors, with minutes and records of all decisions.

ForHumanity’s ethical choice framework requires organisations to identify decision points where competing values must be adjudicated, and to ensure those decisions go through a deliberative process. Their cognitive bias mitigation requirements specifically mandate active measures to prevent humans from over-relying on AI outputs.

Tokenmaxxing is itself an ethical choice that most organisations have not treated as one. The decision to evaluate employees by consumption volume, rather than by the quality or governance of their output, is a values trade-off. It prioritises throughput over judgement. M.OV-01 exists so that a body with standing can challenge exactly this kind of decision.

The question is direct: did your ethics committee review the decision to implement token consumption leaderboards? If it did, what was the outcome? If it didn’t, the committee is not performing the function the control requires.

Organizational Values

ARISE’s Organizational Values and Governance control (G.OV) requires that organizations define ethical tenets, including accountability, fairness, and transparency, and use them as decision filters. At P2, it requires that when speed conflicts with values, the conflict is escalated and documented. At P3, it requires anonymous feedback mechanisms to detect cultural drift.

Tokenmaxxing is cultural drift operating in plain sight. The organisation’s stated values almost certainly include accountability, quality, and responsible use of technology. The consumption leaderboard measures none of these. G.OV’s requirement to use tenets as decision filters means that every operational decision, including how AI adoption is measured, should be tested against the organisation’s own values.

In neuro-linguistic programming there is a presupposition: the map is not the territory [2]. The leaderboard is a map. The values statement is a description of the territory the organisation claims to occupy. When the two diverge, people follow the map. G.OV exists to close that gap.

Testing and Secure Development

ARISE’s Testing control (V.TE-01) requires separation of builder and tester responsibilities for high-risk releases, risk-based test strategies covering negative tests and reproducibility verification, golden baselines for validating AI outputs, and change governance requiring test evidence for all approvals.

The Secure Development control (P.SD) requires threat modelling at design, automated scanning in CI/CD pipelines, and release gates with documented sign-off. At P2, it requires provenance and integrity controls: signed builds, reproducible pipelines, and a model bill of materials attached to every release.

ForHumanity’s audit criteria include binary pass/fail criteria for AI systems and require evidence of validation under production-like conditions.

Under consumption pressure, testing becomes the most expensive governance activity in the organisation. Every hour writing tests, running validation, or conducting security assessments is an hour that produced no visible output on the leaderboard. The team that insists on V.TE-01’s validation cycles before release consumes fewer tokens than the team that pushes straight to production. In cybersecurity, the result is technical debt. The difference here is that tokenmaxxing doesn’t just accumulate technical debt through neglect. It rewards its accumulation.

Continuous Monitoring and the Evidence Gap

ARISE’s Continuous Monitoring control (D.CM) requires telemetry collection across all systems, defined alerting thresholds with escalation criteria, and governance reporting of monitoring outcomes to committees. At P2, it requires baseline and drift detection for AI model performance, and downstream impact monitoring to catch real-world harms that internal metrics miss.

ForHumanity’s documentation and logging controls require the evidence trail an auditor needs to verify that oversight was operating during a given period.

This is where the cumulative effect of tokenmaxxing becomes visible. If change management gates were skipped (M.CM), if peer review was deprioritised (P.IM), if testing was cut short (V.TE-01), and if the ethics committee wasn’t consulted (M.OV-01), then the monitoring layer is the last place where the absence of governance might be detected. But D.CM requires that monitoring outcomes reach governance committees, not just technical dashboards. If the governance reporting was itself deprioritised because it consumed time rather than tokens, the evidence gap is complete.

The auditor will not ask whether these controls exist in the policy library. They will ask for evidence that each one was operating during the period under review. If tokenmaxxing has eroded the willingness to follow them, that evidence will be thin, inconsistent, or absent.

The Operational Question

ARISE was designed to close the gap between having governance documentation and having operational governance. Each control carries structured requirements across three priority tiers: P1 establishes the baseline, P2 builds the day-to-day processes, P3 adds maturity mechanisms like drift detection and public transparency reporting. ForHumanity’s independent audit criteria, which map to ARISE across the GOVERN, MANAGE, IDENTIFY, PROTECT, DETECT, and VALIDATE pillars, provide the external verification layer. Assessed Intelligence certified auditors test whether controls are operating, not whether they exist.

An organisation that has implemented these controls and can demonstrate their operation has a defence against the risks tokenmaxxing creates. An organisation that has the controls on paper but has allowed consumption pressure to hollow out their execution does not.

If you are preparing for any AI governance audit, whether against ISO 42001, the EU AI Act, ForHumanity’s criteria, or SOC 2, trace the evidence chain for a single AI-generated output from the past quarter. Can you show who requested the change? Who approved it? Who tested it? Who reviewed it for bias, accuracy, and downstream impact? Was the reviewer independent of the producer?

If the answer to any of those questions is no, the leaderboard did not cause the governance failure, but it did create the conditions under which the failure became rational.

Measure what matters. Govern what you measure.

References

[1] Hoag, J.D. NLP Presupposition: The Map is Not the Territory. NLP Life Skills. http://nlpls.com/articles/mapTerritory.php

[2] Roose, K. (2026, March 20). Tokenmaxxing: The New AI Arms Race Inside Companies. The New York Times. https://www.nytimes.com/2026/03/20/technology/tokenmaxxing-ai-agents.html

[3] Pearl, M. (2026, March 22). Tech Employees Are Reportedly Being Evaluated by How Fast They Burn Through LLM Tokens. Gizmodo. https://gizmodo.com/tech-employees-are-reportedly-being-evaluated-by-how-fast-they-burn-through-llm-tokens-2000736627

Assessed Intelligence delivers vCISO and vCRAIO leadership, ARISE Framework™ implementation, and continuous assurance through the OPERATE retainer. If your organization is deploying agentic AI and needs governance that operates at the speed of your systems, speak with an advisor.

What Is New?

Blog

Segments

Capabilities

Core Services

Get Started

Fractional ROLES

PackageS

LIBRARY

Assessed Partner

Affiliated Organizations

Get to Know Us

Managing Performance Metrics with Governance

Tokenmaxxing and the Controls It Breaks

Arise – M.CM: Change Management

Separation of Duties and Identity Management

Ethics Oversight Committee

Organizational Values

Testing and Secure Development

Continuous Monitoring and the Evidence Gap

The Operational Question

References

What Is New?

Managing Performance Metrics with Governance

Across The Industry Brief

When the Performance Metric Becomes the Governance Risk