Artificial Intelligence (AI) tools are increasingly being used for written deliverables in a wide variety of domains. In some cases, the recipient of the deliverables wants to ensure that the content was written by a human rather than an AI tool, e.g., ensuring assignments were completed by students, product reviews written by actual customers, etc. This creates a demand for AI detection tools that minimize two key statistics: the False Negative Rate (FNR), which corresponds to the proportion of AI-generated text that is falsely classified as human, and the False Positive Rate (FPR), which corresponds to the proportion of human-written text that is falsely classified as AI-generated. We evaluate four commercial and open-source AI-text detectors—Pangram, OriginalityAI, GPTZero and RoBERTa—on these dimensions using a large corpus of human and AI-generated text that spans across topics, length, and AI models. First, we find that detectors vary in their capacity to minimize FNR and FPR, with the commercial detectors outperforming open-source. Second, most commercial AI detectors perform remarkably well, with Pangram in particular achieving a near zero FPR and FNR within our set of stimuli; these results are stable across AI models. Third, while Pangram’s performance largely holds up on very short passages (< 50 words) and is robust to “humanizer” tools (e.g., StealthGPT), the performance of other detectors becomes case-dependent. Finally, we consider the implementation of detectors as policy, noting that a policy designer faces a trade-off between maximizing the probability of detecting true AI-generated text while minimizing the risk of false accusations. Given this tradeoff, we propose an evaluation metric that uses policy caps—a scale-free, detector-independent measure that corresponds to the designer’s tolerance for false positives or negatives—to compare detectors. Using this metric, we show that Pangram is the only detector that meets a stringent policy cap (FPR ≤ 0.005) without compromising the ability to accurately detect AI text.

More on this topic

BFI Working Paper·Feb 10, 2026

How Does AI Distribute the Pie? Large Language Models and the Ultimatum Game

Douglas K.G. Araujo and Harald Uhlig
Topics: Technology & Innovation
BFI Working Paper·Jan 21, 2026

FinTech and Customer Capital

Bianca He, Lauren Mostrom, and Amir Sufi
Topics: Financial Markets, Technology & Innovation
BFI Working Paper·Jan 15, 2026

Technology and Economic Development

Daron Acemoglu, Ufuk Akcigit, and Simon Johnson
Topics: Technology & Innovation