Artificial Intelligence (AI) tools are increasingly being used for written deliverables in a wide variety of domains. In some cases, the recipient of the deliverables wants to ensure that the content was written by a human rather than an AI tool, e.g., ensuring assignments were completed by students, product reviews written by actual customers, etc. This creates a demand for AI detection tools that minimize two key statistics: the False Negative Rate (FNR), which corresponds to the proportion of AI-generated text that is falsely classified as human, and the False Positive Rate (FPR), which corresponds to the proportion of human-written text that is falsely classified as AI-generated. We evaluate four commercial and open-source AI-text detectors—Pangram, OriginalityAI, GPTZero and RoBERTa—on these dimensions using a large corpus of human and AI-generated text that spans across topics, length, and AI models. First, we find that detectors vary in their capacity to minimize FNR and FPR, with the commercial detectors outperforming open-source. Second, most commercial AI detectors perform remarkably well, with Pangram in particular achieving a near zero FPR and FNR within our set of stimuli; these results are stable across AI models. Third, while Pangram’s performance largely holds up on very short passages (< 50 words) and is robust to “humanizer” tools (e.g., StealthGPT), the performance of other detectors becomes case-dependent. Finally, we consider the implementation of detectors as policy, noting that a policy designer faces a trade-off between maximizing the probability of detecting true AI-generated text while minimizing the risk of false accusations. Given this tradeoff, we propose an evaluation metric that uses policy caps—a scale-free, detector-independent measure that corresponds to the designer’s tolerance for false positives or negatives—to compare detectors. Using this metric, we show that Pangram is the only detector that meets a stringent policy cap (FPR ≤ 0.005) without compromising the ability to accurately detect AI text.

More on this topic

BFI Working Paper·Sep 16, 2025

The Promise of Digital Technology and Generative AI for Supporting Parenting Interventions in Latin America

Ariel Kalil, Michelle Michelini, and Pablo Ramos
Topics: Early Childhood Education, Technology & Innovation
BFI Working Paper·Sep 8, 2025

Chat2Learn: A Proof-of-Concept Evaluation of a Technology-Based Tool to Enhance Parent-Child Language Interaction

Linxi Lu and Ariel Kalil
Topics: Early Childhood Education, Technology & Innovation
BFI Working Paper·Jun 17, 2025

Using AI to Generate Option C Scaling Ideas: A Case Study in Early Education

Faith Fatchen, John List, and Francesca Pagnotta
Topics: Technology & Innovation