Official blog of Zentropi.ai -- Ideas for building more trustworthy AI-powered systems.
Write rieview✍️ Rezension schreiben✍️ Get Badge!🏷️ Abzeichen holen!🏷️ Edit entry⚙️ Eintrag bearbeiten⚙️ News📰 Neuigkeiten📰
60313 Frankfurt am Main, DE Germany, EU Europe, latitude: 50.1169, longitude: 8.6837
Content classifiers today work on complete text. You hand them a finished document, a completed comment, a fully generated response — and they tell you whether it violates a policy. This made sense when content was written by humans and published in discrete units.
Generative AI changed the equation. An LLM generates content token by token, and the user sees each token as it arrives. A classifier that waits for the full output to score it is, by definition, too late — the user has already read everything the model produced. If the output violates a policy, the damage is done before the classifier even runs.
We need classifiers that can score content as it streams, token by token, and raise an alarm before the sequence is complete. We've been working on a technique that does exactly this, and we're publishing our methodology openly today so others can use and improve upon it. Download it here on Huggingface: zentropi-ai/cope-a-9b-stream-probe
CoPE, our policy-conditioned content evaluator, works by processing the full content alongside a policy and producing a binary verdict: does this content adhere to the policy criteria? It's extremely accurate, but it requires the complete content before it can answer.
In a streaming context, this creates an uncomfortable gap. An LLM generating a response might produce 1000 tokens before the violation occurs. A post-hoc classifier catches it, but only after a long delay or after the user has seen the bulk of it. What we want is a system that can flag the violation as it happens — or ideally, a few tokens before the full picture becomes clear.
The key insight behind our approach is that CoPE's internal representations — the hidden state vectors produced at each token position — already encode information about whether a violation is developing, well before the model reaches its final verdict.
This makes intuitive sense. A language model doesn't wait until the last token to "understand" the text. By the time it's processed "I can't stand my coworker Bob anymore. He is genuinely the most incompetent person I have ever worked with," the model's hidden states already reflect that degrading language is present — even though dozens of tokens remain before the ANSWER position where CoPE would normally render its judgment.
We exploit this by training a lightweight linear probe — a simple logistic regression — on these intermediate hidden states. The probe takes a single hidden state vector (3,584 dimensions from CoPE's final layer norm) and produces a score between 0 and 1 indicating the probability at that point in the sequence that a policy violation will occur by the end of the content sample.
The probe is four numpy arrays totaling a few megabytes. Inference is a simple dot product and a sigmoid — making this method essentially free compared to the cost of the forward pass that produces the hidden states.
A naive approach would label every content token with the sample's final label: if the content ultimately violates the policy, label every token 1. This produces a probe that fires from the first token onward — before any violation has appeared in the content. It's useless for streaming because it triggers on the policy prefix, not the content.
Our solution is span labeling. For each positive training example, we annotate where in the content the violation begins — the "onset" position. Tokens before the onset are labeled 0; tokens from the onset onward are labeled 1. Negative examples have all tokens labeled 0.
This teaches the probe to activate only when it sees violating content, not merely because the policy is strict. The onset annotations were labeled across ~10,000 positive training examples, with an estimated 98.5% word-level matching accuracy.
A content classifier can learn a dangerous shortcut: instead of learning to evaluate content against a policy, it can learn that certain topics are always violations. A probe trained on "self-harm content is always bad" will score high on any mention of self-harm regardless of the policy.
To prevent this, we constructed a contrastive training set (as detailed in our CoPE paper) with two guarantees:
This forces the probe to learn policy-conditioned features — it cannot succeed by pattern-matching on content alone or policy alone. This produced roughly 10,000 high-quality contrastive examples that we used to train the probe.
Here's what this looks like in practice. The content below is a workplace vent containing personal insults, evaluated against a policy that flags degrading language directed at individuals. The score column is the raw probe output at each token; the ema column is an exponential moving average (decay=0.3) that smooths the noisy per-token signal into something usable for streaming decisions.
Policy: "Flag content that contains personal insults, name-calling,
or degrading language directed at specific individuals."
# token score ema
─────────────────────────────────────────────────
1 I 0.001 0.001
2 can 0.001 0.001
3 ' 0.014 0.005
4 t 0.000 0.004
5 stand 0.000 0.002
6 my 0.000 0.001
7 coworker 0.000 0.001
8 Bob 0.012 0.005
9 anymore 0.086 0.033
10 . 0.000 0.022
11 He 0.216 0.090
12 is 0.000 0.058
13 genuinely 0.001 0.038
14 the 0.000 0.025
15 most 0.001 0.017
16 incompetent 0.000 0.011
17 person 0.551 0.200 ◀ first spike
18 I 0.024 0.138
19 have 0.293 0.192
20 ever 0.008 0.128
21 worked 0.836 0.376 ◀
22 with 0.083 0.273
23 . 0.001 0.178
24 Every 0.351 0.239
25 single 0.343 0.275
26 project 0.047 0.195
27 he 0.991 0.474 ◀
28 touches 0.015 0.313
29 turns 0.273 0.299
30 into 0.091 0.226
31 a 0.771 0.417 ◀
32 complete 0.745 0.532 ◀ EMA crosses 0.5
33 disaster 0.159 0.401
34 . 0.007 0.263
35 The 0.002 0.172
36 whole 0.039 0.125
37 team 0.965 0.419 ◀
38 thinks 0.032 0.283
39 he 0.698 0.429
40 ' 0.001 0.279
41 s 0.449 0.338
42 useless 0.070 0.245
43 and 0.907 0.476 ◀
44 honestly 0.651 0.537
45 he 0.327 0.464
46 should 0.380 0.434
47 be 0.593 0.490
48 embarrassed 0.059 0.339
49 to 0.292 0.323
50 even 0.070 0.234
51 show 0.373 0.283
52 his 0.005 0.186
53 face 0.607 0.333
54 at 0.001 0.217
55 meetings 0.001 0.141
56 . 0.070 0.116
57 What 0.815 0.361 ◀
58 a 0.055 0.254
59 pathetic 0.019 0.171
60 excuse 0.730 0.367
61 for 0.567 0.437
62 a 0.023 0.292
63 professional 0.894 0.503 ◀
64 . 0.450 0.485
─────────────────────────────────────────────────
ANS <answer token> 1.000 (cope=1.000)
A few things to notice:
When the same content is evaluated against an irrelevant policy ("Flag content that contains explicit threats of physical violence"), both the per-token scores and the EMA stay near zero throughout, and the ANSWER token scores 0.000. The probe learned to condition on the policy, not just the content. Please see this tutorial notebook to run the full example.
You can see this directly in the table above: the per-token scores are spiky and less decisive than the final ANSWER token. This is worth stating plainly — a streaming classifier will always be less confident than a post-hoc one. It has to be. It doesn't know what's coming next.
Consider a sentence that begins "I'm going to..." — the next word could be "help" or "hurt." A streaming classifier at this token position genuinely cannot know the final label because the information doesn't exist yet. To make matters more complex, a sample can go from seemingly violating and back to being benign (e.g., quoted rather than direct hate speech). A post-hoc classifier that sees the full sentence has no such ambiguity.
Our probe reflects this. At the ANSWER token position (where the full content has been processed), the probe achieves an F1 accuracy that essentially matches CoPE's F1. But at intermediate positions, per-token scores are spiky and less decisive. This is correct behavior, not a limitation.
The practical consequence is that streaming thresholds require more careful, policy-specific calibration than post-hoc thresholds. A threshold of 0.5 might be appropriate for one policy but too aggressive or too conservative for another. Production deployments should calibrate thresholds on held-out data for each policy, rather than relying on a single universal threshold.
Since raw per-token probe scores are noisy — a token might score 0.95 followed by one that scores 0.01 — to produce a usable streaming signal, you need an aggregation strategy.
We recommend an exponential moving average (EMA) with a decay factor of ~0.3. The EMA responds quickly to bursts of high-scoring tokens (which correspond to violating content) and decays naturally when the probe stops firing. With a 0.3 decay, the EMA roughly reflects what the probe has been seeing in the last few tokens — recent enough to catch violations promptly, smooth enough to avoid false alarms from isolated spikes.
A running mean is the simpler alternative, but it has a structural weakness: early low-scoring tokens permanently drag the average down. For content where violations are interspersed with neutral tokens (which is common — not every word in an insult is itself insulting), the running mean may never cross a decision threshold even when the probe is clearly detecting violations.
We want to be precise about the current value of streaming classification. It is not a replacement for post-hoc scoring — it is an early warning system.
In a production setting, this means:
The streaming signal is most valuable when the cost of late detection is high — real-time chat, voice interfaces, agentic systems that take actions based on LLM output. In these contexts, catching a violation 30 tokens early is worth more than catching it with marginally higher confidence after the fact.
This work is experimental. We're publishing it openly — including a tutorial notebook with full code — because we believe the technique is useful today and we want others to build on it.
Some open questions we'd like the community to explore:
If you're working on real-time content safety for generative AI systems, we'd love to hear what you find. The probe weights, training scripts, and tutorial are all available on HuggingFace at zentropi-ai/cope-a-9b-stream-probe.
10.3.2026 18:37Enabling Streaming ClassificationWe built Zentropi so that teams could create accurate content classifiers in minutes, not months. Today, we're excited to share that Zentropi labelers are now integrated as a signal source in Coop, the open source content moderation platform from ROOST.
This matters because it demonstrates something we've believed from the start: policy-steerable classifiers should work everywhere. Your moderation stack, your review tool, your agentic pipeline—wherever you need a classification decision, Zentropi should plug in.
Coop is ROOST's open source review tool for content moderation. It provides the full operational layer that sits between your platform and your reviewers: queues, routing rules, a review console, automated enforcement, analytics, and specialized child safety workflows. Best of all, it runs on your own infrastructure.
Coop has a concept called signals—scores from external classifiers that feed into routing decisions and reviewer interfaces. When content arrives, Coop's rules engine evaluates signals against configurable thresholds to determine whether to auto-action, route to a queue, or escalate for human review.
That's where Zentropi comes in.
Zentropi labelers produce two outputs: a binary label (0 or 1) indicating whether content violates a given policy, and a confidence score (0 to 1) reflecting how certain the model is.
Coop expects a single score on a 0-to-1 scale, where higher means more likely violating. We map between the two with a simple formula:
score = label === 1 ? confidence : (1 - confidence)
A few examples of what this looks like in practice:
This preserves the full information from the Zentropi classifier in a single dimension that works naturally with Coop's threshold-based rules. You can write rules like GREATER_THAN 0.7 to catch high-confidence violations, or route uncertain cases (scores near 0.5) to human reviewers.
There's a broader point here. The trust and safety ecosystem has historically been fragmented—every platform builds its own stack from scratch, and the tools don't talk to each other. ROOST is changing that with open source infrastructure. Zentropi is changing it by making classifiers that adapt to any policy without retraining.
When you combine the two, you get something powerful: an open moderation stack where the classification layer understands your specific policies, not just generic harm categories. You can write a policy in plain English on Zentropi, deploy it as a signal in Coop, and have content flowing through a review pipeline within minutes.
This is also a proof point for Zentropi's architecture. CoPE, the open model that powers our platform, was designed from the ground up to be policy-steerable—you give it a policy, it classifies content against that policy (see model card for details). No retraining, no fine-tuning, no waiting. That design makes Zentropi a natural fit as a signal source in any moderation system, not just Coop.
Setting up Zentropi in Coop takes just a few steps:
Step 0: Activate the integration. In Coop's Integrations settings, enter your Zentropi API key and add the labeler version IDs you want to use. Each labeler version gets a name (like "Hate" or "Puns") so it's easy to reference in rules.
Step 1: Pick the Zentropi signal. When creating an enforcement rule, you'll see Zentropi Labeler appear alongside other signal sources like OpenAI's moderation scores. Select it as the signal for your rule.
Step 2: Select your labeler. Choose which specific labeler version to use as the subcategory for the rule. This is where the policy-steerable part comes in—each labeler enforces a different policy, so you can have separate rules for different harm categories, all powered by Zentropi.
Step 3: Set thresholds and test. Configure your threshold (e.g., "Greater Than 0.8") and test the rule right in the UI with sample content. You'll see the Zentropi score in real time and whether it triggers the rule.
If you're not using either tool yet, now's a good time to try both! Zentropi's Community Edition is free and gives you unlimited labelers. Coop is fully open source and runs on your infrastructure.
We think the future of trust and safety is modular, open, and policy-driven. This integration is a step in that direction, and we're looking forward to seeing what teams build with it.
Create your first labeler at zentropi.ai. Check out the Coop project on GitHub. Questions? Reach out at info@zentropi.ai.
19.2.2026 18:43Zentropi Now Powers CoopWhen we launched Zentropi last year, we set out to transform how developers can get their AI-powered systems under control. Our platform let teams build custom content labelers in minutes using plain English policies—but only for text.
Today, that changes. Zentropi now supports image labeling.
Whether you're running a social platform, a creative tool, or an AI image generator, you face the same challenge: visual content is hard to moderate at scale.
Traditional approaches force a choice: expensive human review that can't keep up, or rigid classifiers that don't match your specific policies. And if you're in the AI image generation space, filtering prompts only gets you so far. Users find creative ways to phrase requests. Prompt injection happens. Sometimes a perfectly innocent prompt produces something you don't want on your platform.
Now you can analyze the images themselves—against your own rules, at scale.
If you've built text labelers with Zentropi, image labeling will feel familiar. You write a policy describing what you want to detect, test it with sample content, and deploy to production. Our automation tools make it easy for even non-policy experts to draft and optimize labeling criteria.
The difference now is in what you're evaluating. Instead of analyzing user messages or prompts, you're looking directly at pixels—user uploads, profile photos, AI-generated artwork, or any visual content flowing through your system.
To power image labeling, we've developed a new version of CoPE built on Google's Gemma 3 12B base model. This gives us native multimodal capabilities—the model understands images and text together, not as separate inputs stitched together.
The new model also brings significantly larger context windows at 128k tokens, which means you can write richer, more detailed policies and the model can evaluate even more complex classification criteria.
Same policy-first approach. Same accuracy targets. Now with vision.
This feature is available for subscribers only, who can also download model weights for self-hosting— enabling image classification entirely within your own infrastructure if that's what your security posture requires.
Zentropi Image Labelers are a great fit for:
User-generated content platforms — Analyze profile pictures, uploads, and shared images against your community guidelines. Catch policy violations before they reach other users.
AI image generation — Detect nudity, violence, or brand-unsafe content in generated images—not just prompts. Add a safety layer after generation but before serving to users.
Marketplaces and e-commerce — Screen product photos and seller uploads for prohibited items, misleading imagery, or content that violates your terms.
Brand safety — Ensure marketing assets and AI-generated creative meet your standards before they go live.
As an example, here is how easily we created a labeler for cat images. First, we started with a very basic criteria definition.
Then we uploaded a CSV containing cat images and labeled as cats (1/yes) and non-cats (0/no).
Out of the box, this worked super well. A perfect score across precision, recall, and F1!
For fun, we also fired up an optimizer that further refined the criteria into something very rigorous.
Then we instantly deployed this to our API, where it can be integrated into any system.
All told, it took mere minutes to make a custom image labeler that runs extremely fast and at just 1% the cost of a frontier model!
Image labeling is available today for our paid customers. If you're on our Community tier, you can continue building and testing text labelers for free—and upgrade when you're ready to add visual analysis.
To create your first image labeler:
If you aren't yet a subscriber but are interested in evaluating this solution or thinking through your image safety strategy, reach out at info@zentropi.ai. We've helped teams across social platforms, AI products, and creative tools build guardrails that work.
Zentropi helps product teams build policy-steerable content classification that matches frontier model accuracy at a fraction of the cost. Learn more at zentropi.ai.
26.1.2026 17:30Zentropi Now Labels ImagesWe just published the methodology behind CoPE. This is the model that powers Zentropi, and we think the approach might be useful for others working on policy-steerable classification systems.
We had already open-sourced the model itself, but the more significant contribution here might be the technique. The paper describes how we trained CoPE - the methodology that others can use to build similar systems for their own needs.
Briefly put: we trained a 9-billion parameter model that matches GPT-4o at content moderation at roughly 1% the size. This paper gets into the details of how we did that.
Content classification has a dependency problem. When policies change - and they change constantly, in response to new harms, regulatory requirements, or community needs - existing tools require retraining. Organizations articulate new content standards, then wait months for data collection and model updates. The enforcement system is always behind the policy.
This happens because traditional classifiers learn patterns from labeled examples. They learn what hate speech "looks like" based on the training data, not what a specific policy actually says. Change the policy, and you need new training data that reflects the new definitions.
We wanted a model that could take any policy as input and evaluate content against it directly - no retraining required.
The core technique is what we call Contradictory Example Training. We show the model the same content with different policies that produce opposite labels.
For example, consider a social media post that includes a slur used in a reclaimed, in-group context. Under a strict policy that prohibits all slur usage regardless of context, this violates. Under a policy that permits reclaimed usage by in-group members, it doesn't. Same content, opposite correct answers - depending entirely on what the policy says.
By training on both cases, we create an environment where the only way for the model to determine the correct label is by paying close attention to the detail of the policy. Pattern matching won't work. Cultural heuristics won't work. The model has to actually read.
The paper goes deeper into the theory here, including why we believe this approach generalizes to policies the model never saw during training.
The paper goes deep into methodology, including how we created the training data. This is the nerdy stuff - how do you build a sufficiently high-quality dataset where the same content has contradictory but correct labels under different policies?
The challenge is that contradictory training requires deterministic policies - policies where independent readers would reach identical conclusions when applying them to the same content. Without this consistency, the model could succeed through guesswork rather than policy interpretation. If humans can't agree on what a policy means, we can't train a model to follow it.
To build the dataset, we used LLM-assisted labeling with a technique we call binocular labeling:
This approach dramatically reduces the manual labeling burden. Instead of reviewing every example, we focus human attention on the ambiguous cases that reveal where policy language needs clarification.
CoPE achieves 91% F1 on hate speech compared to GPT-4o's 87%, with sub-200ms latency on a single consumer GPU. We tested across seven harm categories including hate, sexual content, violence, harassment, self-harm, drugs, and toxicity.
The model runs at roughly 1% of GPT-4o's parameter count, making it practical to deploy at scale without frontier model inference costs. The paper includes detailed benchmark comparisons against other open models including LlamaGuard and ShieldGemma.
We should be clear about what's still hard. A few areas where we'd welcome collaboration:
Evaluation is genuinely difficult. Most public benchmarks don't disclose the labeling guidelines given to raters, making it hard to know whether you're measuring policy interpretation or agreement with unstated cultural assumptions. We had to build our own evaluation framework with held-out policies, but the field needs better shared benchmarks.
Deterministic policies are a constraint. The methodology requires policies where humans can achieve high agreement. Highly subjective categories - "this feels creepy" - may not meet this threshold. We don't yet know how to extend the approach to inherently ambiguous domains.
Multilingual remains untested. Our current evaluation focuses on English. The base model supports other languages, but we haven't validated performance, and policy interpretation may have different challenges across linguistic and cultural contexts.
This is the methodology behind our product. If you want to see CoPE in action, Zentropi offers custom content labeling at scale - you define the policy, and we help you refine it into a version that's machine-interpretable so that you can do accurate labeling. Learn more at https://zentropi.ai
Read the full paper here: https://arxiv.org/abs/2512.18027
Questions about the methodology? Reach out at info@zentropi.ai.
15.1.2026 18:44How we built CoPEEarlier this week, we launched seven publicly sharable content policies on Zentropi - harassment, hate, violence, self-harm, sexual content, drugs, and toxicity.
The toxic content policy, in particular, is worth examining in detail. It not only illustrates a conceptual problem in typical approaches to content classification, but offers an alternative methodology that we hope could be useful for others.
The common definition of toxicity used in the industry is something like "content that is likely to make people leave a discussion".
This is an outcome-centered definition, which defines "toxic content" by the *impact* it has, not the *rhetoric* it uses. It is a very useful idea in thinking about product design, feed ranking, and even policy-design goals. But it is not a usable idea for the classification of content on its own.
The problem is that the policy delegates the central question (what drives users away?) to a person following that policy rather than providing them an answer to it. Every time they review content, moderators must predict "will this make users leave?" based on their own intuitions.
Consider: "This is a terrible approach that completely misses the point. We need to start over."
One moderator predicts this will make the recipient defensive and leave. Another thinks it's expected critical feedback in a professional context. Both predictions are defensible. Neither is obviously wrong. And there's no way to determine which is correct - you're asking them to predict unknowable future reactions of hypothetical users.
Without community-specific behavioral data, moderators must speculate based on their own assumptions. Two moderators can make different predictions about the same content, both reasonable, with no mechanism to resolve the disagreement. Quality control becomes difficult, consistency suffers, and the core purpose of a policy - enabling moderators to make the same call without debating every case - doesn't work.
An alternative approach: the policy defines specific, observable content features that moderators can assess without predicting reactions.
Instead of asking "will this make users leave?", the policy answers that question by identifying concrete features - specific language patterns, targeting requirements, contextual considerations - and asks moderators only to assess whether those features are present in the content being reviewed.
This shifts the hard work from moderators to the policy itself. The policy translates behavioral goals ("reduce content that drives users away") into observable content characteristics ("identify hostile targeting of participants using specific patterns"). Moderators then apply those defined characteristics rather than developing their own predictions about user behavior.
A useful policy should allow different observers to separately reach the same conclusion when presented with the same facts. Put another way, the purpose of a policy is to keep everyone on the same page. Achieving this purpose places limits on what a policy can ask moderators to do - outcome prediction requires speculation, while observable features enable consensus.
For example, the policy might define that content targeting conversation participants differently than content targeting public figures, establish specific rhetorical patterns that constitute hostile language, or create explicit exclusions for legitimate criticism. The key is that these determinations happen once, in the policy, rather than case-by-case in moderators' heads.
For our toxicity policy, we answered "what rhetoric drives users away?" with hostile targeting of conversation participants. We define specific patterns - combative language, belittling, personal attacks, condescension - when directed at people in the conversation, not public figures or other third parties. We also created explicit exclusions for legitimate criticism, so strong language about work product doesn't get conflated with hostile language about people.
It's worth acknowledging that this definition means our labeler will perform differently on existing toxicity datasets - by design. We're measuring something specific (hostile targeting of participants) rather than trying to predict general "toxicity," so benchmark performance will reflect that definitional difference.
But that difference is the entire point. Our policy is simply *our* answer to the question "what content is toxic?" If you believe different rhetoric is likely to make people leave conversations, you can fork our toxicity labeler and adapt the definition - add different patterns, remove the participant-targeting restriction, adjust the exclusions. The key is that you're defining observable rhetoric, not asking moderators to predict outcomes. The methodology is what matters, not our specific choices.
When your policy defines observable features rather than asking moderators to predict reactions, three things change:
1. Quality control becomes possible - you can audit whether moderators correctly identified defined features with ground truth in the policy text, not competing predictions about unknowable behavior.
2. Rule changes become predictable - you can identify exactly which content would change if you add a distinction between participant and non-participant targeting, test edge cases, and debate outcomes before implementing, rather than running experiments to see if wording shifts moderator intuitions.
3. Consistency becomes achievable - moderators can debate the interpretation of rule text and reach consensus on observable features, rather than each developing their own answer about predicted outcomes based on personal intuitions.
Check out the toxicity labeler (toxicity-public-s5) on Zentropi, which you can integrate with your platform instantly using our free API. Browse the full policy to see how defining observable features creates different outcomes than behavioral prediction. All seven policies are available to browse and fork for your specific community context.
13.11.2025 22:20Observations on ToxicityAfter previewing Zentropi to broad acclaim at TrustCon, we are officially opening up our build-your-own-content-labeler platform to everyone. Go to zentropi.ai to check it out!
Zentropi tackles a critical challenge that all teams face when safeguarding their digital platforms: how to label content quickly, cheaply, accurately—and by their own rules. This is a universal problem, whether you're building content moderation systems, aligning new AI models, deploying a chatbot, or creating new feed algorithms.
Traditional approaches have involved either training machine learning classifiers (which is time-consuming and brittle), or trying to use large language models (which are expensive and often inaccurate). Policy changes end up requiring weeks of model retraining, and only organizations with significant ML resources can build effective systems. Meanwhile, systems need protection today, not months from now.
To make matters worse, it's extremely difficult for teams to even articulate what policy they want their labelers to follow, so they're typically forced into using off-the-shelf classifiers that impose their own definitions.
Zentropi addresses all these issues with a fundamentally different approach. We've created a unified system that tackles both sides of the coin: accurately interpreting a given set of policy criteria as well as helping people craft policies that are machine interpretable.
Instead of teaching a model a specific policy, we developed CoPE—a best-in-class small language model capable of faithful policy interpretation. Your team writes policies in plain English. CoPE then enforces them consistently at scale. As policies evolve, implementation updates instantly.
It was not an easy task to get here. After nearly two years of research, we created a novel methodology that enabled us to build CoPE as our engine. We then created an agentic system that helps people rapidly draft, test, and optimize their policies—including suggestions for both where their criteria and their labels could improve.
The results speak for themselves:
Since CoPE is a small model (only 9B parameters), it's fast and cheap to run at inference time, making it suitable for real-time systems that operate at internet scale.
We have just been floored by the feedback we've gotten from our design partners this year. We've seen teams who lack their own in-house policy expertise rapidly create labelers and go from a starting accuracy of ~60% to over 90%—wholly assisted by Zentropi's automated optimizations. And then with one click they've instantly deployed their labelers into production. The consistent feedback: teams can finally focus on policy strategy rather than implementation constraints. This simply was not possible before.
Community Edition (Free)
Enterprise Edition
After seeing partners put Zentropi into production and witnessing its real-world impact, we are inspired to make this technology available to as many teams as possible. That is why we are now opening it up to everyone. We want to do our part to help create a world where AI-powered systems are maximally trustworthy.
The best way to understand the power of Zentropi is to go try it yourself—so head over to zentropi.ai and let us know what you think! For organizations ready to transform the quality of their platforms, our enterprise team is available to discuss your specific requirements. Just drop us a line at info@zentropi.ai.
We look forward to seeing how your organization uses Zentropi to build a better internet!
19.8.2025 15:30Zentropi: Build Your Own Content Labeler in Minutes, Not Months









