lade...

Blog.zentropi.ai

Blog.zentropi.ai

an avatar

a logo

zentropi blog

Official blog of Zentropi.ai -- Ideas for building more trustworthy AI-powered systems.

🌐 Visit Blog.zentropi.ai 🌐 Blog.zentropi.ai besuchen

Write rieview✍️ Rezension schreiben✍️ Get Badge!🏷️ Abzeichen holen!🏷️ Edit entry⚙️ Eintrag bearbeiten⚙️ News📰 Neuigkeiten📰

Write review

Tags: building official powered systems trustworthy zentropi

Blog.zentropi.ai hosts 1 (1) users Blog.zentropi.ai beherbergt 1 (1) Benutzer insgesamt (powered by Ghost)

Server location (146.75.119.7):Serverstandort (146.75.119.7 ):60313 Frankfurt am Main, DE Germany, EU Europe, latitude: 50.1169, longitude: 8.6837

Rieviews

Bewertungen

not yet rated noch nicht bewertet 0%

Be the first one
and write a rieview
about blog.zentropi.ai.
Sein Sie der erste
und schreiben Sie eine Rezension
über blog.zentropi.ai.

Blog.zentropi.ai News

Enabling Streaming Classification

https://blog.zentropi.ai/enablin...

Enabling Streaming Classification

Content classifiers today work on complete text. You hand them a finished document, a completed comment, a fully generated response — and they tell you whether it violates a policy. This made sense when content was written by humans and published in discrete units.

Generative AI changed the equation. An LLM generates content token by token, and the user sees each token as it arrives. A classifier that waits for the full output to score it is, by definition, too late — the user has already read everything the model produced. If the output violates a policy, the damage is done before the classifier even runs.

We need classifiers that can score content as it streams, token by token, and raise an alarm before the sequence is complete. We've been working on a technique that does exactly this, and we're publishing our methodology openly today so others can use and improve upon it. Download it here on Huggingface: zentropi-ai/cope-a-9b-stream-probe

The problem with post-hoc classification

CoPE, our policy-conditioned content evaluator, works by processing the full content alongside a policy and producing a binary verdict: does this content adhere to the policy criteria? It's extremely accurate, but it requires the complete content before it can answer.

In a streaming context, this creates an uncomfortable gap. An LLM generating a response might produce 1000 tokens before the violation occurs. A post-hoc classifier catches it, but only after a long delay or after the user has seen the bulk of it. What we want is a system that can flag the violation as it happens — or ideally, a few tokens before the full picture becomes clear.

Hidden states already know

The key insight behind our approach is that CoPE's internal representations — the hidden state vectors produced at each token position — already encode information about whether a violation is developing, well before the model reaches its final verdict.

This makes intuitive sense. A language model doesn't wait until the last token to "understand" the text. By the time it's processed "I can't stand my coworker Bob anymore. He is genuinely the most incompetent person I have ever worked with," the model's hidden states already reflect that degrading language is present — even though dozens of tokens remain before the ANSWER position where CoPE would normally render its judgment.

We exploit this by training a lightweight linear probe — a simple logistic regression — on these intermediate hidden states. The probe takes a single hidden state vector (3,584 dimensions from CoPE's final layer norm) and produces a score between 0 and 1 indicating the probability at that point in the sequence that a policy violation will occur by the end of the content sample.

The probe is four numpy arrays totaling a few megabytes. Inference is a simple dot product and a sigmoid — making this method essentially free compared to the cost of the forward pass that produces the hidden states.

Training the probe to activate at the right moment

A naive approach would label every content token with the sample's final label: if the content ultimately violates the policy, label every token 1. This produces a probe that fires from the first token onward — before any violation has appeared in the content. It's useless for streaming because it triggers on the policy prefix, not the content.

Our solution is span labeling. For each positive training example, we annotate where in the content the violation begins — the "onset" position. Tokens before the onset are labeled 0; tokens from the onset onward are labeled 1. Negative examples have all tokens labeled 0.

This teaches the probe to activate only when it sees violating content, not merely because the policy is strict. The onset annotations were labeled across ~10,000 positive training examples, with an estimated 98.5% word-level matching accuracy.

Contrastive training data

A content classifier can learn a dangerous shortcut: instead of learning to evaluate content against a policy, it can learn that certain topics are always violations. A probe trained on "self-harm content is always bad" will score high on any mention of self-harm regardless of the policy.

To prevent this, we constructed a contrastive training set (as detailed in our CoPE paper) with two guarantees:

  1. Every piece of content appears in the dataset with both a positive and negative label (under different policies)
  2. Every policy appears in the dataset with both positive and negative examples

This forces the probe to learn policy-conditioned features — it cannot succeed by pattern-matching on content alone or policy alone. This produced roughly 10,000 high-quality contrastive examples that we used to train the probe.

The probe in action

Here's what this looks like in practice. The content below is a workplace vent containing personal insults, evaluated against a policy that flags degrading language directed at individuals. The score column is the raw probe output at each token; the ema column is an exponential moving average (decay=0.3) that smooths the noisy per-token signal into something usable for streaming decisions.

  Policy: "Flag content that contains personal insults, name-calling,
           or degrading language directed at specific individuals."

    #  token              score     ema
  ─────────────────────────────────────────────────
    1  I                  0.001   0.001
    2   can               0.001   0.001
    3  '                  0.014   0.005
    4  t                  0.000   0.004
    5   stand             0.000   0.002
    6   my                0.000   0.001
    7   coworker          0.000   0.001
    8   Bob             0.012   0.005
    9   anymore           0.086   0.033
   10  .                  0.000   0.022
   11   He                0.216   0.090
   12   is                0.000   0.058
   13   genuinely         0.001   0.038
   14   the               0.000   0.025
   15   most              0.001   0.017
   16   incompetent       0.000   0.011
   17   person            0.551   0.200  ◀ first spike
   18   I                 0.024   0.138
   19   have              0.293   0.192
   20   ever              0.008   0.128
   21   worked            0.836   0.376  ◀
   22   with              0.083   0.273
   23  .                  0.001   0.178
   24   Every             0.351   0.239
   25   single            0.343   0.275
   26   project           0.047   0.195
   27   he                0.991   0.474  ◀
   28   touches           0.015   0.313
   29   turns             0.273   0.299
   30   into              0.091   0.226
   31   a                 0.771   0.417  ◀
   32   complete          0.745   0.532  ◀ EMA crosses 0.5
   33   disaster          0.159   0.401
   34  .                  0.007   0.263
   35   The               0.002   0.172
   36   whole             0.039   0.125
   37   team              0.965   0.419  ◀
   38   thinks            0.032   0.283
   39   he                0.698   0.429
   40  '                  0.001   0.279
   41  s                  0.449   0.338
   42   useless           0.070   0.245
   43   and               0.907   0.476  ◀
   44   honestly          0.651   0.537
   45   he                0.327   0.464
   46   should            0.380   0.434
   47   be                0.593   0.490
   48   embarrassed       0.059   0.339
   49   to                0.292   0.323
   50   even              0.070   0.234
   51   show              0.373   0.283
   52   his               0.005   0.186
   53   face              0.607   0.333
   54   at                0.001   0.217
   55   meetings          0.001   0.141
   56  .                  0.070   0.116
   57   What              0.815   0.361  ◀
   58   a                 0.055   0.254
   59   pathetic          0.019   0.171
   60   excuse            0.730   0.367
   61   for               0.567   0.437
   62   a                 0.023   0.292
   63   professional      0.894   0.503  ◀
   64  .                  0.450   0.485
  ─────────────────────────────────────────────────
  ANS  <answer token>     1.000  (cope=1.000)

A few things to notice:

Enabling Streaming Classification

When the same content is evaluated against an irrelevant policy ("Flag content that contains explicit threats of physical violence"), both the per-token scores and the EMA stay near zero throughout, and the ANSWER token scores 0.000. The probe learned to condition on the policy, not just the content. Please see this tutorial notebook to run the full example.

A streaming classifier is inherently less confident

You can see this directly in the table above: the per-token scores are spiky and less decisive than the final ANSWER token. This is worth stating plainly — a streaming classifier will always be less confident than a post-hoc one. It has to be. It doesn't know what's coming next.

Consider a sentence that begins "I'm going to..." — the next word could be "help" or "hurt." A streaming classifier at this token position genuinely cannot know the final label because the information doesn't exist yet. To make matters more complex, a sample can go from seemingly violating and back to being benign (e.g., quoted rather than direct hate speech). A post-hoc classifier that sees the full sentence has no such ambiguity.

Our probe reflects this. At the ANSWER token position (where the full content has been processed), the probe achieves an F1 accuracy that essentially matches CoPE's F1. But at intermediate positions, per-token scores are spiky and less decisive. This is correct behavior, not a limitation.

The practical consequence is that streaming thresholds require more careful, policy-specific calibration than post-hoc thresholds. A threshold of 0.5 might be appropriate for one policy but too aggressive or too conservative for another. Production deployments should calibrate thresholds on held-out data for each policy, rather than relying on a single universal threshold.

Aggregation matters

Since raw per-token probe scores are noisy — a token might score 0.95 followed by one that scores 0.01 — to produce a usable streaming signal, you need an aggregation strategy.

We recommend an exponential moving average (EMA) with a decay factor of ~0.3. The EMA responds quickly to bursts of high-scoring tokens (which correspond to violating content) and decays naturally when the probe stops firing. With a 0.3 decay, the EMA roughly reflects what the probe has been seeing in the last few tokens — recent enough to catch violations promptly, smooth enough to avoid false alarms from isolated spikes.

A running mean is the simpler alternative, but it has a structural weakness: early low-scoring tokens permanently drag the average down. For content where violations are interspersed with neutral tokens (which is common — not every word in an insult is itself insulting), the running mean may never cross a decision threshold even when the probe is clearly detecting violations.

Early warning, not final verdict

We want to be precise about the current value of streaming classification. It is not a replacement for post-hoc scoring — it is an early warning system.

In a production setting, this means:

The streaming signal is most valuable when the cost of late detection is high — real-time chat, voice interfaces, agentic systems that take actions based on LLM output. In these contexts, catching a violation 30 tokens early is worth more than catching it with marginally higher confidence after the fact.

Open methodology

This work is experimental. We're publishing it openly — including a tutorial notebook with full code — because we believe the technique is useful today and we want others to build on it.

Some open questions we'd like the community to explore:

If you're working on real-time content safety for generative AI systems, we'd love to hear what you find. The probe weights, training scripts, and tutorial are all available on HuggingFace at zentropi-ai/cope-a-9b-stream-probe.

10.3.2026 18:37Enabling Streaming Classification
https://blog.zentropi.ai/enablin...

Zentropi Now Powers Coop

https://blog.zentropi.ai/zentrop...

Zentropi Now Powers Coop

We built Zentropi so that teams could create accurate content classifiers in minutes, not months. Today, we're excited to share that Zentropi labelers are now integrated as a signal source in Coop, the open source content moderation platform from ROOST.

This matters because it demonstrates something we've believed from the start: policy-steerable classifiers should work everywhere. Your moderation stack, your review tool, your agentic pipeline—wherever you need a classification decision, Zentropi should plug in.

What is Coop?

Coop is ROOST's open source review tool for content moderation. It provides the full operational layer that sits between your platform and your reviewers: queues, routing rules, a review console, automated enforcement, analytics, and specialized child safety workflows. Best of all, it runs on your own infrastructure.

Coop has a concept called signals—scores from external classifiers that feed into routing decisions and reviewer interfaces. When content arrives, Coop's rules engine evaluates signals against configurable thresholds to determine whether to auto-action, route to a queue, or escalate for human review.

That's where Zentropi comes in.

How the Integration Works

Zentropi labelers produce two outputs: a binary label (0 or 1) indicating whether content violates a given policy, and a confidence score (0 to 1) reflecting how certain the model is.

Coop expects a single score on a 0-to-1 scale, where higher means more likely violating. We map between the two with a simple formula:

score = label === 1 ? confidence : (1 - confidence)

A few examples of what this looks like in practice:

Zentropi Output

Coop Score

What It Means

Label 1, confidence 0.95

0.95

Very likely violating

Label 0, confidence 0.95

0.05

Very likely safe

Label 1, confidence 0.60

0.60

Uncertain, leaning violating

Label 0, confidence 0.60

0.40

Uncertain, leaning safe

This preserves the full information from the Zentropi classifier in a single dimension that works naturally with Coop's threshold-based rules. You can write rules like GREATER_THAN 0.7 to catch high-confidence violations, or route uncertain cases (scores near 0.5) to human reviewers.

Why This Matters

There's a broader point here. The trust and safety ecosystem has historically been fragmented—every platform builds its own stack from scratch, and the tools don't talk to each other. ROOST is changing that with open source infrastructure. Zentropi is changing it by making classifiers that adapt to any policy without retraining.

When you combine the two, you get something powerful: an open moderation stack where the classification layer understands your specific policies, not just generic harm categories. You can write a policy in plain English on Zentropi, deploy it as a signal in Coop, and have content flowing through a review pipeline within minutes.

This is also a proof point for Zentropi's architecture. CoPE, the open model that powers our platform, was designed from the ground up to be policy-steerable—you give it a policy, it classifies content against that policy (see model card for details). No retraining, no fine-tuning, no waiting. That design makes Zentropi a natural fit as a signal source in any moderation system, not just Coop.

Getting Started

Setting up Zentropi in Coop takes just a few steps:

Step 0: Activate the integration. In Coop's Integrations settings, enter your Zentropi API key and add the labeler version IDs you want to use. Each labeler version gets a name (like "Hate" or "Puns") so it's easy to reference in rules.

Zentropi Now Powers Coop

Step 1: Pick the Zentropi signal. When creating an enforcement rule, you'll see Zentropi Labeler appear alongside other signal sources like OpenAI's moderation scores. Select it as the signal for your rule.

Zentropi Now Powers Coop

Step 2: Select your labeler. Choose which specific labeler version to use as the subcategory for the rule. This is where the policy-steerable part comes in—each labeler enforces a different policy, so you can have separate rules for different harm categories, all powered by Zentropi.

Zentropi Now Powers Coop

Step 3: Set thresholds and test. Configure your threshold (e.g., "Greater Than 0.8") and test the rule right in the UI with sample content. You'll see the Zentropi score in real time and whether it triggers the rule.

Zentropi Now Powers Coop

If you're not using either tool yet, now's a good time to try both! Zentropi's Community Edition is free and gives you unlimited labelers. Coop is fully open source and runs on your infrastructure.

We think the future of trust and safety is modular, open, and policy-driven. This integration is a step in that direction, and we're looking forward to seeing what teams build with it.


Create your first labeler at zentropi.ai. Check out the Coop project on GitHub. Questions? Reach out at info@zentropi.ai.

19.2.2026 18:43Zentropi Now Powers Coop
https://blog.zentropi.ai/zentrop...

Zentropi Now Labels Images

https://blog.zentropi.ai/zentrop...

Zentropi Now Labels Images

When we launched Zentropi last year, we set out to transform how developers can get their AI-powered systems under control. Our platform let teams build custom content labelers in minutes using plain English policies—but only for text.

Today, that changes. Zentropi now supports image labeling.

Why Images Matter

Whether you're running a social platform, a creative tool, or an AI image generator, you face the same challenge: visual content is hard to moderate at scale.

Traditional approaches force a choice: expensive human review that can't keep up, or rigid classifiers that don't match your specific policies. And if you're in the AI image generation space, filtering prompts only gets you so far. Users find creative ways to phrase requests. Prompt injection happens. Sometimes a perfectly innocent prompt produces something you don't want on your platform.

Now you can analyze the images themselves—against your own rules, at scale.

How It Works

If you've built text labelers with Zentropi, image labeling will feel familiar. You write a policy describing what you want to detect, test it with sample content, and deploy to production. Our automation tools make it easy for even non-policy experts to draft and optimize labeling criteria.

The difference now is in what you're evaluating. Instead of analyzing user messages or prompts, you're looking directly at pixels—user uploads, profile photos, AI-generated artwork, or any visual content flowing through your system.

A New Model for Multimodal Classification

To power image labeling, we've developed a new version of CoPE built on Google's Gemma 3 12B base model. This gives us native multimodal capabilities—the model understands images and text together, not as separate inputs stitched together.

The new model also brings significantly larger context windows at 128k tokens, which means you can write richer, more detailed policies and the model can evaluate even more complex classification criteria.

Same policy-first approach. Same accuracy targets. Now with vision.

This feature is available for subscribers only, who can also download model weights for self-hosting— enabling image classification entirely within your own infrastructure if that's what your security posture requires.

What You Can Build

Zentropi Image Labelers are a great fit for:

User-generated content platforms — Analyze profile pictures, uploads, and shared images against your community guidelines. Catch policy violations before they reach other users.

AI image generation — Detect nudity, violence, or brand-unsafe content in generated images—not just prompts. Add a safety layer after generation but before serving to users.

Marketplaces and e-commerce — Screen product photos and seller uploads for prohibited items, misleading imagery, or content that violates your terms.

Brand safety — Ensure marketing assets and AI-generated creative meet your standards before they go live.

Obligatory Cat Example

As an example, here is how easily we created a labeler for cat images. First, we started with a very basic criteria definition.

Zentropi Now Labels Images

Then we uploaded a CSV containing cat images and labeled as cats (1/yes) and non-cats (0/no).

Zentropi Now Labels Images

Out of the box, this worked super well. A perfect score across precision, recall, and F1!

Zentropi Now Labels Images

For fun, we also fired up an optimizer that further refined the criteria into something very rigorous.

Zentropi Now Labels Images

Then we instantly deployed this to our API, where it can be integrated into any system.

Zentropi Now Labels Images

All told, it took mere minutes to make a custom image labeler that runs extremely fast and at just 1% the cost of a frontier model!

Getting Started

Image labeling is available today for our paid customers. If you're on our Community tier, you can continue building and testing text labelers for free—and upgrade when you're ready to add visual analysis.

To create your first image labeler:

  1. Log in to zentropi.ai
  2. Create a new labeler
  3. Write your policy in plain English (or have one generated for you)
  4. Upload test images and refine until you're confident
  5. Publish and integrate via our API

If you aren't yet a subscriber but are interested in evaluating this solution or thinking through your image safety strategy, reach out at info@zentropi.ai. We've helped teams across social platforms, AI products, and creative tools build guardrails that work.


Zentropi helps product teams build policy-steerable content classification that matches frontier model accuracy at a fraction of the cost. Learn more at zentropi.ai.

26.1.2026 17:30Zentropi Now Labels Images
https://blog.zentropi.ai/zentrop...

How we built CoPE

https://blog.zentropi.ai/how-we-...

How we built CoPE

We just published the methodology behind CoPE. This is the model that powers Zentropi, and we think the approach might be useful for others working on policy-steerable classification systems.

We had already open-sourced the model itself, but the more significant contribution here might be the technique. The paper describes how we trained CoPE - the methodology that others can use to build similar systems for their own needs.

Briefly put: we trained a 9-billion parameter model that matches GPT-4o at content moderation at roughly 1% the size. This paper gets into the details of how we did that.

The Problem We Were Trying to Solve

Content classification has a dependency problem. When policies change - and they change constantly, in response to new harms, regulatory requirements, or community needs - existing tools require retraining. Organizations articulate new content standards, then wait months for data collection and model updates. The enforcement system is always behind the policy.

This happens because traditional classifiers learn patterns from labeled examples. They learn what hate speech "looks like" based on the training data, not what a specific policy actually says. Change the policy, and you need new training data that reflects the new definitions.

We wanted a model that could take any policy as input and evaluate content against it directly - no retraining required.

Contradictory Example Training

The core technique is what we call Contradictory Example Training. We show the model the same content with different policies that produce opposite labels.

For example, consider a social media post that includes a slur used in a reclaimed, in-group context. Under a strict policy that prohibits all slur usage regardless of context, this violates. Under a policy that permits reclaimed usage by in-group members, it doesn't. Same content, opposite correct answers - depending entirely on what the policy says.

By training on both cases, we create an environment where the only way for the model to determine the correct label is by paying close attention to the detail of the policy. Pattern matching won't work. Cultural heuristics won't work. The model has to actually read.

The paper goes deeper into the theory here, including why we believe this approach generalizes to policies the model never saw during training.

Building the Dataset: Binocular Labeling

The paper goes deep into methodology, including how we created the training data. This is the nerdy stuff - how do you build a sufficiently high-quality dataset where the same content has contradictory but correct labels under different policies?

The challenge is that contradictory training requires deterministic policies - policies where independent readers would reach identical conclusions when applying them to the same content. Without this consistency, the model could succeed through guesswork rather than policy interpretation. If humans can't agree on what a policy means, we can't train a model to follow it.

To build the dataset, we used LLM-assisted labeling with a technique we call binocular labeling:

  1. We draft an initial policy and use an LLM to generate a semantically equivalent but linguistically distinct alternative version
  2. We run the same content through both policy versions using an LLM-based labeling system
  3. We only manually review the mismatches - cases where the two versions produced different labels
  4. Based on those reviews, we refine the policy language and repeat until the two versions achieve high agreement

This approach dramatically reduces the manual labeling burden. Instead of reviewing every example, we focus human attention on the ambiguous cases that reveal where policy language needs clarification.

Results

CoPE achieves 91% F1 on hate speech compared to GPT-4o's 87%, with sub-200ms latency on a single consumer GPU. We tested across seven harm categories including hate, sexual content, violence, harassment, self-harm, drugs, and toxicity.

The model runs at roughly 1% of GPT-4o's parameter count, making it practical to deploy at scale without frontier model inference costs. The paper includes detailed benchmark comparisons against other open models including LlamaGuard and ShieldGemma.

Open Research Problems

We should be clear about what's still hard. A few areas where we'd welcome collaboration:

Evaluation is genuinely difficult. Most public benchmarks don't disclose the labeling guidelines given to raters, making it hard to know whether you're measuring policy interpretation or agreement with unstated cultural assumptions. We had to build our own evaluation framework with held-out policies, but the field needs better shared benchmarks.

Deterministic policies are a constraint. The methodology requires policies where humans can achieve high agreement. Highly subjective categories - "this feels creepy" - may not meet this threshold. We don't yet know how to extend the approach to inherently ambiguous domains.

Multilingual remains untested. Our current evaluation focuses on English. The base model supports other languages, but we haven't validated performance, and policy interpretation may have different challenges across linguistic and cultural contexts.

This Powers Zentropi

This is the methodology behind our product. If you want to see CoPE in action, Zentropi offers custom content labeling at scale - you define the policy, and we help you refine it into a version that's machine-interpretable so that you can do accurate labeling. Learn more at https://zentropi.ai

Read the full paper here: https://arxiv.org/abs/2512.18027

Questions about the methodology? Reach out at info@zentropi.ai.

15.1.2026 18:44How we built CoPE
https://blog.zentropi.ai/how-we-...

Observations on Toxicity

https://blog.zentropi.ai/observa...

Observations on Toxicity

Earlier this week, we launched seven publicly sharable content policies on Zentropi - harassment, hate, violence, self-harm, sexual content, drugs, and toxicity.

The toxic content policy, in particular, is worth examining in detail. It not only illustrates a conceptual problem in typical approaches to content classification, but offers an alternative methodology that we hope could be useful for others.

The Usual Approach: Predicting Outcomes

The common definition of toxicity used in the industry is something like "content that is likely to make people leave a discussion".

This is an outcome-centered definition, which defines "toxic content" by the *impact* it has, not the *rhetoric* it uses. It is a very useful idea in thinking about product design, feed ranking, and even policy-design goals. But it is not a usable idea for the classification of content on its own.

The problem is that the policy delegates the central question (what drives users away?) to a person following that policy rather than providing them an answer to it. Every time they review content, moderators must predict "will this make users leave?" based on their own intuitions.

Consider: "This is a terrible approach that completely misses the point. We need to start over."

One moderator predicts this will make the recipient defensive and leave. Another thinks it's expected critical feedback in a professional context. Both predictions are defensible. Neither is obviously wrong. And there's no way to determine which is correct - you're asking them to predict unknowable future reactions of hypothetical users.

Without community-specific behavioral data, moderators must speculate based on their own assumptions. Two moderators can make different predictions about the same content, both reasonable, with no mechanism to resolve the disagreement. Quality control becomes difficult, consistency suffers, and the core purpose of a policy - enabling moderators to make the same call without debating every case - doesn't work.

Observable Features: Defining Specific Content Characteristics

An alternative approach: the policy defines specific, observable content features that moderators can assess without predicting reactions.

Instead of asking "will this make users leave?", the policy answers that question by identifying concrete features - specific language patterns, targeting requirements, contextual considerations - and asks moderators only to assess whether those features are present in the content being reviewed.

This shifts the hard work from moderators to the policy itself. The policy translates behavioral goals ("reduce content that drives users away") into observable content characteristics ("identify hostile targeting of participants using specific patterns"). Moderators then apply those defined characteristics rather than developing their own predictions about user behavior.

A useful policy should allow different observers to separately reach the same conclusion when presented with the same facts. Put another way, the purpose of a policy is to keep everyone on the same page. Achieving this purpose places limits on what a policy can ask moderators to do - outcome prediction requires speculation, while observable features enable consensus.

For example, the policy might define that content targeting conversation participants differently than content targeting public figures, establish specific rhetorical patterns that constitute hostile language, or create explicit exclusions for legitimate criticism. The key is that these determinations happen once, in the policy, rather than case-by-case in moderators' heads.

Our Approach to Toxicity

For our toxicity policy, we answered "what rhetoric drives users away?" with hostile targeting of conversation participants. We define specific patterns - combative language, belittling, personal attacks, condescension - when directed at people in the conversation, not public figures or other third parties. We also created explicit exclusions for legitimate criticism, so strong language about work product doesn't get conflated with hostile language about people.

It's worth acknowledging that this definition means our labeler will perform differently on existing toxicity datasets - by design. We're measuring something specific (hostile targeting of participants) rather than trying to predict general "toxicity," so benchmark performance will reflect that definitional difference.

But that difference is the entire point. Our policy is simply *our* answer to the question "what content is toxic?" If you believe different rhetoric is likely to make people leave conversations, you can fork our toxicity labeler and adapt the definition - add different patterns, remove the participant-targeting restriction, adjust the exclusions. The key is that you're defining observable rhetoric, not asking moderators to predict outcomes. The methodology is what matters, not our specific choices. 

Operational Benefits

When your policy defines observable features rather than asking moderators to predict reactions, three things change:

1. Quality control becomes possible - you can audit whether moderators correctly identified defined features with ground truth in the policy text, not competing predictions about unknowable behavior.

2. Rule changes become predictable - you can identify exactly which content would change if you add a distinction between participant and non-participant targeting, test edge cases, and debate outcomes before implementing, rather than running experiments to see if wording shifts moderator intuitions. 

3. Consistency becomes achievable - moderators can debate the interpretation of rule text and reach consensus on observable features, rather than each developing their own answer about predicted outcomes based on personal intuitions.

Try It Yourself

Check out the toxicity labeler (toxicity-public-s5) on Zentropi, which you can integrate with your platform instantly using our free API. Browse the full policy to see how defining observable features creates different outcomes than behavioral prediction. All seven policies are available to browse and fork for your specific community context.

13.11.2025 22:20Observations on Toxicity
https://blog.zentropi.ai/observa...

Zentropi: Build Your Own Content Labeler in Minutes, Not Months

https://blog.zentropi.ai/zentrop...

Zentropi: Build Your Own Content Labeler in Minutes, Not Months

After previewing Zentropi to broad acclaim at TrustCon, we are officially opening up our build-your-own-content-labeler platform to everyone. Go to zentropi.ai to check it out!

The Universal Challenge

Zentropi tackles a critical challenge that all teams face when safeguarding their digital platforms: how to label content quickly, cheaply, accurately—and by their own rules. This is a universal problem, whether you're building content moderation systems, aligning new AI models, deploying a chatbot, or creating new feed algorithms.

Traditional approaches have involved either training machine learning classifiers (which is time-consuming and brittle), or trying to use large language models (which are expensive and often inaccurate). Policy changes end up requiring weeks of model retraining, and only organizations with significant ML resources can build effective systems. Meanwhile, systems need protection today, not months from now.

To make matters worse, it's extremely difficult for teams to even articulate what policy they want their labelers to follow, so they're typically forced into using off-the-shelf classifiers that impose their own definitions.

Our Innovation: Policy-First Architecture

Zentropi addresses all these issues with a fundamentally different approach. We've created a unified system that tackles both sides of the coin: accurately interpreting a given set of policy criteria as well as helping people craft policies that are machine interpretable.

Instead of teaching a model a specific policy, we developed CoPE—a best-in-class small language model capable of faithful policy interpretation. Your team writes policies in plain English. CoPE then enforces them consistently at scale. As policies evolve, implementation updates instantly.

It was not an easy task to get here. After nearly two years of research, we created a novel methodology that enabled us to build CoPE as our engine. We then created an agentic system that helps people rapidly draft, test, and optimize their policies—including suggestions for both where their criteria and their labels could improve.

The Numbers That Matter

The results speak for themselves:

Since CoPE is a small model (only 9B parameters), it's fast and cheap to run at inference time, making it suitable for real-time systems that operate at internet scale.

Real-World Impact

We have just been floored by the feedback we've gotten from our design partners this year. We've seen teams who lack their own in-house policy expertise rapidly create labelers and go from a starting accuracy of ~60% to over 90%—wholly assisted by Zentropi's automated optimizations. And then with one click they've instantly deployed their labelers into production. The consistent feedback: teams can finally focus on policy strategy rather than implementation constraints. This simply was not possible before.

Get Started Today

Community Edition (Free)

Enterprise Edition

Building a More Trustworthy Future

After seeing partners put Zentropi into production and witnessing its real-world impact, we are inspired to make this technology available to as many teams as possible. That is why we are now opening it up to everyone. We want to do our part to help create a world where AI-powered systems are maximally trustworthy. 

The best way to understand the power of Zentropi is to go try it yourself—so head over to zentropi.ai and let us know what you think! For organizations ready to transform the quality of their platforms, our enterprise team is available to discuss your specific requirements. Just drop us a line at info@zentropi.ai

We look forward to seeing how your organization uses Zentropi to build a better internet!

19.8.2025 15:30Zentropi: Build Your Own Content Labeler in Minutes, Not Months
https://blog.zentropi.ai/zentrop...
Subscribe
More from our directory... Mehr aus unserem Verzeichnis...

🔝

Datenschutzerklärung    Impressum