Solutions

Datasets

Research

Resources

Company

Talk to us

Read our papers on AI training, evaluation, and safety

Learn more

Read our papers on AI training, evaluation, and safety

Learn more

Read our papers on AI training, evaluation, and safety

Learn more

Toxicity detection: why we still need human-labeled data

Daryna Dementieva

June 2, 2025

News

Toxicity detection remains one of the most persistent challenges in NLP for safe digital space and AI technologies. This problem becomes even more complex when you're working with a language for which no traditional classification benchmarks exist yet.

In 2025, we’re fortunate to have a broader choice of tools than ever before for building language-specific toxicity detectors. You can try translating high-quality datasets from resource-rich languages like English. Or you can tap into the capabilities of large language models (LLMs) to generate or annotate data via prompting. But here’s the real question: how well do these strategies actually perform in practice? Can synthetic data stand in for real, annotated examples? How much training data do you need before your classifier becomes robust enough for real-world use?

In this post, we dive into these questions through the lens of Ukrainian – a language with increasing digital presence but limited NLP infrastructure. We explore practical baselines, test the limits of synthetic data, and share insights into what it takes to build a reliable toxicity detector from scratch. If you are working on non-mainstream languages or just curious about scaling language tech responsibly, this one’s for you.

What do we mean by Toxic?

Toxicity in language can take many forms, ranging from sarcasm and hate speech to direct personal attacks. In our work, we focus on a more explicit case: instances containing words or phrases commonly recognized as vulgar, obscene or profane. It’s important to note that even if a message contains such language, its overall tone might still be neutral. For simplicity and clarity, we frame this as a binary classification task, labeling each example as either toxic or non-toxic.

Here’s an example of how we’d like our Ukrainian classifier to label sentences as toxic or non-toxic:

How to build a toxicity classifier without labeled Ukrainian data

There are several options for building a toxicity classifier without obtaining a comprehensive labeled dataset in Ukrainian, including methods for synthetic data acquisition and cross-lingual approaches.

Let’s break them down.

Backtranslation
A simple and effective baseline starts with translation. If a solid toxicity classifier already exists in English (or any other resource-rich language), you can translate Ukrainian text to English and run it through the English model. This backtranslation pipeline avoids the need for retraining but depends heavily on external tools—namely, a machine translation system and a robust English classifier—to consistently deliver reliable results.

Synthetic training data via translation
Another strategy to reduce long-term reliance on translation systems is to translate the entire English dataset upfront. This gives you a fully synthetic training set in Ukrainian, which can then be used to fine-tune a model directly for the downstream classification task.

The key advantage here? No translation is needed at inference time, meaning the final classifier runs independently and efficiently. However, this method does require additional compute for training, and there is a trade-off: some of the nuanced label signals may get lost or altered in translation, and the resulting data may not always align perfectly with linguistic specifics of the target language.

LLM prompting
Another no-fine-tuning strategy is prompting large language models (LLMs). Thanks to recent progress in generative AI, many classification tasks can now be reframed as text generation. With the right prompt, LLMs can produce direct predictions, either in a zero-shot setup (with no examples) or few-shot (with a couple of examples). While LLMs have been explored for multilingual hate speech detection, Ukrainian remains largely underrepresented in such evaluations.

Adapter-based training
For a lightweight fine-tuning option, adapters offer a smart middle ground. You start by attaching a language-specific adapter layer to a frozen multilingual language model, initially in English. After tuning it for your task, you simply swap in a Ukrainian adapter layer and apply the model to Ukrainian inputs. The solution is modular, efficient, and tailored to the language without retraining the whole model.

Semi-synthetic dataset via toxic keyword filtering

Another way to obtain synthetic data for toxicity classification is to filter samples by toxic keywords. We applied keyword-based filtering to a large Ukrainian Twitter corpus, using a curated list of known toxic terms.

For the non-toxic side of the dataset, we selected tweets that contained none of the flagged keywords, along with additional clean text samples sourced from Ukrainian news and fiction texts included in the UD Ukrainian IU dataset. This gave us a semi-synthetic, balanced dataset suitable for initial training and evaluation.

Collecting real-world toxicity labeled data via Toloka

Finally, to gather real human-labeled toxicity data, we turned to crowdsourcing using the Toloka platform. Starting with a Ukrainian Twitter corpus, we preprocessed the texts by removing URLs and usernames, then filtered out very short (under 5 words) or overly long (over 20 words) tweets.

Toloka provided high-quality real-world toxicity labels from Ukrainian-speaking contributors.

Each annotation task included 9 real tweets, 2 control questions with known answers, and 1 training example with both a solution and an explanation. We also applied several quality checks to monitor annotator performance.

So, what is better? Translation vs LLMs vs fine-tuning with real-world data

We compared the three types of datasets we obtained for training classifiers: translated, semi-synthetic, and crowdsourced.

Obtained synthetic, semi-synthetic, and real-world data size comparison:

Comparing dataset sizes and class balance
Our datasets vary significantly in size depending on the data source. The synthetically translated dataset (from English to Ukrainian) contains several dozens of thousands of examples, offering a large-scale training set. The training semi-synthetic dataset, created via keyword filtering, is more moderate in size with roughly 12,000 instances. In contrast, our crowdsourced dataset is considerably smaller, totaling around 5,000 human-annotated samples.

Despite these differences, we ensured that all datasets—regardless of their origin—are carefully balanced across toxic and non-toxic labels within their respective train, validation, and test splits. This design choice helps maintain fairness and comparability in our evaluations.

However, with such vast differences in size, which data produces the most robust results?

Results:

The results are in: Real annotations still win
Among the approaches that don’t require fine-tuning, backtranslation and adapter training emerge as strong baselines. Interestingly, Mistral outperforms LLaMa on the semi-synthetic test set, yet struggles on both the translated and crowdsourced datasets—especially the latter, which best reflects real-world Ukrainian toxicity. Thus, LLMs still struggle on cultural-dependent tasks for non-English languages.

Interestingly, when fine-tuned on the translated data, the XLM-R model still performs well,but not as well as on the crowdsourced test set. This could be due to a loss of nuance during translation, with some originally toxic content becoming neutral in Ukrainian. Still, this translated-data training approach proves to be a reliable fallback, showing resilience across datasets and maintaining decent out-of-domain performance.

The clearest takeaway comes from fine-tuning on human-annotated data. A model like XLM-R, trained on just a few thousand crowdsourced examples, achieves near-perfect performance on both in-domain and out-of-domain test sets. This highlights a key point: real-world, manually labeled data—no matter the scale—captures contemporary language use, cultural nuance, and subjective cues far better than synthetic alternatives. This is especially vital for tasks like toxicity detection, where context and societal understanding matter.

Conclusion

While strong baselines are achievable, even a few thousand real, crowd-labeled examples significantly enhance the robustness and cultural relevance of toxicity classifiers.

If you're curating multilingual datasets and need high-quality, human-annotated data, connect with Toloka.

You can also explore the datasets and resources from this project.

Open-source data and models on Hugging Face

Space for the project: https://huggingface.co/ukr-detect
Synthetic translated data: ukr-detect/ukr-toxicity-dataset-translated-jigsaw
Semi-synthetic filtered data: ukr-detect/ukr-toxicity-dataset-seminatural
Crowdsourced data: https://huggingface.co/datasets/ukr-detect/ukr-toxicity-dataset
Best Ukrainian Toxicity Classifier: https://huggingface.co/ukr-detect/ukr-toxicity-classifier

Corresponding publications

Daryna Dementieva, Valeriia Khylenko, Nikolay Babakov, and Georg Groh. 2024. Toxicity Classification in Ukrainian. In Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024), pages 244–255, Mexico City, Mexico. Association for Computational Linguistics.

Cross-lingual Text Classification Transfer: The Case of Ukrainian