Log in

Building production-ready RAG systems requires more than great retrieval; it requires data you can trust.
Haystack has become a leading framework for building retrieval-augmented generation (RAG) pipelines, giving teams the tools to ingest, index, and query unstructured data at scale. But as organizations move from prototypes to production, a critical gap emerges: the data flowing into those pipelines is often sensitive, unstructured, and not ready for safe use in AI systems.
That’s where Tonic Textual comes in.
With this integration, teams can now combine Haystack’s flexible RAG orchestration with Textual’s ability to detect, redact, and generate privacy-safe text—ensuring that the data entering your pipeline is both usable and compliant from the start.
Every RAG pipeline starts with ingestion. Documents come in: PDFs, support tickets, clinical notes, customer records; and get chunked, embedded, and stored in a vector database.
This creates two problems that are hard to solve after the fact:
The first is compliance. Once PII is embedded in a vector store, you can't easily remove it. GDPR right-to-erasure requests, HIPAA de-identification requirements, PCI cardholder data rules — these all require that sensitive data be handled before it becomes entangled in your retrieval infrastructure. Redacting a source document after its chunks are already embedded doesn't fix the embeddings. You need to clean the data before ingestion.
The second is retrieval quality. Dense embeddings encode everything in the text, including PII that's irrelevant to the semantic content you actually want to retrieve on. A clinical note about a specific diagnosis will produce different embeddings depending on which patient name appears in it. Entity extraction gives you structured metadata — names, organizations, locations, dates — that you can use to build hybrid retrieval strategies, filter results, or enrich chunks with faceted metadata that improves precision without relying solely on semantic similarity.
These aren't niche concerns. They show up in every organization building RAG over sensitive data:
Regex and rule-based approaches can catch an email that looks like user@example.com, but they fall apart on context-dependent PII. Is "Jordan" a person's name or a country? Is "April" a name or a month? Is "1600 Pennsylvania Avenue" a location that needs redaction or a well-known reference? You need NER models that understand context, not pattern matching.
Tonic Textual provides exactly this — transformer-based PII detection and transformation across text, JSON, HTML, PDFs, images, and tabular data. With textual-haystack, those capabilities are now available as native Haystack pipeline components.
Textual's NER model identifies 46+ entity types across 50+ languages. These aren't just the obvious ones like email addresses and phone numbers. The model detects names (given and family, separately), dates of birth, occupations, healthcare IDs, routing numbers, IP addresses, and more — with a confidence score for each detection.
What happens after detection is where Textual differentiates itself. There are three capabilities:
Synthesis replaces PII with realistic fake data that preserves the structure and statistical properties of the original:
1Input: "Patient John Smith, DOB 03/15/1982, MRN 12345678"
2Output: "Patient Maria Chen, DOB 07/22/1975, MRN 87654321"Synthesized data keeps downstream analytics, embeddings, and retrieval valid. A synthesized name is still a plausible name in the right position in the sentence. A synthesized date has the right format. The semantic structure of the document is preserved while the identifying information is replaced. This matters for RAG — placeholder tokens like [NAME_GIVEN_xxxx] distort the embeddings your retriever learns from, while synthesized replacements preserve the natural language distribution.
Tokenization replaces detected PII with labeled placeholders:
1Input: "Patient John Smith, DOB 03/15/1982, MRN 12345678"
2Output: "Patient [NAME_GIVEN_a1b2] [NAME_FAMILY_c3d4], DOB [DOB_e5f6], MRN [HEALTHCARE_ID_g7h8]"The placeholders are tagged with their entity type and a consistent identifier, so you can track which replacements correspond to the same original entity across a document. This is the safest option when you need to guarantee that no real PII appears in the output.
Entity extraction returns the raw detections without modifying the text:
1Input: "Patient John Smith, DOB 03/15/1982, MRN 12345678"
2Output: [
3 {"entity": "NAME_GIVEN", "text": "John", "start": 8, "end": 12, "score": 0.95},
4 {"entity": "NAME_FAMILY", "text": "Smith", "start": 13, "end": 18, "score": 0.95},
5 {"entity": "DOB", "text": "03/15/1982", "start": 24, "end": 34, "score": 0.90},
6 {"entity": "HEALTHCARE_ID", "text": "12345678", "start": 40, "end": 48, "score": 0.85}
7]Entity extraction is useful when you need to know what PII is present — for auditing, for building structured metadata that improves retrieval, or for making per-document decisions about how to handle the data.
textual-haystack provides two Haystack components:
Both are native Haystack @component classes. They accept list[Document] as input and return list[Document] as output, so they slot directly into any Haystack pipeline. Installation is a single line:
pip install textual-haystack
The document cleaner transforms PII in document content before it reaches downstream components like splitters, embedders, and document stores. It produces new Document instances with cleaned content — the originals are never mutated.
1from haystack.dataclasses import Document
2from haystack_integrations.components.tonic_textual import TonicTextualDocumentCleaner
3
4cleaner = TonicTextualDocumentCleaner(generator_default="Synthesis")
5result = cleaner.run(documents=[
6 Document(content="Patient John Smith, DOB 03/15/1982, was admitted for chest pain.")
7])
8print(result["documents"][0].content)
9# "Patient Maria Chen, DOB 07/22/1975, was admitted for chest pain."The clinical content — "admitted for chest pain" — is preserved. The identifying information is replaced with synthetic data that maintains the same structure. When this document is chunked and embedded, the resulting vectors capture the medical semantics without encoding any real patient data.
In practice, you often want different handling for different entity types. Names might be safe to synthesize, but SSNs should always be tokenized. Organization names might be left alone entirely.
generator_config provides this control:
1cleaner = TonicTextualDocumentCleaner(
2 generator_default="Off",
3 generator_config={
4 "NAME_GIVEN": "Synthesis",
5 "NAME_FAMILY": "Synthesis",
6 "DOB": "Synthesis",
7 "US_SSN": "Redaction",
8 "EMAIL_ADDRESS": "Redaction",
9 },
10)This configuration synthesizes names and dates of birth (preserving natural language flow for better embeddings), tokenizes SSNs and emails (for maximum safety), and leaves everything else untouched. The generator_default of "Off" means any entity type not listed in generator_config passes through unchanged.
The entity extractor detects PII in document content and stores the results as structured metadata in doc.meta["named_entities"]. The document content itself is not modified.
1from haystack.dataclasses import Document
2from haystack_integrations.components.tonic_textual import TonicTextualEntityExtractor
3
4extractor = TonicTextualEntityExtractor()
5result = extractor.run(documents=[
6 Document(content="Contact Jane Doe at jane@example.com or (555) 867-5309.")
7])
8
9for entity in TonicTextualEntityExtractor.get_stored_annotations(result["documents"][0]):
10 print(f"{entity.entity}: {entity.text} (confidence: {entity.score:.2f})")
11# NAME_GIVEN: Jane (confidence: 0.95)
12# NAME_FAMILY: Doe (confidence: 0.95)
13# EMAIL_ADDRESS: jane@example.com (confidence: 0.90)
14# PHONE_NUMBER: (555) 867-5309 (confidence: 0.90)Each annotation is a PiiEntityAnnotation dataclass with entity (the PII type label), text (the detected value), start and end (character offsets), and score (confidence).
Dense retrieval works well for semantic similarity, but it struggles with precision when you need to find documents about a specific person, organization, or date. Entity extraction gives you structured metadata you can use for:
Here's a complete pipeline that cleans documents and then extracts entities from the cleaned text — a common pattern when you want both PII-safe content and structured metadata:
1from haystack import Pipeline
2from haystack.dataclasses import Document
3from haystack_integrations.components.tonic_textual import (
4 TonicTextualDocumentCleaner,
5 TonicTextualEntityExtractor,
6)
7
8pipeline = Pipeline()
9pipeline.add_component(
10 "cleaner",
11 TonicTextualDocumentCleaner(generator_default="Synthesis"),
12)
13pipeline.add_component("extractor", TonicTextualEntityExtractor())
14pipeline.connect("cleaner", "extractor")
15
16documents = [
17 Document(content="Patient John Smith, DOB 03/15/1982, MRN 12345678."),
18 Document(content="Invoice for Acme Corp, attn: Bob Jones, bob@acme.com."),
19]
20
21result = pipeline.run({"cleaner": {"documents": documents}})
22
23for doc in result["extractor"]["documents"]:
24 entities = TonicTextualEntityExtractor.get_stored_annotations(doc)
25 print(f"\nCleaned: {doc.content}")
26 print(f"Entities: {[(e.entity, e.text) for e in entities]}")The cleaner runs first, replacing real PII with synthesized data. The extractor then runs on the cleaned text, producing entity metadata from the synthetic values. The result is documents with PII-safe content and structured entity metadata — ready for chunking, embedding, and storage.
You can also use either component independently. The cleaner alone is sufficient for ingestion pipelines where you just need PII-safe content. The extractor alone is useful when you want entity metadata without modifying the source documents — for example, in an analytics pipeline where you need to catalog what PII exists across a document corpus.
Both components follow Haystack's conventions:
@component decorated — no base class inheritance, just the decorator and a run() method.run(documents: list[Document]) -> {"documents": list[Document]}.Document instances rather than modifying inputs. The cleaner uses dataclasses.replace() to produce documents with transformed content. The extractor uses replace() to produce documents with enriched metadata.warm_up() method initializes the Tonic Textual client once, on first use.to_dict() and from_dict() using Haystack's default_to_dict/default_from_dict, with Secret for API key handling.The Tonic Textual client is shared across calls within a component instance. The API key is read from the TONIC_TEXTUAL_API_KEY environment variable by default, or can be passed explicitly via Haystack's Secret:
1from haystack.utils.auth import Secret
2
3cleaner = TonicTextualDocumentCleaner(
4 api_key=Secret.from_token("your-api-key"),
5 base_url="https://textual.your-company.com", # for self-hosted deployments
6)This initial release covers the two highest-impact use cases for Haystack pipelines: document cleaning before ingestion and entity extraction for metadata enrichment. We're continuing to expand capabilities.
The package is open source under the MIT license and available to all Tonic Textul users (start a free trial here):
Self-contained examples for both components live in the examples/ directory of the repo.
To get started:
1pip install textual-haystack
2export TONIC_TEXTUAL_API_KEY="your-api-key"
3from haystack.dataclasses import Document
4from haystack_integrations.components.tonic_textual import (
5 TonicTextualDocumentCleaner,
6 TonicTextualEntityExtractor,
7)
8
9# Clean PII before ingestion
10cleaner = TonicTextualDocumentCleaner(generator_default="Synthesis")
11
12# Or extract entities for metadata enrichment
13extractor = TonicTextualEntityExtractor()For the full Tonic Textual platform — including the web UI, dataset management, custom entity training, and enterprise deployment — learn more or start a free trial.
Adam Kamor, Co-founder and Head of Engineering at Tonic.ai, leads the development of synthetic data solutions that enable AI and development teams to unlock data safely, efficiently, and at scale. With a Ph.D. in Physics from Georgia Tech, Adam has dedicated his career to the intersection of data privacy, AI, and software engineering, having built developer tools, analytics platforms, and AI validation frameworks at companies such as Microsoft, Kabbage, and Tableau. He thrives on solving complex data challenges, transforming raw, unstructured enterprise data into high-quality fuel for AI & ML model training, to ultimately make life easier for developers, analysts, and AI teams.