Unified Search & Discovery

DNXT Publisher Suite — Search & Discovery

Find Any Document in Your Regulatory Library in Under 5 Seconds

DNXT combines Apache Lucene full-text search with AI-powered semantic retrieval so regulatory teams stop hunting through folders and start finding the right document, version, and context — instantly, across every dossier, submission, and client tenant.

drug substance stability data ICH Q1A AI + Full-Text 0.38s
All Dossiers Module 3 CTD Section Approved 2022–2024
Showing 847 results across 23 active dossiers — sorted by relevance
NDA-204892 / Module 3 / 3.2.S.7 / Stability
AI Match 98% relevance
3.2.S.7.1 Stability Summary — Drug Substance Long-Term Data (24-Month)
ICH Q1A-compliant stability study showing drug substance degradation profiles under accelerated and long-term conditions. Data meets shelf-life specification criteria...
IND-2023-187 / Module 3 / 3.2.S.7
94% relevance
Drug Substance Stability Protocol — Forced Degradation Studies
Thermal and photolytic stability data for drug substance API batch DS-047. Validated per ICH Q1B guidance with photostability outcomes documented...
BLA-125742 / Module 3 / 3.2.P.8
AI Match 87% relevance
Stability Summary Report — Drug Product Formulation Batch P-003
Comparative stability analysis between drug product batches under ICH-aligned protocols. Shelf-life specification met at 36-month interval...
<500ms Average search response across full dossier libraries
All Dossiers, submissions, TOC nodes, and document content searched simultaneously
2-in-1 AI semantic search + Apache Lucene full-text running in parallel on every query
<10s Time from document upload to fully searchable — background auto-indexing

Who This Is Built For

Every regulatory role has a different search problem. DNXT Unified Search is engineered to solve all of them — without compromising on precision, permissions, or compliance.

🗂️
Regulatory Affairs Director
Mid-Size Pharma / Biotech
You manage 12 active INDs and 3 NDA programs across two therapeutic areas. When a reviewer asks for "all stability data referenced in our NDA submissions from 2022 onward," you spend 45 minutes asking three team members to compile the list — and still aren't sure you got everything. Audit-readiness in this environment is reactive, not proactive.
  • Instantly surface all documents matching a substance, a Module section, or a regulatory agency across every active program
  • Audit-ready search history: every query, filter, and result is logged and traceable
  • Faceted filters let you narrow by program, submission type, date range, approval status, and document type in seconds
  • AI semantic matching catches documents that reference the same concept under different terminology — no more missed precedents
🔬
Senior Regulatory Scientist
Biotech — NDA / BLA Pipeline
You're authoring Module 3.2.S.7 and need to locate the stability study that was used in a previous IND for the same compound two years ago — a study you didn't author yourself. You know it exists. You don't know which folder version it's in, and the colleague who managed it has left the company. You end up re-requesting the study from CMC, burning a week you didn't have.
  • Full document content search — finds text inside PDFs, DOCX, and XML files, not just file names
  • TOC node search locates content by its position in the eCTD structure, even across multiple submission sequences
  • Semantic search understands intent — searching "API degradation under heat" finds documents titled "Thermal Forced Degradation Study" even without keyword overlap
  • Version-aware results show the most current approved document, with prior versions one click away
🌐
VP Regulatory Operations
Contract Research Organization (CRO)
You manage regulatory work for 18 pharma and biotech clients. Each client is a separate tenant. When your team wants to leverage a prior art document or reuse a validated analytical method developed for Client A on Client B's submission, they have to log into a separate account, find the document manually, re-download it, and re-upload it — every time. Cross-client search is operationally impossible, so reuse doesn't happen, and margins suffer.
  • Global CRO search spans all client tenants in a single query — permission-enforced so confidential data stays isolated
  • Find reusable analytical methods, validated procedures, and prior submissions across your entire client portfolio without switching accounts
  • Reduce duplicated work across clients: see at a glance whether a Module 3 component already exists before authoring from scratch
  • Search results tagged by client, program, and jurisdiction — no cross-contamination of sensitive sponsor data

How It Works

Unified Search isn't a simple database query. It is a multi-layer retrieval system engineered for the specific characteristics of regulatory documents — structured eCTD metadata, unstructured clinical text, and everything in between.

1
Document Ingestion

Every Uploaded Document Triggers Immediate Background Indexing

The moment a document is uploaded or modified in DNXT — whether it's a PDF study report, an XML eCTD leaf file, or a DOCX module section — an asynchronous indexing job is queued automatically. No manual trigger, no batch window, no waiting until the next business day. The indexing pipeline extracts raw text using layout-aware PDF parsing (preserving tables, headers, and section structure) and strips eCTD XML into its constituent text nodes. This ensures that within 10 seconds of upload, the document is fully searchable.

2
Full-Text Indexing — Apache Lucene

Apache Lucene Indexes Every Word, Section, and Metadata Field

Extracted document text is passed to an Apache Lucene index — the same battle-tested search engine that powers Elasticsearch and Solr. Lucene tokenizes content, applies stemming (so "stability" matches "stabilize" and "stabilization"), and builds an inverted index across all document content, filenames, CTD section codes, submission identifiers, substance names, and custom metadata fields. When you execute a keyword search, Lucene returns results in milliseconds using BM25 relevance scoring — the industry-standard probabilistic ranking model that weighs term frequency and document length to surface the most contextually relevant matches first.

3
Semantic Indexing — RAG Embeddings

AI Encodes Document Meaning as Vector Embeddings for Semantic Retrieval

In parallel with Lucene indexing, each document chunk is passed through a large language model encoder that converts the text into a high-dimensional vector embedding — a mathematical representation of its semantic meaning. These vectors are stored in a dedicated vector database. When you search "drug substance degradation under thermal stress," the system encodes your query into the same embedding space and retrieves documents whose meaning is geometrically closest — regardless of whether they use your exact words. This is Retrieval Augmented Generation (RAG), applied specifically to the regulatory domain: the model has been fine-tuned on pharmaceutical and regulatory text so it understands that "forced degradation" and "stress testing" are conceptually equivalent in this context.

4
Query Execution — Hybrid Retrieval

Each Search Runs Both Engines Simultaneously and Merges Results

When a user submits a query, DNXT fires it against both the Lucene index and the vector database concurrently. Lucene returns exact and stemmed matches ranked by BM25 score. The vector database returns semantically similar documents ranked by cosine similarity. A reciprocal rank fusion algorithm merges both result sets into a single ranked list, giving appropriate weight to both precision (exact matches) and recall (semantic matches). Documents that score highly on both systems are elevated — indicating that they are both literally and semantically relevant to the query. The merged result is returned to the UI in under 500 milliseconds.

5
Faceted Filtering & Scoping

Results Are Refined by Regulatory Metadata Without Re-Running the Core Query

Every result carries structured metadata: dossier ID, submission sequence, CTD module and section code, document type (study report, certificate, protocol, SOP), jurisdiction, regulatory status (approved, under review, superseded), document version, author, and modification date. Users apply faceted filters in the sidebar — narrowing results by any combination of these fields instantly. Filtering happens at the index level, not in the application layer, so applying a filter to 50,000 documents takes the same time as applying it to 500. Regulatory directors frequently use this to quickly scope: "show me all Module 3.2.S.4 Analytical Methods documents across all active INDs, filed since 2023."

6
Global CRO Search — Permission-Enforced

CRO Tenants Can Search Across All Client Portfolios With Granular Access Control

For CRO accounts, DNXT's global search federation queries across all client tenant indexes simultaneously — but applies permission gates at the index level before any results are returned. Each result is scoped to what the requesting user is authorized to view, based on their role assignments within each client tenant. A regulatory project manager at a CRO can search "validated HPLC analytical method for peptide API" and see relevant documents across all client programs they're assigned to — without seeing confidential data from programs they're not — all without switching user accounts or tenant contexts. This is architected as a parallel index federation, not a data merge, ensuring client data is never co-mingled at the storage layer.

7
Result Presentation & Audit Trail

Results Display Source Context, Relevance Reasoning, and Full Audit Log

Search results display the document's CTD path, version, regulatory status, matched text snippet with keyword highlighting, relevance score, and — for AI-matched results — a badge indicating the match was semantic rather than literal, helping users understand why a document was surfaced. Clicking a result opens the document with the matched passage highlighted in context. Every search query, the results set returned, and any documents opened are recorded in the platform's immutable audit log — providing a complete, timestamped record of who searched for what and what they found, supporting regulatory inspection readiness and GxP compliance requirements.

Search Capabilities Built for Regulatory Complexity

Six distinct search and discovery capabilities that address the specific retrieval challenges of eCTD submissions, dossier management, and cross-program regulatory work.

🔍

Full-Text Search (Apache Lucene)

DNXT's Lucene-powered full-text engine indexes every word inside every document — including the content of uploaded PDFs, DOCX files, and eCTD XML leaves — not just file names and metadata. Lucene's BM25 scoring algorithm ranks results by statistical relevance, accounting for term frequency and document length, so the most contextually dense results appear first. Linguistic stemming ensures "impurity characterization" also matches "impurities characterized" and "characterize impurity" — capturing the vocabulary variation inherent in multi-author regulatory documents. Results are returned within 200 milliseconds for even the largest dossier libraries.

🧠

AI Semantic Search (RAG)

Built on Retrieval Augmented Generation, DNXT's semantic layer understands the meaning of your query — not just its words. When you search "bioequivalence study for small molecule oral solid," you'll find documents titled "Comparative Pharmacokinetic Evaluation" or "In Vivo BA/BE Assessment" that never use the phrase "bioequivalence study" but are precisely what you're looking for. The embedding model has been trained on pharmaceutical and regulatory corpus data, so it understands domain-specific synonyms, ICH guideline concepts, and regulatory terminology equivalencies that a general-purpose search engine would miss entirely. This dramatically improves recall for senior scientists who know what they're looking for but not what it might be named.

🌐

Global Search for CROs

CROs managing multiple sponsor clients face a search problem no single-tenant tool can solve: how to leverage prior work across clients without violating confidentiality. DNXT's global search federates across all client tenant indexes in a single query, with permission gates enforced at the index retrieval layer — not the display layer — so unauthorized documents are never fetched, cached, or processed for non-authorized users. A CRO regulatory scientist can search for a validated dissolution method across their entire client portfolio, see which client programs have it, and request reuse — all without switching accounts or obtaining separate credentials. Results are clearly tagged with client identifiers so there is zero ambiguity about data provenance.

🗂️

TOC Node Search

In eCTD submissions, finding content by its structural location — not just its file name — is essential. DNXT's TOC Node Search indexes the complete eCTD table of contents hierarchy: module, section, subsection, leaf title, and sequence number. You can search "3.2.P.5.6 Justification of Specifications" and instantly retrieve every document filed at that exact TOC node across all submissions, including the full version history at that position. For teams writing responses to agency deficiency letters, this means you can immediately locate every document that occupied a specific CTD section across all prior sequences — without opening individual submissions or navigating folder structures manually.

🏷️

Metadata Search

Beyond document content, DNXT indexes every metadata field attached to documents and submissions: substance name, drug product name, therapeutic area, INN, CAS number, regulatory agency, submission type (IND, NDA, ANDA, MAA, CTD), sequence number, lifecycle status, author, reviewer, approval date, and all custom metadata fields configured by your organization. This makes DNXT a de facto regulatory asset registry — you can query "all submissions referencing CAS 123456-78-9 filed with FDA between 2020 and 2024" and get a precise, auditable answer in seconds. For regulatory intelligence and due diligence activities, this metadata searchability is transformative.

Faceted Filtering

Faceted filtering in DNXT is not an afterthought — it's a structured refinement layer that operates directly at the index level, making it instantaneous regardless of result set size. Every search result can be filtered across multiple simultaneous dimensions: regulatory jurisdiction (FDA, EMA, PMDA, Health Canada, TGA, etc.), document status (draft, under review, approved, superseded), CTD module, document type, date range, program, and submission sequence. Filters are additive and displayed with live document counts so you can see exactly how many results each filter will return before you apply it. Power users build saved search queries with pre-applied filters that they run as standing reports — for example, "all Module 1 cover letters across active EU MAA submissions" checked weekly.

DNXT vs. The Alternative

Regulatory teams deserve an honest comparison. Here is how DNXT Unified Search measures up against the tools teams are currently using — or are told to use.

Capability DNXT Publisher Suite Veeva Vault RIM LORENZ docuBridge Manual / SharePoint
Full-text document content search Lucene-powered, indexes inside every PDF, DOCX, XML file content Metadata and title search only in most configurations; full-text requires additional Vault configuration and is limited in scope Primarily filename and metadata search; document content indexing not natively supported SharePoint full-text search is inconsistent, fails on heavily formatted PDFs, and has no regulatory awareness
AI semantic / intent-based search RAG-based vector search with regulatory domain fine-tuning; finds conceptually matching documents regardless of exact wording No AI semantic search capability as of 2024; roadmap items announced but not delivered Keyword-only retrieval; no semantic understanding of regulatory concepts No semantic capability; users rely on knowing exact document names or maintaining manual spreadsheet indexes
Background auto-indexing on upload Documents searchable within 10 seconds of upload — fully automatic, no configuration required Indexing occurs but can lag by minutes to hours depending on vault size and configuration; manual re-index sometimes required Indexing tied to scheduled processes; newly uploaded documents may not be searchable until next index cycle No automated indexing; content is found only if previously indexed by SharePoint crawl, which can take hours
Cross-client CRO global search Federated global search across all client tenants with permission-enforced isolation — single query, single UI Vault is strictly single-tenant per customer; CROs must manage separate Vault instances per sponsor and switch between them manually No multi-tenant cross-client search; each project environment is isolated with no federation capability Completely impossible; separate SharePoint sites or network drives per client with no cross-site search
eCTD TOC node search Search by CTD section code (e.g., 3.2.S.7) returns all documents filed at that TOC position across all submissions and sequences Structured browsing by CTD section is available, but cross-submission TOC search requires manual filtering and is slow at scale TOC navigation available within a submission; cross-submission TOC search not supported as a unified query No TOC awareness; folder structure may approximate CTD hierarchy but cannot be queried as structured data
Faceted filtering (multi-dimension) Simultaneous multi-dimensional faceting on jurisdiction, status, module, date, type, program — with live document counts per facet value