Unified Search & Discovery
Find Any Document in Your Regulatory Library in Under 5 Seconds
DNXT combines Apache Lucene full-text search with AI-powered semantic retrieval so regulatory teams stop hunting through folders and start finding the right document, version, and context — instantly, across every dossier, submission, and client tenant.
Who This Is Built For
Every regulatory role has a different search problem. DNXT Unified Search is engineered to solve all of them — without compromising on precision, permissions, or compliance.
- Instantly surface all documents matching a substance, a Module section, or a regulatory agency across every active program
- Audit-ready search history: every query, filter, and result is logged and traceable
- Faceted filters let you narrow by program, submission type, date range, approval status, and document type in seconds
- AI semantic matching catches documents that reference the same concept under different terminology — no more missed precedents
- Full document content search — finds text inside PDFs, DOCX, and XML files, not just file names
- TOC node search locates content by its position in the eCTD structure, even across multiple submission sequences
- Semantic search understands intent — searching "API degradation under heat" finds documents titled "Thermal Forced Degradation Study" even without keyword overlap
- Version-aware results show the most current approved document, with prior versions one click away
- Global CRO search spans all client tenants in a single query — permission-enforced so confidential data stays isolated
- Find reusable analytical methods, validated procedures, and prior submissions across your entire client portfolio without switching accounts
- Reduce duplicated work across clients: see at a glance whether a Module 3 component already exists before authoring from scratch
- Search results tagged by client, program, and jurisdiction — no cross-contamination of sensitive sponsor data
How It Works
Unified Search isn't a simple database query. It is a multi-layer retrieval system engineered for the specific characteristics of regulatory documents — structured eCTD metadata, unstructured clinical text, and everything in between.
Every Uploaded Document Triggers Immediate Background Indexing
The moment a document is uploaded or modified in DNXT — whether it's a PDF study report, an XML eCTD leaf file, or a DOCX module section — an asynchronous indexing job is queued automatically. No manual trigger, no batch window, no waiting until the next business day. The indexing pipeline extracts raw text using layout-aware PDF parsing (preserving tables, headers, and section structure) and strips eCTD XML into its constituent text nodes. This ensures that within 10 seconds of upload, the document is fully searchable.
Apache Lucene Indexes Every Word, Section, and Metadata Field
Extracted document text is passed to an Apache Lucene index — the same battle-tested search engine that powers Elasticsearch and Solr. Lucene tokenizes content, applies stemming (so "stability" matches "stabilize" and "stabilization"), and builds an inverted index across all document content, filenames, CTD section codes, submission identifiers, substance names, and custom metadata fields. When you execute a keyword search, Lucene returns results in milliseconds using BM25 relevance scoring — the industry-standard probabilistic ranking model that weighs term frequency and document length to surface the most contextually relevant matches first.
AI Encodes Document Meaning as Vector Embeddings for Semantic Retrieval
In parallel with Lucene indexing, each document chunk is passed through a large language model encoder that converts the text into a high-dimensional vector embedding — a mathematical representation of its semantic meaning. These vectors are stored in a dedicated vector database. When you search "drug substance degradation under thermal stress," the system encodes your query into the same embedding space and retrieves documents whose meaning is geometrically closest — regardless of whether they use your exact words. This is Retrieval Augmented Generation (RAG), applied specifically to the regulatory domain: the model has been fine-tuned on pharmaceutical and regulatory text so it understands that "forced degradation" and "stress testing" are conceptually equivalent in this context.
Each Search Runs Both Engines Simultaneously and Merges Results
When a user submits a query, DNXT fires it against both the Lucene index and the vector database concurrently. Lucene returns exact and stemmed matches ranked by BM25 score. The vector database returns semantically similar documents ranked by cosine similarity. A reciprocal rank fusion algorithm merges both result sets into a single ranked list, giving appropriate weight to both precision (exact matches) and recall (semantic matches). Documents that score highly on both systems are elevated — indicating that they are both literally and semantically relevant to the query. The merged result is returned to the UI in under 500 milliseconds.
Results Are Refined by Regulatory Metadata Without Re-Running the Core Query
Every result carries structured metadata: dossier ID, submission sequence, CTD module and section code, document type (study report, certificate, protocol, SOP), jurisdiction, regulatory status (approved, under review, superseded), document version, author, and modification date. Users apply faceted filters in the sidebar — narrowing results by any combination of these fields instantly. Filtering happens at the index level, not in the application layer, so applying a filter to 50,000 documents takes the same time as applying it to 500. Regulatory directors frequently use this to quickly scope: "show me all Module 3.2.S.4 Analytical Methods documents across all active INDs, filed since 2023."
CRO Tenants Can Search Across All Client Portfolios With Granular Access Control
For CRO accounts, DNXT's global search federation queries across all client tenant indexes simultaneously — but applies permission gates at the index level before any results are returned. Each result is scoped to what the requesting user is authorized to view, based on their role assignments within each client tenant. A regulatory project manager at a CRO can search "validated HPLC analytical method for peptide API" and see relevant documents across all client programs they're assigned to — without seeing confidential data from programs they're not — all without switching user accounts or tenant contexts. This is architected as a parallel index federation, not a data merge, ensuring client data is never co-mingled at the storage layer.
Results Display Source Context, Relevance Reasoning, and Full Audit Log
Search results display the document's CTD path, version, regulatory status, matched text snippet with keyword highlighting, relevance score, and — for AI-matched results — a badge indicating the match was semantic rather than literal, helping users understand why a document was surfaced. Clicking a result opens the document with the matched passage highlighted in context. Every search query, the results set returned, and any documents opened are recorded in the platform's immutable audit log — providing a complete, timestamped record of who searched for what and what they found, supporting regulatory inspection readiness and GxP compliance requirements.
Search Capabilities Built for Regulatory Complexity
Six distinct search and discovery capabilities that address the specific retrieval challenges of eCTD submissions, dossier management, and cross-program regulatory work.
Full-Text Search (Apache Lucene)
DNXT's Lucene-powered full-text engine indexes every word inside every document — including the content of uploaded PDFs, DOCX files, and eCTD XML leaves — not just file names and metadata. Lucene's BM25 scoring algorithm ranks results by statistical relevance, accounting for term frequency and document length, so the most contextually dense results appear first. Linguistic stemming ensures "impurity characterization" also matches "impurities characterized" and "characterize impurity" — capturing the vocabulary variation inherent in multi-author regulatory documents. Results are returned within 200 milliseconds for even the largest dossier libraries.
AI Semantic Search (RAG)
Built on Retrieval Augmented Generation, DNXT's semantic layer understands the meaning of your query — not just its words. When you search "bioequivalence study for small molecule oral solid," you'll find documents titled "Comparative Pharmacokinetic Evaluation" or "In Vivo BA/BE Assessment" that never use the phrase "bioequivalence study" but are precisely what you're looking for. The embedding model has been trained on pharmaceutical and regulatory corpus data, so it understands domain-specific synonyms, ICH guideline concepts, and regulatory terminology equivalencies that a general-purpose search engine would miss entirely. This dramatically improves recall for senior scientists who know what they're looking for but not what it might be named.
Global Search for CROs
CROs managing multiple sponsor clients face a search problem no single-tenant tool can solve: how to leverage prior work across clients without violating confidentiality. DNXT's global search federates across all client tenant indexes in a single query, with permission gates enforced at the index retrieval layer — not the display layer — so unauthorized documents are never fetched, cached, or processed for non-authorized users. A CRO regulatory scientist can search for a validated dissolution method across their entire client portfolio, see which client programs have it, and request reuse — all without switching accounts or obtaining separate credentials. Results are clearly tagged with client identifiers so there is zero ambiguity about data provenance.
TOC Node Search
In eCTD submissions, finding content by its structural location — not just its file name — is essential. DNXT's TOC Node Search indexes the complete eCTD table of contents hierarchy: module, section, subsection, leaf title, and sequence number. You can search "3.2.P.5.6 Justification of Specifications" and instantly retrieve every document filed at that exact TOC node across all submissions, including the full version history at that position. For teams writing responses to agency deficiency letters, this means you can immediately locate every document that occupied a specific CTD section across all prior sequences — without opening individual submissions or navigating folder structures manually.
Metadata Search
Beyond document content, DNXT indexes every metadata field attached to documents and submissions: substance name, drug product name, therapeutic area, INN, CAS number, regulatory agency, submission type (IND, NDA, ANDA, MAA, CTD), sequence number, lifecycle status, author, reviewer, approval date, and all custom metadata fields configured by your organization. This makes DNXT a de facto regulatory asset registry — you can query "all submissions referencing CAS 123456-78-9 filed with FDA between 2020 and 2024" and get a precise, auditable answer in seconds. For regulatory intelligence and due diligence activities, this metadata searchability is transformative.
Faceted Filtering
Faceted filtering in DNXT is not an afterthought — it's a structured refinement layer that operates directly at the index level, making it instantaneous regardless of result set size. Every search result can be filtered across multiple simultaneous dimensions: regulatory jurisdiction (FDA, EMA, PMDA, Health Canada, TGA, etc.), document status (draft, under review, approved, superseded), CTD module, document type, date range, program, and submission sequence. Filters are additive and displayed with live document counts so you can see exactly how many results each filter will return before you apply it. Power users build saved search queries with pre-applied filters that they run as standing reports — for example, "all Module 1 cover letters across active EU MAA submissions" checked weekly.
DNXT vs. The Alternative
Regulatory teams deserve an honest comparison. Here is how DNXT Unified Search measures up against the tools teams are currently using — or are told to use.
| Capability | DNXT Publisher Suite | Veeva Vault RIM | LORENZ docuBridge | Manual / SharePoint |
|---|---|---|---|---|
| Full-text document content search | ✓ Lucene-powered, indexes inside every PDF, DOCX, XML file content | ◑ Metadata and title search only in most configurations; full-text requires additional Vault configuration and is limited in scope | ✗ Primarily filename and metadata search; document content indexing not natively supported | ✗ SharePoint full-text search is inconsistent, fails on heavily formatted PDFs, and has no regulatory awareness |
| AI semantic / intent-based search | ✓ RAG-based vector search with regulatory domain fine-tuning; finds conceptually matching documents regardless of exact wording | ✗ No AI semantic search capability as of 2024; roadmap items announced but not delivered | ✗ Keyword-only retrieval; no semantic understanding of regulatory concepts | ✗ No semantic capability; users rely on knowing exact document names or maintaining manual spreadsheet indexes |
| Background auto-indexing on upload | ✓ Documents searchable within 10 seconds of upload — fully automatic, no configuration required | ◑ Indexing occurs but can lag by minutes to hours depending on vault size and configuration; manual re-index sometimes required | ◑ Indexing tied to scheduled processes; newly uploaded documents may not be searchable until next index cycle | ✗ No automated indexing; content is found only if previously indexed by SharePoint crawl, which can take hours |
| Cross-client CRO global search | ✓ Federated global search across all client tenants with permission-enforced isolation — single query, single UI | ✗ Vault is strictly single-tenant per customer; CROs must manage separate Vault instances per sponsor and switch between them manually | ✗ No multi-tenant cross-client search; each project environment is isolated with no federation capability | ✗ Completely impossible; separate SharePoint sites or network drives per client with no cross-site search |
| eCTD TOC node search | ✓ Search by CTD section code (e.g., 3.2.S.7) returns all documents filed at that TOC position across all submissions and sequences | ◑ Structured browsing by CTD section is available, but cross-submission TOC search requires manual filtering and is slow at scale | ◑ TOC navigation available within a submission; cross-submission TOC search not supported as a unified query | ✗ No TOC awareness; folder structure may approximate CTD hierarchy but cannot be queried as structured data |
| Faceted filtering (multi-dimension) | ✓ Simultaneous multi-dimensional faceting on jurisdiction, status, module, date, type, program — with live document counts per facet value |
|