Project

Trial Document Intelligence

Document intelligence architecture for eTMF classification in multi-tenant clinical trial operations. Without AI, document filing depends on manual review and ongoing taxonomy maintenance. The most direct AI approach is to put the full classification taxonomy plus the document content into a long LLM prompt, but that quickly becomes expensive at scale, slow, and hard to maintain as each tenant has hundreds of evolving rules. This system uses generative semantic projection, retrieval over enriched tenant-specific rule vectors, and an LLM-as-a-judge decision step to make classification cheaper, more scalable, and easier to inspect.

Semantic projectionVector retrievalRerankZero-shot adaptationLLM-as-a-judge

At A Glance

Domain: eTMF document classification for multi-tenant clinical trial operations
Core idea: Project noisy document instances into definition-style semantic descriptions before rule retrieval
Inputs: OCR text, key first-page and signature-page text streams, extracted metadata, tenant filing rules, rule definitions, and enriched rule features
Outputs: Final filing classification, Top-K candidate rules, supporting metadata evidence, and a reviewable decision basis
Production concerns: Tenant isolation, zero-shot onboarding, token cost, latency, recall quality, and human review for ambiguous cases

Problem Framing

eTMF management is a high-volume, rule-heavy document workflow. Without AI, classification is mostly a manual operation: someone reads the uploaded trial document, understands the customer's filing taxonomy, checks document evidence such as title, date, signature, and study context, and chooses the right category. That work becomes expensive because each sponsor or CRO may maintain hundreds of rules, and the taxonomy itself needs constant human upkeep as customer conventions change.

The most direct AI version is tempting but weak: put the entire classification taxonomy into the prompt, add the full document text, and ask a large model to predict the category in one shot. In practice this pushes context length, token cost, and latency very high, so it is not scalable operationally. It also makes the taxonomy hard to maintain, because every rule update changes a large prompt surface, and the model can lose attention when hundreds of rules sit in the same context.

This system uses a more structured approach. A lightweight generative model first converts the raw document into a standardized, definition-style description. Retrieval compares that description against enriched tenant-specific rule definitions and returns a compact candidate set. A stronger decision model then reviews only the standardized description, extracted metadata, and Top-K rules, making the final classification cheaper, faster, and easier to inspect.

Input to Output

Raw eTMF document
The workflow receives an uploaded clinical trial document, often as a PDF with OCR-derived text, signatures, dates, page headers, tables, and tenant-specific filing context.
Key text and metadata
The system extracts the most informative text streams, especially first-page and signature-page content, and captures structured hints such as signature presence, dates, document title signals, and other classification-relevant metadata.
Standardized description
A generative semantic projection step filters out instance-specific details and rewrites the document as a normalized description in the style of TMF rule definitions.
Candidate rule set
The standardized description is embedded and searched against the tenant's isolated rule namespace, returning a small Top-K set of enriched candidate rules rather than loading all rules into the prompt.
Final classification
A second LLM compares the standardized description, extracted metadata, and candidate rules to select the best filing classification with a reviewable evidence basis.

Architecture Decision

The key architecture decision was to avoid treating the LLM as a giant prompt-based classifier. Loading every rule plus the full document into one model call is simple to prototype, but the value drops quickly in production because every tenant rule, document page, and taxonomy update makes both inference and maintenance more expensive.

Instead, the system introduces a generative semantic projection layer between raw document understanding and retrieval. The document is converted from a noisy instance into a rule-like definition, so retrieval becomes definition-to-definition matching. The second model only receives the compact candidate set and supporting metadata, keeping cost and latency lower while preserving enough reasoning capacity for edge cases.

Core Workflow

Rule enrichment
For each tenant, the system expands rule names and definitions with explicit features, implicit domain signals, synonyms, lifecycle context, and regulatory cues, then stores the resulting rich rule text as vectors in a tenant-isolated rule database.
Document text selection
For each uploaded document, OCR output is reduced to the highest-signal regions, such as the header, first page, tail section, and signature page, so the model avoids unnecessary middle-body noise.
Generative projection
A fast generative model performs feature generalization and style transfer: it removes instance-specific names or incidental values while preserving category-level evidence, producing a standardized rule-like description.
Homogeneous retrieval
The standardized description is embedded and compared with enriched rule vectors in the tenant namespace, turning retrieval from an instance-vs-definition problem into a definition-vs-definition problem.
Evidence decision
A second model receives the standardized description, extracted metadata, and Top-K candidate rules, then performs contextual evidence comparison to produce the final classification.
Low-confidence refinement
When retrieval confidence is low, the system can inspect missing metadata, trigger targeted extraction such as date or signature lookup, revise the query description, and retrieve again before final decision.

Knowledge and Retrieval Layer

Tenant-specific rule namespaces keep each customer's filing taxonomy isolated while sharing the same projection and decision model parameters.
Rule enrichment expands sparse rule text with explicit features, implicit business-stage signals, domain synonyms, and regulatory associations.
OCR and document intelligence extraction prioritize classification-relevant text regions and structured evidence rather than sending full documents through every step.
The retrieval layer returns a compact Top-K rule set so the decision model avoids long-context attention issues and repetitive rule-token cost.
Alternative paths include reverse generation of synthetic rule examples, hierarchical coarse-to-fine classification, and iterative agentic extraction when the first retrieval pass is uncertain.

Design Highlights

Generative semantic alignment converts documents from instance space into the same definition space used by classification rules.
Asymmetric retrieval-and-decision architecture uses a lightweight model for high-volume abstraction and a stronger model for focused evidence judgment.
Zero-shot tenant adaptation lets a new tenant onboard by loading and enriching rules, without collecting labeled documents, fine-tuning a tenant-specific classifier, or rewriting a massive prompt by hand.
Tenant-isolated vector namespaces reduce cross-tenant interference while allowing shared model parameters across customers.
Top-K retrieval reduces long-prompt cost and mitigates attention failure when tenants maintain hundreds of filing rules.
Metadata-aware decision making combines text semantics with structured evidence such as signature and date status.

Evaluation, Validation, and Feedback Loop

The architecture is designed to be evaluated at multiple points rather than only on final classification accuracy. Retrieval quality can be measured by whether the correct rule appears in the Top-K candidate set after semantic projection. Decision quality can be measured by whether the second model selects the correct rule when given the standardized description, metadata, and candidates.

Operational validation focuses on the main production failure modes: manual taxonomy upkeep, direct embedding mismatch, OCR noise sensitivity, long-context attention loss, tenant onboarding cost, and low-confidence edge cases. This gives the system a clearer reliability surface than treating classification as a single prompt outcome.

My Role

Led the technical framing for the generative semantic projection classification method, which later became the basis for a patent filing.
Defined the shift from manual classification and brute-force long-context prompting toward semantic abstraction, retrieval, and focused LLM decision making.
Mapped eTMF operational constraints into a multi-tenant rule architecture with tenant isolation and shared model parameters.
Designed the rule enrichment, document projection, candidate retrieval, metadata evidence, and low-confidence refinement flow.
Worked with engineering and domain experts to translate the concept into an implementable product architecture for iCTA eTMF workflows.

Impact

Improved the expected recall quality of eTMF classification by aligning document and rule representations before vector search.
Reduced manual classification and taxonomy-maintenance burden by turning customer-specific rules into a searchable, reusable AI layer.
Reduced onboarding friction for new tenants by avoiding tenant-specific fine-tuning, historical labeled-data requirements, and large handcrafted classification prompts.
Lowered prompt cost and latency by replacing full-taxonomy long-context prompting with compact Top-K candidate review.
Made ambiguous classification decisions more inspectable by preserving standardized descriptions, metadata evidence, and candidate rules.
Created a reusable document intelligence pattern for dynamic-rule environments beyond a single customer's taxonomy.