Back to projects

Project

TLF Reporting Automation Agents

LangGraph-based agentic system for clinical study reporting, designed to turn table, figure, and listing shell specifications into reviewable SAS programming workflows. The system connects shell-level reporting intent with ADaM data structures, retrieves reusable macros and reference programs, coordinates coding agents, and uses benchmark-driven evaluation plus runtime feedback from SAS execution.

LangGraphAgentic workflowStateful orchestrationProgressive context loadingExecution feedback loop

At A Glance

  • Domain: Clinical study TLF reporting and statistical programming automation
  • Core stack: LangGraph, FastAPI, SSE, PostgreSQL checkpointing, Docker, SAS generation
  • Inputs: PDF-derived shell JSON, ADaM specifications, macro profiles, implementation guidance, reference SAS programs, benchmark cases
  • Outputs: Generated SAS code and a reviewable reasoning trace, with supporting intermediate state available when domain experts need to inspect or correct the workflow
  • Production concerns: Traceability, human review, resumable workflows, benchmark evaluation, SAS runtime feedback

Problem Framing

Clinical study reporting requires statistical programmers to translate table, figure, and listing shells into executable SAS programs. That translation is not just a coding task: programmers must interpret visual table structure, infer reporting intent, map rows and columns to ADaM datasets and variables, select reusable display macros, handle gaps with custom logic, and validate outputs against study conventions.

This system takes shell specifications, ADaM metadata, macro knowledge, reference programs, and study context as input, then turns them into a structured LangGraph workflow for generating reviewable SAS code. Instead of asking one agent to write a complete program in a single pass, the workflow separates shell understanding, ADaM matching, implementation planning, code assembly, human review, and post-review copilot correction.

The design goal is production-oriented clinical reporting automation: preserve domain control, expose intermediate decisions, reuse existing programming assets, and connect agentic code generation with benchmark-driven evaluation and execution-feedback-based optimization.

Scale and Review Burden

A clinical study reporting package can contain tens to hundreds of TLF outputs. Each table, listing, or figure requires shell interpretation, ADaM data mapping, SAS implementation, output formatting, and review against the original reporting intent.

The review burden is also substantial. Many outputs require independent QC or dual-programming cross-checks, where a primary program and a QC implementation are compared. When results differ, teams must trace the issue back through shell interpretation, data filters, statistical logic, macro behavior, or output formatting.

The value of this system is therefore not simply faster code generation. It turns repetitive interpretation, implementation planning, code assembly, quality review, human feedback, and runtime signals into a traceable workflow that can scale across many reporting outputs with more consistency.

Input to Output

  1. Shell specification

    The workflow starts from PDF or Word-derived TLF shell structures represented as JSON, preserving row hierarchy, column hierarchy, placeholders, titles, footnotes, and programming notes.

  2. Parsed table model

    Preprocessing converts the shell into a structured table model with blocks, groups, columns, constraints, page-by variables, and table-level scope that downstream agents can inspect and revise.

  3. ADaM and macro decisions

    The system maps reporting requirements to ADaM datasets and variables, retrieves candidate macros and reference programs, and selects an implementation approach across reusable macro calls, preprocessing, composition, or custom code.

  4. SAS code and reasoning trace

    The core output is generated SAS code plus a reviewable reasoning trace that explains how shell intent, data assumptions, macro choices, and code assembly decisions led to the final program. Intermediate state remains available for expert inspection, but it is supporting context rather than the primary deliverable.

Architecture Decision

The key architecture decision was to use explicit graph-based orchestration rather than a single unconstrained ReAct agent. TLF generation has long business logic chains, intermediate decisions that domain experts must inspect, and correction points that should not require rerunning the entire process.

A LangGraph workflow gives the system node-level traceability, structured parallelism, checkpoint persistence, human interrupt, targeted resume, and clear ownership for each decision stage. The LLM still has autonomy inside nodes, but the overall control flow remains explicit and auditable.

Core Workflow

  1. Preprocess

    The workflow analyzes structured shell JSON derived from TLF templates, including row hierarchy, column structure, placeholders, titles, footnotes, grouping variables, page-by variables, and programming annotations.

  2. AdamMatcher

    The system matches table blocks to ADaM datasets, variables, treatment groups, table-level where clauses, and page-by constraints so code generation is grounded in the actual analysis data model.

  3. CodeGenerator

    A parent coding agent decides how to cluster table blocks, which display or low-level macros to use, and what implementation plan best handles the gap between shell requirements and reusable macro capabilities.

  4. Parallel exec_block agents

    Child coding agents generate cluster-level SAS components in parallel, then the parent workflow assembles local setup, macro calls, post-processing logic, and final table code into a reviewable program.

  5. HumanReview and Copilot

    A human review checkpoint exposes intermediate decisions and generated code. Post-review feedback flows through a copilot path that can answer questions, patch state, modify groups or blocks, and trigger local regeneration without rerunning the full workflow.

Knowledge and Retrieval Layer

  • ADaM specifications ground dataset, variable, treatment group, page-by, and table-scope decisions.
  • Macro profiles describe reusable display and low-level macros, including capabilities, constraints, parameters, and examples.
  • Implementation guidance helps the agent decide when to use direct macro invocation, preprocessing, macro composition, or custom SAS assembly when macro coverage is incomplete.
  • Reference programs and benchmark cases provide concrete implementation patterns for retrieval and comparison.
  • Skills and staged context loading keep the model focused by loading judgment context before detailed interface context.

Reliability and Productionization

  • Checkpoint-backed graph execution supports long-running workflows, interruption, review, resume, and targeted regeneration.
  • FastAPI exposes the production business API, while LangGraph Studio remains a separate debugging surface.
  • SSE streaming supports post-review copilot interaction and makes agent progress visible to the frontend.
  • Deterministic completeness gates, quality scoring, and validation tests reduce reliance on subjective inspection alone.
  • Docker deployment, configurable providers, and service-layer boundaries make the system easier to integrate into enterprise environments.

Evaluation, Validation, and Feedback Loop

The current project includes a Benchmark module for repeatable evaluation across study scenarios, shell structures, ADaM specifications, reference SAS programs, and generated outputs. This gives the system a stronger evaluation surface than one-off demo inspection: generated code can be compared against curated cases, known shell patterns, expected mappings, and reference implementation behavior.

The next step is to make evaluation itself agentic. An evaluation agent can review generated SAS code against shell intent, ADaM dataset and variable mappings, macro decisions, implementation choices, code completeness, and benchmark expectations. This shifts validation from a single quality check into a structured review workflow that can explain what is wrong, where the mismatch happened, and which upstream decision should be corrected.

The production loop connects the workflow to a SAS runtime environment. Execution logs, syntax errors, warnings, output dataset status, and output differences can flow back into the graph as concrete feedback signals. That creates a closed loop: generate code, run it, evaluate the result, repair the implementation, and rerun until the program converges toward a valid and reviewable reporting output.

My Role

  • Led product framing and system design for the clinical reporting automation workflow.
  • Defined the graph-based agent architecture, including shell understanding, ADaM matching, code generation, human review, and copilot correction paths.
  • Shaped the knowledge strategy for macro retrieval, reference program usage, skill loading, benchmark cases, and staged context management.
  • Worked with engineering and statistical programming experts to translate TLF shell interpretation and SAS implementation requirements into production-ready workflow components.
  • Designed the validation and evaluation direction, including benchmark-based assessment, review checkpoints, and SAS-runtime feedback loops.
  • Guided cross-functional execution from exploratory prototype toward API-backed deployment and regulated workflow readiness.

Impact

  • Improved consistency and reviewability in complex clinical reporting workflows by making shell interpretation, ADaM mapping, macro selection, and generated code visible as intermediate artifacts.
  • Scaled better across reporting packages with tens to hundreds of outputs, where small reductions in interpretation, QC comparison, and repair effort compound across the study.
  • Reduced repetitive effort in shell interpretation, implementation planning, macro lookup, and SAS program assembly.
  • Created reusable agentic automation patterns across studies, table types, macro libraries, and reporting outputs.
  • Established a benchmark-oriented path for evaluating generated code beyond subjective review.
  • Set up a scalable direction for closing the loop between generation, SAS runtime feedback, evaluation, and code repair.