Agentic Pipeline for Document Review
TL;DR – An uploaded term sheet flows through a five-stage async pipeline that decomposes it into a scored clause graph. A copilot explains the deterministic analysis in plain language. A LangGraph state machine generates term sheets from natural-language deal descriptions. The LLM never picks the band; it explains what the engine already decided.
For links to all the posts in this series, see Babel.
The Decomposition Pipeline
PARSE_DOC → CHUNK_EMBED → EXTRACT_NORMALIZE → BAND_MAP_GRAPH → ANALYZE
parsed chunked extracted graphed analyzed
Each stage is a job in a PostgreSQL-backed queue, idempotent and retriable. The pipeline runs to completion without human intervention once the upload fires.
Stage 1 — Parse. Purpose: turn arbitrary document formats into structured blocks the rest of the pipeline can use. Downloads the blob from Supabase Storage and attempts structured parsing via Docling (IBM), falling back to PyMuPDF for PDFs or Mammoth/python-docx for DOCX files. Each fallback strategy cascades until one succeeds. Output: a blocks[] array (headings, paragraphs, tables with row/column metadata) and plaintext stored on the document record.
Stage 2 — Chunk & Embed. Purpose: create searchable units and optional embeddings for semantic retrieval. Each block becomes a chunk with block_id, page, kind, and text. If embeddings are enabled, 1536-dim vectors are generated and stored as pgvector columns with an IVFFlat index on cosine distance for approximate nearest-neighbour search.
Stage 3 — Extract & Normalize. Purpose: identify clause boundaries and extract structured attributes for banding. Two-phase clause detection. First, regex extraction classifies sections by heading lookup (CANONICAL_HEADING_MAP) and body-pattern matching (BODY_HINTS), and pulls structured attributes (days, percentages, multiples, participation type) from clause text. Second, LLM normalisation coerces the output at temperature 0.0; the LLM only normalises what the regex already extracted (e.g. standardising phrasing), it does not override clause keys or band choice, so determinism is preserved. Clauses are inserted into the database and linked to their source chunks.
Stage 4 — Band, Map, Graph. Purpose: build a navigable graph of clauses with band/posture and cross-clause links. Builds the clause decomposition graph. Three node types: a document root, category nodes (nine negotiation buckets), and clause nodes enriched with band, badge, and tilt from the scoring engine. Hierarchical edges connect doc → category → clause. Second-order edges encode cross-clause trade relationships:
SECOND_ORDER_LINKS = {
"rofr": ["tag_along", "rofo"],
"tag_along": ["rofr", "drag_along"],
"drag_along": ["exit"],
"exclusivity": ["rofo", "rofr", "tag_along"],
"anti_dilution": ["pay_to_play"],
"liquidation_preference": ["reserved_matters", "board"],
}
Links are symmetric in the sense that if A links to B, B typically links back (e.g. rofr ↔ tag_along). They are only added when both clause nodes exist in the current document, so the graph adapts to each term sheet’s content. Output: a Cytoscape.js-compatible graph_json.
Stage 5 — Analyse. Purpose: run the consensus engine on every clause and persist posture, score, and trades. Runs the deterministic analysis per clause: extract attributes → match band → compute composite score → determine posture (±0.2 threshold) → upsert into the analyses table. The document is now fully processed.
The Graph as Negotiation Map
The frontend renders the graph with Cytoscape.js using a force-directed layout. Document node at the centre, category nodes sized by clause count, clause nodes as rectangles with band badges and posture-coloured borders. Clicking a clause highlights its neighbourhood and triggers copilot analysis. The second-order edges surface trade possibilities: clicking exclusivity shows its links to rofo, rofr, and tag_along. A “highlight non-market” toggle fades balanced clauses to foreground contentious terms.
The Copilot
The copilot’s core design decision: the LLM explains the deterministic result, it doesn’t pick the band. The engine has already matched the value to a band, scored it, and determined posture. The LLM translates that into prose a non-technical founder or investor can understand – and adds colour (risks, leverage dynamics, precedent) that the math alone doesn’t capture.
The clause analysis route supports a ?reasoned= parameter: reasoned=false returns pure deterministic banding (fast, no LLM call); reasoned=true computes the deterministic result first, then the copilot explains it. The LLM is always an overlay, never the source of truth.
Term Sheet Generation via LangGraph
Babel can also generate term sheets from a natural-language deal description. The generator is built with LangGraph as a stateful, multi-step workflow:
class DealState(TypedDict):
nl_input: str # "Series A, $5M at $20M pre"
overrides: Optional[DealOverrides] # Parsed deal terms
deal: Optional[DealConfig] # Full config with defaults
validation_errors: List[str] # Any issues
selected_clause_ids: List[str] # Which templates to use
rendered_term_sheet: Optional[str] # Final HTML output
clarification_questions: Optional[List[str]] # If input is ambiguous
input → parse_nl → apply_defaults → validate_deal → select_clauses → render_ts → output → END
Each node reads and writes to the shared DealState. The LLM extracts structured deal terms from natural language at temperature 0.0, defaults are merged from a market-standard base config, validation checks business rules, clause templates are selected and deduplicated, and the final term sheet is rendered as HTML. The graph is linear for now but the architecture supports clarification loops.
Design Patterns (agentic behaviour)
Here “agentic” means multi-step workflows that run without human intervention between steps — goal-directed only in the sense that each step has a clear input/output contract and the next step is enqueued on success. Babel doesn’t use a single “agent” framework; that behaviour emerges from several patterns:
| Pattern | Where | How |
|---|---|---|
| Multi-step state machine | ts_generator/graph.py |
LangGraph with typed state, node-per-stage, conditional-edge-ready |
| Tool-augmented reasoning | copilot_service.py |
LLM receives deterministic banding output as structured context |
| Structured extraction | parse_nl node |
LLM → JSON schema (DealOverrides); temperature 0.0 |
| Fallback chains | parse_docling.py, parse_docx.py |
Docling → PyMuPDF → Mammoth → python-docx → XML → plaintext |
| Prompt grounding | analyze_clause() |
Band data injected into the prompt; the LLM explains, it doesn’t invent |
| Idempotent job chaining | workers/runner.py |
Each handler enqueues the next stage; crash at any point and it resumes |
| Deterministic + LLM split | ?reasoned= flag |
Core analysis is always deterministic; LLM is an optional layer |
The term-sheet generator is the clearest agentic example: unstructured input → LLM extraction → validation → template selection → rendered document, all without human intervention. The worker pipeline is agentic in a systems sense: five chained idempotent jobs, each capable of independent retry, coordinated by a queue.
Technology Stack
| Layer | Technology | Role |
|---|---|---|
| Frontend | React + TypeScript, Zustand, Cytoscape.js | SPA with graph visualisation, state management, chat |
| API | FastAPI (Python, async) | REST endpoints with CORS, OpenAPI docs |
| LLM Orchestration | LangGraph | State-machine workflow for term sheet generation |
| LLM Provider | OpenRouter (DeepSeek v3.1 default) | Chat completions, structured extraction |
| Database | Supabase (PostgreSQL + pgvector) | Relational storage, RLS, vector search |
| Document Parsing | Docling (IBM), PyMuPDF, Mammoth, python-docx | Multi-strategy parsing with cascading fallbacks |
| BATNA Engine | TypeScript (packages/batna/) |
Band matching, composite scoring, shared with frontend |
| Job Queue | PostgreSQL (FOR UPDATE SKIP LOCKED) |
Five-stage async worker pipeline |
| ORM | SQLAlchemy (async, asyncpg) |
Database access with session management |