Document Extraction API - Quick Start Guide

Overview

The Document Extraction API is a REST API that accepts ANY document type (PDF, JPEG, PNG) and uses AI to automatically identify and classify documents. It routes identified documents to appropriate extraction workflows based on a configurable document type registry.

Getting Started

1. API Key Authentication

All requests require a valid API key in the Authorization header:

Authorization: Bearer dk_live_your_api_key_here

2. Base URL

Production: https://api.extractable.xyz

Core Endpoints

Submit Document for Processing

POST /api/documents/extract
Content-Type: multipart/form-data
Authorization: Bearer dk_live_your_api_key_here

file: document.pdf
webhookUrl: https://your-domain.com/webhook (optional)
metadata: {"clientId": "12345"} (optional)

Response (202 Accepted):

{
  "jobId": "123e4567-e89b-12d3-a456-426614174000",
  "status": "pending",
  "pollingUrl": "https://api.extractable.xyz/api/documents/extract/123e4567-e89b-12d3-a456-426614174000",
  "createdAt": "2024-01-01T00:00:00Z",
  "message": "Document submitted for processing. Document extraction pipeline initiated."
}

Check Processing Status

GET /api/documents/extract/{jobId}
Authorization: Bearer dk_live_your_api_key_here

Response (200 OK):

{
  "jobId": "123e4567-e89b-12d3-a456-426614174000",
  "status": "completed",
  "fileName": "statement_of_repair.pdf",
  "fileType": "pdf",
  "documentType": "sor",
  "classificationConfidence": 0.95,
  "progress": 100,
  "extractedData": {
    "documentMetadata": {
      "consultant": "ABC Construction Consulting",
      "propertyAddress": "123 Main St, Anytown, ST 12345",
      "borrowerName": "John Doe",
      "lenderName": "First National Bank"
    },
    "constructionSections": [
      {
        "sectionName": "1.0 GENERAL CONDITIONS",
        "lineItems": [
          {
            "description": "Permits and inspections",
            "quantity": 1,
            "unit": "LS",
            "materialCost": 500.00,
            "laborCost": 0.00,
            "totalCost": 500.00
          }
        ]
      }
    ],
    "recapSummary": {
      "subtotal": 45000.00,
      "generalConditions": 4500.00,
      "overheadAndProfit": 7425.00,
      "grandTotal": 56925.00
    }
  },
  "createdAt": "2024-01-01T00:00:00Z",
  "updatedAt": "2024-01-01T00:00:45Z",
  "completedAt": "2024-01-01T00:00:45Z"
}

List Supported Document Types

GET /api/documents/types

Response (200 OK):

{
  "supportedTypes": [
    {
      "typeCode": "sor",
      "displayName": "Statement of Repair",
      "description": "HUD 203(k) Statement of Repair documents",
      "confidenceThreshold": 0.85,
      "extractionSupported": true
    },
    {
      "typeCode": "insurance_cert",
      "displayName": "Insurance Certificate",
      "description": "ACORD Insurance Certificate forms",
      "confidenceThreshold": 0.80,
      "extractionSupported": true
    }
  ],
  "capabilities": [
    "AI-powered document classification",
    "Structured data extraction",
    "Webhook notifications",
    "Secure file processing"
  ]
}

Key Features

Universal Document Acceptance

  • Upload ANY document type without specifying the type
  • AI automatically identifies document type
  • Supports both single and multi-page documents
  • Handles unknown document types gracefully

Supported Document Types

  • Statement of Repair (SOR): HUD 203(k) documents with detailed construction scope
  • Insurance Certificates: ACORD forms with coverage details
  • Extensible Registry: Easy to add new document types

AI-Powered Classification

  • Uses advanced AI models for document identification
  • Returns confidence scores for transparency
  • Provides alternative type suggestions
  • Configurable confidence thresholds

Error Handling

Common Error Responses

401 Unauthorized:

{
  "error": "unauthorized",
  "message": "Invalid or missing API key"
}

400 Bad Request:

{
  "error": "invalid_request",
  "message": "No file provided"
}

413 File Too Large:

{
  "error": "file_too_large",
  "message": "File size exceeds limit of 50MB for PDF files"
}

429 Rate Limited:

{
  "error": "rate_limit_exceeded",
  "message": "Too many requests. Please try again later."
}

File Requirements

Supported Formats

  • PDF: Up to 50MB
  • JPEG/JPG: Up to 10MB
  • PNG: Up to 10MB

Best Practices

  • Use high-resolution scans (300 DPI or higher)
  • Ensure text is clearly readable
  • Avoid heavily redacted or corrupted documents
  • Single-page documents process faster than multi-page

Testing

cURL Examples

Submit a document:

curl -X POST https://api.extractable.xyz/api/documents/extract \
  -H "Authorization: Bearer dk_live_your_api_key_here" \
  -F "file=@document.pdf" \
  -F 'metadata={"clientId": "CLIENT123"}'

Check status:

curl https://api.extractable.xyz/api/documents/extract/123e4567-e89b-12d3-a456-426614174000 \
  -H "Authorization: Bearer dk_live_your_api_key_here"

List document types:

curl https://api.extractable.xyz/api/documents/types

Integration Notes

Webhook Configuration

When providing a webhook URL, ensure your endpoint can handle POST requests with the following payload structure:

{
  "jobId": "123e4567-e89b-12d3-a456-426614174000",
  "status": "completed",
  "documentType": "sor",
  "confidence": 0.95,
  "extractedData": { ... },
  "metadata": { ... }
}

Polling Strategy

  • Poll every 5-10 seconds for job status
  • Most documents process within 30-60 seconds
  • Complex multi-page documents may take up to 2 minutes

Rate Limits

  • Per API Key: 100 requests per minute
  • Per Organization: 1000 requests per minute
  • Contact support for higher limits

Example Workflow

  1. Submit Document: Upload any document without specifying type
  2. AI Classification: System automatically identifies document type
  3. Extraction: If type is supported, structured data is extracted
  4. Results: Receive extracted data via polling or webhook

Document Type Registry

The API maintains a registry of supported document types that can be extended:

Current Types

  • Statement of Repair: Construction scope documents
  • Insurance Certificate: Coverage verification forms
  • Invoice: Billing documents (identification only)
  • Contract: Agreement documents (identification only)

Adding New Types

Document types can be added to the registry without code changes:

  1. Define recognition criteria
  2. Set confidence thresholds
  3. Optionally add extraction templates
  4. Deploy and test

Support

For technical support or questions:

  • Documentation: Full API documentation available
  • Status Page: Check system status and maintenance windows

Next Steps

  1. Get your API key: Contact support for API access
  2. Review the full API documentation: See the complete OpenAPI specification
  3. Test with sample documents: Use the provided cURL examples
  4. Set up webhooks: Configure your endpoint to receive results
  5. Monitor usage: Track API usage and performance