Extractable Document Platform - Comprehensive Project Documentation

Executive Summary

This document outlines the development of a new document extraction platform that enables users to create custom document extraction templates, process documents using AI, and integrate extracted data with various platforms (primarily GoHighLevel, but extensible to others like Zapier). The platform will be built as a new flow within the existing Next.js SaaS starter, hiding legacy features and implementing a Linear.app-inspired minimal UI.

Project Vision

Core Purpose: Simplify document data extraction by allowing users to define their own extraction templates and seamlessly integrate extracted data with their existing workflows through webhooks and direct platform integrations.

Target Users:

  • Small to medium businesses that process documents regularly
  • GoHighLevel agencies managing multiple client accounts
  • Professionals who need to extract structured data from various document types
  • Users who want to automate document processing workflows

Technical Architecture Overview

Technology Stack

  • Frontend: Next.js 14 with React, TailwindCSS
  • Backend: Next.js API routes, Supabase for database and auth
  • Document Processing: Trigger.dev for orchestration, Gemini AI for extraction
  • Storage: Supabase Storage for document files
  • Integrations: GoHighLevel v2 OAuth, Webhook system for other platforms
  • UI Framework: Custom design system inspired by Linear.app

Core Components

  1. Document Template Engine: AI-powered template creation and management
  2. Document Processing Pipeline: Upload, identification, extraction, output
  3. Integration Hub: GoHighLevel direct integration + webhook system
  4. User Management: Multi-tenant architecture with organization support
  5. Extraction History: Track and manage all extraction activities

Detailed Feature Specifications

1. User Onboarding & Template Creation

Template Creation Flow

  1. Upload Example Document: User uploads a sample document of the type they want to extract
  2. AI Analysis: Gemini AI analyzes document and identifies all possible data fields
  3. Field Discovery Display: Show user all discovered fields in a clean, organized JSON-like interface
  4. Field Selection: User selects exactly which fields they need extracted
  5. Template Configuration:
    • Name the template
    • Add description
    • Set field validation rules
    • Configure output format preferences
  6. Template Testing: Test template on additional sample documents
  7. Template Activation: Save template to user's profile for future use

Template Management

  • Multiple Templates: Users can create unlimited document templates
  • Template Versioning: Track changes and maintain version history
  • Template Sharing: Share templates within organization (future feature)
  • Template Import/Export: JSON-based template sharing

2. Document Upload & Processing

Upload Interface

  • Drag & Drop: Modern file upload interface
  • Batch Upload: Support multiple documents at once
  • File Type Support: PDF, images (PNG, JPG), Word documents
  • File Size Limits: Configurable per plan tier

Document Type Identification

  • Manual Selection: User chooses from their saved templates
  • AI Identification:
    • Analyze document content
    • Match against user's existing templates
    • Suggest best matching template with confidence score
    • Allow user to confirm or override suggestion

Processing Pipeline

  1. File Upload & Storage: Secure upload to Supabase Storage
  2. Document Preprocessing: Convert to optimal format for AI analysis
  3. Template Application: Apply selected/identified template
  4. Data Extraction: Gemini AI extracts specified fields
  5. Validation: Validate extracted data against template rules
  6. Output Generation: Format data for app display and integrations

3. Data Output & Integration System

In-App Display

  • Extraction Results: Clean, organized display of extracted data
  • Field Validation: Visual indicators for validation status
  • Edit Capability: Allow manual correction of extracted data
  • Export Options: CSV, JSON, PDF report formats

Webhook System

  • Configuration Interface: Easy webhook setup with URL and authentication
  • Payload Customization: Choose which fields to send, format options
  • Retry Logic: Automatic retry for failed webhook deliveries
  • Delivery Tracking: Log and monitor webhook delivery status
  • Platform Templates: Pre-configured templates for popular platforms (Zapier, Make, etc.)

GoHighLevel Direct Integration

  • OAuth 2.0 Implementation: Use existing go-high-level-2 integration
  • Field Mapping: Map extracted fields to GHL custom fields or standard fields
  • Contact Management: Create/update contacts with extracted data
  • Opportunity Creation: Create opportunities with document data
  • Custom Field Sync: Sync to custom fields in GHL subaccounts
  • Agency Support: Support for GHL agencies managing multiple locations

4. User Interface Design System

Design Principles (Linear.app Inspired)

  • Minimal & Clean: Remove visual clutter, focus on content
  • Task-Oriented: Each screen serves a specific purpose
  • Consistent Spacing: 8px grid system throughout
  • Typography Hierarchy: Clear information hierarchy
  • Subtle Animations: Smooth transitions without distraction
  • Dark/Light Mode: Support both themes

Component Library

  • Navigation: Simple sidebar with clear sections
  • Cards: Document cards, template cards, integration cards
  • Forms: Clean form styling with proper validation states
  • Tables: Data tables with sorting, filtering, pagination
  • Modals: For complex actions and confirmations
  • Progress Indicators: For upload and processing states

Color Palette

  • Primary: Modern blue (#2563eb)
  • Secondary: Subtle gray scale
  • Success: Green (#16a34a)
  • Warning: Amber (#d97706)
  • Error: Red (#dc2626)
  • Neutral: Gray variations for text and backgrounds

Database Schema Design

Core Tables

document_templates

CREATE TABLE document_templates (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  organization_id UUID REFERENCES organizations(id),
  user_id UUID REFERENCES users(id),
  name VARCHAR(255) NOT NULL,
  description TEXT,
  ai_discovered_fields JSONB, -- All fields AI found in example
  selected_fields JSONB, -- Fields user chose to extract
  validation_rules JSONB, -- Field validation configuration
  example_document_url TEXT, -- Reference to example document
  status VARCHAR(20) DEFAULT 'active', -- active, archived, draft
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

document_extractions

CREATE TABLE document_extractions (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  organization_id UUID REFERENCES organizations(id),
  user_id UUID REFERENCES users(id),
  template_id UUID REFERENCES document_templates(id),
  original_filename VARCHAR(255),
  file_url TEXT NOT NULL,
  extracted_data JSONB, -- The actual extracted data
  extraction_confidence JSONB, -- Confidence scores per field
  processing_status VARCHAR(20), -- uploaded, processing, completed, failed
  ai_model_used VARCHAR(50),
  processing_time_ms INTEGER,
  webhook_deliveries JSONB, -- Track webhook delivery attempts
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

integration_connections

CREATE TABLE integration_connections (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  organization_id UUID REFERENCES organizations(id),
  user_id UUID REFERENCES users(id),
  integration_type VARCHAR(50), -- 'ghl', 'webhook', 'zapier'
  connection_name VARCHAR(255),
  oauth_data JSONB, -- OAuth tokens for platforms like GHL
  webhook_config JSONB, -- Webhook URL, headers, auth
  field_mappings JSONB, -- How to map extracted fields to integration
  status VARCHAR(20) DEFAULT 'active',
  last_sync_at TIMESTAMP,
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

webhook_deliveries

CREATE TABLE webhook_deliveries (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  extraction_id UUID REFERENCES document_extractions(id),
  connection_id UUID REFERENCES integration_connections(id),
  webhook_url TEXT,
  payload JSONB,
  response_status INTEGER,
  response_body TEXT,
  attempt_number INTEGER DEFAULT 1,
  delivered_at TIMESTAMP,
  failed_at TIMESTAMP,
  next_retry_at TIMESTAMP,
  created_at TIMESTAMP DEFAULT NOW()
);

GoHighLevel Integration Nuances

OAuth Implementation Details

Connection Flow

  1. User Initiates: Click "Connect GoHighLevel" in integrations
  2. OAuth Redirect: Redirect to GHL marketplace with v2 scopes
  3. User Authorization: User logs into GHL and selects location/agency
  4. Callback Handling: Receive authorization code at /oauth/secret-crm
  5. Token Exchange: Exchange code for access/refresh tokens
  6. Connection Storage: Store tokens and location/company ID
  7. Field Mapping Setup: Allow user to map extraction fields to GHL fields

Required Scopes (go-high-level-2)

  • contacts.readonly and contacts.write: Manage contacts
  • conversations.readonly and conversations.write: Conversations
  • locations/customFields.readonly and locations/customFields.write: Custom fields
  • locations/customValues.readonly and locations/customValues.write: Custom field values
  • opportunities.readonly and opportunities.write: Opportunity management
  • companies.readonly: Agency-level access
  • oauth.write and oauth.readonly: Token management

Data Sync Strategies

  1. Contact Creation/Update:

    • Check if contact exists by email or phone
    • Create new contact if not found
    • Update existing contact with extracted data
    • Handle custom field mapping
  2. Custom Field Management:

    • Fetch available custom fields for location
    • Create new custom fields if needed (requires permission)
    • Map extracted fields to existing custom fields
    • Validate field types (text, number, date, etc.)
  3. Opportunity Creation:

    • Create opportunities linked to contacts
    • Use extracted data for opportunity details
    • Set pipeline and stage based on document type

Error Handling

  • Token Expiration: Automatic refresh using refresh tokens
  • Rate Limiting: Implement exponential backoff
  • API Errors: Graceful handling of GHL API errors
  • Connection Loss: Detect and prompt for re-authentication

Security Considerations

  • Token Storage: Encrypt OAuth tokens in database
  • Webhook Validation: Validate webhook sources by location ID
  • Rate Limiting: Prevent abuse of integration endpoints
  • Audit Logging: Log all integration activities

End-to-End Document Extraction Flow

Phase 1: Template Creation (One-time Setup)

  1. User Access: User logs into platform, navigates to "Document Templates"
  2. Template Creation: Click "Create New Template"
  3. Document Upload: Upload example document (PDF, image, Word)
  4. AI Processing:
    • Trigger.dev job initiated
    • Gemini AI analyzes document structure
    • Extract all possible data fields with examples
    • Return structured field discovery
  5. Field Review: User reviews discovered fields in organized interface
  6. Field Selection: User selects which fields they need extracted
  7. Template Configuration:
    • Name template (e.g., "Invoice", "Contract", "Application")
    • Add description
    • Set field validation rules (required, format, etc.)
    • Configure output preferences
  8. Template Testing: Test template on 2-3 additional sample documents
  9. Template Activation: Save template to user profile for future use

Phase 2: Document Processing (Ongoing Operations)

  1. Document Upload: User uploads document(s) to process
  2. Type Identification:
    • Manual: User selects from their templates
    • Automatic: AI compares document to existing templates, suggests best match
  3. Processing Pipeline:
    • Document stored in Supabase Storage
    • Trigger.dev job queued for processing
    • Gemini AI extracts data using selected template
    • Extracted data validated against template rules
    • Processing status updated in real-time
  4. Results Display:
    • Show extracted data in clean interface
    • Highlight validation issues
    • Allow manual corrections
    • Show confidence scores per field

Phase 3: Data Integration (Automated)

  1. Integration Triggers: Based on user configuration
  2. GoHighLevel Integration:
    • Map extracted fields to GHL fields
    • Create/update contact with extracted data
    • Create opportunity if configured
    • Update custom fields
    • Log integration results
  3. Webhook Delivery:
    • Format data according to webhook configuration
    • Send to configured webhook URLs
    • Retry failed deliveries with exponential backoff
    • Log delivery status and responses
  4. Notification: User receives notification of processing completion

Phase 4: History & Management

  1. Extraction History: All extractions logged with full audit trail
  2. Integration Monitoring: Track success/failure rates
  3. Template Performance: Analytics on template accuracy and usage
  4. Error Resolution: Interface to review and resolve failed processes

Implementation Checklist

Phase 1: Foundation & UI Framework

  • Design System Implementation

    • Create Linear.app-inspired component library
    • Implement dark/light mode toggle
    • Build responsive grid system
    • Create consistent typography scale
    • Design form components with validation states
  • Navigation & Layout Updates

    • Hide existing tabs (Home, Loan Files, All Draws)
    • Create new minimal sidebar navigation
    • Implement breadcrumb navigation
    • Build dashboard layout for new features
  • Database Schema Setup

    • Create document_templates table
    • Create document_extractions table
    • Create integration_connections table
    • Create webhook_deliveries table
    • Set up proper indexes and relationships
    • Implement RLS policies for multi-tenancy

Phase 2: Template Creation System

  • Template Creation Interface

    • Build document upload component
    • Create field discovery display interface
    • Implement field selection interface
    • Build template configuration forms
    • Add template testing capability
  • AI Integration for Template Creation

    • Set up Trigger.dev job for template analysis
    • Implement Gemini AI document analysis
    • Create field discovery algorithms
    • Build confidence scoring system
    • Add field type detection (text, number, date, etc.)
  • Template Management

    • Build template listing interface
    • Implement template editing
    • Add template versioning
    • Create template import/export
    • Build template sharing (future)

Phase 3: Document Processing Pipeline

  • Upload Interface

    • Build modern drag & drop upload
    • Implement batch upload support
    • Add file type validation
    • Create upload progress indicators
    • Handle file size limits
  • Document Type Identification

    • Build manual template selection
    • Implement AI-powered type identification
    • Create confidence-based suggestions
    • Add override capabilities
  • Processing Engine

    • Set up Trigger.dev extraction jobs
    • Implement Gemini AI extraction
    • Build data validation system
    • Create real-time status updates
    • Add error handling and recovery

Phase 4: Integration System

  • Webhook Framework

    • Build webhook configuration interface
    • Implement webhook delivery system
    • Add retry logic with exponential backoff
    • Create delivery tracking and logging
    • Build webhook testing tools
  • GoHighLevel Integration

    • Implement OAuth 2.0 flow using existing v2 integration
    • Build field mapping interface
    • Create contact creation/update logic
    • Implement custom field management
    • Add opportunity creation features
    • Build agency/multi-location support
  • Integration Management

    • Create integration listing interface
    • Build connection status monitoring
    • Implement integration testing tools
    • Add integration analytics
    • Create troubleshooting interfaces

Phase 5: User Experience & Monitoring

  • Extraction History

    • Build extraction listing interface
    • Implement filtering and search
    • Add export capabilities
    • Create detailed view with edit options
  • Analytics & Monitoring

    • Build template performance analytics
    • Implement extraction accuracy tracking
    • Create integration success monitoring
    • Add usage analytics dashboard
  • User Onboarding

    • Create guided onboarding flow
    • Build tutorial/help system
    • Add sample templates
    • Create documentation

Phase 6: Advanced Features

  • Multi-Platform Integration Framework

    • Design extensible integration architecture
    • Build Zapier integration
    • Add Make.com integration
    • Create custom API endpoints
  • Advanced Template Features

    • Implement conditional field extraction
    • Add multi-page document support
    • Create template inheritance
    • Build collaborative template editing
  • Enterprise Features

    • Implement team management
    • Add role-based permissions
    • Create organization-wide templates
    • Build audit logging

Technical Considerations

Performance Optimization

  • File Processing: Use streaming for large files
  • AI Processing: Implement queueing for high-volume processing
  • Database Queries: Optimize with proper indexing
  • Caching: Implement Redis for frequently accessed data
  • CDN: Use for static assets and processed documents

Security Requirements

  • File Upload Security: Validate file types, scan for malware
  • Data Encryption: Encrypt sensitive data at rest
  • API Security: Implement rate limiting and authentication
  • Integration Security: Secure OAuth token storage
  • Audit Trail: Log all user actions and data access

Scalability Planning

  • Database Sharding: Plan for multi-tenant scaling
  • Message Queues: Use for background processing
  • Load Balancing: Prepare for horizontal scaling
  • Storage Optimization: Implement file lifecycle management
  • Monitoring: Set up comprehensive system monitoring

Success Metrics

User Engagement

  • Template Creation Rate: Templates created per user
  • Processing Volume: Documents processed per month
  • Integration Usage: Active integrations per user
  • User Retention: Monthly and annual retention rates

Technical Performance

  • Processing Speed: Average extraction time
  • Accuracy Rates: Template accuracy scores
  • Integration Success: Webhook delivery success rates
  • System Uptime: Platform availability metrics

Business Metrics

  • User Growth: New user acquisition rate
  • Revenue Growth: Monthly recurring revenue
  • Feature Adoption: Usage of advanced features
  • Customer Satisfaction: Support ticket volume and resolution

Risk Mitigation

Technical Risks

  • AI Model Changes: Plan for Gemini API updates
  • Integration Dependencies: Monitor third-party API stability
  • Data Processing Failures: Implement comprehensive error handling
  • Security Vulnerabilities: Regular security audits

Business Risks

  • Competition: Monitor competitive landscape
  • Platform Dependencies: Reduce reliance on single providers
  • Compliance: Ensure GDPR, CCPA compliance
  • Customer Churn: Implement retention strategies

Future Roadmap

Short-term (3-6 months)

  • Complete core platform development
  • Launch GoHighLevel integration
  • Implement basic webhook system
  • Add essential analytics

Medium-term (6-12 months)

  • Add more platform integrations (Zapier, Make)
  • Implement advanced template features
  • Build mobile-responsive interface
  • Add team collaboration features

Long-term (12+ months)

  • API marketplace for integrations
  • AI model fine-tuning capabilities
  • Enterprise security features
  • White-label solutions

This comprehensive documentation serves as the foundation for building the new document extraction platform. Each section provides detailed specifications while maintaining flexibility for iterative development and user feedback incorporation.