Extractable Document Platform - Comprehensive Project Documentation
Executive Summary
This document outlines the development of a new document extraction platform that enables users to create custom document extraction templates, process documents using AI, and integrate extracted data with various platforms (primarily GoHighLevel, but extensible to others like Zapier). The platform will be built as a new flow within the existing Next.js SaaS starter, hiding legacy features and implementing a Linear.app-inspired minimal UI.
Project Vision
Core Purpose: Simplify document data extraction by allowing users to define their own extraction templates and seamlessly integrate extracted data with their existing workflows through webhooks and direct platform integrations.
Target Users:
- Small to medium businesses that process documents regularly
- GoHighLevel agencies managing multiple client accounts
- Professionals who need to extract structured data from various document types
- Users who want to automate document processing workflows
Technical Architecture Overview
Technology Stack
- Frontend: Next.js 14 with React, TailwindCSS
- Backend: Next.js API routes, Supabase for database and auth
- Document Processing: Trigger.dev for orchestration, Gemini AI for extraction
- Storage: Supabase Storage for document files
- Integrations: GoHighLevel v2 OAuth, Webhook system for other platforms
- UI Framework: Custom design system inspired by Linear.app
Core Components
- Document Template Engine: AI-powered template creation and management
- Document Processing Pipeline: Upload, identification, extraction, output
- Integration Hub: GoHighLevel direct integration + webhook system
- User Management: Multi-tenant architecture with organization support
- Extraction History: Track and manage all extraction activities
Detailed Feature Specifications
1. User Onboarding & Template Creation
Template Creation Flow
- Upload Example Document: User uploads a sample document of the type they want to extract
- AI Analysis: Gemini AI analyzes document and identifies all possible data fields
- Field Discovery Display: Show user all discovered fields in a clean, organized JSON-like interface
- Field Selection: User selects exactly which fields they need extracted
- Template Configuration:
- Name the template
- Add description
- Set field validation rules
- Configure output format preferences
- Template Testing: Test template on additional sample documents
- Template Activation: Save template to user's profile for future use
Template Management
- Multiple Templates: Users can create unlimited document templates
- Template Versioning: Track changes and maintain version history
- Template Sharing: Share templates within organization (future feature)
- Template Import/Export: JSON-based template sharing
2. Document Upload & Processing
Upload Interface
- Drag & Drop: Modern file upload interface
- Batch Upload: Support multiple documents at once
- File Type Support: PDF, images (PNG, JPG), Word documents
- File Size Limits: Configurable per plan tier
Document Type Identification
- Manual Selection: User chooses from their saved templates
- AI Identification:
- Analyze document content
- Match against user's existing templates
- Suggest best matching template with confidence score
- Allow user to confirm or override suggestion
Processing Pipeline
- File Upload & Storage: Secure upload to Supabase Storage
- Document Preprocessing: Convert to optimal format for AI analysis
- Template Application: Apply selected/identified template
- Data Extraction: Gemini AI extracts specified fields
- Validation: Validate extracted data against template rules
- Output Generation: Format data for app display and integrations
3. Data Output & Integration System
In-App Display
- Extraction Results: Clean, organized display of extracted data
- Field Validation: Visual indicators for validation status
- Edit Capability: Allow manual correction of extracted data
- Export Options: CSV, JSON, PDF report formats
Webhook System
- Configuration Interface: Easy webhook setup with URL and authentication
- Payload Customization: Choose which fields to send, format options
- Retry Logic: Automatic retry for failed webhook deliveries
- Delivery Tracking: Log and monitor webhook delivery status
- Platform Templates: Pre-configured templates for popular platforms (Zapier, Make, etc.)
GoHighLevel Direct Integration
- OAuth 2.0 Implementation: Use existing go-high-level-2 integration
- Field Mapping: Map extracted fields to GHL custom fields or standard fields
- Contact Management: Create/update contacts with extracted data
- Opportunity Creation: Create opportunities with document data
- Custom Field Sync: Sync to custom fields in GHL subaccounts
- Agency Support: Support for GHL agencies managing multiple locations
4. User Interface Design System
Design Principles (Linear.app Inspired)
- Minimal & Clean: Remove visual clutter, focus on content
- Task-Oriented: Each screen serves a specific purpose
- Consistent Spacing: 8px grid system throughout
- Typography Hierarchy: Clear information hierarchy
- Subtle Animations: Smooth transitions without distraction
- Dark/Light Mode: Support both themes
Component Library
- Navigation: Simple sidebar with clear sections
- Cards: Document cards, template cards, integration cards
- Forms: Clean form styling with proper validation states
- Tables: Data tables with sorting, filtering, pagination
- Modals: For complex actions and confirmations
- Progress Indicators: For upload and processing states
Color Palette
- Primary: Modern blue (#2563eb)
- Secondary: Subtle gray scale
- Success: Green (#16a34a)
- Warning: Amber (#d97706)
- Error: Red (#dc2626)
- Neutral: Gray variations for text and backgrounds
Database Schema Design
Core Tables
document_templates
CREATE TABLE document_templates (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
organization_id UUID REFERENCES organizations(id),
user_id UUID REFERENCES users(id),
name VARCHAR(255) NOT NULL,
description TEXT,
ai_discovered_fields JSONB, -- All fields AI found in example
selected_fields JSONB, -- Fields user chose to extract
validation_rules JSONB, -- Field validation configuration
example_document_url TEXT, -- Reference to example document
status VARCHAR(20) DEFAULT 'active', -- active, archived, draft
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
document_extractions
CREATE TABLE document_extractions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
organization_id UUID REFERENCES organizations(id),
user_id UUID REFERENCES users(id),
template_id UUID REFERENCES document_templates(id),
original_filename VARCHAR(255),
file_url TEXT NOT NULL,
extracted_data JSONB, -- The actual extracted data
extraction_confidence JSONB, -- Confidence scores per field
processing_status VARCHAR(20), -- uploaded, processing, completed, failed
ai_model_used VARCHAR(50),
processing_time_ms INTEGER,
webhook_deliveries JSONB, -- Track webhook delivery attempts
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
integration_connections
CREATE TABLE integration_connections (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
organization_id UUID REFERENCES organizations(id),
user_id UUID REFERENCES users(id),
integration_type VARCHAR(50), -- 'ghl', 'webhook', 'zapier'
connection_name VARCHAR(255),
oauth_data JSONB, -- OAuth tokens for platforms like GHL
webhook_config JSONB, -- Webhook URL, headers, auth
field_mappings JSONB, -- How to map extracted fields to integration
status VARCHAR(20) DEFAULT 'active',
last_sync_at TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
webhook_deliveries
CREATE TABLE webhook_deliveries (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
extraction_id UUID REFERENCES document_extractions(id),
connection_id UUID REFERENCES integration_connections(id),
webhook_url TEXT,
payload JSONB,
response_status INTEGER,
response_body TEXT,
attempt_number INTEGER DEFAULT 1,
delivered_at TIMESTAMP,
failed_at TIMESTAMP,
next_retry_at TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW()
);
GoHighLevel Integration Nuances
OAuth Implementation Details
Connection Flow
- User Initiates: Click "Connect GoHighLevel" in integrations
- OAuth Redirect: Redirect to GHL marketplace with v2 scopes
- User Authorization: User logs into GHL and selects location/agency
- Callback Handling: Receive authorization code at
/oauth/secret-crm - Token Exchange: Exchange code for access/refresh tokens
- Connection Storage: Store tokens and location/company ID
- Field Mapping Setup: Allow user to map extraction fields to GHL fields
Required Scopes (go-high-level-2)
contacts.readonlyandcontacts.write: Manage contactsconversations.readonlyandconversations.write: Conversationslocations/customFields.readonlyandlocations/customFields.write: Custom fieldslocations/customValues.readonlyandlocations/customValues.write: Custom field valuesopportunities.readonlyandopportunities.write: Opportunity managementcompanies.readonly: Agency-level accessoauth.writeandoauth.readonly: Token management
Data Sync Strategies
Contact Creation/Update:
- Check if contact exists by email or phone
- Create new contact if not found
- Update existing contact with extracted data
- Handle custom field mapping
Custom Field Management:
- Fetch available custom fields for location
- Create new custom fields if needed (requires permission)
- Map extracted fields to existing custom fields
- Validate field types (text, number, date, etc.)
Opportunity Creation:
- Create opportunities linked to contacts
- Use extracted data for opportunity details
- Set pipeline and stage based on document type
Error Handling
- Token Expiration: Automatic refresh using refresh tokens
- Rate Limiting: Implement exponential backoff
- API Errors: Graceful handling of GHL API errors
- Connection Loss: Detect and prompt for re-authentication
Security Considerations
- Token Storage: Encrypt OAuth tokens in database
- Webhook Validation: Validate webhook sources by location ID
- Rate Limiting: Prevent abuse of integration endpoints
- Audit Logging: Log all integration activities
End-to-End Document Extraction Flow
Phase 1: Template Creation (One-time Setup)
- User Access: User logs into platform, navigates to "Document Templates"
- Template Creation: Click "Create New Template"
- Document Upload: Upload example document (PDF, image, Word)
- AI Processing:
- Trigger.dev job initiated
- Gemini AI analyzes document structure
- Extract all possible data fields with examples
- Return structured field discovery
- Field Review: User reviews discovered fields in organized interface
- Field Selection: User selects which fields they need extracted
- Template Configuration:
- Name template (e.g., "Invoice", "Contract", "Application")
- Add description
- Set field validation rules (required, format, etc.)
- Configure output preferences
- Template Testing: Test template on 2-3 additional sample documents
- Template Activation: Save template to user profile for future use
Phase 2: Document Processing (Ongoing Operations)
- Document Upload: User uploads document(s) to process
- Type Identification:
- Manual: User selects from their templates
- Automatic: AI compares document to existing templates, suggests best match
- Processing Pipeline:
- Document stored in Supabase Storage
- Trigger.dev job queued for processing
- Gemini AI extracts data using selected template
- Extracted data validated against template rules
- Processing status updated in real-time
- Results Display:
- Show extracted data in clean interface
- Highlight validation issues
- Allow manual corrections
- Show confidence scores per field
Phase 3: Data Integration (Automated)
- Integration Triggers: Based on user configuration
- GoHighLevel Integration:
- Map extracted fields to GHL fields
- Create/update contact with extracted data
- Create opportunity if configured
- Update custom fields
- Log integration results
- Webhook Delivery:
- Format data according to webhook configuration
- Send to configured webhook URLs
- Retry failed deliveries with exponential backoff
- Log delivery status and responses
- Notification: User receives notification of processing completion
Phase 4: History & Management
- Extraction History: All extractions logged with full audit trail
- Integration Monitoring: Track success/failure rates
- Template Performance: Analytics on template accuracy and usage
- Error Resolution: Interface to review and resolve failed processes
Implementation Checklist
Phase 1: Foundation & UI Framework
Design System Implementation
- Create Linear.app-inspired component library
- Implement dark/light mode toggle
- Build responsive grid system
- Create consistent typography scale
- Design form components with validation states
Navigation & Layout Updates
- Hide existing tabs (Home, Loan Files, All Draws)
- Create new minimal sidebar navigation
- Implement breadcrumb navigation
- Build dashboard layout for new features
Database Schema Setup
- Create
document_templatestable - Create
document_extractionstable - Create
integration_connectionstable - Create
webhook_deliveriestable - Set up proper indexes and relationships
- Implement RLS policies for multi-tenancy
- Create
Phase 2: Template Creation System
Template Creation Interface
- Build document upload component
- Create field discovery display interface
- Implement field selection interface
- Build template configuration forms
- Add template testing capability
AI Integration for Template Creation
- Set up Trigger.dev job for template analysis
- Implement Gemini AI document analysis
- Create field discovery algorithms
- Build confidence scoring system
- Add field type detection (text, number, date, etc.)
Template Management
- Build template listing interface
- Implement template editing
- Add template versioning
- Create template import/export
- Build template sharing (future)
Phase 3: Document Processing Pipeline
Upload Interface
- Build modern drag & drop upload
- Implement batch upload support
- Add file type validation
- Create upload progress indicators
- Handle file size limits
Document Type Identification
- Build manual template selection
- Implement AI-powered type identification
- Create confidence-based suggestions
- Add override capabilities
Processing Engine
- Set up Trigger.dev extraction jobs
- Implement Gemini AI extraction
- Build data validation system
- Create real-time status updates
- Add error handling and recovery
Phase 4: Integration System
Webhook Framework
- Build webhook configuration interface
- Implement webhook delivery system
- Add retry logic with exponential backoff
- Create delivery tracking and logging
- Build webhook testing tools
GoHighLevel Integration
- Implement OAuth 2.0 flow using existing v2 integration
- Build field mapping interface
- Create contact creation/update logic
- Implement custom field management
- Add opportunity creation features
- Build agency/multi-location support
Integration Management
- Create integration listing interface
- Build connection status monitoring
- Implement integration testing tools
- Add integration analytics
- Create troubleshooting interfaces
Phase 5: User Experience & Monitoring
Extraction History
- Build extraction listing interface
- Implement filtering and search
- Add export capabilities
- Create detailed view with edit options
Analytics & Monitoring
- Build template performance analytics
- Implement extraction accuracy tracking
- Create integration success monitoring
- Add usage analytics dashboard
User Onboarding
- Create guided onboarding flow
- Build tutorial/help system
- Add sample templates
- Create documentation
Phase 6: Advanced Features
Multi-Platform Integration Framework
- Design extensible integration architecture
- Build Zapier integration
- Add Make.com integration
- Create custom API endpoints
Advanced Template Features
- Implement conditional field extraction
- Add multi-page document support
- Create template inheritance
- Build collaborative template editing
Enterprise Features
- Implement team management
- Add role-based permissions
- Create organization-wide templates
- Build audit logging
Technical Considerations
Performance Optimization
- File Processing: Use streaming for large files
- AI Processing: Implement queueing for high-volume processing
- Database Queries: Optimize with proper indexing
- Caching: Implement Redis for frequently accessed data
- CDN: Use for static assets and processed documents
Security Requirements
- File Upload Security: Validate file types, scan for malware
- Data Encryption: Encrypt sensitive data at rest
- API Security: Implement rate limiting and authentication
- Integration Security: Secure OAuth token storage
- Audit Trail: Log all user actions and data access
Scalability Planning
- Database Sharding: Plan for multi-tenant scaling
- Message Queues: Use for background processing
- Load Balancing: Prepare for horizontal scaling
- Storage Optimization: Implement file lifecycle management
- Monitoring: Set up comprehensive system monitoring
Success Metrics
User Engagement
- Template Creation Rate: Templates created per user
- Processing Volume: Documents processed per month
- Integration Usage: Active integrations per user
- User Retention: Monthly and annual retention rates
Technical Performance
- Processing Speed: Average extraction time
- Accuracy Rates: Template accuracy scores
- Integration Success: Webhook delivery success rates
- System Uptime: Platform availability metrics
Business Metrics
- User Growth: New user acquisition rate
- Revenue Growth: Monthly recurring revenue
- Feature Adoption: Usage of advanced features
- Customer Satisfaction: Support ticket volume and resolution
Risk Mitigation
Technical Risks
- AI Model Changes: Plan for Gemini API updates
- Integration Dependencies: Monitor third-party API stability
- Data Processing Failures: Implement comprehensive error handling
- Security Vulnerabilities: Regular security audits
Business Risks
- Competition: Monitor competitive landscape
- Platform Dependencies: Reduce reliance on single providers
- Compliance: Ensure GDPR, CCPA compliance
- Customer Churn: Implement retention strategies
Future Roadmap
Short-term (3-6 months)
- Complete core platform development
- Launch GoHighLevel integration
- Implement basic webhook system
- Add essential analytics
Medium-term (6-12 months)
- Add more platform integrations (Zapier, Make)
- Implement advanced template features
- Build mobile-responsive interface
- Add team collaboration features
Long-term (12+ months)
- API marketplace for integrations
- AI model fine-tuning capabilities
- Enterprise security features
- White-label solutions
This comprehensive documentation serves as the foundation for building the new document extraction platform. Each section provides detailed specifications while maintaining flexibility for iterative development and user feedback incorporation.