Resiliency PDF Restructure UG Work¶

Resiliency PDF Restructure UG Work

Python application that validates PDF structure, extracts CU codes, and updates Smartsheet with validation results.

Python 66% Active

Overview¶

Validates the structure of generated PDFs, extracts Construction Unit (CU) codes, verifies formatting compliance, and updates Smartsheet with validation status and extracted data.

GitHub Repository: Repository may be private or not yet public

Key Features¶

- **PDF Structure Validation**: Checks PDF format and structure - **CU Code Extraction**: Extracts Construction Unit codes from PDFs - **Format Verification**: Ensures PDFs meet specifications - **Smartsheet Integration**: Updates validation results - **Batch Processing**: Validates multiple PDFs efficiently - **Error Reporting**: Detailed validation failure messages - **Automated Execution**: Runs after PDF generation

Use Cases¶

Validating generated foreman reports
Extracting CU codes for billing
Quality assurance for automated reports
Ensuring PDF compliance with requirements
Audit trail for generated documents

Architecture¶

graph LR
    PDF[PDF Files] -->|Read| PY[Python Script]
    PY -->|Validate<br/>Structure| PY
    PY -->|Extract<br/>CU Codes| PY
    PY -->|Update<br/>Results| SS[Smartsheet]

    style PDF fill:#ff9,stroke:#333,stroke-width:2px
    style SS fill:#f9f,stroke:#333,stroke-width:2px
    style PY fill:#3776ab,stroke:#333,stroke-width:2px

File Structure¶

Resiliency-pdf-restructure-ug-work/
├── pdf_restructure.py         # Main validation script
├── pdf_validator.py           # PDF validation logic
├── cu_code_extractor.py       # CU code extraction
├── smartsheet_updater.py      # Smartsheet operations
├── config.py                  # Configuration
├── requirements.txt           # Dependencies
├── .env.example               # Environment template
└── README.md                  # Documentation

Environment Variables¶

Variable	Required	Description	Example
`SMARTSHEET_ACCESS_TOKEN`	Yes	API token	`ll...`
`SHEET_ID`	Yes	Target sheet ID	`1234567890123456`
`PDF_DIRECTORY`	Yes	Directory with PDFs	`/var/pdfs/weekly`
`COLUMN_ID_CU_CODE`	Yes	CU code column	`1111111111111111`
`COLUMN_ID_VALIDATION`	Yes	Validation status column	`2222222222222222`
`COLUMN_ID_VALIDATION_DATE`	Yes	Validation date column	`3333333333333333`
`COLUMN_ID_PDF_PATH`	No	PDF file path column	`4444444444444444`

Setup Instructions¶

1. Clone Repository¶

git clone https://github.com/JFlo21/Resiliency-pdf-restructure-ug-work.git
cd Resiliency-pdf-restructure-ug-work

2. Install Dependencies¶

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

3. Configure Environment¶

SMARTSHEET_ACCESS_TOKEN=your_token
SHEET_ID=1234567890123456
PDF_DIRECTORY=/var/pdfs/weekly
COLUMN_ID_CU_CODE=1111111111111111
COLUMN_ID_VALIDATION=2222222222222222
COLUMN_ID_VALIDATION_DATE=3333333333333333
COLUMN_ID_ERROR_MESSAGE=4444444444444444

Usage Examples¶

Validate All PDFs¶

python pdf_restructure.py

Output:

Validating PDFs in /var/pdfs/weekly...
Validating foreman_john_smith_2025-01-13.pdf... PASS
  Extracted CU code: CU-12345
Validating foreman_jane_doe_2025-01-13.pdf... PASS
  Extracted CU code: CU-12346
Validating foreman_bob_johnson_2025-01-13.pdf... FAIL
  Error: Missing CU code section
Updated Smartsheet with 3 validation results

Validate Specific PDF¶

python pdf_restructure.py --pdf /var/pdfs/weekly/foreman_john.pdf

Dry-Run Mode¶

python pdf_restructure.py --dry-run

Verbose Output¶

python pdf_restructure.py --verbose

Validation Rules¶

Structure Checks¶

PDF Format: Valid PDF structure
Page Count: Expected number of pages
Text Extraction: Text is extractable
Font Compliance: Uses approved fonts
Image Quality: Images meet resolution requirements

CU Code Validation¶

import re

def validate_cu_code(text):
    """
    CU code format: CU-##### where # is digit
    """
    pattern = r'CU-\d{5}'
    match = re.search(pattern, text)

    if not match:
        return None, "CU code not found"

    cu_code = match.group(0)

    # Additional validation
    if not is_valid_cu_code(cu_code):
        return None, f"Invalid CU code: {cu_code}"

    return cu_code, None

Format Requirements¶

def validate_pdf_structure(pdf_path):
    errors = []

    # Check page count
    if page_count != 1:
        errors.append(f"Expected 1 page, found {page_count}")

    # Check for required sections
    required_sections = ['Header', 'Summary', 'Details', 'CU Code']
    for section in required_sections:
        if section not in text:
            errors.append(f"Missing section: {section}")

    # Check date format
    if not re.search(r'\d{4}-\d{2}-\d{2}', text):
        errors.append("Date not found in expected format")

    return errors

Dependencies¶

smartsheet-python-sdk>=3.0.0
PyPDF2>=3.0.0
pdfplumber>=0.10.0
python-dotenv>=0.19.0
pillow>=10.0.0

PDF Extraction¶

Extract Text¶

import pdfplumber

def extract_pdf_text(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        text = ''
        for page in pdf.pages:
            text += page.extract_text()
        return text

Extract Tables¶

def extract_tables(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        tables = []
        for page in pdf.pages:
            tables.extend(page.extract_tables())
        return tables

Extract Metadata¶

from PyPDF2 import PdfReader

def extract_metadata(pdf_path):
    reader = PdfReader(pdf_path)
    return {
        'page_count': len(reader.pages),
        'author': reader.metadata.get('/Author'),
        'title': reader.metadata.get('/Title'),
        'created': reader.metadata.get('/CreationDate'),
    }

Smartsheet Updates¶

Update Validation Results¶

def update_validation_results(sheet_id, pdf_name, cu_code, status, errors):
    # Find row by PDF name
    row = find_row_by_pdf_name(sheet_id, pdf_name)

    if not row:
        logger.warning(f"No row found for PDF: {pdf_name}")
        return

    # Update cells
    cells = [
        {
            'column_id': COLUMN_ID_CU_CODE,
            'value': cu_code or ''
        },
        {
            'column_id': COLUMN_ID_VALIDATION,
            'value': status  # 'Pass' or 'Fail'
        },
        {
            'column_id': COLUMN_ID_VALIDATION_DATE,
            'value': datetime.now().strftime('%Y-%m-%d')
        }
    ]

    if errors:
        cells.append({
            'column_id': COLUMN_ID_ERROR_MESSAGE,
            'value': '; '.join(errors)
        })

    update_row(sheet_id, row.id, cells)

Scheduling¶

Run After PDF Generation¶

# Combined script
#!/bin/bash

# Generate PDFs
python /path/to/generate_weekly_pdfs.py

# Validate PDFs
python /path/to/pdf_restructure.py

# Send notification
mail -s "Weekly PDFs Generated and Validated" admin@company.com < /var/log/pdfs.log

Cron Job¶

# Run every Monday at 7 AM (1 hour after PDF generation)
0 7 * * 1 cd /path/to/repo && /path/to/venv/bin/python pdf_restructure.py

Error Handling¶

Validation Failures¶

class ValidationError(Exception):
    pass

try:
    validate_pdf(pdf_path)
except ValidationError as e:
    logger.error(f"Validation failed for {pdf_path}: {e}")
    update_smartsheet_with_error(pdf_name, str(e))

Missing PDFs¶

def validate_all_pdfs(pdf_dir):
    pdf_files = glob.glob(os.path.join(pdf_dir, '*.pdf'))

    if not pdf_files:
        logger.warning(f"No PDF files found in {pdf_dir}")
        return

    for pdf_file in pdf_files:
        try:
            validate_pdf(pdf_file)
        except Exception as e:
            logger.error(f"Error validating {pdf_file}: {e}")
            continue

Monitoring¶

Validation Metrics¶

metrics = {
    'total_pdfs': 0,
    'passed': 0,
    'failed': 0,
    'cu_codes_extracted': 0,
    'errors': []
}

# Track metrics during validation
# Log to file or database

Alerts¶

def send_alert_if_failures(metrics):
    if metrics['failed'] > 0:
        message = f"PDF validation failures: {metrics['failed']}/{metrics['total_pdfs']}"
        send_email_alert(message)

Troubleshooting¶

Cannot Extract Text¶

# Check PDF is not scanned/image-based
pdftotext file.pdf -

If text extraction fails, PDF may need OCR:

# Install Tesseract OCR
sudo apt-get install tesseract-ocr
pip install pytesseract

CU Code Not Found¶

Check PDF content manually:

pdftotext foreman_report.pdf - | grep -i "CU-"

Permission Errors¶

Ensure script has read access to PDF directory:

ls -la /var/pdfs/weekly
chmod 755 /var/pdfs/weekly

Generate Weekly PDFs - PDF generation
Supabase Smartsheet Promax Offload - Database sync