Skip to content

Resiliency PDF Restructure UG Work

Resiliency PDF Restructure UG Work

Python application that validates PDF structure, extracts CU codes, and updates Smartsheet with validation results.

Python 66% Active

Overview

Validates the structure of generated PDFs, extracts Construction Unit (CU) codes, verifies formatting compliance, and updates Smartsheet with validation status and extracted data.

GitHub Repository: Repository may be private or not yet public

Key Features

- **PDF Structure Validation**: Checks PDF format and structure - **CU Code Extraction**: Extracts Construction Unit codes from PDFs - **Format Verification**: Ensures PDFs meet specifications - **Smartsheet Integration**: Updates validation results - **Batch Processing**: Validates multiple PDFs efficiently - **Error Reporting**: Detailed validation failure messages - **Automated Execution**: Runs after PDF generation

Use Cases

  • Validating generated foreman reports
  • Extracting CU codes for billing
  • Quality assurance for automated reports
  • Ensuring PDF compliance with requirements
  • Audit trail for generated documents

Architecture

graph LR
    PDF[PDF Files] -->|Read| PY[Python Script]
    PY -->|Validate<br/>Structure| PY
    PY -->|Extract<br/>CU Codes| PY
    PY -->|Update<br/>Results| SS[Smartsheet]

    style PDF fill:#ff9,stroke:#333,stroke-width:2px
    style SS fill:#f9f,stroke:#333,stroke-width:2px
    style PY fill:#3776ab,stroke:#333,stroke-width:2px

File Structure

Resiliency-pdf-restructure-ug-work/
├── pdf_restructure.py         # Main validation script
├── pdf_validator.py           # PDF validation logic
├── cu_code_extractor.py       # CU code extraction
├── smartsheet_updater.py      # Smartsheet operations
├── config.py                  # Configuration
├── requirements.txt           # Dependencies
├── .env.example               # Environment template
└── README.md                  # Documentation

Environment Variables

Variable Required Description Example
SMARTSHEET_ACCESS_TOKEN Yes API token ll...
SHEET_ID Yes Target sheet ID 1234567890123456
PDF_DIRECTORY Yes Directory with PDFs /var/pdfs/weekly
COLUMN_ID_CU_CODE Yes CU code column 1111111111111111
COLUMN_ID_VALIDATION Yes Validation status column 2222222222222222
COLUMN_ID_VALIDATION_DATE Yes Validation date column 3333333333333333
COLUMN_ID_PDF_PATH No PDF file path column 4444444444444444

Setup Instructions

1. Clone Repository

git clone https://github.com/JFlo21/Resiliency-pdf-restructure-ug-work.git
cd Resiliency-pdf-restructure-ug-work

2. Install Dependencies

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

3. Configure Environment

SMARTSHEET_ACCESS_TOKEN=your_token
SHEET_ID=1234567890123456
PDF_DIRECTORY=/var/pdfs/weekly
COLUMN_ID_CU_CODE=1111111111111111
COLUMN_ID_VALIDATION=2222222222222222
COLUMN_ID_VALIDATION_DATE=3333333333333333
COLUMN_ID_ERROR_MESSAGE=4444444444444444

Usage Examples

Validate All PDFs

python pdf_restructure.py

Output:

Validating PDFs in /var/pdfs/weekly...
Validating foreman_john_smith_2025-01-13.pdf... PASS
  Extracted CU code: CU-12345
Validating foreman_jane_doe_2025-01-13.pdf... PASS
  Extracted CU code: CU-12346
Validating foreman_bob_johnson_2025-01-13.pdf... FAIL
  Error: Missing CU code section
Updated Smartsheet with 3 validation results

Validate Specific PDF

python pdf_restructure.py --pdf /var/pdfs/weekly/foreman_john.pdf

Dry-Run Mode

python pdf_restructure.py --dry-run

Verbose Output

python pdf_restructure.py --verbose

Validation Rules

Structure Checks

  1. PDF Format: Valid PDF structure
  2. Page Count: Expected number of pages
  3. Text Extraction: Text is extractable
  4. Font Compliance: Uses approved fonts
  5. Image Quality: Images meet resolution requirements

CU Code Validation

import re

def validate_cu_code(text):
    """
    CU code format: CU-##### where # is digit
    """
    pattern = r'CU-\d{5}'
    match = re.search(pattern, text)

    if not match:
        return None, "CU code not found"

    cu_code = match.group(0)

    # Additional validation
    if not is_valid_cu_code(cu_code):
        return None, f"Invalid CU code: {cu_code}"

    return cu_code, None

Format Requirements

def validate_pdf_structure(pdf_path):
    errors = []

    # Check page count
    if page_count != 1:
        errors.append(f"Expected 1 page, found {page_count}")

    # Check for required sections
    required_sections = ['Header', 'Summary', 'Details', 'CU Code']
    for section in required_sections:
        if section not in text:
            errors.append(f"Missing section: {section}")

    # Check date format
    if not re.search(r'\d{4}-\d{2}-\d{2}', text):
        errors.append("Date not found in expected format")

    return errors

Dependencies

smartsheet-python-sdk>=3.0.0
PyPDF2>=3.0.0
pdfplumber>=0.10.0
python-dotenv>=0.19.0
pillow>=10.0.0

PDF Extraction

Extract Text

import pdfplumber

def extract_pdf_text(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        text = ''
        for page in pdf.pages:
            text += page.extract_text()
        return text

Extract Tables

def extract_tables(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        tables = []
        for page in pdf.pages:
            tables.extend(page.extract_tables())
        return tables

Extract Metadata

from PyPDF2 import PdfReader

def extract_metadata(pdf_path):
    reader = PdfReader(pdf_path)
    return {
        'page_count': len(reader.pages),
        'author': reader.metadata.get('/Author'),
        'title': reader.metadata.get('/Title'),
        'created': reader.metadata.get('/CreationDate'),
    }

Smartsheet Updates

Update Validation Results

def update_validation_results(sheet_id, pdf_name, cu_code, status, errors):
    # Find row by PDF name
    row = find_row_by_pdf_name(sheet_id, pdf_name)

    if not row:
        logger.warning(f"No row found for PDF: {pdf_name}")
        return

    # Update cells
    cells = [
        {
            'column_id': COLUMN_ID_CU_CODE,
            'value': cu_code or ''
        },
        {
            'column_id': COLUMN_ID_VALIDATION,
            'value': status  # 'Pass' or 'Fail'
        },
        {
            'column_id': COLUMN_ID_VALIDATION_DATE,
            'value': datetime.now().strftime('%Y-%m-%d')
        }
    ]

    if errors:
        cells.append({
            'column_id': COLUMN_ID_ERROR_MESSAGE,
            'value': '; '.join(errors)
        })

    update_row(sheet_id, row.id, cells)

Scheduling

Run After PDF Generation

# Combined script
#!/bin/bash

# Generate PDFs
python /path/to/generate_weekly_pdfs.py

# Validate PDFs
python /path/to/pdf_restructure.py

# Send notification
mail -s "Weekly PDFs Generated and Validated" admin@company.com < /var/log/pdfs.log

Cron Job

# Run every Monday at 7 AM (1 hour after PDF generation)
0 7 * * 1 cd /path/to/repo && /path/to/venv/bin/python pdf_restructure.py

Error Handling

Validation Failures

class ValidationError(Exception):
    pass

try:
    validate_pdf(pdf_path)
except ValidationError as e:
    logger.error(f"Validation failed for {pdf_path}: {e}")
    update_smartsheet_with_error(pdf_name, str(e))

Missing PDFs

def validate_all_pdfs(pdf_dir):
    pdf_files = glob.glob(os.path.join(pdf_dir, '*.pdf'))

    if not pdf_files:
        logger.warning(f"No PDF files found in {pdf_dir}")
        return

    for pdf_file in pdf_files:
        try:
            validate_pdf(pdf_file)
        except Exception as e:
            logger.error(f"Error validating {pdf_file}: {e}")
            continue

Monitoring

Validation Metrics

metrics = {
    'total_pdfs': 0,
    'passed': 0,
    'failed': 0,
    'cu_codes_extracted': 0,
    'errors': []
}

# Track metrics during validation
# Log to file or database

Alerts

def send_alert_if_failures(metrics):
    if metrics['failed'] > 0:
        message = f"PDF validation failures: {metrics['failed']}/{metrics['total_pdfs']}"
        send_email_alert(message)

Troubleshooting

Cannot Extract Text

# Check PDF is not scanned/image-based
pdftotext file.pdf -

If text extraction fails, PDF may need OCR:

# Install Tesseract OCR
sudo apt-get install tesseract-ocr
pip install pytesseract

CU Code Not Found

Check PDF content manually:

pdftotext foreman_report.pdf - | grep -i "CU-"

Permission Errors

Ensure script has read access to PDF directory:

ls -la /var/pdfs/weekly
chmod 755 /var/pdfs/weekly

Additional Resources