AI Product Data Extraction Pipeline — Case Study

🔴 The Problem

An e-commerce business received supplier product catalogs as PDFs and unstructured emails every day. Employees spent 8+ hours daily manually reading these documents and typing product names, SKUs, pricing, dimensions, and specifications into Airtable.

This manual entry was not only exhaustive — it was error-prone. Typos and transcription mistakes in SKU numbers and pricing was causing fulfillment issues that cost the client money and led to customer complaints. The process scaled poorly: as the business grew, they needed to hire more people just to keep up with data entry.

8 hours/day of repetitive data entry by 2 full-time staff
5–10% error rate causing fulfillment problems
1–2 day lag between receiving catalogs and products available in the system
Process did not scale — each new supplier required more headcount

🟢 The Solution

I built an end-to-end AI pipeline that monitors the client's supplier inbox, extracts structured data from any format (PDF, Excel, plain email), and automatically creates or updates Airtable records — all without human involvement for standard-confidence extractions.

The pipeline handles two document types with different parsers:

PDF Catalogs: Processed by a Python Cloud Function that extracts all text and tables, then structured by GPT-4
Email catalogs: HTML/text parsed directly in Make.com, then sent to GPT-4 for structured extraction

"The system now processes an entire supplier catalog — 200+ products — in under 4 minutes. What took two people all day is now completely automated."

🛠️ Technical Architecture

📧

Gmail API Watch

Monitors supplier inbox for new emails with attachments

🐍

Python Cloud Function

PDF text and table extraction, preprocessing for GPT

🤖

OpenAI GPT-4

Structured data extraction with JSON schema output

🗃️

Airtable

Upsert destination — deduplication on SKU field

📊 The Results

8+ hours/day of manual data entry completely eliminated
500+ product records processed daily without human input
Extraction accuracy: 97%+ (vs. 90–95% with manual human entry)
Low-confidence extractions (under 85% confidence) routed to a human review queue — only ~3% of records require any human check
Catalog-to-system lag reduced from 1–2 days to under 10 minutes