--- name: bank-statement-converter display_name: Bank Statement Converter version: "1.0.0" description: >- Convert PDF bank and credit card statements into CSV and QBO files. Processes each file independently with mandatory validation against statement summaries before proceeding. Supports any bank format. Keywords: bank statement, PDF, CSV, QBO, QuickBooks, OFX, accounting, finance. tags: - accounting - finance - pdf - csv - quickbooks allowed-tools: >- Read Write Edit Bash --- # Bank Statement Converter Converts PDF bank and credit card statements into structured CSV and QBO (QuickBooks) files. Each PDF is processed independently with full validation before moving to the next. --- ## Mandatory Rules ### Never Do 1. **Never write a single script to batch-process multiple PDFs** — even if you believe the formats are identical. 2. **Never assume two PDFs share the same format** — even from the same bank. 3. **Never skip the validation step.** 4. **Never proceed to the next file before validation passes.** ### Always Do Complete the full pipeline for each PDF before starting the next: ``` Step 1: Read PDF and extract text ↓ Step 2: Analyze format → parse transactions line by line ↓ Step 3: Generate CSV (dates in YYYY-MM-DD) ↓ Step 4: ★ Full CSV Validation ★ — must pass before continuing ↓ Step 5: Generate QBO file ↓ Step 6: ★ Full QBO Validation ★ — must pass before continuing ↓ Step 7: Save files to the same directory as the source PDF ↓ Only after all validations pass → start next PDF ``` --- ## Step 1: Extract Text from PDF Use `pdfplumber` to read the PDF and extract text: ```python import pdfplumber with pdfplumber.open("statement.pdf") as pdf: for page in pdf.pages: text = page.extract_text() print(text) ``` ## Step 2: Analyze Format and Parse Transactions **Every PDF may have a different layout. Always analyze before parsing.** ### 2.1 Examine the PDF Structure After extracting text, observe: - How transactions are arranged (table, list, mixed) - Date position and format - Amount layout (single column with +/-, two separate columns) - Whether descriptions wrap across multiple lines - Whether summary/total rows need to be excluded ### 2.2 Standard Fields Bank statements typically contain: | Field | Common Labels | |-------|---------------| | Date | Date, Trans Date, Posted | | Description | Description, Memo, Details, Payee | | Credit | Credit, Deposits, Money In | | Debit | Debit, Withdrawals, Money Out | ### 2.3 Determine Amount Direction - **Two-column format** (separate Credit/Debit columns): map directly. - **Single-column format** (signed amounts): - Positive or unsigned → Credit (income) - Negative or parenthesized → Debit (expense) ### 2.4 Parsing Principle **Parse based on the actual format of the current PDF.** Do not rely on hardcoded patterns or regular expressions from other statements. ## Step 3: Standardize Dates and Generate CSV ### Date Normalization Convert all dates to `YYYY-MM-DD` (ISO 8601): | PDF Format | Standardized | Example | |------------|-------------|---------| | `1 February` / `Feb 1` | `YYYY-02-01` | `2025-02-01` | | `Nov 03, 2025` | `2025-11-03` | `2025-11-03` | | `03/15/2025` (MM/DD/YYYY) | `2025-03-15` | `2025-03-15` | | `15/03/2025` (DD/MM/YYYY) | `2025-03-15` | `2025-03-15` | | `03-15-25` | `2025-03-15` | `2025-03-15` | ### When the Year Is Missing If the PDF only shows month and day (e.g., `Jan 15`): 1. **Check the PDF header** — usually states the period, e.g., "Statement Period: January 1 – January 31, 2025" 2. **Check nearby rows** — adjacent transactions may include full dates 3. **Check first/last entries** — summary sections often include the year 4. **Handle year boundaries** — if December is followed by January, increment the year **Do not assume all transactions fall in the same year.** Cross-year statements (Dec–Jan) are common. ### CSV Format ```python import csv with open("transactions.csv", "w", newline="", encoding="utf-8") as f: writer = csv.writer(f) writer.writerow(["Date", "Description", "Credit", "Debit"]) for t in transactions: writer.writerow([t["date"], t["description"], t["credit"], t["debit"]]) ``` **Rules:** - Columns: `Date`, `Description`, `Credit`, `Debit` - Dates: `YYYY-MM-DD` - Amounts: numbers only, no currency symbols - Empty credit or debit: leave blank - Descriptions containing commas must be quoted (handled by `csv.writer`) ## Step 4: CSV Validation (Mandatory) ### A1. Transaction Count - [ ] Count rows in the CSV - [ ] Cross-reference every transaction against the PDF - [ ] Confirm the counts match ### A2. Amount Totals - [ ] Sum all Credit values in the CSV - [ ] Sum all Debit values in the CSV - [ ] Compare against the PDF summary totals (usually at the top or bottom of the statement) - [ ] CSV Credit total ≈ PDF "Total Deposits / Money In" - [ ] CSV Debit total ≈ PDF "Total Withdrawals / Money Out" **Tolerance: ≤ $0.01. Larger discrepancies require investigation.** ### A3. Per-Transaction Check - [ ] All dates are `YYYY-MM-DD` - [ ] Dates are correct (e.g., February = 02, not 01) - [ ] Descriptions are complete (no truncation) - [ ] Credit/Debit amounts are positive numbers in the correct column ### If Validation Fails 1. Stop — do not proceed to QBO generation 2. Re-analyze the PDF format from Step 1 3. Fix the issue and regenerate CSV 4. Re-run full validation ## Steps 5 & 6: Generate and Validate QBO QBO is QuickBooks' bank statement import format based on OFX/SGML. ```python import csv from datetime import datetime def csv_to_qbo(csv_file, qbo_file): transactions = [] with open(csv_file, "r", encoding="utf-8") as f: reader = csv.DictReader(f) for row in reader: transactions.append(row) # Date range from all transactions (not just first/last row) dates = [datetime.strptime(t["Date"], "%Y-%m-%d") for t in transactions] dt_start = min(dates).strftime("%Y%m%d") dt_end = max(dates).strftime("%Y%m%d") qbo_lines = [ "OFXHEADER:100", "DATA:OFXSGML", "VERSION:102", "", "", f"{dt_start}", f"{dt_end}", ] for t in transactions: credit = float(t["Credit"]) if t.get("Credit", "").strip() else 0 debit = float(t["Debit"]) if t.get("Debit", "").strip() else 0 amount = credit - debit date_str = t["Date"].replace("-", "") desc = t["Description"].upper() if any(kw in desc for kw in ["PAYMENT", "WITHDRAWAL", "DEBIT", "PURCHASE", "CHARGE"]): trn_type = "DEBIT" elif any(kw in desc for kw in ["TRANSFER IN", "CREDIT", "DEPOSIT", "SALARY", "INCOME"]): trn_type = "CREDIT" else: trn_type = "GEN" qbo_lines.extend([ "", f"{trn_type}", f"{date_str}", f"{amount:.2f}", f"{t['Description']}", f"{t['Description']}", "", ]) qbo_lines.extend(["", ""]) with open(qbo_file, "w", encoding="utf-8") as f: f.write("\n".join(qbo_lines)) ``` ### QBO Field Reference | Field | Description | Example | |-------|-------------|---------| | OFXHEADER | OFX version header | `OFXHEADER:100` | | DTSTART/DTEND | Date range (YYYYMMDD) | `20250101` | | DTPOSTED | Transaction date (YYYYMMDD) | `20250115` | | TRNAMT | Amount (positive = income, negative = expense) | `-25.00` | | TRNTYPE | Type: DEBIT, CREDIT, or GEN | `DEBIT` | ### QBO Validation - [ ] QBO file was generated successfully - [ ] Transaction count matches the CSV - [ ] All dates are YYYYMMDD (8 digits) - [ ] Amounts have correct signs (income positive, expense negative) - [ ] DTSTART/DTEND range is correct --- ## Output Files ### Location Save output files in the **same directory** as the source PDF. ### Naming Convention | Input | CSV Output | QBO Output | |-------|-----------|------------| | `acme_bank_jan.pdf` | `acme_bank_jan_transactions.csv` | `acme_bank_jan.qbo` | | `chase_checking_q1.pdf` | `chase_checking_q1_transactions.csv` | `chase_checking_q1.qbo` | - **CSV**: append `_transactions` suffix to the original filename - **QBO**: use the original filename with `.qbo` extension --- ## Troubleshooting | Problem | Likely Cause | Solution | |---------|-------------|----------| | Transaction count too low | Missed rows or multi-line descriptions not merged | Re-examine PDF text line by line | | Amount totals mismatch | Wrong amount direction or missing transactions | Compare against PDF summary totals | | Wrong date format | Non-ISO format or incorrect month mapping | Verify all dates are YYYY-MM-DD | | Description truncated | Multi-line description only captured first line | Check if descriptions span multiple lines | | QBO import fails | Date format error or missing tags | Confirm YYYYMMDD format and all required fields | --- ## Dependencies - `pdfplumber` — PDF text extraction - Python standard library: `csv`, `datetime` --- ## Multi-File Processing Summary ``` file1.pdf → extract → analyze → parse → CSV → [Validate] → QBO → [Validate] → ✓ saved ↓ file2.pdf → extract → analyze → parse → CSV → [Validate] → QBO → [Validate] → ✓ saved ↓ file3.pdf → extract → analyze → parse → CSV → [Validate] → QBO → [Validate] → ✓ saved ``` **Each file must complete all validations (✓) before processing the next.**