--- name: Python JSON description: UTF-8 JSON file I/O utilities to avoid Windows encoding issues (CP-1252 vs UTF-8) --- # Python JSON Python on Windows defaults to the system encoding (usually CP-1252) when opening files, not UTF-8. This causes mojibake when reading JSON files containing non-ASCII characters like `£`, `€`, or accented letters. ## Instructions When helping users with JSON file operations in Python, follow these guidelines: 1. **Never use raw file operations for JSON**: Always use a utility module that abstracts the encoding issue 2. **Always specify UTF-8 encoding**: When opening files for JSON, use `encoding='utf-8'` explicitly 3. **Use ensure_ascii=False when writing**: This preserves Unicode characters instead of escaping them to `\uXXXX` 4. **Create a json_io utility module**: For any project that reads/writes JSON files, create a dedicated utility 5. **RFC 8259 compliance**: JSON text must be encoded in UTF-8 per the JSON specification ## Examples ### Example 1: Project needs JSON file operations ``` User: Help me read and write JSON files in my Python project Claude: I'll create a json_io utility module to abstract the Windows encoding issue: - Create utils/json_io.py with load_json and save_json functions - Both functions explicitly use UTF-8 encoding - save_json uses ensure_ascii=False to preserve Unicode [Provides implementation using the json_io pattern from reference material below] ``` ### Example 2: Mojibake in JSON data ``` User: My JSON file shows £ as £ when I read it Claude: This is the UTF-8 double-encoding bug on Windows. The £ character (U+00A3, bytes C2 A3 in UTF-8) is being read as CP-1252, where those bytes represent £. Solution: Use explicit UTF-8 encoding when opening files: - with open(path, 'r', encoding='utf-8') as f I'll create a json_io utility module to prevent this issue project-wide. [Provides implementation using the json_io pattern from reference material below] ``` ### Example 3: Writing JSON with special characters ``` User: When I save JSON, my £ symbols become \u00a3 Claude: Python's json.dump() defaults to ensure_ascii=True, which escapes non-ASCII characters. To preserve them: - Use ensure_ascii=False in json.dump() - Always specify encoding='utf-8' when opening the file [Provides implementation using the json_io pattern from reference material below] ``` --- # Reference Implementation Details The sections below contain proven working code that the examples above reference. **Reference Files in This Folder**: - `json_io.py` - Complete UTF-8 JSON I/O utility (copy to utils/) ## json_io Pattern **Purpose**: Abstract JSON file operations to always use UTF-8 encoding ### Code Example ```python """ JSON file I/O utilities with proper UTF-8 encoding. This module abstracts JSON file operations to handle the well-known issue of Python defaulting to system encoding on Windows instead of UTF-8 (which is the default encoding per RFC 8259). """ import json from pathlib import Path from typing import Any, Union def load_json(path: Union[str, Path]) -> Any: """ Load JSON from a file with proper UTF-8 encoding. Args: path: Path to the JSON file Returns: Parsed JSON data """ with open(path, 'r', encoding='utf-8') as f: return json.load(f) def save_json(path: Union[str, Path], data: Any, indent: int = 2) -> None: """ Save data to a JSON file with proper UTF-8 encoding. Args: path: Path to the JSON file data: Data to serialize indent: Indentation level (default: 2) """ with open(path, 'w', encoding='utf-8') as f: json.dump(data, f, indent=indent, ensure_ascii=False) ``` **Key Points**: - Always use `encoding='utf-8'` when opening files - Use `ensure_ascii=False` to preserve Unicode characters in output - Accept both str and Path objects for flexibility - Default indent of 2 for human-readable output ## Usage Pattern ```python from utils.json_io import load_json, save_json # Reading data = load_json('config.json') # Writing save_json('output.json', {'price': '£99.99'}) ``` ## The Problem Explained ### Windows Default Encoding Python's `open()` function uses `locale.getencoding()` to determine the default encoding. On Windows, this is typically CP-1252 (Windows-1252), not UTF-8. ### The Double-Encoding Bug When a UTF-8 file is read with CP-1252 encoding: | Character | UTF-8 bytes | CP-1252 interpretation | |-----------|-------------|----------------------| | £ (U+00A3) | C2 A3 | £ | | € (U+20AC) | E2 82 AC | € | | é (U+00E9) | C3 A9 | é | ### RFC 8259 Requirement Per RFC 8259 (The JSON Data Interchange Format): > JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8. ## Troubleshooting ### Mojibake Characters **Symptom**: Characters like `£`, `€`, `é` appear in your data **Cause**: UTF-8 file read with CP-1252 encoding **Solution**: Use the json_io utility module or add `encoding='utf-8'` to file open calls ### \uXXXX Escapes in Output **Symptom**: Non-ASCII characters appear as `\u00a3` in JSON output **Cause**: `json.dump()` default `ensure_ascii=True` **Solution**: Use `ensure_ascii=False` in json.dump() ## Best Practices Summary 1. Create a `utils/json_io.py` module in every Python project that handles JSON files 2. Never use raw `json.load(open(...))` without explicit UTF-8 encoding 3. Use `ensure_ascii=False` when writing JSON to preserve readability 4. Consider setting `PYTHONUTF8=1` environment variable for system-wide UTF-8 default (Python 3.7+)