# csv-diff [![PyPI](https://img.shields.io/pypi/v/csv-diff.svg)](https://pypi.org/project/csv-diff/) [![Changelog](https://img.shields.io/github/v/release/simonw/csv-diff?include_prereleases&label=changelog)](https://github.com/simonw/csv-diff/releases) [![Tests](https://github.com/simonw/csv-diff/workflows/Test/badge.svg)](https://github.com/simonw/csv-diff/actions?query=workflow%3ATest) [![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/simonw/csv-diff/blob/main/LICENSE) Tool for viewing the difference between two CSV, TSV or JSON files. See [Generating a commit log for San Francisco’s official list of trees](https://simonwillison.net/2019/Mar/13/tree-history/) (and the [sf-tree-history repo commit log](https://github.com/simonw/sf-tree-history/commits)) for background information on this project. ## Installation pip install csv-diff ## Usage Consider two CSV files: `one.csv` id,name,age 1,Cleo,4 2,Pancakes,2 `two.csv` id,name,age 1,Cleo,5 3,Bailey,1 `csv-diff` can show a human-readable summary of differences between the files: $ csv-diff one.csv two.csv --key=id 1 row changed, 1 row added, 1 row removed 1 row changed Row 1 age: "4" => "5" 1 row added id: 3 name: Bailey age: 1 1 row removed id: 2 name: Pancakes age: 2 The `--key=id` option means that the `id` column should be treated as the unique key, to identify which records have changed. The tool will automatically detect if your files are comma- or tab-separated. You can over-ride this automatic detection and force the tool to use a specific format using `--format=tsv` or `--format=csv`. You can also feed it JSON files, provided they are a JSON array of objects where each object has the same keys. Use `--format=json` if your input files are JSON. Use `--show-unchanged` to include full details of the unchanged values for rows with at least one change in the diff output: % csv-diff one.csv two.csv --key=id --show-unchanged 1 row changed id: 1 age: "4" => "5" Unchanged: name: "Cleo" ### JSON output You can use the `--json` option to get a machine-readable difference: $ csv-diff one.csv two.csv --key=id --json { "added": [ { "id": "3", "name": "Bailey", "age": "1" } ], "removed": [ { "id": "2", "name": "Pancakes", "age": "2" } ], "changed": [ { "key": "1", "changes": { "age": [ "4", "5" ] } } ], "columns_added": [], "columns_removed": [] } ### Adding templated extras You can specify additional keys to be displayed in the human-readable format using the `--extra` option: --extra name "Python format string with {id} for variables" For example, to output a link to `https://news.ycombinator.com/latest?id={id}` for each item with an ID, you could use this: ```bash csv-diff one.csv two.csv --key=id \ --extra latest "https://news.ycombinator.com/latest?id={id}" ``` These extras display something like this: ``` 1 row changed id: 41459472 points: "24" => "25" numComments: "5" => "6" extras: latest: https://news.ycombinator.com/latest?id=41459472 ``` ## As a Python library You can also import the Python library into your own code like so: from csv_diff import load_csv, compare diff = compare( load_csv(open("one.csv"), key="id"), load_csv(open("two.csv"), key="id") ) `diff` will now contain the same data structure as the output in the `--json` example above. If the columns in the CSV have changed, those added or removed columns will be ignored when calculating changes made to specific rows. ## As a Docker container ### Build the image $ docker build -t csvdiff . ### Run the container $ docker run --rm -v $(pwd):/files csvdiff Suppose current directory contains two csv files : one.csv two.csv $ docker run --rm -v $(pwd):/files csvdiff one.csv two.csv ## Alternatives - [csvdiff](https://github.com/aswinkarthik/csvdiff) is a "fast diff tool for comparing CSV files" - you may get better results from this than from `csv-diff` against larger files.