{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"format": "text/markdown"
},
"source": [
"# CSV dialect detection with CleverCSV\n",
"\n",
"**Author**: [Gertjan van den Burg](https://gertjan.dev)\n",
"\n",
"In this note we'll show some examples of using CleverCSV, a package for \n",
"handling messy CSV files. We'll start with a motivating example and then show \n",
"some other files where CleverCSV shines. CleverCSV was developed as part of a \n",
"research project on automating data wrangling. It achieves an accuracy of 97% \n",
"on over 9300 real-world CSV files and improves the accuracy on messy files by \n",
"21% over standard tools.\n",
"\n",
"Handy links:\n",
"\n",
" - [Paper on arXiv](https://arxiv.org/abs/1811.11242)\n",
" - [CleverCSV on GitHub](https://github.com/alan-turing-institute/CleverCSV)\n",
" - [CleverCSV on PyPI](https://pypi.org/project/clevercsv/)\n",
" - [Reproducible Research Repo](https://github.com/alan-turing-institute/CSV_Wrangling/)\n",
"\n",
"## IMDB Movie data\n",
"\n",
"Alice is a data scientist who would like to analyse the movie ratings on IMDB \n",
"for movies of different genres. She found [a dataset shared by a user on \n",
"Kaggle](https://www.kaggle.com/orgesleka/imdbmovies) that contains information \n",
"of over 14,000 movies. Great! \n",
"\n",
"The data is stored in a CSV file, which is a very common data format for \n",
"sharing tabular data. The first few lines of the file look like this:\n",
"\n",
"```\n",
"fn,tid,title,wordsInTitle,url,imdbRating,ratingCount,duration,year,type,nrOfWins,nrOfNominations,nrOfPhotos,nrOfNewsArticles,nrOfUserReviews,nrOfGenre,Action,Adult,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,FilmNoir,GameShow,History,Horror,Music,Musical,Mystery,News,RealityTV,Romance,SciFi,Short,Sport,TalkShow,Thriller,War,Western\n",
"titles01/tt0012349,tt0012349,Der Vagabund und das Kind (1921),der vagabund und das kind,http://www.imdb.com/title/tt0012349/,8.4,40550,3240,1921,video.movie,1,0,19,96,85,3,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0\n",
"titles01/tt0015864,tt0015864,Goldrausch (1925),goldrausch,http://www.imdb.com/title/tt0015864/,8.3,45319,5700,1925,video.movie,2,1,35,110,122,3,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0\n",
"titles01/tt0017136,tt0017136,Metropolis (1927),metropolis,http://www.imdb.com/title/tt0017136/,8.4,81007,9180,1927,video.movie,3,4,67,428,376,2,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0\n",
"titles01/tt0017925,tt0017925,Der General (1926),der general,http://www.imdb.com/title/tt0017925/,8.3,37521,6420,1926,video.movie,1,1,53,123,219,3,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0\n",
"titles01/tt0021749,tt0021749,Lichter der Großstadt (1931),lichter der gro stadt,http://www.imdb.com/title/tt0021749/,8.7,70057,5220,1931,video.movie,2,0,38,187,186,3,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0\n",
"```\n",
"\n",
"Seems pretty standard, let's load it with Pandas!\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"autoscroll": "auto",
"options": {
"caption": false,
"complete": true,
"display_data": true,
"display_stream": true,
"dpi": 200,
"echo": true,
"evaluate": true,
"f_env": null,
"f_pos": "htpb",
"f_size": [
6,
4
],
"f_spines": true,
"fig": true,
"include": true,
"name": null,
"option_string": "",
"results": "verbatim",
"term": false,
"wrap": "output"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Exception reporting mode: Minimal\n"
]
},
{
"ename": "ParserError",
"evalue": "Error tokenizing data. C error: Expected 44 fields in line 66, saw 46\n",
"output_type": "error",
"traceback": [
"\u001b[0;31mParserError\u001b[0m\u001b[0;31m:\u001b[0m Error tokenizing data. C error: Expected 44 fields in line 66, saw 46\n\n"
]
}
],
"source": [
"%xmode Minimal\n",
"import pandas as pd\n",
"df = pd.read_csv('./data/imdb.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {
"format": "text/markdown"
},
"source": [
"\n",
"Oh, that doesn't work. Maybe there's something wrong with the file? Let's try \n",
"opening it with the Python CSV reader:\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"autoscroll": "auto",
"options": {
"caption": false,
"complete": true,
"display_data": true,
"display_stream": true,
"dpi": 200,
"echo": true,
"evaluate": true,
"f_env": null,
"f_pos": "htpb",
"f_size": [
6,
4
],
"f_spines": true,
"fig": true,
"include": true,
"name": null,
"option_string": "",
"results": "verbatim",
"term": false,
"wrap": "output"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Detected delimiter = ' ', quotechar = \"'\"\n",
"Loaded 13928 rows.\n"
]
}
],
"source": [
"import csv\n",
"with open('./data/imdb.csv', 'r', newline='') as fid:\n",
" dialect = csv.Sniffer().sniff(fid.read())\n",
" print(\"Detected delimiter = %r, quotechar = %r\" % (dialect.delimiter, dialect.quotechar))\n",
" fid.seek(0)\n",
" reader = csv.reader(fid, dialect=dialect)\n",
" rows = list(reader)\n",
"\n",
"print(\"Loaded %i rows.\" % len(rows))"
]
},
{
"cell_type": "markdown",
"metadata": {
"format": "text/markdown"
},
"source": [
"\n",
"Huh, that's strange, Python thinks the *space* is the delimiter and loads \n",
"13928 rows, but the file should contain 14,762 rows according to the \n",
"documentation. What's going on here?\n",
"\n",
"It turns out that on the 65th line of the file, there's a movie with the title \n",
"``Dr. Seltsam\\, oder wie ich lernte\\, die Bombe zu lieben (1964)`` (the German \n",
"version of Dr. Strangelove). The title has commas in it, that are escaped \n",
"using the ``\\`` character! Why are CSV files so hard? 😑\n",
"\n",
"**CleverCSV to the rescue!**\n",
"\n",
"CleverCSV detects the dialect of CSV files much more accurately than existing \n",
"approaches, and it is therefore robust against these kinds of format \n",
"variations. It even has a wrapper that works with DataFrames!\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"autoscroll": "auto",
"options": {
"caption": false,
"complete": true,
"display_data": true,
"display_stream": true,
"dpi": 200,
"echo": true,
"evaluate": true,
"f_env": null,
"f_pos": "htpb",
"f_size": [
6,
4
],
"f_spines": true,
"fig": true,
"include": true,
"name": null,
"option_string": "",
"results": "verbatim",
"term": false,
"wrap": "output"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" fn | \n",
" tid | \n",
" title | \n",
" wordsInTitle | \n",
" url | \n",
" imdbRating | \n",
" ratingCount | \n",
" duration | \n",
" year | \n",
" type | \n",
" ... | \n",
" News | \n",
" RealityTV | \n",
" Romance | \n",
" SciFi | \n",
" Short | \n",
" Sport | \n",
" TalkShow | \n",
" Thriller | \n",
" War | \n",
" Western | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" titles01/tt0012349 | \n",
" tt0012349 | \n",
" Der Vagabund und das Kind (1921) | \n",
" der vagabund und das kind | \n",
" http://www.imdb.com/title/tt0012349/ | \n",
" 8.4 | \n",
" 40550.0 | \n",
" 3240.0 | \n",
" 1921.0 | \n",
" video.movie | \n",
" ... | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" titles01/tt0015864 | \n",
" tt0015864 | \n",
" Goldrausch (1925) | \n",
" goldrausch | \n",
" http://www.imdb.com/title/tt0015864/ | \n",
" 8.3 | \n",
" 45319.0 | \n",
" 5700.0 | \n",
" 1925.0 | \n",
" video.movie | \n",
" ... | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" titles01/tt0017136 | \n",
" tt0017136 | \n",
" Metropolis (1927) | \n",
" metropolis | \n",
" http://www.imdb.com/title/tt0017136/ | \n",
" 8.4 | \n",
" 81007.0 | \n",
" 9180.0 | \n",
" 1927.0 | \n",
" video.movie | \n",
" ... | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" titles01/tt0017925 | \n",
" tt0017925 | \n",
" Der General (1926) | \n",
" der general | \n",
" http://www.imdb.com/title/tt0017925/ | \n",
" 8.3 | \n",
" 37521.0 | \n",
" 6420.0 | \n",
" 1926.0 | \n",
" video.movie | \n",
" ... | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" titles01/tt0021749 | \n",
" tt0021749 | \n",
" Lichter der Großstadt (1931) | \n",
" lichter der gro stadt | \n",
" http://www.imdb.com/title/tt0021749/ | \n",
" 8.7 | \n",
" 70057.0 | \n",
" 5220.0 | \n",
" 1931.0 | \n",
" video.movie | \n",
" ... | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 14756 | \n",
" titles04/index.html.9989 | \n",
" tt0672488 | \n",
" \"Peep Show\" Sectioning (TV Episode 2005) | \n",
" peep show sectioning tv episode | \n",
" http://www.imdb.com/title/tt0672488/ | \n",
" 7.7 | \n",
" 135.0 | \n",
" 1440.0 | \n",
" 2005.0 | \n",
" video.episode | \n",
" ... | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 14757 | \n",
" titles04/index.html.9992 | \n",
" tt0675644 | \n",
" \"Playhouse 90\" The Miracle Worker (TV Episode ... | \n",
" playhouse the miracle worker tv episode | \n",
" http://www.imdb.com/title/tt0675644/ | \n",
" 7.3 | \n",
" 8.0 | \n",
" 5400.0 | \n",
" 1957.0 | \n",
" video.episode | \n",
" ... | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 14758 | \n",
" titles04/index.html.9994 | \n",
" tt0679222 | \n",
" \"Private Screenings\" Robert Mitchum and Jane R... | \n",
" private screenings robert mitchum and jane rus... | \n",
" http://www.imdb.com/title/tt0679222/ | \n",
" 7.0 | \n",
" 20.0 | \n",
" 3600.0 | \n",
" 1996.0 | \n",
" video.episode | \n",
" ... | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 14759 | \n",
" titles04/index.html.9995 | \n",
" tt0680064 | \n",
" \"Providence\" All the King's Men (TV Episode 2002) | \n",
" providence all the king s men tv episode | \n",
" http://www.imdb.com/title/tt0680064/ | \n",
" NaN | \n",
" NaN | \n",
" 3600.0 | \n",
" 2002.0 | \n",
" video.episode | \n",
" ... | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 14760 | \n",
" titles04/index.html.9997 | \n",
" tt0681024 | \n",
" \"QI\" Adam (TV Episode 2003) | \n",
" qi adam tv episode | \n",
" http://www.imdb.com/title/tt0681024/ | \n",
" 7.6 | \n",
" 89.0 | \n",
" 1800.0 | \n",
" 2003.0 | \n",
" video.episode | \n",
" ... | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
14761 rows × 44 columns
\n",
"
"
],
"text/plain": [
" fn tid \\\n",
"0 titles01/tt0012349 tt0012349 \n",
"1 titles01/tt0015864 tt0015864 \n",
"2 titles01/tt0017136 tt0017136 \n",
"3 titles01/tt0017925 tt0017925 \n",
"4 titles01/tt0021749 tt0021749 \n",
"... ... ... \n",
"14756 titles04/index.html.9989 tt0672488 \n",
"14757 titles04/index.html.9992 tt0675644 \n",
"14758 titles04/index.html.9994 tt0679222 \n",
"14759 titles04/index.html.9995 tt0680064 \n",
"14760 titles04/index.html.9997 tt0681024 \n",
"\n",
" title \\\n",
"0 Der Vagabund und das Kind (1921) \n",
"1 Goldrausch (1925) \n",
"2 Metropolis (1927) \n",
"3 Der General (1926) \n",
"4 Lichter der Großstadt (1931) \n",
"... ... \n",
"14756 \"Peep Show\" Sectioning (TV Episode 2005) \n",
"14757 \"Playhouse 90\" The Miracle Worker (TV Episode ... \n",
"14758 \"Private Screenings\" Robert Mitchum and Jane R... \n",
"14759 \"Providence\" All the King's Men (TV Episode 2002) \n",
"14760 \"QI\" Adam (TV Episode 2003) \n",
"\n",
" wordsInTitle \\\n",
"0 der vagabund und das kind \n",
"1 goldrausch \n",
"2 metropolis \n",
"3 der general \n",
"4 lichter der gro stadt \n",
"... ... \n",
"14756 peep show sectioning tv episode \n",
"14757 playhouse the miracle worker tv episode \n",
"14758 private screenings robert mitchum and jane rus... \n",
"14759 providence all the king s men tv episode \n",
"14760 qi adam tv episode \n",
"\n",
" url imdbRating ratingCount \\\n",
"0 http://www.imdb.com/title/tt0012349/ 8.4 40550.0 \n",
"1 http://www.imdb.com/title/tt0015864/ 8.3 45319.0 \n",
"2 http://www.imdb.com/title/tt0017136/ 8.4 81007.0 \n",
"3 http://www.imdb.com/title/tt0017925/ 8.3 37521.0 \n",
"4 http://www.imdb.com/title/tt0021749/ 8.7 70057.0 \n",
"... ... ... ... \n",
"14756 http://www.imdb.com/title/tt0672488/ 7.7 135.0 \n",
"14757 http://www.imdb.com/title/tt0675644/ 7.3 8.0 \n",
"14758 http://www.imdb.com/title/tt0679222/ 7.0 20.0 \n",
"14759 http://www.imdb.com/title/tt0680064/ NaN NaN \n",
"14760 http://www.imdb.com/title/tt0681024/ 7.6 89.0 \n",
"\n",
" duration year type ... News RealityTV Romance SciFi \\\n",
"0 3240.0 1921.0 video.movie ... 0 0 0 0 \n",
"1 5700.0 1925.0 video.movie ... 0 0 0 0 \n",
"2 9180.0 1927.0 video.movie ... 0 0 0 1 \n",
"3 6420.0 1926.0 video.movie ... 0 0 0 0 \n",
"4 5220.0 1931.0 video.movie ... 0 0 1 0 \n",
"... ... ... ... ... ... ... ... ... \n",
"14756 1440.0 2005.0 video.episode ... 0 0 0 0 \n",
"14757 5400.0 1957.0 video.episode ... 0 0 0 0 \n",
"14758 3600.0 1996.0 video.episode ... 0 0 0 0 \n",
"14759 3600.0 2002.0 video.episode ... 0 0 0 0 \n",
"14760 1800.0 2003.0 video.episode ... 0 0 0 0 \n",
"\n",
" Short Sport TalkShow Thriller War Western \n",
"0 0 0 0 0 0 0 \n",
"1 0 0 0 0 0 0 \n",
"2 0 0 0 0 0 0 \n",
"3 0 0 0 0 0 0 \n",
"4 0 0 0 0 0 0 \n",
"... ... ... ... ... ... ... \n",
"14756 0 0 0 0 0 0 \n",
"14757 0 0 0 0 0 0 \n",
"14758 0 0 1 0 0 0 \n",
"14759 0 0 0 0 0 0 \n",
"14760 0 0 0 0 0 0 \n",
"\n",
"[14761 rows x 44 columns]"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from clevercsv import read_dataframe\n",
"\n",
"df = read_dataframe('./data/imdb.csv')\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"format": "text/markdown"
},
"source": [
"\n",
"Hooray! 🎉\n",
"\n",
"How does it work? CleverCSV searches the space of all possible dialects of a \n",
"file, and computes a *data consistency measure* that quantifies how much the \n",
"resulting table \"looks like real data\". The consistency measure combines \n",
"patterns of row lengths in the parsing result and the data type of the \n",
"resulting cells. This mimicks how a human would identify the dialect. If \n",
"you're wondering why this problem is hard, it's because every dialect will \n",
"give you *some* table, but not necessarily the correct one. More details can \n",
"be found [in the paper](https://rdcu.be/bLVur).\n",
"\n",
"## Other Examples\n",
"\n",
"We'll compare CleverCSV to the built-in Python CSV module and to Pandas and \n",
"show how these are not as robust as CleverCSV. Note that Pandas always uses \n",
"the comma as separator, unless it is forced to autodetect the dialect, in \n",
"which case it uses the Python Sniffer on the first line (we don't show that \n",
"here). These files are of course selected for this tutorial, because it \n",
"wouldn't be very interesting to show files where all methods are correct.\n",
"\n",
"Some files come from the [UK's open government data portal](data.gov.uk) (see \n",
"[the repo for \n",
"sources](https://github.com/alan-turing-institute/CleverCSVDemo/tree/master/data)), \n",
"whereas others come from MIT-licensed GitHub repositories (the URLs point \n",
"directly to the source files).\n",
"\n",
"We'll define some functions for easy comparisons.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"autoscroll": "auto",
"options": {
"caption": false,
"complete": true,
"display_data": true,
"display_stream": true,
"dpi": 200,
"echo": true,
"evaluate": true,
"f_env": null,
"f_pos": "htpb",
"f_size": [
6,
4
],
"f_spines": true,
"fig": true,
"include": true,
"name": null,
"option_string": "",
"results": "verbatim",
"term": false,
"wrap": "output"
}
},
"outputs": [],
"source": [
"import csv\n",
"import clevercsv\n",
"import io\n",
"import os\n",
"import requests\n",
"import pandas as pd\n",
"\n",
"from termcolor import colored\n",
"from IPython.display import display\n",
"\n",
"def page(url):\n",
" \"\"\" Get the content of a webpage using requests, assuming UTF-8 encoding \"\"\"\n",
" page = requests.get(url)\n",
" content = page.content.decode('utf-8')\n",
" return content\n",
"\n",
"def head(content, num=10):\n",
" \"\"\" Preview a CSV file \"\"\"\n",
" print('--- File Preview ---')\n",
" for i, line in enumerate(io.StringIO(content, newline=None)):\n",
" print(line, end='')\n",
" if i == num - 1:\n",
" break\n",
" print('\\n---')\n",
"\n",
"def sniff_url(content):\n",
" \"\"\" Utility to run the python Sniffer on a CSV file at a URL \"\"\"\n",
" try:\n",
" dialect = csv.Sniffer().sniff(content)\n",
" print(\"CSV Sniffer detected: delimiter = %r, quotechar = %r\" % (dialect.delimiter,\n",
" dialect.quotechar))\n",
" except csv.Error as err:\n",
" print(colored(\"No result from the Python CSV Sniffer\", \"red\"))\n",
" print(colored(\"Error was: %s\" % err, \"red\"))\n",
"\n",
"def detect_url(content, verbose=True):\n",
" \"\"\" Utility to run the CleverCSV detector on a CSV file at a URL \"\"\"\n",
" # We have designed CleverCSV to be a drop-in replacement for the CSV module\n",
" try:\n",
" dialect = clevercsv.Sniffer().sniff(content, verbose=verbose)\n",
" print(\"CleverCSV detected: delimiter = %r, quotechar = %r\" % (dialect.delimiter, \n",
" dialect.quotechar))\n",
" except clevercsv.Error:\n",
" print(colored(\"No result from CleverCSV\", \"red\"))\n",
"\n",
"def pandas_url(content):\n",
" \"\"\" Wrapper around pandas.read_csv(). \"\"\"\n",
" buf = io.StringIO(content)\n",
" print(\n",
" \"Pandas uses: delimiter = %r, quotechar = %r\"\n",
" % (',', '\"')\n",
" )\n",
" try:\n",
" df = pd.read_csv(buf)\n",
" display(df.head())\n",
" except pd.errors.ParserError:\n",
" print(colored(\"ParserError from pandas.\", \"red\"))\n",
"\n",
"\n",
"def compare(input_, verbose=False, n_preview=10):\n",
" if os.path.exists(input_):\n",
" enc = clevercsv.utils.get_encoding(input_)\n",
" content = open(input_, 'r', newline='', encoding=enc).read()\n",
" else:\n",
" content = page(input_)\n",
" head(content, num=n_preview)\n",
" print(\"\\n1. Running Python Sniffer\")\n",
" sniff_url(content)\n",
" print(\"\\n2. Running Pandas\")\n",
" pandas_url(content)\n",
" print(\"\\n3. Running CleverCSV\")\n",
" detect_url(content, verbose=verbose)"
]
},
{
"cell_type": "markdown",
"metadata": {
"format": "text/markdown"
},
"source": [
"\n",
"### Numbers with comma for decimal point\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"autoscroll": "auto",
"options": {
"caption": false,
"complete": true,
"display_data": true,
"display_stream": true,
"dpi": 200,
"echo": true,
"evaluate": true,
"f_env": null,
"f_pos": "htpb",
"f_size": [
6,
4
],
"f_spines": true,
"fig": true,
"include": true,
"name": null,
"option_string": "",
"results": "verbatim",
"term": false,
"wrap": "output"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--- File Preview ---\n",
"Department Family,Entity,Payment Date,Expense Type,Expense Area,Supplier,Transaction No.,Amount\n",
"DEPARTMENT OF HEALTH,AIREDALE NHS FOUNDATION TRUST,16/07/2010,COMPUTER SOFTWARE / LICENSE FEES,INFORMATION MANAGEMENT & TECHNOLOGY,ACCENTURE PACS,3003126885,\"43,774.58\"\n",
"DEPARTMENT OF HEALTH,AIREDALE NHS FOUNDATION TRUST,16/07/2010,COMPUTER SOFTWARE / LICENSE FEES,INFORMATION MANAGEMENT & TECHNOLOGY,ACCENTURE PACS,3003126885,\"43,774.58\"\n",
"DEPARTMENT OF HEALTH,AIREDALE NHS FOUNDATION TRUST,16/07/2010,COMPUTER SOFTWARE / LICENSE FEES,INFORMATION MANAGEMENT & TECHNOLOGY,ACCENTURE PACS,3003126885,\"7,660.55\"\n",
"DEPARTMENT OF HEALTH,AIREDALE NHS FOUNDATION TRUST,16/07/2010,COMPUTER SOFTWARE / LICENSE FEES,INFORMATION MANAGEMENT & TECHNOLOGY,ACCENTURE PACS,3003126885,\"7,660.55\"\n",
"\n",
"---\n",
"\n",
"1. Running Python Sniffer\n",
"CSV Sniffer detected: delimiter = '.', quotechar = '\"'\n",
"\n",
"2. Running Pandas\n",
"Pandas uses: delimiter = ',', quotechar = '\"'\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Department Family | \n",
" Entity | \n",
" Payment Date | \n",
" Expense Type | \n",
" Expense Area | \n",
" Supplier | \n",
" Transaction No. | \n",
" Amount | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" DEPARTMENT OF HEALTH | \n",
" AIREDALE NHS FOUNDATION TRUST | \n",
" 16/07/2010 | \n",
" COMPUTER SOFTWARE / LICENSE FEES | \n",
" INFORMATION MANAGEMENT & TECHNOLOGY | \n",
" ACCENTURE PACS | \n",
" 3003126885 | \n",
" 43,774.58 | \n",
"
\n",
" \n",
" 1 | \n",
" DEPARTMENT OF HEALTH | \n",
" AIREDALE NHS FOUNDATION TRUST | \n",
" 16/07/2010 | \n",
" COMPUTER SOFTWARE / LICENSE FEES | \n",
" INFORMATION MANAGEMENT & TECHNOLOGY | \n",
" ACCENTURE PACS | \n",
" 3003126885 | \n",
" 43,774.58 | \n",
"
\n",
" \n",
" 2 | \n",
" DEPARTMENT OF HEALTH | \n",
" AIREDALE NHS FOUNDATION TRUST | \n",
" 16/07/2010 | \n",
" COMPUTER SOFTWARE / LICENSE FEES | \n",
" INFORMATION MANAGEMENT & TECHNOLOGY | \n",
" ACCENTURE PACS | \n",
" 3003126885 | \n",
" 7,660.55 | \n",
"
\n",
" \n",
" 3 | \n",
" DEPARTMENT OF HEALTH | \n",
" AIREDALE NHS FOUNDATION TRUST | \n",
" 16/07/2010 | \n",
" COMPUTER SOFTWARE / LICENSE FEES | \n",
" INFORMATION MANAGEMENT & TECHNOLOGY | \n",
" ACCENTURE PACS | \n",
" 3003126885 | \n",
" 7,660.55 | \n",
"
\n",
" \n",
" 4 | \n",
" DEPARTMENT OF HEALTH | \n",
" AIREDALE NHS FOUNDATION TRUST | \n",
" 16/07/2010 | \n",
" COMPUTER SOFTWARE / LICENSE FEES | \n",
" INFORMATION MANAGEMENT & TECHNOLOGY | \n",
" ACCENTURE PACS | \n",
" 3003129243 | \n",
" 42,022.79 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Department Family Entity Payment Date \\\n",
"0 DEPARTMENT OF HEALTH AIREDALE NHS FOUNDATION TRUST 16/07/2010 \n",
"1 DEPARTMENT OF HEALTH AIREDALE NHS FOUNDATION TRUST 16/07/2010 \n",
"2 DEPARTMENT OF HEALTH AIREDALE NHS FOUNDATION TRUST 16/07/2010 \n",
"3 DEPARTMENT OF HEALTH AIREDALE NHS FOUNDATION TRUST 16/07/2010 \n",
"4 DEPARTMENT OF HEALTH AIREDALE NHS FOUNDATION TRUST 16/07/2010 \n",
"\n",
" Expense Type Expense Area \\\n",
"0 COMPUTER SOFTWARE / LICENSE FEES INFORMATION MANAGEMENT & TECHNOLOGY \n",
"1 COMPUTER SOFTWARE / LICENSE FEES INFORMATION MANAGEMENT & TECHNOLOGY \n",
"2 COMPUTER SOFTWARE / LICENSE FEES INFORMATION MANAGEMENT & TECHNOLOGY \n",
"3 COMPUTER SOFTWARE / LICENSE FEES INFORMATION MANAGEMENT & TECHNOLOGY \n",
"4 COMPUTER SOFTWARE / LICENSE FEES INFORMATION MANAGEMENT & TECHNOLOGY \n",
"\n",
" Supplier Transaction No. Amount \n",
"0 ACCENTURE PACS 3003126885 43,774.58 \n",
"1 ACCENTURE PACS 3003126885 43,774.58 \n",
"2 ACCENTURE PACS 3003126885 7,660.55 \n",
"3 ACCENTURE PACS 3003126885 7,660.55 \n",
"4 ACCENTURE PACS 3003129243 42,022.79 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"3. Running CleverCSV\n",
"CleverCSV detected: delimiter = ',', quotechar = '\"'\n"
]
}
],
"source": [
"compare('./data/airedale.csv', n_preview=5)"
]
},
{
"cell_type": "markdown",
"metadata": {
"format": "text/markdown"
},
"source": [
"\n",
"You'll notice that Python Sniffer says ``.`` is the delimiter, Pandas is \n",
"correct because the file uses the default comma as separator, and CleverCSV \n",
"detects the dialect correctly as well.\n",
"\n",
"### Tab-separated\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"autoscroll": "auto",
"options": {
"caption": false,
"complete": true,
"display_data": true,
"display_stream": true,
"dpi": 200,
"echo": true,
"evaluate": true,
"f_env": null,
"f_pos": "htpb",
"f_size": [
6,
4
],
"f_spines": true,
"fig": true,
"include": true,
"name": null,
"option_string": "",
"results": "verbatim",
"term": false,
"wrap": "output"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--- File Preview ---\n",
"UK Availability Disposals and Production of Milk and Milk Products Year\t\tAVAILABILITY of RAW MILK UK Milk Production million litres\tAVAILABILITY of RAW MILK Imports million litres\tAVAILABILITY of RAW MILK Total Available million litres\tDISPOSALS of RAW MILK For Manufacture For Liquid Milk (1) million litres\tDISPOSALS of RAW MILK For Manufacture Total million litres\tDISPOSALS of RAW MILK For Manufacture Condensed Milk (3) million litres\tDISPOSALS of RAW MILK For Manufacture Milk Powders (3)(4) million litres\tDISPOSALS of RAW MILK For Manufacture Butter million cream litres\tDISPOSALS of RAW MILK For Manufacture Cheese million litres\tDISPOSALS of RAW MILK For Manufacture million cream litres\tDISPOSALS of RAW MILK For ManufactureYoghurt million litres\tDISPOSALS of RAW MILK For Manufacture Other Products (5) million litres\tDISPOSALS of RAW MILK For Manufacture Exports million litres\tDISPOSALS of RAW MILK For Manufacture Stock change & wastage million litres\tWHOLESALE PRODUCTION (2) Liquid Milk (1) million litres\tWHOLESALE PRODUCTION (2) Condensed Milk (3) th. tonnes\tWHOLESALE PRODUCTION (2) Milk Powders (3) (4) th. tonnes\tWHOLESALE PRODUCTION (2) Butter (6) th. tonnes\tWHOLESALE PRODUCTION (2) Cheese (6) th. tonnes\tINTERVENTION STOCKS Skimmed Milk Powder th. tonnes\tINTERVENTION STOCKS Skimmed Milk Powder th. tonnes\n",
"1987\t\t14718.2\t\t14718.2\t6813.1\t7857\t698.5\t2872.9\t373.1\t2868.4\t169.5\t\t874.7\t36\t12.1\t6576.3\t180.3\t288.4\t182.3\t270.2\t\t\n",
"1988\t\t14398.8\t\t14398.8\t6858.3\t7480.6\t809.4\t2311.9\t301.8\t3178.4\t190.9\t\t688.3\t46.9\t13\t6605.9\t182.8\t240.1\t146.7\t299.9\t\t\n",
"1989\t\t14186\t\t14186\t6859.4\t7257.5\t794.8\t2313.5\t287.4\t3069.7\t226.4\t\t565.7\t57.7\t11.3\t6623\t207.3\t227.9\t139.5\t283.1\t\t\n",
"1990\t\t14465.7\t\t14465.7\t6892.4\t7488.9\t737.5\t2482.5\t308.8\t3289.1\t248.7\t\t422.3\t74.5\t9.9\t6654.8\t203.8\t236\t151.2\t315.1\t\t\n",
"\n",
"---\n",
"\n",
"1. Running Python Sniffer\n",
"CSV Sniffer detected: delimiter = ' ', quotechar = '\"'\n",
"\n",
"2. Running Pandas\n",
"Pandas uses: delimiter = ',', quotechar = '\"'\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" UK Availability Disposals and Production of Milk and Milk Products Year\\t\\tAVAILABILITY of RAW MILK UK Milk Production million litres\\tAVAILABILITY of RAW MILK Imports million litres\\tAVAILABILITY of RAW MILK Total Available million litres\\tDISPOSALS of RAW MILK For Manufacture For Liquid Milk (1) million litres\\tDISPOSALS of RAW MILK For Manufacture Total million litres\\tDISPOSALS of RAW MILK For Manufacture Condensed Milk (3) million litres\\tDISPOSALS of RAW MILK For Manufacture Milk Powders (3)(4) million litres\\tDISPOSALS of RAW MILK For Manufacture Butter million cream litres\\tDISPOSALS of RAW MILK For Manufacture Cheese million litres\\tDISPOSALS of RAW MILK For Manufacture million cream litres\\tDISPOSALS of RAW MILK For ManufactureYoghurt million litres\\tDISPOSALS of RAW MILK For Manufacture Other Products (5) million litres\\tDISPOSALS of RAW MILK For Manufacture Exports million litres\\tDISPOSALS of RAW MILK For Manufacture Stock change & wastage million litres\\tWHOLESALE PRODUCTION (2) Liquid Milk (1) million litres\\tWHOLESALE PRODUCTION (2) Condensed Milk (3) th. tonnes\\tWHOLESALE PRODUCTION (2) Milk Powders (3) (4) th. tonnes\\tWHOLESALE PRODUCTION (2) Butter (6) th. tonnes\\tWHOLESALE PRODUCTION (2) Cheese (6) th. tonnes\\tINTERVENTION STOCKS Skimmed Milk Powder th. tonnes\\tINTERVENTION STOCKS Skimmed Milk Powder th. tonnes | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1987\\t\\t14718.2\\t\\t14718.2\\t6813.1\\t7857\\t698.... | \n",
"
\n",
" \n",
" 1 | \n",
" 1988\\t\\t14398.8\\t\\t14398.8\\t6858.3\\t7480.6\\t80... | \n",
"
\n",
" \n",
" 2 | \n",
" 1989\\t\\t14186\\t\\t14186\\t6859.4\\t7257.5\\t794.8\\... | \n",
"
\n",
" \n",
" 3 | \n",
" 1990\\t\\t14465.7\\t\\t14465.7\\t6892.4\\t7488.9\\t73... | \n",
"
\n",
" \n",
" 4 | \n",
" 1991\\t\\t13992\\t\\t13992\\t6892.8\\t7021.9\\t706.7\\... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" UK Availability Disposals and Production of Milk and Milk Products Year\\t\\tAVAILABILITY of RAW MILK UK Milk Production million litres\\tAVAILABILITY of RAW MILK Imports million litres\\tAVAILABILITY of RAW MILK Total Available million litres\\tDISPOSALS of RAW MILK For Manufacture For Liquid Milk (1) million litres\\tDISPOSALS of RAW MILK For Manufacture Total million litres\\tDISPOSALS of RAW MILK For Manufacture Condensed Milk (3) million litres\\tDISPOSALS of RAW MILK For Manufacture Milk Powders (3)(4) million litres\\tDISPOSALS of RAW MILK For Manufacture Butter million cream litres\\tDISPOSALS of RAW MILK For Manufacture Cheese million litres\\tDISPOSALS of RAW MILK For Manufacture million cream litres\\tDISPOSALS of RAW MILK For ManufactureYoghurt million litres\\tDISPOSALS of RAW MILK For Manufacture Other Products (5) million litres\\tDISPOSALS of RAW MILK For Manufacture Exports million litres\\tDISPOSALS of RAW MILK For Manufacture Stock change & wastage million litres\\tWHOLESALE PRODUCTION (2) Liquid Milk (1) million litres\\tWHOLESALE PRODUCTION (2) Condensed Milk (3) th. tonnes\\tWHOLESALE PRODUCTION (2) Milk Powders (3) (4) th. tonnes\\tWHOLESALE PRODUCTION (2) Butter (6) th. tonnes\\tWHOLESALE PRODUCTION (2) Cheese (6) th. tonnes\\tINTERVENTION STOCKS Skimmed Milk Powder th. tonnes\\tINTERVENTION STOCKS Skimmed Milk Powder th. tonnes\n",
"0 1987\\t\\t14718.2\\t\\t14718.2\\t6813.1\\t7857\\t698.... \n",
"1 1988\\t\\t14398.8\\t\\t14398.8\\t6858.3\\t7480.6\\t80... \n",
"2 1989\\t\\t14186\\t\\t14186\\t6859.4\\t7257.5\\t794.8\\... \n",
"3 1990\\t\\t14465.7\\t\\t14465.7\\t6892.4\\t7488.9\\t73... \n",
"4 1991\\t\\t13992\\t\\t13992\\t6892.8\\t7021.9\\t706.7\\... "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"3. Running CleverCSV\n",
"CleverCSV detected: delimiter = '\\t', quotechar = '\"'\n"
]
}
],
"source": [
"compare('./data/milk.csv', n_preview=5)"
]
},
{
"cell_type": "markdown",
"metadata": {
"format": "text/markdown"
},
"source": [
"\n",
"Sniffer and Pandas are incorrect here, but CleverCSV gets it right.\n",
"\n",
"### File with comments\n",
"\n",
"The Python Sniffer gives no result for this file, and Pandas fails because it \n",
"checks for a rectangular table shape. Note that the text in the comments says \n",
"that the file uses ``|`` as separator, even though it actually uses ``,``!\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"autoscroll": "auto",
"options": {
"caption": false,
"complete": true,
"display_data": true,
"display_stream": true,
"dpi": 200,
"echo": true,
"evaluate": true,
"f_env": null,
"f_pos": "htpb",
"f_size": [
6,
4
],
"f_spines": true,
"fig": true,
"include": true,
"name": null,
"option_string": "",
"results": "verbatim",
"term": false,
"wrap": "output"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--- File Preview ---\n",
"#Release 14.4 - par P.49d (lin)\n",
"#Copyright (c) 1995-2012 Xilinx, Inc. All rights reserved.\n",
"\n",
"#Thu Jun 20 07:23:42 2013\n",
"\n",
"#\n",
"## NOTE: This file is designed to be imported into a spreadsheet program\n",
"# such as Microsoft Excel for viewing, printing and sorting. The |\n",
"# character is used as the data field separator. This file is also designed\n",
"# to support parsing.\n",
"#\n",
"#INPUT FILE: project.ncd\n",
"#OUTPUT FILE: project_r_pad.csv\n",
"#PART TYPE: xc3s500e\n",
"#SPEED GRADE: -5\n",
"#PACKAGE: fg320\n",
"#\n",
"# Pinout by Pin Number:\n",
"# \n",
"# -----,-----,-----,-----,-----,-----,-----,-----,-----,-----,-----,-----,-----,-----,-----,\n",
"Pin Number,Signal Name,Pin Usage,Pin Name,Direction,IO Standard,IO Bank Number,Drive (mA),Slew Rate,Termination,IOB Delay,Voltage,Constraint,IO Register,Signal Integrity,\n",
"A1,,,GND,,,,,,,,,,,,\n",
"A2,,,TDI,,,,,,,,,,,,\n",
"A3,,IBUF,IP,UNUSED,,0,,,,,,,,,\n",
"A4,,DIFFM,IO_L24P_0,UNUSED,,0,,,,,,,,,\n",
"A5,,DIFFMI,IP_L22P_0,UNUSED,,0,,,,,,,,,\n",
"A6,,DIFFS,IO_L20N_0,UNUSED,,0,,,,,,,,,\n",
"A7,,IBUF,IP,UNUSED,,0,,,,,,,,,\n",
"A8,,IOB,IO,UNUSED,,0,,,,,,,,,\n",
"A9,,,VCCO_0,,,0,,,,,2.50,,,,\n",
"\n",
"---\n",
"\n",
"1. Running Python Sniffer\n",
"\u001b[31mNo result from the Python CSV Sniffer\u001b[0m\n",
"\u001b[31mError was: Could not determine delimiter\u001b[0m\n",
"\n",
"2. Running Pandas\n",
"Pandas uses: delimiter = ',', quotechar = '\"'\n",
"\u001b[31mParserError from pandas.\u001b[0m\n",
"\n",
"3. Running CleverCSV\n",
"CleverCSV detected: delimiter = ',', quotechar = ''\n"
]
}
],
"source": [
"compare(\"https://raw.githubusercontent.com/queq/just-stuff/c1b8714664cc674e1fc685bd957eac548d636a43/pov/TopFixed/build/project_r_pad.csv\", n_preview=30)"
]
},
{
"cell_type": "markdown",
"metadata": {
"format": "text/markdown"
},
"source": [
"\n",
"### Semi-colon separated\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"autoscroll": "auto",
"options": {
"caption": false,
"complete": true,
"display_data": true,
"display_stream": true,
"dpi": 200,
"echo": true,
"evaluate": true,
"f_env": null,
"f_pos": "htpb",
"f_size": [
6,
4
],
"f_spines": true,
"fig": true,
"include": true,
"name": null,
"option_string": "",
"results": "verbatim",
"term": false,
"wrap": "output"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--- File Preview ---\n",
"log(E);de_log(E);dede_log(E);cepstrum;de_cepstrum;dede_cepstrum\n",
"0.27;1.77;4.97;0.1;0.61;1.75;\n",
"1.75;1.25;1.00;1.25;0.50;0.25;\n",
"---\n",
"\n",
"1. Running Python Sniffer\n",
"\u001b[31mNo result from the Python CSV Sniffer\u001b[0m\n",
"\u001b[31mError was: Could not determine delimiter\u001b[0m\n",
"\n",
"2. Running Pandas\n",
"Pandas uses: delimiter = ',', quotechar = '\"'\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" log(E);de_log(E);dede_log(E);cepstrum;de_cepstrum;dede_cepstrum | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.27;1.77;4.97;0.1;0.61;1.75; | \n",
"
\n",
" \n",
" 1 | \n",
" 1.75;1.25;1.00;1.25;0.50;0.25; | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" log(E);de_log(E);dede_log(E);cepstrum;de_cepstrum;dede_cepstrum\n",
"0 0.27;1.77;4.97;0.1;0.61;1.75; \n",
"1 1.75;1.25;1.00;1.25;0.50;0.25; "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"3. Running CleverCSV\n",
"CleverCSV detected: delimiter = ';', quotechar = ''\n"
]
}
],
"source": [
"compare(\"https://raw.githubusercontent.com/grezesf/Research/17b1e829d1d4b8954661270bd8b099e74bb45ce7/Reservoirs/Task0_Replication/code/preprocessing/factors.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"format": "text/markdown"
},
"source": [
"\n",
"Sniffer fails outright, Pandas is incorrect because it assumes comma.\n",
"\n",
"### File with multiple tables\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"autoscroll": "auto",
"options": {
"caption": false,
"complete": true,
"display_data": true,
"display_stream": true,
"dpi": 200,
"echo": true,
"evaluate": true,
"f_env": null,
"f_pos": "htpb",
"f_size": [
6,
4
],
"f_spines": true,
"fig": true,
"include": true,
"name": null,
"option_string": "",
"results": "verbatim",
"term": false,
"wrap": "output"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--- File Preview ---\n",
"2013/12/12 23:20 UPDATE\n",
"ピーク時供給力(万kW),時間帯,供給力情報更新日,供給力情報更新時刻\n",
"4849,17:00〜18:00,12/12,8:30\n",
"\n",
"予想最大電力(万kW),時間帯,予想最大電力情報更新日,予想最大電力情報更新時刻\n",
"4210,17:00〜18:00,12/12,8:30\n",
"\n",
"DATE,TIME,当日実績(万kW),予測値(万kW)\n",
"2013/12/12,0:00,3098,0\n",
"2013/12/12,1:00,2948,0\n",
"\n",
"---\n",
"\n",
"1. Running Python Sniffer\n",
"CSV Sniffer detected: delimiter = '\\r', quotechar = '\"'\n",
"\n",
"2. Running Pandas\n",
"Pandas uses: delimiter = ',', quotechar = '\"'\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" | \n",
" 2013/12/12 23:20 UPDATE | \n",
"
\n",
" \n",
" \n",
" \n",
" ピーク時供給力(万kW) | \n",
" 時間帯 | \n",
" 供給力情報更新日 | \n",
" 供給力情報更新時刻 | \n",
"
\n",
" \n",
" 4849 | \n",
" 17:00〜18:00 | \n",
" 12/12 | \n",
" 8:30 | \n",
"
\n",
" \n",
" 予想最大電力(万kW) | \n",
" 時間帯 | \n",
" 予想最大電力情報更新日 | \n",
" 予想最大電力情報更新時刻 | \n",
"
\n",
" \n",
" 4210 | \n",
" 17:00〜18:00 | \n",
" 12/12 | \n",
" 8:30 | \n",
"
\n",
" \n",
" DATE | \n",
" TIME | \n",
" 当日実績(万kW) | \n",
" 予測値(万kW) | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 2013/12/12 23:20 UPDATE\n",
"ピーク時供給力(万kW) 時間帯 供給力情報更新日 供給力情報更新時刻\n",
"4849 17:00〜18:00 12/12 8:30\n",
"予想最大電力(万kW) 時間帯 予想最大電力情報更新日 予想最大電力情報更新時刻\n",
"4210 17:00〜18:00 12/12 8:30\n",
"DATE TIME 当日実績(万kW) 予測値(万kW)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"3. Running CleverCSV\n",
"CleverCSV detected: delimiter = ',', quotechar = ''\n"
]
}
],
"source": [
"compare(\"https://raw.githubusercontent.com/HAYASAKA-Ryosuke/TodenGraphDay/8f052219d037edabebd488e5f6dc2ddbe8367dc1/juyo-j.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"format": "text/markdown"
},
"source": [
"\n",
"Sniffer says ``\\r`` (carriage return) is the delimiter!\n",
"\n",
"## Conclusion\n",
"\n",
"We hope you find CleverCSV useful! The package is still in beta, so if you \n",
"encounter any issues or files where CleverCSV fails, please leave a comment on \n",
"GitHub!\n"
]
}
],
"metadata": {
"kernel_info": {
"name": "python3"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}