{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": null, "outputs": [], "source": [ "#r \"nuget: FSharp.Data,6.4.0\"\n", "\n", "Formatter.SetPreferredMimeTypesFor(typeof\u003cobj\u003e, \"text/plain\")\n", "Formatter.Register(fun (x: obj) (writer: TextWriter) -\u003e fprintfn writer \"%120A\" x)\n", "#endif\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "[![Binder](../img/badge-binder.svg)](https://mybinder.org/v2/gh/fsprojects/FSharp.Data/gh-pages?filepath=library/CsvProvider.ipynb)\u0026emsp;\n", "[![Script](../img/badge-script.svg)](https://fsprojects.github.io/FSharp.Data//library/CsvProvider.fsx)\u0026emsp;\n", "[![Notebook](../img/badge-notebook.svg)](https://fsprojects.github.io/FSharp.Data//library/CsvProvider.ipynb)\n", "\n", "# CSV Type Provider\n", "\n", "This article demonstrates how to use the CSV type provider to read CSV files\n", "in a statically typed way.\n", "\n", "The CSV type provider takes a sample CSV as input and generates a type based on the data\n", "present on the columns of that sample. The column names are obtained from the first\n", "(header) row, and the types are inferred from the values present on the subsequent rows.\n", "\n", "## Introducing the provider\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": null, "outputs": [], "source": [ "\u003cdiv class=\"container-fluid\" style=\"margin:15px 0px 15px 0px;\"\u003e\n", " \u003cdiv class=\"row-fluid\"\u003e\n", " \u003cdiv class=\"span1\"\u003e\u003c/div\u003e\n", " \u003cdiv class=\"span10\" id=\"anim-holder\"\u003e\n", " \u003ca id=\"lnk\" href=\"../images/csv.gif\"\u003e\u003cimg id=\"anim\" src=\"../images/csv.gif\" /\u003e\u003c/a\u003e\n", " \u003c/div\u003e\n", " \u003cdiv class=\"span1\"\u003e\u003c/div\u003e\n", " \u003c/div\u003e\n", "\u003c/div\u003e\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "The type provider is located in the `FSharp.Data.dll` assembly. Assuming the package is referenged\n", "we can access its namespace as follows:\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 2, "outputs": [], "source": [ "open FSharp.Data\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "### Parsing stock prices\n", "\n", "The Yahoo Finance web site provides daily stock prices in a CSV format that has the\n", "following structure (you can find a larger example in the [`data/MSFT.csv`](../data/MSFT.csv) file):\n", "\n", " [lang=text]\n", " Date,Open,High,Low,Close,Volume,Adj Close\n", " 2012-01-27,29.45,29.53,29.17,29.23,44187700,29.23\n", " 2012-01-26,29.61,29.70,29.40,29.50,49102800,29.50\n", " 2012-01-25,29.07,29.65,29.07,29.56,59231700,29.56\n", " 2012-01-24,29.47,29.57,29.18,29.34,51703300,29.34\n", "\n", "As usual with CSV files, the first row contains the headers (names of individual columns)\n", "and the next rows define the data. We can pass reference to the file to [CsvProvider](https://fsprojects.github.io/FSharp.Data/reference/fsharp-data-csvprovider.html) to\n", "get a strongly typed view of the file:\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 3, "outputs": [], "source": [ "[\u003cLiteral\u003e]\n", "let ResolutionFolder = __SOURCE_DIRECTORY__\n", "\n", "type Stocks = CsvProvider\u003c\"../data/MSFT.csv\", ResolutionFolder=ResolutionFolder\u003e\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "The generated type provides two static methods for loading data. The `Parse` method can be\n", "used if we have the data in a `string` value. The `Load` method allows reading the data from\n", "a file or from a web resource (and there\u0027s also an asynchronous `AsyncLoad` version). We could also\n", "have used a web URL instead of a local file in the sample parameter of the type provider.\n", "The following sample calls the `Load` method with an URL that points to a live CSV file on the Yahoo finance web site:\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 4, "outputs": [ { "data": { "text/plain": ["HLOC: (76.55M, 75.86M, 75.97M, 76.29M)", "", "HLOC: (76.03M, 75.54M, 75.67M, 76.00M)", "", "HLOC: (76.12M, 74.96M, 75.22M, 75.97M)", "", "HLOC: (74.72M, 73.71M, 74.09M, 74.69M)", "", "HLOC: (74.88M, 74.20M, 74.67M, 74.26M)", "", "HLOC: (75.01M, 74.30M, 74.71M, 74.61M)", "", "HLOC: (74.54M, 73.88M, 73.94M, 74.49M)", "", "HLOC: (73.97M, 73.31M, 73.54M, 73.87M)", "", "HLOC: (74.17M, 73.17M, 73.55M, 73.85M)", "", "HLOC: (73.81M, 72.99M, 73.67M, 73.26M)", "", "val msft: Runtime.CsvFile\u003cCsvProvider\u003c...\u003e.Row\u003e", "", "val firstRow: CsvProvider\u003c...\u003e.Row =", "", " (10/9/2017 12:00:00 AM, 75.97M, 76.55M, 75.86M, 76.29M, 11386502)", "", "val lastDate: System.DateTime = 10/9/2017 12:00:00 AM", "", "val lastOpen: decimal = 75.97M", "", "val it: unit = ()"] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" }], "source": [ "// Download the stock prices\n", "let msft =\n", " Stocks\n", " .Load(__SOURCE_DIRECTORY__ + \"/../data/MSFT.csv\")\n", " .Cache()\n", "\n", "// Look at the most recent row. Note the \u0027Date\u0027 property\n", "// is of type \u0027DateTime\u0027 and \u0027Open\u0027 has a type \u0027decimal\u0027\n", "let firstRow = msft.Rows |\u003e Seq.head\n", "let lastDate = firstRow.Date\n", "let lastOpen = firstRow.Open\n", "\n", "// Print the first 10 prices in the HLOC format\n", "for row in msft.Rows |\u003e Seq.truncate 10 do\n", " printfn \"HLOC: (%A, %A, %A, %A)\" row.High row.Low row.Open row.Close\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "The generated type has a property `Rows` that returns the data from the CSV file as a\n", "collection of rows. We iterate over the rows using a `for` loop. As you can see the\n", "(generated) type for rows has properties such as `High`, `Low` and `Close` that correspond\n", "to the columns in the CSV file.\n", "\n", "As you can see, the type provider also infers types of individual rows. The `Date`\n", "property is inferred to be a `DateTime` (because the values in the sample file can all\n", "be parsed as dates) while HLOC prices are inferred as `decimal`.\n", "\n", "## Using units of measure\n", "\n", "The CSV type provider supports F# units of measure: if the header includes the name or symbol of one of the standard SI units, then the generated type\n", "returns values annotated with the appropriate unit.\n", "\n", "In this section, we use a simple file [`data/SmallTest.csv`](../data/SmallTest.csv) which\n", "looks as follows:\n", "\n", " [lang=text]\n", " Name, Distance (metre), Time (s)\n", " First, 50.0, 3.7\n", "\n", "As you can see, the second and third columns are annotated with `metre` and `s`,\n", "respectively. To use units of measure in our code, we need to open the namespace with\n", "standard unit names. Then we pass the `SmallTest.csv` file to the type provider as\n", "a static argument. Also note that in this case we\u0027re using the same data at runtime,\n", "so we use the `GetSample` method instead of calling `Load` and passing the same parameter again.\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 5, "outputs": [ { "data": { "text/plain": ["val small: CsvProvider\u003c...\u003e"] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" }], "source": [ "let small =\n", " CsvProvider\u003c\"../data/SmallTest.csv\", ResolutionFolder=ResolutionFolder\u003e.GetSample ()\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "We can also use the default constructor instead of the `GetSample` static method:\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 6, "outputs": [ { "data": { "text/plain": ["val small2: CsvProvider\u003c...\u003e"] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" }], "source": [ "let small2 =\n", " new CsvProvider\u003c\"../data/SmallTest.csv\", ResolutionFolder=ResolutionFolder\u003e()\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "but the VisualStudio IntelliSense for the type provider parameters doesn\u0027t work when we use a default\n", "constructor for a type provider, so we\u0027ll keep using `GetSample` instead.\n", "\n", "As in the previous example, the `small` value exposes the rows using the `Rows` property.\n", "The generated properties `Distance` and `Time` are now annotated with units. Look at the\n", "following simple calculation:\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 7, "outputs": [ { "data": { "text/plain": ["Second (19.230769230769230769230769231M m/s)", "", "Third (23.4375M m/s)", "", "val it: unit = ()"] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" }], "source": [ "open FSharp.Data.UnitSystems.SI.UnitNames\n", "\n", "for row in small.Rows do\n", " let speed = row.Distance / row.Time\n", "\n", " if speed \u003e 15.0M\u003cmetre/second\u003e then\n", " printfn \"%s (%A m/s)\" row.Name speed\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "The numerical values of `Distance` and `Time` are both inferred as `decimal` (because they\n", "are small enough). Thus the type of `speed` becomes `decimal\u003cmetre/second\u003e`. The compiler\n", "can then statically check that we\u0027re not comparing incompatible values - e.g. number in\n", "meters per second against a value in kilometres per hour.\n", "\n", "## Custom separators and tab-separated files\n", "\n", "By default, the CSV type provider uses comma (`,`) as a separator. However, CSV\n", "files sometime use a different separator character than `,`. In some European\n", "countries, `,` is already used as the numeric decimal separator, so a semicolon (`;`) is used\n", "instead to separate CSV columns. The `CsvProvider` has an optional `Separators` static parameter\n", "where you can specify what to use as separator. This means that you can consume\n", "any textual tabular format. Here is an example using `;` as a separator:\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 8, "outputs": [ { "data": { "text/plain": ["type AirQuality = CsvProvider\u003c...\u003e", "", "val airQuality: AirQuality", "", "val it: unit = ()"] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" }], "source": [ "type AirQuality = CsvProvider\u003c\"../data/AirQuality.csv\", \";\", ResolutionFolder=ResolutionFolder\u003e\n", "\n", "let airQuality = new AirQuality()\n", "\n", "for row in airQuality.Rows |\u003e Seq.truncate 10 do\n", " if row.Month \u003e 6 then\n", " printfn \"Temp: %i Ozone: %f \" row.Temp row.Ozone\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "The air quality dataset ([`data/AirQuality.csv`](../data/AirQuality.csv)) is used in many\n", "samples for the Statistical Computing language R. A short description of the dataset can be found\n", "[in the R language manual](http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/airquality.html).\n", "\n", "If you are parsing a tab-separated file that uses `\\t` as the separator, you can also\n", "specify the separator explicitly. However, if you\u0027re using an url or file that has\n", "the `.tsv` extension, the type provider will use `\\t` by default. In the following example,\n", "we also set `IgnoreErrors` static parameter to `true` so that lines with incorrect number of elements\n", "are automatically skipped (the sample file ([`data/MortalityNY.csv`](../data/MortalityNY.tsv)) contains additional unstructured data at the end):\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 9, "outputs": [ { "data": { "text/plain": ["CAUSE: Pedal cyclist injured in collision with car, pick-up truck or van, driver injured in traffic accident", "", "Albany County, NY (1 cases)", "", "Bronx County, NY (1 cases)", "", "Broome County, NY (1 cases)", "", "Cayuga County, NY (1 cases)", "", "Chemung County, NY (1 cases)", "", "Dutchess County, NY (1 cases)", "", "Kings County, NY (3 cases)", "", "Monroe County, NY (1 cases)", "", "Nassau County, NY (8 cases)", "", "New York County, NY (1 cases)", "", "Niagara County, NY (2 cases)", "", "Oneida County, NY (2 cases)", "", "Onondaga County, NY (2 cases)", "", "Orange County, NY (2 cases)", "", "Oswego County, NY (2 cases)", "", "Queens County, NY (1 cases)", "", "Rensselaer County, NY (2 cases)", "", "Saratoga County, NY (2 cases)", "", "Schenectady County, NY (1 cases)", "", "Seneca County, NY (1 cases)", "", "Steuben County, NY (1 cases)", "", "Suffolk County, NY (9 cases)", "", "Sullivan County, NY (1 cases)", "", "Ulster County, NY (1 cases)", "", "Westchester County, NY (3 cases)", "", "val mortalityNy: CsvProvider\u003c...\u003e", "", "val cause: CsvProvider\u003c...\u003e.Row =", "", " (\"\", \"Albany County, NY\", 36001,", "", " \"Pedal cyclist injured in collision with car, pick-up truck or\"+[40 chars],", "", " \"V13.4\", 1, 2072701, \"0.0 (Unreliable)\")", "", "val it: unit = ()"] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" }], "source": [ "let mortalityNy =\n", " CsvProvider\u003c\"../data/MortalityNY.tsv\", IgnoreErrors=true, ResolutionFolder=ResolutionFolder\u003e.GetSample ()\n", "\n", "// Find the name of a cause based on code\n", "// (Pedal cyclist injured in an accident)\n", "let cause =\n", " mortalityNy.Rows\n", " |\u003e Seq.find (fun r -\u003e r.``Cause of death Code`` = \"V13.4\")\n", "\n", "// Print the number of injured cyclists\n", "printfn \"CAUSE: %s\" cause.``Cause of death``\n", "\n", "for r in mortalityNy.Rows do\n", " if r.``Cause of death Code`` = \"V13.4\" then\n", " printfn \"%s (%d cases)\" r.County r.Count\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, note that it is also possible to specify multiple different separators\n", "for the `CsvProvider`. This might be useful if a file is irregular and contains\n", "rows separated by either semicolon or a colon. You can use:\n", "`CsvProvider\u003c\"../data/AirQuality.csv\", Separators=\";,\", ResolutionFolder=ResolutionFolder\u003e`.\n", "\n", "## Missing values\n", "\n", "It is quite common in statistical datasets for some values to be missing. If\n", "you open the [`data/AirQuality.csv`](../data/AirQuality.csv) file you will see\n", "that some values for the ozone observations are marked `#N/A`. Such values are\n", "parsed as float and will be marked with `Double.NaN` in F#. The values\n", "`NaN`, `NA`, `N/A`, `#N/A`, `:`, `-`, `TBA`, and `TBD`\n", "are recognized as missing values by default, but you can customize it by specifying\n", "the `MissingValues` static parameter of `CsvProvider` as a comma-separated string.\n", "For example, to ignore `this` and `that` we could do:\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 10, "outputs": [ { "data": { "text/plain": ["val it: CsvProvider\u003c...\u003e.Row seq = seq [(nan, nan, 1.0M)]"] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" }], "source": [ "CsvProvider\u003c\"X,Y,Z\\nthis,that,1.0\", MissingValues=\"this,that\"\u003e\n", " .GetSample()\n", " .Rows\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "The following snippet calculates the mean of the ozone observations\n", "excluding the `Double.NaN` values. We first obtain the `Ozone` property for\n", "each row, then remove missing values and then use the standard `Seq.average` function:\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 11, "outputs": [ { "data": { "text/plain": ["val mean: float = 42.12931034"] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" }], "source": [ "open System\n", "\n", "let mean =\n", " airQuality.Rows\n", " |\u003e Seq.toArray\n", " |\u003e Array.map (fun row -\u003e row.Ozone)\n", " |\u003e Array.filter (fun elem -\u003e not (Double.IsNaN elem))\n", " |\u003e Array.average\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "If the sample doesn\u0027t have missing values on all columns, but at runtime missing values could\n", "appear anywhere, you can set the static parameter `AssumeMissingValues` to `true` in order to force `CsvProvider`\n", "to assume missing values can occur in any column.\n", "\n", "## Controlling the column types\n", "\n", "By default, the CSV type provider checks the first 1000 rows to infer the types, but you can customize\n", "it by specifying the `InferRows` static parameter of `CsvProvider`. If you specify 0 the entire file will be used.\n", "\n", "Columns with only `0`, `1`, `Yes`, `No`, `True`, or `False` will be set to `bool`. Columns with numerical values\n", "will be set to either `int`, `int64`, `decimal`, or `float`, in that order of preference.\n", "\n", "If a value is missing in any row, by default the CSV type provider will infer a nullable (for `int` and `int64`) or an optional\n", "(for `bool`, `DateTime` and `Guid`). When a `decimal` would be inferred but there are missing values, we will infer a\n", "`float` instead, and use `Double.NaN` to represent those missing values. The `string` type is already inherently nullable,\n", "so by default we won\u0027t generate a `string option`. If you prefer to use optionals in all cases, you can set the static parameter\n", "`PreferOptionals` to `true`. In that case you\u0027ll never get an empty string or a `Double.NaN` and will always get a `None` instead.\n", "\n", "If you have other preferences, e.g. if you want a column to be a `float` instead of a `decimal`,\n", "you can override the default behaviour by specifying the types in the header column between braces, similar to what can be done to\n", "specify the units of measure. This will override both `AssumeMissingValues` and `PreferOptionals`. The valid types are:\n", "\n", "* `int`\n", "\n", "* `int?`\n", "\n", "* `int option`\n", "\n", "* `int64`\n", "\n", "* `int64?`\n", "\n", "* `int64 option`\n", "\n", "* `bool`\n", "\n", "* `bool?`\n", "\n", "* `bool option`\n", "\n", "* `float`\n", "\n", "* `float?`\n", "\n", "* `float option`\n", "\n", "* `decimal`\n", "\n", "* `decimal?`\n", "\n", "* `decimal option`\n", "\n", "* `date`\n", "\n", "* `date?`\n", "\n", "* `date option`\n", "\n", "* `datetimeoffset`\n", "\n", "* `datetimeoffset?`\n", "\n", "* `datetimeoffset option`\n", "\n", "* `guid`\n", "\n", "* `guid?`\n", "\n", "* `guid option`\n", "\n", "* `string`\n", "\n", "* `string option`.\n", "\n", "You can also specify both the type and a unit (e.g `float\u003cmetre\u003e`). Example:\n", "\n", " [lang=text]\n", " Name, Distance (decimal?\u003cmetre\u003e), Time (float)\n", " First, 50, 3\n", "\n", "Additionally, you can also specify some or all the types in the `Schema` static parameter of `CsvProvider`. Valid formats are:\n", "\n", "* `Type`\n", "\n", "* `Type\u003cMeasure\u003e`\n", "\n", "* `Name (Type)`\n", "\n", "* `Name (Type\u003cMeasure\u003e)`\n", "\n", "What\u0027s specified in the `Schema` static parameter will always take precedence to what\u0027s specified in the column headers.\n", "\n", "If the first row of the file is not a header row, you can specify the `HasHeaders` static parameter to `false` in order to\n", "consider that row as a data row. In that case, the columns will be named `Column1`, `Column2`, etc..., unless the\n", "names are overridden using the `Schema` parameter. Note that you can override only the name in the `Schema` parameter\n", "and still have the provider infer the type for you. Example:\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 12, "outputs": [ { "data": { "text/plain": ["1.000000 2 3.000000", "", "type OneTwoThree = CsvProvider\u003c...\u003e", "", "val csv: CsvProvider\u003c...\u003e", "", "val it: unit = ()"] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" }], "source": [ "type OneTwoThree = CsvProvider\u003c\"1,2,3\", HasHeaders=false, Schema=\"Duration (float\u003csecond\u003e),foo,float option\"\u003e\n", "\n", "let csv = OneTwoThree.GetSample()\n", "\n", "for row in csv.Rows do\n", " printfn \"%f %d %f\" (row.Duration / 1.0\u003csecond\u003e) row.Foo (defaultArg row.Column3 1.0)\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "You don\u0027t need to override all the columns, you can skip the ones to leave as default.\n", "For example, in the titanic training dataset from Kaggle ([`data/Titanic.csv`](../data/Titanic.csv)),\n", "if you want to rename the 3rd column (the `PClass` column) to `Passenger Class` and override the\n", "6th column (the `Fare` column) to be a `float` instead of a `decimal`, you can define only that, and leave\n", "the other columns blank in the schema (you also don\u0027t need to add all the trailing commas).\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 13, "outputs": [ { "data": { "text/plain": ["Braund, Mr. Owen Harris Class = 3 Fare = 7.25", "", "Cumings, Mrs. John Bradley (Florence Briggs Thayer) Class = 1 Fare = 71.2833", "", "Heikkinen, Miss. Laina Class = 3 Fare = 7.925", "", "Futrelle, Mrs. Jacques Heath (Lily May Peel) Class = 1 Fare = 53.1", "", "Allen, Mr. William Henry Class = 3 Fare = 8.05", "", "Moran, Mr. James Class = 3 Fare = 8.4583", "", "McCarthy, Mr. Timothy J Class = 1 Fare = 51.8625", "", "Palsson, Master. Gosta Leonard Class = 3 Fare = 21.075", "", "Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) Class = 3 Fare = 11.1333", "", "Nasser, Mrs. Nicholas (Adele Achem) Class = 2 Fare = 30.0708", "", "type Titanic1 = CsvProvider\u003c...\u003e", "", "val titanic1: CsvProvider\u003c...\u003e", "", "val it: unit = ()"] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" }], "source": [ "type Titanic1 =\n", " CsvProvider\u003c\"../data/Titanic.csv\", Schema=\",,Passenger Class,,,float\", ResolutionFolder=ResolutionFolder\u003e\n", "\n", "let titanic1 = Titanic1.GetSample()\n", "\n", "for row in titanic1.Rows |\u003e Seq.truncate 10 do\n", " printfn \"%s Class = %d Fare = %g\" row.Name row.``Passenger Class`` row.Fare\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, you can rename and override the type of any column by name instead of by position:\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 14, "outputs": [ { "data": { "text/plain": ["Braund, Mr. Owen Harris Class = 3 Fare = 7.25", "", "Cumings, Mrs. John Bradley (Florence Briggs Thayer) Class = 1 Fare = 71.2833", "", "Heikkinen, Miss. Laina Class = 3 Fare = 7.925", "", "Futrelle, Mrs. Jacques Heath (Lily May Peel) Class = 1 Fare = 53.1", "", "Allen, Mr. William Henry Class = 3 Fare = 8.05", "", "Moran, Mr. James Class = 3 Fare = 8.4583", "", "McCarthy, Mr. Timothy J Class = 1 Fare = 51.8625", "", "Palsson, Master. Gosta Leonard Class = 3 Fare = 21.075", "", "Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) Class = 3 Fare = 11.1333", "", "Nasser, Mrs. Nicholas (Adele Achem) Class = 2 Fare = 30.0708", "", "type Titanic2 = CsvProvider\u003c...\u003e", "", "val titanic2: CsvProvider\u003c...\u003e", "", "val it: unit = ()"] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" }], "source": [ "type Titanic2 =\n", " CsvProvider\u003c\"../data/Titanic.csv\", Schema=\"Fare=float,PClass-\u003ePassenger Class\", ResolutionFolder=ResolutionFolder\u003e\n", "\n", "let titanic2 = Titanic2.GetSample()\n", "\n", "for row in titanic2.Rows |\u003e Seq.truncate 10 do\n", " printfn \"%s Class = %d Fare = %g\" row.Name row.``Passenger Class`` row.Fare\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "You can even mix and match the two syntaxes like this `Schema=\"int64,DidSurvive,PClass-\u003ePassenger Class=string\"`\n", "\n", "## Transforming CSV files\n", "\n", "In addition to reading, `CsvProvider` also has support for transforming the row collection of CSV files. The operations\n", "available are `Filter`, `Take`, `TakeWhile`, `Skip`, `SkipWhile`, and `Truncate`. All these operations\n", "preserve the schema, so after transforming you can save the results by using one of the overloads of\n", "the `Save` method. You can also use the `SaveToString()` to get the output directly as a string.\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 15, "outputs": [ { "data": { "text/plain": ["val it: string =", "", " \"Ozone;Solar.R;Wind;Temp;Month;Day", "", "41;190;7.4;67;5;1", "", "36;118;8;72;5;2", "", "12;149;12.6;74;5;3", "", "18;313;11.5;62;5;4", "", "23;299;8.6;65;5;7", "", "19;99;13.8;59;5;8", "", "8;19;20.1;61;5;9", "", "16;256;9.7;69;5;12", "", "11;290;9.2;66;5;13", "", "14;274;10.9;68;5;14", "", "\""] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" }], "source": [ "// Saving the first 10 rows that don\u0027t have missing values to a new csv file\n", "airQuality\n", " .Filter(fun row -\u003e\n", " not (Double.IsNaN row.Ozone)\n", " \u0026\u0026 not (Double.IsNaN row.``Solar.R``))\n", " .Truncate(10)\n", " .SaveToString()\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "It\u0027s also possible to transform the columns themselves by using `Map` and the constructor for the `Row` type.\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 16, "outputs": [ { "data": { "text/plain": ["val doubleOzone: Runtime.CsvFile\u003cCsvProvider\u003c...\u003e.Row\u003e"] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }], "source": [ "let doubleOzone =\n", " airQuality.Map(fun row -\u003e AirQuality.Row(row.Ozone * 2.0, row.``Solar.R``, row.Wind, row.Temp, row.Month, row.Day))\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "You can also append new rows, either by creating them directly as in the previous example, or by parsing them from a string.\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 17, "outputs": [ { "data": { "text/plain": ["val newRows: CsvProvider\u003c...\u003e.Row array =", "", " [|(41.0, 190.0, 7.4M, 67, 5, 1); (36.0, 118.0, 8M, 72, 5, 2)|]", "", "val airQualityWithExtraRows: Runtime.CsvFile\u003cCsvProvider\u003c...\u003e.Row\u003e"] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" }], "source": [ "let newRows =\n", " AirQuality.ParseRows(\n", " \"\"\"41;190;7.4;67;5;1\n", " 36;118;8;72;5;2\"\"\"\n", " )\n", "\n", "let airQualityWithExtraRows = airQuality.Append newRows\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "It\u0027s even possible to create csv files without parsing at all:\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 18, "outputs": [ { "data": { "text/plain": ["type MyCsvType = CsvProvider\u003c...\u003e", "", "val myRows: CsvProvider\u003c...\u003e.Row list =", "", " [(1, \"a\", None); (2, \"B\", Some 4/21/2024 6:02:52 PM)]", "", "val myCsv: MyCsvType", "", "val it: string = \"1,a,", "", "2,B,2024-04-21T18:02:52.9245901+00:00", "", "\""] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" }], "source": [ "type MyCsvType = CsvProvider\u003cSchema=\"A (int), B (string), C (date option)\", HasHeaders=false\u003e\n", "\n", "let myRows =\n", " [ MyCsvType.Row(1, \"a\", None)\n", " MyCsvType.Row(2, \"B\", Some System.DateTime.Now) ]\n", "\n", "let myCsv = new MyCsvType(myRows)\n", "myCsv.SaveToString()\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "## Handling big datasets\n", "\n", "By default, the rows are cached so you can iterate over the `Rows` property multiple times without worrying.\n", "But if you will only iterate once, you can disable caching by setting the `CacheRows` static parameter of `CsvProvider`\n", "to `false`. If the number of rows is very big, you have to do this otherwise you may exhaust the memory.\n", "You can still cache the data at some point by using the `Cache` method, but only do that if you have already\n", "transformed the dataset to be smaller.\n", "\n", "## Related articles\n", "\n", "* [Using JSON provider in a library](JsonProvider.html#jsonlib) also applies to CSV type provider\n", "\n", "* [CSV Parser](CsvFile.html) - provides more information about\n", "working with CSV documents dynamically.\n", "\n", "* [API Reference: CsvProvider type provider](../reference/fsharp-data-csvprovider.html)\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": ".NET (F#)", "language": "F#", "name": ".net-fsharp" }, "language_info": { "file_extension": ".fs", "mimetype": "text/x-fsharp", "name": "polyglot-notebook", "pygments_lexer": "fsharp" }, "polyglot_notebook": { "kernelInfo": { "defaultKernelName": "fsharp", "items": [ { "aliases": [], "languageName": "fsharp", "name": "fsharp" } ] } } }, "nbformat": 4, "nbformat_minor": 2 }