{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "(file_handling)=\n", "# File handling\n", "``` {index} File handling\n", "```\n", "Suppose we have a text file like below, from which we would like to extract temperature and density data:\n", "\n", "```\n", "# Density of air at different temperatures, at 1 atm pressure\n", "# Column 1: temperature in Celsius degrees\n", "# Column 2: density in kg/m^3\n", " 0.0 999.8425\n", " 4.0 999.9750\n", " 15.0 999.1026\n", " 20.0 998.2071\n", " 25.0 997.0479\n", " 37.0 993.3316\n", " 50.0 988.04\n", "100.0 958.3665\n", "\n", "# Source: Wikipedia (keyword Density)\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(python_file_handling)=\n", "## Python\n", "\n", "It is usually best to read and write files using the `with` statement which will improve our syntax and do a lot of things automatically for us (like closing the file we opened).\n", "\n", "Our file contains 3 lines at the beginning and 1 line at the end which we do not want to extract. Additionally, the temperature and density values are separated by whitespaces. Here is an example of how we could extract the data into two arrays:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 0. 4. 15. 20. 25. 37. 50. 100.]\n", "[999.8425 999.975 999.1026 998.2071 997.0479 993.3316 988.04 958.3665]\n" ] } ], "source": [ "with open('density_water.dat', 'r') as file: # r for read\n", " # skip extra lines by taking a slice of the file\n", " lines = file.readlines()[3:-1]\n", " \n", " # initialise 2 empty arrays to store our data\n", " temp, density = np.zeros(len(lines)), np.zeros(len(lines))\n", " \n", " for i in range(len(lines)):\n", " values = lines[i].split()\n", " temp[i] = float(values[0])\n", " density[i] = float(values[1])\n", " \n", "print(temp)\n", "print(density)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The lines variable in the above example is a list which contains our lines with numbers, but they are stored as a string - they are not recognised as numbers at this point. Note that the temperature and density arrays are already initialised as arrays of specific length, rather than empty lists which we will then append numbers to. It is good practice to initalise arrays like this when we know exactly how many elements our final array will have. The for loop is used to cycle through all elements of our data list. Each i-th element is a string, which we split into a new list with the same number of elements as the number of values that are separated by whitespaces in the line - in our case 2. Finally we can assign those values to elements in the temperature and density arrays, but first we converted them from a string to a float.\n", "\n", "The below example shows how we could write the data into a text file with, for example here, comma separated values.\n", "\n", "```python\n", "with open('output.txt', 'w') as file: # w for write\n", " file.write('# Temperature and density data\\n')\n", " for i in range(len(temp)):\n", " file.write(f'{temp[i]},{density[i]}\\n')\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(numpy_file_handling)=\n", "## NumPy\n", "\n", "NumPy's [`genfromtxt`](https://numpy.org/doc/1.18/reference/generated/numpy.genfromtxt.html) (generate from text) function provides full control over the file which we are trying to open. It initialises a numpy array from the data. Some of the parameters include:\n", "* `dtype` - set a data type; if not set, determines the data type automatically for each column\n", "* `comments` - skips every line starting with a string that is set here; `#` by default\n", "* `skip_header` - number of lines to skip at the beginning of the file\n", "* `skip_footer` - number of lines to skip at the end of the file\n", "* `delimiter` - the string used to separate values; any whitespaces by default\n", " \n", "For our file, we do not have to change anything since comments are marked by a # at the beginning of the line and the values are separated by whitespaces." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 0. 999.8425]\n", " [ 4. 999.975 ]\n", " [ 15. 999.1026]\n", " [ 20. 998.2071]\n", " [ 25. 997.0479]\n", " [ 37. 993.3316]\n", " [ 50. 988.04 ]\n", " [100. 958.3665]]\n" ] } ], "source": [ "import numpy as np\n", "\n", "data = np.genfromtxt('density_water.dat')\n", "print(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can save our array to a file in multiple ways - examples below. If we plan on using the array in another python code, perhaps it is best to save it as a `.npy` file. We can later easily load the `.npy` file and reconstruct the original array. The reader is encouraged to read about **pickling** in Python, which allows any object in Python to be saved in this way, not only numpy arrays." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 0. 999.8425]\n", " [ 4. 999.975 ]\n", " [ 15. 999.1026]\n", " [ 20. 998.2071]\n", " [ 25. 997.0479]\n", " [ 37. 993.3316]\n", " [ 50. 988.04 ]\n", " [100. 958.3665]]\n" ] } ], "source": [ "np.savetxt('data.txt', data) # save data array to a text file\n", "np.save('data.npy', data) # save data array to a .npy file\n", "\n", "A = np.load('data.npy')\n", "\n", "print(A)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(pandas_file_handling)=\n", "## Pandas\n", "\n", "Despite its name, pandas' [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) can read many different types of files, including our `.dat` file. Pandas `DataFrame` is the primary data structure in pandas, so this function will generate one such `DataFrame`. It is a very powerful object with many capabilities, our own tutorial can be found in {ref}`Introduction to Pandas `.\n", "\n", "As in the NumPy example, we specify that our comment lines begin with a `#` and that our delimiter is whitespace(s). Furthermore, we can give names (header) to our columns in the `DataFrame`, where we assign our list `col_names` using the names parameter. Pandas' `read_csv()` by default automatically sets this from the header line in our file, which we do not have in our file so we set `header=None`." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " temperature density\n", "0 0.0 999.8425\n", "1 4.0 999.9750\n", "2 15.0 999.1026\n", "3 20.0 998.2071\n", "4 25.0 997.0479\n", "5 37.0 993.3316\n", "6 50.0 988.0400\n", "7 100.0 958.3665\n" ] } ], "source": [ "from pandas import read_csv\n", "\n", "col_names = ['temperature', 'density']\n", "\n", "df = read_csv('density_water.dat', comment='#', delim_whitespace=True, names=col_names, header=None)\n", "print(df)\n", "\n", "df.to_csv('data.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(file_handling_exercises)=\n", "## Exercises\n", "--------\n", "* **Temperature** You are given a table with average temperature in different countries in the World. Open the file and find the countries ending with *stan* (like Turkmeni*stan*) and save their average temperatures in a separate file). Columns are `\\t` separated.\n", "\n", "The file can be found at `'Data\\\\TempData.txt'`\n", "\n", "**HINT**: Structure the data as a list of tuples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "````{admonition} Answer\n", ":class: dropdown\n", "\n", "```python\n", "def isStan(country):\n", " if len(country) < 4:\n", " return False\n", " else:\n", " return country[-4::] == \"stan\" # True or False\n", "\n", "with open('Data\\\\TempData.txt', 'r') as file: \n", " lines = file.readlines()\n", "\n", " nameTemp =[]\n", "\n", " res = 0\n", " stan_countries = 0\n", "\n", " for i in range(len(lines)):\n", " values = lines[i].split('\\t')\n", " nameTemp.append((values[0].strip(), float(values[1])))\n", "\n", " if(isStan(nameTemp[i][0])): # if True\n", " res += nameTemp[i][1]\n", " stan_countries += 1\n", "\n", "print(res/stan_countries)\n", "```\n", "\n", "````" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--------------------\n", "* **Hottest countires**: Using the same file as in the exercise before, via NumPy, find the countries with the average temperature above 27.\n", "\n", "**HINT**: Set `dtype=['U20', float]`, remember about delimiters." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "````{admonition} Answer\n", ":class: dropdown\n", "\n", "```python\n", "import numpy as np\n", "\n", "data = np.genfromtxt('Data\\\\TempData.txt', delimiter='\\t', dtype=['U20', float])\n", "\n", "for i in data:\n", " if i[1] > 27:\n", " print(i[0]) \n", "```\n", "\n", "````" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----------------------\n", "* **Countries per continent** Construct a list showing how many countries and independent regions there are on a particular continent. Consider only the countries with a `Three_Letter_Country_Code` value. Use the Pandas library.\n", "\n", "The file can be found at `Data\\\\CountryContinent.csv`.\n", "\n", "**HINT**: Use the `for index, row in df.iterrows():` structure of the `for` loop. The final result should look something like this: `[['Asia', 58], ['Europe', 57], ['Antarctica', 5], ['Africa', 58], ['Oceania', 27], ['North America', 43], ['South America', 14]]`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "````{admonition} Answer\n", ":class: dropdown\n", "\n", "```python\n", "from pandas import read_csv\n", "\n", "df = read_csv('Data\\\\CountryContinent.csv')\n", "\n", "continents = df['Continent_Name'].unique() # list of continent names from the file\n", "res = [[continent, 0] for continent in continents] # initial list, not counted yet\n", "\n", "for index, row in df.iterrows():\n", " if row[\"Three_Letter_Country_Code\"] != \"nan\":\n", "\n", " for j in range(len(res)):\n", " if row[\"Continent_Name\"] == res[j][0]: # find correct continent\n", " res[j][1] += 1 # increase country count by 1\n", "\n", "print(res)\n", "```\n", "\n", "````" ] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }