{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Code structure for data analysis\n", "\n", "> Marcos Duarte \n", "> Laboratory of Biomechanics and Motor Control ([http://demotu.org/](http://demotu.org/)) \n", "> Federal University of ABC, Brazil" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes data from experiments are stored in different files where each file contains data for different subjects, trials, conditions, etc. This text presents a common and simple solution to write a code to analyze such data. \n", "The basic idea is that the name of the file is created in a structured way and you can use that to run a sequence of procedures inside one or more nested loops. \n", "For instance, consider that the two first letters of the filename encode the initials of the subject's name, the next two letters the different conditions, and the last two characters the trial number. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2017-12-30T09:19:08.523589Z", "start_time": "2017-12-30T09:19:08.518798Z" } }, "outputs": [], "source": [ "subjects = ['AA', 'AB']\n", "conditions = ['c1', 'c2']\n", "trials = ['01', '02', '03']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could open and process these files with:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2017-12-30T09:19:11.658246Z", "start_time": "2017-12-30T09:19:11.648131Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AAc101\n", "AAc102\n", "AAc103\n", "AAc201\n", "AAc202\n", "AAc203\n", "ABc101\n", "ABc102\n", "ABc103\n", "ABc201\n", "ABc202\n", "ABc203\n" ] } ], "source": [ "for subject in subjects:\n", " for condition in conditions:\n", " for trial in trials:\n", " filename = subject + condition + trial\n", " print(filename)\n", " # read file, process data, save results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The problem with this code is that if one one more files are missing or corrupted (which is typical), it will break. A solution is to read the file inside a `try` function. The `try...except` handles exceptions such as a failure in reading a file and then we can use a `continue` statement to skip each failed iteration in the inner loop. \n", "Let's create some files and implement this idea." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read and save files\n", "\n", "If the data is in text ([ASCII](http://en.wikipedia.org/wiki/ASCII)) format, it's easier to read the file with the [`Numpy`](http://www.numpy.org/) function [`loadtxt`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html) or with the [`pandas`](http://pandas.pydata.org/) function [`read_csv`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html). Both functions behave similarly; they can skip a certain number of first rows, can read files with different column separators, read numbers and letters, etc. `read_csv` tends to be faster but it returns a `pandas` [`DataFrame`](http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html) object, which might not be useful if you are not into `pandas` (but you should be).\n", "\n", "To save data to a file, we can use the counterpart functions `savetxt` and `to_csv`:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2017-12-30T09:19:51.468132Z", "start_time": "2017-12-30T09:19:51.375837Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "File ./../data/AAc101.txt saved\n", "File ./../data/AAc102.txt saved\n", "File ./../data/AAc103.txt saved\n", "File ./../data/AAc201.txt saved\n", "File ./../data/AAc202.txt saved\n", "File ./../data/AAc203.txt saved\n", "File ./../data/ABc101.txt saved\n", "File ./../data/ABc102.txt saved\n", "File ./../data/ABc103.txt saved\n", "File ./../data/ABc201.txt saved\n", "File ./../data/ABc202.txt saved\n", "File ./../data/ABc203.txt saved\n" ] } ], "source": [ "import numpy as np\n", "\n", "path = './../data/'\n", "extension = '.txt'\n", "\n", "for subject in subjects:\n", " for condition in conditions:\n", " for trial in trials:\n", " filename = path + subject + condition + trial + extension\n", " data = np.random.randn(5, 3)\n", " header = 'Col A\\tCol B\\tCol C'\n", " np.savetxt(filename, data, fmt='%g',\n", " delimiter='\\t', header = header, comments = '')\n", " print('File', filename, 'saved')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In my case I used the './../' command to move up one directory relatively to my current directory (see the cd (command)). \n", "Let's remove one of the files:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2017-12-30T09:20:01.412652Z", "start_time": "2017-12-30T09:20:01.407628Z" } }, "outputs": [], "source": [ "import os\n", "os.remove('./../data/AAc202.txt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's read the data in these files and handle a possible missing or corrupted file:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2017-12-30T09:20:08.097063Z", "start_time": "2017-12-30T09:20:08.071650Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "./../data/AAc101.txt loaded\n", "./../data/AAc102.txt loaded\n", "./../data/AAc103.txt loaded\n", "./../data/AAc201.txt loaded\n", "./../data/AAc202.txt [Errno 2] No such file or directory: './../data/AAc202.txt'\n", "./../data/AAc203.txt loaded\n", "./../data/ABc101.txt loaded\n", "./../data/ABc102.txt loaded\n", "./../data/ABc103.txt loaded\n", "./../data/ABc201.txt loaded\n", "./../data/ABc202.txt loaded\n", "./../data/ABc203.txt loaded\n" ] } ], "source": [ "for subject in subjects:\n", " for condition in conditions:\n", " for trial in trials:\n", " filename = path + subject + condition + trial + extension\n", " try:\n", " data = np.loadtxt(filename, skiprows=1)\n", " except Exception as err:\n", " print(filename, err) \n", " continue\n", " else:\n", " print(filename, 'loaded')\n", " \n", " # process data\n", " # ...\n", " # save results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Store results\n", "\n", "The results of the analysis for each file can be stored in a variable in different ways. \n", "We can store the results in a multidimensional variable where each dimension corresponds to the different indices in the loops. With the data above this would produce `results(s, c, t)`, a 2x2x3 array. Or we can store everything in a two-dimensional array where for example each row corresponds to each combination of subject, condition, and trial. \n", "Let's try both ways:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2017-12-30T09:20:29.250460Z", "start_time": "2017-12-30T09:20:29.216320Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(2, 2, 3, 3)\n", "[[[[-0.274438 -0.2594482 0.107014 ]\n", " [ 0.526563 0.1208578 -0.1596212 ]\n", " [-0.107973 0.5240266 0.59081164]]\n", "\n", " [[-0.6094492 -0.020314 0.8049366 ]\n", " [ nan nan nan]\n", " [-0.16159672 0.7030814 -0.140353 ]]]\n", "\n", "\n", " [[[-1.1080408 0.45704074 -1.1114716 ]\n", " [-0.4147048 -0.3402416 -0.6172871 ]\n", " [ 0.21572672 1.3349914 0.14154375]]\n", "\n", " [[-0.4043338 -0.2619066 0.071193 ]\n", " [-0.03757986 0.7670396 -0.1359376 ]\n", " [ 0.768179 0.1022648 -0.69671646]]]]\n" ] } ], "source": [ "results = np.empty(shape=(2, 2, 3, 3))*np.NaN\n", "for s, subject in enumerate(subjects):\n", " for c, condition in enumerate(conditions):\n", " for t, trial in enumerate(trials):\n", " filename = path + subject + condition + trial + extension\n", " try:\n", " data = np.loadtxt(filename, skiprows=1)\n", " except Exception as err:\n", " #print(filename, err) \n", " continue\n", " else:\n", " #print(filename, 'loaded')\n", " pass\n", " \n", " results[s, c, t, :] = np.mean(data, axis=0)\n", " \n", "print(results.shape)\n", "print(results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One problem with this approach is that for many dimensions the data gets convoluted and it might be difficult to read it. \n", "The results for the first subject, condition, and trial are:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2017-12-30T09:20:36.105086Z", "start_time": "2017-12-30T09:20:36.087408Z" } }, "outputs": [ { "data": { "text/plain": [ "array([-0.274438 , -0.2594482, 0.107014 ])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results[0, 0, 0, :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the second approach and store the results in a two-dimensional array:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2017-12-30T09:20:43.333769Z", "start_time": "2017-12-30T09:20:43.286426Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(12, 3)\n", "[[-0.274438 -0.2594482 0.107014 ]\n", " [ 0.526563 0.1208578 -0.1596212 ]\n", " [-0.107973 0.5240266 0.59081164]\n", " [-0.6094492 -0.020314 0.8049366 ]\n", " [ nan nan nan]\n", " [-0.16159672 0.7030814 -0.140353 ]\n", " [-1.1080408 0.45704074 -1.1114716 ]\n", " [-0.4147048 -0.3402416 -0.6172871 ]\n", " [ 0.21572672 1.3349914 0.14154375]\n", " [-0.4043338 -0.2619066 0.071193 ]\n", " [-0.03757986 0.7670396 -0.1359376 ]\n", " [ 0.768179 0.1022648 -0.69671646]]\n", "(12, 3)\n", "[[-0.274438 -0.2594482 0.107014 ]\n", " [ 0.526563 0.1208578 -0.1596212 ]\n", " [-0.107973 0.5240266 0.59081164]\n", " [-0.6094492 -0.020314 0.8049366 ]\n", " [ nan nan nan]\n", " [-0.16159672 0.7030814 -0.140353 ]\n", " [-1.1080408 0.45704074 -1.1114716 ]\n", " [-0.4147048 -0.3402416 -0.6172871 ]\n", " [ 0.21572672 1.3349914 0.14154375]\n", " [-0.4043338 -0.2619066 0.071193 ]\n", " [-0.03757986 0.7670396 -0.1359376 ]\n", " [ 0.768179 0.1022648 -0.69671646]]\n" ] } ], "source": [ "results = np.empty(shape=(2*2*3, 3))*np.NaN\n", "results2 = np.empty(shape=(2*2*3, 3))*np.NaN\n", "ind = 0\n", "for s, subject in enumerate(subjects):\n", " for c, condition in enumerate(conditions):\n", " for t, trial in enumerate(trials):\n", " ind += 1\n", " filename = path + subject + condition + trial + extension\n", " try:\n", " data = np.loadtxt(filename, skiprows=1)\n", " except Exception as err:\n", " #print(filename, err) \n", " continue\n", " else:\n", " #print(filename, 'loaded')\n", " pass\n", " \n", " # 1st way, using an index:\n", " results[ind-1, :] = np.mean(data, axis=0)\n", " # 2nd way, no index:\n", " results2[len(conditions)*len(trials)*s + len(trials)*c + t, :] = np.mean(data, axis=0)\n", " \n", "print(results.shape)\n", "print(results)\n", "print(results2.shape)\n", "print(results2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can create columns identifying the subject, condition, and trial, which might be useful for running statistical analysis:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2017-12-30T09:20:49.918031Z", "start_time": "2017-12-30T09:20:49.880643Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(12, 6)\n", "[[ 0. 0. 0. -0.274438 -0.2594482 0.107014 ]\n", " [ 0. 0. 1. 0.526563 0.1208578 -0.1596212 ]\n", " [ 0. 0. 2. -0.107973 0.5240266 0.59081164]\n", " [ 0. 1. 0. -0.6094492 -0.020314 0.8049366 ]\n", " [ 0. 1. 1. nan nan nan]\n", " [ 0. 1. 2. -0.16159672 0.7030814 -0.140353 ]\n", " [ 1. 0. 0. -1.1080408 0.45704074 -1.1114716 ]\n", " [ 1. 0. 1. -0.4147048 -0.3402416 -0.6172871 ]\n", " [ 1. 0. 2. 0.21572672 1.3349914 0.14154375]\n", " [ 1. 1. 0. -0.4043338 -0.2619066 0.071193 ]\n", " [ 1. 1. 1. -0.03757986 0.7670396 -0.1359376 ]\n", " [ 1. 1. 2. 0.768179 0.1022648 -0.69671646]]\n" ] } ], "source": [ "results = np.empty(shape=(2*2*3, 3))*np.NaN\n", "ind = 0\n", "indexes = []\n", "for s, subject in enumerate(subjects):\n", " for c, condition in enumerate(conditions):\n", " for t, trial in enumerate(trials):\n", " ind += 1\n", " indexes.append([s, c, t])\n", " filename = path + subject + condition + trial + extension\n", " try:\n", " data = np.loadtxt(filename, skiprows=1)\n", " except Exception as err:\n", " #print(filename, err) \n", " continue\n", " else:\n", " #print(filename, 'loaded')\n", " pass\n", " \n", " results[ind-1, :] = np.mean(data, axis=0)\n", " \n", "results = np.hstack((np.array(indexes), results))\n", "print(results.shape)\n", "print(results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are just some possible generic approaches to analyze data in multiple files.\n", "\n", "And we can save the results in a file:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2017-12-30T09:20:55.923248Z", "start_time": "2017-12-30T09:20:55.913675Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "File ./../data/results.txt saved\n" ] } ], "source": [ "filename = path + 'results.txt'\n", "header = 'Subject\\tCondition\\tTrial\\tCol A\\tCol B\\tCol C'\n", "np.savetxt(filename, results, fmt='%d\\t%d\\t%d\\t%g\\t%g\\t%g',\n", " delimiter='\\t', header = header, comments = '')\n", "print('File', filename, 'saved')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }