{ "cells": [ { "cell_type": "markdown", "id": "68b92712-4df4-4762-8eeb-1b7a9ec6ca56", "metadata": {}, "source": [ "# Working with messy data with Python and pandas" ] }, { "cell_type": "markdown", "id": "952c0972-0434-44e6-99c3-2594c5158ba1", "metadata": {}, "source": [ "The goal of this workshop is to provide you with some tools that will help you work with real-world data.\n", "\n", "There are many **excellent** textbooks and online tutorials that will teach you how to analyze and visualize data.\n", "\n", "However, these tutorials rely on data that are pre-formatted in a very specific way (i.e., example datasets are ready to analyze).\n", "\n", "There is good reason for this: statistics and plotting libraries contain functions that are designed to have data organized with a precise structure.\n", "- A great resource that describes how data *should* be structured can be found here: https://vita.had.co.nz/papers/tidy-data.pdf\n", "\n", "If data are not arranged appropriately:\n", "- bad: it will be difficult to get your analysis working properly\n", "- worse: you might not be able to get the analysis to work at all\n", "- worst: the analysis will work, but not as you expected!!!\n", "\n", "## Problem\n", "\n", "Real-world data are not usually pre-formatted for analyses.\n", "\n", "Few textbooks or tutorials focus on the process of cleaning up data---that's where this workshop comes in!\n", "\n", "Here, you will learn:\n", "- About the Python library called `pandas`\n", "- How to use Python/pandas to\n", " - Read csv files\n", " - Filter/select data\n", " - Create a clean dataset\n", "\n", "We will work with a small set of example files stored in the `raw_data` folder.\n", "- These files can be downloaded using this link: https://minhaskamal.github.io/DownGit/#/home?url=https://github.com/scds/dash-webinars/tree/main/assets/data\n", "\n", "These files represent an experiment scenario where participants completed a questionnaire and a separate reaction time task.\n", "\n", "Each participant has their data stored in two separate csv files.\n", "\n", "We need to combine all of these files into one clean data set for analysis." ] }, { "cell_type": "markdown", "id": "15b7ac87-876a-474d-a829-aabbfc03f1c3", "metadata": {}, "source": [ "## Overview\n", "\n", "How will we approach this problem?\n", "\n", "1. Select all the survey and trial raw data files\n", "2. Read each file\n", "3. Select the appropriate data in each file and store in a DataFrame\n", "4. Combine all individual data into a single clean data file\n", "5. Write the clean file\n" ] }, { "cell_type": "markdown", "id": "82cd08e7-85d2-41f7-ba98-a6259e5eb238", "metadata": {}, "source": [ "## Organizing Files\n", "The `os` module provides functions that allow you to interact with some basic parts of the operating system.\n", "A very useful function is the `listdir()` function, which can read the contents of a folder on your computer!" ] }, { "cell_type": "code", "execution_count": 1, "id": "96c6ba92-565e-485e-8711-e1bb91b5ae5a", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [], "source": [ "import os" ] }, { "cell_type": "code", "execution_count": 2, "id": "49def607-ec20-4baf-9400-b993f183bbc2", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "['id_007_trial.csv',\n", " 'id_005_survey.csv',\n", " 'id_008_survey.csv',\n", " '.DS_Store',\n", " 'id_001_trial.csv',\n", " 'id_006_trial.csv',\n", " 'id_007_survey.csv',\n", " 'id_002_survey.csv',\n", " 'id_010_survey.csv',\n", " 'id_005_trial.csv',\n", " 'id_008_trial.csv',\n", " 'id_001_survey.csv',\n", " 'id_003_trial.csv',\n", " 'id_004_survey.csv',\n", " 'id_009_survey.csv',\n", " '.ipynb_checkpoints',\n", " 'id_004_trial.csv',\n", " 'id_010_trial.csv',\n", " 'id_003_survey.csv',\n", " 'id_009_trial.csv',\n", " 'id_002_trial.csv',\n", " 'id_006_survey.csv']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dir_raw = \"raw_data/\"\n", "files_raw = os.listdir(dir_raw)\n", "files_raw" ] }, { "cell_type": "markdown", "id": "8b89b803-822d-46cb-80ff-1d12a319e11e", "metadata": {}, "source": [ "There is a problem if .ipynb files are included. We need to exclude non-csv files.\n", "A list comprehension can quickly check each filename for a string pattern." ] }, { "cell_type": "code", "execution_count": 3, "id": "160c331f-5b46-41a7-8c66-ffc643acf5b9", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "['id_007_trial.csv',\n", " 'id_005_survey.csv',\n", " 'id_008_survey.csv',\n", " 'id_001_trial.csv',\n", " 'id_006_trial.csv',\n", " 'id_007_survey.csv',\n", " 'id_002_survey.csv',\n", " 'id_010_survey.csv',\n", " 'id_005_trial.csv',\n", " 'id_008_trial.csv',\n", " 'id_001_survey.csv',\n", " 'id_003_trial.csv',\n", " 'id_004_survey.csv',\n", " 'id_009_survey.csv',\n", " 'id_004_trial.csv',\n", " 'id_010_trial.csv',\n", " 'id_003_survey.csv',\n", " 'id_009_trial.csv',\n", " 'id_002_trial.csv',\n", " 'id_006_survey.csv']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "files_raw = [file for file in files_raw if \".csv\" in file]\n", "files_raw" ] }, { "cell_type": "markdown", "id": "c37dbea8-4a59-4813-be6c-8320ac0d3d34", "metadata": {}, "source": [ "We can do this to separate 'survey' files from 'trial' files as well.\n", "This is important because survey and trial files have different structure and will require different selection and filtering tools." ] }, { "cell_type": "code", "execution_count": 4, "id": "92ccf2a7-c59a-424a-b0f0-ad06071797ad", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['id_005_survey.csv', 'id_008_survey.csv', 'id_007_survey.csv', 'id_002_survey.csv', 'id_010_survey.csv', 'id_001_survey.csv', 'id_004_survey.csv', 'id_009_survey.csv', 'id_003_survey.csv', 'id_006_survey.csv']\n", "['id_007_trial.csv', 'id_001_trial.csv', 'id_006_trial.csv', 'id_005_trial.csv', 'id_008_trial.csv', 'id_003_trial.csv', 'id_004_trial.csv', 'id_010_trial.csv', 'id_009_trial.csv', 'id_002_trial.csv']\n" ] } ], "source": [ "files_raw_survey = [f for f in files_raw if \"survey\" in f]\n", "files_raw_trials = [f for f in files_raw if \"trial\" in f]\n", "\n", "print(files_raw_survey)\n", "print(files_raw_trials)" ] }, { "cell_type": "markdown", "id": "1c03f624-cea7-4908-9037-5d5e78eb672f", "metadata": {}, "source": [ "Now we have two lists of file names. One that holds all of the survey files and one that holds all of the experiment files." ] }, { "cell_type": "markdown", "id": "2b9b3ef8-a218-471b-b1f8-8e1bf4aa9a83", "metadata": {}, "source": [ "## pandas\n", "\n", "The `pandas` library lets us easily work with data structures.\n", "\n", "The pandas documentation is a great resource!\n", "\n", "Check out the Intro to pandas: https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html#intro-to-pandas" ] }, { "cell_type": "code", "execution_count": 5, "id": "afb39251-c315-4c99-80de-c70ab0e3df88", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "id": "0a996d97-fe8c-44eb-a3ea-a23778eb2d2b", "metadata": {}, "source": [ "### Creating DataFrames" ] }, { "cell_type": "code", "execution_count": 6, "id": "a8e01259-8594-4099-963e-3a53f8742d7d", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col_Acol_B
0A20
1B30
\n", "
" ], "text/plain": [ " col_A col_B\n", "0 A 20\n", "1 B 30" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df01 = pd.DataFrame({\n", " \"col_A\": [\"A\", \"B\"],\n", " \"col_B\": [20, 30]\n", "})\n", "df01.head()" ] }, { "cell_type": "code", "execution_count": 7, "id": "df14b49e-b0db-4580-b93a-893c3affba33", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0 A\n", "1 B\n", "Name: col_A, dtype: object" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df01[\"col_A\"]" ] }, { "cell_type": "markdown", "id": "beb4d072-6ee3-4cbe-ae0c-82d5978d9bce", "metadata": {}, "source": [ "### DataFrames from Files" ] }, { "cell_type": "code", "execution_count": 8, "id": "8c481ede-90ed-4d48-b43f-597cf9257f8d", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trial_indextaskstimulusvaluetime_elapsed
01Participant IDNaN00121
12ConsentNaNComplete47
23Trial 1blue_red_Incongruent6763
34Trial 2blue_blue_Congruent5191
45Trial 3red_red_Congruent51116
\n", "
" ], "text/plain": [ " trial_index task stimulus value time_elapsed\n", "0 1 Participant ID NaN 001 21\n", "1 2 Consent NaN Complete 47\n", "2 3 Trial 1 blue_red_Incongruent 67 63\n", "3 4 Trial 2 blue_blue_Congruent 51 91\n", "4 5 Trial 3 red_red_Congruent 51 116" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_read_example = pd.read_csv(\"raw_data/id_001_trial.csv\")\n", "data_read_example.head()" ] }, { "cell_type": "code", "execution_count": 9, "id": "79218861-92f0-4ac7-9609-7a73adc54f0e", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trial_indextaskstimulusvaluetime_elapsed
1920Trial 18blue_blue_Congruent26538
2021Trial 19red_red_Congruent1585
2122Trial 20red_red_Congruent37613
2223DebriefNaNComplete642
2324CompleteNaNDone666
\n", "
" ], "text/plain": [ " trial_index task stimulus value time_elapsed\n", "19 20 Trial 18 blue_blue_Congruent 26 538\n", "20 21 Trial 19 red_red_Congruent 1 585\n", "21 22 Trial 20 red_red_Congruent 37 613\n", "22 23 Debrief NaN Complete 642\n", "23 24 Complete NaN Done 666" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_read_example.tail()" ] }, { "cell_type": "code", "execution_count": 10, "id": "1ac01988-b54c-4232-850f-e3a1dd9dbc0b", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 24 entries, 0 to 23\n", "Data columns (total 5 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 trial_index 24 non-null int64 \n", " 1 task 24 non-null object\n", " 2 stimulus 20 non-null object\n", " 3 value 24 non-null object\n", " 4 time_elapsed 24 non-null int64 \n", "dtypes: int64(2), object(3)\n", "memory usage: 1.1+ KB\n" ] } ], "source": [ "data_read_example.info()" ] }, { "cell_type": "code", "execution_count": 11, "id": "a382e46f-00ab-48e1-b82e-f0a0299ed00f", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trial_indextime_elapsed
count24.00000024.000000
mean12.500000341.666667
std7.071068205.208837
min1.00000021.000000
25%6.750000147.250000
50%12.500000355.500000
75%18.250000501.000000
max24.000000666.000000
\n", "
" ], "text/plain": [ " trial_index time_elapsed\n", "count 24.000000 24.000000\n", "mean 12.500000 341.666667\n", "std 7.071068 205.208837\n", "min 1.000000 21.000000\n", "25% 6.750000 147.250000\n", "50% 12.500000 355.500000\n", "75% 18.250000 501.000000\n", "max 24.000000 666.000000" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_read_example.describe()" ] }, { "cell_type": "markdown", "id": "d229c9de-9ce0-4b46-94dd-978deb94fdb0", "metadata": {}, "source": [ "### Subsetting DataFrames\n", "\n", "#### Columns" ] }, { "cell_type": "code", "execution_count": 12, "id": "37c7fcea-eecf-499b-9a8d-01d31718fce8", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0 21\n", "1 47\n", "2 63\n", "3 91\n", "4 116\n", "5 130\n", "6 153\n", "7 200\n", "8 229\n", "9 275\n", "10 321\n", "11 345\n", "12 366\n", "13 408\n", "14 430\n", "15 461\n", "16 488\n", "17 496\n", "18 516\n", "19 538\n", "20 585\n", "21 613\n", "22 642\n", "23 666\n", "Name: time_elapsed, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_read_example[\"time_elapsed\"]" ] }, { "cell_type": "code", "execution_count": 13, "id": "dc17cacc-ffc0-4a66-9df4-b5e243ca750f", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
time_elapsedtask
021Participant ID
147Consent
263Trial 1
391Trial 2
4116Trial 3
5130Trial 4
6153Trial 5
7200Trial 6
8229Trial 7
9275Trial 8
10321Trial 9
11345Trial 10
12366Trial 11
13408Trial 12
14430Trial 13
15461Trial 14
16488Trial 15
17496Trial 16
18516Trial 17
19538Trial 18
20585Trial 19
21613Trial 20
22642Debrief
23666Complete
\n", "
" ], "text/plain": [ " time_elapsed task\n", "0 21 Participant ID\n", "1 47 Consent\n", "2 63 Trial 1\n", "3 91 Trial 2\n", "4 116 Trial 3\n", "5 130 Trial 4\n", "6 153 Trial 5\n", "7 200 Trial 6\n", "8 229 Trial 7\n", "9 275 Trial 8\n", "10 321 Trial 9\n", "11 345 Trial 10\n", "12 366 Trial 11\n", "13 408 Trial 12\n", "14 430 Trial 13\n", "15 461 Trial 14\n", "16 488 Trial 15\n", "17 496 Trial 16\n", "18 516 Trial 17\n", "19 538 Trial 18\n", "20 585 Trial 19\n", "21 613 Trial 20\n", "22 642 Debrief\n", "23 666 Complete" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_read_example[[\"time_elapsed\", \"task\"]]" ] }, { "cell_type": "markdown", "id": "c48a0f3d-1bbd-4d74-868a-c79b38147b0b", "metadata": {}, "source": [ "#### Rows" ] }, { "cell_type": "code", "execution_count": 14, "id": "bb43eace-9c27-40a4-a39a-263378ae8bdb", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trial_indextaskstimulusvaluetime_elapsed
1314Trial 12red_red_Congruent15408
1415Trial 13red_red_Congruent22430
1516Trial 14blue_red_Incongruent55461
1617Trial 15blue_red_Incongruent66488
1718Trial 16red_red_Congruent48496
1819Trial 17red_blue_Incongruent75516
1920Trial 18blue_blue_Congruent26538
2021Trial 19red_red_Congruent1585
2122Trial 20red_red_Congruent37613
2223DebriefNaNComplete642
2324CompleteNaNDone666
\n", "
" ], "text/plain": [ " trial_index task stimulus value time_elapsed\n", "13 14 Trial 12 red_red_Congruent 15 408\n", "14 15 Trial 13 red_red_Congruent 22 430\n", "15 16 Trial 14 blue_red_Incongruent 55 461\n", "16 17 Trial 15 blue_red_Incongruent 66 488\n", "17 18 Trial 16 red_red_Congruent 48 496\n", "18 19 Trial 17 red_blue_Incongruent 75 516\n", "19 20 Trial 18 blue_blue_Congruent 26 538\n", "20 21 Trial 19 red_red_Congruent 1 585\n", "21 22 Trial 20 red_red_Congruent 37 613\n", "22 23 Debrief NaN Complete 642\n", "23 24 Complete NaN Done 666" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_read_example[data_read_example[\"time_elapsed\"] > 400]" ] }, { "cell_type": "code", "execution_count": 15, "id": "2499275d-cb5b-4f9c-8f7f-8de239b89b17", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", "5 False\n", "6 False\n", "7 False\n", "8 False\n", "9 False\n", "10 False\n", "11 False\n", "12 False\n", "13 True\n", "14 True\n", "15 True\n", "16 True\n", "17 True\n", "18 True\n", "19 True\n", "20 True\n", "21 True\n", "22 True\n", "23 True\n", "Name: time_elapsed, dtype: bool" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_read_example[\"time_elapsed\"] > 400" ] }, { "cell_type": "code", "execution_count": 16, "id": "7a43bbed-24b6-4c82-beba-e95c55e90d6c", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", "5 False\n", "6 False\n", "7 False\n", "8 False\n", "9 False\n", "10 False\n", "11 False\n", "12 False\n", "13 False\n", "14 True\n", "15 False\n", "16 False\n", "17 False\n", "18 True\n", "19 False\n", "20 False\n", "21 False\n", "22 False\n", "23 False\n", "Name: task, dtype: bool" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_read_example[\"task\"].isin([\"Trial 13\", \"Trial 17\"])" ] }, { "cell_type": "code", "execution_count": 17, "id": "c09ca6b8-5897-4671-91b4-16cb52da2a9e", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trial_indextaskstimulusvaluetime_elapsed
1415Trial 13red_red_Congruent22430
1819Trial 17red_blue_Incongruent75516
\n", "
" ], "text/plain": [ " trial_index task stimulus value time_elapsed\n", "14 15 Trial 13 red_red_Congruent 22 430\n", "18 19 Trial 17 red_blue_Incongruent 75 516" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_read_example[data_read_example[\"task\"].isin([\"Trial 13\", \"Trial 17\"])]" ] }, { "cell_type": "markdown", "id": "ae2349ad-6edd-4503-8a91-44c01f359bc7", "metadata": {}, "source": [ "#### Rows and Columns" ] }, { "cell_type": "code", "execution_count": 18, "id": "0750cb15-8cdf-4401-966b-b4ed6db8ff78", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stimulusvalue
14red_red_Congruent22
\n", "
" ], "text/plain": [ " stimulus value\n", "14 red_red_Congruent 22" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_read_example.loc[data_read_example[\"task\"].isin([\"Trial 13\"]), [\"stimulus\", \"value\"]]" ] }, { "cell_type": "code", "execution_count": 19, "id": "a5c04042-474b-482b-a706-c70bbd806ccb", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stimulusvalue
3blue_blue_Congruent51
4red_red_Congruent51
5blue_blue_Congruent16
6blue_red_Incongruent81
\n", "
" ], "text/plain": [ " stimulus value\n", "3 blue_blue_Congruent 51\n", "4 red_red_Congruent 51\n", "5 blue_blue_Congruent 16\n", "6 blue_red_Incongruent 81" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_read_example.iloc[3:7,2:4]" ] }, { "cell_type": "code", "execution_count": 20, "id": "39dfa042-0abc-4641-a7f8-c7b991e1966d", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trial_indextaskstimulusvaluetime_elapsed
01Participant IDNaN00121
12ConsentNaNComplete222
23Trial 1blue_red_Incongruent67222
34Trial 2blue_blue_Congruent5191
45Trial 3red_red_Congruent51116
\n", "
" ], "text/plain": [ " trial_index task stimulus value time_elapsed\n", "0 1 Participant ID NaN 001 21\n", "1 2 Consent NaN Complete 222\n", "2 3 Trial 1 blue_red_Incongruent 67 222\n", "3 4 Trial 2 blue_blue_Congruent 51 91\n", "4 5 Trial 3 red_red_Congruent 51 116" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_read_example.iloc[1:3, 4] = 222\n", "data_read_example.head()" ] }, { "cell_type": "markdown", "id": "f721fec7-1ecf-4c7a-aea1-3b11a8745cb2", "metadata": {}, "source": [ "### Creating New Columns" ] }, { "cell_type": "code", "execution_count": 21, "id": "88adb964-fed6-4515-b4a4-ba6232229718", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trial_indextaskstimulusvaluetime_elapsedtime_seconds
01Participant IDNaN001210.021
12ConsentNaNComplete2220.222
23Trial 1blue_red_Incongruent672220.222
34Trial 2blue_blue_Congruent51910.091
45Trial 3red_red_Congruent511160.116
\n", "
" ], "text/plain": [ " trial_index task stimulus value time_elapsed \\\n", "0 1 Participant ID NaN 001 21 \n", "1 2 Consent NaN Complete 222 \n", "2 3 Trial 1 blue_red_Incongruent 67 222 \n", "3 4 Trial 2 blue_blue_Congruent 51 91 \n", "4 5 Trial 3 red_red_Congruent 51 116 \n", "\n", " time_seconds \n", "0 0.021 \n", "1 0.222 \n", "2 0.222 \n", "3 0.091 \n", "4 0.116 " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_read_example[\"time_seconds\"] = data_read_example[\"time_elapsed\"] / 1000\n", "data_read_example.head()" ] }, { "cell_type": "markdown", "id": "d6ef3065-8f4d-46e7-9648-dfcd0cbc9d1d", "metadata": {}, "source": [ "#### Renaming Columns" ] }, { "cell_type": "code", "execution_count": 22, "id": "42960600-9ef7-46ba-bf4a-ace024f9f2f4", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trial_indextaskstimulusvaluetime_elapsedtime_s
01Participant IDNaN001210.021
12ConsentNaNComplete2220.222
23Trial 1blue_red_Incongruent672220.222
34Trial 2blue_blue_Congruent51910.091
45Trial 3red_red_Congruent511160.116
\n", "
" ], "text/plain": [ " trial_index task stimulus value time_elapsed \\\n", "0 1 Participant ID NaN 001 21 \n", "1 2 Consent NaN Complete 222 \n", "2 3 Trial 1 blue_red_Incongruent 67 222 \n", "3 4 Trial 2 blue_blue_Congruent 51 91 \n", "4 5 Trial 3 red_red_Congruent 51 116 \n", "\n", " time_s \n", "0 0.021 \n", "1 0.222 \n", "2 0.222 \n", "3 0.091 \n", "4 0.116 " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_read_example.rename(columns={\"time_seconds\": \"time_s\"}, inplace=True)\n", "data_read_example.head()" ] }, { "cell_type": "markdown", "id": "031e90c1-528d-4d46-86b5-be189da55264", "metadata": {}, "source": [ "### Summarizing" ] }, { "cell_type": "code", "execution_count": 23, "id": "f1ac2866-34c3-46c3-9b78-bc14556e639c", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0.3555833333333333" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_read_example[\"time_s\"].mean()" ] }, { "cell_type": "code", "execution_count": 24, "id": "14c5479a-84f4-489c-bd19-f9005a7f922d", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
time_s
mean0.355583
std0.189743
\n", "
" ], "text/plain": [ " time_s\n", "mean 0.355583\n", "std 0.189743" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_read_example.agg(\n", " {\n", " \"time_s\": [\"mean\", \"std\"]\n", " }\n", ")" ] }, { "cell_type": "markdown", "id": "dcd6ac3e-5e34-4a81-91bb-83b8aa54214b", "metadata": {}, "source": [ "#### Grouping" ] }, { "cell_type": "code", "execution_count": 25, "id": "3742041f-f799-4b40-a0f1-b5d5e5119bdf", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
time_elapsedtime_s
stimulus
blue_blue_Congruent253.0000000.253000
blue_red_Incongruent331.0000000.331000
red_blue_Incongruent337.1666670.337167
red_red_Congruent411.0000000.411000
\n", "
" ], "text/plain": [ " time_elapsed time_s\n", "stimulus \n", "blue_blue_Congruent 253.000000 0.253000\n", "blue_red_Incongruent 331.000000 0.331000\n", "red_blue_Incongruent 337.166667 0.337167\n", "red_red_Congruent 411.000000 0.411000" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_read_example.groupby(\"stimulus\").mean(numeric_only=True)[[\"time_elapsed\", \"time_s\"]]" ] }, { "cell_type": "markdown", "id": "192963fb-36f8-47e1-971d-649b87da3e96", "metadata": {}, "source": [ "#### Counting" ] }, { "cell_type": "code", "execution_count": 26, "id": "95926985-8970-4eb9-93c7-ed8bebf04521", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "51 2\n", "75 2\n", "48 2\n", "Complete 2\n", "001 1\n", "15 1\n", "37 1\n", "1 1\n", "26 1\n", "66 1\n", "55 1\n", "22 1\n", "31 1\n", "58 1\n", "34 1\n", "61 1\n", "81 1\n", "16 1\n", "67 1\n", "Done 1\n", "Name: value, dtype: int64" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_read_example[\"value\"].value_counts()" ] }, { "cell_type": "markdown", "id": "4f061ad1-6516-4f1a-b3eb-64f6101e6404", "metadata": {}, "source": [ "### Sorting" ] }, { "cell_type": "code", "execution_count": 27, "id": "b349fe3c-492d-478c-8282-96571f9a3de4", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trial_indextaskstimulusvaluetime_elapsedtime_s
2324CompleteNaNDone6660.666
12ConsentNaNComplete2220.222
2223DebriefNaNComplete6420.642
01Participant IDNaN001210.021
23Trial 1blue_red_Incongruent672220.222
\n", "
" ], "text/plain": [ " trial_index task stimulus value time_elapsed \\\n", "23 24 Complete NaN Done 666 \n", "1 2 Consent NaN Complete 222 \n", "22 23 Debrief NaN Complete 642 \n", "0 1 Participant ID NaN 001 21 \n", "2 3 Trial 1 blue_red_Incongruent 67 222 \n", "\n", " time_s \n", "23 0.666 \n", "1 0.222 \n", "22 0.642 \n", "0 0.021 \n", "2 0.222 " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_read_example.sort_values(by=\"task\").head()" ] }, { "cell_type": "code", "execution_count": 28, "id": "1c91333d-9fba-4ebc-a963-e79f5dd6ef24", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trial_indextaskstimulusvaluetime_elapsedtime_s
2122Trial 20red_red_Congruent376130.613
2021Trial 19red_red_Congruent15850.585
1718Trial 16red_red_Congruent484960.496
1415Trial 13red_red_Congruent224300.430
1314Trial 12red_red_Congruent154080.408
\n", "
" ], "text/plain": [ " trial_index task stimulus value time_elapsed time_s\n", "21 22 Trial 20 red_red_Congruent 37 613 0.613\n", "20 21 Trial 19 red_red_Congruent 1 585 0.585\n", "17 18 Trial 16 red_red_Congruent 48 496 0.496\n", "14 15 Trial 13 red_red_Congruent 22 430 0.430\n", "13 14 Trial 12 red_red_Congruent 15 408 0.408" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_read_example.sort_values(by=[\"stimulus\", \"time_elapsed\"], ascending=False).head()" ] }, { "cell_type": "markdown", "id": "d1611f84-4018-4497-a6bb-a833cd5a0859", "metadata": {}, "source": [ "### Formatting: Long to Wide" ] }, { "cell_type": "code", "execution_count": 29, "id": "1a0839d7-ae90-41cd-a38b-7127f8afa8e5", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TypeValue
0A20
1B30
2C40
3A50
4B30
\n", "
" ], "text/plain": [ " Type Value\n", "0 A 20\n", "1 B 30\n", "2 C 40\n", "3 A 50\n", "4 B 30" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df02 = pd.DataFrame({\n", " \"Type\": [\"A\", \"B\", \"C\", \"A\", \"B\"],\n", " \"Value\": [20, 30, 40, 50, 30]\n", "})\n", "df02" ] }, { "cell_type": "code", "execution_count": 30, "id": "a49b64bd-84d8-4cf0-80fa-0e69e12a7efd", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TypeABC
020.0NaNNaN
1NaN30.0NaN
2NaNNaN40.0
350.0NaNNaN
4NaN30.0NaN
\n", "
" ], "text/plain": [ "Type A B C\n", "0 20.0 NaN NaN\n", "1 NaN 30.0 NaN\n", "2 NaN NaN 40.0\n", "3 50.0 NaN NaN\n", "4 NaN 30.0 NaN" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df02.pivot(columns=\"Type\", values=\"Value\")" ] }, { "cell_type": "code", "execution_count": 31, "id": "9665983a-b1a9-4ced-8dd2-5d623a43ac1d", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TypeABC
Value353040
\n", "
" ], "text/plain": [ "Type A B C\n", "Value 35 30 40" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df02.pivot_table(columns=\"Type\", values=\"Value\", aggfunc=np.mean)" ] }, { "cell_type": "markdown", "id": "5868d644-f52d-4fbb-b0d4-dda6b727fce2", "metadata": {}, "source": [ "### Formatting: Wide to Long" ] }, { "cell_type": "code", "execution_count": 32, "id": "1c410c4a-8ce3-4b55-b7f7-8922c0c3f3e5", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Participant IDTask 1Task 2Task 3
0001202218
\n", "
" ], "text/plain": [ " Participant ID Task 1 Task 2 Task 3\n", "0 001 20 22 18" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df03 = pd.DataFrame({\n", " \"Participant ID\": [\"001\"],\n", " \"Task 1\": [20],\n", " \"Task 2\": [22],\n", " \"Task 3\": [18],\n", "})\n", "df03" ] }, { "cell_type": "code", "execution_count": 33, "id": "95a92d0c-f894-45bf-81e4-9d774191d4e9", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Participant IDtaskvalue
0001Task 120
1001Task 222
2001Task 318
\n", "
" ], "text/plain": [ " Participant ID task value\n", "0 001 Task 1 20\n", "1 001 Task 2 22\n", "2 001 Task 3 18" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df03.melt(id_vars=\"Participant ID\", var_name=\"task\", value_name=\"value\")" ] }, { "cell_type": "code", "execution_count": 34, "id": "7b41dd32-e019-485f-aaed-646287927016", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Participant IDTask 3taskvalue
000118Task 120
100118Task 222
\n", "
" ], "text/plain": [ " Participant ID Task 3 task value\n", "0 001 18 Task 1 20\n", "1 001 18 Task 2 22" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df03.melt(\n", " id_vars=[\"Participant ID\", \"Task 3\"], \n", " value_vars=[\"Task 1\", \"Task 2\"],\n", " var_name=\"task\", \n", " value_name=\"value\")" ] }, { "cell_type": "markdown", "id": "73828442-8818-4159-b545-45ee08454541", "metadata": { "tags": [] }, "source": [ "### Combining Data" ] }, { "cell_type": "markdown", "id": "429b6be8-9031-4d32-a828-2644c2d47266", "metadata": {}, "source": [ "#### Concatenation" ] }, { "cell_type": "code", "execution_count": 35, "id": "36e3bba1-fe72-4b22-874e-f9bc1b13f06d", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col_Acol_B
0A22
1B33
2C30
3D10
\n", "
" ], "text/plain": [ " col_A col_B\n", "0 A 22\n", "1 B 33\n", "2 C 30\n", "3 D 10" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df01 = pd.DataFrame({\n", " \"col_A\": [\"A\", \"B\"],\n", " \"col_B\": [22, 33],\n", "})\n", "\n", "df02 = pd.DataFrame({\n", " \"col_A\": [\"C\", \"D\"],\n", " \"col_B\": [30, 10],\n", "})\n", "\n", "pd.concat([df01, df02], ignore_index=True)" ] }, { "cell_type": "code", "execution_count": 36, "id": "bb7f99e7-e97a-4e20-a5d3-5af0288a3abb", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
0A22C30
1B33D10
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 A 22 C 30\n", "1 B 33 D 10" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df01, df02], ignore_index=True, axis=1)" ] }, { "cell_type": "code", "execution_count": 37, "id": "75238f80-13ab-4263-94dd-777168bb4796", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col_Acol_Bcol_Ccol_D
0A22.0NaNNaN
1B33.0NaNNaN
2NaNNaNC30.0
3NaNNaND10.0
\n", "
" ], "text/plain": [ " col_A col_B col_C col_D\n", "0 A 22.0 NaN NaN\n", "1 B 33.0 NaN NaN\n", "2 NaN NaN C 30.0\n", "3 NaN NaN D 10.0" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df01 = pd.DataFrame({\n", " \"col_A\": [\"A\", \"B\"],\n", " \"col_B\": [22, 33],\n", "})\n", "\n", "df02 = pd.DataFrame({\n", " \"col_C\": [\"C\", \"D\"],\n", " \"col_D\": [30, 10],\n", "})\n", "\n", "pd.concat([df01, df02], ignore_index=True)" ] }, { "cell_type": "code", "execution_count": 38, "id": "e524efba-627a-4d36-9809-49d10289cbce", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col_Acol_Bcol_Ccol_D
0A22.0NaNNaN
1B33.0NaNNaN
2NaNNaNC30.0
3NaNNaND10.0
\n", "
" ], "text/plain": [ " col_A col_B col_C col_D\n", "0 A 22.0 NaN NaN\n", "1 B 33.0 NaN NaN\n", "2 NaN NaN C 30.0\n", "3 NaN NaN D 10.0" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df01, df02], ignore_index=True, keys=[\"col_A\", \"col_C\"])" ] }, { "cell_type": "markdown", "id": "ac3dc7ac-8989-4c69-b5d4-fd1216ceccb7", "metadata": {}, "source": [ "#### Merging" ] }, { "cell_type": "code", "execution_count": 39, "id": "ded5917c-4eaa-4d34-a5f9-481cca0ed87d", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col_Acol_Bcol_Ccol_D
0A22A30
\n", "
" ], "text/plain": [ " col_A col_B col_C col_D\n", "0 A 22 A 30" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df01 = pd.DataFrame({\n", " \"col_A\": [\"A\", \"B\"],\n", " \"col_B\": [22, 33],\n", "})\n", "\n", "df02 = pd.DataFrame({\n", " \"col_C\": [\"A\", \"C\"],\n", " \"col_D\": [30, 10],\n", "})\n", "\n", "pd.merge(df01, df02, left_on=\"col_A\", right_on=\"col_C\") # how=outer/inner - left/right/cross" ] }, { "cell_type": "markdown", "id": "973e99d2-fa64-4938-b39d-e3f30c1ab21b", "metadata": {}, "source": [ "### Manipulating Text" ] }, { "cell_type": "code", "execution_count": 40, "id": "15984090-5725-4af3-af66-3cfa6806460c", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trial_indextaskstimulusvaluetime_elapsedtime_s
01participant idNaN001210.021
12consentNaNComplete2220.222
23trial 1blue_red_Incongruent672220.222
34trial 2blue_blue_Congruent51910.091
45trial 3red_red_Congruent511160.116
\n", "
" ], "text/plain": [ " trial_index task stimulus value time_elapsed \\\n", "0 1 participant id NaN 001 21 \n", "1 2 consent NaN Complete 222 \n", "2 3 trial 1 blue_red_Incongruent 67 222 \n", "3 4 trial 2 blue_blue_Congruent 51 91 \n", "4 5 trial 3 red_red_Congruent 51 116 \n", "\n", " time_s \n", "0 0.021 \n", "1 0.222 \n", "2 0.222 \n", "3 0.091 \n", "4 0.116 " ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_read_example[\"task\"] = data_read_example[\"task\"].str.lower()\n", "data_read_example.head()" ] }, { "cell_type": "code", "execution_count": 41, "id": "5175ace1-634e-483f-b4dd-463c85e3ed5a", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trial_indextaskstimulusvaluetime_elapsedtime_s
23trial 1blue_red_Incongruent672220.222
34trial 2blue_blue_Congruent51910.091
45trial 3red_red_Congruent511160.116
56trial 4blue_blue_Congruent161300.130
67trial 5blue_red_Incongruent811530.153
\n", "
" ], "text/plain": [ " trial_index task stimulus value time_elapsed time_s\n", "2 3 trial 1 blue_red_Incongruent 67 222 0.222\n", "3 4 trial 2 blue_blue_Congruent 51 91 0.091\n", "4 5 trial 3 red_red_Congruent 51 116 0.116\n", "5 6 trial 4 blue_blue_Congruent 16 130 0.130\n", "6 7 trial 5 blue_red_Incongruent 81 153 0.153" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_read_example[data_read_example[\"task\"].str.contains(\"trial\")].head()" ] }, { "cell_type": "code", "execution_count": 42, "id": "142af218-72ec-4f43-8af6-3ddf20361ed0", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0 NaN\n", "1 NaN\n", "2 [blue, red, Incongruent]\n", "3 [blue, blue, Congruent]\n", "4 [red, red, Congruent]\n", "5 [blue, blue, Congruent]\n", "6 [blue, red, Incongruent]\n", "7 [red, blue, Incongruent]\n", "8 [red, red, Congruent]\n", "9 [red, blue, Incongruent]\n", "10 [red, blue, Incongruent]\n", "11 [red, blue, Incongruent]\n", "12 [red, blue, Incongruent]\n", "13 [red, red, Congruent]\n", "14 [red, red, Congruent]\n", "15 [blue, red, Incongruent]\n", "16 [blue, red, Incongruent]\n", "17 [red, red, Congruent]\n", "18 [red, blue, Incongruent]\n", "19 [blue, blue, Congruent]\n", "20 [red, red, Congruent]\n", "21 [red, red, Congruent]\n", "22 NaN\n", "23 NaN\n", "Name: stimulus, dtype: object" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_read_example[\"stimulus\"].str.split(\"_\")" ] }, { "cell_type": "code", "execution_count": 43, "id": "33a3110c-a532-4580-826b-cd70e7f7d109", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
WordColourCondition
2blueredIncongruent
3blueblueCongruent
4redredCongruent
5blueblueCongruent
6blueredIncongruent
7redblueIncongruent
8redredCongruent
9redblueIncongruent
10redblueIncongruent
11redblueIncongruent
12redblueIncongruent
13redredCongruent
14redredCongruent
15blueredIncongruent
16blueredIncongruent
17redredCongruent
18redblueIncongruent
19blueblueCongruent
20redredCongruent
21redredCongruent
\n", "
" ], "text/plain": [ " Word Colour Condition\n", "2 blue red Incongruent\n", "3 blue blue Congruent\n", "4 red red Congruent\n", "5 blue blue Congruent\n", "6 blue red Incongruent\n", "7 red blue Incongruent\n", "8 red red Congruent\n", "9 red blue Incongruent\n", "10 red blue Incongruent\n", "11 red blue Incongruent\n", "12 red blue Incongruent\n", "13 red red Congruent\n", "14 red red Congruent\n", "15 blue red Incongruent\n", "16 blue red Incongruent\n", "17 red red Congruent\n", "18 red blue Incongruent\n", "19 blue blue Congruent\n", "20 red red Congruent\n", "21 red red Congruent" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_read_example.loc[data_read_example[\"task\"].str.contains(\n", " \"trial\"), \"stimulus\"].str.split(\"_\", expand=True).rename(\n", " columns={0: \"Word\", 1: \"Colour\", 2: \"Condition\"})" ] }, { "cell_type": "markdown", "id": "cd89d8fd-3fcd-4783-80ec-fcfbd6fdb74a", "metadata": { "jupyter": { "outputs_hidden": true }, "tags": [] }, "source": [ "## Putting it all Together\n", "\n", "We can use pandas along with the list of files to create a clean set of data" ] }, { "cell_type": "code", "execution_count": 44, "id": "e8a0e1ab-1171-4fab-8374-3046902953d0", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['id_005_survey.csv', 'id_008_survey.csv', 'id_007_survey.csv', 'id_002_survey.csv', 'id_010_survey.csv', 'id_001_survey.csv', 'id_004_survey.csv', 'id_009_survey.csv', 'id_003_survey.csv', 'id_006_survey.csv']\n", "['id_007_trial.csv', 'id_001_trial.csv', 'id_006_trial.csv', 'id_005_trial.csv', 'id_008_trial.csv', 'id_003_trial.csv', 'id_004_trial.csv', 'id_010_trial.csv', 'id_009_trial.csv', 'id_002_trial.csv']\n" ] } ], "source": [ "print(files_raw_survey)\n", "print(files_raw_trials)" ] }, { "cell_type": "markdown", "id": "6f02ae94-a4ab-4f74-8752-f583685ee48c", "metadata": { "jupyter": { "outputs_hidden": true } }, "source": [ "### Create Survey Data\n", "\n", "Start by creating an empty dataframe to store all of the survey data.\n", "\n", "Then, read each file in `files_raw_survey` into a temporary dataframe and concatenate to the empty survey data." ] }, { "cell_type": "code", "execution_count": 45, "id": "198f5fec-2b7d-4393-a761-c55e6ab3d6b4", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Participant IDAgeGenderIncomeEducation
0538MaleMediumUniversity
0823FemaleHighHigh school
0725MaleHighUniversity
0233MaleHighUniversity
01023FemaleMediumHigh school
0120FemaleMediumUniversity
0421MaleLowUniversity
0926MaleLowCollege
0331FemaleLowCollege
0624FemaleMediumUniversity
\n", "
" ], "text/plain": [ " Participant ID Age Gender Income Education\n", "0 5 38 Male Medium University\n", "0 8 23 Female High High school\n", "0 7 25 Male High University\n", "0 2 33 Male High University\n", "0 10 23 Female Medium High school\n", "0 1 20 Female Medium University\n", "0 4 21 Male Low University\n", "0 9 26 Male Low College\n", "0 3 31 Female Low College\n", "0 6 24 Female Medium University" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_survey = pd.DataFrame()\n", "\n", "for file in files_raw_survey:\n", " tmp_df = pd.read_csv(dir_raw + file)\n", " data_survey = pd.concat([data_survey, tmp_df])\n", " \n", "data_survey" ] }, { "cell_type": "markdown", "id": "bfd74b02-5b94-4071-aec4-ecf84ed81caa", "metadata": { "tags": [] }, "source": [ "### Create Trials data\n", "\n", "We can read the trial data in a similar way, but we will need to select and filter data as we go.\n", "\n", "Lets just start with one file:" ] }, { "cell_type": "code", "execution_count": 46, "id": "ea8cf893-5312-4c15-8a07-22bb52c3d791", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trial_indextaskstimulusvaluetime_elapsed
01Participant IDNaN00727
12ConsentNaNComplete56
23Trial 1red_red_Congruent3280
34Trial 2red_blue_Incongruent76105
45Trial 3red_red_Congruent43116
56Trial 4red_blue_Incongruent85130
67Trial 5red_blue_Incongruent49166
78Trial 6blue_blue_Congruent2171
89Trial 7blue_blue_Congruent55201
910Trial 8blue_blue_Congruent34241
1011Trial 9blue_red_Incongruent35246
1112Trial 10red_blue_Incongruent71288
1213Trial 11blue_red_Incongruent60294
1314Trial 12red_red_Congruent49341
1415Trial 13blue_blue_Congruent20360
1516Trial 14blue_blue_Congruent8365
1617Trial 15red_blue_Incongruent72381
1718Trial 16blue_blue_Congruent35387
1819Trial 17blue_blue_Congruent56436
1920Trial 18blue_blue_Congruent53449
2021Trial 19red_red_Congruent58465
2122Trial 20blue_red_Incongruent35514
2223DebriefNaNComplete549
2324CompleteNaNDone562
\n", "
" ], "text/plain": [ " trial_index task stimulus value time_elapsed\n", "0 1 Participant ID NaN 007 27\n", "1 2 Consent NaN Complete 56\n", "2 3 Trial 1 red_red_Congruent 32 80\n", "3 4 Trial 2 red_blue_Incongruent 76 105\n", "4 5 Trial 3 red_red_Congruent 43 116\n", "5 6 Trial 4 red_blue_Incongruent 85 130\n", "6 7 Trial 5 red_blue_Incongruent 49 166\n", "7 8 Trial 6 blue_blue_Congruent 2 171\n", "8 9 Trial 7 blue_blue_Congruent 55 201\n", "9 10 Trial 8 blue_blue_Congruent 34 241\n", "10 11 Trial 9 blue_red_Incongruent 35 246\n", "11 12 Trial 10 red_blue_Incongruent 71 288\n", "12 13 Trial 11 blue_red_Incongruent 60 294\n", "13 14 Trial 12 red_red_Congruent 49 341\n", "14 15 Trial 13 blue_blue_Congruent 20 360\n", "15 16 Trial 14 blue_blue_Congruent 8 365\n", "16 17 Trial 15 red_blue_Incongruent 72 381\n", "17 18 Trial 16 blue_blue_Congruent 35 387\n", "18 19 Trial 17 blue_blue_Congruent 56 436\n", "19 20 Trial 18 blue_blue_Congruent 53 449\n", "20 21 Trial 19 red_red_Congruent 58 465\n", "21 22 Trial 20 blue_red_Incongruent 35 514\n", "22 23 Debrief NaN Complete 549\n", "23 24 Complete NaN Done 562" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_trials = pd.DataFrame()\n", "\n", "for file in files_raw_trials[0:1]:\n", " tmp_df = pd.read_csv(dir_raw + file)\n", "\n", "tmp_df" ] }, { "cell_type": "markdown", "id": "319f9e17-746e-4050-9c2a-e5e15e0fda70", "metadata": {}, "source": [ "'Participant ID', 'Consent', 'Debrief', and 'Complete' should be in their own columns\n", "\n", "'stimulus' should be broken into its component factors." ] }, { "cell_type": "code", "execution_count": 47, "id": "4aa99378-3f86-42ab-98a5-2e1aa7427c45", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
WordColourConditionParticipant IDConsentDebriefCompleteReaction Time
2redredCongruent007CompleteComplete56232
3redblueIncongruent007CompleteComplete56276
4redredCongruent007CompleteComplete56243
5redblueIncongruent007CompleteComplete56285
6redblueIncongruent007CompleteComplete56249
7blueblueCongruent007CompleteComplete5622
8blueblueCongruent007CompleteComplete56255
9blueblueCongruent007CompleteComplete56234
10blueredIncongruent007CompleteComplete56235
11redblueIncongruent007CompleteComplete56271
12blueredIncongruent007CompleteComplete56260
13redredCongruent007CompleteComplete56249
14blueblueCongruent007CompleteComplete56220
15blueblueCongruent007CompleteComplete5628
16redblueIncongruent007CompleteComplete56272
17blueblueCongruent007CompleteComplete56235
18blueblueCongruent007CompleteComplete56256
19blueblueCongruent007CompleteComplete56253
20redredCongruent007CompleteComplete56258
21blueredIncongruent007CompleteComplete56235
\n", "
" ], "text/plain": [ " Word Colour Condition Participant ID Consent Debrief Complete \\\n", "2 red red Congruent 007 Complete Complete 562 \n", "3 red blue Incongruent 007 Complete Complete 562 \n", "4 red red Congruent 007 Complete Complete 562 \n", "5 red blue Incongruent 007 Complete Complete 562 \n", "6 red blue Incongruent 007 Complete Complete 562 \n", "7 blue blue Congruent 007 Complete Complete 562 \n", "8 blue blue Congruent 007 Complete Complete 562 \n", "9 blue blue Congruent 007 Complete Complete 562 \n", "10 blue red Incongruent 007 Complete Complete 562 \n", "11 red blue Incongruent 007 Complete Complete 562 \n", "12 blue red Incongruent 007 Complete Complete 562 \n", "13 red red Congruent 007 Complete Complete 562 \n", "14 blue blue Congruent 007 Complete Complete 562 \n", "15 blue blue Congruent 007 Complete Complete 562 \n", "16 red blue Incongruent 007 Complete Complete 562 \n", "17 blue blue Congruent 007 Complete Complete 562 \n", "18 blue blue Congruent 007 Complete Complete 562 \n", "19 blue blue Congruent 007 Complete Complete 562 \n", "20 red red Congruent 007 Complete Complete 562 \n", "21 blue red Incongruent 007 Complete Complete 562 \n", "\n", " Reaction Time \n", "2 32 \n", "3 76 \n", "4 43 \n", "5 85 \n", "6 49 \n", "7 2 \n", "8 55 \n", "9 34 \n", "10 35 \n", "11 71 \n", "12 60 \n", "13 49 \n", "14 20 \n", "15 8 \n", "16 72 \n", "17 35 \n", "18 56 \n", "19 53 \n", "20 58 \n", "21 35 " ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_trials = pd.DataFrame()\n", "\n", "for file in files_raw_trials[0:1]:\n", " tmp_df = pd.read_csv(dir_raw + file)\n", " tmp_trials = tmp_df.loc[tmp_df[\"task\"].str.contains(\n", " \"Trial\"), \"stimulus\"].str.split(\"_\", expand=True).rename(\n", " columns={0: \"Word\", 1: \"Colour\", 2: \"Condition\"})\n", " tmp_trials[\"Participant ID\"] = tmp_df[tmp_df[\"task\"] == \"Participant ID\"][\"value\"].values[0]\n", " tmp_trials[\"Consent\"] = tmp_df[tmp_df[\"task\"] == \"Consent\"][\"value\"].values[0]\n", " tmp_trials[\"Debrief\"] = tmp_df[tmp_df[\"task\"] == \"Debrief\"][\"value\"].values[0]\n", " tmp_trials[\"Complete\"] = tmp_df[tmp_df[\"task\"] == \"Complete\"][\"time_elapsed\"].values[0]\n", " tmp_trials[\"Reaction Time\"] = tmp_df.loc[tmp_df[\"task\"].str.contains(\"Trial\"), \"value\"]\n", "\n", "tmp_trials" ] }, { "cell_type": "code", "execution_count": 48, "id": "956d4db8-191c-40b3-b49f-0692f22ede9a", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
WordColourConditionParticipant IDConsentDebriefCompleteReaction Time
2redredCongruent007CompleteComplete56232
3redblueIncongruent007CompleteComplete56276
4redredCongruent007CompleteComplete56243
5redblueIncongruent007CompleteComplete56285
6redblueIncongruent007CompleteComplete56249
...........................
17redredCongruent002CompleteComplete66238
18blueblueCongruent002CompleteComplete66215
19blueredIncongruent002CompleteComplete66238
20blueredIncongruent002CompleteComplete66247
21redredCongruent002CompleteComplete66213
\n", "

200 rows × 8 columns

\n", "
" ], "text/plain": [ " Word Colour Condition Participant ID Consent Debrief Complete \\\n", "2 red red Congruent 007 Complete Complete 562 \n", "3 red blue Incongruent 007 Complete Complete 562 \n", "4 red red Congruent 007 Complete Complete 562 \n", "5 red blue Incongruent 007 Complete Complete 562 \n", "6 red blue Incongruent 007 Complete Complete 562 \n", ".. ... ... ... ... ... ... ... \n", "17 red red Congruent 002 Complete Complete 662 \n", "18 blue blue Congruent 002 Complete Complete 662 \n", "19 blue red Incongruent 002 Complete Complete 662 \n", "20 blue red Incongruent 002 Complete Complete 662 \n", "21 red red Congruent 002 Complete Complete 662 \n", "\n", " Reaction Time \n", "2 32 \n", "3 76 \n", "4 43 \n", "5 85 \n", "6 49 \n", ".. ... \n", "17 38 \n", "18 15 \n", "19 38 \n", "20 47 \n", "21 13 \n", "\n", "[200 rows x 8 columns]" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_trials = pd.DataFrame()\n", "\n", "for file in files_raw_trials: # Loop through all trials now\n", " tmp_df = pd.read_csv(dir_raw + file)\n", " tmp_trials = tmp_df.loc[tmp_df[\"task\"].str.contains(\n", " \"Trial\"), \"stimulus\"].str.split(\"_\", expand=True).rename(\n", " columns={0: \"Word\", 1: \"Colour\", 2: \"Condition\"})\n", " tmp_trials[\"Participant ID\"] = tmp_df[tmp_df[\"task\"] == \"Participant ID\"][\"value\"].values[0]\n", " tmp_trials[\"Consent\"] = tmp_df[tmp_df[\"task\"] == \"Consent\"][\"value\"].values[0]\n", " tmp_trials[\"Debrief\"] = tmp_df[tmp_df[\"task\"] == \"Debrief\"][\"value\"].values[0]\n", " tmp_trials[\"Complete\"] = tmp_df[tmp_df[\"task\"] == \"Complete\"][\"time_elapsed\"].values[0]\n", " tmp_trials[\"Reaction Time\"] = tmp_df.loc[tmp_df[\"task\"].str.contains(\"Trial\"), \"value\"]\n", " \n", " data_trials = pd.concat([data_trials, tmp_trials])\n", "\n", "data_trials" ] }, { "cell_type": "markdown", "id": "61b874e0-e372-476b-93ea-d4023f721f2d", "metadata": {}, "source": [ "### Merge Survey and Trials Data\n", "\n", "We want to merge on 'Participant ID', notice that in `survey_data` 'Participant ID' is an integer:" ] }, { "cell_type": "code", "execution_count": 49, "id": "06420905-2262-43de-89ff-d4c902f464b4", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 10 entries, 0 to 0\n", "Data columns (total 5 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Participant ID 10 non-null int64 \n", " 1 Age 10 non-null int64 \n", " 2 Gender 10 non-null object\n", " 3 Income 10 non-null object\n", " 4 Education 10 non-null object\n", "dtypes: int64(2), object(3)\n", "memory usage: 480.0+ bytes\n" ] } ], "source": [ "data_survey.info()" ] }, { "cell_type": "markdown", "id": "a84277ab-e90e-405b-85de-36af07ef296b", "metadata": {}, "source": [ "Merge will give an error since it doesn't match the data type of 'Participant ID' in `data_trials`" ] }, { "cell_type": "code", "execution_count": 50, "id": "10a7acd5-fc54-48ff-966e-75d253530ab9", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Participant IDAgeGenderIncomeEducation
000538MaleMediumUniversity
000823FemaleHighHigh school
000725MaleHighUniversity
000233MaleHighUniversity
001023FemaleMediumHigh school
000120FemaleMediumUniversity
000421MaleLowUniversity
000926MaleLowCollege
000331FemaleLowCollege
000624FemaleMediumUniversity
\n", "
" ], "text/plain": [ " Participant ID Age Gender Income Education\n", "0 005 38 Male Medium University\n", "0 008 23 Female High High school\n", "0 007 25 Male High University\n", "0 002 33 Male High University\n", "0 010 23 Female Medium High school\n", "0 001 20 Female Medium University\n", "0 004 21 Male Low University\n", "0 009 26 Male Low College\n", "0 003 31 Female Low College\n", "0 006 24 Female Medium University" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_survey[\"Participant ID\"] = data_survey[\"Participant ID\"].astype(str).str.zfill(3)\n", "data_survey" ] }, { "cell_type": "markdown", "id": "24d77ccd-99e3-4a75-8594-1b80d9a725c1", "metadata": {}, "source": [ "Now we can merge!" ] }, { "cell_type": "code", "execution_count": 54, "id": "12b5943f-20b2-4181-b248-907b672db867", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Participant IDAgeGenderIncomeEducationWordColourConditionConsentDebriefCompleteReaction Time
000538MaleMediumUniversityredredCongruentCompleteComplete61441
100538MaleMediumUniversityblueblueCongruentCompleteComplete61419
200538MaleMediumUniversityredredCongruentCompleteComplete61442
300538MaleMediumUniversityredredCongruentCompleteComplete61454
400538MaleMediumUniversityredblueIncongruentCompleteComplete61431
.......................................
19500624FemaleMediumUniversityblueblueCongruentCompleteComplete55840
19600624FemaleMediumUniversityredblueIncongruentCompleteComplete55883
19700624FemaleMediumUniversityblueredIncongruentCompleteComplete55870
19800624FemaleMediumUniversityblueblueCongruentCompleteComplete55826
19900624FemaleMediumUniversityredblueIncongruentCompleteComplete55883
\n", "

200 rows × 12 columns

\n", "
" ], "text/plain": [ " Participant ID Age Gender Income Education Word Colour Condition \\\n", "0 005 38 Male Medium University red red Congruent \n", "1 005 38 Male Medium University blue blue Congruent \n", "2 005 38 Male Medium University red red Congruent \n", "3 005 38 Male Medium University red red Congruent \n", "4 005 38 Male Medium University red blue Incongruent \n", ".. ... ... ... ... ... ... ... ... \n", "195 006 24 Female Medium University blue blue Congruent \n", "196 006 24 Female Medium University red blue Incongruent \n", "197 006 24 Female Medium University blue red Incongruent \n", "198 006 24 Female Medium University blue blue Congruent \n", "199 006 24 Female Medium University red blue Incongruent \n", "\n", " Consent Debrief Complete Reaction Time \n", "0 Complete Complete 614 41 \n", "1 Complete Complete 614 19 \n", "2 Complete Complete 614 42 \n", "3 Complete Complete 614 54 \n", "4 Complete Complete 614 31 \n", ".. ... ... ... ... \n", "195 Complete Complete 558 40 \n", "196 Complete Complete 558 83 \n", "197 Complete Complete 558 70 \n", "198 Complete Complete 558 26 \n", "199 Complete Complete 558 83 \n", "\n", "[200 rows x 12 columns]" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_clean = pd.merge(data_survey, data_trials, on=\"Participant ID\")\n", "data_clean" ] }, { "cell_type": "markdown", "id": "05544e21-ee1f-4e0b-b05b-d4567ea30b96", "metadata": {}, "source": [ "## Congratulations!!!\n", "\n", "We have now combined each of the raw data files into one clean DataFrame for analysis!\n", "\n", "We can either continue our anlaysis in Python, or write this DataFrame to a file and continue an analysis in another language or software." ] }, { "cell_type": "code", "execution_count": 56, "id": "fa4ac710-ef5f-415c-803c-71548c774c41", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [], "source": [ "data_clean.to_csv(\"clean_data.csv\", index=None)" ] }, { "cell_type": "markdown", "id": "81c418f3-a59d-4f65-97fc-466e3a90c9f9", "metadata": {}, "source": [ "## Thank you for attending this workshop!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.2" } }, "nbformat": 4, "nbformat_minor": 5 }