{ "cells": [ { "metadata": { "toc": true }, "cell_type": "markdown", "source": "

Table of Contents

\n
" }, { "metadata": {}, "cell_type": "markdown", "source": "# Factors in R" }, { "metadata": {}, "cell_type": "markdown", "source": "* List of all possible values of a variable in string format (special string type)\n* Limited number of different values(categories) \n* Stored as a vector of integer values \n* factor's levels will always be character values\n* `factor()` function is used to create a factor\n* `levels()` function can be useful to check the levels of factor variable " }, { "metadata": {}, "cell_type": "markdown", "source": "## Creating or Converting" }, { "metadata": { "scrolled": true, "trusted": false }, "cell_type": "code", "source": "# To create factors in R, you make use of the function factor()\n# Sex vector\nsex_vect <- c(\"Male\", \"Female\", \"Female\", \"Male\", \"Male\", \"Female\")\n# Convert vector into a factor\nfactor_sex_vect <- factor(sex_vect)\nprint(factor_sex_vect)", "execution_count": 42, "outputs": [ { "name": "stdout", "output_type": "stream", "text": "[1] Male Female Female Male Male Female\nLevels: Female Male\n" } ] }, { "metadata": {}, "cell_type": "markdown", "source": "## Types of Categorical variables" }, { "metadata": {}, "cell_type": "markdown", "source": "

" }, { "metadata": {}, "cell_type": "markdown", "source": "## Factor levels:\n* Assign the levels(using argument) while creating factor or change the names of these levels using `levels()` function. \n* Post check the levels using `levels()` function\n* There is also another function `nlevels()` to check the number of levels" }, { "metadata": { "trusted": true }, "cell_type": "code", "source": "# Animals without order\nanimals_vector <- c(\"Elephant\", \"Dog\", \"Donkey\", \"Horse\")\nfactor_animals_vector <- factor(animals_vector)\nprint(factor_animals_vector)\nprint(levels(factor_animals_vector))\nnlevels(factor_animals_vector) # to check the number of levels\n# Temperature with order\ntemperature_vector <- c(\"High\", \"Low\", \"High\",\"Low\", \"Medium\")\nfactor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c(\"Low\", \"Medium\", \"High\"))\nnlevels(factor_temperature_vector)", "execution_count": 2, "outputs": [ { "output_type": "stream", "text": "[1] Elephant Dog Donkey Horse \nLevels: Dog Donkey Elephant Horse\n[1] \"Dog\" \"Donkey\" \"Elephant\" \"Horse\" \n", "name": "stdout" }, { "output_type": "display_data", "data": { "text/plain": "[1] 4", "text/latex": "4", "text/markdown": "4", "text/html": "4" }, "metadata": {} }, { "output_type": "display_data", "data": { "text/plain": "[1] 3", "text/latex": "3", "text/markdown": "3", "text/html": "3" }, "metadata": {} } ] }, { "metadata": {}, "cell_type": "markdown", "source": "## Ordered factor\n* An \"ordered\" factor is a factor whose levels have a particular order\n* Create ordered factors with the `ordered()` command, or by using `factor()` with the `ordered=TRUE` argument" }, { "metadata": { "scrolled": true, "trusted": false }, "cell_type": "code", "source": "# Temperature with order\ntemperature_vector <- c(\"High\", \"Low\", \"High\",\"Low\", \"Medium\")\nfactor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c(\"Low\", \"Medium\", \"High\"))\nprint(factor_temperature_vector)\n# levels(factor_vector) <- c(\"name1\", \"name2\",...) # syntax to assign the levels\nprint(levels(factor_temperature_vector))", "execution_count": 44, "outputs": [ { "name": "stdout", "output_type": "stream", "text": "[1] High Low High Low Medium\nLevels: Low < Medium < High\n[1] \"Low\" \"Medium\" \"High\" \n" } ] }, { "metadata": { "trusted": false }, "cell_type": "code", "source": "# High\nHigh <- factor_temperature_vector[1]\n\n# Low\nLow <- factor_temperature_vector[2]\n\n# check whether High greater than Low? similarly we can use for heavier, larger, faster, strongly agree depends on the context of the data \nHigh > Low", "execution_count": 45, "outputs": [ { "data": { "text/html": "TRUE", "text/latex": "TRUE", "text/markdown": "TRUE", "text/plain": "[1] TRUE" }, "metadata": {}, "output_type": "display_data" } ] }, { "metadata": {}, "cell_type": "markdown", "source": "## labels:\n* Assign the labels through the labels= argument\n* They are better than using simple integer labels because factors are self describing: `\"low\", \"medium\",` and `\"high\"` is more descriptive than `1, 2, 3`" }, { "metadata": { "trusted": false }, "cell_type": "code", "source": "data = c(1,2,2,3,1,2,3,3,1,2,3,3,1)\nfdata = factor(data)\nprint(fdata) # without labels \nrdata = factor(data,labels=c(\"Low\",\"Medium\",\"High\"))\nprint(rdata) # with labels", "execution_count": 46, "outputs": [ { "name": "stdout", "output_type": "stream", "text": " [1] 1 2 2 3 1 2 3 3 1 2 3 3 1\nLevels: 1 2 3\n [1] Low Medium Medium High Low Medium High High Low Medium\n[11] High High Low \nLevels: Low Medium High\n" } ] }, { "metadata": {}, "cell_type": "markdown", "source": "## Generating Factor Levels(using `gl()` function)" }, { "metadata": {}, "cell_type": "markdown", "source": "`gl()` function generates factors by specifying the pattern of their levels. \n \n`gl(n, k, length = n*k, labels = 1:n, ordered = FALSE)`\n \n`n`: number of levels \n`k`: number of replications \n`length`: length of the result \n`labels`: labels for the resulting factor levels \n`ordered`: whether the result sould be ordered or not \n \n \n`gl(3,2,labels = c(\"green\",\"red\",\"yellow\"))` \n[1] green green red red yellow yellow \nLevels: green red yellow " }, { "metadata": { "trusted": false }, "cell_type": "code", "source": "# usage of gl function in data frame \nclinical.trial <-\n data.frame(patient = 1:100,\n age = rnorm(100, mean = 60, sd = 6),\n treatment = gl(2, 50,\n labels = c(\"Treatment\", \"Control\")),\n center = sample(paste(\"Center\", LETTERS[1:5]), 100, replace = TRUE)) \nprint(head(clinical.trial,20))", "execution_count": 47, "outputs": [ { "name": "stdout", "output_type": "stream", "text": " patient age treatment center\n1 1 56.37911 Treatment Center B\n2 2 72.05500 Treatment Center D\n3 3 54.26399 Treatment Center D\n4 4 59.67424 Treatment Center A\n5 5 58.36585 Treatment Center C\n6 6 50.42562 Treatment Center C\n7 7 61.84052 Treatment Center C\n8 8 57.49042 Treatment Center C\n9 9 59.41043 Treatment Center C\n10 10 57.12441 Treatment Center D\n11 11 58.12204 Treatment Center E\n12 12 59.25759 Treatment Center C\n13 13 51.23221 Treatment Center E\n14 14 56.24048 Treatment Center E\n15 15 55.28236 Treatment Center D\n16 16 62.27550 Treatment Center A\n17 17 64.87295 Treatment Center C\n18 18 55.22675 Treatment Center C\n19 19 56.19110 Treatment Center C\n20 20 67.15331 Treatment Center C\n" } ] }, { "metadata": {}, "cell_type": "markdown", "source": "## `droplevels()`\n* we can drop the Unused Levels from the Factors using 'droplevels()' function, since the factor list sticks around even if you remove some data such that no examples of a particular level still exist (see below)" }, { "metadata": { "trusted": false }, "cell_type": "code", "source": "aq <- transform(airquality, Month = factor(Month, labels = month.abb[5:9]))\nprint(levels(aq$Month))\naq <- subset(aq, Month != \"Jul\")\nprint(levels(aq$Month)) # still the same levels\ntable(aq$Month) # even though one level has 0 entries!\ntable(droplevels(aq)$Month)", "execution_count": 48, "outputs": [ { "name": "stdout", "output_type": "stream", "text": "[1] \"May\" \"Jun\" \"Jul\" \"Aug\" \"Sep\"\n[1] \"May\" \"Jun\" \"Jul\" \"Aug\" \"Sep\"\n" }, { "data": { "text/plain": "\nMay Jun Jul Aug Sep \n 31 30 0 31 30 " }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": "\nMay Jun Aug Sep \n 31 30 31 30 " }, "metadata": {}, "output_type": "display_data" } ] }, { "metadata": {}, "cell_type": "markdown", "source": "## working with `is.factor()`, `is.ordered()`, `as.factor()` and `as.ordered()` \n* to check whethr it is factor or ordered factor \n* to convert the vectors into factor vector or ordered factors " }, { "metadata": { "trusted": false }, "cell_type": "code", "source": "is.factor(temperature_vector)\nis.ordered(temperature_vector)\nis.factor(factor_temperature_vector)\nis.ordered(factor_temperature_vector)", "execution_count": 49, "outputs": [ { "data": { "text/html": "FALSE", "text/latex": "FALSE", "text/markdown": "FALSE", "text/plain": "[1] FALSE" }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": "FALSE", "text/latex": "FALSE", "text/markdown": "FALSE", "text/plain": "[1] FALSE" }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": "TRUE", "text/latex": "TRUE", "text/markdown": "TRUE", "text/plain": "[1] TRUE" }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": "TRUE", "text/latex": "TRUE", "text/markdown": "TRUE", "text/plain": "[1] TRUE" }, "metadata": {}, "output_type": "display_data" } ] }, { "metadata": {}, "cell_type": "markdown", "source": "# Date and Time" }, { "metadata": {}, "cell_type": "markdown", "source": "![Imgur](https://i.imgur.com/2RMPVpT.png)\nsource: https://tbrieder.org/epidata/course_reading/e_aragon.pdf \nsource: http://www.columbia.edu/~cjd11/charles_dimaggio/DIRE/resources/R/vars.pdf " }, { "metadata": {}, "cell_type": "markdown", "source": "* `strptime()`\n* `POSIXct()` and `POSIXlt()`\n* `as.Date()`" }, { "metadata": { "trusted": false }, "cell_type": "code", "source": "help(strptime) #- to get conversion formats see .", "execution_count": 5, "outputs": [] }, { "metadata": { "trusted": false }, "cell_type": "code", "source": "myDays <- c(\"10/11/1945\", \"8/19/2003\", \"5/15/1964\")\nmyDates<- as.Date(myDays, format = \"%m/%d/%Y\")\nmyDates", "execution_count": 4, "outputs": [ { "data": { "text/html": "
    \n\t
  1. \n\t
  2. \n\t
  3. \n
\n", "text/latex": "\\begin{enumerate*}\n\\item 1945-10-11\n\\item 2003-08-19\n\\item 1964-05-15\n\\end{enumerate*}\n", "text/markdown": "1. 1945-10-11\n2. 2003-08-19\n3. 1964-05-15\n\n\n", "text/plain": "[1] \"1945-10-11\" \"2003-08-19\" \"1964-05-15\"" }, "metadata": {}, "output_type": "display_data" } ] }, { "metadata": {}, "cell_type": "markdown", "source": "References \nhttps://blogisticreflections.wordpress.com/2009/09/21/r-function-of-the-day-table/ \nhttp://www.endmemo.com/program/R/gl.php " }, { "metadata": {}, "cell_type": "markdown", "source": "https://data-flair.training/blogs/r-factor-functions/?fbclid=IwAR2VrzHLFSR3JYjmCCMR9KqE-Yz71KiOE6C_5QVhKXWfJcl8cI7rAxwuQro" } ], "metadata": { "kernelspec": { "name": "r", "display_name": "R", "language": "R" }, "language_info": { "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.4.1", "file_extension": ".r", "codemirror_mode": "r" }, "toc": { "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "base_numbering": 1, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": "block", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }