Every computer program is a series of instructions—a sequence of separate, small commands. The art of programming is to take a general idea and break it apart into separate steps. (This may be just as important as learning the rules and syntax of a particular language.)
Programming (or code) consists of either imperative or declarative style. R uses imperative style, meaning it strings together instructions to the computer. (Declarative style involves telling the computer what the end result should be, like HTML code.) There are many subdivisions of imperative style, but the primary concern for beginning programmers should be procedural style: that is, describing the steps for achieving a task.
Each step/instruction is a statement—words, numbers, or equations that express a thought.
The central processing unit (CPU) of the computer does not understand any of them! The CPU only takes in machine code, which runs directly on the computer’s hardware. Machine code is basically unreadable, though: it’s a series of tiny numerical operations.
Several popular computer programming languages are actually translations of machine code; they are literally interpreted—as opposed to a compiled—languages. They bridge the gap between machine code/computer hardware and the human programmer. What we call our source code is our set of statements in our preferred language that interacts with machine code.
Source code is simply written in plain text in a text editor. Do not use a word processor.
The computer knows understands source code by the file extension. For us, that means the “.R” extension (and the R notebook is “.Rmd”).
While you do not need a special program to write code, it is usually a good idea to use an IDE (integrated development environment) to help you. Many people (like me) use the oXygen IDE for editing XML documents and creating transformations with XSLT. Python users often use Pycharm or Anaconda. For R, I like to use RStudio (more on that in a moment).
Short answer: because I like R. I have learned some Python, too, but for some reason R worked better for me. This suggests an important takeaway from this session: there is no single language that is better than any other. What you chose to work with will depend on what materials you are working on, what level of comfort you have with a given language, and what kinds of outputs you would like from your code.
For example, if I am primarily interested in text-based edition projects, I would be wise to work mostly with XML technologies: TEI-XML, XPath, XSLT, and XQuery, just to name a few. However, I have seen people use Python and JavaScript to transform XML. While I would advocate XSLT for such an operation, it is better for you to use your preferred language to get things done.
That all said, R does have some distinct advantages:
The visualisation libraries are excellent.
Being so dependent on variables, the code is more readable than many other languages (like JavaScript).
It was built by data scientists and linguists, so it is optimal for dealing with structured text and data sets.
When you first launch R, you will see a console:
This interface allows you to run R commands just like you would run commands on a Bash scripting shell.
When you open this file in RStudio, the command line interface (labeled “Console”) is below the editing window. When you run a code block in the editing window, you will see the results appear in the Console below.
This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Note: the ~ takes you to your home directory in a Unix-based system like Mac OS; it’s a handy short-cut. In Windows you would need to type out the file path, so something like C:\Users\[username]\Desktop
. A handy tip: start to type your file path and use the tab
button to see a dropdown menu of your current file location.
R can be good for doing some math. Say I am making a travel budget, and I want to add the cost of hotel and flight prices for a trip to Seattle. The flight is £550 and the hotel price per night is £133. R can do the work for you.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter. (On a Windows machine you would press [Windows button]+Shift+Enter.)
550 + 133
## [1] 683
R can make all kinds of calculations, so if you want to get the total cost of a five-day trip to Seattle, you can add an operator for multiplication.
550 + 133 * 5
## [1] 1215
To make this effective, we need to store these kinds of calculations in variables. Variables can be assigned with either <-, =, or ->. Let’s do that, and let’s compare the price of a 5-day trip to Seattle to a 7-day trip to Paris.
sea.trip.v <- 550 + 133 * 5
paris.trip.v <- 150 + 100 * 7
What is the most expensive trip?
Guess we should go to Paris. What if I just want to do both?
sea.trip.v + paris.trip.v
## [1] 2065
Suppose further that I wanted to add in an optional 3-day trip to New York City. I want to see which trip would be more expensive if I were to take two out of the three options.
nyc.trip.v <- 300 + 150 * 3
sea.and.nyc <- sea.trip.v + nyc.trip.v
sea.and.paris <- sea.trip.v + paris.trip.v
sea.and.nyc > sea.and.paris
## [1] FALSE
Above you can see how powerful even simple R programming can be: you can store mathemtical operations in named variables and then use those variables to work with other variables (this becomes very important in word frequency calculations).
You see how this works, and how quickly one can store variables for even practical questions.
There are other important kinds of R data formats that you should know. The first is a vector, which is a numbered list stored under a single name. An easy way to create a vector is to use the c
command, which basically means “combine.”
v1 <- c("i", "wait", "with", "bated", "breath")
# confirm the value of the variable by running v1
v1
## [1] "i" "wait" "with" "bated" "breath"
# identify a specific value by indicating it is brackets
v1[4]
## [1] "bated"
The other important data structure is called a data frame. This is probably the most useful for sophisticated analyses, because it renders the data in a table that is very similar to a spreadsheet. It is also more than that: a data frame is actually a special kind of list of vectors and factors that have the same length.
It is important to input your data in an Excel or Google docs spreadsheet and then export that data into a comma separated value (.csv) or tab separated value (.tsv) file.
Data frames are basically two-dimensional matrices, whereas vectors are unidomensional. Suppose you have a group of texts and you want to keep track of some of their metadata.
David Copperfield / Charles Dickens / novel / British Pictures from Italy / Charles Dickens / nonfiction / British Leaves of Grass / Walt Whitman / poetry / American Sartar Resartus / Thomas Carlyle / nonfiction / British
We can create a data frame to arrange this material in tabular format:
title <- c("David Copperfield", "Pictures from Italy", "Leaves of Grass", "Sartar Resartus")
author <- c("Charles Dickens", "Charles Dickens", "Walt Whitman", "Thomas Carlyle")
genre <- c("novel", "nonfiction", "poetry", "nonfiction")
nationality <- c("British", "British", "American", "British")
Here we have just created variables containing vectors. The data.frame
function, which takes the vector variables as arguments and combines them into a table.
metadata <- data.frame(title, author, genre, nationality)
str(metadata)
## 'data.frame': 4 obs. of 4 variables:
## $ title : Factor w/ 4 levels "David Copperfield",..: 1 3 2 4
## $ author : Factor w/ 3 levels "Charles Dickens",..: 1 1 3 2
## $ genre : Factor w/ 3 levels "nonfiction","novel",..: 2 1 3 1
## $ nationality: Factor w/ 2 levels "American","British": 2 2 1 2
summary(metadata)
## title author genre nationality
## David Copperfield :1 Charles Dickens:2 nonfiction:2 American:1
## Leaves of Grass :1 Thomas Carlyle :1 novel :1 British :3
## Pictures from Italy:1 Walt Whitman :1 poetry :1
## Sartar Resartus :1
You have just created a data frame. The str
function shows you the structure of the data frame, and the summary
function shows you the unique values, among other interesting facts. The dollar sign ($) can be used to identify specific variables in the data frame.
metadata$author
## [1] Charles Dickens Charles Dickens Walt Whitman Thomas Carlyle
## Levels: Charles Dickens Thomas Carlyle Walt Whitman
metadata$nationality
## [1] British British American British
## Levels: American British
This is a fairly simple example to show you the syntax and meaning of a data frame, but most of you will be loading data into R. (Though you should remember that the data.frame
fucntion is often used in code to transform lists.) Usually that data comes from a spreadsheet software (Microsoft Excel, Apple Numbers, Google Sheets).
To load data we use the read.csv
or read.table
function. (See Gries, p. 53-54.) From our GitHub site, download the bow-in-the-cloud-metadata-box1.csv file. Let’s use that to run some experiments on data frames.
rm(list = ls(all=TRUE))
bow.metadata <- read.csv(file = "bow-in-the-cloud-metadata-box1.csv", header = TRUE, sep = ",")
str(bow.metadata)
## 'data.frame': 84 obs. of 14 variables:
## $ Reference.Number : Factor w/ 1 level "Eng MS 414": 1 1 1 1 1 1 1 1 1 1 ...
## $ Volume : Factor w/ 1 level "Box 1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Item.number : Factor w/ 84 levels "# 44d","10","11",..: 14 15 16 32 46 64 79 80 81 82 ...
## $ PARENT.WORK.TITLE : Factor w/ 59 levels "“A Voice from America’s Plains”; \"A Cry from the Isles of the West”",..: 31 5 21 10 10 53 10 41 10 7 ...
## $ Creator : Factor w/ 27 levels "Anon.","Anonymous",..: 20 2 4 4 4 4 21 21 14 14 ...
## $ Creator.Role : Factor w/ 2 levels "Author","Photographer": 1 1 1 1 1 1 1 1 1 1 ...
## $ Date.Created..YYYY.MM.DD. : Factor w/ 52 levels "1825-11-05","1825-11-06",..: 52 45 7 24 36 36 38 38 30 33 ...
## $ Creation.Site : Factor w/ 17 levels "","England: Bedfordshire: Woburn",..: 13 8 14 14 14 14 5 5 4 4 ...
## $ Location.in.Bow.in.the.Cloud: Factor w/ 32 levels "","103–5","106–7; 33–34",..: 1 1 3 1 1 32 1 8 1 1 ...
## $ Language : Factor w/ 2 levels "English","n/a": 1 1 1 1 1 1 1 1 1 1 ...
## $ Support : Factor w/ 3 levels "","Paper","Photo card": 2 2 2 2 2 2 2 1 2 2 ...
## $ Extent : Factor w/ 8 levels "","1 sheet",..: 2 2 5 2 2 3 2 2 2 2 ...
## $ Notes : Factor w/ 23 levels "","“West Indian Salesman”, the enclosed poem, was apparently not published.",..: 17 1 6 1 1 20 1 1 1 7 ...
## $ X : logi NA NA NA NA NA NA ...
bow.metadata$Creator[1:10]
## [1] Rawson, Mary Anne (MAR) (1801–1887)
## [2] Anonymous
## [3] Barton, Bernard (1784–1849)
## [4] Barton, Bernard (1784–1849)
## [5] Barton, Bernard (1784–1849)
## [6] Barton, Bernard (1784–1849)
## [7] Roscoe, Jane (1797-1853)
## [8] Roscoe, Jane (1797-1853)
## [9] Hill , Thomas
## [10] Hill , Thomas
## 27 Levels: Anon. Anonymous Ball, Dinah ... Wrangham, Francis (1769–1842)
You may also want to output a file using write.table
.
write.table(bow.metadata, file = "bow-metadata-df.csv", sep = "\t", quote = FALSE, row.names = FALSE)
In your working directory you should now have a new csv file that looks quite similar to the original spreadsheet. Again, not particularly interesting here, but in many cases you will find yourself turning vectors into data frames in R, and then outputting your results into csv files. It’s also important to know the difference between the read.csv
and write.table
functions.
There are other important kinds of R data formats that you should know. The first is a vector, which is a single variable consisting of an ordered collection numbers and/or words. An easy way to create a vector is to use the c
command, which basically means “combine.”
v1 <- c("i", "wait", "with", "bated", "breath")
# confirm the value of the variable by running v1
v1
## [1] "i" "wait" "with" "bated" "breath"
# identify a specific value by indicating it is brackets
v1[4]
## [1] "bated"
Get used to the functions that help you understand R: ?
and example()
.
?c
example(c, echo = FALSE) # change the echo value to TRUE to get the results
The c
function is widely used, but it is really only useful for creating small data sets. Many of you will probably want to load text files.
Jeff Rydberg-Cox provides some helpful tips for preparing data for R processing:
Download the text(s) from a source repository.
Remove extraneous material from the text(s).
Transform the text(s) to answer your research questions.
The best way to load text files is with the scan
function. First, download a text file of Dickens’s Great Expectations onto your working directory (it is also available in our corpus directory, in the c19-20 subdirectory).
dickens.v <- scan("dickens_great-expectations.txt", what="character", sep="\n", encoding = "UTF-8")
You have now loaded Great Expectations into a variable called dickens.v
. It is now a vector of lines in the book that can be analysed. Let’s see if that is true.
head(dickens.v)
## [1] "Chapter I"
## [2] "My father's family name being Pirrip, and my Christian name Philip, my infant tongue could make of both names nothing longer or more explicit than Pip. So, I called myself Pip, and came to be called Pip."
## [3] "I give Pirrip as my father's family name, on the authority of his tombstone and my sister,—Mrs. Joe Gargery, who married the blacksmith. As I never saw my father or my mother, and never saw any likeness of either of them (for their days were long before the days of photographs), my first fancies regarding what they were like were unreasonably derived from their tombstones. The shape of the letters on my father's, gave me an odd idea that he was a square, stout, dark man, with curly black hair. From the character and turn of the inscription, \"Also Georgiana Wife of the Above,\" I drew a childish conclusion that my mother was freckled and sickly. To five little stone lozenges, each about a foot and a half long, which were arranged in a neat row beside their grave, and were sacred to the memory of five little brothers of mine,—who gave up trying to get a living, exceedingly early in that universal struggle,—I am indebted for a belief I religiously entertained that they had all been born on their backs with their hands in their trousers-pockets, and had never taken them out in this state of existence."
## [4] "Ours was the marsh country, down by the river, within, as the river wound, twenty miles of the sea. My first most vivid and broad impression of the identity of things seems to me to have been gained on a memorable raw afternoon towards evening. At such a time I found out for certain that this bleak place overgrown with nettles was the churchyard; and that Philip Pirrip, late of this parish, and also Georgiana wife of the above, were dead and buried; and that Alexander, Bartholomew, Abraham, Tobias, and Roger, infant children of the aforesaid, were also dead and buried; and that the dark flat wilderness beyond the churchyard, intersected with dikes and mounds and gates, with scattered cattle feeding on it, was the marshes; and that the low leaden line beyond was the river; and that the distant savage lair from which the wind was rushing was the sea; and that the small bundle of shivers growing afraid of it all and beginning to cry, was Pip."
## [5] "\"Hold your noise!\" cried a terrible voice, as a man started up from among the graves at the side of the church porch. \"Keep still, you little devil, or I'll cut your throat!\""
## [6] "A fearful man, all in coarse gray, with a great iron on his leg. A man with no hat, and with broken shoes, and with an old rag tied round his head. A man who had been soaked in water, and smothered in mud, and lamed by stones, and cut by flints, and stung by nettles, and torn by briars; who limped, and shivered, and glared, and growled; and whose teeth chattered in his head as he seized me by the chin."
The head function is the same as the basic Unix command for showing the first part of a file. This can be useful for testing whether your code has worked.