Text data can come from lots of areas:
The easier to convert your text data into digitally stored text, the cleaner your results and fewer transcription errors.
A text corpus is a large and structured set of texts. It typically stores the text as a raw character string with meta data and details stored with the text.
Examples of typical transformations include:
Feature extraction involves converting the text string into some sort of quantifiable measures. The most common approach is the bag-of-words model, whereby each document is represented as a vector which counts the frequency of each term’s appearance in the document. You can combine all the vectors for each document together and you create a term-document matrix:
However the bag-of-word model ignores context. You could randomly scramble the order of terms appearing in the document and still get the same term-document matrix.
At this point you now have data assembled and ready for analysis. There are several approaches you may take when analyzing text depending on your research question. Basic approaches include:
More advanced methods include document classification, or assigning documents to different categories. This can be supervised (the potential categories are defined in advance of the modeling) or unsupervised (the potential categories are unknown prior to analysis). You might also conduct corpora comparison, or comparing the content of different groups of text. This is the approach used in plagiarism detecting software such as Turn It In. Finally, you may attempt to detect clusters of document features, known as topic modeling.
devtools::session_info()
## Session info -------------------------------------------------------------
## setting value
## version R version 3.4.1 (2017-06-30)
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Chicago
## date 2017-08-23
## Packages -----------------------------------------------------------------
## package * version date source
## backports 1.1.0 2017-05-22 CRAN (R 3.4.0)
## base * 3.4.1 2017-07-07 local
## compiler 3.4.1 2017-07-07 local
## datasets * 3.4.1 2017-07-07 local
## devtools 1.13.3 2017-08-02 CRAN (R 3.4.1)
## digest 0.6.12 2017-01-27 CRAN (R 3.4.0)
## evaluate 0.10.1 2017-06-24 CRAN (R 3.4.1)
## graphics * 3.4.1 2017-07-07 local
## grDevices * 3.4.1 2017-07-07 local
## htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
## knitr 1.17 2017-08-10 cran (@1.17)
## magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
## memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
## methods * 3.4.1 2017-07-07 local
## Rcpp 0.12.12 2017-07-15 CRAN (R 3.4.1)
## rmarkdown 1.6 2017-06-15 CRAN (R 3.4.0)
## rprojroot 1.2 2017-01-16 CRAN (R 3.4.0)
## stats * 3.4.1 2017-07-07 local
## stringi 1.1.5 2017-04-07 CRAN (R 3.4.0)
## stringr 1.2.0 2017-02-18 CRAN (R 3.4.0)
## tools 3.4.1 2017-07-07 local
## utils * 3.4.1 2017-07-07 local
## withr 2.0.0 2017-07-28 CRAN (R 3.4.1)
## yaml 2.1.14 2016-11-12 CRAN (R 3.4.0)
This work is licensed under the CC BY-NC 4.0 Creative Commons License.