--- title: "Introduction to pitchRx package" author: "Carson Sievert" date: "`r Sys.Date()`" output: html_document --- The R package [**pitchRx**](http://cran.r-project.org/web/packages/pitchRx/) provides tools for collecting Major League Baseball (MLB) Gameday data and visualizing [PITCHf/x](http://en.wikipedia.org/wiki/PITCHf/x). This page provides a rough overview of it's scope, but the [RJournal article](http://cpsievert.github.io/pitchRx/RJwrapper.pdf) is more comprehensive. The [source file](https://github.com/cpsievert/pitchRx/blob/gh-pages/index.Rmd) used to generate this page is helpful to see how to embed pitchRx animations in to documents using [**knitr**](http://yihui.name/knitr/). If coding isn't your thing, you might want to just [play](http://104.131.111.111:3838/pitchRx/) with my PITCHf/x visualization app! ```{r setup, echo = FALSE, message = FALSE} knitr::opts_chunk$set( fig.path = "figure/", cache.path = "cache/", fig.align = "center", warning = FALSE, message = FALSE, fig.height = 7, fig.width = 5, tidy = FALSE ) #knitr::opts_knit$set(animation.fun = knitr::hook_r2swf) library(pitchRx) ``` Data Collection ---------------------------- ### Collecting 'smallish' data **pitchRx** makes it simple to acquire PITCHf/x directly from its source. Here, **pitchRx**'s `scrape()` function is used to collect all PITCHf/x data recorded on June 1st, 2013. ```{r scrape, results = 'hide', cache = TRUE} library(pitchRx) dat <- scrape(start = "2013-06-01", end = "2013-06-01") ``` ```{r names} names(dat) dim(dat[["pitch"]]) ``` By default, `scrape()` returns a list of `r length(dat)` data frames. The `'pitch'` data frame contains the actual PITCHf/x data which is recorded on a pitch-by-pitch basis. The dimensions of this data frame indicate that `r dim(dat[["pitch"]])[1]` pitches were thrown on June 1st, 2013. If your analysis requires PITCHf/x data over many months, you surely don't want to pull all that data into a single `R` session! For this (and other) reasons, `scrape()` can write directly to a database (see the "Managing PITCHf/x data" section). ### Collecting data by Gameday IDs In the previous example, `scrape()` actually determines the relevant game IDs based on the `start` and `end` date. If the user wants a more complicated query based to specific games, relevant game IDs can be passed to the `game.ids` argument using the built in `gids` data object. ```{r game.ids} data(gids, package = "pitchRx") head(gids) ``` As you can see, the `gids` object contains game IDs and those IDs contain relevant dates as well as abbreviations for the home and away team name. Since the away team is always listed first, we could do the following to collect PITCHf/x data from every away game played by the Minnesota Twins in July of 2013. ```{r scrape2, results = 'hide', cache = TRUE} MNaway13 <- gids[grep("2013_06_[0-9]{2}_minmlb*", gids)] dat2 <- scrape(game.ids = MNaway13) ``` ### Managing PITCHf/x data in bulk Creating and maintaining a PITCHf/x database is a breeze with **pitchRx** and [dplyr](http://cran.r-project.org/web/packages/dplyr/index.html). With a few lines of code (and some patience), all available PITCHf/x data can be obtained directly from its source and stored in a local [SQLite](http://www.sqlite.org/) database: ```{r sqlite, eval=FALSE} library(dplyr) db <- src_sqlite("pitchfx.sqlite3", create = T) scrape(start = "2008-01-01", end = Sys.Date(), connect = db$con) ``` The website which hosts PITCHf/x data hosts a wealth of other data that might come in handy for PITCHf/x analysis. The file type which contains PITCHf/x always ends with [inning/inning_all.xml](http://gd2.mlb.com/components/game/mlb/year_2012/month_05/day_01/gid_2012_05_01_arimlb_wasmlb_1/inning/inning_all.xml). `scrape` also has support to collect data from three other types of files: [miniscoreboard.xml](http://gd2.mlb.com/components/game/mlb/year_2012/month_05/day_01/miniscoreboard.xml), [players.xml](http://gd2.mlb.com/components/game/mlb/year_2012/month_05/day_01/gid_2012_05_01_arimlb_wasmlb_1/players.xml), and [inning/inning_hit.xml](http://gd2.mlb.com/components/game/mlb/year_2012/month_05/day_01/gid_2012_05_01_arimlb_wasmlb_1/inning/inning_hit.xml). Data from these files can easily be added to our existing PITCHf/x database: ```{r add, eval=FALSE} files <- c("miniscoreboard.xml", "players.xml", "inning/inning_hit.xml") scrape(start = "2008-01-01", end = Sys.Date(), suffix = files, connect = db$con) ``` ### Building your own custom scraper **pitchRx** is built on top of the R package [XML2R](http://cran.r-project.org/web/packages/XML2R/index.html). In [this post](https://baseballwithr.wordpress.com/2014/09/25/write-your-own-gameday-scraper-with-xml2r-and-pitchrx-3/), I demonstrate how to use **XML2R** and **pitchRx** to collect attendance data from the GameDay site (similar methods can be used to collect other GameDay data). For a more detailed look at **XML2R**, see the [introductory webpage](http://cpsievert.github.io/XML2R/) and/or the [RJournal paper](http://cpsievert.github.io/pitchRx/RJwrapper.pdf). PITCHf/x Visualization -------------------- ### 2D animation The **pitchRx** comes pre-packaged with a `pitches` data frame with four-seam and cut fastballs thrown by Mariano Rivera and Phil Hughes during the 2011 season. These pitches are used to demonstrate PITCHf/x animations using `animateFX()`. The viewer should notice that as the animation progresses, pitches coming closer to them (that is, imagine you are the umpire/catcher - watching the pitcher throw directly at you). In the animation below, the horizontal and vertical location of `pitches` is plotted every tenth of a second until they reach home plate (in real time). Since looking at animations in real time can be painful, this animation delays the time between each frame to a half a second. ```{r ani, fig.show = "animate", interval = 0.5, cache = TRUE, fig.height = 12, fig.width = 12, dev = "CairoPNG"} # adding ggplot2 functions to customize animateFX() output won't work, but # you can pass a list to the layer argument like this: x <- list( facet_grid(pitcher_name ~ stand, labeller = label_both), theme_bw(), coord_equal() ) animateFX(pitches, layer = x) ``` To avoid a cluttered animation, the `avg.by` option averages the trajectory for each unique value of the variable supplied to `avg.by`. ```{r ani2, fig.show = "animate", interval = 0.5, cache = TRUE, fig.height = 12, fig.width = 12, dev = "CairoPNG"} animateFX(pitches, avg.by = "pitch_types", layer = x) ``` Note that when using `animateFX()`, the user may want to wrap the expression with `animation::saveHTML()` to view the result in a web browser. If you want to include the animation in a document, [knitr](http://yihui.name/knitr/options#chunk_options)'s `fig.show = "animate"` chunk option is very useful. ### Interactive animations See here for a post on creating interactive animations of PITCHf/x data using the [**animint**](https://github.com/tdhock/animint) package. ### Interactive 3D plots **pitchRx** also makes use of **rgl** graphics. If I want a more revealing look as Mariano Rivera's pitches, I can subset the `pitches` data frame accordingly. Note that the plot below is interactive, so make sure you have JavaScript & [WebGL](http://get.webgl.org/) enabled (if you do, go ahead - click and drag)! ```{r demo, eval = FALSE} Rivera <- subset(pitches, pitcher_name == "Mariano Rivera") interactiveFX(Rivera) ``` ### Visualizing pitch locations #### 2D densities The `strikeFX()` function can be used to quickly visualize pitch location densities (from the perspective of the umpire). Here is the density of called strikes thrown by Rivera and Hughes in 2011 (for both right and left-handed batters). ```{r strike, fig.height=7, fig.width=10, dev="CairoPNG"} strikes <- subset(pitches, des == "Called Strike") strikeFX(strikes, geom = "tile") + facet_grid(pitcher_name ~ stand) + coord_equal() + theme_bw() + viridis::scale_fill_viridis() ``` #### Probabilistic strike-zone densities Models that estimate the event probabilities _conditioned on pitch location_ provide a better inferential tool than density estimation. Here we use the [**mgcv**](https://cran.r-project.org/web/packages/mgcv/index.html) package to fit a [Generalized Additive Model](https://en.wikipedia.org/wiki/Generalized_additive_model) (GAMs) which estimates the probability of a called strike as a function of pitch location and batter stance. ```{r mgcv, fig.height = 7, fig.width = 10, dev = "CairoPNG"} noswing <- subset(pitches, des %in% c("Ball", "Called Strike")) noswing$strike <- as.numeric(noswing$des %in% "Called Strike") library(mgcv) m <- bam(strike ~ s(px, pz, by = factor(stand)) + factor(stand), data = noswing, family = binomial(link = 'logit')) x <- list( facet_grid(. ~ stand), theme_bw(), coord_equal(), viridis::scale_fill_viridis(name = "Probability of Called Strike") ) strikeFX(noswing, model = m, layer = x) ``` [Here](https://baseballwithr.wordpress.com/2014/10/23/a-probabilistic-model-for-interpreting-strike-zone-expansion-7/) [are](https://baseballwithr.wordpress.com/2014/11/11/interactive-visualization-of-strike-zone-expansion-5/) [some](http://ssrn.com/abstract=2478447) [other](http://onlinelibrary.wiley.com/doi/10.1002/mde.2630/abstract) places where GAMs were used to understand factors that influence umpire decision making. ### Session Info ```{r info} devtools::session_info() ```