--- title: How to Convert XML to Dataframe Using Xlm2 author: Rob Wiederstein date: '2021-03-05' slug: xml-to-dataframe categories: - R tags: - xlm2 layout: layouts/post/single draft: no header: image: feature.png alt: xml code screenshot caption: A dataset that was only available in pdf or xml. summary: Everytime I run into a file with an `.xml` extension, I cringe. Though, admittedly, it's a file format that you have to be familiar with when it comes to sending and receiving data over the web. `R` has a package `xlm2` to assist in the conversion of nested data to tabular data. repo: https://raw.githubusercontent.com/RobWiederstein/purple-bananas/main/content/post/2021-03-05-xml-to-dataframe/index.Rmd --- ```{r load-packages, include = F} ## Load frequently used packages for blog posts x <- c( 'devtools', #for session info 'ggthemes' ) lapply(x, library, character.only = T) ``` ```{r set-chunk-options, include = F} ## Do not break chunk line ## Do not use spaces or periods "." or underscores "_" ## set options for knitr knitr::opts_chunk$set( comment = '', fig.width = 6, fig.asp = .8, fig.align="center", message=F, error=F, warning=F, tidy=T, comment='', cache=T, dev='svg', echo=F ) ``` ```{r set-ggplot-theme-defaults, include = F} #from ggthemes library(ggplot2); theme_set(ggthemes::theme_fivethirtyeight()) ``` ```{r define-color-palette, include = F, eval = T} # color blind friendly palette from http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/ cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#000000") ``` ## Overview The task here is to take an easy and short `xml` file and convert it to a dataframe. The `xml2` package was specifically designed for the task and is the successor to the `xml` package. ## Background ### Sanity Checks Let's face it: you wouldn't be on this page if you could've figured it out using W3schools or the `R` help menu. Let's start with some super basic stuff. Have you opened the `.xml` file using a browser? Do that first and see if you can get a sense of how the data is organized. Second, open the xml file in a linter and see if it's formatted correctly. (Here's an [example](https://xmllint.com)) I've lost a ton of time double checking code only to figure out that some `.json` was missing a "}". Finally, open it in a good text editor and beautify the code. (I use Atom.) My last project, I "peeled" the outer layers of the onion to get at the data. It was hacky and lacked reproducibility, but it worked. ### Definitions First, it's important to get some defintions first. (W3schools has a really helpful set of [tutorials](https://www.w3schools.com/xml/default.asp) on the topic.) A "node" in general and speaking in a broad way is an HTML element. **DOM (Document Object Model)** - a tree structure that represents the HTML of the website, and every HTML element is a "node". **Element Nodes** - model the actual HTML elements in the document. **Attribute Nodes** - model the various attributes in the different HTML elements. Attributes include id, class, title and style. **Text Nodes** - model the text content inside the different HTML elements. **Root Node** - the node on the very top of the document tree, usually called the document node. **Parent Node** - a node that has children; represents an element that has at least one other element or text nested inside it. **Child Node** - a node that has a parent; represents an element or text that is nested inside another element. For example, a `
` tag is often the child of a `