--- title: How to Convert XML to Dataframe Using Xlm2 author: Rob Wiederstein date: '2021-03-05' slug: xml-to-dataframe categories: - R tags: - xlm2 layout: layouts/post/single draft: no header: image: feature.png alt: xml code screenshot caption: A dataset that was only available in pdf or xml. summary: Everytime I run into a file with an `.xml` extension, I cringe. Though, admittedly, it's a file format that you have to be familiar with when it comes to sending and receiving data over the web. `R` has a package `xlm2` to assist in the conversion of nested data to tabular data. repo: https://raw.githubusercontent.com/RobWiederstein/purple-bananas/main/content/post/2021-03-05-xml-to-dataframe/index.Rmd --- ```{r load-packages, include = F} ## Load frequently used packages for blog posts x <- c( 'devtools', #for session info 'ggthemes' ) lapply(x, library, character.only = T) ``` ```{r set-chunk-options, include = F} ## Do not break chunk line ## Do not use spaces or periods "." or underscores "_" ## set options for knitr knitr::opts_chunk$set( comment = '', fig.width = 6, fig.asp = .8, fig.align="center", message=F, error=F, warning=F, tidy=T, comment='', cache=T, dev='svg', echo=F ) ``` ```{r set-ggplot-theme-defaults, include = F} #from ggthemes library(ggplot2); theme_set(ggthemes::theme_fivethirtyeight()) ``` ```{r define-color-palette, include = F, eval = T} # color blind friendly palette from http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/ cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#000000") ``` ## Overview The task here is to take an easy and short `xml` file and convert it to a dataframe. The `xml2` package was specifically designed for the task and is the successor to the `xml` package. ## Background ### Sanity Checks Let's face it: you wouldn't be on this page if you could've figured it out using W3schools or the `R` help menu. Let's start with some super basic stuff. Have you opened the `.xml` file using a browser? Do that first and see if you can get a sense of how the data is organized. Second, open the xml file in a linter and see if it's formatted correctly. (Here's an [example](https://xmllint.com)) I've lost a ton of time double checking code only to figure out that some `.json` was missing a "}". Finally, open it in a good text editor and beautify the code. (I use Atom.) My last project, I "peeled" the outer layers of the onion to get at the data. It was hacky and lacked reproducibility, but it worked. ### Definitions First, it's important to get some defintions first. (W3schools has a really helpful set of [tutorials](https://www.w3schools.com/xml/default.asp) on the topic.) A "node" in general and speaking in a broad way is an HTML element. **DOM (Document Object Model)** - a tree structure that represents the HTML of the website, and every HTML element is a "node". **Element Nodes** - model the actual HTML elements in the document. **Attribute Nodes** - model the various attributes in the different HTML elements. Attributes include id, class, title and style. **Text Nodes** - model the text content inside the different HTML elements. **Root Node** - the node on the very top of the document tree, usually called the document node. **Parent Node** - a node that has children; represents an element that has at least one other element or text nested inside it. **Child Node** - a node that has a parent; represents an element or text that is nested inside another element. For example, a `

` tag is often the child of a `

` tag. ```{r load-more-libraries, include = F} library(xml2) ``` ### Example ```{r pressure, echo=FALSE, fig.cap="See [DOM model](https://en.wikipedia.org/wiki/Document_Object_Model) on Wikipedia", out.width = '65%'} knitr::include_graphics("wikipedia-dom-model.png") ``` ### Structure XML structured data looks like this: ``` #xml prologue ..... ``` ### Xpath XPath is a syntax for defining and navigating parts of an XML document. They can be used across many languages. Xpath has over 200 functions according to W3schools. See the W3 page for [xpath syntax](https://www.w3schools.com/xml/xpath_syntax.asp). ## XML Data All of the functions below are from the `xlm2` package and the help [page](https://xml2.r-lib.org). ```{r read-in-xml, echo = T} #get W3 data url <- "https://www.w3schools.com/xml/simple.xml" # read breakfast menu bkfst <- read_xml(x = url) str(bkfst) ``` ```{r xml-name, echo = T} xml_name(bkfst) ``` ### See XML Data Structure ```{r xml-structure, echo = T} xml_structure(bkfst) ``` ```{r xml-text, echo = T} # see text xml_text(bkfst) ``` ### Extract XML The real key is understanding how to define or set the "xpath" argument. ```{r get-bkfst-name, echo = T} xml_find_all(bkfst, xpath = "//name") ``` ```{r get-bkfst-price, echo = T} xml_text(xml_find_all(bkfst, xpath = "//name")) ``` ### Build table ```{r build-bkfst-table, echo = T} library(tibble) name <- xml_text(xml_find_all(bkfst, xpath = "//name")) price <- xml_text(xml_find_all(bkfst, xpath = "//price")) df <- tibble(name = name, price = price) df ``` ## Lists and XML Lists and xml are similar and easily converted within the `xml2` package. ```{r create-nested-list, echo = T} document <- list(root = list(parent = list(child = list(1)))) xmlDoc <- xml2::as_xml_document(document) xml2::xml_structure(xmlDoc) ``` ```{r extract-one} xml_find_all(xmlDoc, xpath = ".//child") ``` ## Conclusion If you can get a basic understanding of html web pages and xpath syntax, then you should be able to parse xml files efficiently with `xlm2`. ## Acknowledgements (Get bibliographic stuff from "archetype hill".) ## References ## Disclaimer The views, analysis and conclusions presented within this paper represent the author’s alone and not of any other person, organization or government entity. While I have made every reasonable effort to ensure that the information in this article was correct, it will nonetheless contain errors, inaccuracies and inconsistencies. It is a working paper subject to revision without notice as additional information becomes available. Any liability is disclaimd as to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause. The author(s) received no financial support for the research, authorship, and/or publication of this article. ## Reproducibility ```{r reproducibility, echo = FALSE} ## Reproducibility info options(width = 120) session_info() ```