# Chapter 18 - Data Formats III (XML)

In this chapter, we will learn how to work with one of the most popular structured data formats: [XML](http://www.w3schools.com/xml/). XML is used a lot in Natural Language Processing (NLP), and it is important that you know how to work with it. In theory, you could load XML data just as you read in a text file, but the structure is much too complicated to extract information manually. Therefore, we use an existing library. 

### At the end of this chapter, you will be able to
* read an XML file using `etree.parse`
* read XML from string using `etree.fromstring`
* convert an XML element to a string using `etree.tostring`
* use the following methods and attributes of an XML element (of type `lxml.etree._Element`):
    * to access elements: methods `find`, `findall`, and `getchildren`
    * to access attributes: method `get`
    * to access element information: attributes `tag` and `text`  

* [not needed for assignment] create your own XML and write it to a file

### If you want to learn more about this chapter, you might find the following links useful:
* [XML](http://www.w3schools.com/xml/)
* [detailled XML introduction](http://www.dfki.de/~uschaefer/esslli09/xmlquerylang.pdf)
* [NAF XML](http://www.newsreader-project.eu/files/2013/01/techreport.pdf)
* [Xpath](http://www.w3schools.com/xml/xpath_syntax.asp)
* Other structured data formats: [JSON-LD](http://json-ld.org/), [MicroData](https://www.w3.org/TR/microdata/), [RDF](https://www.w3.org/RDF/)

If you have **questions** about this chapter, please contact us at cltl.python.course@gmail.com.

## 1. Introduction to XML

Natural language processing (NLP) is all about text data. More specifically, we usually want to annotate (manually or automatically) textual data with information about:

* [part of speech](https://en.wikipedia.org/wiki/Part_of_speech)
* [word senses](https://en.wikipedia.org/wiki/Word_sense)
* [syntactic information (in particulay dependencies)](https://en.wikipedia.org/wiki/Dependency_grammar)
* [entities](https://en.wikipedia.org/wiki/Entity_linking)
* [semantic role labelling](https://en.wikipedia.org/wiki/Semantic_role_labeling)
* Events
* many many many more.....

How should we represent such annotated data? This is what you get from using NLTK tools:

In [None]:
import nltk

In [None]:
text = nltk.word_tokenize("Tom Cruise is an actor.")
text_pos_tagged = nltk.pos_tag(text)
print(text_pos_tagged)
print(type(text_pos_tagged), type(text_pos_tagged[0]))

In this example, we see that the format is a list of [tuples](https://docs.python.org/3/tutorial/datastructures.html#tuples-and-sequences).  The first element of each tuple is the word and the second element is the part of speech tag. Great, so far, this works.  

However, we usually want to store more information. For instance, we may want to indicate that *Tom Cruise* is a named entity. We could also represent syntactic information about this sentence. Now we start to run into difficulties because some annotations are for single words and some are for combinations of words. In addition, we have more than one annotation per token. Data structures such as CSV and TSV are not great at **representing** linguistic information. So is there a format that is better at it? The answer is yes and the format is XML. 

## 2. Terminology
Let's look at an example (the line numbers are there for explanation purposes). On purpose, we start with a non-linguistic, hopefully intuitive example. In the folder `../Data/xml_data` this XML is stored as the file `course.xml`. You can inspect this file using a text editor (e.g. [Atom](https://atom.io/), [BBEdit](https://www.barebones.com/products/bbedit/download.html) or [Notepad++](https://notepad-plus-plus.org)).

```xml
1.  <Course>
2.      <person role="coordinator">Van der Vliet</person>
3.      <person role="instructor">Van Miltenburg</person>
4.      <person role="instructor">Van Son</person>
5.      <person role="instructor">Postma</person>
7.      <person role="student">Baloche</person>
8.      <person role="student">De Boer</person>
9.      <animal role="student">Rubber duck</animal>
10.     <person role="student">Van Doorn</person>
11.     <person role="student">De Jager</person>
12.     <person role="student">King</person>
13.     <person role="student">Kingham</person>
14.     <person role="student">Mózes</person>
15.     <person role="student">Rübsaam</person>
16.     <person role="student">Torsi</person>
17.     <person role="student">Witteman</person>
18.     <person role="student">Wouterse</person>
19.     <person/>
20. </Course>
```

### 2.1 Elements
Line 1 to 19 all show examples of [XML elements](http://www.w3schools.com/xml/xml_elements.asp). Each XML element contains a **starting tag** (e.g. ```<person>```) and an **end tag** (e.g. ```</person>```). An element can contain:
* **text** *Van der Vliet* on line 2
* **attributes**: *role* attribute in lines 2 to 18
* **elements**: elements can contain other elements, e.g. *person* elements inside the *Course* element. The terminology to talk about this is as follows. In this example, we call `person` the `child` of `Course` and `Course` the `parent` of `person`.

Please note that on line 19 the **starting tag** and **end tag** are combined. This happens when an element has no children and/or no text. The syntax for an element is then **``` <START_TAG/>```**.

### 2.2 Root element
A special element is the **root** element. In our example, `Course` is our [root element](https://en.wikipedia.org/wiki/Root_element). The element starts at line 1 (```<Course>```) and ends at line 19 (```</Course>```). Notice the difference between the begin tag (no '/') and the end tag (with '/'). A root element is special in that it is the only element, which is the sole parent element to all the other elements.

### 2.3 Attributes
Elements can contain [attributes](http://www.w3schools.com/xml/xml_attributes.asp), which contain information about the element. In this case, this information is the `role` a person has in the course. All attributes are located in the start tag of an XML element.

## 3. Working with XML in Python
Now that we know the basics of XML, we want to be able to access it in Python. In order to work with XML, we will use the [**lxml**](http://lxml.de/) library.

In [None]:
from lxml import etree

We will focus on the following methods/attributes:
* **to parse the XML from file or string**: the methods `etree.parse()` and `etree.fromstring()`
* **to access the root element**: the methods `getroot()`
* **to access elements**: the methods `find()`, `findall()`, and `getchildren()`
* **to access attributes**: the method `get()`
* **to access element information**: the attributes `tag` and `text`

### 3.1 Parsing XML from file or string

The **`etree.fromstring()`** is used to parse XML from a string:

In [None]:
xml_string = """
<Course>
    <person role="coordinator">Van der Vliet</person>
    <person role="instructor">Van Miltenburg</person>
    <person role="instructor">Van Son</person>
    <person role="instructor">Marten Postma</person>
    <person role="student">Baloche</person>
    <person role="student">De Boer</person>
    <animal role="student">Rubber duck</animal>
    <person role="student">Van Doorn</person>
    <person role="student">De Jager</person>
    <person role="student">King</person>
    <person role="student">Kingham</person>
    <person role="student">Mózes</person>
    <person role="student">Rübsaam</person>
    <person role="student">Torsi</person>
    <person role="student">Witteman</person>
    <person role="student">Wouterse</person>
    <person/>
</Course>
"""

tree = etree.fromstring(xml_string)
print(type(tree))

In [None]:
# printing the tree only shows a reference to the tree, but not the tree itself 
# To access information, you will have to use the methods we introduce below.
print(tree)

The **`etree.parse()`** method is used to load XML files on your computer:

In [None]:
tree = etree.parse('../Data/xml_data/course.xml')
print(tree)
print(type(tree))

As you can see, `etree.parse()` returns an `ElementTree`, whereas `etree.fromstring()` returns an `Element`. One of the important differences is that the `ElementTree` class serialises as a complete document, as opposed to a single `Element`. This includes top-level processing instructions and comments, as well as a DOCTYPE and other DTD content in the document. For now, it's not too important that you know what these are; just remember that there is a difference btween `ElementTree` and `Element`.

### 3.1 Accessing root element

While `etree.fromstring()` gives you the root element right away, `etree.parse()` does not. In order to access the root element of `ElementTree`, we first need to use the **`getroot()`** method. Note that this does not show the XML element itself, but only a reference. In order to show the element itself, we can use the **`etree.dump()`** method.

**Hint:** etree.dump is helpful for inspecting (parts of) an xml structure you have loaded from a file. You will see examples of this later. 

In [None]:
root = tree.getroot()
print('root', type(root), root)
print()
print('etree.dump example')
etree.dump(root, pretty_print=True)

As with any python object, we can use the built-in function **`dir()`** to list all methods of an element (which has the type **`lxml.etree._Element`**) , some of which will be illustrated below.

In [None]:
print(type(root))
dir(root)

### 3.2 Accessing elements
There are several ways of accessing XML elements. The **`find()`** method returns the *first* matching child.

In [None]:
first_person_el = root.find('person')
# Printing the element itself again only shows a reference
print(first_person_el)
#instead, we use etree.dump:
etree.dump(first_person_el, pretty_print=True)

In order to get a list of all person children, we can use the **`findall()`** method.
Notice that this does not return the `animal` since we are looking for `person` elements.

In [None]:
all_person_els = root.findall('person')
# Check how many we found
print(len(all_person_els))
all_person_els

Sometimes, we simple want all the children, while ignoring the start tags. This can be achieved using the **`getchildren()`** method. The list created below will contain all elements under root (including the animal element)

In [None]:
all_child_els = root.getchildren()
print(len(all_child_els))
all_child_els

### 3.3 Accessing element information
We will now show how to access the attributes, text, and tag of an element.

The **`get()`** method is used to access the attribute of an element.
**Attention**: If an attribute does not exist, it will return `None`. Hence, there will not be an error.

In [None]:
first_person_el = root.find('person')
role_first_person_el = first_person_el.get('role')
attribute_not_found = first_person_el.get('blabla')
print('role first person element:', role_first_person_el)
print('value if not found:', attribute_not_found)

The **text** of an element is found in the attribute **`text`**:

In [None]:
print(first_person_el.text)

The **tag** of an element is found in the attribute **`tag`**:

In [None]:
print(first_person_el.tag)

## 4 How to deal with more than one layer
In our previous example, we had an XML with only one nested layer (**person**). However, XML can deal with many more. 


Let's look at such an example and think about how you would access the first **`target`** element, i.e. 
```xml

<target id="t1" />
```

```xml

<NAF xml:lang="en" version="v3">
    <terms>
        <term id="t1" type="open" lemma="Tom" pos="N" morphofeat="NNP">
        <term id="t2" type="open" lemma="Cruise" pos="N" morphofeat="NNP">
        <term id="t3" type="open" lemma="be" pos="V" morphofeat="VBZ">
        <term id="t4" type="open" lemma="an" pos="R" morphofeat="DT">
        <term id="t5" type="open" lemma="actor" pos="N" morphofeat="NN">
    </terms>
    <entities>
        <entity id="e3" type="PERSON">
              <references>
                  <span>
                      <target id="t1" />
                      <target id="t2" />
                  </span>
              </references>
        </entity>
    </entities>
</NAF>
```

Again, we use `etree.fromstring()` to load XML from a string:

In [None]:
naf_string = """
<NAF xml:lang="en" version="v3">
    <text>
        <wf id="w1" offset="0" length="3" sent="1" para="1">tom</wf>
        <wf id="w2" offset="4" length="6" sent="1" para="1">cruise</wf>
        <wf id="w3" offset="11" length="2" sent="1" para="1">is</wf>
        <wf id="w4" offset="14" length="2" sent="1" para="1">an</wf>
        <wf id="w5" offset="17" length="5" sent="1" para="1">actor</wf>
    </text>
    <terms>
        <term id="t1" type="open" lemma="Tom" pos="N" morphofeat="NNP"/>
        <term id="t2" type="open" lemma="Cruise" pos="N" morphofeat="NNP"/>
        <term id="t3" type="open" lemma="be" pos="V" morphofeat="VBZ"/>
        <term id="t4" type="open" lemma="an" pos="R" morphofeat="DT"/>
        <term id="t5" type="open" lemma="actor" pos="N" morphofeat="NN"/>
    </terms> 
    <entities>
        <entity id="e3" type="PERSON">
              <references>
                  <span>
                      <target id="t1" />
                      <target id="t2" />
                  </span>
              </references>
        </entity>
    </entities>
</NAF>
"""

naf = etree.fromstring(naf_string)
print(type(naf))
etree.dump(naf, pretty_print=True)

Please note that the structure is as follows:
* the **`NAF`** element is the parent of the elements **`text`**, **`terms`**, and **`entities`**
* the **`wf`** elements are children of the **`text`** element, which provides us information about the position of words in the text, e.g. that *tom* is the first word in the text (**`id="w1`"**) and in the first sentence (**sent="1"**)
* the **`term`** elements are children of the **`term`** elements, which provide us information about lemmatization and part of speech
* the **`entity`** element is a child of the **`entities`** element. We learn from the **`entity`** element that the terms **`t1`** and **`t2`** (e.g. Tom Cruise) form an entity of type **`person`**.

One way of accessing the first **`target`** element is by going one level at a time:

In [None]:
entities_el = naf.find('entities')
entity_el = entities_el.find('entity')
references_el = entity_el.find('references')
span_el = references_el.find('span')
target_el = span_el.find('target')
etree.dump(target_el, pretty_print=True)

Is there a better way? The answer is yes! The following way is an easier way to find our `target` element:

In [None]:
target_el = naf.find('entities/entity/references/span/target')
etree.dump(target_el, pretty_print=True)

You can also use **`findall()`** to find *all* `target` elements:

(Note that **findall()** returns a list of xml elements. We can use a loop to iterate over them and print them individually.)

In [None]:

for target_el in naf.findall('entities/entity/references/span/target'):
    etree.dump(target_el, pretty_print=True)

## 5. Extracting infromation from XML


Often, we want to extract information from an XML file, so we can analyze and possibly manipulate it in python. It can be very useful to use python containers for this. In the following example, we want to collect all tokens (i.e. words as they appear in text) of a part of speech.

To do this, we have to extract infromation from the token layer and combine it with infromation from the term layer.

In [None]:
path_to_tokens = 'text/wf'
path_to_terms = 'terms/term'


# We define dictionaries to map identifiers to tokens ('word forms') and term tags including pos information.
# We can use the identifiers to map the tokens to the pos tags in the next step
tokens = naf.findall(path_to_tokens)
terms = naf.findall(path_to_terms)

id_token_dict = dict()
id_pos_dict = dict()

# map ids to tokens
for token in tokens:
    id_token_dict[token.get('id')] = token.text
#map ids to terms
for term in terms:
    id_pos_dict[term.get('id')] = term.get('pos')
 
 #use ids to map tokens to pos tags
for token_id, token in id_token_dict.items():
    # term identifiers start with a t, token identifiers with a w. The numbers correspond. 
    term_id = token_id.replace('w', 't')
    pos = id_pos_dict[term_id]
    print(token, pos)

## 6. EXTRA: Creating your own XML

Please note that this section is optional, meaning that you don't need to understand this section in order to complete the assignment. 

There are three main steps:
* **Step a:** Create an XML object with a root element
* **Step b:** Creating child elements and adding them
* **Step c:** Writing to a file

### Step a: Create an XML object with a root element
You create a new XML object by:
* creating the **`root`** element -> using **`etree.Element`** 
* creating the main XML object -> using **`etree.ElementTree`**

You do not have to fully understand how this works. Please make sure you can reuse this code snippet when you create your own XML.

In [None]:
our_root = etree.Element('Course')
our_tree = etree.ElementTree(our_root)

We can inspect what we have created by using the `etree.dump()` method. As you can see, we only have the root node `Course` currently in our document.

In [None]:
etree.dump(our_root, pretty_print=True)

As you see, we created an XML object, containing only the root element **Course**.

### Step b: Creating child elements and adding them
There are two ways to add child elements to the root element. The first is to create an element using the **`etree.Element()`** method and using `append()` to add it to the root:

In [None]:
# Define tag, attributes and text of the new element
tag = 'person' # what the start and end tag will be 
attributes = {'role': 'student'} # dictionary of attributes, can be more than one
name_student = 'Lee' # the text of the elements

# Create new Element
new_person_element = etree.Element(tag, attrib=attributes)
new_person_element.text = name_student

# Add to root
our_root.append(new_person_element)

# Inspect the current XML
etree.dump(our_root, pretty_print=True)

However, this is so common that there is a shorter and much more efficient way to do this: by using **`etree.SubElement()`**. It accepts the same arguments as the `etree.Element()` method, but additionally requires the parent as first argument:

In [None]:
# Define tag, attributes and text of the new element
tag = 'person' 
attributes = {'role': 'student'} 
name_student = 'Pitt' 

# Add to root
another_person_element = etree.SubElement(our_root, tag, attrib=attributes) # parent is our_root
another_person_element.text = name_student

# Inspect the current XML
etree.dump(our_root, pretty_print=True)

As we have seen before, XML can have multiple nested layers. Creating these works the same way as adding child elements to the root, but now we specify one of the other elements as the parent (in this case, `new_person_element`).

In [None]:
# Define tag, attributes and text of the new element
tag = 'pet'
attributes = {'role': 'joy'}
name_pet = 'Romeo'

# Add to new_person_element
new_pet_element = etree.SubElement(new_person_element, tag, attrib=attributes) # parent is new_person_element
new_pet_element.text = name_pet

# Inspect the current XML
etree.dump(our_root, pretty_print=True) 

### Step c: Writing to a file
This is how we can write our selfmade XML to a file. Please inspect `../Data/xml_data/selfmade.xml` using a text editor to check if it worked.

In [None]:
with open('../Data/xml_data/selfmade.xml', 'wb') as outfile:
    our_tree.write(outfile,
                   pretty_print=True,
                   xml_declaration=True,
                   encoding='utf-8')

## Exercises

### Exercise 1:

Have another look at the XML below. Then print the following information:
* the names of all students
* the names of all instructors whose name starts with 'Van'
* all names containing a space
* the role of 'Rubber duck'

In [None]:
xml_string = """
<Course>
    <person role="coordinator">Van der Vliet</person>
    <person role="instructor">Van Miltenburg</person>
    <person role="instructor">Van Son</person>
    <person role="instructor">Marten Postma</person>
    <person role="student">Baloche</person>
    <person role="student">De Boer</person>
    <animal role="student">Rubber duck</animal>
    <person role="student">Van Doorn</person>
    <person role="student">De Jager</person>
    <person role="student">King</person>
    <person role="student">Kingham</person>
    <person role="student">Mózes</person>
    <person role="student">Rübsaam</person>
    <person role="student">Torsi</person>
    <person role="student">Witteman</person>
    <person role="student">Wouterse</person>
    <person/>
</Course>
"""

tree = etree.fromstring(xml_string)
print(type(tree))

### Exercise 2:
In the folder `../Data/xml_data` there is an XML file called `framenet.xml`, which is a simplified version of the data provided by the [FrameNet project](https://framenet.icsi.berkeley.edu/fndrupal/).

FrameNet is a lexical database describing **semantic frames**, which are representations of events or situations and the participants in it. For example, cooking typically involves a person doing the cooking (`Cook`), the food that is to be cooked (`Food`), something to hold the food while cooking (`Container`) and a source of heat (`Heating_instrument`). In FrameNet, this is represented as a frame called `Apply_heat`. The `Cook`, `Food`, `Heating_instrument` and `Container` are called **frame elements (FEs)**. Words that evoke this frame, such as *fry*, *bake*, *boil*, and *broil*, are called **lexical units (LUs)** of the `Apply_heat` frame. FrameNet also contains relations between frames. For example, `Apply_heat` has relations with the `Absorb_heat`, `Cooking_creation` and `Intentionally_affect` frames. In FrameNet, frame descriptions are stored in XML format.

`framenet.xml` contains the information about the frame `Waking_up`. Parse the XML file and print the following:
* the name of the frame
* the names of all lexical units
* the definitions of all lexical units
* the related frames with their type of relation to `Waking_up` (e.g. `Event` with the `Inherits from` relation)

## Exercise 3:

Something were we collect information from multiple files. Not created yet. 