{ "cells": [ { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "# NOSQL AND QUERYING JSON AND XML/HTML OBJECTS/FILES" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "**by Serhat Çevikel**" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "## Working with JSON Objects" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "- JSON is a hierarchical data format that allows data that is not appropriate to be formatted as columns and rows to be stored and queried.\n", "\n", "- Let's say we are tracking our contact data in a csv file:\n", "\n", "```\n", "Lastname, Firstname, Phone Number\n", "\n", "Membrey, Peter, +852 1234 5678\n", "\n", "Thielen, Wouter, +81 1234 5678\n", "```" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "- If one of the contacts have more than one phone numbers, we have to create a new column\n", "\n", "```\n", "Lastname, Firstname, Phone Number1, Phone Number2\n", "\n", "Membrey, Peter, +852 1234 5678, +44 1234 565 555\n", "\n", "Thielen, Wouter, +81 1234 5678\n", "```" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "- But suppose, we have million of records with tens of fields, and in some exceptions, some records may have many multiple values of some fields: e.g. 10 telephone numbers, etc\n", "\n", "- JSON format is a remedy for these kinds of flexibility issues and hierarchical data formats.\n", "\n", "- Integrity rules are softer for handling JSON data" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "- JSON stands for \"JavaScript Object Notation\"\n", "- In JSON, each record is called a \"document\"\n", "- Let's write the first record as a JSON document:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "record1='{\n", "\"firstname\": \"Peter\",\n", "\"lastname\": \"Membrey\",\n", "\"phone_numbers\": [\n", "\"+852 1234 5678\",\n", "\"+44 1234 565 555\"\n", "]\n", "}'\n", "\n", "echo $record1" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "- Echoing the JSON as comma separated values as such, is not easy to parse and understand the format.\n", "\n", "- We may use online json parsers for this purpose.\n", "\n", "- You can copy and paste the above string into the input pane:\n", "\n", "http://jsonparseronline.com/" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "kernel": "Bash" }, "source": [ "### Querying JSON with JQ" ] }, { "cell_type": "markdown", "metadata": { "hidden": true, "kernel": "Bash" }, "source": [ "- Or better, we will use a handy tool called \"jq\" for this purpose:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "kernel": "Bash" }, "outputs": [], "source": [ "echo $record1 | jq ." ] }, { "cell_type": "markdown", "metadata": { "hidden": true, "kernel": "Bash" }, "source": [ "jq is a parser and querying tool for json, that creates a nice output\n", "\n", "You can have more info on jq following the links:\n", "\n", "[The Home Page](https://stedolan.github.io/jq/)\n", "\n", "[Tutorial](https://stedolan.github.io/jq/tutorial/)\n", "\n", "[Manual](https://stedolan.github.io/jq/manual/)" ] }, { "cell_type": "markdown", "metadata": { "hidden": true, "kernel": "Bash" }, "source": [ "- Each document (equivalent to a row in RDBMS) in JSON is delimited by curly braces \"{\"\n", "- And all values are given as \"key\" and \"value\" pairs:\n", "\n", "```json\n", "{\n", " \"firstname\": \"Peter\",\n", " \"lastname\": \"Membrey\",\n", " \"phone_numbers\": [\n", " \"+852 1234 5678\",\n", " \"+44 1234 565 555\"\n", " ]\n", "}\n", "```\n", "\n", "- firstname is the key, \"Peter\" is the value, an so on\n", "\n", "- We also have arrays of values for a single key, delimited by square brackets []" ] }, { "cell_type": "markdown", "metadata": { "hidden": true, "kernel": "Bash" }, "source": [ "- What is more powerful about JSON format is that you can embedded documents inside other ones: (we print the data here in three visual formats)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "kernel": "Bash" }, "outputs": [], "source": [ "record1b='{\n", "\"firstname\": \"Peter\",\n", "\"lastname\": \"Membrey\",\n", "\"numbers\": [\n", "{\n", "\"phone\": \"+852 1234 5678\"\n", "},\n", "{\n", "\"fax\": \"+44 1234 565 555\"\n", "}\n", "]\n", "}'\n", "\n", "echo -e $record1b \"\\n\"\n", "\n", "echo -e \"$record1b\\n\"\n", "\n", "echo $record1b | jq ." ] }, { "cell_type": "markdown", "metadata": { "hidden": true, "kernel": "Bash" }, "source": [ "- See, the phone and fax numbers are inside embedded documents" ] }, { "cell_type": "markdown", "metadata": { "hidden": true, "kernel": "Bash" }, "source": [ "- Multiple documents can be \"collected\" inside \"collections\":\n", "\n", "- A \"collection\" in NoSQL terminology is analogous to a \"table\" in the RDBMS jargon. A collection is a collection of similar items (or documents with similar key-value pairs)" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "### A real json database example: UN COMTRADE" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "- We will be using a part of the UN COMTRADE database:\n", "\n", "[UN COMTRADE](https://comtrade.un.org/)\n", "\n", "UN COMTRADE is the widest and most comprehensive database on international trade:\n", "\n", "- 250+ reporter countries\n", "- 290+ partner countries\n", "- 6500+ commodity codes\n", "- 50+ of history\n", "- Both imports and exports\n", "- Both values and quantities!" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "Let's first save the path:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "datadir=~/data\n", "comtrade=$datadir/comtrade_s1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "ls $comtrade" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "- classificationS1.json lists the item classification according to SITC1 method\n", "- reporterAreas.json and partnerAreas.json lists the countries and their respective codes\n", "- data files are under 2010" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "#### Reporters" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "jq . $comtrade/reporterAreas.json" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "Another way to pretty print and navigate through json files is R:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "reporter <- jsonlite::fromJSON(\"~/data/comtrade_s1/reporterAreas.json\")\n", "reporter" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "str(reporter)" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "See that, in not-so-nested structures, the data is automatically flattened into a data frame" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "And a collapsable and interactive gadget for viewing json and similar hierarchical data types: (you may need two execute several times to get the JS gadget)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "listviewer::jsonedit(reporter, mode = \"form\")" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "Now let's traverse through this document to list country texts:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "jq '.results[].text' $comtrade/reporterAreas.json | tr -d '\"'" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "And let's list the country codes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "jq -r '.results[].id' $comtrade/reporterAreas.json" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "- Separate lists of country names and id's do not mean much.\n", "- Suppose we want to find the country code of turkey" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "reporters=$(jq -r '.results[] | \"\\(.text)\\t\\(.id)\"' $comtrade/reporterAreas.json | \\\n", "xargs -0 -i echo -e \"{}\")\n", "\n", "echo \"$reporters\" | column -s $'\\t' -t" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "We can filter with grep or awk:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "echo \"$reporters\" | awk -F \"\\t\" '$1==\"Turkey\"{print $2}'" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "Another way to flatten fields is:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "reporters2=$(jq -r '.results[] | .text + \"\\t\" + .id' $comtrade/reporterAreas.json | xargs -0 -i echo -e \"{}\")\n", "\n", "echo \"$reporters2\" | column -s $'\\t' -t" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "echo \"$reporters\" | awk -F \"\\t\" '$1==\"Turkey\"{print $2}'" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "Or we can use the \"select\" statement for filtering values:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "jq -r '.results[] | select(.text == \"Turkey\") | .id' $comtrade/reporterAreas.json" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "- All the files under 2010e directory are files in which Turkey is either a reporter or a partner\n", "- So sometimes there exists a pair of files where Turkey and another country swap roles as reporter and partner" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "#### Classification" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "Now let's go through the classification file:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R", "scrolled": false }, "outputs": [], "source": [ "classification <- jsonlite::fromJSON(\"~/data/comtrade_s1/classificationS1.json\")\n", "listviewer::jsonedit(classification, mode = \"form\")" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "Now we will filter for those entries, in which text includes \"textile\" and code is only 3 digits:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "jq -r '.results[] | select((.id|test(\"^\\\\d+$\")) and (.text|test(\"(?i)textile\"))) |\n", "select((.id|tonumber < 1000) and (.id|tonumber > 99)) | .text' \\\n", "$comtrade/classificationS1.json | \\\n", "sed 's/ - /\\t/g' | column -s $'\\t' -t" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "See how it works:\n", "\n", "- We filter for id values that are numeric (so exclude ALL, TOTAL, AG1..AG5) and where text includes case insensitive textile\n", "- We filter for id values larger than 99 and smaller than 1000\n", "- We return the text\n", "- The text has already id info at the beginning split with \" - \". We substitute these character with a tab character" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "**EXERCISE 1:**\n", "\n", "Now your turn: Find the id and text of codes parent of which are 651 (Textile yarn and thread).\n", "\n", "The result will be:\n", "\n", "```\n", "6511 Thrown silk & silk yarn and thread\n", "6512 Yarn of wool and animal hair\n", "6513 Cotton yarn & thread, grey, not mercerized\n", "6514 Cotton yarn & thread, bleached, dyed, mercerd.\n", "6515 Yarn and thread of flax, ramie and true hemp\n", "6516 Yarn and thread of synthetic fibres\n", "6517 Yarn and thread of regenerated fibres\n", "6518 Yarn of glass fibre\n", "6519 Yarn of textile fibres,nes incl.paper yarn\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "**SOLUTION 1:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "pass1=\n", "encrypt=\"U2FsdGVkX1+WtEYqehLba8zmJ9evvmr1ohvj0BWaggG2cQUjY/HMZia25jIkV9zR jyLxrSmHJRCX55wFh7JIB9U1sjWJ1cvipesjhECBk3B3GaP97Kjsq9cDu0jnjYu7 cM9b9CKPMJN1Am/sywL5RQT+z0nDKopIuq78l3p+eeqEHQNvloe35hu/y19+lFmt 6f4EG2i1cQ2JJq8bBMyl9g==\"\n", "solution=$(echo \"$encrypt\" | openssl enc -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)\n", "echo \"$solution\"; echo\n", "for l in \"$solution\"; do eval \"${l}\"; done" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "#### Data files" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "Now let's go through the actual data files that includes trade volumes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "ls $comtrade/2010e | { head -5; tail -5; }" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "ls $comtrade/2010e | grep -P \"2010_792\" | head -5" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "- The code after the first underscore is the reporter's and the code after the second underscore is the partner country's code\n", "\n", "- So there are the files for which Turkey is either a reporter or partner\n", "\n", "- For how many files Turkey is reporter or a partner?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "ls $comtrade/2010e | grep -P \"2010_792\" | wc -l\n", "ls $comtrade/2010e | grep -P \"792_s1\" | wc -l" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "Turkey reported her trade with 227 partners, while 165 reporter countries reported their trade partnering with Turkey (for s1 classification only)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "tradedata <- jsonlite::fromJSON(\"~/data/comtrade_s1/2010e/2010_792_100_s1.json\")\n", "listviewer::jsonedit(tradedata, mode = \"form\")" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "Descriptions for several variables are:\n", "\n", "- CmdCode: \tCommodity code\n", "- CmdDesc: \tCommodity description\n", "- IsLeaf: \tBasic code without children\n", "- Parentcode: \tHigh level of that commodity code\n", "- pfDesc: \tCommodity classification\n", "- PfCode: \tCommodity classification code\n", "- yr: \tYear\n", "- rtCode \tReporter Code\n", "- ptCode: \tPartner Code\n", "- qtCode: \tQuantity code " ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "Now from all files for which Turkey is a reporter, the TradeValue of exports (rgCode is 2) in 651 code (Textile yarn and thread)\n", "\n", "We will report:\n", "- ptTitle (name of partner country)\n", "- TradeValue\n", "- TradeQuantity " ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "For a single file we have:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "jq -r '.dataset[] | select(.cmdCode == \"651\" and .rgCode == 2) |\n", "\"\\(.ptTitle)\\t\\(.TradeValue)\\t\\(.TradeQuantity)\"' $comtrade/2010e/2010_792_100_s1.json | \\\n", "xargs -0 -i echo -e \"{}\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "textileexports=$(for file in $comtrade/2010e/2010_792*.json;\n", "do\n", " jq -r '.dataset[] | select(.cmdCode == \"651\" and .rgCode == 2) |\n", "\"\\(.ptTitle)\\t\\(.TradeValue)\\t\\(.TradeQuantity)\"' $file | \\\n", "xargs -0 -i echo -e \"{}\";\n", "done)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "echo \"$textileexports\" | column -s $'\\t' -t" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "Let's exclude World total:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "echo \"$textileexports\" | awk -F \"\\t\" '$1 != \"World\"' | \\\n", "column -s $'\\t' -t" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "echo \"$textileexports\" | \\\n", "awk -F \"\\t\" '$1 != \"World\"' | \\\n", "sort -t $'\\t' -nr -k 2 | \\\n", "numfmt -d $'\\t' --field=2-3 --to-unit=M | \\\n", "column -s $'\\t' -t" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "What we do here:\n", "\n", "- Exclude the row first field of which had \"World\"\n", "- Sort as number in reverse order by second field\n", "- Format 2nd-3rd columns in millions\n", "- And align columns" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "So in 2010, Turkey's significant partners for Textile yarn and thread export are:\n", "\n", "- Italy (USD 166 mio)\n", "- Russion Federation (USD 95 mio)\n", "- Germany (USD 86 mio)" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "**EXERCISE 2:**\n", "\n", "- For each file in which Turkey is a partner, return the reporting country name (rtTitle), comodity description (cmdDescE) and trade value (TradeValue) for those items which:\n", "- are exported from those countries to Turkey (rgCode is 2)\n", "- TradeValue's are at least 500000000 (500 mio)\n", "- and are at the most detailed level (IsLeaf is 1) \n", "\n", "So you should fill in this template below:" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "```Bash\n", "for file in $comtrade/2010e/*792_s1.json;\n", "do\n", " jq -r '#fill in this part' $file | \\\n", " xargs -0 -i echo -e \"{}\";\n", "done | \\\n", "awk -F \"\\t\" '$1 != \"EU-28\"' | \\\n", "sort -t $'\\t' -nr -k 3 | \\\n", "numfmt -d $'\\t' --field=3 --to-unit=M | \\\n", "column -s $'\\t' -t\n", "```" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "The output will be:\n", "\n", "```\n", "Germany Passenger motor cars, other than buses 2653\n", "Algeria Gas, natural 2591\n", "USA Special transactions 2010\n", "Russian Federation Crude petroleum 1607\n", "Germany Other parts for motor vehicles 1231\n", "China Statistical machines cards or tapes 1027\n", "France Other parts for motor vehicles 1003\n", "Germany Internal combustion engines, not for aircraft 985\n", "Russian Federation Coal /anthracite, bituminous/ 972\n", "Germany Special transactions 948\n", "Spain Passenger motor cars, other than buses 923\n", "Russian Federation Aluminium and aluminium alloys, unwrought 865\n", "USA Raw cotton, other than linters 862\n", "Russian Federation Refined copper including remelted 861\n", "Saudi Arabia Products of polymerization and copolymerization 849\n", "Italy Other parts for motor vehicles 806\n", "France Aircraft, heavier than air 799\n", "Belgium Products of polymerization and copolymerization 644\n", "Qatar Gas, natural 631\n", "Italy Machinery and mechanical appliances, nes 562\n", "```" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "You can view a sample data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R", "scrolled": true }, "outputs": [], "source": [ "tradedata <- jsonlite::fromJSON(\"~/data/comtrade_s1/2010e/2010_100_792_s1.json\")\n", "listviewer::jsonedit(tradedata, mode = \"form\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "**SOLUTION 2:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "pass1=\n", "encrypt=\"U2FsdGVkX1+4esKDuc9iGGz0tWhqscfUMmjPGTa2MIp+YPGWOvdJZY0ZLMx3XSI5 42ueQ3AERCACRgSCZOnUhnDJG4wgnAgcmv+RXW0fR82yLC6g5M2+t3XuODqTIZag Y+g6DkbnapcedfV2K5hg0ELct8YzomN7648I/vQxCQEYEPlwxv/Pdw7IBEeIgFTH zJ25jJlAKJerLYTl9G+XXr01Pwlh5VS36wh16/opSTYjeh++FgvCCH2WKBiFfLjK oxzjfLBcpcqWLVOQ8ozqCean7tnTPIHR0QzjJeiqJ91jBOLbxPpnyT93Yu5Ikxne MbZ/7zAVqLmQVJCpLj6igmHtYxMv3Ck1HC5RKLX2Qrb8sPygcafwyGRp+r0jEsOk nl3MxKOVMuWIwvA+gmJwQhUJXmf7Zm1YX9tcn3PJe6IWPom2P9esB+GXKMnFkpL9 IsDEP/nvQ5TXe+gDGpKWzlEirPp6CwAB6YNiH/phRMUmrbg2DtsRSvZA9bMslFbf eXrItWZ34YuwnqDSlAaquA==\"\n", "solution=$(echo \"$encrypt\" | openssl enc -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)\n", "echo \"$solution\"; echo\n", "for l in \"$solution\"; do eval \"${l}\"; done" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "So according to numbers reported by other countries, in 2010 Turkey's largest imports were:\n", "\n", "- Passenger cars, engines and parts for motor vehicles from Germany and France (some portion for domestic production)\n", "- Natural gas, petroleum and Coal from Algeria and Russia\n", "- (Probably) mobile devices from China\n", "\n", "Note that SITC 1 is a very old classification so some modern commodities may not fit well into this schema" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "### MongoDB" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "Now we will import the files into MongoDB" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "First we create a directory for MongoDB database:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "mkdir -p ~/mongo" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "And run the mongo server:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "mongod -dbpath ~/mongo &" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "And import the files into a new database called comtrade and collection called 2010s1:\n", "\n", "**DO NOT REPEAT THIS STEP MULTIPLE TIMES, YOU WILL HAVE DUPLICATE DOCUMENTS!**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "for file in $comtrade/2010e/*.json;\n", "do mongoimport --db comtrade --collection 2010s1 $file; done" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "Now from R, import monglite package: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "library(mongolite)" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "We create a connection object:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "con <- mongo(collection = \"2010s1\", db = \"comtrade\")" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "Select the document(s) where rtTitle is Bulgaria and return, commodity description, commodity code and TradeValue columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "bulgaria_vals <- con$find('{\"dataset.rtTitle\" : \"Bulgaria\"}',\n", " fields = '{\"dataset.cmdDescE\": 1, \"dataset.cmdCode\": 1, \"dataset.TradeValue\":1 }')" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "The syntax of find() method has\n", "- The query\n", "- And the projection (selected fields to be returned)\n", "\n", "The fields to be returned get the value 1" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "Let's view the structure of the return value:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "str(bulgaria_vals)" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "And view only the dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "bulgaria_vals$dataset" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "Now we will get the values for countries with which Turkey has at least USD 50 mio export of 651 coded \"Textile yarn and thread\":" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "export_651 <- con$find('{\n", "\"dataset\":\n", "{ \"$elemMatch\":\n", "\n", "{ \"$and\": [\n", " { \"rtTitle\" : \"Turkey\" },\n", " { \"cmdCode\" : \"651\" },\n", " { \"TradeValue\" : { \"$gt\": 50000000 } },\n", " { \"ptTitle\" : { \"$ne\" : \"World\" } },\n", " { \"rgCode\" : 2 }\n", "\n", "] } } }',\n", " fields = '{\"dataset.ptTitle\": 1,\n", " \"dataset.rgCode\": 1,\n", " \"dataset.cmdDescE\": 1,\n", " \"dataset.cmdCode\": 1,\n", " \"dataset.TradeValue\":1 }')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "str(export_651)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "export_651_2 <- do.call(rbind, export_651$dataset)\n", "with(export_651_2, export_651_2[rgCode == \"2\" & cmdCode == \"651\",]) " ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "**EXERCISE 3:**\n", "\n", "Select those countries for which Turkey has at least USD 5 bio exports" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "**SOLUTION 3:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "pass <- readline(prompt = \"Please enter the password for the solution: \")\n", "encrypt <- \"U2FsdGVkX1/cPhgJB5HjYQ4xWOnuyfs5vKrofNpNlV9UHETg+d+KYoLJw7Py0t9p ap3m5rxDCl7JtmwH6VZKMf5sfooPDhhAnI5YXHJJmAc78U5ZYT/aVL75466bi6Te ic4Yd4uM1jcyblnj9LAp2n76eyGFFaEmAY2hmwn+uaDbk91xTNTYnE0gNqmjMJHs t60XDU19QfOTF8YUyG1cbYQwGqMp4k9u1o45Co5f4BRH8aWrgdCp1NBCd+X2itsl ujQYQ45+dnj7D01iezb9gPxvWiEMT+rIP7rWEQh8+wg/mIRR0zctqwk0xrIXLSNo /q1RCrw8DTaoqX5De3EZXxFzWNsw81+/9eehspVmosb9KnoNEKfZzgj99dm/ICDq 8uf9bqxZ373tN6oC23znD1vndhDPVl9KbwoNHjLa9Cb+Do5qoQqQ+hxEnDQm9I96 jai0dRPTnf5RXeb4Zs4Us3+8wahR3gm797WU/qJVIymbZH+5DjSID27XCI2WinPi\"\n", "solution <- system(sprintf(\"echo %s | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:%s 2> /dev/null\", encrypt, pass), intern = T, ignore.stderr = T)\n", "cat(solution, sep = \"\\n\")\n", "eval(parse(text = solution))" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "## Working with XML/HTML Objects" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "### What is XML" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "(https://www.w3schools.com/XML/xml_whatis.asp)\n", "\n", "XML is a software- and hardware-independent tool for storing and transporting data.\n", "\n", "What is XML?\n", "\n", "- XML stands for eXtensible Markup Language\n", "- XML is a markup language much like HTML\n", "- XML was designed to store and transport data\n", "- XML was designed to be self-descriptive\n", "- XML is a W3C Recommendation\n" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "XML Does Not DO Anything\n", "\n", "Maybe it is a little hard to understand, but XML does not DO anything.\n", "\n", "This note is a note to Tove from Jani, stored as XML:\n", "```XML\n", "\n", " Tove\n", " Jani\n", " Reminder\n", " Don't forget me this weekend!\n", "\n", "```\n", "\n", "The XML above is quite self-descriptive:\n", "\n", "- It has sender information.\n", "- It has receiver information\n", "- It has a heading\n", "- It has a message body.\n", "\n", "But still, the XML above does not DO anything. XML is just information wrapped in tags.\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "The Difference Between XML and HTML\n", "\n", "XML and HTML were designed with different goals:\n", "\n", "- XML was designed to carry data - with focus on what data is\n", "- HTML was designed to display data - with focus on how data looks\n", "- XML tags are not predefined like HTML tags are" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "XML Does Not Use Predefined Tags\n", "\n", "The XML language has no predefined tags.\n", "\n", "```XML\n", "The tags in the example above (like and ) are not defined in any XML standard. These tags are \"invented\" by the author of the XML document.\n", "\n", "HTML works with predefined tags like

,

, , etc.\n", "\n", "With XML, the author must define both the tags and the document structure.\n", "```" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "XML is Extensible\n", "\n", "Most XML applications will work as expected even if new data is added (or removed).\n", "\n", "```XML\n", "Imagine an application designed to display the original version of note.xml ( ).\n", "\n", "Then imagine a newer version of note.xml with added and elements, and a removed .\n", "```\n", "\n", "The way XML is constructed, older version of the application can still work:\n", "\n", "```XML\n", "\n", " 2015-09-01\n", " 08:30\n", " Tove\n", " Jani\n", " Don't forget me this weekend!\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "### Basic XML Syntax" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "(https://www.w3schools.com/XML/xml_syntax.asp)\n", "\n", "The syntax rules of XML are very simple and logical. The rules are easy to learn, and easy to use.\n", "\n", "XML Documents Must Have a Root Element\n", "\n", "XML documents must contain one root element that is the parent of all other elements:\n", "\n", "```XML\n", "\n", " \n", " .....\n", " \n", " \n", "```\n", "\n", "In this example is the root element:\n", "\n", "```XML\n", "\n", "\n", " Tove\n", " Jani\n", " Reminder\n", " Don't forget me this weekend!\n", " \n", "```" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "All XML Elements Must Have a Closing Tag\n", "\n", "In XML, it is illegal to omit the closing tag. All elements must have a closing tag:\n", "\n", "```XML\n", "

This is a paragraph.

\n", "
\n", "```\n", "\n", "XML Tags are Case Sensitive\n", "\n", "XML tags are case sensitive. The tag is different from the tag .\n", "\n", "Opening and closing tags must be written with the same case:\n", "```XML\n", "This is correct \n", "```" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "XML Elements Must be Properly Nested\n", "\n", "In HTML, you might see improperly nested elements:\n", "```XML\n", "This text is bold and italic\n", "```\n", "\n", "In XML, all elements must be properly nested within each other:\n", "```XML\n", "This text is bold and italic\n", "```\n", "\n", "In the example above, \"Properly nested\" simply means that since the element is opened inside the element, it must be closed inside the element." ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "XML Attribute Values Must Always be Quoted\n", "\n", "XML elements can have attributes in name/value pairs just like in HTML.\n", "\n", "In XML, the attribute values must always be quoted:\n", "```XML\n", "\n", " Tove\n", " Jani\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "Entity References\n", "\n", "Some characters have a special meaning in XML.\n", "\n", "If you place a character like \"<\" inside an XML element, it will generate an error because the parser interprets it as the start of a new element.\n", "\n", "This will generate an XML error:\n", "```XML\n", "salary < 1000\n", "```\n", " \n", "To avoid this error, replace the \"<\" character with an entity reference:\n", "```XML\n", "salary < 1000\n", "```\n", "\n", "There are 5 pre-defined entity references in XML:\n", "```\n", "< \t< \tless than\n", "> \t> \tgreater than\n", "& \t& \tampersand \n", "' \t' \tapostrophe\n", "" \t\" \tquotation mark\n", "```" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "### A real XML/HTML dataset" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "On the 2nd of December 2018, I scraped 994 realty listing pages of residences for sale in Mecidiyekoy, Sisli neighbourhood from www.hurriyetemlak.com" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "datadir=~/data\n", "hemlak=$datadir/he_sisli" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "ls -l $hemlak | head" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "Viewing the data as is not quite intuitive:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "find $hemlak -mindepth 1 | head -1 | xargs cat" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "An XML/HTMl parser can parse, pretty print, and traverse through the files.\n", "\n", "Xidel is highly performant:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "find $hemlak -mindepth 1 | head -1 | xargs -i xidel --input-format html --output-format html -e \"/\" {}" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "Another way to view an indented version of the XML/HTMl file might be:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "a <- xml2::read_html(\"~/data/he_sisli/10005103\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "htmltidy::html_view(a)" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "However, opening the file with a web browser (preferably Chrome), hitting the F12 button and viewing the \"elements\" pane is the best option.\n", "\n", "This way we will get the Xpath nodes for the information we want from the files" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "### XPath basics" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "https://www.w3schools.com/xml/xpath_intro.asp\n", "\n", "- XPath can be used to navigate through elements and attributes in an XML document.\n", "- XPath stands for XML Path Language\n", "- XPath uses \"path like\" syntax to identify and navigate nodes in an XML document\n", "- XPath contains over 200 built-in functions\n", "- XPath uses path expressions to select nodes or node-sets in an XML document.\n", "- These path expressions look very much like the path expressions you use with traditional computer file systems" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "#### XPath Nodes" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "https://www.w3schools.com/xml/xpath_nodes.asp\n", "\n", "XPath Terminology\n", "Nodes\n", "\n", "In XPath, there are seven kinds of nodes: element, attribute, text, namespace, processing-instruction, comment, and document nodes.\n", "\n", "XML documents are treated as trees of nodes. The topmost element of the tree is called the root element.\n", "\n", "Look at the following XML document:\n", "\n", "```XML\n", "\n", "\n", "\n", " \n", " Harry Potter\n", " J K. Rowling\n", " 2005\n", " 29.99\n", " \n", "\n", "```\n", "\n", "Example of nodes in the XML document above:\n", "```XML\n", " (root element node)\n", "\n", "J K. Rowling (element node)\n", "\n", "lang=\"en\" (attribute node) \n", "```\n", "\n", "Atomic values\n", "\n", "Atomic values are nodes with no children or parent.\n", "\n", "Example of atomic values:\n", "\n", "J K. Rowling\n", "\n", "\"en\"\n", "\n", "Items\n", "\n", "Items are atomic values or nodes." ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "#### Relationship of Nodes" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "##### Parent" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "Parent\n", "\n", "Each element and attribute has one parent.\n", "\n", "In the following example; the book element is the parent of the title, author, year, and price:\n", "```XML\n", "\n", " Harry Potter\n", " J K. Rowling\n", " 2005\n", " 29.99\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "##### Children" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "Element nodes may have zero, one or more children.\n", "\n", "In the following example; the title, author, year, and price elements are all children of the book element:\n", "```XML\n", "\n", " Harry Potter\n", " J K. Rowling\n", " 2005\n", " 29.99\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "##### Siblings" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "Nodes that have the same parent.\n", "\n", "In the following example; the title, author, year, and price elements are all siblings:\n", "```XML\n", "\n", " Harry Potter\n", " J K. Rowling\n", " 2005\n", " 29.99\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "##### Ancestors" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "A node's parent, parent's parent, etc.\n", "\n", "In the following example; the ancestors of the title element are the book element and the bookstore element:\n", "```XML\n", "\n", "\n", "\n", " Harry Potter\n", " J K. Rowling\n", " 2005\n", " 29.99\n", "\n", "\n", " \n", "```" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "##### Descendants" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "A node's children, children's children, etc.\n", "\n", "In the following example; descendants of the bookstore element are the book, title, author, year, and price elements:\n", "```XML\n", "\n", "\n", "\n", " Harry Potter\n", " J K. Rowling\n", " 2005\n", " 29.99\n", "\n", "\n", " \n", "```" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "#### XPath Syntax" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "https://www.w3schools.com/xml/xpath_syntax.asp" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "##### The XML Example Document" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "We will use the following XML document in the examples below.\n", "```XML\n", "\n", "\n", "\n", "\n", "\n", " Harry Potter\n", " 29.99\n", "\n", "\n", "\n", " Learning XML\n", " 39.95\n", "\n", "\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "##### Selecting Nodes" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "XPath uses path expressions to select nodes in an XML document. The node is selected by following a path or steps. The most useful path expressions are listed below:" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ExpressionDescription
nodenameSelects all nodes with the name \"nodename\"
/Selects from the root node
//Selects nodes in the document from the current node that match the selection no matter where they are
.Selects the current node
..Selects the parent of the current node
@Selects attributes
" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "In the table below we have listed some path expressions and the result of the expressions:" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Path ExpressionResult
bookstoreSelects all nodes with the name \"bookstore\"
/bookstoreSelects the root element bookstore

Note: If the path starts with a slash ( / ) it always represents an absolute \n", "path to an element!

bookstore/bookSelects all book elements that are children of bookstore
//bookSelects all book elements no matter where they are in the document
bookstore//bookSelects all book elements that are descendant of the bookstore element, no matter where they are under the bookstore element
//@langSelects all attributes that are named lang
" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "##### Predicates" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "- Predicates are used to find a specific node or a node that contains a specific value.\n", "- Predicates are always embedded in square brackets.\n", "- In the table below we have listed some path expressions with predicates and the result of the expressions:" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Path ExpressionResult
/bookstore/book[1] Selects the first book element that is the child of the bookstore element.\n", "

Note: In IE 5,6,7,8,9 first node is[0], but according to W3C, it is [1]. To solve this problem in IE, set the SelectionLanguage to XPath:

\n", " In JavaScript: xml.setProperty(\"SelectionLanguage\",\"XPath\");
/bookstore/book[last()]Selects the last book element that is the child of the bookstore element
/bookstore/book[last()-1]Selects the last but one book element that is the child of the bookstore element
/bookstore/book[position()<3]Selects the first two book elements that are children of the bookstore element
//title[@lang]Selects all the title elements that have an attribute named lang
//title[@lang='en']Selects all the title elements that have a \"lang\" attribute with a value of \"en\"
/bookstore/book[price>35.00]Selects all the book elements of the bookstore element that have a price element with a value greater than 35.00
/bookstore/book[price>35.00]/titleSelects all the title elements of the book elements of the bookstore element that have a price element with a value greater than 35.00
" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "##### Selecting Unknown Nodes" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "XPath wildcards can be used to select unknown XML nodes." ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
WildcardDescription
*Matches any element node
@*Matches any attribute node
node()Matches any node of any kind
" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "In the table below we have listed some path expressions and the result of the expressions:" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Path ExpressionResult
/bookstore/*Selects all the child element nodes of the bookstore element
//*Selects all elements in the document
//title[@*]Selects all title elements which have at least one attribute of any kind
" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "##### Selecting Several Paths" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "- By using the | operator in an XPath expression you can select several paths.\n", "- In the table below we have listed some path expressions and the result of the expressions:" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Path ExpressionResult
//book/title | //book/priceSelects all the title AND price elements of all book elements
//title | //priceSelects all the title AND price elements in the document
/bookstore/book/title | //priceSelects all the title elements of the book element of the bookstore element AND all the price elements in the document
" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "### Xpath example: Get the price info from listings" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "- Now please open a web browser windows on Chrome, and open:\n", "\n", "https://www.hurriyetemlak.com/konut-satilik/istanbul-sisli-mecidiyekoy-daire/listeleme?pageSize=50&page=1\n", "\n", "- Navigate to any listing, hit F12 (debug tools) and select the elements pane\n", "- By using the element selector on top left, hit any point on the web page and see how the Elements windows navigates\n", "- Now hit the price info with the selector\n", "- Right click the highlighted element on the right pane, and click on \"Copy Element\"\n", "- The result will be something like:\n", "\n", "```XML\n", "500.000 TL\n", "```" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "Now we want to get the path to this node:\n", "\n", "- Right click again, this time click on \"Copy XPath\"\n", "\n", "```XPath\n", "/html/body/div[1]/div[2]/div[2]/div[1]/div[2]/div/div/div[1]/div[2]/div[1]/div[1]/div[2]/ul/li[1]/span\n", "```\n", "\n", "Now we can use Xidel to get the value at this XPath, provided that the queried XML/HTML file has a similar DOM structure (hierarchy of nodes) " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "xidel --input-format html --output-format adhoc -e \\\n", "\"/html/body/div[1]/div[2]/div[2]/div[1]/div[2]/div/div/div[1]/div[2]/div[1]/div[1]/div[2]/ul/li[1]/span\" \\\n", "$hemlak/10005103 2> /dev/null" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "However, traversing using only indices might not be correct in all cases: The count of a certain element may change across similar pages\n", "\n", "So we will use attributes to be more robust:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "xidel --input-format html --output-format adhoc -e \\\n", "\"//div[@class='realty-details realty-details-right clearfix']/ul[@class='clearfix']/li[@class='price-line clearfix']/span/text()\" \\\n", "$hemlak/10005103 2> /dev/null" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "Or course it is better to get only the numeric values and skip \".\" and \"TL\" parts" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "xidel --input-format html --output-format adhoc -e \\\n", "\"//div[@class='realty-details realty-details-right clearfix']/ul[@class='clearfix']/li[@class='price-line clearfix']/span/text()\" \\\n", "$hemlak/10005103 2> /dev/null | \\\n", "tr -d \".\" | grep -Po \"\\d+\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "Now we can traverse through all 994 files to get price information" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "he_prices=$(for i in $hemlak/*; do \\\n", "xidel --input-format html --output-format adhoc -e \\\n", "\"//div[@class='realty-details realty-details-right clearfix']/ul[@class='clearfix']/li[@class='price-line clearfix']/span/text()\" \\\n", "$i 2> /dev/null | \\\n", "tr -d \".\" | grep -Po \"\\d+\";\n", "done)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "echo \"$he_prices\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "ls $hemlak | wc -l\n", "echo \"$he_prices\" | wc -l" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "So we have 994 price information for 994 listing files" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "### Get the square meter information" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "Similarly we will get the square meter information from listing files:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "he_sqm=$(for i in $hemlak/*; do \\\n", "xidel --input-format html --output-format adhoc -e \\\n", "\"//div[@class='realty-details realty-details-right clearfix']/ul[@class='clearfix']//span[@id='realtyGrossSqm']/following-sibling::span/text()\" \\\n", "$i 2> /dev/null | \\\n", "tr -d \".\" | grep -Po \"\\d+\";\n", "done)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "echo \"$he_sqm\" | wc -l" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "### Get \"loan eligibility info\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "Note that we also have to control for missing values so that the output is parallel to previous ones" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "he_kredi=$(for i in $hemlak/*; do \\\n", "kredi=$(xidel --input-format html --output-format adhoc -e \\\n", "\"//div[@class='realty-details realty-details-right clearfix']/ul[@class='clearfix']//span[text()='Krediye Uygunluk']/following-sibling::span/text()\" \\\n", "$i 2> /dev/null);\n", "if [[ -z $kredi ]]; then echo NA; else echo $kredi; fi;\n", "done)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "echo \"$he_kredi\" | wc -l" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "### Get property age info" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "**EXERCISE 4:**\n", "\n", "Get the age (Bina Yaşı) info from files and save into \"he_age\" object.\n", "\n", "Check that total length is 994" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "**SOLUTION 4:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "pass1=\n", "encrypt=\"U2FsdGVkX1/eIM2iFwU8EiycbOHQU9x+E17J9APxpeP4xuHq88DO5QUGfmMq0yvi 7BqTIovpXlJ08UUBy+C30htC9YXNU0yc5RO3poMIC57QJ/TsIO1S7cJcmHAQJKYl 0mfbwfMgIEjHmjjAQ8XPwmARSwobvXlyl/6IK6lLNDZL3Ve9bwHBJD9cm8KsSEeX TQGukNiUej07kD7jGD6+q3CKz29an0olwDLvK7dgPaKavfZSGIpa7CAHqAriVx83 Fq+uQKu2s1E5vpuA85huxyrVuWqMJFwoq5uiq2KBv446Sm+AOPSOJDVVxS1B+DB8 Ayh9E18VChQ0JS2Aw+wjlRF3wzE3Ew3Ne1pUUUSbPHBde3t3RgfCzD3u+LdR1K7h QQlGJGIOpbSHe7mRNVeiuhNwl0qGe72jdFVPqcC4DE8ZuW5AySn+SQ2I4kvvmhRs jOjHOddqrH6c9BDkoibzrrWoR2teu8aIbMBCNaBrDIU=\"\n", "solution=$(echo \"$encrypt\" | openssl enc -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)\n", "echo \"$solution\"; echo\n", "for l in \"$solution\"; do eval \"${l}\"; done" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "### Combine info into a tsv file" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "Now together with filenames (listing codes), let's combine all information into a tab seperated text:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "selected_info=$(paste <(ls $hemlak) <(echo \"$he_prices\") <(echo \"$he_sqm\") \\\n", "<(echo \"$he_kredi\") <(echo \"$he_age\") --delimiters \"\\t\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "echo \"$selected_info\" | column -s $'\\t' -t" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "And save into a file:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "echo \"$selected_info\" > selected_info.tsv" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "### Analyze selected info" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "Now let's analyze the selected info using R" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "selected_info <- read.delim(\"selected_info.tsv\", header = F)" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "See the structure:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "str(selected_info)" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "See missing data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "lapply(selected_info, function(x) which(is.na(x)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "selected_info2 <- na.omit(selected_info)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "str(selected_info2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "selected_info2" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "Let's rename columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "names(selected_info2) <- c(\"id\", \"price\", \"sqm\", \"eligibility\", \"age\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "selected_info2" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "Let's add another column for price per sqm" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "selected_info2$pricesqm <- with(selected_info2, price / sqm)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "selected_info2" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "Now let's see whether price per sqm differs across eligibility values:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "with(selected_info2, aggregate(pricesqm, by = list(eligibility), FUN = mean))" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "price per sqm for loan eligible properties are significantly higher" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "The distribution across eligibility classes are:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "table(selected_info2$eligibility)" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "Now let's draw a simple scatteplot between price per sqm and age for eligible properties" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "eligible <- selected_info2[selected_info2$eligibility == \"Uygun\",]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "with(selected_info2, plot(age, pricesqm))" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "No significant trend is detected without further data wrangling and cleaning (such as omitting outliers)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "Bash" }, "outputs": [], "source": [ "cat ~/data/he_sisli/10005103 | head" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "### Extract tables from XML/HTML" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "We will extract all the information to the left of the picture in a listing as a table using R" ] }, { "cell_type": "markdown", "metadata": { "kernel": "Bash" }, "source": [ "First parse the file as html:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "library(XML)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "parsed <- XML::htmlParse(\"~/data/he_sisli/10005103\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "parsed" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "The xpath for the table is:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "xpath1 <- \"//div[@class='realty-details realty-details-right clearfix']/ul[@class='clearfix']/li[@class='info-line']/ul[@class='clearfix']\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "Apply the XPath to document" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "table1 <- xpathApply(parsed, xpath1)[[1]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "str(table1)" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "Convert XMLNode to an XMLInternalDocument:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "xml1 <- xmlDoc(table1)\n", "xml1" ] }, { "cell_type": "markdown", "metadata": { "kernel": "R" }, "source": [ "And extract items as two columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "R" }, "outputs": [], "source": [ "table2 <- sapply(1:2, function(x) xpathSApply(xml1, sprintf(\"//li/span[%s]/text()\", x), xmlValue))\n", "table2" ] } ], "metadata": { "kernelspec": { "display_name": "SoS", "language": "sos", "name": "sos" }, "language_info": { "codemirror_mode": "sos", "file_extension": ".sos", "mimetype": "text/x-sos", "name": "sos", "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter", "pygments_lexer": "sos" }, "sos": { "kernels": [ [ "Bash", "bash", "Bash", "#E6EEFF" ], [ "Bash", "bash", "", "#E6EEFF" ], [ "R", "ir", "", "#DCDCDA" ] ], "panel": { "displayed": true, "height": 0, "style": "side" }, "version": "0.20.9" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": "block", "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }