{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Country Converter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The country converter (coco) is a Python package to convert country names into different classifications and between different naming versions. Internally it uses regular expressions to match country names.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The package is available as PyPI, use " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "pip install country_converter -upgrade" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "from the command line or use your preferred python package installer.\n", "The source code is available on github: https://github.com/IndEcol/country_converter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conversion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The country converter provides one main class which is used for the conversion:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import country_converter as coco" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "converter = coco.CountryConverter()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given a list of countries is a certain classification:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "iso3_codes = [\"USA\", \"VUT\", \"TKL\", \"AUT\", \"AFG\", \"ALB\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This can be converted to any classification provided by:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['United States of America',\n", " 'Republic of Vanuatu',\n", " 'Tokelau',\n", " 'Republic of Austria',\n", " 'Islamic Republic of Afghanistan',\n", " 'Republic of Albania']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "converter.convert(names=iso3_codes, src=\"ISO3\", to=\"name_official\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['America', 'Oceania', 'Oceania', 'Europe', 'Asia', 'Europe']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "converter.convert(names=iso3_codes, src=\"ISO3\", to=\"continent\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The parameter \"src\" specifies the input-, \"to\" the output format. Possible values for both parameter can be found by:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['APEC',\n", " 'BASIC',\n", " 'BRIC',\n", " 'CIS',\n", " 'Cecilia2050',\n", " 'DACcode',\n", " 'EEA',\n", " 'EU',\n", " 'EU12',\n", " 'EU15',\n", " 'EU25',\n", " 'EU27',\n", " 'EU27_2007',\n", " 'EU28',\n", " 'EURO',\n", " 'EXIO1',\n", " 'EXIO1_3L',\n", " 'EXIO2',\n", " 'EXIO2_3L',\n", " 'EXIO3',\n", " 'EXIO3_3L',\n", " 'Eora',\n", " 'FAOcode',\n", " 'G20',\n", " 'G7',\n", " 'GBDcode',\n", " 'GWcode',\n", " 'IEA',\n", " 'IMAGE',\n", " 'ISO2',\n", " 'ISO3',\n", " 'ISOnumeric',\n", " 'MESSAGE',\n", " 'OECD',\n", " 'REMIND',\n", " 'Schengen',\n", " 'UN',\n", " 'UNcode',\n", " 'UNmember',\n", " 'UNregion',\n", " 'WIOD',\n", " 'ccTLD',\n", " 'continent',\n", " 'name_official',\n", " 'name_short',\n", " 'obsolete',\n", " 'regex']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "converter.valid_class" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Internally, these names are the column header of the underlying pandas dataframe (see below)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The convert function can also be accessed without initiating the CountryConverter. This can be useful for one time usage. For multiple matches, initiating the CountryConverter avoids that the file providing the matching data gets read in for each conversion." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['US', 'VU', 'TK', 'AT', 'AF', 'AL']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "converter.convert(names=iso3_codes, src=\"ISO3\", to=\"ISO2\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some of the classifications can be accessed by some shortcuts. For example:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name_shortEU27
14AustriaEU27
21BelgiumEU27
35BulgariaEU27
55CroatiaEU27
58CyprusEU27
59Czech RepublicEU27
60DenmarkEU27
70EstoniaEU27
76FinlandEU27
77FranceEU27
84GermanyEU27
87GreeceEU27
101HungaryEU27
107IrelandEU27
110ItalyEU27
122LatviaEU27
128LithuaniaEU27
129LuxembourgEU27
137MaltaEU27
156NetherlandsEU27
177PolandEU27
178PortugalEU27
182RomaniaEU27
196SlovakiaEU27
197SloveniaEU27
204SpainEU27
215SwedenEU27
\n", "
" ], "text/plain": [ " name_short EU27\n", "14 Austria EU27\n", "21 Belgium EU27\n", "35 Bulgaria EU27\n", "55 Croatia EU27\n", "58 Cyprus EU27\n", "59 Czech Republic EU27\n", "60 Denmark EU27\n", "70 Estonia EU27\n", "76 Finland EU27\n", "77 France EU27\n", "84 Germany EU27\n", "87 Greece EU27\n", "101 Hungary EU27\n", "107 Ireland EU27\n", "110 Italy EU27\n", "122 Latvia EU27\n", "128 Lithuania EU27\n", "129 Luxembourg EU27\n", "137 Malta EU27\n", "156 Netherlands EU27\n", "177 Poland EU27\n", "178 Portugal EU27\n", "182 Romania EU27\n", "196 Slovakia EU27\n", "197 Slovenia EU27\n", "204 Spain EU27\n", "215 Sweden EU27" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "converter.EU27" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ISO2OECD
13AU1971.0
14AT1961.0
21BE1961.0
41CA1961.0
45CL2010.0
49CO2020.0
53CR2021.0
59CZ1995.0
60DK1961.0
70EE2010.0
76FI1969.0
77FR1961.0
84DE1961.0
87GR1961.0
101HU1996.0
102IS1961.0
107IE1961.0
109IL2010.0
110IT1962.0
112JP1964.0
122LV2016.0
128LT2018.0
129LU1961.0
143MX1994.0
156NL1961.0
158NZ1973.0
166NO1961.0
177PL1996.0
178PT1961.0
196SK2000.0
197SI2010.0
202KR1996.0
204ES1961.0
215SE1961.0
216CH1961.0
228TR1961.0
235GB1961.0
236US1961.0
\n", "
" ], "text/plain": [ " ISO2 OECD\n", "13 AU 1971.0\n", "14 AT 1961.0\n", "21 BE 1961.0\n", "41 CA 1961.0\n", "45 CL 2010.0\n", "49 CO 2020.0\n", "53 CR 2021.0\n", "59 CZ 1995.0\n", "60 DK 1961.0\n", "70 EE 2010.0\n", "76 FI 1969.0\n", "77 FR 1961.0\n", "84 DE 1961.0\n", "87 GR 1961.0\n", "101 HU 1996.0\n", "102 IS 1961.0\n", "107 IE 1961.0\n", "109 IL 2010.0\n", "110 IT 1962.0\n", "112 JP 1964.0\n", "122 LV 2016.0\n", "128 LT 2018.0\n", "129 LU 1961.0\n", "143 MX 1994.0\n", "156 NL 1961.0\n", "158 NZ 1973.0\n", "166 NO 1961.0\n", "177 PL 1996.0\n", "178 PT 1961.0\n", "196 SK 2000.0\n", "197 SI 2010.0\n", "202 KR 1996.0\n", "204 ES 1961.0\n", "215 SE 1961.0\n", "216 CH 1961.0\n", "228 TR 1961.0\n", "235 GB 1961.0\n", "236 US 1961.0" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "converter.OECDas(\"ISO2\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Handling missing data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The return value for non-found entries is be default set to 'not found':" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "ABC not found in ISO3\n", "XXX not found in ISO3\n" ] }, { "data": { "text/plain": [ "['not found', 'AUT', 'not found']" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iso3_codes_missing = [\"ABC\", \"AUT\", \"XXX\"]\n", "converter.convert(iso3_codes_missing, src=\"ISO3\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "but can also be rest to something else:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "ABC not found in ISO3\n", "XXX not found in ISO3\n" ] }, { "data": { "text/plain": [ "['missing', 'AUT', 'missing']" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "converter.convert(iso3_codes_missing, src=\"ISO3\", not_found=\"missing\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternativly, the non-found entries can be passed through by passing None to not_found:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "ABC not found in ISO3\n", "XXX not found in ISO3\n" ] }, { "data": { "text/plain": [ "['ABC', 'AUT', 'XXX']" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "converter.convert(iso3_codes_missing, src=\"ISO3\", not_found=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To extend the underlying dataset, an additional dataframe (or file) can be passed. Note, that all entries below (name_short, name_official, regex, ISO2 and ISO3) must be specified." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "add_data = pd.DataFrame.from_dict(\n", " {\n", " \"name_short\": [\"xxx country\", \"abc country\"],\n", " \"name_official\": [\"The XXX country\", \"The ABC country\"],\n", " \"regex\": [\"xxx country\", \"abc country\"],\n", " \"ISO2\": [\"xx\", \"ab\"],\n", " \"ISO3\": [\"xxx\", \"abc\"],\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name_shortname_officialregexISO2ISO3
0xxx countryThe XXX countryxxx countryxxxxx
1abc countryThe ABC countryabc countryababc
\n", "
" ], "text/plain": [ " name_short name_official regex ISO2 ISO3\n", "0 xxx country The XXX country xxx country xx xxx\n", "1 abc country The ABC country abc country ab abc" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "add_data" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['abc country', 'Austria', 'xxx country']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "extended_converter = coco.CountryConverter(additional_data=add_data)\n", "extended_converter.convert(iso3_codes_missing, src=\"ISO3\", to=\"name_short\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively to a ad hoc dataframe, additional datafiles can be passed. These must have the same format as basic data set. \n", "An example can be found here: \n", "https://github.com/IndEcol/country_converter/tree/master/tests/custom_data_example.txt\n", "\n", "The custom data example contains the ISO3 code mapping for Romania before 2002 and switches the regex matching for congo between DR Congo and Congo Republic.\n", "\n", "To use is pass the path to the additional country file:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# extended_converter = coco.CountryConverter(additional_data=path/to/datafile)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The passed data (file or dataframe) must at least contain the headers 'name_official', 'name_short' and 'regex'. Of course, if the additional data shall be used to a conversion to any other field, these must also be included. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Additionally passed data always overwrites the existing one.\n", "This can be used to adjust coco for datasets with wrong country names. \n", "For example, assuming a dataset erroneous switched the ISO2 codes for India (IN) and Indonesia (ID) (therefore assuming 'ID' for India and 'IN' for Indonesia), one can accomedate for that by: " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Duplicated values in column name_short of merged data - keep last one\n", "Duplicated values in column regex of merged data - keep last one\n" ] } ], "source": [ "switched_converter = coco.CountryConverter(\n", " additional_data=pd.DataFrame.from_dict(\n", " {\n", " \"name_short\": [\"India\", \"Indonesia\"],\n", " \"name_official\": [\"India\", \"Indonesia\"],\n", " \"regex\": [\"india\", \"indonesia\"],\n", " \"ISO2\": [\"ID\", \"IN\"],\n", " \"ISO3\": [\"IDN\", \"IND\"],\n", " }\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'India'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "converter.convert(\"IN\", src=\"ISO2\", to=\"name_short\")" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'India'" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "switched_converter.convert(\"ID\", src=\"ISO2\", to=\"name_short\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Regular expression matching" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The input parameter \"src\" can be set to \"regex\" to use regular expression matching for a given country list. For example:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "some_names = [\n", " \"United Rep. of Tanzania\",\n", " \"Cape Verde\",\n", " \"Burma\",\n", " \"Iran (Islamic Republic of)\",\n", " \"Korea, Republic of\",\n", " \"Dem. People's Rep. of Korea\",\n", "]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Tanzania', 'Cabo Verde', 'Myanmar', 'Iran', 'South Korea', 'North Korea']" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "coco.convert(names=some_names, src=\"regex\", to=\"name_short\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The regular expressions can also be used to match any list of countries to any other. For example: " ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'norway': 'Norway is a Kingdom too',\n", " 'united_states': 'USA',\n", " 'china': 'Peoples Republic of China',\n", " 'taiwan': 'Republic of China'}" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "match_these = [\"norway\", \"united_states\", \"china\", \"taiwan\"]\n", "master_list = [\n", " \"USA\",\n", " \"The Swedish Kingdom\",\n", " \"Norway is a Kingdom too\",\n", " \"Peoples Republic of China\",\n", " \"Republic of China\",\n", "]\n", "\n", "coco.match(match_these, master_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the regular expression matches several times, all results are given as list and a warning is generated:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Multiple matches for name taiwan in list_b\n" ] }, { "data": { "text/plain": [ "{'norway': 'Norway is a Kingdom too',\n", " 'united_states': 'USA',\n", " 'china': 'Peoples Republic of China',\n", " 'taiwan': ['Taiwan, province of china', 'Republic of China']}" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "match_these = [\"norway\", \"united_states\", \"china\", \"taiwan\"]\n", "master_list = [\n", " \"USA\",\n", " \"The Swedish Kingdom\",\n", " \"Norway is a Kingdom too\",\n", " \"Peoples Republic of China\",\n", " \"Taiwan, province of china\",\n", " \"Republic of China\",\n", "]\n", "\n", "coco.match(match_these, master_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The parameter \"enforce_sublist\" can be set to ensure consistent output:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Multiple matches for name taiwan in list_b\n" ] }, { "data": { "text/plain": [ "{'norway': ['Norway is a Kingdom too'],\n", " 'united_states': ['USA'],\n", " 'china': ['Peoples Republic of China'],\n", " 'taiwan': ['Taiwan, province of china', 'Republic of China']}" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "coco.match(match_these, master_list, enforce_sublist=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You get a warning if one of the names couldn't be found:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Could not identify some other country in list_a\n" ] }, { "data": { "text/plain": [ "{'norway': 'Norway is a Kingdom too',\n", " 'united_states': 'USA',\n", " 'china': 'Peoples Republic of China',\n", " 'taiwan': 'Republic of China',\n", " 'some other country': 'not_found'}" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "match_these = [\"norway\", \"united_states\", \"china\", \"taiwan\", \"some other country\"]\n", "master_list = [\n", " \"USA\",\n", " \"The Swedish Kingdom\",\n", " \"Norway is a Kingdom too\",\n", " \"Peoples Republic of China\",\n", " \"Republic of China\",\n", "]\n", "coco.match(match_these, master_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the value for non found countries can be specified: " ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Could not identify some other country in list_a\n" ] }, { "data": { "text/plain": [ "{'norway': 'Norway is a Kingdom too',\n", " 'united_states': 'USA',\n", " 'china': 'Peoples Republic of China',\n", " 'taiwan': 'Republic of China',\n", " 'some other country': 'its not there'}" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "coco.match(match_these, master_list, not_found=\"its not there\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This can also be used to pass the not found country to the new classification:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Could not identify some other country in list_a\n" ] }, { "data": { "text/plain": [ "{'norway': 'Norway is a Kingdom too',\n", " 'united_states': 'USA',\n", " 'china': 'Peoples Republic of China',\n", " 'taiwan': 'Republic of China',\n", " 'some other country': 'some other country'}" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "coco.match(match_these, master_list, not_found=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Internals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Within the new instance, the raw data for the conversion is saved within a pandas dataframe. \n", "This dataframe can be accessed directly with:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
APECBASICBRICCISCecilia2050DACcodeEEAEUEU12EU15...UNcodeUNmemberUNregionWIODccTLDcontinentname_officialname_shortobsoleteregex
0NaNNaNNaNNaNRoW625.0NaNNaNNaNNaN...4.01946.0Southern AsiaRoWafAsiaIslamic Republic of AfghanistanAfghanistanNaNafghan
1NaNNaNNaNNaNRoWNaNNaNNaNNaNNaN...248.0NaNNorthern EuropeRoWaxEuropeÅland IslandsAland IslandsNaN\\b(a|å)land
2NaNNaNNaNNaNRoW71.0NaNNaNNaNNaN...8.01955.0Southern EuropeRoWalEuropeRepublic of AlbaniaAlbaniaNaNalbania
3NaNNaNNaNNaNRoW130.0NaNNaNNaNNaN...12.01962.0Northern AfricaRoWdzAfricaPeople's Democratic Republic of AlgeriaAlgeriaNaNalgeria
4NaNNaNNaNNaNRoW880.0NaNNaNNaNNaN...16.0NaNPolynesiaRoWasOceaniaAmerican SamoaAmerican SamoaNaN^(?=.*americ).*samoa
\n", "

5 rows × 47 columns

\n", "
" ], "text/plain": [ " APEC BASIC BRIC CIS Cecilia2050 DACcode EEA EU EU12 EU15 ... UNcode \\\n", "0 NaN NaN NaN NaN RoW 625.0 NaN NaN NaN NaN ... 4.0 \n", "1 NaN NaN NaN NaN RoW NaN NaN NaN NaN NaN ... 248.0 \n", "2 NaN NaN NaN NaN RoW 71.0 NaN NaN NaN NaN ... 8.0 \n", "3 NaN NaN NaN NaN RoW 130.0 NaN NaN NaN NaN ... 12.0 \n", "4 NaN NaN NaN NaN RoW 880.0 NaN NaN NaN NaN ... 16.0 \n", "\n", " UNmember UNregion WIOD ccTLD continent \\\n", "0 1946.0 Southern Asia RoW af Asia \n", "1 NaN Northern Europe RoW ax Europe \n", "2 1955.0 Southern Europe RoW al Europe \n", "3 1962.0 Northern Africa RoW dz Africa \n", "4 NaN Polynesia RoW as Oceania \n", "\n", " name_official name_short obsolete \\\n", "0 Islamic Republic of Afghanistan Afghanistan NaN \n", "1 Åland Islands Aland Islands NaN \n", "2 Republic of Albania Albania NaN \n", "3 People's Democratic Republic of Algeria Algeria NaN \n", "4 American Samoa American Samoa NaN \n", "\n", " regex \n", "0 afghan \n", "1 \\b(a|å)land \n", "2 albania \n", "3 algeria \n", "4 ^(?=.*americ).*samoa \n", "\n", "[5 rows x 47 columns]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "converter.data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataframe can be extended in both directions. The only requirement is to provide unique values for name_short, name_official and regex.\n", "\n", "Internally, the data is saved in country_data.txt as tab-separated values (utf-8 encoded)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, all pandas indexing and matching methods can be used. For example, to get new OECD members since 1995 present in a list:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "59 Czech Republic\n", "70 Estonia\n", "101 Hungary\n", "122 Latvia\n", "128 Lithuania\n", "Name: name_short, dtype: object" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "some_countries = [\n", " \"Australia\",\n", " \"Belgium\",\n", " \"Brazil\",\n", " \"Bulgaria\",\n", " \"Cyprus\",\n", " \"Czech Republic\",\n", " \"Denmark\",\n", " \"Estonia\",\n", " \"Finland\",\n", " \"France\",\n", " \"Germany\",\n", " \"Greece\",\n", " \"Hungary\",\n", " \"India\",\n", " \"Indonesia\",\n", " \"Ireland\",\n", " \"Italy\",\n", " \"Japan\",\n", " \"Latvia\",\n", " \"Lithuania\",\n", " \"Luxembourg\",\n", " \"Malta\",\n", " \"Romania\",\n", " \"Russia\",\n", " \"Turkey\",\n", " \"United Kingdom\",\n", " \"United States\",\n", "]\n", "converter.data[\n", " (converter.data.OECD >= 1995) & converter.data.name_short.isin(some_countries)\n", "].name_short" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Further information can be found here: http://pandas.pydata.org/pandas-docs/stable/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Testing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All regular expressions of the country converter are tested for a unique match to name_short and name_official. \n", "Test sets for alternative names found in various databases are also available. \n", "\n", "The test sets are stored in the ``tests/`` subdirectory. To tests require pytest.\n", "I recommend to rerun the test if a regular expression is changed. \n", "\n", "To specify a new test set just add a tab-separated file with headers \"name\\_short\" and \"name\\_test\" and provide name (corresponding to the short name in the main classification file) and the alternative name which should be tested (one pair per row in the file). If the file name starts with \"test\\_regex\\_ \" it will be automatically recognised by the test functions.\n", "\n", "Please see the file CONTRIBUTING.rst for further information." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Konstantin Stadler" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.6" } }, "nbformat": 4, "nbformat_minor": 4 }