{ "cells": [ { "cell_type": "markdown", "id": "6c7e849a", "metadata": { "ExecuteTime": { "end_time": "2021-06-11T09:24:53.643384Z", "start_time": "2021-06-11T09:24:53.622385Z" } }, "source": [ "# CO2 Emission Analysis and Prediction\n", "\n", "## PART 1: Data Integration & Cleaning Notebook\n", "\n", "### AiGlass \n", "`Seeing Through Data`\n", "\n", "© Explore AI\n", "\n", "---\n", "\n", "\n", "## Overview\n", "Carbon dioxide (CO2) is a colourless, odourless and non-poisonous gas formed by combustion of carbon and in the respiration of living organisms and is considered a greenhouse gas.\n", "\n", "CO2 emissions from the burning of fossil fuels are the primary cause of global warming which happens to be one of the biggest threats facing humanity in this day and age. Although there are plenty of other emissions that are emitted on this earth, including Methane, nitrous oxide, and CFCs, none compare to the emission of CO2, and we as humans are mostly to blame for this. For this analysis we will be focosing on CO2 Emissions and its effect on the world we live in as well as some key factors and stats that may play a role in the emission of CO2 globally.\n", "\n", "The world as we know it, is becoming more modernized by the year, and with this becoming all the more POLLUTED.\n", "\n", "**According to UN Official Data States:**\n", "\n", " 1. Over 3 BILLION PEOPLE of the world’s 8 Billion people are affected by degrading ecosystems \n", " 2. Pollution is responsible for some 9 MILLION premature deaths each year\n", " 3. Over 1 million plant and animal species risk extinction\n", " 4. 200 million people could be displaced EACH YEAR by climate disruption by 2050.\n", "\n", "Our Work is a continuation on the analysis done by [Benjamin from Minneapolis, Minnesota, United States](https://www.kaggle.com/lobosi) on Kaggle. The result of his analysis includes;\n", "\n", " * CO2 Emission has been increasing throughout the time period.\n", " * Coal and Petroleum/other liquids have been the dominant energy source for this time period.\n", " * CO2 Emission has been icreasing 1.71% yearlly on average, and has overall increased by 68.14% over the entire time period.\n", " * As of 2019, the average CO2 emission emitted was 10.98 (MMtonnes CO2) for the year.\n", " * The top CO2 emitters over the entire time period have been China and The United States, both exceding nearlly 4x or more the amount of every other country.\n", " * Throughout the time period, China and India have increased there CO2 Emissions the most out of every other country.\n", " * Throughout the time period, Former soviet republics have had the largest decrease in CO2 emission, The United Kingdom and Germany have also decreased there emissions a bit as well.\n", " * Generally speaking, the larger the population, the more CO2 the country will be likely to emit.\n", " * The larger the GDP, the more likely the country will have a high CO2 emission.\n", " * The larger the Energy Consumption of a country, the larger the CO2 emission.\n", " * A high or low Energy Intensity by GDP of Energy Intensity per capita isnt necesarilly predictive of a large CO2 emission, but generally speaking the lower it is the better (the more energy conserved means less CO2 emitted).\n", "\n", "The dataset used is broadly catategorizing all emitters, transportation, lifestyle, industry etc.. into one total amount for each energy type.\n", "\n", "This notebook looks to further in the analysis and building of several Machine Learning Models which can predict accurately the CO2 emission based of several parameters." ] }, { "cell_type": "markdown", "id": "a5c28b72", "metadata": {}, "source": [ "**Warning:** We are not a climate scientist, some things may be inacurate. This is simply just a study on a subject im interested in, allowing me to go deeper into the subject while at the same time imporving my graphing skills. All my sources are at the bottom of the notebook." ] }, { "cell_type": "markdown", "id": "05600c92", "metadata": {}, "source": [ "\n", "\n", "## Table of Contents\n", "\n", "1. Importing Packages\n", "\n", "2. Loading Data\n", "\n", "3. Integrating Additional Data\n", "\n", "4. Extract Integrated Data" ] }, { "cell_type": "markdown", "id": "997462e2", "metadata": {}, "source": [ " \n", "## 1. Importing Packages\n", "Back to Table of Contents\n", "\n", "---\n", " \n", "| ⚡ Description: Importing Packages ⚡ |\n", "| :--------------------------- |\n", "| In this section we will be importing libraries used throughout our analysis and modelling which will allow us to call functions that are not part of your main python program, and briefly discuss them.\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": 1, "id": "475dbe93", "metadata": { "ExecuteTime": { "end_time": "2021-06-23T10:30:53.800892Z", "start_time": "2021-06-23T10:30:50.215449Z" } }, "outputs": [], "source": [ "# Libraries for Analysis\n", "import numpy as np\n", "import pandas as pd\n", "\n", "# Mute warnings\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "id": "f22a6718", "metadata": {}, "source": [ "\n", "## 2. Loading the Data\n", "\n", "Back to Table of Contents\n", "\n", "---\n", " \n", "| ⚡ Description: Loading the data ⚡ |\n", "| :--------------------------- |\n", "| In this section we will be loading the data from the CSV and EXCEL files into Pandas DataFrames. |\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": 2, "id": "fbbb6c18", "metadata": { "ExecuteTime": { "end_time": "2021-06-28T08:49:35.311495Z", "start_time": "2021-06-28T08:49:35.295494Z" } }, "outputs": [], "source": [ "# Load Base Data\n", "df = pd.read_csv(\"data/Our_CO2emission_Clean_Data.csv\")" ] }, { "cell_type": "code", "execution_count": 3, "id": "eb99056f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0Countrye_typeYeare_cone_prodGDPPopulationei_capitaei_gdpCO2_emission
00Worldall1988345.56347.4142106.64927545.0870.138.2121163.84
11Worldcoal198896.8798.4842106.64927545.0870.138.218930.92
22Worldnat_gas198871.0171.8542106.64927545.0870.138.213571.68
33Worldpet/oth1988133.45132.4942106.64927545.0870.138.218661.24
44Worldnuclear198819.2319.2342106.64927545.0870.138.210.00
\n", "
" ], "text/plain": [ " Unnamed: 0 Country e_type Year e_con e_prod GDP Population \\\n", "0 0 World all 1988 345.56 347.41 42106.6 4927545.08 \n", "1 1 World coal 1988 96.87 98.48 42106.6 4927545.08 \n", "2 2 World nat_gas 1988 71.01 71.85 42106.6 4927545.08 \n", "3 3 World pet/oth 1988 133.45 132.49 42106.6 4927545.08 \n", "4 4 World nuclear 1988 19.23 19.23 42106.6 4927545.08 \n", "\n", " ei_capita ei_gdp CO2_emission \n", "0 70.13 8.21 21163.84 \n", "1 70.13 8.21 8930.92 \n", "2 70.13 8.21 3571.68 \n", "3 70.13 8.21 8661.24 \n", "4 70.13 8.21 0.00 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# View first 5 rows of Loaded Base Data\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 4, "id": "0fd7b630", "metadata": {}, "outputs": [], "source": [ "# Drop Unamed Column\n", "df = df.drop('Unnamed: 0', axis=1)" ] }, { "cell_type": "markdown", "id": "8d9ed280", "metadata": {}, "source": [ "**Column descriptions:**\n", " * **Country** - Country in question\n", " * **Energy_type** - Type of energy source\n", " * **Year** - Year the data was recorded\n", " * **Energy_consumption** - Amount of Consumption for the specific energy source, measured (quad Btu)\n", " * **Energy_production** - Amount of Production for the specific energy source, measured (quad Btu)\n", " * **GDP** - Countries GDP at purchasing power parities, measured (Billion 2015\\$ PPP)\n", " * **Population** - Population of specific Country, measured (Mperson)\n", " * **Energy_intensity_per_capita** - Energy intensity is a measure of the energy inefficiency of an economy. It is calculated as units of energy per unit of capita (capita = individual person), measured (MMBtu/person)\n", " * **Energy_intensity_by_GDP**- Energy intensity is a measure of the energy inefficiency of an economy. It is calculated as units of energy per unit of GDP, measred (1000 Btu/2015\\$ GDP PPP)\n", " * **CO2_emission** - The amount of C02 emitted, measured (MMtonnes CO2)\n", " \n", "It will also be exciting to see how we can enrich the dataset with extra features. Hence, We will adding the following datasets;\n", "1. **Rate of population change** - To see if a possible change in population of a place will result in change in CO2 emission & to What extent\n", "2. **Population density** - Does the density of a population have any effect on CO2 Emission?\n", "3. **GDP splits** - Example, % for agriculture vs manufacturing; Hypothetically, GDP increase due to agricultural/Green activities should oppose the direct correlation of rise in GDP to CO2 Emission\n", "4. **Rate of Deforestation** - As a result of our research on why the Dip in CO2 Emission of the World occurred in 2009 and the sudden rise in 2010 when Energy Type, Pop, and GDP were Constant.... \n", "\n", "From: [REUTERS: Carbon emissions dip in 2009, to jump in 2010 - report](https://www.reuters.com/article/idINIndia-53062920101121) `“The real surprise was that we were expecting a bigger dip due to the financial crisis in terms of fossil fuel emissions,” said Pep Canadell, executive director of the Global Carbon Project and one of the co-authors of the study published in the latest issue of the journal Nature Geoscience. Scientists say rising levels of CO2, the main greenhouse gas, from burning fossil fuels and deforestation is heating up the planet; So we had BURNING FOSSIL FUEL covered but not the impact of DEFORESTATION...` Then FROM: [Measuring Carbon Emissions from Tropical Deforestation: An Overview](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjLmN7jouz4AhVfYPEDHbXXDhoQFnoECAkQAw&url=https%3A%2F%2Fwww.edf.org%2Fsites%2Fdefault%2Ffiles%2F10333_Measuring_Carbon_Emissions_from_Tropical_Deforestation--An_Overview.pdf&usg=AOvVaw2x4oTsffUsBJzPk0S6DK_y) It states that Tropical deforestation contributes about 20% of annual global greenhouse gas (GHG) emissions and reducing it will be necessary to avoid dangerous climate change. China and the US are the world’s number one and two emitters, but numbers three and four are Indonesia and Brazil, with ~80% and ~70% of their emissions respectively from deforestation.\n", "\n", "5. **Emission per Capita** - Also: probing into the theory that a unit increase in Population is directly impacting on the increase in CO2 Emission, we opted to getting a column which represents per capita emission for each country per energy type which will be plotted against the co2 emission and resulting graph compared with the graph of countries/population of highest emitters. The idea is if the comparism correlates, then our Hypothesis theory of increase in pop is directly propotional to increase in CO2 Emission, is 100% valid, if not; To be modified with extra clause." ] }, { "cell_type": "code", "execution_count": 5, "id": "896867e2", "metadata": {}, "outputs": [], "source": [ "# Load Population Growth (Rate of population change)\n", "pop_df = pd.read_csv(\"data/Population_Growth_from_world_Bank_Integrate.csv\")\n", "# Load Population Density per Country\n", "den_df = pd.read_excel('data/Population_Density_per_country_data.xls')\n", "# Load Manufacturing GDP Contribution (GDP splits)\n", "mgdp_df = pd.read_excel('data/GDP_split_Manufacturing_contribution_data_per_Country.xls')\n", "# Load Agri GDP Contribution (GDP splits)\n", "agdp_df = pd.read_excel('data/GDP_split_Agricultural_contribution_data_per_Country.xls')\n", "# Load Deforestation Impact per country \n", "forest_df = pd.read_csv(\"data/Deforestation_data.csv\") # Forest area (% of land area)\n", "land_df = pd.read_csv(\"data/Land_Area_Data.csv\") # Land Area (sq. km)\n", "# Load Emission per Capita\n", "df['emission_per_cap'] = df['CO2_emission']/df['Population']" ] }, { "cell_type": "code", "execution_count": 6, "id": "71bdc75b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Country Name198819891990199119921993199419951996...2012201320142015201620172018201920202021
0Aruba-1.246457-0.0638791.8168303.8987395.4460526.0486695.6449304.6101563.531110...0.5033850.5832900.5905080.5410480.5028600.4718740.4592660.4374150.428017NaN
1Africa Eastern and Southern2.9871722.9564052.9130592.8710782.8320132.7912942.7513742.7104202.673851...2.7634262.7614962.7504002.7325982.7122182.6909022.6656202.6366662.605427NaN
\n", "

2 rows × 35 columns

\n", "
" ], "text/plain": [ " Country Name 1988 1989 1990 1991 \\\n", "0 Aruba -1.246457 -0.063879 1.816830 3.898739 \n", "1 Africa Eastern and Southern 2.987172 2.956405 2.913059 2.871078 \n", "\n", " 1992 1993 1994 1995 1996 ... 2012 2013 \\\n", "0 5.446052 6.048669 5.644930 4.610156 3.531110 ... 0.503385 0.583290 \n", "1 2.832013 2.791294 2.751374 2.710420 2.673851 ... 2.763426 2.761496 \n", "\n", " 2014 2015 2016 2017 2018 2019 2020 2021 \n", "0 0.590508 0.541048 0.502860 0.471874 0.459266 0.437415 0.428017 NaN \n", "1 2.750400 2.732598 2.712218 2.690902 2.665620 2.636666 2.605427 NaN \n", "\n", "[2 rows x 35 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check\n", "pop_df.head(2)" ] }, { "cell_type": "code", "execution_count": 7, "id": "da6dcb71", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Country NameIndicator Name19881989199019911992199319941995...2012201320142015201620172018201920202021
0ArubaForest area (% of land area)NaNNaN2.3333332.3333332.3333332.3333332.3333332.333333...2.3333332.3333332.3333332.3333332.3333332.3333332.3333332.3333332.333333NaN
\n", "

1 rows × 36 columns

\n", "
" ], "text/plain": [ " Country Name Indicator Name 1988 1989 1990 1991 \\\n", "0 Aruba Forest area (% of land area) NaN NaN 2.333333 2.333333 \n", "\n", " 1992 1993 1994 1995 ... 2012 2013 2014 \\\n", "0 2.333333 2.333333 2.333333 2.333333 ... 2.333333 2.333333 2.333333 \n", "\n", " 2015 2016 2017 2018 2019 2020 2021 \n", "0 2.333333 2.333333 2.333333 2.333333 2.333333 2.333333 NaN \n", "\n", "[1 rows x 36 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check\n", "forest_df.head(1)" ] }, { "cell_type": "markdown", "id": "b09ace62", "metadata": {}, "source": [ "**OBSERVATION:** After Careful studies, we observed that several countries have either been integrated into another or have had their names modified or changed thereby tending to result in lot's of missing values. SEE the table below\n", "\n", "| Country Name in Data | Current Name | Replacement Name |**Special Case** |\n", "| --- | --- | --- |--- |\n", "| Burma | Myanmar | Myanmar |\n", "| Congo-Brazzaville | Republic of the Congo | Congo, Rep. |\n", "| Congo-Kinshasa | Democratic Republic of the Congo | Congo, Dem. Rep. |\n", "| Côte d’Ivoire | --- | Cote d'Ivoire |\n", "| Guadeloupe | overseas département and overseas region of FRANCE | **DROP** |\n", "| Laos | Lao People's Democratic Republic | Lao PDR |\n", "| Macau | special administrative region CHINA | Macao SAR, China |\n", "| Martinique | Island and overseas territorial collectivity of FRANCE | **DROP** |\n", "| North Korea | Korea, Dem. People's Rep. | Korea, Dem. People's Rep. | **Lump Together North & South Korea or DROP** |\n", "| Reunion | Réunion La Réunion (French) | **DROP** |\n", "| Saint Kitts and Nevis | Federation of Saint Christopher and Nevis | St. Kitts and Nevis |\n", "| Saint Lucia | --- | St. Lucia |\n", "| Saint Vincent/Grenadines | --- | St. Vincent and the Grenadines |\n", "| South Korea | Korea, Dem. People's Rep. | --- | **Lump Together North & South Korea or DROP** |\n", "| Taiwan | Republic of China (ROC) | **DROP** |\n", "| The Bahamas | --- | Bahamas, The |\n", "| Kyrgyzstan | Kyrgyz Republic | Kyrgyz Republic |\n", "| Slovakia | Slovak Republic | Slovak Republic |\n", "| Palestinian Territories | Israel | West Bank and Gaza |\n", "\n", "So Let's proceed to making this changes..." ] }, { "cell_type": "code", "execution_count": 8, "id": "c1fc6ff6", "metadata": {}, "outputs": [], "source": [ "'''\n", "This countries Have been merged to Major countries already present\n", "in our Dataset and won't be present in dataset adding the additional \n", "features, hence deleting them.\n", "'''\n", "# Drop Rows of Countries: Guadeloupe, Martinique, Reunion, Taiwan\n", "df = df[df.Country.isin(['Guadeloupe', 'Martinique', 'Reunion', 'Taiwan'])==False]\n", "\n", "# Decalre Replacement Names as Dict\n", "replace_values = {'Burma' : 'Myanmar', \n", " 'Congo-Brazzaville' : 'Congo, Rep.', \n", " 'Congo-Kinshasa' : 'Congo, Dem. Rep.', \n", " \"Côte d’Ivoire\": \"Cote d'Ivoire\",\n", " \"Laos\": 'Lao PDR', \n", " 'Macau': 'Macao SAR, China', \n", " 'Saint Kitts and Nevis': 'St. Kitts and Nevis',\n", " 'Saint Lucia': 'St. Lucia', \n", " 'Saint Vincent/Grenadines': 'St. Vincent and the Grenadines',\n", " 'The Bahamas': 'Bahamas, The', 'Kyrgyzstan': 'Kyrgyz Republic', \n", " 'Slovakia': 'Slovak Republic',\n", " 'Palestinian Territories': 'West Bank and Gaza'\n", " } \n", "# Apply Replacement Names\n", "df = df.replace({\"Country\": replace_values}) " ] }, { "cell_type": "code", "execution_count": 9, "id": "f494fa33", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Countrye_typeYeare_cone_prodGDPPopulationei_capitaei_gdpCO2_emissionemission_per_cap
150Myanmarall19880.090.0826.3540085.62.133.234.830.00012
\n", "
" ], "text/plain": [ " Country e_type Year e_con e_prod GDP Population ei_capita \\\n", "150 Myanmar all 1988 0.09 0.08 26.35 40085.6 2.13 \n", "\n", " ei_gdp CO2_emission emission_per_cap \n", "150 3.23 4.83 0.00012 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check\n", "df[df['Country'] == 'Myanmar'].head(1)" ] }, { "cell_type": "markdown", "id": "81132ab3", "metadata": {}, "source": [ "\n", "## 3. Integrating Additional Data\n", "\n", "Back to Table of Contents\n", "\n", "---\n", " \n", "| ⚡ Description: Exploratory data analysis ⚡ |\n", "| :--------------------------- |\n", "| In this section, we will be Integrating and Engineering our features with areas that may prove vaible in or analysis on the Emmission of CO2. |\n", "\n", "---\n", "Hence let's proceed to Integrating additional features." ] }, { "cell_type": "markdown", "id": "bbc50108", "metadata": {}, "source": [ "#### 3.1 Integrating Population Growth to Base DF" ] }, { "cell_type": "code", "execution_count": 10, "id": "5ce1df03", "metadata": {}, "outputs": [], "source": [ "# Defining function that Integrates Pop_Growth\n", "def add_pop_growth(row):\n", " \n", " val = pop_df.loc[pop_df[\"Country Name\"].str.contains(row['Country']), str(row['Year'])]\n", " return round(float(list(val)[0]),3) if len(val)>0 else np.NaN" ] }, { "cell_type": "code", "execution_count": 11, "id": "668705d0", "metadata": {}, "outputs": [], "source": [ "# Applying Fuction\n", "df['pop_growth'] = df.apply(add_pop_growth, axis=1)" ] }, { "cell_type": "code", "execution_count": 12, "id": "82582c42", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['North Korea', 'South Korea', 'New Zealand', 'Kuwait', 'Eritrea'],\n", " dtype=object)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check for missing values\n", "df[df[\"pop_growth\"].isnull()][\"Country\"].unique()" ] }, { "cell_type": "markdown", "id": "b0c63a9f", "metadata": {}, "source": [ "`North Korea & South Korea` needs to be collapse as one Country called `Korea, Dem. People's Rep`. While for 'New Zealand', 'Kuwait', 'Eritrea' the missing values will have to be delt with conventionally.\n", "\n", "| Country | Year's with Missing Values | **Action** |\n", "| --- | --- | --- |\n", "| New Zealand | 1991 only | **Fill with mean/median** |\n", "| Kuwait | 1992 to 1995 | **Fill with mean/median** |\n", "| Eritrea | 2012 to 2019 | **Fill with mean/median** |\n", "\n", "#### 3.2 Integrating Population Density to Modified DF" ] }, { "cell_type": "code", "execution_count": 13, "id": "fa072339", "metadata": {}, "outputs": [], "source": [ "# Defining function that Integrates Pop_Density\n", "def add_pop_den(row):\n", " \n", " val = den_df.loc[den_df[\"Country Name\"].str.contains(row['Country']), str(row['Year'])]\n", " return round(float(list(val)[0]),3) if len(val)>0 else np.NaN" ] }, { "cell_type": "code", "execution_count": 14, "id": "6768eb0a", "metadata": {}, "outputs": [], "source": [ "# Applying Fuction\n", "df['pop_density'] = df.apply(add_pop_den, axis=1)" ] }, { "cell_type": "code", "execution_count": 15, "id": "a63d4895", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['Luxembourg', 'North Korea', 'South Korea', 'Kuwait', 'Kosovo'],\n", " dtype=object)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check for missing values\n", "df[df[\"pop_density\"].isnull()][\"Country\"].unique()" ] }, { "cell_type": "markdown", "id": "9b655d7b", "metadata": {}, "source": [ "So we see a couple of additional missing values present in the population density Column for Countries 'Luxembourg', 'North Korea', 'South Korea', 'Kuwait', 'Kosovo'\n", "\n", "We will look to treat this later\n", "#### 3.3 Integrating GDP Split (Agric & Manuf) to Modified DF\n", "The calculation of a country's GDP encompasses all private and public consumption, government outlays, investments, additions to private inventories, paid-in construction costs, and the foreign balance of trade. \n", "\n", "We will be focusing on Just the GDP Contributions of the Manufacturing & Agricultural Industries per country per time." ] }, { "cell_type": "code", "execution_count": 16, "id": "13c8c188", "metadata": {}, "outputs": [], "source": [ "# Defining function that Integrates Manufacturing GDP Contribution\n", "def add_gdp_manu(row):\n", " \n", " val = mgdp_df.loc[mgdp_df[\"Country Name\"].str.contains(row['Country']), str(row['Year'])]\n", " return round(float(list(val)[0]), 3) if len(val)>0 else np.NaN\n", "\n", "# Defining function that Integrates Agricuture GDP Contribution\n", "def add_gdp_agri(row):\n", " \n", " val = agdp_df.loc[agdp_df[\"Country Name\"].str.contains(row['Country']), str(row['Year'])]\n", " return round(float(list(val)[0]), 3) if len(val)>0 else np.NaN" ] }, { "cell_type": "code", "execution_count": 17, "id": "441eaa7e", "metadata": {}, "outputs": [], "source": [ "# Applying Fuction Respectively\n", "df['manuf_GDP'] = df.apply(add_gdp_manu, axis=1)\n", "df['agri_GDP'] = df.apply(add_gdp_agri, axis=1)" ] }, { "cell_type": "markdown", "id": "7bbb0858", "metadata": {}, "source": [ "It's also worthy of Note that the `manuf_GDP & agri_GDP` are percentage contribution of the overal GDP, Hence we'll have to extract the value for computation." ] }, { "cell_type": "code", "execution_count": 18, "id": "30027310", "metadata": {}, "outputs": [], "source": [ "df['Manuf_GDP'] = (df['manuf_GDP']/100)*df['GDP']\n", "df['Agric_GDP'] = (df['agri_GDP']/100)*df['GDP']" ] }, { "cell_type": "markdown", "id": "8e9cea0c", "metadata": {}, "source": [ "#### 3.4 Integrating Deforestation Data to Modified DF" ] }, { "cell_type": "code", "execution_count": 19, "id": "a61e281c", "metadata": {}, "outputs": [], "source": [ "# Defining function that Integrates Forest Area % of Land & Land Area Sq.M Data\n", "def add_forest(row):\n", " \n", " val = forest_df.loc[forest_df[\"Country Name\"].str.contains(row['Country']), str(row['Year'])]\n", " return round(float(list(val)[0]), 3) if len(val)>0 else np.NaN\n", "\n", "def add_land(row):\n", " \n", " val = land_df.loc[land_df[\"Country Name\"].str.contains(row['Country']), str(row['Year'])]\n", " return round(float(list(val)[0]), 3) if len(val)>0 else np.NaN" ] }, { "cell_type": "code", "execution_count": 20, "id": "6ae127fe", "metadata": {}, "outputs": [], "source": [ "# Applying Fuction Respectively\n", "df['Forest'] = df.apply(add_forest, axis=1)\n", "df['Land'] = df.apply(add_land, axis=1)\n", "# Get Exact Forest area in SqM\n", "df['Deforestation'] = (df['Forest']/100)*df['Land'] \n", "\n", "# Note: we cleaned all land info below 1990 as to avoid errors since Forest data starts from 1990\n", "# Drop redundant Forest & Land Columns\n", "df = df.drop(['Forest', 'Land'], axis=1)" ] }, { "cell_type": "markdown", "id": "5d141498", "metadata": {}, "source": [ "The forest data begins from year 1990, hence we will be experiencing missing values across all Countries for the years 1988 & 1989 and maybe a few others within the dataset.\n", "\n", "#### 3.5 Extracting Emission per Capita\n", "This refers to the per capita/person emission for each country per energy type" ] }, { "cell_type": "code", "execution_count": 21, "id": "81f83594", "metadata": {}, "outputs": [], "source": [ "# Adding the emission_per_cap column\n", "df['emission_per_cap'] = df['CO2_emission']/df['Population']" ] }, { "cell_type": "code", "execution_count": 22, "id": "30584df6", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "Let's Reposition our Target variable \n", "CO2 Emission to the End of our Dataframe\n", "\"\"\"\n", "# Seperate Other Features From Target Variable\n", "others = df.drop(['CO2_emission', 'emission_per_cap'], axis=1)\n", "co = df[['emission_per_cap', 'CO2_emission']]\n", "# Delete df\n", "del df\n", "# concat both Tables into fresh df\n", "df = pd.concat([others, co], axis=1)\n", "# Dropping % version of agric & Manufac GDP\n", "df = df.drop(['manuf_GDP', 'agri_GDP'], axis=1)" ] }, { "cell_type": "code", "execution_count": 23, "id": "c418c336", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Countrye_typeYeare_cone_prodGDPPopulationei_capitaei_gdppop_growthpop_densityManuf_GDPAgric_GDPDeforestationemission_per_capCO2_emission
0Worldall1988345.56347.4142106.64927545.0870.138.211.7739.285NaN2233.75513NaN0.00429521163.84
1Worldcoal198896.8798.4842106.64927545.0870.138.211.7739.285NaN2233.75513NaN0.0018128930.92
2Worldnat_gas198871.0171.8542106.64927545.0870.138.211.7739.285NaN2233.75513NaN0.0007253571.68
3Worldpet/oth1988133.45132.4942106.64927545.0870.138.211.7739.285NaN2233.75513NaN0.0017588661.24
4Worldnuclear198819.2319.2342106.64927545.0870.138.211.7739.285NaN2233.75513NaN0.0000000.00
\n", "
" ], "text/plain": [ " Country e_type Year e_con e_prod GDP Population ei_capita \\\n", "0 World all 1988 345.56 347.41 42106.6 4927545.08 70.13 \n", "1 World coal 1988 96.87 98.48 42106.6 4927545.08 70.13 \n", "2 World nat_gas 1988 71.01 71.85 42106.6 4927545.08 70.13 \n", "3 World pet/oth 1988 133.45 132.49 42106.6 4927545.08 70.13 \n", "4 World nuclear 1988 19.23 19.23 42106.6 4927545.08 70.13 \n", "\n", " ei_gdp pop_growth pop_density Manuf_GDP Agric_GDP Deforestation \\\n", "0 8.21 1.77 39.285 NaN 2233.75513 NaN \n", "1 8.21 1.77 39.285 NaN 2233.75513 NaN \n", "2 8.21 1.77 39.285 NaN 2233.75513 NaN \n", "3 8.21 1.77 39.285 NaN 2233.75513 NaN \n", "4 8.21 1.77 39.285 NaN 2233.75513 NaN \n", "\n", " emission_per_cap CO2_emission \n", "0 0.004295 21163.84 \n", "1 0.001812 8930.92 \n", "2 0.000725 3571.68 \n", "3 0.001758 8661.24 \n", "4 0.000000 0.00 " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "id": "a28c9e54", "metadata": {}, "source": [ "\n", "## 4. Extract Integrated Data\n", "\n", "Back to Table of Contents" ] }, { "cell_type": "code", "execution_count": 24, "id": "ea29e5c6", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "Extract & Save Data as CSV\n", "Unhash: To Run\n", "\"\"\" \n", "\n", "# df.to_csv('data/Our_CO2emission_Analysis_Data.csv')" ] }, { "cell_type": "markdown", "id": "89a5e130", "metadata": {}, "source": [ "### Kindly Proceed to Notebook PART 2 For further engineering and Exploratory Analysis" ] }, { "cell_type": "code", "execution_count": null, "id": "539febb7", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }