# CO2 Emission Analysis and Prediction

## PART 1: Data Integration & Cleaning Notebook

### AiGlass 
`Seeing Through Data`

© Explore AI

---
<img src="https://dailyenergyinsider.com/wp-content/uploads/2022/05/shutterstock_2076711193.jpg" width=70%/>

## Overview
Carbon dioxide (CO2) is a colourless, odourless and non-poisonous gas formed by combustion of carbon and in the respiration of living organisms and is considered a greenhouse gas.

CO2 emissions from the burning of fossil fuels are the primary cause of global warming which happens to be one of the biggest threats facing humanity in this day and age. Although there are plenty of other emissions that are emitted on this earth, including Methane, nitrous oxide, and CFCs, none compare to the emission of CO2, and we as humans are mostly to blame for this. For this analysis we will be focosing on CO2 Emissions and its effect on the world we live in as well as some key factors and stats that may play a role in the emission of CO2 globally.

The world as we know it, is becoming more modernized by the year, and with this becoming all the more POLLUTED.

**According to UN Official Data States:**

    1. Over 3 BILLION PEOPLE of the world’s 8 Billion people are affected by degrading ecosystems 
    2. Pollution is responsible for some 9 MILLION premature deaths each year
    3. Over 1 million plant and animal species risk extinction
    4. 200 million people could be displaced EACH  YEAR by climate disruption by 2050.

Our Work is a continuation on the analysis done by [Benjamin from Minneapolis, Minnesota, United States](https://www.kaggle.com/lobosi) on Kaggle. The result of his analysis includes;

 * CO2 Emission has been increasing throughout the time period.
 * Coal and Petroleum/other liquids have been the dominant energy source for this time period.
 * CO2 Emission has been icreasing 1.71% yearlly on average, and has overall increased by 68.14% over the entire time period.
 * As of 2019, the average CO2 emission emitted was 10.98 (MMtonnes CO2) for the year.
 * The top CO2 emitters over the entire time period have been China and The United States, both exceding nearlly 4x or more the amount of every other country.
 * Throughout the time period, China and India have increased there CO2 Emissions the most out of every other country.
 * Throughout the time period, Former soviet republics have had the largest decrease in CO2 emission, The United Kingdom and Germany have also decreased there emissions a bit as well.
 * Generally speaking, the larger the population, the more CO2 the country will be likely to emit.
 * The larger the GDP, the more likely the country will have a high CO2 emission.
 * The larger the Energy Consumption of a country, the larger the CO2 emission.
 * A high or low Energy Intensity by GDP of Energy Intensity per capita isnt necesarilly predictive of a large CO2 emission, but generally speaking the lower it is the better (the more energy conserved means less CO2 emitted).

The dataset used is broadly catategorizing all emitters, transportation, lifestyle, industry etc.. into one total amount for each energy type.

This notebook looks to further in the analysis and building of several Machine Learning Models which can predict accurately the CO2 emission based of several parameters.

**Warning:** We are not a climate scientist, some things may be inacurate. This is simply just a study on a subject im interested in, allowing me to go deeper into the subject while at the same time imporving my graphing skills. All my sources are at the bottom of the notebook.

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Integrating Additional Data</a>

<a href=#four>4. Extract Integrated Data</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section we will be importing libraries used throughout our analysis and modelling which will allow us to call functions that are not part of your main python program, and briefly discuss them.

---

In [1]:
# Libraries for Analysis
import numpy as np
import pandas as pd

# Mute warnings
import warnings
warnings.filterwarnings('ignore')

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section we will be loading the data from the CSV and EXCEL files into Pandas DataFrames. |

---

In [2]:
# Load Base Data
df = pd.read_csv("data/Our_CO2emission_Clean_Data.csv")

In [3]:
# View first 5 rows of Loaded Base Data
df.head()

Unnamed: 0.1,Unnamed: 0,Country,e_type,Year,e_con,e_prod,GDP,Population,ei_capita,ei_gdp,CO2_emission
0,0,World,all,1988,345.56,347.41,42106.6,4927545.08,70.13,8.21,21163.84
1,1,World,coal,1988,96.87,98.48,42106.6,4927545.08,70.13,8.21,8930.92
2,2,World,nat_gas,1988,71.01,71.85,42106.6,4927545.08,70.13,8.21,3571.68
3,3,World,pet/oth,1988,133.45,132.49,42106.6,4927545.08,70.13,8.21,8661.24
4,4,World,nuclear,1988,19.23,19.23,42106.6,4927545.08,70.13,8.21,0.0


In [4]:
# Drop Unamed Column
df = df.drop('Unnamed: 0', axis=1)

**Column descriptions:**
 * **Country** - Country in question
 * **Energy_type** - Type of energy source
 * **Year** - Year the data was recorded
 * **Energy_consumption** - Amount of Consumption for the specific energy source, measured (quad Btu)
 * **Energy_production** - Amount of Production for the specific energy source, measured (quad Btu)
 * **GDP** - Countries GDP at purchasing power parities, measured (Billion 2015\$ PPP)
 * **Population** - Population of specific Country, measured (Mperson)
 * **Energy_intensity_per_capita** - Energy intensity is a measure of the energy inefficiency of an economy. It is calculated as  units of energy per unit of capita (capita = individual person), measured (MMBtu/person)
 * **Energy_intensity_by_GDP**- Energy intensity is a measure of the energy inefficiency of an economy. It is calculated as units of energy per unit of GDP, measred (1000 Btu/2015\$ GDP PPP)
 * **CO2_emission** - The amount of C02 emitted, measured  (MMtonnes CO2)
 
It will also be exciting to see how we can enrich the dataset with extra features. Hence, We will adding the following datasets;
1. **Rate of population change** - To see if a possible change in population of a place will result in change in CO2 emission & to What extent
2. **Population density** - Does the density of a population have any effect on CO2 Emission?
3. **GDP splits** - Example, % for agriculture vs manufacturing; Hypothetically, GDP increase due to agricultural/Green activities should oppose the direct correlation of rise in GDP to CO2 Emission
4. **Rate of Deforestation** - As a result of our research on why the Dip in CO2 Emission of the World occurred in 2009 and the sudden rise in 2010 when Energy Type, Pop, and GDP were Constant.... 

From: [REUTERS: Carbon emissions dip in 2009, to jump in 2010 - report](https://www.reuters.com/article/idINIndia-53062920101121) `“The real surprise was that we were expecting a bigger dip due to the financial crisis in terms of fossil fuel emissions,” said Pep Canadell, executive director of the Global Carbon Project and one of the co-authors of the study published in the latest issue of the journal Nature Geoscience. Scientists say rising levels of CO2, the main greenhouse gas, from burning fossil fuels and deforestation is heating up the planet; So we had BURNING FOSSIL FUEL covered but not the impact of DEFORESTATION...` Then FROM: [Measuring Carbon Emissions from Tropical Deforestation: An Overview](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjLmN7jouz4AhVfYPEDHbXXDhoQFnoECAkQAw&url=https%3A%2F%2Fwww.edf.org%2Fsites%2Fdefault%2Ffiles%2F10333_Measuring_Carbon_Emissions_from_Tropical_Deforestation--An_Overview.pdf&usg=AOvVaw2x4oTsffUsBJzPk0S6DK_y) It states that Tropical deforestation contributes about 20% of annual global greenhouse gas (GHG) emissions and reducing it will be necessary to avoid dangerous climate change. China and the US are the world’s number one and two emitters, but numbers three and four are Indonesia and Brazil, with ~80% and ~70% of their emissions respectively from deforestation.

5. **Emission per Capita** - Also: probing into the theory that a unit increase in Population is directly impacting on the increase in CO2 Emission, we opted to getting a column which represents per capita emission for each country per energy type which will be plotted against the co2 emission and resulting graph compared with the graph of countries/population of highest emitters. The idea is if the comparism correlates, then our Hypothesis theory of increase in pop is directly propotional to increase in CO2 Emission, is 100% valid, if not; To be modified with extra clause.

In [5]:
# Load Population Growth (Rate of population change)
pop_df = pd.read_csv("data/Population_Growth_from_world_Bank_Integrate.csv")
# Load Population Density per Country
den_df = pd.read_excel('data/Population_Density_per_country_data.xls')
# Load Manufacturing GDP Contribution (GDP splits)
mgdp_df = pd.read_excel('data/GDP_split_Manufacturing_contribution_data_per_Country.xls')
# Load Agri GDP Contribution (GDP splits)
agdp_df = pd.read_excel('data/GDP_split_Agricultural_contribution_data_per_Country.xls')
# Load Deforestation Impact per country 
forest_df = pd.read_csv("data/Deforestation_data.csv") # Forest area (% of land area)
land_df = pd.read_csv("data/Land_Area_Data.csv") # Land Area (sq. km)
# Load Emission per Capita
df['emission_per_cap'] = df['CO2_emission']/df['Population']

In [6]:
# Check
pop_df.head(2)

Unnamed: 0,Country Name,1988,1989,1990,1991,1992,1993,1994,1995,1996,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,Aruba,-1.246457,-0.063879,1.81683,3.898739,5.446052,6.048669,5.64493,4.610156,3.53111,...,0.503385,0.58329,0.590508,0.541048,0.50286,0.471874,0.459266,0.437415,0.428017,
1,Africa Eastern and Southern,2.987172,2.956405,2.913059,2.871078,2.832013,2.791294,2.751374,2.71042,2.673851,...,2.763426,2.761496,2.7504,2.732598,2.712218,2.690902,2.66562,2.636666,2.605427,


In [7]:
# Check
forest_df.head(1)

Unnamed: 0,Country Name,Indicator Name,1988,1989,1990,1991,1992,1993,1994,1995,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,Aruba,Forest area (% of land area),,,2.333333,2.333333,2.333333,2.333333,2.333333,2.333333,...,2.333333,2.333333,2.333333,2.333333,2.333333,2.333333,2.333333,2.333333,2.333333,


**OBSERVATION:** After Careful studies, we observed that several countries have either been integrated into another or have had their names modified or changed thereby tending to result in lot's of missing values. SEE the table below

| Country Name in Data | Current Name | Replacement Name |**Special Case** |
| --- | --- | --- |--- |
| Burma | Myanmar | Myanmar |
| Congo-Brazzaville | Republic of the Congo | Congo, Rep. |
| Congo-Kinshasa | Democratic Republic of the Congo | Congo, Dem. Rep. |
| Côte d’Ivoire | --- | Cote d'Ivoire |
| Guadeloupe | overseas département and overseas region of FRANCE | **DROP** |
| Laos | Lao People's Democratic Republic | Lao PDR |
| Macau | special administrative region CHINA | Macao SAR, China |
| Martinique | Island and overseas territorial collectivity of FRANCE | **DROP** |
| North Korea | Korea, Dem. People's Rep. | Korea, Dem. People's Rep. | **Lump Together North & South Korea or DROP** |
| Reunion | Réunion La Réunion (French) | **DROP** |
| Saint Kitts and Nevis | Federation of Saint Christopher and Nevis | St. Kitts and Nevis |
| Saint Lucia | --- | St. Lucia |
| Saint Vincent/Grenadines | --- | St. Vincent and the Grenadines |
| South Korea | Korea, Dem. People's Rep. | --- | **Lump Together North & South Korea or DROP** |
| Taiwan | Republic of China (ROC) | **DROP** |
| The Bahamas | --- | Bahamas, The |
| Kyrgyzstan | Kyrgyz Republic | Kyrgyz Republic |
| Slovakia | Slovak Republic | Slovak Republic |
| Palestinian Territories | Israel | West Bank and Gaza |

So Let's proceed to making this changes...

In [8]:
'''
This countries Have been merged to Major countries already present
in our Dataset and won't be present in dataset adding the additional 
features, hence deleting them.
'''
# Drop Rows of Countries: Guadeloupe, Martinique, Reunion, Taiwan
df = df[df.Country.isin(['Guadeloupe', 'Martinique', 'Reunion', 'Taiwan'])==False]

# Decalre Replacement Names as Dict
replace_values = {'Burma' : 'Myanmar', 
                  'Congo-Brazzaville' : 'Congo, Rep.', 
                  'Congo-Kinshasa' : 'Congo, Dem. Rep.', 
                  "Côte d’Ivoire": "Cote d'Ivoire",
                  "Laos": 'Lao PDR', 
                  'Macau': 'Macao SAR, China', 
                  'Saint Kitts and Nevis': 'St. Kitts and Nevis',
                  'Saint Lucia': 'St. Lucia', 
                  'Saint Vincent/Grenadines': 'St. Vincent and the Grenadines',
                  'The Bahamas': 'Bahamas, The', 'Kyrgyzstan': 'Kyrgyz Republic', 
                  'Slovakia': 'Slovak Republic',
                  'Palestinian Territories': 'West Bank and Gaza'
                 }    
# Apply Replacement Names
df = df.replace({"Country": replace_values}) 

In [9]:
# Check
df[df['Country'] == 'Myanmar'].head(1)

Unnamed: 0,Country,e_type,Year,e_con,e_prod,GDP,Population,ei_capita,ei_gdp,CO2_emission,emission_per_cap
150,Myanmar,all,1988,0.09,0.08,26.35,40085.6,2.13,3.23,4.83,0.00012


<a id="three"></a>
## 3. Integrating Additional Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, we will be Integrating and Engineering our features with areas that may prove vaible in or analysis on the Emmission of CO2.  |

---
Hence let's proceed to Integrating additional features.

#### 3.1 Integrating Population Growth to Base DF

In [10]:
# Defining function that Integrates Pop_Growth
def add_pop_growth(row):
    
    val = pop_df.loc[pop_df["Country Name"].str.contains(row['Country']), str(row['Year'])]
    return round(float(list(val)[0]),3) if len(val)>0 else np.NaN

In [11]:
# Applying Fuction
df['pop_growth'] = df.apply(add_pop_growth, axis=1)

In [12]:
# Check for missing values
df[df["pop_growth"].isnull()]["Country"].unique()

array(['North Korea', 'South Korea', 'New Zealand', 'Kuwait', 'Eritrea'],
      dtype=object)

`North Korea & South Korea` needs to be collapse as one Country called `Korea, Dem. People's Rep`. While for 'New Zealand', 'Kuwait', 'Eritrea' the missing values will have to be delt with conventionally.

| Country | Year's with Missing Values | **Action** |
| --- | --- | --- |
| New Zealand | 1991 only | **Fill with mean/median** |
| Kuwait | 1992 to 1995  | **Fill with mean/median** |
| Eritrea | 2012 to 2019 | **Fill with mean/median** |

#### 3.2 Integrating Population Density to Modified DF

In [13]:
# Defining function that Integrates Pop_Density
def add_pop_den(row):
    
    val = den_df.loc[den_df["Country Name"].str.contains(row['Country']), str(row['Year'])]
    return round(float(list(val)[0]),3) if len(val)>0 else np.NaN

In [14]:
# Applying Fuction
df['pop_density'] = df.apply(add_pop_den, axis=1)

In [15]:
# Check for missing values
df[df["pop_density"].isnull()]["Country"].unique()

array(['Luxembourg', 'North Korea', 'South Korea', 'Kuwait', 'Kosovo'],
      dtype=object)

So we see a couple of additional missing values present in the population density Column for Countries 'Luxembourg', 'North Korea', 'South Korea', 'Kuwait', 'Kosovo'

We will look to treat this later
#### 3.3 Integrating GDP Split (Agric & Manuf) to Modified DF
The calculation of a country's GDP encompasses all private and public consumption, government outlays, investments, additions to private inventories, paid-in construction costs, and the foreign balance of trade. 

We will be focusing on Just the GDP Contributions of the Manufacturing & Agricultural Industries per country per time.

In [16]:
# Defining function that Integrates Manufacturing GDP Contribution
def add_gdp_manu(row):
    
    val = mgdp_df.loc[mgdp_df["Country Name"].str.contains(row['Country']), str(row['Year'])]
    return round(float(list(val)[0]), 3) if len(val)>0 else np.NaN

# Defining function that Integrates Agricuture GDP Contribution
def add_gdp_agri(row):
    
    val = agdp_df.loc[agdp_df["Country Name"].str.contains(row['Country']), str(row['Year'])]
    return round(float(list(val)[0]), 3) if len(val)>0 else np.NaN

In [17]:
# Applying Fuction Respectively
df['manuf_GDP'] = df.apply(add_gdp_manu, axis=1)
df['agri_GDP'] = df.apply(add_gdp_agri, axis=1)

It's also worthy of Note that the `manuf_GDP & agri_GDP` are percentage contribution of the overal GDP, Hence we'll have to extract the value for computation.

In [18]:
df['Manuf_GDP'] = (df['manuf_GDP']/100)*df['GDP']
df['Agric_GDP'] = (df['agri_GDP']/100)*df['GDP']

#### 3.4 Integrating Deforestation Data to Modified DF

In [19]:
# Defining function that Integrates Forest Area % of Land & Land Area Sq.M Data
def add_forest(row):
    
    val = forest_df.loc[forest_df["Country Name"].str.contains(row['Country']), str(row['Year'])]
    return round(float(list(val)[0]), 3) if len(val)>0 else np.NaN

def add_land(row):
    
    val = land_df.loc[land_df["Country Name"].str.contains(row['Country']), str(row['Year'])]
    return round(float(list(val)[0]), 3) if len(val)>0 else np.NaN

In [20]:
# Applying Fuction Respectively
df['Forest'] = df.apply(add_forest, axis=1)
df['Land'] = df.apply(add_land, axis=1)
# Get Exact Forest area in SqM
df['Deforestation'] = (df['Forest']/100)*df['Land'] 

# Note: we cleaned all land info below 1990 as to avoid errors since Forest data starts from 1990
# Drop redundant Forest & Land Columns
df = df.drop(['Forest', 'Land'], axis=1)

The forest data begins from year 1990, hence we will be experiencing missing values across all Countries for the years 1988 & 1989 and maybe a few others within the dataset.

#### 3.5 Extracting Emission per Capita
This refers to the per capita/person emission for each country per energy type

In [21]:
# Adding the emission_per_cap column
df['emission_per_cap'] = df['CO2_emission']/df['Population']

In [22]:
"""
Let's Reposition our Target variable 
CO2 Emission to the End of our Dataframe
"""
# Seperate Other Features From Target Variable
others = df.drop(['CO2_emission', 'emission_per_cap'], axis=1)
co = df[['emission_per_cap', 'CO2_emission']]
# Delete df
del df
# concat both Tables into fresh df
df = pd.concat([others, co], axis=1)
# Dropping % version of agric & Manufac GDP
df = df.drop(['manuf_GDP', 'agri_GDP'], axis=1)

In [23]:
df.head()

Unnamed: 0,Country,e_type,Year,e_con,e_prod,GDP,Population,ei_capita,ei_gdp,pop_growth,pop_density,Manuf_GDP,Agric_GDP,Deforestation,emission_per_cap,CO2_emission
0,World,all,1988,345.56,347.41,42106.6,4927545.08,70.13,8.21,1.77,39.285,,2233.75513,,0.004295,21163.84
1,World,coal,1988,96.87,98.48,42106.6,4927545.08,70.13,8.21,1.77,39.285,,2233.75513,,0.001812,8930.92
2,World,nat_gas,1988,71.01,71.85,42106.6,4927545.08,70.13,8.21,1.77,39.285,,2233.75513,,0.000725,3571.68
3,World,pet/oth,1988,133.45,132.49,42106.6,4927545.08,70.13,8.21,1.77,39.285,,2233.75513,,0.001758,8661.24
4,World,nuclear,1988,19.23,19.23,42106.6,4927545.08,70.13,8.21,1.77,39.285,,2233.75513,,0.0,0.0


<a id="four"></a>
## 4. Extract Integrated Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

In [24]:
"""
Extract & Save Data as CSV
Unhash: To Run
""" 

# df.to_csv('data/Our_CO2emission_Analysis_Data.csv')

### Kindly Proceed to Notebook PART 2 For further engineering and Exploratory Analysis