---   
 <img align="left" width="75" height="75"  src="https://upload.wikimedia.org/wikipedia/en/c/c8/University_of_the_Punjab_logo.png"> 

<h1 align="center">Department of Data Science</h1>
<h1 align="center">Course: Tools and Techniques for Data Science</h1>

---
<h3><div align="right">Instructor: Muhammad Arif Butt, Ph.D.</div></h3>    

<h1 align="center">Lecture 3.10 (Pandas-02)</h1>

<a href="https://colab.research.google.com/github/arifpucit/data-science/blob/master/Section-3-Python-for-Data-Scientists/Lec-3.10(Pandas-02-Overview-of-Pandas-Series).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="right" width="400" height="400"  src="images/pandas-apps.png"  >

## _Overview of Pandas Series Data Structure.ipynb_

#### Read about Pandas Data: https://pandas.pydata.org/docs/user_guide

## Learning agenda of this notebook

1. Overview of Python Pandas library and its data structures
2. Creating a Series
    - From Python List
    - From NumPy Arrays
    - From Python Dictionary
    - From a scalar value
3. Attributes of a Pandas Series
4. Understanding Index in a Series and its usage
    - Identification
    - Selection/Filtering/Subsetting
    - Alignment

In [None]:
# To install this library in Jupyter notebook
import sys
!{sys.executable} -m pip install pandas --quiet

In [2]:
import pandas as pd
pd.__version__ , pd.__path__

('1.3.4',
 ['/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas'])

<img align="right" width="500" height="600"  src="images/series-anatomy.png"  >

## 1. Creating a Series
> **A Series is a one-dimensional array capable of holding a sequence of values of any data type (integers, floating point numbers, strings, Python objects etc) which by default have numeric data labels starting from zero. You can imagine a Pandas Series as a column in a spreadsheet or a Pandas Dataframe object.**
- To create a Series object you can use `pd.Series()` method

**```pd.Series(data, index, dtype, name)```**
- Where,
   - `data`: can be a Python list, Python dictionary, numPy array, or a scalar value.
   - `index`: If you donot pass the index argument, it will default to `np.arrange(n)`. Indices must be hashable (numbers or strings) and have the same length as `data`. Non-unique index values are allowed. Index is used for three purposes:
       - Identification.
       - Selection.
       - Alignment.
   - `dtype`: Optionally, you can assign any valid numpy datatype to the series object (np.sctypes). If not specified, this will be inferred from `data`.
   - `name`: Optionally, you can assign a name to a series, which becomes attribute of the series object. Moreover, it becomes the column name, if that series object is used to create a dataframe later.

### a. Creating a Series from Python List

In [2]:
import pandas as pd
import numpy as np
list1 = ['Arif', 'Rauf', 'Maaz', '','Hadeed']  # note the empty string

# When index is not provided, it creates an index for the data starting from zero and with a step size of one.
s = pd.Series(data=list1)
print(s)
print(type(s))

0      Arif
1      Rauf
2      Maaz
3          
4    Hadeed
dtype: object
<class 'pandas.core.series.Series'>


>Observe that output is shown in two columns - the index is on the left and the data value is on the right. If we do not explicitly specify an index for the data values while creating a series, then by default indices range from 0 through N â€“ 1. Here N is the number of data elements.

**You can explicitly specify the index for a Series object, which can be either int or string type, and must be of the same size as the values in the series. Otherwise, it will raise a ValueError**

In [3]:
list1 = ['Arif', 'Rauf', 'Maaz', 'Hadeed']
indices = ['MS01', 'MS02', '', 'MS02']   # non-unique index values are allowed and you can have empty string as index

s = pd.Series(data=list1, index=indices)
print(s)
print(type(s))

MS01      Arif
MS02      Rauf
          Maaz
MS02    Hadeed
dtype: object
<class 'pandas.core.series.Series'>


In [5]:
s['MS01']

'Arif'

>Also note that non-unique indices are allowed

In [1]:
list1 = ['Arif', 'Rauf', 'Maaz', 'Hadeed']
indices = [2.1, 2.2, 2.3, 2.4]   

s = pd.Series(data=list1, index=indices)
print(s)
print(type(s))

NameError: name 'pd' is not defined

**You can create a series with NaN values, using `np.nan`, which is IEEE 754 floating-point representation of Not a Number. NaN values can act as a placeholder for any missing numerical values in the array.**

In [6]:
list1 = [1, 2.7, np.nan, 54]
s = pd.Series(data=list1)
print(s)
print(type(s))

0     1.0
1     2.7
2     NaN
3    54.0
dtype: float64
<class 'pandas.core.series.Series'>


>Also note the `dtype` of the series object is inferred from the data as `float64`

**You can use the `dtype` argument to specify a datatype to the series object.**

In [7]:
list1 = [27, 33, 19]
s = pd.Series(data=list1, dtype=np.uint8)
print(s)
print(type(s))

0    27
1    33
2    19
dtype: uint8
<class 'pandas.core.series.Series'>


**Optionally, you can assign a name to a series, which becomes attribute of the series object. Moreover, it becomes the column name, if that series object is used to create a dataframe later.**

In [8]:
list1 = ['Arif', 'Rauf', '', 'Hadeed']
indices = ['MS01', 'MS02', 'MS03', 'MS04']
s = pd.Series(data=list1, index=indices, name='myseries1') 
print(s)
print(type(s))

MS01      Arif
MS02      Rauf
MS03          
MS04    Hadeed
Name: myseries1, dtype: object
<class 'pandas.core.series.Series'>


### b. Creating a Series from NumPy Array

In [None]:
s = pd.Series(data = np.arange(4))
print(s)
print(type(s))

In [9]:
arr1 = np.array([22.3,33.6, 98, 44])
s = pd.Series(data=arr1, dtype='float64')
print(s)
print(type(s))

0    22.3
1    33.6
2    98.0
3    44.0
dtype: float64
<class 'pandas.core.series.Series'>


### c. Creating a Series from Python Dictionary

In [10]:
my_dict = {
    'name':"Arif", 
    'gender':"Male", 
    'Role':"Teacher", 
    'subject':"Data Science"}
s = pd.Series(data=my_dict)
print(s)
print(type(s))

name               Arif
gender             Male
Role            Teacher
subject    Data Science
dtype: object
<class 'pandas.core.series.Series'>


**When you create a series from dictionary, it will automatically take the keys as index and the value as data**

### d. Creating a Series from Scalar value

In [11]:
s = pd.Series(data=25)
print(s)
print(type(s))

0    25
dtype: int64
<class 'pandas.core.series.Series'>


### e. Creating an Empty Series

In [12]:
# Need to pass atleast `dtype` else you get a warning
s=pd.Series()
print(s)
print(type(s))

Series([], dtype: float64)
<class 'pandas.core.series.Series'>


  s=pd.Series()


## 3. Attributes of Panda  Series
- We can access certain properties called attributes of a series by using that property with the series name using dot `.` notation

In [24]:
my_dict = {0:"Rauf", 1:np.nan, 2:"Maaz", 3:"Hadeed", 4:"Mujahid", 5:"Mohid", 6:"Jamil"}
s = pd.Series(my_dict, name="myseries1")
s

0       Rauf
1        NaN
2       Maaz
3     Hadeed
4    Mujahid
5      Mohid
6      Jamil
Name: myseries1, dtype: object

In [14]:
# `name` attribute of a series object return the name of the series object
s.name

'myseries1'

In [15]:
# `index` attribute of a series object return the list of indices and its datatype
s.index

Int64Index([0, 1, 2, 3, 4, 5, 6], dtype='int64')

In [18]:
# `values` attribute of a series object return the list of values and its datatype
s.values

array(['Rauf', '', 'Maaz', 'Hadeed', 'Mujahid', 'Mohid', 'Jamil'],
      dtype=object)

In [19]:
# `dtype` attribute of a series object return the type of underlying data
s.dtype

dtype('O')

In [20]:
# `shape` attribute of a series object return a tuple of shape of underlying data
s.shape

(7,)

In [21]:
# `nbytes` attribute of a series object return the number of bytes of underlying data (object data type take 8 bytes)
s.nbytes

56

In [None]:
# `size` attribute of a series object return number of elements in the underlying data
s.size

In [22]:
# `ndim` attribute of a series object return number of dimensions of underlying data
s.ndim

1

In [25]:
# `hasnans` attribute of a series object return true if there are NaN values in the data
s.hasnans

True

<img align="right" width="500" height="500"  src="images/series-anatomy.png"  >

## 4. Understanding Index in a Series
- Every series object has an index associated with every item. 
- The Pandas series object supports both integer-based (default) and label/string-based indexing and provides a host of methods for performing operations involving the index.
<br><br>
    - When index is unique, Pandas use a hashtable to map `key to value` and searching can be done in O(1) time. 
    - When index is non-unique but sorted, Pandas use binary search, which takes logarithmic time O(logN).
    - When index is randomly ordered, searching takes linear time, as Pandas need to check all the keys in the index O(N).<br><br>
- Index in series object is used for three purposes:
    - Identification
    - Selection/Filtering/Subsetting
    - Alignment <br><br>

### a. Changing Index of a Series Object
- In above examples, we have seen that
    - If we create a Series object from dictionary, the keys of dictionray become the index 
    - If we create a Series object from a list or numPy array, the index defaults to integers from 0, 1, 2, ...
    - Last but not the least, we can assign the indices of our own choice, which can be integers or strings
- Let us see as how we can change the indices of a series object after creation

In [26]:
list1 = ['Rauf', 'Arif', 'Maaz', 'Hadeed', 'Mujahid']
s = pd.Series(data=list1)
print(s)
print(s.index)

0       Rauf
1       Arif
2       Maaz
3     Hadeed
4    Mujahid
dtype: object
RangeIndex(start=0, stop=5, step=1)


>Index attribute of series object shows that index range for this series is from (0-4) with step value of 1

**Let us modify the index of this series object to some random integers by assigning a random array of integers to `index` attribute of this series object**

In [27]:
arr1 = np.random.randint(low = 100, high = 200, size = 5)

s.index = arr1

print(s)
print(s.index)

113       Rauf
152       Arif
176       Maaz
191     Hadeed
179    Mujahid
dtype: object
Int64Index([113, 152, 176, 191, 179], dtype='int64')


In [28]:
s.index = [1,4,2,6.3,9]

print(s)
print(s.index)

1.0       Rauf
4.0       Arif
2.0       Maaz
6.3     Hadeed
9.0    Mujahid
dtype: object
Float64Index([1.0, 4.0, 2.0, 6.3, 9.0], dtype='float64')


**Changing index of a series to a list of strings**

In [29]:
list1 = ['Rauf', 'Arif', 'Maaz', 'Hadeed', 'Mujahid']
s = pd.Series(data=list1)
print(s)
print(s.index)

0       Rauf
1       Arif
2       Maaz
3     Hadeed
4    Mujahid
dtype: object
RangeIndex(start=0, stop=5, step=1)


In [30]:
indices = ['num1', 'num2', 'num3', 'num4', 'num5']

s.index = indices

print(s)
print(s.index)

num1       Rauf
num2       Arif
num3       Maaz
num4     Hadeed
num5    Mujahid
dtype: object
Index(['num1', 'num2', 'num3', 'num4', 'num5'], dtype='object')


<img align="right" width="300" height="300"  src="images/series-anatomy.png"  >

### b. First use of Index (Identification)
- Since every data value of a series object has an associated index (integer or string). So we can use this index/label to identify or access data value(s)
- There are three ways to access elements of a series:
    - Using `s[]` operator and specifying the index (integer/label)
    - Using `s.loc[]` method and specifying the index (integer/label)
    - Using `.iloc[]` method and specify the position (an integer value from 0 to length-1). It also support negative indexing, the last element can be accessed by an index of -1

**Identification using Integer Indices or by Position**

In [31]:
list1 = ['Rauf', 'Arif', 'Maaz', 'Hadeed', 'Mujahid']
indices = [5, 10, 15, 20, 25]
s = pd.Series(data=list1, index=indices)
s

5        Rauf
10       Arif
15       Maaz
20     Hadeed
25    Mujahid
dtype: object

In [41]:
# Give index to subscript operator
s[25]

# Subscript operator do not work on position
#s[0] # will raise an error because index 0 do not exist

'Mujahid'

In [43]:
# Give index to  loc method
s.loc[20]
# loc method do not work on position
#s.loc[0] # will raise an error because index 0 do not exist

'Hadeed'

In [45]:
# iloc method is position based, so will flag an error if you pass an actual index
#s.iloc[20] 

In [46]:
# The iloc method is passed position and not index
s.iloc[3]


'Hadeed'

**Fancy Indexing**

In [47]:
# Can access multiple values by specifying a list of indices
s[[20, 5]]

20    Hadeed
5       Rauf
dtype: object

In [48]:
# Can access multiple values by specifying a list of indices
s.loc[[20, 5]]

20    Hadeed
5       Rauf
dtype: object

In [49]:
# Can access multiple values by specifying list of positions
s.iloc[[3, 0]]

20    Hadeed
5       Rauf
dtype: object

**Negative Indexing, work only for `iloc`**

In [None]:
#s[-1]
#s.loc[-1]
s.iloc[-1]

**Identification using String Indices or by Position**

In [50]:
list1 = ['Rauf', 'Arif', 'Maaz', 'Hadeed', 'Mujahid']
indices = ['num1', 'num2', 'num3', 'num4', 'num5']
s = pd.Series(data=list1, index=indices)
s

num1       Rauf
num2       Arif
num3       Maaz
num4     Hadeed
num5    Mujahid
dtype: object

In [53]:
# Give index to subscript operator (which in this case is a string or label)
s['num1']

'Rauf'

In [55]:
# for position as well
s[2]

'Maaz'

In [56]:
# Give index to  loc method (which in this case is a string or label)
s.loc['num1']

'Rauf'

In [1]:
# Will not work on position the way [] worked previously
#s.loc[0]

In [57]:
# iloc method is position based, so will flag an error if you pass it string indices
#s.iloc['num1'] 
# however will work fine if you pass an integer specifying the position
s.iloc[0]

'Rauf'

In [None]:
s.iloc[-1]

**Fancy Indexing**

In [None]:
# Can access multiple values by specifying a list of indices (which in this case are strings or labels)
s[['num3', 'num1']]

In [None]:
# Can access multiple values by specifying a list of indices (which in this case are strings or labels)
s.loc[['num3', 'num1']]

In [None]:
# iloc method is position based, so will flag an error if you pass it string indices
#s.iloc['num3', 'num1'] 
# however will work fine if you pass an integer specifying the position
s.iloc[[2,0]]

<img align="right" width="400" height="400"  src="images/series-anatomy.png"  >

### c. Second use of Index (Selection)
- A series can be sliced using `:` symbol, which returns a subset of a series object (values with corresponding indices).
- There are three arguments of slice object `[[start]:[stop][:step]]`, and all are optional

- The slice object can be used in three ways to slice a Pandas Series object::
    - Using `s[]` operator and specifying the index (integer/label)
    - Using `s.loc[]` method and specifying the index (integer/label)
    - Using `.iloc` method and specify the position (an integer value from 0 to length-1). It also support negative indexing, the last element can be accessed by an index of -1
- Keep following points in mind:
    - The `stop` argument is NOT inclusive for `s[]` for integer indices, while it is inclusive for string indices.
    - The `stop` argument is inclusive for `s.loc[]` for both integer and label indices.
    - The `stop` argument is NOT inclusive for `s.iloc[]` being position based.
  
>**Note: Once you slice a Pandas series, you get a view of the original object, which is similar to shallow copy. So if you modify an element in original series object, the change will also be visible in the other series object.**

**Selection/Filtering/Subsetting of Series object having Integer indices**

In [1]:
list1 = ['Rauf', 'Arif', 'Maaz', 'Hadeed', 'Mujahid']
indices = [5, 10, 15, 20, 25]
s = pd.Series(data=list1, index=indices)
s

NameError: name 'pd' is not defined

In [67]:
s[5:15]

Series([], dtype: object)

In [61]:
# The subscript operator considers the slice object as positional index and not as the actual indices 
# (if we have integer indices)
# The `stop` argument is NOT inclusive for `s[]` for integer indices
s[1:4]

10      Arif
15      Maaz
20    Hadeed
dtype: object

In [62]:
#The loc[] method considers the slice object as actual indices and not as positional indices
# The stop argument is inclusive for `s.loc[]` for both integer and label indices
s.loc[5:15]

5     Rauf
10    Arif
15    Maaz
dtype: object

In [63]:
# The iloc[] method considers the slice object as positional index and not as the actual indices
# The `stop` argument is NOT inclusive for `s.iloc[]` being position based
s.iloc[1:4]

10      Arif
15      Maaz
20    Hadeed
dtype: object

**Selection/Filtering/Subsetting of Series object having String Indices**

In [64]:
list1 = ['Rauf', 'Arif', 'Maaz', 'Hadeed', 'Mujahid']
indices = ['num1', 'num2', 'num3', 'num4', 'num5']
s = pd.Series(data=list1, index=indices)
s

num1       Rauf
num2       Arif
num3       Maaz
num4     Hadeed
num5    Mujahid
dtype: object

In [65]:
s[0:2]

num1    Rauf
num2    Arif
dtype: object

In [66]:
# The subscript operator considers the slice object as positional index and not as the actual indices
# (if we have integer indices). However, will also consider the actual indices in case of string indices
# The `stop` argument is inclusive for `s[]` for string indices, while it is NOT inclusive for integer indices.
s['num2':'num4']

num2      Arif
num3      Maaz
num4    Hadeed
dtype: object

In [None]:
# The `stop` argument is inclusive for `s[]` for string indices, while it is NOT inclusive for integer indices.
s[0:2]

In [None]:
#The loc[] method considers the slice object as actual indices and not as positional indices
# The stop argument is inclusive for `s.loc[]` for both integer and label indices
s.loc['num2':'num4']

In [None]:
# The iloc[] method considers the slice object as positional index and not as the actual indices
# iloc method is position based, so will flag an error if you pass it string indices
#s.iloc['num2': 'num4'] 
# however will work fine if you pass an integer values (specifying positions) in the slice operator
# Moreover the stop index is not inclusive
s.iloc[1:4]

**Understanding Step with Series object having String Indices**

In [None]:
s

In [None]:
# The step works fine with string indices as well
s['num2':'num5':1]

In [None]:
s['num2':'num5':2]

In [None]:
s['num5':'num3':-1]

<img align="right" width="300" height="300"  src="images/series-anatomy.png"  >

### d. Third use of Index (Alignment)
- We can perform basic arithmetic operations like addition, subtraction, multiplication, division, etc., on two Series objects, to produce a new Series instance.
- The operation is done on each corresponding pair of elements. This is done by matching the indices of the two series objects.

**Example 1:** Adding two series object with same integer indices

In [68]:
list1 = [1,3,5,7,9];
list2 = [2,4,6,8,10];
s1 = pd.Series(data=list1);
s2 = pd.Series(data=list1);

In [69]:
print(s1)
print(s1.index)

0    1
1    3
2    5
3    7
4    9
dtype: int64
RangeIndex(start=0, stop=5, step=1)


In [70]:
print(s2)
print(s2.index)

0    1
1    3
2    5
3    7
4    9
dtype: int64
RangeIndex(start=0, stop=5, step=1)


In [71]:
s3 = s1 + s2
print(s3)
print(s3.index)

0     2
1     6
2    10
3    14
4    18
dtype: int64
RangeIndex(start=0, stop=5, step=1)


**Example 2:** Adding two series object having different integer indices

In [72]:
list1 = [6,9,7,5]
index1 = [0,1,2,3]
list2 = [8,6,2,1]
index2 = [0,2,3,5]
s1 = pd.Series(data=list1, index=index1);
s2 = pd.Series(data=list2, index=index2);

In [73]:
print(s1)
print(s1.index)

0    6
1    9
2    7
3    5
dtype: int64
Int64Index([0, 1, 2, 3], dtype='int64')


In [74]:
print(s2)
print(s2.index)

0    8
2    6
3    2
5    1
dtype: int64
Int64Index([0, 2, 3, 5], dtype='int64')


In [75]:
s3 = s1 + s2
print(s3)
print(s3.index)

0    14.0
1     NaN
2    13.0
3     7.0
5     NaN
dtype: float64
Int64Index([0, 1, 2, 3, 5], dtype='int64')


**Problem:** While performing mathematical operations on series having mismatched indices, all missing values are filled in with NaN by default.

**Solution:** To handle this problem, instead of using the operators (`+, -, *, /`), an explicit call to `s.add()`, `s.sub()`, `s.mul()` and `s.div()` is preferred. This allows us to replace the missing values in any of the series witth a specific value, so as to have a concrete output in place of NaN

In [76]:
s1.add(s2, fill_value=0) # Compare it with above result

0    14.0
1     9.0
2    13.0
3     7.0
5     1.0
dtype: float64

**Example 3:** Adding two series object having different string indices

In [3]:
list1 = [6,9,7,5, 2]
labels1 = ['num1', 'num2', 'num3', 'num4', 'num5']

list2 = [8,6,2,3,6]
labels2 = ['num1', 'num2', 'num3', 'num8', 'num5']

s1 = pd.Series(data=list1, index=labels1)
s2 = pd.Series(data=list2, index=labels2)


In [4]:
print(s1)
print(s1.index)

num1    6
num2    9
num3    7
num4    5
num5    2
dtype: int64
Index(['num1', 'num2', 'num3', 'num4', 'num5'], dtype='object')


In [5]:
print(s2)
print(s2.index)

num1    8
num2    6
num3    2
num8    3
num5    6
dtype: int64
Index(['num1', 'num2', 'num3', 'num8', 'num5'], dtype='object')


In [6]:
# Let us use the `add()` method
#s1+s2
s3 = s1.add(s2, fill_value=5)
#s3 = s1.add(s2)
print(s3)
print(s3.index)

num1    14.0
num2    15.0
num3     9.0
num4    10.0
num5     8.0
num8     8.0
dtype: float64
Index(['num1', 'num2', 'num3', 'num4', 'num5', 'num8'], dtype='object')


**My dear students, please make time to practice following topics related to Series:**
- Boolean/Fancy Indexing and Slicing
- Use of `reset_index()` method for completely resetting the index
- Use of other manipulation methods like 
    - `s.pop(index)` is passed an index and it returns the data item at the index and removes it from series
    - `s.drop(indexes)` is passed one or a list of indices and returns a series of the data items. Series remains unchanged unless the inplace = True argument is passed
    - `s1.append(s2, ignore_index=False, verify_integrity=False)` is used to concatenate two series and return the concatenated series, original series remain unchanged
    - `s1.update(s2)` is used to miduft the series `s1` inplace using the values from passed series
>**We will discuss these while studying Pandas Dataframe object InshaAllah**

# Pandas Series vs NumPy 1-D Arrays
>- In a series object we can define our own labeled index to access elements of an array. These can be numbers or strings. NumPy arrays are accessed  by their integer position using numbers only.
>- In a series object the elements can be indexed in descending order also. In NumPy arrays, the indexing starts with zero for the first element and the index is fixed.
>- While performing arithmetic operations on series having misaligned indices, NaN or missing values may be generated. In NumPy arrays, the concept of broadcasting exist and there is no concept of NaN values. While performing arithmetic on incompatible numPy arrays the operation fails.
>- Series require more memory. NumPy arrays occupies lesser memory.
    
    