DSEP Mapping Research Team, UC Berkeley
Berkeley Institute for Data Science
Zhongling Jiang, Vinitra Swamy
The research_grant_history dataset contains historical grants information of all Berkeley research taken place from 1987 to 2016. The information includes activity type, sponsor class, fund amount, department, project information, PI, etc. Our goal is to visualize the trend of recent ten years' research. The component that we look at includes:
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import pandas as pd
from datascience import *
import numpy as np
import locale
import re
import csv
Based on previous work on data cleaning and exploration, we create two datasets: by_dept_funding.csv and by_dept_activity.csv. Both datasets group data by department and respectively show funding info. and reseach type info. for each department. They are sorted by grant amount.
by_dept_funding = pd.read_csv('by_dept_funding.csv')
by_dept_activity = pd.read_csv('by_dept_activity.csv')
by_dept_funding.head(3)
by_dept_activity.head(3)
## Add total number of sponsors and research
by_dept_funding['num_of_sponsors'] = by_dept_funding['Federal'] + by_dept_funding['State of California'] + by_dept_funding['Non Profit'] + by_dept_funding['University of California'] + by_dept_funding['Other']
by_dept_funding['num_of_research'] = by_dept_activity['Total']
by_dept_funding.head(4)
We create the bubble chart using plot.ly. Each bubble represents a department, the size represents the number of research from that department, and the distance from origin represents the amount of fundings.
The top four organizations that recieve most fundings: ERSO Engineering, SSL Space Lab, MCB Molecular & Cell, and The California Institute for Qualitative Biosciences.
The top four organizations that produces most research: ERSO Engineering, MCB Molecular & Cell, Graduate division Dean, and The California Institute for Qualitative Biosciences.
import plotly.plotly as py
import cufflinks as cf
import pandas as pd
cf.set_config_file(offline=False, world_readable=True, theme='pearl')
by_dept_funding.iplot(kind='bubble', x='Grant Amount', y='num_of_research', size = 'num_of_sponsors',text='Dept/Division',
xTitle='Funding Recieved', yTitle='Number of Research',
filename='simple-bubble-chart2')
We are curious to know which departments are most active recently (in past 10 years) by producing high quantity of research. We pick top four from above.
cleaned_research = pd.read_csv('cleaned_research_spo_data.csv')
recent_data = cleaned_research[cleaned_research['Year'] > 2006]
# Find number of research
grouped = recent_data.groupby([recent_data['Department'], recent_data['Year']]).size()
plt.figure()
plt.subplot(2,2,1)
grouped['ERSO Engineering Research Support Organization'].plot.line()
plt.subplot(2,2,2)
grouped['SSL Space Sciences Lab'].plot.line()
plt.subplot(2,2,3)
grouped['MCB Molecular & Cell Biology'].plot.line()
plt.subplot(2,2,4)
grouped['The California Institute for Quantitative Biosciences (QB3)'].plot.line()
plt.show()
The number of different types of research over time. Basic research are most popular fund recievers in recent 10- 15 years.
cleaned_research = pd.read_csv('cleaned_research_spo_data.csv')
a = cleaned_research['Grant Amount'].groupby([cleaned_research['Activity Type'], cleaned_research['Year']]).sum()
plt.figure()
plt.plot(a['Applied research'], label = 'Applied Research')
plt.plot(a['Basic research'], label = 'Basic research')
plt.plot(a['Services'], label = 'Services')
plt.plot(a['Training'], label = 'Training')
plt.plot(a['Other'], label = 'Other')
plt.legend(loc=2,prop={'size':10})
plt.show()
With the drop-down list, we can visualize the funding source by department. For example, federal funding accounts for the greatest percentage in funding ERSO Engineering. Meanwhile, onw subplot shows which type of research is more hearvily funded in department.
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
# ERSO Engineering Research Lab
# funding source count
# apply ipywidget
grant_data = pd.read_csv('grant_data.csv')
def plot_funding_and_research_type(dept):
x = by_dept_funding.loc[by_dept_funding['Dept/Division'] == dept,['Federal', 'State of California', 'Non Profit', 'University of California', 'Other']].values.flatten().tolist()
y = [1,2,3,4,5]
z = cleaned_research.loc[cleaned_research['Department'] == dept, ['Activity Type','Grant Amount', 'Department','Year']]
a = z['Grant Amount'].groupby([z['Activity Type'], z['Year']]).sum()
plt.figure()
plt.subplot(2,1,1)
LABELS = ['Federal', 'State of California', 'Non Profit', 'University of California', 'Other']
plt.barh(y, x, align = 'center')
plt.yticks(y, LABELS)
plt.subplot(2,1,2)
types_of_research = ['Applied research','Basic research', 'Services', 'Training', 'Other']
for research_type in types_of_research:
plt.plot(a[research_type], label = research_type)
plt.legend(loc=2,prop={'size':10})
plt.show()
plot_funding_and_research_type('ERSO Engineering Research Support Organization')
unique_divisions = list(set(by_dept_funding['Dept/Division']))[1:]
interact(plot_funding_and_research_type, dept=unique_divisions)
by_dept_funding.head(3)
We could also see the most popular projects i.e, the ones that recieve most fundings within each department. Further text analysis could be conducted to investigate the trend in research topics.
# what ERSO project recieves top funding (or in each department)
clean_grant = pd.read_csv('cleaned_research_spo_data.csv') # the dataset has been sorted by grant amount
def top_project(data, dept, n):
by_dept = data.loc[data['Dept/Division'] == dept, ]
return by_dept[['Department', 'Amount', 'Title', 'Activity Type','Project Begin Date', 'Project End Date']].head(n)
# top_project(clean_grant, 'ERSO Engineering Research Support Organization', 10)
interact(top_project, data=fixed(clean_grant), dept=unique_divisions, n=widgets.IntSlider(min=0, max=50, step=5, value=10))