# Lecture 1: Welcome to CSCI 1360!
CSCI 1360E: Foundations for Informatics and Analytics

## Overview and Objectives

In this lecture, we'll go over the basic principles of the field of data science and analytics. By the end of the lecture, you should be able to

 - Broadly define "data science" and understand its role as an interdisciplinary field of study

 - Identify the six major skill divisions of a data scientist

 - Provide some justification for why there is a sudden interest and national need for trained data scientists

No programming just yet; this is mainly a history and background lesson.

## Part 1: Data Science

At some point in the last few years, you've most likely stumbled across at least one of the many memes surrounding "big data" and "data science."

![meme](http://www.octopus-hr.co.uk/hrmoz/images/articles/1030_2.gif)

The level of ubiquity with which these terms have thoroughly saturated the tech sector's vernacular has rendered these terms almost meaningless. Indeed, many have argued that data science may very well not be anything new, but rather a rehashing and rebranding of ideas that did not gain traction previously, for whatever reason.

Before differentiating data science as a field unto itself and justifying its existence, I'll first offer a working definition. In my opinion, the most cogent and concise definition of data science is encapsulated in the following Venn diagram:

![datasci](http://static1.squarespace.com/static/5150aec6e4b0e340ec52710a/t/51525c33e4b0b3e0d10f77ab/1364352052403/Data_Science_VD.png)

Data Science as a proper field of study is the confluence of three major aspects:

1: **Hacking skills**: the ability to code, and knowledge of available tools.

2: **Math and statistics**: strong quantitative skills that can be implemented in code.

3: **Substantive expertise**: some specialized area of emphasis.

The third point is crucial to the overall definition of data science, and is also the reason that defining data science as a distinct field is so controversial. After all, isn't a data scientist whose substantive experience is biology just a quantitative biologist with strong programming skills?

I'll instantiate the answer to this question with a somewhat snarky data science definition I've stumbled across before:

`Data Scientist (n.): Person who is better at statistics than any software engineer, and better at software engineering than any statistician.`

## Part 2: How is data science different?

The best answer here is "it depends." This being a college course, however, we'll need to be more rigorous than that if we expect to get any credit.

A big part of the ascendence of data science is, essentially, [Moore's Law](https://en.wikipedia.org/wiki/Moore%27s_law).

Where once upon a time, digital storage was a luxury and available processing power required hours to perform the most basic computations, we're now right smack in the middle of an exponential expansion of digitized data creation and enough processing power to crunch a significant portion of it.

![datagrowth1](http://www.kdnuggets.com/2012/07/data-science-vdhar-image002.jpg)

![datagrowth2](https://qph.is.quoracdn.net/main-qimg-ac33ff5752c393e3c0657c62a3b409af?convert_to_webp=true)

In effect, data science is not "new", but rather the result of technologies that have made such theoretical advancements plausible in practice.

 - Digital storage is cheap enough to store everything.
 - Modern CPUs / GPUs can perform trillions of calculations per second.
 - Rather than pre-selecting features of data to save, we can now put sensors on everything and store the information for downstream analysis at a later time.

There is, nevertheless, a lot of overlap between this new field of "data science" and those that many regard as the progenitors: statistics and machine learning.

Many data scientists are from academia, with Ph.D.s! (e.g. machine learning, statistics)

Many *other* data scientists have stronger software engineering backgrounds.

To quote Joel Grus in his book, *Data Science from Scratch*:

**`In short, pretty much no matter how you define data science, you'll find practitioners for whom the definition is totally, absolutely wrong.`**

So we fall back to: "a professional who uses scientific methods to liberate and create meaning from raw data"

Unfortunately, this is too overly broad to be useful; applied statisticians would claim this definition is, almost word-for-word, what they've been doing for centuries.

Is there any hope?!

(spoiler alert: yes)

While the popular media tropes of "big data," "skills," and "jobs" don't inherently justify the spawn of a new field, there is nonetheless still a solid case to be made for a data science entity.

## Part 3: "Greater" Data Science

It's important to note first: data science **did not develop overnight.**

Despite the seemingly rapid rise of Data Science as a field, "big data" as a meme, and the data scientist as an industry gold standard, this was something envisioned [over 50 years ago by John Turkey](https://projecteuclid.org/euclid.aoms/1177704711) in his book, *The Future of Data Analytics* (1962!)

Presented broad concepts of

 - Data analytics
 - Intepretation of said analytics
 - Visualization

as *their own field*, not just branches or extensions of math and stats.

There is legitimate value in training people in the practice of **extracting information from data**.

from "getting acquainted" with the data

to delivering major conclusions based on it

This is the field of "Greater Data Science", or GDS.

### 1: Data Exploration and Preparation

Ever heard this?

> "A data scientist spends 80% of their time cleaning and rearranging their data."

It's a little rhetorical (what about cat videos?)

But data are messy, often missing certain values or corrupted in other ways that would break naive attempts at analysis.

Furthermore, there's something to be said for exploring data beforehand, "getting a feel for it."

No such thing as an assumption-free algorithm that will "just work" for any data!

Understanding the data, really intuitively *getting it*, will dramatically improve our lives (of trying to analyze it).

### 2: Data Representation and Transformation

There are *countless* data formats, with more appearing all the time:

![xkcd](http://imgs.xkcd.com/comics/standards.png)

Text in a CSV file? Images in a TIFF stack? Compressed video in Ogg format? Or, moving beyond flat files: user activity stored in a MySQL database? Event logs stored in NoSQL key-value format? Or, moving beyond even concrete formats: are we reading 8-bit pixel data, or floating-point cepstral transformations? Are we receiving the raw image data, or the wavelet filter bank?

### 3: Computing with Data

Computing tools change; computer science fundamentals don't.

*...nevertheless*, you'll need to be well-versed in a broad range of tools for reading, analyzing, and interpreting data at each stage of the data science pipeline.

 - What programming language will you use?
 - What constructs will you employ within that language to maximize the efficiency of the algorithms you implement?
 - What kind of platforms will you deploy your programs on?
 - Can you containerize your applications for fine-grained resource scheduling?

Arguably as important as the pieces of the project you'll assemble is how you plan to document your efforts. Inevitably, your code will fall to someone else to maintain and improve upon in the future, but sans your knowledge acquired in assembling the code in the first place. And just as inevitably, you will inherit code initially created by somebody long since gone. Thorough documentation strategies will be your lifeline.

### 4: Data Visualization and Presentation

> "If you can't write about it, you can't prove you know or understand it."

Not a lot of opportunity for written answers in this course...

...but the art of *visualization* is a close proxy!

Visualization is rguably **one of the most important aspects of data science**, as it is one of (if not *the*) primary ways in which analysis results are conveyed.

### 5: Data Modeling

Broadly speaking, there are two modeling cultures:

 - *Generative modeling*. In this case, you start with some dataset $\mathcal{X}$ and, using your knowledge of the data, construct a model $\mathcal{M}$ that could have feasibly *generated* your dataset $\mathcal{X}$.

 - *Predictive modeling*. In this case, you start with some dataset $\mathcal{X}$ and attempt to directly map it to some output space.

Data scientists should be familiar with and competent in both modeling paradigms. 

### 6: Science about Data Science

Always need a healthy dose of meta!

GDS should be able to evaluate itself.

 - Packaging commonly-used analysis techniques or workflows into easily-accessible extensions

 - Analyzing an algorithm's runtime and memory efficiency

 - Measuring the effectiveness of human workflows from data extraction and exploration to final inferences

 - Verifying existing results!

## Part 4: The Perfect [Data] Storm

Oh, and: there is a lot of money in data science. A *lot*.

![salaries](http://cobweb.cs.uga.edu/~squinn/courses/su16/csci1360e/assets/datascisalaries.png)

As our society relies ever more heavily on digitized data collection and automated decision making, those with the right skill sets to help shape these new infrastructures will continue to be in demand for awhile.

Everybody's building a data science program these days

...and [UGA is no exception](http://news.uga.edu/releases/article/presidential-informatics-hiring-initiative)!

![ugaii](http://cs.uga.edu/~squinn/courses/fa16/csci1360/assets)

Not just tech companies! Any *company* that deals with any kind of data will likely have a need for highly trained data scientists.

We live in an era where

 - **cheap digital storage mechanisms**
 - **powerful computing hardware**
 - **unprecedented digital connectivity**

have combined to create a perfect storm **for those with the skills to ingest, structure, and analyze all the data** that is being generated.

## Review Questions

Some things to consider.

1: Why is the combination of hacking skills and substantive expertise, devoid of math and statistics knowledge, considered the "danger zone"?

2: What is the difference between *machine learning* and *artificial intelligence*?

3: Of the six divisions of "GDS," which one do you think is most prevalent in mainstream media? Why?

4: What's wrong with saying that data science is just statistics, but with computers?

5: Think of a company you wouldn't normally consider a "tech company," and contemplate how they might use a data science division to improve their bottom line.

## Course Administrivia

**Website**: https://eds-uga.github.io/csci1360-fa16

**Lectures**: Tuesdays and Thursdays.

**Flipped Lectures**: Every Wednesday!

**Programming Assignments**: 10 of them (not including "Assignment 0"), all to be done on JupyterHub.

**Exams**: 2 of them!

**Slack**: Dedicated Slack team, https://eds-uga-csci1360.slack.com, to ask and answer questions.

### What the flip is a flipped lecture?

Remember those "review questions" a few slides ago?

Unstructured time for *students* to structure, starting from the review questions of previous lectures.

 - Work out one or more of the review questions
 - Bring one or more of your own questions to the other students
 - Invite other students to come up and work out your questions

There is a **Participation** component to your grade; this will affect it!

### JupyterHub

Web portal through which all the programming assignments will be done!

[live demo]

### Grading

 - Assignments: 60% (10 of them)
 - Participation: 10% (flipped lectures, answering questions in Slack, participating in lecture)
 - Midterm Exam: 10%
 - Final Exam: 20%

*Ample* opportunities for extra credit on assignments and exams.

Don't ask about doing "extra" extra credit. Just don't. kthx bai

Please don't copy code.

### Notetaker!

A student in this class requires a notetaker!

If you plan to attend class regularly and are interested, please see me after class.

For providing this service, you will be paid \$80 for the first student and \$40 for each additional student in this class. Payment will be made at the end of the semester.

## Questions?

### https://eds-uga.github.io/csci1360-fa16

![questions](https://cdn.meme.am/instances/55118459.jpg)

## Appendix: Additional Resources

 1. Grus, Joel. *Data Science from Scratch: First Principles with Python*. 2015.
 2. Donoho, David. *50 Years of Data Science*. 2010.