# *Data Analysis and Machine Learning Applications for Physicists*

*Material for a* [*University of Illinois*](http://illinois.edu) *course offered by the* [*Physics Department*](https://physics.illinois.edu). *This content is maintained on* [*GitHub*](https://github.com/illinois-mla) *and is distributed under a* [*BSD3 license*](https://opensource.org/licenses/BSD-3-Clause).

[Table of contents](Contents.ipynb)

## Welcome!

![Einstein-as-a-data-scientist](./img/Intro/pynstein.jpg)

**ACTIVITY:** Discuss these questions:
1. What is *Data Science*? *Machine Learning / Artifical Intelligence*? *Statistics / Data Analysis*? How are these related?
2. What distinguishes *Data* from *Models* from *Parameters* and their estimation?

![Data-models-statistics triangle](./img/Intro/MLS-triangle.png)

Further reading:
- [Data mining and statistics: what's the connection?](statweb.stanford.edu/~jhf/ftp/dm-stat.pdf)
- [The rise of the "data engineer"](https://medium.com/@maximebeauchemin/the-rise-of-the-data-engineer-91be18f1e603)
- [Humorous contrasts between ML and Stats](http://statweb.stanford.edu/~tibs/stat315a/glossary.pdf)
 - python$\leftrightarrow$ R
 - conference talk$\leftrightarrow$ journal article

![AI-Circle](./img/Intro/AI-Circle.jpg)

### How will this course be different from a CS class?

Physics and astronomy students have different preparation:
- Strong background and experience with mathematical tools (linear algebra, multivariate calculus) needed for rigorous discussion of statistics.
- Weak / varied background in traditional CS core topics of fundamental algorithms, databases, etc

Physics and astronomy research also has different needs:
- Our data and models are often fundamentally different from those in typical CS contexts.
- We ask different types of questions about our data, sometimes requiring new methods.
- We have different priorities for judging a "good" method: interpretability, error estimates, etc.

### What is Data?

Data are a finite set of measurements:
- Usually viewed as a 2D table e.g., spreadsheet, [FITS table](http://docs.astropy.org/en/stable/io/fits/usage/table.html), [ROOT tree](https://root.cern.ch/root/html/guides/users-guide/Trees.html#trees), [Pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)...
- **colums = features**: numeric / categorical?
- **rows = samples**: ordered? independent?
- measurement errors?
- binned / un-binned?
- similarity measure on samples?

**ACTIVITY:** Pick one of these ML problems and describe the rows (samples) and columns (features) of the data you might use to solve the problem.
1. Learn a fast approximation to a slow exact calculation.
2. Learn to identify Higgs particle decays from LHC event data.
3. Learn to estimate the distance to a quasar using optical images.

### What is a Model?

Models specify the probabilities of possible measurements:
- Explicit: probability density function.
- Implicit: algorithm to generate random outcomes (forward / generative model).
- Usually wrong (except "Toy MC")
- Observables (latent variables):
 - integrability: required to calculate normalized probabilities.
- Parameters (and hyper-parameters):
 - differentiability: required to find most probable (uphill) direction.
- Variance - bias tradeoffs (regularization).

### What is special about ML in physics and astronomy?

- We are data **producers**, not (only) data consumers:
 - Experiment / survey design.
 - Optimization of statistical errors.
 - Control of systematic errors.
- Our data represent measurements of physical processes:
 - Measurements often reduce to counting photons, etc, with known a-priori random errors.
 - Dimensions and units are important.
- Our models are usually traceable to an underlying physical theory:
 - Models constrained by theory and previous observations.
 - Parameter values often intrinsically interesting.
- A parameter error estimate is just as important as its value:
 - Prefer methods that handle input data errors (weights) and provide output parameter error estimates.
- In some experiments and scientific domains, the data sets are *huge* --> "Big Data"
 - See one of my [recent talks](https://absuploads.aps.org/presentation.cfm?pid=14316)

---

**Postprocessing for html export of notebook**

Python postamble (do not edit):

In [None]:
!pip install jupyter_contrib_nbextensions >/dev/null
!jupyter nbconvert *.ipynb --to html_embed

Please see the full instructions at

https://illinois-mla.github.io/syllabus/assets/html-embed-export-tutorial.pdf

for what to do after this cell executes to obtain a pdf of your notebook.