{ "cells": [ { "cell_type": "code", "execution_count": 3, "metadata": { "init_cell": true, "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "skip" }, "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "---\n", "license:\n", " code: MIT\n", " content: CC-BY-4.0\n", "github: https://github.com/ocademy-ai/machine-learning\n", "venue: By Ocademy\n", "open_access: true\n", "bibliography:\n", " - https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/references.bib\n", "---" ] }, { "cell_type": "markdown", "metadata": { "notebookRunGroups": { "groupValue": "" }, "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "# Data Science introduction" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "## 1. Defining data science" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "fragment" } }, "source": [ "### What's Data Science\n", " \n", "> **Data Science** is defined as a scientific field that uses scientific methods to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Main goal - extract knowledge from data.\n", "- Uses scientific methods, such as probability and statistics.\n", "- Obtained knowledge should be applied to produce some actionable insights.\n", "- Should be able to operate on both structured and unstructured data.\n", "- Application domain is important, and some degree of expertise in the problem domain is required." ] }, { "cell_type": "markdown", "metadata": { "cell_style": "center", "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### What you can do with data\n", "\n", "
\n", "\n", "
\n", " \n", "
\n", " \n", "- Data acquisition. \n", "- Data storage. \n", " - A relational database.\n", " - A NoSQL database.\n", " - Data Lake.\n", "- Data processing. \n", "- Visualization / human insights.\n", "- Training a predictive model.\n", " \n", "
\n", "\n", "
\n", "\n", "\n", " The data science venn diagram[1]\n", "\n", "
\n", " \n", "\n", "
\n", "\n", "\n", "\n", "*1. The data science venn diagram. URL: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram (visited on 2022-08-27).*\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "center", "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### Related fields\n", "\n", "- Databases\n", "- Big Data\n", "- Machine Learning\n", "- Aritificial Intelligence\n", "- Visualization" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### What you can do with data\n", "\n", "- Data acquisition\n", "- Data storage\n", "- Data processing\n", "- Visualization / human insights\n", "- Training a predictive model" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "## 2. Data Science ethics" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", " \n", "

\n", " \n", "- Data ethics\n", "- Applied ethics\n", "- Ethics culture\n", "\n", "

\n", "\n", "
\n", " \n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "notes" } }, "source": [ "### Basic definition\n", "\n", "- **Data ethics** is a new branch of ethics that “studies and evaluates moral problems related to data, algorithms and corresponding practices”. \n", "- **Applied ethics** is the practical application of moral considerations. It’s the process of actively investigating ethical issues in the context of real-world actions, products, and processes, and taking corrective measures\n", "- **Ethics culture** is about operationalizing applied ethics to make sure that our ethical principles and practices are adopted in a consistent and scalable manner across the entire organization. " ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### Ethics concepts\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", "\n", "- Accountability\n", "- Transparency\n", "- Fairness\n", "- Reliability & Safety\n", "- Privacy & Security\n", "- Inclusiveness\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "Responsible AI at Microsoft[1]\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "\n", "*1. jcodella. Ethics and responsible use - personalizer - azure cognitive services. URL: https://learn.microsoft.com/en-us/azure/cognitive-services/personalizer/ethics-responsible-use (visited on 2022-10-01).*\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "notes" } }, "source": [ "* Accountability makes practitioners responsible for their data & AI operations, and compliance with these ethical principles.\n", "* Transparency ensures that data and AI actions are understandable (interpretable) to users, explaining the what and why behind decisions.\n", "* Fairness - focuses on ensuring AI treats all people fairly, addressing any systemic or implicit socio-technical biases in data and systems.\n", "* Reliability & Safety - ensures that AI behaves consistently with defined values, minimizing potential harms or unintended consequences.\n", "* Privacy & Security - is about understanding data lineage, and providing data privacy and related protections to users.\n", "* Inclusiveness - is about designing AI solutions with intention, and adapting them to meet a broad range of human needs & capabilities." ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### Ethics challenges\n", "\n", "1. Data ownership\n", "2. Informed consent\n", "3. Intellectual property\n", "4. Data privacy\n", "5. Right to be forgotten\n", "6. Dataset bias\n", "7. Data quality\n", "8. Algorithm fairness\n", "9. Misrepresentation\n", "10. Free choice" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "notes" } }, "source": [ "| Ethics Challenge | Case Study |\n", "|--- |--- |\n", "| **Informed consent** | 1972 - [Tuskegee Syphilis Study](https://en.wikipedia.org/wiki/Tuskegee_Syphilis_Study) - African American men who participated in the study were promised free medical care _but deceived_ by researchers who failed to inform subjects of their diagnosis or about availability of treatment. Many subjects died & partners or children were affected; the study lasted 40 years. |\n", "| **Data privacy** | 2007 - The [Netflix data prize](https://www.wired.com/2007/12/why-anonymous-data-sometimes-isnt/) provided researchers with _10M anonymized movie rankings from 50K customers_ to help improve recommendation algorithms. However, researchers were able to correlate anonymized data with personally-identifiable data in _external datasets_ (e.g., IMDb comments) - effectively \"de-anonymizing\" some Netflix subscribers.|\n", "| **Collection bias** | 2013 - The City of Boston [developed Street Bump](https://www.boston.gov/transportation/street-bump), an app that let citizens report potholes, giving the city better roadway data to find and fix issues. However, [people in lower income groups had less access to cars and phones](https://hbr.org/2013/04/the-hidden-biases-in-big-data), making their roadway issues invisible in this app. Developers worked with academics to _equitable access and digital divides_ issues for fairness. |\n", "| **Algorithmic fairness** | 2018 - The MIT [Gender Shades Study](http://gendershades.org/overview.html) evaluated the accuracy of gender classification AI products, exposing gaps in accuracy for women and persons of color. A [2019 Apple Card](https://www.wired.com/story/the-apple-card-didnt-see-genderand-thats-the-problem/) seemed to offer less credit to women than men. Both illustrated issues in algorithmic bias leading to socio-economic harms.|\n", "| **Data misrepresentation** | 2020 - The [Georgia Department of Public Health released COVID-19 charts](https://www.vox.com/covid-19-coronavirus-us-response-trump/2020/5/18/21262265/georgia-covid-19-cases-declining-reopening) that appeared to mislead citizens about trends in confirmed cases with non-chronological ordering on the x-axis. This illustrates misrepresentation through visualization tricks. |\n", "| **Illusion of free choice** | 2020 - Learning app [ABCmouse paid $10M to settle an FTC complaint](https://www.washingtonpost.com/business/2020/09/04/abcmouse-10-million-ftc-settlement/) where parents were trapped into paying for subscriptions they couldn't cancel. This illustrates dark patterns in choice architectures, where users were nudged towards potentially harmful choices. |\n", "| **Data privacy & user rights** | 2021 - Facebook [Data Breach](https://www.npr.org/2021/04/09/986005820/after-data-breach-exposes-530-million-facebook-says-it-will-not-notify-users) exposed data from 530M users, resulting in a $5B settlement to the FTC. It however refused to notify users of the breach violating user rights around data transparency and access. |" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### Applied ethics\n", "\n", "- Professional codes, e.g.\n", " - [Oxford Munich](http://www.code-of-ethics.org/code-of-conduct/) Code of Ethics\n", " - [Data Science Association](http://datascienceassn.org/code-of-conduct.html) Code of Conduct (created 2013)\n", " - [ACM Code of Ethics and Professional Conduct](https://www.acm.org/code-of-ethics) (since 1993)\n", "- Ethics checklists, e.g.\n", " * [Deon](https://deon.drivendata.org/) - a general-purpose data ethics checklist\n", " * [AI Fairness Checklist](https://www.microsoft.com/en-us/research/project/ai-fairness-checklist/)\n", " * [22 questions for ethics in data and AI](https://medium.com/the-organization/22-questions-for-ethics-in-data-and-ai-efb68fd19429)\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "## 3. Defining data\n", "\n", "- Quantitative data\n", "- Qualitative data\n", "- Structured data - IoT, surveys, analysis of behavior\n", "- Unstructured data - texts, images or videos, logs\n", "- Semi-structured data - social network\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "## 4. Introduction to statistics and probability" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### Probability and random variables\n", "\n", "> **Probability** is a number between 0 and 1 that expresses how probable an **event** is. And when we talk about events, we use **random variables**.\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### Aggregations: min, max, and everything in between\n", "\n", "Often when faced with a large amount of data, a first step is to compute summary statistics for the data in question." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "## 5. Your turn! 🚀\n", "\n", "1. [Word cloud](../../data-science/introduction/defining-data-science.html#your-turn)\n", "2. [Small diabetes study](../../data-science/introduction/introduction-to-statistics-and-probability.html#your-turn)\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "## 6. References\n", "\n", "1. [Data Science introduction](https://ocademy-ai.github.io/machine-learning/data-science/introduction/introduction.html)" ] } ], "metadata": { "celltoolbar": "Slideshow", "init_cell": "run_on_kernel_ready", "jupytext": { "formats": "ipynb" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13 (main, May 24 2022, 21:28:31) \n[Clang 13.1.6 (clang-1316.0.21.2)]" }, "rise": { "autolaunch": true, "chalkboard": { "color": [ "rgb(250, 250, 250)", "rgb(250, 250, 250)" ] }, "enable_chalkboard": true, "header": "", "scroll": true }, "vscode": { "interpreter": { "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49" } } }, "nbformat": 4, "nbformat_minor": 2 }