{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# How Engineers Solve Big and Difficult Problems Part 1: The\n", "\n", "Challenges/Opportunities Presented to Engineers by AI/ML \\### [Neil D.\n", "Lawrence](http://inverseprobability.com), University of Cambridge\n", "\n", "### 2022-11-14" ], "id": "accbc21b-d54a-4dbe-be2a-7cdac01f8ab5" }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Abstract**: Machine learning solutions, in particular those based on\n", "deep learning methods, form an underpinning of the current revolution in\n", "“artificial intelligence” that has dominated popular press headlines and\n", "is having a significant influence on the wider tech agenda. In this talk\n", "I will give an overview of where we are now with machine learning\n", "solutions, and what challenges we face both in the near and far future.\n", "These include practical application of existing algorithms in the face\n", "of the need to explain decision making, mechanisms for improving the\n", "quality and availability of data, dealing with large unstructured\n", "datasets." ], "id": "db402c75-0ca1-4b5c-b3e9-1d26bc64bf8b" }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$\n", "$$" ], "id": "173bf220-a60a-4a30-b94e-41f7de201659" }, { "cell_type": "markdown", "metadata": {}, "source": [ "::: {.cell .markdown}\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "" ], "id": "18e89951-43f1-4c46-81a4-57ebb39c8f38" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ], "id": "41d9cda9-3c8b-4384-ab09-dbdca264e7a5" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "plt.rcParams.update({'font.size': 22})" ], "id": "1fd6aa4a-04d3-456e-9402-5276469c548a" }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ], "id": "6bd8b16e-f58a-48b5-97a0-99c811642afc" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## notutils\n", "\n", "\\[edit\\]\n", "\n", "This small package is a helper package for various notebook utilities\n", "used\n", "\n", "The software can be installed using" ], "id": "406d71f7-492d-489d-874e-30dc1d4e8811" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install notutils" ], "id": "26643453-2ab4-47f4-a65a-a8e7668aaf6c" }, { "cell_type": "markdown", "metadata": {}, "source": [ "from the command prompt where you can access your python installation.\n", "\n", "The code is also available on GitHub:\n", "\n", "\n", "Once `notutils` is installed, it can be imported in the usual manner." ], "id": "26cd4808-0f35-451c-bc9f-11c8ba6f7966" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import notutils" ], "id": "d4f16428-a18b-44eb-9254-84899dcc2455" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## pods\n", "\n", "\\[edit\\]\n", "\n", "In Sheffield we created a suite of software tools for ‘Open Data\n", "Science’. Open data science is an approach to sharing code, models and\n", "data that should make it easier for companies, health professionals and\n", "scientists to gain access to data science techniques.\n", "\n", "You can also check this blog post on [Open Data\n", "Science](http://inverseprobability.com/2014/07/01/open-data-science).\n", "\n", "The software can be installed using" ], "id": "cde8c24a-4307-4b61-9d47-63ef3a2a1f81" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install pods" ], "id": "169ce6d1-da25-4bb3-8c7b-4a5e8c036df6" }, { "cell_type": "markdown", "metadata": {}, "source": [ "from the command prompt where you can access your python installation.\n", "\n", "The code is also available on GitHub: \n", "\n", "Once `pods` is installed, it can be imported in the usual manner." ], "id": "dbbe10f6-923a-4fc8-8649-1b3641cf26f0" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods" ], "id": "71d44f45-8dd5-4a1b-971a-a8bc5e25d135" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## mlai\n", "\n", "\\[edit\\]\n", "\n", "The `mlai` software is a suite of helper functions for teaching and\n", "demonstrating machine learning algorithms. It was first used in the\n", "Machine Learning and Adaptive Intelligence course in Sheffield in 2013.\n", "\n", "The software can be installed using" ], "id": "8c5067ae-74c8-4db3-b96e-b684f81b8b6b" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install mlai" ], "id": "932e3b5e-053f-48a5-9810-f23863dbec8f" }, { "cell_type": "markdown", "metadata": {}, "source": [ "from the command prompt where you can access your python installation.\n", "\n", "The code is also available on GitHub: \n", "\n", "Once `mlai` is installed, it can be imported in the usual manner." ], "id": "2132df0a-f1e2-4419-aad9-ccbc6a4e3918" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import mlai" ], "id": "32531eb6-4ca1-4061-b323-486eed6f3eac" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Complexity in Action\n", "\n", "\\[edit\\]\n", "\n", "As an exercise in understanding complexity, watch the following video.\n", "You will see the basketball being bounced around, and the players\n", "moving. Your job is to count the passes of those dressed in white and\n", "ignore those of the individuals dressed in black." ], "id": "67d806c2-ef93-422e-a10e-dae5b708d344" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.lib.display import YouTubeVideo\n", "YouTubeVideo('vJG698U2Mvo')" ], "id": "34e154b1-08ec-4c37-8064-735e56f3363e" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Figure: Daniel Simon’s famous illusion “monkey business”. Focus on\n", "the movement of the ball distracts the viewer from seeing other aspects\n", "of the image.\n", "\n", "In a classic study Simons and Chabris (1999) ask subjects to count the\n", "number of passes of the basketball between players on the team wearing\n", "white shirts. Fifty percent of the time, these subjects don’t notice the\n", "gorilla moving across the scene.\n", "\n", "The phenomenon of inattentional blindness is well known, e.g in their\n", "paper Simons and Charbris quote the Hungarian neurologist, Rezsö Bálint,\n", "\n", "> It is a well-known phenomenon that we do not notice anything happening\n", "> in our surroundings while being absorbed in the inspection of\n", "> something; focusing our attention on a certain object may happen to\n", "> such an extent that we cannot perceive other objects placed in the\n", "> peripheral parts of our visual field, although the light rays they\n", "> emit arrive completely at the visual sphere of the cerebral cortex.\n", ">\n", "> Rezsö Bálint 1907 (translated in Husain and Stein 1988, page 91)\n", "\n", "When we combine the complexity of the world with our relatively low\n", "bandwidth for information, problems can arise. Our focus on what we\n", "perceive to be the most important problem can cause us to miss other\n", "(potentially vital) contextual information.\n", "\n", "This phenomenon is known as selective attention or ‘inattentional\n", "blindness’." ], "id": "542d8dc3-4a94-4d6e-8b41-3b30d9d88a10" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.lib.display import YouTubeVideo\n", "YouTubeVideo('_oGAzq5wM_Q')" ], "id": "e3b4436c-1d9d-42ed-a5f4-2f5bacf6767e" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Figure: For a longer talk on inattentional bias from Daniel Simons\n", "see this video." ], "id": "5e3c31c6-958e-4733-87a5-2bef68e8de6f" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Selective Attention Bias\n", "\n", "\\[edit\\]\n", "\n", "We are going to see how inattention biases can play out in data analysis\n", "by going through a simple example. The analysis involves body mass index\n", "and activity information." ], "id": "449c7485-691d-44b0-ab95-5848e99ddee3" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## BMI Steps Data\n", "\n", "\\[edit\\]\n", "\n", "The BMI Steps example is taken from Yanai and Lercher (2020). We are\n", "given a data set of body-mass index measurements against step counts.\n", "For convenience we have packaged the data so that it can be easily\n", "downloaded." ], "id": "122d481d-3b32-45a4-925b-b9a009ceeb25" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods" ], "id": "9734d7af-360c-4a43-be62-43aa04f2021b" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pods.datasets.bmi_steps()\n", "X = data['X'] \n", "y = data['Y']" ], "id": "e42fe7d2-d341-411a-88c1-0fc2d95cc2a2" }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is good practice to give our variables interpretable names so that\n", "the analysis may be clearly understood by others. Here the `steps` count\n", "is the first dimension of the covariate, the `bmi` is the second\n", "dimension and the `gender` is stored in `y` with `1` for female and `0`\n", "for male." ], "id": "5310ef12-fe3d-4679-8380-1b4e3764fd9b" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "steps = X[:, 0]\n", "bmi = X[:, 1]\n", "gender = y[:, 0]" ], "id": "affb3807-7af7-4a80-a17b-88af97673512" }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check the mean steps and the mean of the BMI." ], "id": "c881d977-f62b-4c4a-9b09-1c09193a65a8" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Steps mean is {mean}.'.format(mean=steps.mean()))" ], "id": "57c02cfe-4b39-4933-9c53-5be9695ed0a1" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('BMI mean is {mean}.'.format(mean=bmi.mean()))" ], "id": "9c8e4f3e-5b96-4c56-ae32-17137c6898cc" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## BMI Steps Data Analysis\n", "\n", "\\[edit\\]\n", "\n", "We can also separate out the means from the male and female populations.\n", "In python this can be done by setting male and female indices as\n", "follows." ], "id": "2a9c6357-594d-4030-8bfc-dadf19dfceaa" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "male_ind = (gender==0)\n", "female_ind = (gender==1)" ], "id": "9d7c1486-3c27-49b9-9445-bbf13f6a11ce" }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now we can extract the variables for the two populations." ], "id": "0d836ab6-5592-4986-8ae2-a9855e94bd6b" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "male_steps = steps[male_ind]\n", "male_bmi = bmi[male_ind]" ], "id": "5bb17acc-19a1-4e28-9fb6-167bd97e3a61" }, { "cell_type": "markdown", "metadata": {}, "source": [ "And as before we compute the mean." ], "id": "4abe5acc-541e-404b-8e88-789c6174fd49" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Male steps mean is {mean}.'.format(mean=male_steps.mean()))" ], "id": "55c60654-b3aa-4406-a5a5-b2b76b4dfbb5" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Male BMI mean is {mean}.'.format(mean=male_bmi.mean()))" ], "id": "8493f902-f223-483d-b3d5-80c8d85cbb96" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, we can get the same result for the female portion of the\n", "populaton." ], "id": "27bcc4fc-25ad-4376-b3cf-0d33c77e0935" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "female_steps = steps[female_ind]\n", "female_bmi = bmi[female_ind]" ], "id": "5b5d08b5-176c-4df1-9b6e-fc18651c2dd0" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Female steps mean is {mean}.'.format(mean=female_steps.mean()))" ], "id": "289707cd-332b-4407-a5bd-a4a42e43bcb6" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Female BMI mean is {mean}.'.format(mean=female_bmi.mean()))" ], "id": "c1d07f8a-8085-4b83-8c0c-b207591f06a2" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Interesting, the female BMI average is slightly higher than the male BMI\n", "average. The number of steps in the male group is higher than that in\n", "the female group. Perhaps the steps and the BMI are anti-correlated. The\n", "more steps, the lower the BMI.\n", "\n", "Python provides a statistics package. We’ll import this in `python` so\n", "that we can try and understand the correlation between the `steps` and\n", "the `BMI`." ], "id": "8bbb9657-b979-43c1-91a7-44e512759831" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import pearsonr" ], "id": "77e161f6-d873-4b64-8603-8703620af857" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "corr, _ = pearsonr(steps, bmi)\n", "print(\"Pearson's overall correlation: {corr}\".format(corr=corr))" ], "id": "77b4aab6-e219-4cbf-a461-89e18cfe616e" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "\n", "male_corr, _ = pearsonr(male_steps, male_bmi)\n", "print(\"Pearson's correlation for males: {corr}\".format(corr=male_corr))" ], "id": "da318633-c94f-47c0-a7a6-ee14f0e31b13" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "female_corr, _ = pearsonr(female_steps, female_bmi)\n", "print(\"Pearson's correlation for females: {corr}\".format(corr=female_corr))" ], "id": "3ae0582c-21e1-4556-a0e4-16e346e465fd" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import mlai.plot as plot\n", "import mlai\n", "import matplotlib.pyplot as plt" ], "id": "5080d27e-e1ba-4048-a673-79ad32c00643" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "_ = ax.plot(X[male_ind, 0], X[male_ind, 1], 'g.',markersize=10)\n", "_ = ax.plot(X[female_ind, 0], X[female_ind, 1], 'r.',markersize=10)\n", "_ = ax.set_xlabel('steps', fontsize=20)\n", "_ = ax.set_ylabel('BMI', fontsize=20)\n", "xlim = (0, 15000)\n", "ylim = (15, 32.5)\n", "ax.set_xlim(xlim)\n", "ax.set_ylim(ylim)\n", "mlai.write_figure(filename='bmi-steps.svg',\n", " directory='./datasets',\n", " transparent=True)" ], "id": "be30ecc0-a4db-4e09-9ccd-fd213d56c4d9" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A Hypothesis as a Liability\n", "\n", "This analysis is from an article titled “A Hypothesis as a Liability”\n", "(Yanai and Lercher, 2020), they start their article with the following\n", "quite from Herman Hesse.\n", "\n", "> ” ‘When someone seeks,’ said Siddhartha, ‘then it easily happens that\n", "> his eyes see only the thing that he seeks, and he is able to find\n", "> nothing, to take in nothing. \\[…\\] Seeking means: having a goal. But\n", "> finding means: being free, being open, having no goal.’ ”\n", ">\n", "> Hermann Hesse\n", "\n", "Their idea is that having a hypothesis can constrain our thinking.\n", "However, in answer to their paper Felin et al. (2021) argue that some\n", "form of hypothesis is always necessary, suggesting that a hypothesis\n", "*can* be a liability\n", "\n", "My view is captured in the introductory chapter to an edited volume on\n", "computational systems biology that I worked on with Mark Girolami,\n", "Magnus Rattray and Guido Sanguinetti.\n", "\n", "\n", "\n", "Figure: Quote from Lawrence (2010) highlighting the importance of\n", "interaction between data and hypothesis.\n", "\n", "Popper nicely captures the interaction between hypothesis and data by\n", "relating it to the chicken and the egg. The important thing is that\n", "these two co-evolve." ], "id": "0358de2b-a466-4d90-bf0b-5c337b7c955f" }, { "cell_type": "markdown", "metadata": {}, "source": [ "# What is Machine Learning?\n", "\n", "\\[edit\\]\n", "\n", "What is machine learning? At its most basic level machine learning is a\n", "combination of\n", "\n", "$$\\text{data} + \\text{model} \\stackrel{\\text{compute}}{\\rightarrow} \\text{prediction}$$\n", "\n", "where *data* is our observations. They can be actively or passively\n", "acquired (meta-data). The *model* contains our assumptions, based on\n", "previous experience. That experience can be other data, it can come from\n", "transfer learning, or it can merely be our beliefs about the\n", "regularities of the universe. In humans our models include our inductive\n", "biases. The *prediction* is an action to be taken or a categorization or\n", "a quality score. The reason that machine learning has become a mainstay\n", "of artificial intelligence is the importance of predictions in\n", "artificial intelligence. The data and the model are combined through\n", "computation.\n", "\n", "In practice we normally perform machine learning using two functions. To\n", "combine data with a model we typically make use of:\n", "\n", "**a prediction function** it is used to make the predictions. It\n", "includes our beliefs about the regularities of the universe, our\n", "assumptions about how the world works, e.g., smoothness, spatial\n", "similarities, temporal similarities.\n", "\n", "**an objective function** it defines the ‘cost’ of misprediction.\n", "Typically, it includes knowledge about the world’s generating processes\n", "(probabilistic objectives) or the costs we pay for mispredictions\n", "(empirical risk minimization).\n", "\n", "The combination of data and model through the prediction function and\n", "the objective function leads to a *learning algorithm*. The class of\n", "prediction functions and objective functions we can make use of is\n", "restricted by the algorithms they lead to. If the prediction function or\n", "the objective function are too complex, then it can be difficult to find\n", "an appropriate learning algorithm. Much of the academic field of machine\n", "learning is the quest for new learning algorithms that allow us to bring\n", "different types of models and data together.\n", "\n", "A useful reference for state of the art in machine learning is the UK\n", "Royal Society Report, [Machine Learning: Power and Promise of Computers\n", "that Learn by\n", "Example](https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf).\n", "\n", "You can also check my post blog post on [What is Machine\n", "Learning?](http://inverseprobability.com/2017/07/17/what-is-machine-learning)." ], "id": "616699af-79cf-4bde-8176-e369b3afb744" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Artificial Intelligence and Data Science\n", "\n", "\\[edit\\]\n", "\n", "Machine learning technologies have been the driver of two related, but\n", "distinct disciplines. The first is *data science*. Data science is an\n", "emerging field that arises from the fact that we now collect so much\n", "data by happenstance, rather than by *experimental design*. Classical\n", "statistics is the science of drawing conclusions from data, and to do so\n", "statistical experiments are carefully designed. In the modern era we\n", "collect so much data that there’s a desire to draw inferences directly\n", "from the data.\n", "\n", "As well as machine learning, the field of data science draws from\n", "statistics, cloud computing, data storage (e.g. streaming data),\n", "visualization and data mining.\n", "\n", "In contrast, artificial intelligence technologies typically focus on\n", "emulating some form of human behaviour, such as understanding an image,\n", "or some speech, or translating text from one form to another. The recent\n", "advances in artificial intelligence have come from machine learning\n", "providing the automation. But in contrast to data science, in artificial\n", "intelligence the data is normally collected with the specific task in\n", "mind. In this sense it has strong relations to classical statistics.\n", "\n", "Classically artificial intelligence worried more about *logic* and\n", "*planning* and focused less on data driven decision making. Modern\n", "machine learning owes more to the field of *Cybernetics* (Wiener, 1948)\n", "than artificial intelligence. Related fields include *robotics*, *speech\n", "recognition*, *language understanding* and *computer vision*.\n", "\n", "There are strong overlaps between the fields, the wide availability of\n", "data by happenstance makes it easier to collect data for designing AI\n", "systems. These relations are coming through wide availability of sensing\n", "technologies that are interconnected by cellular networks, WiFi and the\n", "internet. This phenomenon is sometimes known as the *Internet of\n", "Things*, but this feels like a dangerous misnomer. We must never forget\n", "that we are interconnecting people, not things.\n", "\n", "
\n", "\n", "Convention for the Protection of *Individuals* with regard to Automatic\n", "Processing of *Personal Data* (1981/1/28)\n", "\n", "
" ], "id": "5b1fe25b-8b3b-4e8c-b3d1-61de7048a7fa" }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Evolved Relationship with Information\n", "\n", "\\[edit\\]\n", "\n", "The high bandwidth of computers has resulted in a close relationship\n", "between the computer and data. Large amounts of information can flow\n", "between the two. The degree to which the computer is mediating our\n", "relationship with data means that we should consider it an intermediary.\n", "\n", "Originally our low bandwidth relationship with data was affected by two\n", "characteristics. Firstly, our tendency to over-interpret driven by our\n", "need to extract as much knowledge from our low bandwidth information\n", "channel as possible. Secondly, by our improved understanding of the\n", "domain of *mathematical* statistics and how our cognitive biases can\n", "mislead us.\n", "\n", "With this new set up there is a potential for assimilating far more\n", "information via the computer, but the computer can present this to us in\n", "various ways. If its motives are not aligned with ours then it can\n", "misrepresent the information. This needn’t be nefarious it can be simply\n", "because of the computer pursuing a different objective from us. For\n", "example, if the computer is aiming to maximize our interaction time that\n", "may be a different objective from ours which may be to summarize\n", "information in a representative manner in the *shortest* possible length\n", "of time.\n", "\n", "For example, for me, it was a common experience to pick up my telephone\n", "with the intention of checking when my next appointment was, but to soon\n", "find myself distracted by another application on the phone and end up\n", "reading something on the internet. By the time I’d finished reading, I\n", "would often have forgotten the reason I picked up my phone in the first\n", "place.\n", "\n", "There are great benefits to be had from the huge amount of information\n", "we can unlock from this evolved relationship between us and data. In\n", "biology, large scale data sharing has been driven by a revolution in\n", "genomic, transcriptomic and epigenomic measurement. The improved\n", "inferences that can be drawn through summarizing data by computer have\n", "fundamentally changed the nature of biological science, now this\n", "phenomenon is also influencing us in our daily lives as data measured by\n", "*happenstance* is increasingly used to characterize us.\n", "\n", "Better mediation of this flow requires a better understanding of\n", "human-computer interaction. This in turn involves understanding our own\n", "intelligence better, what its cognitive biases are and how these might\n", "mislead us.\n", "\n", "For further thoughts see Guardian article on [marketing in the internet\n", "era](https://www.theguardian.com/media-network/2015/jul/23/data-driven-economy-marketing)\n", "from 2015.\n", "\n", "You can also check my blog post on [System\n", "Zero](http://inverseprobability.com/2015/12/04/what-kind-of-ai). This\n", "was also written in 2015." ], "id": "ce5a760d-ba72-4dc1-870d-ae5702a25d5a" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## New Flow of Information\n", "\n", "\\[edit\\]\n", "\n", "Classically the field of statistics focused on mediating the\n", "relationship between the machine and the human. Our limited bandwidth of\n", "communication means we tend to over-interpret the limited information\n", "that we are given, in the extreme we assign motives and desires to\n", "inanimate objects (a process known as anthropomorphizing). Much of\n", "mathematical statistics was developed to help temper this tendency and\n", "understand when we are valid in drawing conclusions from data.\n", "\n", "\n", "\n", "Figure: The trinity of human, data, and computer, and highlights the\n", "modern phenomenon. The communication channel between computer and data\n", "now has an extremely high bandwidth. The channel between human and\n", "computer and the channel between data and human is narrow. New direction\n", "of information flow, information is reaching us mediated by the\n", "computer. The focus on classical statistics reflected the importance of\n", "the direct communication between human and data. The modern challenges\n", "of data science emerge when that relationship is being mediated by the\n", "machine.\n", "\n", "Data science brings new challenges. In particular, there is a very large\n", "bandwidth connection between the machine and data. This means that our\n", "relationship with data is now commonly being mediated by the machine.\n", "Whether this is in the acquisition of new data, which now happens by\n", "happenstance rather than with purpose, or the interpretation of that\n", "data where we are increasingly relying on machines to summarize what the\n", "data contains. This is leading to the emerging field of data science,\n", "which must not only deal with the same challenges that mathematical\n", "statistics faced in tempering our tendency to over interpret data but\n", "must also deal with the possibility that the machine has either\n", "inadvertently or maliciously misrepresented the underlying data." ], "id": "6d0440ec-13ea-498b-b45b-91d9076ee4f1" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Science Africa\n", "\n", "\\[edit\\]\n", "\n", "\n", "\n", "Figure: Data Science Africa is a\n", "ground up initiative for capacity building around data science, machine\n", "learning and artificial intelligence on the African continent.\n", "\n", "\n", "\n", "Figure: Data Science Africa meetings held up to October 2021.\n", "Data Science Africa is a bottom up initiative for capacity building in\n", "data science, machine learning and artificial intelligence on the\n", "African continent.\n", "\n", "As of October 2021 there have been five workshops and five schools,\n", "located in Nyeri, Kenya (twice); Kampala, Uganda; Arusha, Tanzania;\n", "Abuja, Nigeria; Addis Ababa, Ethiopia; Accra, Ghana; Kampala, Uganda and\n", "Kimberley, South Africa.\n", "\n", "The main notion is *end-to-end* data science. For example, going from\n", "data collection in the farmer’s field to decision making in the Ministry\n", "of Agriculture. Or going from malaria disease counts in health centers\n", "to medicine distribution.\n", "\n", "The philosophy is laid out in (Lawrence, 2015). The key idea is that the\n", "modern *information infrastructure* presents new solutions to old\n", "problems. Modes of development change because less capital investment is\n", "required to take advantage of this infrastructure. The philosophy is\n", "that local capacity building is the right way to leverage these\n", "challenges in addressing data science problems in the African context.\n", "\n", "Data Science Africa is now a non-govermental organization registered in\n", "Kenya. The organising board of the meeting is entirely made up of\n", "scientists and academics based on the African continent.\n", "\n", "\n", "\n", "Figure: The lack of existing physical infrastructure on the African\n", "continent makes it a particularly interesting environment for deploying\n", "solutions based on the *information infrastructure*. The idea is\n", "explored more in this Guardian op-ed on Guardian article on [How African\n", "can benefit from the data\n", "revolution](https://www.theguardian.com/media-network/2015/aug/25/africa-benefit-data-science-information).\n", "\n", "Guardian article on [Data Science\n", "Africa](https://www.theguardian.com/media-network/2015/aug/25/africa-benefit-data-science-information)" ], "id": "39dd7341-c288-4647-9fb3-dfd61207f2fd" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example: Prediction of Malaria Incidence in Uganda\n", "\n", "\\[edit\\]\n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "Martin Mubangizi\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "Ricardo Andrade Pacecho\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "John Quinn\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "As an example of using Gaussian process models within the full pipeline\n", "from data to decsion, we’ll consider the prediction of Malaria incidence\n", "in Uganda. For the purposes of this study malaria reports come in two\n", "forms, HMIS reports from health centres and Sentinel data, which is\n", "curated by the WHO. There are limited sentinel sites and many HMIS\n", "sites.\n", "\n", "The work is from Ricardo Andrade Pacheco’s PhD thesis, completed in\n", "collaboration with John Quinn and Martin Mubangizi (Andrade-Pacheco et\n", "al., 2014; Mubangizi et al., 2014). John and Martin were initally from\n", "the AI-DEV group from the University of Makerere in Kampala and more\n", "latterly they were based at UN Global Pulse in Kampala. You can see the\n", "work summarized on the UN Global Pulse [disease outbreaks project site\n", "here](https://diseaseoutbreaks.unglobalpulse.net/uganda/).\n", "\n", "- See [UN Global Pulse Disease Outbreaks\n", " Site](https://diseaseoutbreaks.unglobalpulse.net/uganda/)\n", "\n", "Malaria data is spatial data. Uganda is split into districts, and health\n", "reports can be found for each district. This suggests that models such\n", "as conditional random fields could be used for spatial modelling, but\n", "there are two complexities with this. First of all, occasionally\n", "districts split into two. Secondly, sentinel sites are a specific\n", "location within a district, such as Nagongera which is a sentinel site\n", "based in the Tororo district.\n", "\n", "\n", "\n", "Figure: Ugandan districts. Data SRTM/NASA from\n", ".\n", "\n", "(Andrade-Pacheco et al., 2014; Mubangizi et al., 2014)\n", "\n", "The common standard for collecting health data on the African continent\n", "is from the Health management information systems (HMIS). However, this\n", "data suffers from missing values (Gething et al., 2006) and diagnosis of\n", "diseases like typhoid and malaria may be confounded.\n", "\n", "\n", "\n", "Figure: The Tororo district, where the sentinel site, Nagongera, is\n", "located.\n", "\n", "[World Health Organization Sentinel Surveillance\n", "systems](https://www.who.int/immunization/monitoring_surveillance/burden/vpd/surveillance_type/sentinel/en/)\n", "are set up “when high-quality data are needed about a particular disease\n", "that cannot be obtained through a passive system”. Several sentinel\n", "sites give accurate assessment of malaria disease levels in Uganda,\n", "including a site in Nagongera.\n", "\n", "\n", "\n", "Figure: Sentinel and HMIS data along with rainfall and temperature\n", "for the Nagongera sentinel station in the Tororo district.\n", "\n", "In collaboration with the AI Research Group at Makerere we chose to\n", "investigate whether Gaussian process models could be used to assimilate\n", "information from these two different sources of disease informaton.\n", "Further, we were interested in whether local information on rainfall and\n", "temperature could be used to improve malaria estimates.\n", "\n", "The aim of the project was to use WHO Sentinel sites, alongside rainfall\n", "and temperature, to improve predictions from HMIS data of levels of\n", "malaria.\n", "\n", "\n", "\n", "Figure: The Mubende District.\n", "\n", "\n", "\n", "Figure: Prediction of malaria incidence in Mubende.\n", "\n", "\n", "\n", "Figure: The project arose out of the Gaussian process summer school\n", "held at Makerere in Kampala in 2013. The school led, in turn, to the\n", "Data Science Africa initiative." ], "id": "7ef185fc-e0f2-4489-9231-c8d04dc77d11" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Early Warning Systems\n", "\n", "\n", "\n", "Figure: The Kabarole district in Uganda.\n", "\n", "\n", "\n", "Figure: Estimate of the current disease situation in the Kabarole\n", "district over time. Estimate is constructed with a Gaussian process with\n", "an additive covariance funciton.\n", "\n", "Health monitoring system for the Kabarole district. Here we have fitted\n", "the reports with a Gaussian process with an additive covariance\n", "function. It has two components, one is a long time scale component (in\n", "red above) the other is a short time scale component (in blue).\n", "\n", "Monitoring proceeds by considering two aspects of the curve. Is the blue\n", "line (the short term report signal) above the red (which represents the\n", "long term trend? If so we have higher than expected reports. If this is\n", "the case *and* the gradient is still positive (i.e. reports are going\n", "up) we encode this with a *red* color. If it is the case and the\n", "gradient of the blue line is negative (i.e. reports are going down) we\n", "encode this with an *amber* color. Conversely, if the blue line is below\n", "the red *and* decreasing, we color *green*. On the other hand if it is\n", "below red but increasing, we color *yellow*.\n", "\n", "This gives us an early warning system for disease. Red is a bad\n", "situation getting worse, amber is bad, but improving. Green is good and\n", "getting better and yellow good but degrading.\n", "\n", "Finally, there is a gray region which represents when the scale of the\n", "effect is small.\n", "\n", "\n", "\n", "Figure: The map of Ugandan districts with an overview of the Malaria\n", "situation in each district.\n", "\n", "These colors can now be observed directly on a spatial map of the\n", "districts to give an immediate impression of the current status of the\n", "disease across the country." ], "id": "a19dd848-568d-478a-a9b9-bf2671560389" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Thanks!\n", "\n", "For more information on these subjects and more you might want to check\n", "the following resources.\n", "\n", "- twitter: [@lawrennd](https://twitter.com/lawrennd)\n", "- podcast: [The Talking Machines](http://thetalkingmachines.com)\n", "- newspaper: [Guardian Profile\n", " Page](http://www.theguardian.com/profile/neil-lawrence)\n", "- blog:\n", " [http://inverseprobability.com](http://inverseprobability.com/blog.html)" ], "id": "e949b1b5-be3d-48ca-a817-a8b40f9658c2" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References" ], "id": "91009742-0a8b-4894-939c-46a819f71a60" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Andrade-Pacheco, R., Mubangizi, M., Quinn, J., Lawrence, N.D., 2014.\n", "Consistent mapping of government malaria records across a changing\n", "territory delimitation. Malaria Journal 13.\n", "\n", "\n", "Felin, T., Koenderink, J., Krueger, J.I., Noble, D., Ellis, G.F.R.,\n", "2021. The data-hypothesis relationship. Genome Biology 22.\n", "\n", "\n", "Gething, P.W., Noor, A.M., Gikandi, P.W., Ogara, E.A.A., Hay, S.I.,\n", "Nixon, M.S., Snow, R.W., Atkinson, P.M., 2006. Improving imperfect data\n", "from health management information systems in Africa using space–time\n", "geostatistics. PLoS Medicine 3.\n", "\n", "\n", "Lawrence, N.D., 2015. [How Africa can benefit from the data\n", "revolution](https://www.theguardian.com/media-network/2015/aug/25/africa-benefit-data-science-information).\n", "\n", "Lawrence, N.D., 2010. Introduction to learning and inference in\n", "computational systems biology.\n", "\n", "Mubangizi, M., Andrade-Pacheco, R., Smith, M.T., Quinn, J., Lawrence,\n", "N.D., 2014. Malaria surveillance with multiple data sources using\n", "Gaussian process models, in: 1st International Conference on the Use of\n", "Mobile ICT in Africa.\n", "\n", "Simons, D.J., Chabris, C.F., 1999. Gorillas in our midst: Sustained\n", "inattentional blindness for dynamic events. Perception 28, 1059–1074.\n", "\n", "\n", "Wiener, N., 1948. Cybernetics: Control and communication in the animal\n", "and the machine. MIT Press, Cambridge, MA.\n", "\n", "Yanai, I., Lercher, M., 2020. A hypothesis is a liability. Genome\n", "Biology 21." ], "id": "bd90b480-b57f-4c56-b9de-3e8365f37924" } ], "nbformat": 4, "nbformat_minor": 5, "metadata": {} }