{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "### 计算新闻传播学课程简介\n", "***\n", "***\n", "# 数据科学的编程工具:大数据\n", "***\n", "***\n", "\n", "王成军 \n", "\n", "wangchengjun@nju.edu.cn\n", "\n", "计算传播网 http://computational-communication.com" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 关于大数据的图片" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "![](./img/bigdata.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 数字" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "![](./img/bignumber.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 网络" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "![](./img/bignetwork.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 文本" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "![](./img/bigword.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "> # Big data is like teenage sex: \n", "\n", "> # Everyone talks about it, nobody really knows how to do it, \n", ">\n", "# everyone thinks everyone else is doing it, so everyone claims they are doing it. \n", "\n", "> --Dan Ariely of Duke University" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "> # Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation , search, sharing , storage, transfer, visualization, and information privacy . (WIKIPEDIA)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 云计算\n", "\n", "2006: AWS EC2 (cloud-based computing clusters)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "# Tools in the Ecosystem: \"Hadoop\" and Map/Reduce" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "# 阿里云" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "# 百度云" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Map/Reduce\n", "\n", "Google article on MapReduce by Dean and Ghemawat, 2004" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "![](./img/mapreduce.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# The nightmare that is to find out a way to split the data\n", "- word count\n", "- network?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "# An alternative to Hadoop, Spark with Python \n", "- [wordcount with spark and python](https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "把系统与算法结合,设计大规模分布式的机器学习算法与系统,使得机器学习算法可以在多处理器和多机器的集群环境下作业,处理更大量级的数据。\n", "这方面较为知名的系统包括: \n", "- 加州大学伯克利分校的Spark\n", "- 谷歌的TensorFlow\n", "- 华盛顿大学的Dato GraphLab\n", "- 卡内基梅陇大学的Petuum\n", "- 微软的DMTK系统" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Giant Data Sets Are Around\n", "\n", "- amazon https://aws.amazon.com/public-data-sets/ \n", "- 雅虎开放数据 http://webscope.sandbox.yahoo.com/\n", "- Stanford Large Network Dataset Collection https://snap.stanford.edu/data/\n", " - bigdata/twitter7/tweets2009-07.txt.gz \n", "- Gdelt news data http://gdeltproject.org/" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "![](./img/gdelt.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Big Query Demo of GDELT\n", "\n", "https://bigquery.cloud.google.com/table/gdelt-bq:gdeltv2.events?_ga=1.69322323.2127553167.1443183592" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![](./img/bigquery.png)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import display_html, HTML\n", "HTML('')\n", "# the webpage we would like to crawl" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "

imMens: Real-time Visual Querying of Big Data from Stanford Visualization Group on Vimeo.

\n", "\n", "[Bin-Summarize-Smooth: A Framework for Visualizing Large Data](http://vita.had.co.nz/papers/bigvis.pdf) (Hadley Wickham)\n", "\n", "[\"Why Exploring Big Data is Hard and What We Can Do About It\"](https://www.youtube.com/watch?t=2&v=UP5412nU2lI), Danyel Fisher's talk at OpenVisConf 2015 \n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![](./img/dilbert.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Big Data and whole data are not the same. Without taking into account the sample of a data set, the size of the data set is meaningless. For example, a researcher may seek to understand the topical frequency of tweets, yet if Twitter removes all tweets that contain problematic words or content – such as references to pornography or spam – from the stream, the topical frequency would be inaccurate. Regardless of the number of tweets, it is not a representative sample as the data is skewed from the beginning. \n", "> d. boyd and K. Crawford, \"Critical Questions for Big Data\"\n", "\n", "Information, Communication & Society Volume 15, Issue 5, 2012 http://www.tandfonline.com/doi/abs/10.1080/1369118X.2012.678878" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### ... four quantitatively adept social scientists reported that Google’s flu-tracking service not only wildly overestimated the number of flu cases in the United States in the 2012-13 flu season — a well-known miss — but has also consistently overshot in the last few years. Google Flu Trends’ estimate for the 2011-12 flu season was more than 50 percent higher than the cases reported by the Centers for Disease Control and Prevention. ...Their technical criticism of Google Flu Trends is that it is not using a broader array of data analysis tools. Indeed, their analysis shows that combining Google Flu Trends with C.D.C. data, and applying a few tweaking techniques, works best. \n", "\n", " \n", "[Google Flu Trends: The Limits of Big Data (NYT)](http://bits.blogs.nytimes.com/2014/03/28/google-flu-trends-the-limits-of-big-data/?_r=0)\n", "\n", "> Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science 343 (14 March): 1203-1205." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "slide" } }, "source": [ "### The first lesson of Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn’t available. For instance, we find that useful semantic relationships can be automatically learned from the statistics of search queries and the corresponding results-- or from the accumulated evidence of Web-based text patterns and formatted tables-- in both cases without needing any manually annotated data. \n", "> Halevy, Norvig, Pereira " ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "slide" } }, "source": [ "![](./img/datascience.png)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "slide" } }, "source": [ "# Type A: Analysis\n", "- making sense of data\n", " - very similar to a statistician \n", " \n", "# Type B: Builders\n", "- mainly interested in using data in production.\n", " - strong coders and may be trained software engineers. \n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "slide" } }, "source": [ "# Everyone should learn Python\n", "## Everyone should learn some SQL." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 参考文献\n", " \n", "- 维克托•迈尔•舍恩伯格(Viktor Mayer-Schönberger) 大数据时代:生活、工作与思维的大变革, 浙江人民出版社 Big Data:A Revolution That Will Transform How We Live, Work, and Think 译者: 周涛 2012-12 页数: 261\n", "\n", "- http://ghostweather.slides.com/lynncherny/what-is-big-data-anyway" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ " " ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python [conda env:anaconda]", "language": "python", "name": "conda-env-anaconda-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" }, "latex_envs": { "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 0 }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": false, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "780px", "left": "1366px", "top": "123px", "width": "341px" }, "toc_section_display": false, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 1 }