{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Recipes for processing Twitter data with jq\n", "This notebook is a companion to [Getting Started Working with Twitter Data Using jq](http://nbviewer.jupyter.org/github/gwu-libraries/notebooks/blob/master/20160407-twitter-analysis-with-jq/Working-with-twitter-using-jq.ipynb). It focuses on recipes that the [Social Feed Manager](http://gwu-libraries.github.io/sfm-ui/) team has used when preparing datasets of tweets for researchers.\n", "\n", "We will continue to add additional recipes to this notebook. If you have any suggestions, please [contact us](http://gwu-libraries.github.io/sfm-ui/contact).\n", "\n", "This notebook requires at least [jq](https://stedolan.github.io/jq/) 1.5. Note that only earlier versions may be available from your package manager; manual installation may be necessary.\n", "\n", "These recipes can be used with any data source that outputs tweets as line-oriented JSON. Within the context of SFM, this is usually the output of [twitter_rest_warc_iter.py or twitter_stream_warc_iter.py](http://sfm.readthedocs.io/en/latest/processing.html#warc-iterators) within a [processing container](http://sfm.readthedocs.io/en/latest/processing.html#processing-in-container). Alternatively, [Twarc](https://github.com/DocNow/twarc) is a commandline tool for retrieving data from the Twitter API that outputs tweets as line-oriented JSON.\n", "\n", "For the purposes of this notebook, we will use a line-oriented JSON file that was created using Twarc. It contains the user timeline of @SocialFeedMgr. The command used to produce this file was `twarc.py --timeline socialfeedmgr > tweets.json`.\n", "\n", "For an explanation of the fields in a tweet see the [Tweet Field Guide](https://dev.twitter.com/overview/api/tweets). For other helpful tweet processing utilities, see [twarc utils](https://github.com/DocNow/twarc/tree/master/utils).\n", "\n", "For the sake of brevity, some of the examples may only output a subset of the tweets fields and/or a subset of the tweets contained in `tweets.json`. The following example outputs the tweet id and text of all of the first 5 tweets.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1;39m[\u001b[0;32m\"798895564335280129\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"Social Feed Manager 1.3 is out, with collection portability, monitoring page, one-time harvest option, more https://t.co/L956zwfrGQ\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"797074713612877824\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"@SMLabTO How might we update the toolkit's listing for Social Feed Manager? Info and docs available via https://t.co/j3zQ7kGwNn\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"797064988116611073\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"RT @justin_littman: New on the @SocialFeedMgr blog: On retweets, replies, quotes & favorites: A guide for researchers. https://t.co/SjfIuLu…\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"794234002496487424\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"RT @ianmilligan1: Used @SocialFeedMgr to collect almost 80,000 #CdnPoli tweets this morning. Great, intuitive UI. https://t.co/BuS3S7f6kf #…\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"793896478037114880\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"Software doesn't live forever. How to get collections OUT of Social Feed Manager, a new blog post by @justin_littman https://t.co/CagQvSF7pJ\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n" ] } ], "source": [ "!head -n5 tweets.json | jq -c '[.id_str, .text]'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dates\n", "For both filtering and output, it is often necessary to parse and/or normalize the `created_at` date. The following shows the original `created_at` date and the date as an ISO 8601 date." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1;39m[\u001b[0;32m\"2016-11-16T09:28:39Z\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"2016-11-16T09:28:39Z\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"2016-11-11T08:53:15Z\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"2016-11-11T08:53:15Z\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"2016-11-11T08:14:36Z\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"2016-11-11T08:14:36Z\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"2016-11-03T13:45:17Z\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"2016-11-03T13:45:17Z\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"2016-11-02T15:24:04Z\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"2016-11-02T15:24:04Z\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n" ] } ], "source": [ "!head -n5 tweets.json | jq -c '[.created_at, .created_at | strptime(\"%A %B %d %T %z %Y\") | todate]'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Filtering text\n", "#### Case sensitive" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1;39m[\u001b[0;32m\"797064988116611073\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"RT @justin_littman: New on the @SocialFeedMgr blog: On retweets, replies, quotes & favorites: A guide for researchers. https://t.co/SjfIuLu…\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"793896478037114880\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"Software doesn't live forever. How to get collections OUT of Social Feed Manager, a new blog post by @justin_littman https://t.co/CagQvSF7pJ\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"786179196577992707\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"A detailed look at recent technical work to improve our social media harvesters. New blog post by @justinlittman: https://t.co/FFHqJxfxl6\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"773553804558073857\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"When is a Collection not an Archive? New blog post by @save4use https://t.co/JtxyksXLdV\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"742420371434033153\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"RT @justin_littman: My blog post on collecting the tweets of #PulseNightclub with @SocialFeedMgr: https://t.co/qRQRNPRiOO\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"728300814129827840\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"Another Try at Harvesting the Twitter Streaming API to WARC files, blog post by @justin_littman https://t.co/RJL9OqaGDW\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"720616794168369152\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"More on Social Feed Manager and blog posts about the WARC approach at https://t.co/iUdSktNyRp https://t.co/XKIiqaKDDp\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n" ] } ], "source": [ "!cat tweets.json | jq -c 'select(.text | contains(\"blog\")) | [.id_str, .text]'" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "!cat tweets.json | jq -c 'select(.text | contains(\"BLOG\")) | [.id_str, .text]'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Case insensitive\n", "To ignore case, use a [regular expression filter](https://stedolan.github.io/jq/manual/#RegularexpressionsPCRE) with the case-insensitive flag." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1;39m[\u001b[0;32m\"797064988116611073\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"RT @justin_littman: New on the @SocialFeedMgr blog: On retweets, replies, quotes & favorites: A guide for researchers. https://t.co/SjfIuLu…\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"793896478037114880\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"Software doesn't live forever. How to get collections OUT of Social Feed Manager, a new blog post by @justin_littman https://t.co/CagQvSF7pJ\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"786179196577992707\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"A detailed look at recent technical work to improve our social media harvesters. New blog post by @justinlittman: https://t.co/FFHqJxfxl6\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"773553804558073857\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"When is a Collection not an Archive? New blog post by @save4use https://t.co/JtxyksXLdV\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"742420371434033153\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"RT @justin_littman: My blog post on collecting the tweets of #PulseNightclub with @SocialFeedMgr: https://t.co/qRQRNPRiOO\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"728300814129827840\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"Another Try at Harvesting the Twitter Streaming API to WARC files, blog post by @justin_littman https://t.co/RJL9OqaGDW\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"720616794168369152\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"More on Social Feed Manager and blog posts about the WARC approach at https://t.co/iUdSktNyRp https://t.co/XKIiqaKDDp\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n" ] } ], "source": [ "!cat tweets.json | jq -c 'select(.text | test(\"BLog\"; \"i\")) | [.id_str, .text]'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Filtering on multiple terms (OR)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1;39m[\u001b[0;32m\"797064988116611073\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"RT @justin_littman: New on the @SocialFeedMgr blog: On retweets, replies, quotes & favorites: A guide for researchers. https://t.co/SjfIuLu…\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"793896478037114880\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"Software doesn't live forever. How to get collections OUT of Social Feed Manager, a new blog post by @justin_littman https://t.co/CagQvSF7pJ\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"786179196577992707\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"A detailed look at recent technical work to improve our social media harvesters. New blog post by @justinlittman: https://t.co/FFHqJxfxl6\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"773553804558073857\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"When is a Collection not an Archive? New blog post by @save4use https://t.co/JtxyksXLdV\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"742420371434033153\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"RT @justin_littman: My blog post on collecting the tweets of #PulseNightclub with @SocialFeedMgr: https://t.co/qRQRNPRiOO\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"741072574239649792\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"Glad to contribute to twarc, for the benefit of everyone needing access to twitter data. https://t.co/gwKXit2cho\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"728300814129827840\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"Another Try at Harvesting the Twitter Streaming API to WARC files, blog post by @justin_littman https://t.co/RJL9OqaGDW\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"720616794168369152\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"More on Social Feed Manager and blog posts about the WARC approach at https://t.co/iUdSktNyRp https://t.co/XKIiqaKDDp\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n" ] } ], "source": [ "!cat tweets.json | jq -c 'select(.text | test(\"BLog|twarc\"; \"i\")) | [.id_str, .text]'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Filtering on multiple terms (AND)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1;39m[\u001b[0;32m\"728300814129827840\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"Another Try at Harvesting the Twitter Streaming API to WARC files, blog post by @justin_littman https://t.co/RJL9OqaGDW\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n" ] } ], "source": [ "!cat tweets.json | jq -c 'select((.text | test(\"BLog\"; \"i\")) and (.text | test(\"twitter\"; \"i\"))) | [.id_str, .text]'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Filter dates\n", "The following shows tweets created after November 5, 2016." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1;39m[\u001b[0;32m\"798895564335280129\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"Wed Nov 16 14:28:39 +0000 2016\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"2016-11-16T09:28:39Z\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"797074713612877824\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"Fri Nov 11 13:53:15 +0000 2016\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"2016-11-11T08:53:15Z\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"797064988116611073\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"Fri Nov 11 13:14:36 +0000 2016\"\u001b[0m\u001b[1;39m,\u001b[0;32m\"2016-11-11T08:14:36Z\"\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n" ] } ], "source": [ "!cat tweets.json | jq -c 'select((.created_at | strptime(\"%A %B %d %T %z %Y\") | mktime) > (\"2016-11-05T00:00:00Z\" | fromdateiso8601)) | [.id_str, .created_at, (.created_at | strptime(\"%A %B %d %T %z %Y\") | todate)]'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Is retweet" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1;39m[\u001b[0;32m\"797064988116611073\"\u001b[0m\u001b[1;39m,\u001b[0;39m796843045341790200\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"794234002496487424\"\u001b[0m\u001b[1;39m,\u001b[0;39m794224602310463500\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"793786215220776960\"\u001b[0m\u001b[1;39m,\u001b[0;39m793748406602723300\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"791966708390957056\"\u001b[0m\u001b[1;39m,\u001b[0;39m791929123723632600\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"791633052053176321\"\u001b[0m\u001b[1;39m,\u001b[0;39m791632539341447200\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"789442614571466752\"\u001b[0m\u001b[1;39m,\u001b[0;39m789416330009120800\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"785833961847001088\"\u001b[0m\u001b[1;39m,\u001b[0;39m785802401512947700\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"780736483682549760\"\u001b[0m\u001b[1;39m,\u001b[0;39m780732775963983900\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"773231772398194688\"\u001b[0m\u001b[1;39m,\u001b[0;39m773229286589341700\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"766639464454193152\"\u001b[0m\u001b[1;39m,\u001b[0;39m765636639133691900\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"758383727672225792\"\u001b[0m\u001b[1;39m,\u001b[0;39m758316697560416300\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"752856372644024320\"\u001b[0m\u001b[1;39m,\u001b[0;39m752584251648774100\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"750704374876364801\"\u001b[0m\u001b[1;39m,\u001b[0;39m750391045301559300\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"743462280852086784\"\u001b[0m\u001b[1;39m,\u001b[0;39m743459400967413800\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"743458848434958336\"\u001b[0m\u001b[1;39m,\u001b[0;39m743458460205989900\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"743410385923997696\"\u001b[0m\u001b[1;39m,\u001b[0;39m743121113035771900\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"743167286006079488\"\u001b[0m\u001b[1;39m,\u001b[0;39m743166124133650400\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"743128282720284672\"\u001b[0m\u001b[1;39m,\u001b[0;39m743127197066612700\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"742420371434033153\"\u001b[0m\u001b[1;39m,\u001b[0;39m742418514686926800\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"742053812895027200\"\u001b[0m\u001b[1;39m,\u001b[0;39m742048151176151000\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"720621348565970944\"\u001b[0m\u001b[1;39m,\u001b[0;39m720621197550071800\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"720223544014213120\"\u001b[0m\u001b[1;39m,\u001b[0;39m720222105435009000\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"715610839890403329\"\u001b[0m\u001b[1;39m,\u001b[0;39m715607896793477100\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n" ] } ], "source": [ "!cat tweets.json | jq -c 'select(has(\"retweeted_status\")) | [.id_str, .retweeted_status.id]'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Is quote" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1;39m[\u001b[0;32m\"789100742430887936\"\u001b[0m\u001b[1;39m,\u001b[0;39m789098583362502700\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"745995798354210816\"\u001b[0m\u001b[1;39m,\u001b[0;39m745988440794210300\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"741072574239649792\"\u001b[0m\u001b[1;39m,\u001b[0;39m741018168773226500\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n", "\u001b[1;39m[\u001b[0;32m\"720616794168369152\"\u001b[0m\u001b[1;39m,\u001b[0;39m720615458412605400\u001b[0m\u001b[1;39m\u001b[1;39m]\u001b[0m\r\n" ] } ], "source": [ "!cat tweets.json | jq -c 'select(has(\"quoted_status\")) | [.id_str, .quoted_status.id]'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Output\n", "To write output to a file use `> `. For example: `cat tweets.json | jq -r '.id_str' > tweet_ids.txt`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### CSV\n", "Following is a CSV output that has fields similar to the CSV output produced by [SFM's export functionality](http://sfm.readthedocs.io/en/latest/quickstart.html#exports).\n", "\n", "Note that is uses the `-r` flag for jq instead of the `-c` flag.\n", "\n", "Also note that is it is necessary to remove line breaks from the tweet text to prevent it from breaking the CSV. This is done with `(.text | gsub(\"\\n\";\" \"))`." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\"2016-11-16T09:28:39Z\",\"798895564335280129\",\"SocialFeedMgr\",124,27,1,0,,\"http://twitter.com/SocialFeedMgr/status/798895564335280129\",\"Social Feed Manager 1.3 is out, with collection portability, monitoring page, one-time harvest option, more https://t.co/L956zwfrGQ\",false,false\r\n", "\"2016-11-11T08:53:15Z\",\"797074713612877824\",\"SocialFeedMgr\",124,27,0,0,\"SMLabTO\",\"http://twitter.com/SocialFeedMgr/status/797074713612877824\",\"@SMLabTO How might we update the toolkit's listing for Social Feed Manager? Info and docs available via https://t.co/j3zQ7kGwNn\",false,false\r\n", "\"2016-11-11T08:14:36Z\",\"797064988116611073\",\"SocialFeedMgr\",124,27,4,0,,\"http://twitter.com/SocialFeedMgr/status/797064988116611073\",\"RT @justin_littman: New on the @SocialFeedMgr blog: On retweets, replies, quotes & favorites: A guide for researchers. https://t.co/SjfIuLu…\",true,false\r\n", "\"2016-11-03T13:45:17Z\",\"794234002496487424\",\"SocialFeedMgr\",124,27,6,0,,\"http://twitter.com/SocialFeedMgr/status/794234002496487424\",\"RT @ianmilligan1: Used @SocialFeedMgr to collect almost 80,000 #CdnPoli tweets this morning. Great, intuitive UI. https://t.co/BuS3S7f6kf #…\",true,false\r\n", "\"2016-11-02T15:24:04Z\",\"793896478037114880\",\"SocialFeedMgr\",124,27,8,4,,\"http://twitter.com/SocialFeedMgr/status/793896478037114880\",\"Software doesn't live forever. How to get collections OUT of Social Feed Manager, a new blog post by @justin_littman https://t.co/CagQvSF7pJ\",false,false\r\n" ] } ], "source": [ "!head -n5 tweets.json | jq -r '[(.created_at | strptime(\"%A %B %d %T %z %Y\") | todate), .id_str, .user.screen_name, .user.followers_count, .user.friends_count, .retweet_count, .favorite_count, .in_reply_to_screen_name, \"http://twitter.com/\" + .user.screen_name + \"/status/\" + .id_str, (.text | gsub(\"\\n\";\" \")), has(\"retweeted_status\"), has(\"quoted_status\")] | @csv'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Header row\n", "The header row should be written to the output file with `>` before appending the CSV with `>>`." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\"created_at\",\"twitter_id\",\"screen_name\",\"followers_count\",\"friends_count\",\"retweet_count\",\"favorite_count\",\"in_reply_to_screen_name\",\"twitter_url\",\"text\",\"is_retweet\",\"is_quote\"\r\n" ] } ], "source": [ "!echo \"[]\" | jq -r '[\"created_at\",\"twitter_id\",\"screen_name\",\"followers_count\",\"friends_count\",\"retweet_count\",\"favorite_count\",\"in_reply_to_screen_name\",\"twitter_url\",\"text\",\"is_retweet\",\"is_quote\"] | @csv'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Splitting files\n", "Excel can load CSV files with over a million rows. Howver, for practical purposes a much smaller number is recommended.\n", "\n", "The following uses the split command to split the CSV output into multiple files. Note that the flags accepted may be different in your environment.\n", "\n", "```\n", "cat tweets.json | jq -r '[.id_str, (.text | gsub(\"\\n\";\" \"))] | @csv' | split --lines=5 -d --additional-suffix=.csv - tweets\n", "ls *.csv\n", "tweets00.csv tweets01.csv tweets02.csv tweets03.csv tweets04.csv\n", "tweets05.csv tweets06.csv tweets07.csv tweets08.csv tweets09.csv\n", "```\n", "\n", "`--lines=5` sets the number of lines to include in each file.\n", "\n", "`--additional-suffix=.csv` set the file extension.\n", "\n", "`tweets` is the base name for each file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tweet ids\n", "When outputting tweet ids, `.id_str` should be used instead of `.id`. See [Ed Summer's blog post](http://inkdroid.org/2016/11/30/overflow/) for an explanation." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "798895564335280129\r\n", "797074713612877824\r\n", "797064988116611073\r\n", "794234002496487424\r\n", "793896478037114880\r\n" ] } ], "source": [ "!head -n5 tweets.json | jq -r '.id_str'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 1 }