{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### The following code shows how to transform binary type information in spark dataframe into struct type\n", "- **This notebook is running with a pyspark kernel** " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2018-10-12T20:24:24.217088Z", "start_time": "2018-10-12T20:24:10.620922Z" } }, "outputs": [], "source": [ "geotwt_sdf=spark.read.parquet(\"BMC_UserGeoTwt/BMC_GeoTwt_Snappy*\")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2018-10-10T19:40:08.118939Z", "start_time": "2018-10-10T19:39:59.358839Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "38820846\n", "+--------------------+---------+------------------+----------+-----------+--------------------+--------------------+--------------------+--------------------+---------------+-------------+----------+--------------------+------+---------+----------------+-----------+-----------------+\n", "| ctime| uid| uname| lat| lng| profile_location| term| hashtag| fulltext|place_full_name| country|place_type| bounding_box|c_code|attribute| geoid| place_name|__index_level_0__|\n", "+--------------------+---------+------------------+----------+-----------+--------------------+--------------------+--------------------+--------------------+---------------+-------------+----------+--------------------+------+---------+----------------+-----------+-----------------+\n", "|[54 75 65 20 53 6...|126371773|CompassUSAJobBoard|37.5407246|-77.4360481|[43 68 61 72 6C 6...|[6A 6F 69 6E 20 6...|[6A 6F 62 20 46 6...|Join the Crothall...| Richmond, VA|United States| city|[7B 75 27 74 79 7...| US| [7B 7D]|00f751614d8ce37b| Richmond| 0|\n", "|[54 75 65 20 53 6...|126371773|CompassUSAJobBoard|28.3252878|-81.5331286|[43 68 61 72 6C 6...|[72 65 63 6F 6D 6...|[6A 6F 62 20 43 6...|Can you recommend...|Celebration, FL|United States| city|[7B 75 27 74 79 7...| US| [7B 7D]|01bbe9ba4078361c|Celebration| 1|\n", "+--------------------+---------+------------------+----------+-----------+--------------------+--------------------+--------------------+--------------------+---------------+-------------+----------+--------------------+------+---------+----------------+-----------+-----------------+\n", "only showing top 2 rows\n", "\n" ] } ], "source": [ "print geotwt_sdf.count()\n", "geotwt_sdf.show(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " \n", "We notice that there are several columns have been encoded into binary type. Well, it is pretty easy to cast byte array into string using `astype` function. However, it becomes tricky for colume with structured information,e.g., the `bounding_box` column. The `bounding_box` column contains a json string that is supposed to be read in as a struct type column. \n", " \n", "
\n", "We have to specify the struct type manually in order to let the reader recognize the information. \n", "\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2018-10-12T20:24:38.277563Z", "start_time": "2018-10-12T20:24:38.271964Z" } }, "outputs": [], "source": [ "schema = StructType([\n", " StructField(\"type\", StringType(), True),\n", " StructField(\"coordinates\", ArrayType(ArrayType(ArrayType(FloatType()))),\n", " True),\n", "])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2018-10-12T20:24:50.559957Z", "start_time": "2018-10-12T20:24:49.623497Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---------+------------------+----------+-----------+--------------------+---------------+-------------+----------+------+----------------+----------+-----------------+--------------------+--------------------+--------------------+-------------+---------+---------+--------+--------+\n", "| uid| uname| lat| lng| fulltext|place_full_name| country|place_type|c_code| geoid|place_name|__index_level_0__| ctime_str| term_str| htag_str|plocation_str| ll_lat| ll_lng| ur_lat| ur_lng|\n", "+---------+------------------+----------+-----------+--------------------+---------------+-------------+----------+------+----------------+----------+-----------------+--------------------+--------------------+--------------------+-------------+---------+---------+--------+--------+\n", "|126371773|CompassUSAJobBoard|37.5407246|-77.4360481|Join the Crothall...| Richmond, VA|United States| city| US|00f751614d8ce37b| Richmond| 0|Tue Sep 25 04:52:...|join crothall tea...|job FacilitiesMgm...|Charlotte, NC|37.447044|-77.60104|37.61272|-77.3853|\n", "+---------+------------------+----------+-----------+--------------------+---------------+-------------+----------+------+----------------+----------+-----------------+--------------------+--------------------+--------------------+-------------+---------+---------+--------+--------+\n", "only showing top 1 row\n", "\n" ] } ], "source": [ "temp_sdf=geotwt_sdf.withColumn('ctime_str',geotwt_sdf.ctime.astype('string'))\n", "temp_sdf=temp_sdf.withColumn('term_str',temp_sdf.term.astype('string'))\n", "temp_sdf=temp_sdf.withColumn('bbox_str',temp_sdf.bounding_box.astype('string'))\n", "temp_sdf=temp_sdf.withColumn('coords',func.regexp_replace('bbox_str','u',\"\"))\n", "temp_sdf=temp_sdf.withColumn('bbox',func.from_json('coords',schema)) \n", "\n", "temp_sdf=temp_sdf.withColumn('htag_str',temp_sdf.hashtag.astype('string'))\n", "temp_sdf=temp_sdf.withColumn('plocation_str',temp_sdf.profile_location.astype('string'))\n", "\n", "temp_sdf = temp_sdf.withColumn(\n", " 'll_lat',\n", " temp_sdf.bbox.coordinates.getItem(0).getItem(0).getItem(1)).withColumn(\n", " 'll_lng',\n", " temp_sdf.bbox.coordinates.getItem(0).getItem(0).getItem(0))\n", "temp_sdf = temp_sdf.withColumn(\n", " 'ur_lat',\n", " temp_sdf.bbox.coordinates.getItem(0).getItem(2).getItem(1)).withColumn(\n", " 'ur_lng',\n", " temp_sdf.bbox.coordinates.getItem(0).getItem(2).getItem(0))\n", "temp_sdf = temp_sdf.drop('ctime', 'profile_location', 'term', 'hashtag',\n", " 'bounding_box', 'attribute', 'bbox', 'coords',\n", " 'bbox_str')\n", "\n", "temp_sdf.show(1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.14" }, "toc": { "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "toc_cell": false, "toc_position": {}, "toc_section_display": "block", "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }