{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The following code shows how to transform binary type information in spark dataframe into struct type\n",
"- **This notebook is running with a pyspark kernel** "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"ExecuteTime": {
"end_time": "2018-10-12T20:24:24.217088Z",
"start_time": "2018-10-12T20:24:10.620922Z"
}
},
"outputs": [],
"source": [
"geotwt_sdf=spark.read.parquet(\"BMC_UserGeoTwt/BMC_GeoTwt_Snappy*\")"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"ExecuteTime": {
"end_time": "2018-10-10T19:40:08.118939Z",
"start_time": "2018-10-10T19:39:59.358839Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"38820846\n",
"+--------------------+---------+------------------+----------+-----------+--------------------+--------------------+--------------------+--------------------+---------------+-------------+----------+--------------------+------+---------+----------------+-----------+-----------------+\n",
"| ctime| uid| uname| lat| lng| profile_location| term| hashtag| fulltext|place_full_name| country|place_type| bounding_box|c_code|attribute| geoid| place_name|__index_level_0__|\n",
"+--------------------+---------+------------------+----------+-----------+--------------------+--------------------+--------------------+--------------------+---------------+-------------+----------+--------------------+------+---------+----------------+-----------+-----------------+\n",
"|[54 75 65 20 53 6...|126371773|CompassUSAJobBoard|37.5407246|-77.4360481|[43 68 61 72 6C 6...|[6A 6F 69 6E 20 6...|[6A 6F 62 20 46 6...|Join the Crothall...| Richmond, VA|United States| city|[7B 75 27 74 79 7...| US| [7B 7D]|00f751614d8ce37b| Richmond| 0|\n",
"|[54 75 65 20 53 6...|126371773|CompassUSAJobBoard|28.3252878|-81.5331286|[43 68 61 72 6C 6...|[72 65 63 6F 6D 6...|[6A 6F 62 20 43 6...|Can you recommend...|Celebration, FL|United States| city|[7B 75 27 74 79 7...| US| [7B 7D]|01bbe9ba4078361c|Celebration| 1|\n",
"+--------------------+---------+------------------+----------+-----------+--------------------+--------------------+--------------------+--------------------+---------------+-------------+----------+--------------------+------+---------+----------------+-----------+-----------------+\n",
"only showing top 2 rows\n",
"\n"
]
}
],
"source": [
"print geotwt_sdf.count()\n",
"geotwt_sdf.show(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
" \n",
"We notice that there are several columns have been encoded into binary type. Well, it is pretty easy to cast byte array into string using `astype` function. However, it becomes tricky for colume with structured information,e.g., the `bounding_box` column. The `bounding_box` column contains a json string that is supposed to be read in as a struct type column. \n",
" \n",
"
\n",
"We have to specify the struct type manually in order to let the reader recognize the information. \n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2018-10-12T20:24:38.277563Z",
"start_time": "2018-10-12T20:24:38.271964Z"
}
},
"outputs": [],
"source": [
"schema = StructType([\n",
" StructField(\"type\", StringType(), True),\n",
" StructField(\"coordinates\", ArrayType(ArrayType(ArrayType(FloatType()))),\n",
" True),\n",
"])"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"ExecuteTime": {
"end_time": "2018-10-12T20:24:50.559957Z",
"start_time": "2018-10-12T20:24:49.623497Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+---------+------------------+----------+-----------+--------------------+---------------+-------------+----------+------+----------------+----------+-----------------+--------------------+--------------------+--------------------+-------------+---------+---------+--------+--------+\n",
"| uid| uname| lat| lng| fulltext|place_full_name| country|place_type|c_code| geoid|place_name|__index_level_0__| ctime_str| term_str| htag_str|plocation_str| ll_lat| ll_lng| ur_lat| ur_lng|\n",
"+---------+------------------+----------+-----------+--------------------+---------------+-------------+----------+------+----------------+----------+-----------------+--------------------+--------------------+--------------------+-------------+---------+---------+--------+--------+\n",
"|126371773|CompassUSAJobBoard|37.5407246|-77.4360481|Join the Crothall...| Richmond, VA|United States| city| US|00f751614d8ce37b| Richmond| 0|Tue Sep 25 04:52:...|join crothall tea...|job FacilitiesMgm...|Charlotte, NC|37.447044|-77.60104|37.61272|-77.3853|\n",
"+---------+------------------+----------+-----------+--------------------+---------------+-------------+----------+------+----------------+----------+-----------------+--------------------+--------------------+--------------------+-------------+---------+---------+--------+--------+\n",
"only showing top 1 row\n",
"\n"
]
}
],
"source": [
"temp_sdf=geotwt_sdf.withColumn('ctime_str',geotwt_sdf.ctime.astype('string'))\n",
"temp_sdf=temp_sdf.withColumn('term_str',temp_sdf.term.astype('string'))\n",
"temp_sdf=temp_sdf.withColumn('bbox_str',temp_sdf.bounding_box.astype('string'))\n",
"temp_sdf=temp_sdf.withColumn('coords',func.regexp_replace('bbox_str','u',\"\"))\n",
"temp_sdf=temp_sdf.withColumn('bbox',func.from_json('coords',schema)) \n",
"\n",
"temp_sdf=temp_sdf.withColumn('htag_str',temp_sdf.hashtag.astype('string'))\n",
"temp_sdf=temp_sdf.withColumn('plocation_str',temp_sdf.profile_location.astype('string'))\n",
"\n",
"temp_sdf = temp_sdf.withColumn(\n",
" 'll_lat',\n",
" temp_sdf.bbox.coordinates.getItem(0).getItem(0).getItem(1)).withColumn(\n",
" 'll_lng',\n",
" temp_sdf.bbox.coordinates.getItem(0).getItem(0).getItem(0))\n",
"temp_sdf = temp_sdf.withColumn(\n",
" 'ur_lat',\n",
" temp_sdf.bbox.coordinates.getItem(0).getItem(2).getItem(1)).withColumn(\n",
" 'ur_lng',\n",
" temp_sdf.bbox.coordinates.getItem(0).getItem(2).getItem(0))\n",
"temp_sdf = temp_sdf.drop('ctime', 'profile_location', 'term', 'hashtag',\n",
" 'bounding_box', 'attribute', 'bbox', 'coords',\n",
" 'bbox_str')\n",
"\n",
"temp_sdf.show(1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.14"
},
"toc": {
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": false,
"toc_cell": false,
"toc_position": {},
"toc_section_display": "block",
"toc_window_display": false
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}