{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Titanic\n",
"\n",
"학습목표 : Titanic의 탑승자 정보를 통해 생존자를 예측하는 모델 만들기"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import findspark\n",
"findspark.init()\n",
"import pyspark\n",
"sc = pyspark.SparkContext()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from pyspark.sql.session import SparkSession\n",
"spark = SparkSession(sc)\n",
"titanic = spark.read.option(\"header\",\"true\").csv(\"/Users/ryanshin/Downloads/train.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+\n",
"|PassengerId|Survived|Pclass| Name| Sex| Age|SibSp|Parch| Ticket| Fare|Cabin|Embarked|\n",
"+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+\n",
"| 1| 0| 3|Braund, Mr. Owen ...| male| 22| 1| 0| A/5 21171| 7.25| null| S|\n",
"| 2| 1| 1|Cumings, Mrs. Joh...|female| 38| 1| 0| PC 17599|71.2833| C85| C|\n",
"| 3| 1| 3|Heikkinen, Miss. ...|female| 26| 0| 0|STON/O2. 3101282| 7.925| null| S|\n",
"| 4| 1| 1|Futrelle, Mrs. Ja...|female| 35| 1| 0| 113803| 53.1| C123| S|\n",
"| 5| 0| 3|Allen, Mr. Willia...| male| 35| 0| 0| 373450| 8.05| null| S|\n",
"| 6| 0| 3| Moran, Mr. James| male|null| 0| 0| 330877| 8.4583| null| Q|\n",
"| 7| 0| 1|McCarthy, Mr. Tim...| male| 54| 0| 0| 17463|51.8625| E46| S|\n",
"| 8| 0| 3|Palsson, Master. ...| male| 2| 3| 1| 349909| 21.075| null| S|\n",
"| 9| 1| 3|Johnson, Mrs. Osc...|female| 27| 0| 2| 347742|11.1333| null| S|\n",
"| 10| 1| 2|Nasser, Mrs. Nich...|female| 14| 1| 0| 237736|30.0708| null| C|\n",
"| 11| 1| 3|Sandstrom, Miss. ...|female| 4| 1| 1| PP 9549| 16.7| G6| S|\n",
"| 12| 1| 1|Bonnell, Miss. El...|female| 58| 0| 0| 113783| 26.55| C103| S|\n",
"| 13| 0| 3|Saundercock, Mr. ...| male| 20| 0| 0| A/5. 2151| 8.05| null| S|\n",
"| 14| 0| 3|Andersson, Mr. An...| male| 39| 1| 5| 347082| 31.275| null| S|\n",
"| 15| 0| 3|Vestrom, Miss. Hu...|female| 14| 0| 0| 350406| 7.8542| null| S|\n",
"| 16| 1| 2|Hewlett, Mrs. (Ma...|female| 55| 0| 0| 248706| 16| null| S|\n",
"| 17| 0| 3|Rice, Master. Eugene| male| 2| 4| 1| 382652| 29.125| null| Q|\n",
"| 18| 1| 2|Williams, Mr. Cha...| male|null| 0| 0| 244373| 13| null| S|\n",
"| 19| 0| 3|Vander Planke, Mr...|female| 31| 1| 0| 345763| 18| null| S|\n",
"| 20| 1| 3|Masselmani, Mrs. ...|female|null| 0| 0| 2649| 7.225| null| C|\n",
"+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+\n",
"only showing top 20 rows\n",
"\n"
]
}
],
"source": [
"titanic.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 데이터 타입\n",
"* 출처 : https://www.kaggle.com/c/titanic/data\n",
"* Survived : 살았으면 1, 죽었으면 0\n",
"* SibSp : 형제자매나 배우자가 몇명 있는지?\n",
"* Parch : 자식이 몇명인지?"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------------+-------------+------------------------------------+\n",
"|count(PassengerId)|sum(Survived)|(sum(Survived) / count(PassengerId))|\n",
"+------------------+-------------+------------------------------------+\n",
"| 891| 342.0| 0.3838383838383838|\n",
"+------------------+-------------+------------------------------------+\n",
"\n"
]
}
],
"source": [
"from pyspark.sql.functions import *\n",
"titanic.select(count(\"PassengerId\"), sum(\"Survived\"), sum(\"Survived\")/count(\"PassengerId\")).show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 생존확율\n",
"* 생존자수( sum(\"Survived\") ) /전체승객수( count(\"PassengerId\") ) "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from IPython.display import IFrame\n",
"IFrame('https://www.zepl.com/viewer/notebooks/bm90ZTovL1NEUkx1cmtlci9jZTY4ZWU0ZDg2ODM0ZDNmOWExMTA0ZTdiMGJkYzI0ZS9ub3RlLmpzb24', width='100%', height=600)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}