{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "using Gadfly" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# *Parallel computing*\n", "\n", "The dataset being analysed is a big data set the which is collectively about **~70GB**. The first barrier before doing any analysis was to be able to read the data quickly and efficiently.\n", "\n", "We started off by experimenting with the ***DataFrames*** package the output dataframe had all the required properties and was easy to manipulate but the reading of the file took to long also the **DateTime** type was being read as string and extracting information such as time/date using string manipulation only made the process really slow. Hence we switched to ***CSV*** module, it had a better performance in reading the data but the output dataframe had the type of **NullableArrays**. This was not in our favour as most of the functions to manipulate the dataframes do not work with the collumns which had **Nullable** type. Hence we wrote a ***type*** which could read a dataframe using ***CSV.read()*** and the convert it into another DataFrame in which columns were tyepe stable and not the **NullableArray**. This gave us overall better performance than using only DataFrames.\n", "\n", "Now even loading all the data sequentially would have been very time consuming as each file is big enough to take a significant amount of time. This is where the parallel computing feature of **JULIA** comes in handy.We introduced the ***@everywhere*** macros and used ***SharedArrays*** and ***DistributedArrays (DArrays)*** which allows to map the function on different processors of the system and collect output from these parallel process.Hence it helps us to avoid following the sequential order and do it one by one hence we were able to load all the data in about as much time as it takes to load one dataset with maximum loading time.\n", "\n", "This feature also helped us to *parallelize* many function which used multiple files for the analysis and plotting of the graph. Also, in one of the analysis we were to cluster the *latitudes and longitudes* we used ***Clustering*** packge for this and used the inbuilt **K-means** function to implement the *k-means algorithm*, we were able to put all the files in parallel for the process of finding the k-means hence were able to save time there too by avoiding the sequential process." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we demonstrate the parallel computing capabilitiesof ***JULIA*** by calculating the value of *pi* using ***Monte Carlo*** simulation with ***10 Billion*** sample points. The number of workers assigned was made to vary to measure the exectution time :" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING: Method definition parallel_findpi(Any) in module Main at In[2]:3 overwritten at In[3]:3.\n" ] }, { "data": { "text/plain": [ "parallel_findpi (generic function with 1 method)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#this script is to estimate pi using monte Carlo simulation\n", "function parallel_findpi(n)\n", " inside = @parallel (+) for i = 1:n\n", " x, y = rand(2)\n", " x^2 + y^2 <= 1 ? 1 : 0\n", " end\n", " 4 * inside / n\n", "end" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "HTML{String}(\"\\n\")" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#code to produce the graph-->\n", "#time = [893.081954, 437.424609, 288.011919, 219.332380,\n", "# 111.764335,98.156851,63.623772,51.070448,43.202897,34.187504,28.481205, 26.948234,26.782099]\n", "\n", "#nprocs=[1,2,3,4,8,10,16,20,24,32,48,64,80]\n", "\n", "#plot(x=nprocs,y=time, Geom.point, Geom.line, Theme(default_color=color(\"orange\")),\n", "#Guide.xlabel(\"no. of processors\"), Guide.ylabel(\"Execution time\"),\n", "#Guide.title(\"Measurement of time in Parallel processing\"))\n", "\n", "HTML(\"\"\"\n", "\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# *Maps*\n", "\n", "The following section has been plotted using hexbins feature of ***Gadfly-Julia***.The plot consists of latitudes on Y-axis and longitudes on X-axis.These maps plots pickup and dropoff location of every taxi ride taken between Aug'13 to Dec'2015. Using this data we can see that we can infer about the map of NY city, without atually using any map features/API." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pickup for Green Taxi \n", "The Map below shows pickup for green taxi. The map has a small density in lower Manhattan region for green taxis pickup implying that not many green taxis are active in this region. Further most of the pickups are in Brooklyn and in upper Manhattan region. " ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ " " ], "text/plain": [ "HTML{String}(\" \")" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(\"\"\" \"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dropoff for Green taxi\n", "The Map below shows dropoff for green taxi. The map shows that dropoff's for green taxis has been in the entire city. The map also shows the bridges connecting the islands.Also most of the dropoff's are concentrated within the city." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ " " ], "text/plain": [ "HTML{String}(\" \")" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "HTML(\"\"\" \"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pickup for Yellow Taxis\n", "We can see that the number of yellow taxis having pickups in Manhataan region is very large. Also the number of pickups for Yellow taxis at JFK and Laguardia Airport is large." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ " " ], "text/plain": [ "HTML{String}(\" \")" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(\"\"\" \"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dropoff for Yellow Taxis\n", "We can see that the number of yellow taxis having dropoff's in Manhataan region is very large. Also the number of dropoff's for Yellow taxis at JFK and Laguardia Airport is large. These facts suggest that mostly Yellow taxis operate at the two airports in Newyork. Also number of taxis operating out of Manhattan region is low." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ " " ], "text/plain": [ "HTML{String}(\" \")" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(\"\"\" \"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pick-ups analysis\n", "In the following graph we anallyse the pick up trends for the yellow taxis, which has two vendors. Few intresting features crop up such as there is a clear dropping in the pick ups for both the vendors over time implying the degradation in popularity of yellow taxis over time. Also both the pick ups for both the vendors seems to co-related implying that number of taxis related depends mostly on external conditions such as year and temperature. For a given year the number of taxi pickups are more during Feb-April and then during October " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "HTML{String}(\"\\n\")" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#code to produce graph---> include(\"Graphs/yellow_taxi_monthly_analysis.jl\")\n", "HTML(\"\"\"\n", "\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following graph we analyse the pickup trends for the green taxi which were introduced in August 2013 in NY. The analysis is over few months from 2013 to 2015. There are **two vendors** for the green taxi labelled as Vendor1 and Vendor2. As can be seen *Vendor2* has much higher growth rate in the early stages. This has led to more number of the pickups by vendor2 over time. Both the vendors tend to stabalise about an *equillibrium* value over time." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "HTML{String}(\"\\n\")" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#code to generate graph--> include(\"Graphs/green_taxi_monthly_analysis.jl\")\n", "HTML(\"\"\"\n", "\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This graph shows the comparison between **vendor1 and vendor2 of yellow taxi**
\n", "over all days of week for the month of january 2015. It clearly projects the domination of vendor2 over vendor1.
\n", "Another fact which crops up is the frequencies of taxies rises as the weekdays past by
\n", "it is peaked on **saturday** and it drops sharply on ***sunday (as it is a holiday)***.
" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "HTML{String}(\"\\n\")" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#code--> include(\"Graphs/vendorweekly.jl\")\n", "HTML(\"\"\"\n", "\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following graph presents the comparison of the two vendors of the yellow taxi\n", " in per hour basis. Rush hours can be deduced from the graph it is from about **6:00 PM to 9:00 PM.** This can be justified as it is the time around which most of the people will be coming back from the office.
\n", "The global minimum at about **6:00 AM** in the morning gives an idea that very small fraction of the working class actually have to start travelling as early as 6 in the morning.
" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "HTML{String}(\"\\n\")" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#code--> include(\"Graphs/vendorperhour.jl\")\n", "HTML(\"\"\"\n", "\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following graph shows the comparison between vendor1 and vendor2 on the weekdays. As can be seen the peak hours for both the vendors is from 6:00PM to 9:00PM and it, decreases afterwards hitting a minima at early morning." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "HTML{String}(\"\\n\")" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#code--> include(\"Graphs/vendorperhourweekdays.jl\")\n", "HTML(\"\"\"\n", "\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following graph shows the comparison between the two vendors of the yellow taxi on the *weekends*.
\n", "A clear feature which crops up which is different than the normal weekdays is during dayhours the frequency
\n", "of both the vendors become more or less stagnant.
\n", "Also contrary to weekdays we see a sharp rise in the taxi uses for both the vendors late in night
\n", "**(12:00 AM to 1:00 AM)** on weekends, this takes into account the fact that many people on weekends stay out till \n", "late in night in new york." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "HTML{String}(\"\\n\")" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#code-->include(\"Graphs/vendorperhourweekend.jl\")\n", "HTML(\"\"\"\n", "\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we present the one to one comparison for the total number of pickups for the **yellow** taxis as a whole and **green** taxis as a whole. As can be seen from the graph, after the introduction of the green taxis in **Aug'13** they have steadily increased and saturated about an equillibrium value *(look at the gif providing the visuals of the spreading of the green taxis)*, which is way below the mean value of the yellow taxi pick ups which apart from some *seasonal trends* are more or less have decreased a bit due to introduction of Green taxis and saturated." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "HTML{String}(\"\\n\")" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#code--> include(\"Graphs/total_pickups.jl\")\n", "HTML(\"\"\"\n", "\n", "\"\"\")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "HTML{String}(\"\\n\")" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(\"\"\"\n", "\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following is the gif which shows the variation of dropoff locations for the green taxis over years" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "HTML{String}(\"\\n\")" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(\"\"\"\n", "\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To extract some of the prime locations (hotspots) around which most of the pickups and dropoffs\n", "are concentrated,
we clustered the *latitudes and longitudes* of all the pickup and drop off points,\n", "using the **K-means** algorithm (a type of unsupervised ML algorithm). \n", "
The following **Clusters** appeared :" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "HTML{String}(\"\\n\")" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(\"\"\"\n", "\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# *Airport Analysis*\n", "\n", "On the basis of the prime location extracted from the above clustering, we analyse the ***JFK AIRPORT*** (as it is the international airport), the following analysis is for the month of january,2015 :" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Airport pickup by days of week\n", "The following graph shows the pick ups and drop offs at JFK airport in new york on weekdays.
\n", "Few things that can be inferred from the graph is pickups are always significantly higher
\n", "than the dropoffs. Also the dropoffs and pickups seems correlated, as both follow the same trend.
\n", "As can be observed the peak for both dropoffs and pickups is on Friday and the minimum is on Tuesdays.
\n", "We can deduce that minimum number of flights to and from NY from JFK aiports is scheduled on Tuesdays.
" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "HTML{String}(\"\\n\")" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#code-->include(\"Graphs/JFKportperday.jl\")\n", "HTML(\"\"\"\n", "\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Airport Pickup by hour \n", "The following graph is the analysis of pick ups and drop offs with respect to daily hours (all days cumullitive).
\n", "Some points which can be deduced from the graph is most of the flights are scheduled to leave from JFK at near
\n", "about **6:00 AM and 3:00 PM**. Also this is also the time when significant amount of flights arrive at JFK airport.
\n", "*(as these points corresponds to local maximum)*.
\n", "Another feature which can be extracted is that late in night almost no flights are scheduled to take off from
\n", "JFK but there are relatively more flights landing on JFK airport at late night.
\n", "Also the decreasing drop offs at the airport after 3:00PM tells about the decreasing schedule of flights to
\n", "take off from JFK in nooon.
" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "HTML{String}(\"\\n\")" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#code--> include(\"Graphs/JFKportperhour.jl\")\n", "HTML(\"\"\"\n", "\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Influx at JFK Airport\n", "In the following graph we have analysed the net influx of the people from *JFK Airport*. The influx has been defined as the difference between the *pickups* and *dropoffs*, offcourse this measure can not be taken as the exact number of the people entering NY on any given month, but it gives an idea about how many people had entered in that particular month.\n", "\n", "An intersting feature that crops up is a shrap drop in the influx in the month of ***February 2014***, Investigation about this led us to the winter storm that hit New York in *Feb'14*. This storm has been reported to be the *7th snowiest on record* in New york city." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "HTML{String}(\"\\n\")" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#code--> include(\"Graphs/netfluxJFK_graph.jl\")\n", "HTML(\"\"\"\n", "\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# * Card Vs. Cash*\n", "From the available data we plotted the percentage of users who paid with cash and who paid with card in every month from 2013 to 2015. It can be seen clearly in the graph that the percentage of people paying through cards is steadily increasing and percentage of people paying through cosh is steadily decreasing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Percentage of cash and card used for Yellow taxi's\n", "The use of Card payment has grown from around 54% to 63% and cash payments have reduced proportionally.This could be due ease of payments that cards offer as compared to cash." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "HTML{String}(\"\\n\")" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#code--> include(\"Graphs/mode_count_graph.jl\")\n", "HTML(\"\"\"\n", "\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Number of tips given with cash and card payment\n", "It is intresting to note that people tend to tip much more when they pay by card.This could be attributed to the fact that online payment portal has tip option included and it is easier to transact with cards. Also many vendors may not be reporting the tips they have received in cash and this could justify for very few tips given while paying with cash." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "HTML{String}(\"\\n\")" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#code--> include(\"Graphs/tip_cash_card_graph.jl\")\n", "HTML(\"\"\"\n", "\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# *Regression of cost with distance and time*\n", "\n", "We performed the regression on the cost, trip distance and the journey time at different time of day to come up with a regressive model to predict the cost of the journey on the basis of the distance of the journey anf the travel time. At different time in days the contribution of *time* in the model would be expected to vary hence we performed the regression after segregating the 24 hours in the buckets and performing regression on those buckets.\n", "\n", "Here we present the crossection of the regression model obtained for the month of *january 2015* in the tine slot of \n", "*9:00AM to 12:00PM*. To visualise the variation with journey distance, journey time hass been kept constant and to visualise the variation with journey time, journey distance has been kept constant. An interesting feature which appears is that for vendor1 distance seems to play a dominant part in deciding cost rather than time and for vendor2 time seems to play a dominant part rather than distance. It seems both the vendor use different charging model." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "HTML{String}(\"\\n\")" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#code--> include(\"Graphs/vendor_regression_plot_time.jl\")\n", "HTML(\"\"\"\n", "\n", "\"\"\")" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "HTML{String}(\"\\n\")" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#code--> include(\"Graphs/vendor_regression_plot_dist.jl\")\n", "HTML(\"\"\"\n", "\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*A 3D plot of the regression model obtained for vendor1 in the above mentioned bucket : *" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "HTML{String}(\"\\n\")" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(\"\"\"\n", "\n", "\"\"\")" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Julia 0.5.0", "language": "julia", "name": "julia-0.5" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "0.5.0" } }, "nbformat": 4, "nbformat_minor": 2 }