Analysis of the taxi demand in Manhatten


A smart city is a city that uses technology and data to manage its assets and resources efficiently. The project idea was to use trip record data of taxis in New York City (NYC) to help taxi drivers move more passengers around the city. To increase mobility within a town, taxis must have the highest possible occupancy rate and, therefore, be close to high demand areas when they don’t have passengers. The problem is that many taxis run around the city empty for longer than necessary because they are unaware of the “hot” regions. This situation impacts not only taxi drivers, but also passengers and the community as a whole. Taxis running empty means people are waiting longer times than needed to hail a cab. It also means that, since most taxis still have a combustion engine, more pollution is being generated without the benefit of providing mobility. The question answered with this project was which zones of Manhattan should an idle taxi drive to during the month of January.


There were many tools from the Data Science track that were used in this analysis, and the most important ones were related to merging, filtering, grouping, and plotting. The merging was done using the pandas module and the command merge. Because the project aimed at helping taxi drivers during their working hours, it was important to have a graph plotted considering the location of each zone to make it easier to interpret. For that, external data on the coordinate points of each zone had to be combined with the trip record data. The external data was found at the NYC Taxi & Limousine Commission, and it had to be converted from shapefile to CSV, which was done with an online converter. By merging both data sets, the longitude and the latitude of each zone of NYC was added to each data point. Moreover, because this project was focused on the Manhattan area, a boolean variable was used to get only the data points for the selected region. The Boolean variable named is_manhattan answered true or false according to whether the data within the column borough was equal to Manhattan. Then, the rows were filtered for when the answer was true. The grouping was needed to have the total number of pick ups for each zone of Manhattan. First, a new column containing solely the integer one was added to the data set. Then this column was grouped together adding the integers one according to the latitudes and longitudes of each data point. 3

Figure 1: Partial view of the table generated after the grouping

Finally, regarding the plotting, it was done using the matplotlib module, with the scatter command. The graph was done using three variables: latitude, longitude and amount of pick ups. The latitude was inputted in the x-axis, the longitude in the y-axis, and the number of pick ups were represented through the size and the color of the circle. These features were also customized regarding transparency and border color by setting the parameters alpha and edgecolors.

The process had many steps. First, the CSV file with the record list of taxi trips in January in NYC was imported. Then, the latitude and the longitude of each area was imported and merged with the original data set. After that, the data was filtered to display only the pick up information regarding the region of Manhattan. Afterward, the data points regarding the same zone within Manhattan were grouped together, so as to have the number of pick ups per zone. Finally, a scatter graph, which can be found below, was plotted in order to better visualized the areas with high taxi demand. To increase the user-friendliness of the graph, both the size of the circles and the colors were related to the number of pick ups. Moreover, the location of each circle was directed related to the location of the zone within Manhattan.


Figure 2: Scatter graph for the amount of pick ups in Manhattan on January 2018
Figure 3: Taxi Zones of Manhattan

The graph represents the number of pickups in each zone of Manhattan during the month of January 2018. The location of the points is related to the location of the zones within the borough. As explained above, both the colors and the sizes of the circle represent the number of pickups. This was done to make the graph easier to read, as it is a tool to be used while the driver is working. When analyzing the graph and the map of the taxi zones of Manhattan, one can see that the zones with the highest demands are 236, 237 and 163. On the other hand, the zones with the lowest demands are 127, 5

128 and 153. From the graph and the picture above, a taxi driver can be fully aware of which regions have higher taxi demand within the Manhattan area during the month of January. If the driver is without a passenger, he/she can decide which direction to drive to while waiting to be hailed. For instance, if the driver is currently at zone number 41, he/she should drive to the southwest part of the neighborhood. On the other hand, if the driver is at in zone 237, he/she should not move to another zone, as he/she is already in a “hot” zone.


To increase the reliability of the model developed in this project, the trip record data should be further worked, and other graphs should be plotted. Firstly, the data should be filtered by days of the week. The areas with high demand on a Sunday most certainly will not be the same as the areas with high demand on a Wednesday. Secondly, the data should also be filtered according to the time. A taxi driver working at 6 am on a Saturday will face a different demand pattern than a taxi driver working on the same day at 10 pm. Moreover, adding external data would also improve the predictive power of the historical data. For example, linking the demand to certain big events that happened in January 2018 can help a taxi driver know how the same event in 2019 will alter the demand for taxis on a certain day. Those additions would increase the usefulness of the tool, increasing the mobility within the region of Manhattan.


The biggest challenges of the project were three. The main one was to know which functions and methods should be used to achieve a certain outcome. The second challenge was how to ensure the code was delivering what was expected. The third one was how to allocate time to the project without compromising the studies.

To overcome the difficulty to identify the functions and methods that needed to be used, I focused on what I wanted to achieve in each step and I researched online ways of accomplishing that. For instance, I wanted my graph to display 3 variables, so I researched online and found a post on Stack Overflow that covered exactly the same topic. Regarding the second challenge, the solution was to create tables and to display them anytime something was altered in the data set. In that way, it was possible to see whether the expected outcome was being reached or whether something needed to be changed in the code. Lastly, to dedicate time to the project without compromising the studies, I made the effort of working a few hours on it every week. By doing so, I was 6

able to dedicate enough time to my academic tasks while also progressing with the project.

I believe my biggest personal success with this project was to deliver insights from a data set with over one million points using the tools that I learned during my track. Although Data Camp is very helpful in explaining the tools and the way to use them, one cannot asses how much he/she has learned without putting theory into practice. By working on this project, I was able to see how much I have learned and how I can leverage other sources of information such as coding forums and blogs to get the outcome that I want. With the knowledge that I gained in the previous months, I was able to manipulate the data in a way that would give the information that I was looking for, to know exactly where in Manhattan a taxi driver during the month of January has higher chances of picking up a passenger.

Andréia Alencar — Data Science Track TechLabs Esade


Our community Members share their insights into the TechLabs Experience