The taxi market in NYC vs SF: inferring trip purpose using LDA

Was the taxi market really that much more developed in New York than in San Francisco? I used Latent Dirichlet Allocation with GPS data to infer taxi trip purposes in each city, to see just how different the two markets were. 

Back in 2014, when Uber and Lyft were taking off, people in San Francisco were much more excited about them than were people in New York. New Yorkers told me, “yeah, Uber might be better than a taxi, but it’s not really that big a deal.”

Why the difference? Maybe it’s just that San Franciscans are techno-optimists, especially compared to critical New Yorkers. But I also suspected it had something to do with differences between the two cities’ taxi markets. 

Maybe San Francisco’s taxis were just worse

Pre-2014, San Francisco’s taxi system was notoriously dysfunctional, while New York’s system more or less worked. In San Francisco, it was hard to get a cab.  It seemed like people only took cabs to or from the airport or after a night of drinking. In New York, taxis have long been a more ubiquitous and integral part of the city’s transport network. Compared to such different baselines, Uber and Lyft would look a lot more transformational in San Francisco than in New York.

In other words, maybe ride-hailing allowed San Franciscans to experience the kind of convenience New Yorkers already had.

Unfortunately, I didn’t have data for how people were using Uber and Lyft. But I did have taxi data from 2012-2013, so I could test my baseline assumption: before Uber, people in New York used taxis for a greater range of purposes than did people in San Francisco.

The timing of trips suggests differences in taxi usage

Simply looking at the number of taxi trips per hour over a typical week, we can see that taxi demand in SF is spikier than in NYC. In SF, the number of trips is much higher on Friday and Saturday nights than other times of the week, whereas in NYC it’s more even. This suggests SF taxi usage is more concentrated in social, “going out” trips, while NYC taxi usage might be more broad.


What can the spatial patterns of trips tell us?

I wanted to see if the spatial pattern of trips added to the story. To do that, I borrowed a machine learning technique called Latent Dirichlet Allocation (LDA) to identify characteristic
“types” of trips based on their origin and destination locations.

I used timestamped taxi GPS records provided by the SFMTA and the New York TLC.  The data are from 2012 and 2013, a time when Uber was only beginning to impact the taxi market.

The San Francisco dataset consisted of complete trip records (more than 700,000 total) from one of the city’s larger taxi companies for October 2012 and mid-­July through October 2013.

A technique to infer trip purpose: LDA

LDA is a topic modeling technique originally developed to classify text. The idea is that each document in a collection contains a mixture of latent (unobserved) topics, and these underlying topics give rise to a predictable vocabulary. Thus the observed pattern of word frequency in the document collection can be used to infer the topics it contains.

For example, a document containing high frequencies of the words “vulnerable,” “flood,” and “damage” might be about disasters, while high occurrence of the words “jobs,” “conservative,” and “voters” likely signals a topic of politics.

When LDA is applied to taxi trip patterns, each “word” is an origin-destination pair and each “document” is the set of trips occurring within a one-­hour period. The inferred “topics” can be thought of as types of trips that might correspond to trip purposes or activities. In this case, trip types are defined by the paired locations of origin and destination.

So in the time period Monday 8-9am, for example, we might see a lot of trips from the Upper East Side to the Financial District. We could thus infer the trip type is “going to work.”

New York has pretty diverse taxi trip types

From the LDA results, I calculated the expected frequency of each type of trip during each hour in a typical week in October, shown in the chart below. Each color represents a different trip type. The height of the bar represents the expected frequency of trips for the given one-hour period, with frequency in terms of the the % of all trips that week. Keep in mind that the LDA model defines trip type in terms of the pattern of origin and destination. Thus the color of the bars represents the spatial patterns, while the size of the bars represents how many trips happen when.

Expected frequency of trips by type for each hour in a typical week. (Frequency is in terms of % of all trips in the week) Expected frequency of trips by type for each hour in a typical week. (Frequency is in terms of % of all trips in the week)

The following summarizes my qualitative descriptions of the inferred trip types for New York.

“Morning commute” (light blue)

This type of trip has origins are concentrated in financial district and around central park. Destinations are concentrated in midtown, as well as LaGuardia. It is prominent during morning and midday hours, with lower occurrence on Sundays.

“Evening commute” (yellow)

The large spike in trips during the evening hours is mostly drawn from this trip type, which clearly represents the evening commute. Trip origins are highly concentrated in lower and mid Manhattan; destinations are more dispersed.

“Night/social” (green)

This clearly represents late night, social trips. Trips are very likely to originate in a handful of locations in the Lower East Side and Williamsburg, while they are likely end in the same locations as well as elsewhere in Brooklyn and Manhattan.

“Airport/tourist” (orange)

Origins and destinations are across Manhattan and at the LaGuardia and JFK airports. The highest concentrations are in midtown, where visitors to the city are likely to stay.

“Other” (dark blue)

This rarely applies, and appears to be relatively dispersed across the city. It is probably a catch­all type representing origin-destination pairs that are unlikely elsewhere.

San Francisco’s trip types are more limited

In comparison, we can identify fewer distinct trip types for San Francisco. The chart below shows expected trip frequency by hour and day for a typical week. A summary of types follows.

Expected frequency of trips by type for each hour in a typical week Expected frequency of trips by type for each hour in a typical week

“Work” (light blue)

Describes trips during most of the day, with origins concentrated downtown and destinations downtown and extending down the peninsula along highway 101.

“Night/social” (yellow)

Applies to evening hours and especially Friday and Saturday nights. Origins are highly concentrated in a few locations­­: Tenderloin, SOMA, the Mission and the Marina. Destinations are likely to be those same locations but also other residential neighborhoods, indicating this type largely represents trips between social destinations and trips home.

“Tourist/airport” (green)

Appears to be mainly airport trips, especially from Union Square. Most likely in the very early morning.

“Late night/airport” (orange)

Most prevalent on weekends around 2am­-5am. Trips are likely to begin or end in late night spots like Soma, the Tenderloin, and Marina, along with the airport.


Appears to be a catchall category. It exhibits more dispersed pattern, with more leisure destinations (such as the Berkeley Marina) and covers destinations in the East Bay and the Peninsula that are unlikely in other trips types.

The charts below show a side-by-side comparison of expected frequencies by trip type for each hour in each city.

Expected trip frequency by trip type for each hour in new york vs san francisco Expected trip frequency by trip type for each hour in new york vs san francisco

Not only are NYC taxi trips more evenly distributed throughout the day, the spatial pattern indicates distinct morning and evening commutes. In SF, in contrast, we find no separate topic for the morning commute. This is probably not because five topics is too few, since the “Recreation/other” type already seems to be an “extra” topic.

So maybe San Francisco has only four, not five, distinct trip types. There seems to be some overlap between the Tourist/airport and Late night/airport types (although there may be differences in the directionality of trips–I’d have to look into that more.) In comparison, it was possible to clearly describe each of the five New York trip types. This supports my hypothesis that New York taxi usage is more broad-based than in SF.

Another difference between the two cities is that, San Francisco’s “Night/social” trip type begins earlier than New York’s. In San Francisco, this type applies to trips starting around 7pm, at the same time as the evening commute, compared with 10 or 11pm in New York.

Obviously, trip types don’t perfectly describe actual trips purposes. The commute and work trips types have a too-high probability on weekends, which suggests to me they may also represent shopping trips.

Also, New York is simply a larger area than SF, with more possible origins and destinations. It shouldn’t be surprising that we find more distinct trip types.

The take-away: taxi usage in New York was more diverse

Even with these limitations, the results are evidence that taxis serve a richer range of uses in New York than in San Francisco. That’s not too surprising: New York is more diverse in lots of ways.

In future work, I might explore patterns at a finer spatial resolution, especially in New York. This same methodology could be applied to Uber trip data and would allow a longitudinal analysis and a comparison with taxi trips. We could ask, have Uber and Lyft changed the use cases for taxis, in either city?

Methodology detail

The New York dataset included all trips in the city. I chose to use only October 2013, which totals more than 15 million trips, of which I randomly sampled 10%. (I chose October because it has no major holidays and typically few weather extremes, and it can be compared with the SF dataset. The SF data includes summer months and so might reflect more tourist activity.)

Origin and destination pairs were defined using a 1000m grid. I also considered using a smaller grid, census tracts, or traffic analysis zones. But for San Francisco ­­there were too few trips in each unit and while it was better for denser New York, I wanted consistency between cities. (In the future I might refine the model using smaller spatial units.)

I used the LDA implementation in the Gensim package developed by Radim Rehurek. To process the data, I grouped trips into one ­hour periods, obtaining 744 periods of roughly 1000 OD pairs each for New York and 3696 periods of about 100 pairs each for San Francisco.

As is typically done with LDA, I selected the model parameter α, which defines the Dirichlet distribution, by minimizing the “perplexity” of a set of test data.

One must also select the number of trip types, k. In text analysis applications, k is often chosen by perplexity minimization as well. In this case, however, the goal was to identify interpretable types. Perplexity should generally decrease with increasing k, so the idea is to choose the lowest k that still gives meaningful results. I decided the most interpretable results were with k=5.  For purposes of comparability, I used the same k value for both cities, although a smaller k might be more appropriate for SF.


Blei, D.M., Ng, A.Y., Jordan, M.I., 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res. 3, 993–1022.