Behind the scenes with MBTA data.

We are excited to announce our new Open Data Portal!

The MBTA has published the MBTA Ridership and Service Statistics, also known as “The Blue Book” since 1988. According to the Blue Book from 2005,

“The MBTA receives frequent inquiries from customers, students, peer transit providers, government agencies, community organizations, transportation enthusiasts, and the media for information regarding its operations, and this book is intended to address these needs. Additionally, this book serves as a management and analytical tool for MBTA staff.”  

The most recent Blue Book edition was released in 2014 and contained data on ridership, bus speeds, track distances, fleet rosters, and more. However, the Blue Book has largely lacked consistency in updates. Moreover, because of the Blue Book’s print format, published data has only existed in an aggregated, non-interactive form, and has failed to include extensive historical data.

As demand for more up to date data increases, and to address the shortcomings of previous Blue Books, the offices of Performance Management & Innovation and Transportation Planning have worked to create an online open data portal. The portal, which went public on Monday, October 7, is designed to be easily navigable and searchable by mode type and data category. Many of the datasets on the portal had previously been available on other platforms, such as our performance API, though users will now have the ability to download customized datasets using an in-program filtering option . Metrics on reliability and performance are available as well as historical data about the MBTA’s financials, assets, and system information. New data will be added in the future.

Though we will no longer be publishing new Blue Book versions, the MBTA Open Data Portal provides the same service. The portal allows users to download data directly from the site and view reported figures within applications and maps on the portal. Datasets that are GTFS-compatible are marked so and can be downloaded for outside visual development. Data exists on the portal in its non-aggregated form to increase options for the user, but can be aggregated by the user upon download to mimic the reporting of the previous Blue Book editions.

How To Use

View all datasets with mode (rapid transit, bus, etc.) or category (ridership, performance, etc.) tags by clicking on the icons under the mode and category headers. Alternately, you can explore all public MBTA datasets by searching for keywords in the dataset title, summary, tags, or description using the search bar at the top of the page. Clicking inside the search bar and pressing enter without entering text will populate all public datasets within the portal. Under the “Overview” tab of a dataset, you can view the description, data dictionary, data limitations, attributes, related data, and metadata. The download and API buttons on the right side of the page allow for download format selection. Under the “Data” tab, you can sort and filter the records by any of the attributes and then download only those filtered records. The “API Explorer” tab shows the query functionality and query URL for the particular dataset.

Here at the MBTA, we keep a close eye on ridership trends and are always working on ways to better collect and visualize ridership data. After the unfortunate derailment on the Red Line in June that drastically affected service throughout the summer, we used Tableau to explore how ridership on the line had been impacted. This post will examine some of the changes in ridership we saw in the period following the derailment as normal service was restored.

The Data

Readers of the blog and transit data enthusiasts will remember that “ridership” does not mean any particular measure, and the ridership we report to the public is estimated based on multiple data sources and historically-based factors. To examine the effects of the derailment, we used card and ticket validations (referred to as “taps” in this post for simplicity) at the gates of stations that serve the Red Line. The data showing taps at gates should record the majority of people passing through them reliably, are comparable to other time periods going back to 2013, and are available at a very granular level. For stations where passengers can board multiple lines (in this case, Downtown Crossing and Park Street), we used the same “split” factors that we used for the dashboard and other reporting to assign a portion of their entries to the Red Line. We did not use any factor to estimate non-interaction; our assumption throughout this post is that non-interaction was roughly static throughout the time periods examined. We also did not attempt to account for passengers who board the Red Line via transfer from another line. It seems possible that fewer passengers would transfer from other lines than usual given the reduced service levels, but we had no reasonable way to measure this.

To explore the data, we queried our research database for all taps on all gates, grouping the data into 30-minute periods and adding attributes for the service date (measured from 3 AM until 2:59 AM the next morning), the type of service in effect for that date (Weekday, Saturday, Sunday, or Holiday), the station and other characteristics about the taps. Once this dataset was built, we loaded it into Tableau and built some views to start exploring.

Service was affected differently in different parts of the Red Line. The initial derailment damaged the signal bunkers housed at JFK/UMass station and forced the Red Line to operate in manual block mode. Automatic signaling was fully restored from JFK/UMass north to Alewife on July 31, but automatic signaling was not restored on the Ashmont branch until September 11, and was not completely restored on the entirety of the Red Line until September 23. We also know that many people traveling from either Red Line branch do not frequently use the Red Line north of downtown, and similarly, that many people traveling from stations north of downtown are only going as far as downtown, or even just to Kendall. To examine how these geographically-distributed service impacts have affected ridership spatially, we divided the boardings into five groups based on area of the city. These groups were:


 Area  Stations Included
 Cambridge / Somerville  Alewife, Davis, Porter, Harvard,       Central, Kendall / MIT
 Downtown  Charles / MGH, Park Street,       Downtown Crossing, South Station 
 South Boston  Broadway, Andrew
 Dorchester  JFK / UMass, Savin Hill, Fields       Corner, Shawmut, Ashmont
 Quincy / Braintree  North Quincy, Wollaston, Quincy       Center, Quincy Adams, Braintree


The Results

To get an idea of longer-term trends on the Red Line, we put together the following chart, which shows daily weekday taps on the Red Line (all stations) over the last two years, with a 20-day moving average that smooths the data to show trends. You can see that we generally have a big dip in ridership in December (Holidays like Thanksgiving and Christmas where we run reduced service are excluded, but we see lower ridership on the weekdays surrounding them). You can also see that we generally have our highest ridership from late September through October when school is in session and there are few breaks in most people’s schedules. You can also see lower ridership in March 2018, when there were a number of storms that closed schools and otherwise affected ridership. Finally, you can see the drop in ridership over this summer likely due to the impacts of the derailment. Ridership is usually low in the week around 4th of July, and towards the end of August, but a decrease can be seen this summer right around June 11 (the day of the derailment) and while July had some higher-ridership days, the overall ridership was about 5% lower than last summer. 


Chart of Total Taps with Moving Average

As expected, ridership was less affected in places where service was less affected. Here are some views of the above chart, filtering to just the taps at the Cambridge / Somerville and Quincy / Braintree stations as grouped above. First, the charts show the last 15 months (showing the last two summers) with a 20-day moving average, then they show taps at the stations since May 2019, with a 10-day moving average. We chose the Quincy and Cambridge stations as they had the greatest difference in service as well as the greatest difference in ridership.

 A chart of the Quincy / Braintree branch with moving average

 A zoomed-in version of the chart of the Quincy / Braintree branch with moving average


 A zoomed-in version of the chart of the Cambridge section with moving average


A zoomed-in version of the chart of the Cambridge section with moving average

You can see from the above charts that in Cambridge and Somerville, ridership returned close to its previous levels quite soon after the derailment, and with the exception of the 4th of July week, remained at this level until the last couple weeks of August. In Quincy and Braintree, however, ridership did not rebound to the same level, and this drop continued for the remainder of the summer. 

We took a look at the median weekday ridership compared to the previous year in each of the areas. The time periods here are divided into three: January 1 – May 31, June 1 – August 31, and September. For September 2019, the data is complete through September 27.


  Change in Median Weekday Ridership
 Area  January-May   June-August   September 
 Cambridge /   Somerville  2.0%  -1.3%  -1.7%
 Downtown  1.2%  -3.0%  -2.1%
 South Boston  -1.5%  -7.8%  -4.9%
 Dorchester  0.4%  -9.3%  -3.3%
 Quincy /         Braintree  -3.9%  -11.9%  -3.1%


In the first 5 months of 2019, ridership was generally steady or up slightly compared to the previous year. While part of this is attributed to the low ridership in March 2018 due to snow, we chose to use the median here to mitigate the effect of such days (as well as the abnormally high ridership on February 5, 2019 due to the Patriots’ championship parade). The exception to this trend was the Braintree branch, where ridership was down nearly 4%. While Wollaston station was closed in both time periods, this is likely to due to increasing construction impacts from various projects along the branch, or perhaps due to people switching to Commuter Rail in the area.

After the derailment, we saw more disparate impacts. The Dorchester and Braintree branches saw the biggest drop in median ridership, likely because service was affected there the most and also because those areas have in higher levels of car ownership (in the case of Quincy residents) and more alternate routes to downtown. Since Wollaston station re-opened, we might have expected a greater increase in Quincy and Braintree; however, we looked at the data and noticed that most Wollaston riders seemed to switch to North Quincy while Wollaston was closed (ridership at Wollaston, North Quincy and Quincy Center combined did not significantly change after Wollaston reopened). Ridership in Cambridge and Somerville barely dropped at all compared to the previous summer, which is likely an effect of service being better and there being fewer alternate routes: Passengers could switch to the Commuter Rail at Porter, but if they were going somewhere for which that trip was convenient, they probably were taking the Fitchburg line already. Downtown ridership was down, but that is likely largely a product of the ridership in the other areas.

So far in September, we have seen ridership much closer to last September than over the summer, but in most areas, we are down a few percent. Some of this is due to missing data for the last few days of September – we tend to see higher ridership at the end of September than at the beginning. To be sure, we also took a look at the median ridership through the first 13 non-holiday weekdays of each month, as well as the averages. The medians were very close to the average ridership, and through the first 13 days, the changes between the median ridership in the two months were similar, as shown above.

Last month, as people returned from vacations and went back to school, ridership (as measured by taps at stations) had rebounded on the Red Line compared to the summer, and overall is down 2.5% from last September. In Cambridge and Somerville, where service was least impacted by the derailment, ridership is nearly the same as last September and was only slightly down during the summer. In Braintree and Quincy, ridership is down nearly 4 percent, but there were still significant service impacts in this area into September. In South Boston and Dorchester, ridership is also down even though service is largely restored. It is possible that usual riders may have switched to another service or mode, and either may have found that this new method serves their trip better, or may not be aware that service has been restored. We will continue to watch ridership at these stations now that full service is restored and we move into our usual high ridership month of October.


In our previous post about passenger walk distances, we used the Rider Census to examine how accessible transit is to its users and found that passengers walked further than the assumed half-mile to stations at the ends of the Red and Orange Lines, while they walked less than this to stations in the center of our region. Our main conclusion, which is perhaps obvious, was that the structure of the network itself has a large impact on how passengers interact with the network.

We wanted to use this data set to look at passengers’ entire journeys rather than just their access point. To do so, we developed a metric we call “substitution propensity.” In a transportation network, each station is only attractive for a set number of destinations. For example, Savin Hill is a station on the Red Line, so Savin Hill is useful for trips north to downtown Boston. However, for trips west to Ruggles or Dudley Square, Savin Hill is not as useful; it’s likely that people would walk to the nearby stop for the 15 bus instead. In other cases, two nearby stations might serve very similar journeys: for example, much of the E branch of the Green Line and the Orange Line run nearly parallel to each other.

Substitution, as it relates to walkability, is defined here as the propensity at which passengers exclusively choose a particular route over other nearby alternative routes. Substitution explains differences in how passengers choose to access MBTA services: passengers will walk for longer distances in areas in which there are fewer service options. This is also a useful metric for determining what qualities passengers value in MBTA services. For example,there may be situations in which bus routes are not substituted for rail routes even when the bus route is faster because passengers may value frequency over faster travel times. 


To measure substitution, we used the 2015-2017 Rider Census data, which includes information about the most recent journey survey respondents took using the MBTA system. We categorized each journey by its starting mode, or the type of service used at the start of the respondent’s journey, and its ending mode, or the type of service used at the end of the respondent’s journey. We defined four categories for the starting mode and ending modes: commuter rail, bus, light rail (the Green and Mattapan lines), and heavy rail (the Red, Orange, and Blue Lines). This resulted in each journey being assigned to one of 16 categories. To give an example, for a passenger who begins their journey at Lynn, takes the commuter rail to North Station, transfers to the Green Line and finishes their journey at Prudential, the journey would be classified as “Commuter Rail to Light Rail.”

While the survey data provided helpful insights on clustering and completed journeys, we had to account for undersampled evening commutes in the data set. We assumed that the trips from point A to point B by morning commuters are duplicated as trips from point B to point A by those same commuters in the evening, assuming that passengers use the same MBTA service for both commutes.

We then used the k-nearest-neighbors algorithm for each journey in the 2015-17 Rider Census to select the ten most similar origin-destination pairs. We determined similarity on the basis of a passenger’s origin and destination locations. The origin location would be the latitude-longitude coordinates of the street intersection nearest to the passenger’s home, and the destination location would be the latitude-longitude coordinates of the street intersection nearest to their workplace. The ten most similar journeys were determined by using four-dimensional Euclidian distance which are the longitude and latitude of the passenger’s origin point and the longitude and latitude of the passenger’s destination point. We calculated the percentage of the ten most similar journeys that belonged to the same category. That measure is the propensity for substitution.Using the same origin-destination pairs, if journeys among passengers varies greatly, the substitution percentage approaches 100%. If journeys do not vary, the percentage approaches 0%.

Next, we mapped the substitution metric in QGIS. The survey data was converted to a spatial point data set, with the location of the point determined by the latitude-longitude coordinate of the origin location. We duplicated the survey data while reversing the origin locations and the destination locations, effectively mapping every journey as two points: one representing the origin location and the other representing the destination location. Adjacent points were grouped into 500m hexagons, and the average propensity for substitution was calculated for each hexagon. At 100%, the ten nearest neighbors of journeys that started and ended in that hexagon were taken using the same MBTA service, on average. Alternatively, at 0%,the ten nearest neighbors of journeys that started and ended in that hexagon were taken using the different MBTA services.



A few interesting trends are shown in the substitution map above. Immediately beyond the terminal stations of the Red and Orange lines, the metric approaches 0%; this is probably because some passengers choose to walk to the Red and Orange line stations, while other passengers choose to take a bus. Many passengers choose to take other MBTA services rather than walk near terminal stations that have large average walk distances, since this walk distance is less acceptable for different people. Another interesting observation is that substitution near Andrew and Broadway, the two Red line stations that serve South Boston, is relatively low; this is most likely because passengers are choosing to take one of the many bus routes rather than the Red Line. In fact, the eastern half of South Boston has a cluster of hexagons with percentages over 80%, meaning that the bus route is practical enough that passengers forego the walk to Broadway or Andrew.

To illustrate the usefulness of this approach, we conducted an analysis focused specifically on South Boston. Five bus lines converge on City Point at the edge of South Boston: Routes 5, 7, 9, 10, and 11. We filtered the survey data to identify trips that started or ended with one of those bus lines (n=696), and since the survey data is biased towards morning trips, duplicated the survey data while flipping the starting and ending locations. We then applied the same k-nearest-neighbors algorithm to the data, and mapped the data using the same procedure. The resulting data showed the same cluster around City Point where all five of the bus lines converge.

Subsequently, we grouped the individual points using the k-nearest neighbor algorithm in to twenty clusters. The four variables we used to cluster the data were the origin location latitude, origin location longitude, destination location latitude, and destination location longitude. We filtered out the clusters with less than 20 data points, leaving twelve clusters, which enabled us to identify unusual trip patterns and ignore them. For each usable cluster, we calculated the average substitution percent and plotted the clusters as lines, with the endpoints of the lines representing the average origin and destination locations of passenger journeys in that particular cluster.

The resulting map illustrates that passengers using the bus network, whose journeys start or end near the western portion of South Boston, typically use the same bus route. Passengers whose journeys begin near Andrew or Broadway, however, use different bus routes to get to serve the same journey. This is potentially a sign that some of the bus routes in South Boston could be consolidated without substantially impacting passenger experience.


In the last two posts, we have used the Rider Census data set to examine how people access transit in greater detail than is usually possible. First, we found that the distance traveled to access transit on foot varies much more than the commonly applied rule of thumb of ½ mile. In this post, we found that people, perhaps unsurprisingly, use different transit services when they have multiple options. Importantly, we do not know from this analysis if an individual might choose different services on different days, nor the reasons why they might choose one service over another. Future analysis can examine these questions, using this and other survey data.