Transportation Flows Mapping Using R
The geographic visualization of data using programming languages, and specifically R, has seen a substantial upsurge in adoption and popularity among members of the GIS and data analytics community in recent years. While the learning curve in acquainting oneself with scripting techniques might be steeper than using more traditional and out of box GIS applications, it undoubtedly provides some other benefits such as building customizable processes and handling complex spatial analysis operations. The latter point being imperative for projects containing extensive amounts of data as is often the case with transportation and commuting flows which ordinarily contain considerable amount of records comprising of trips’ origins and destinations, mode of transport and travel times information. An added interesting perk is that R offers very creative and visually appealing finalized graphical solutions which were one of the motivators behind the choice of technique for this project. The primary motivator was, however, the program’s capacity in transportation data modelling and mapping as the aim of the project was mapping commuting flows.
Story of R
R is an open source software environment and language for statistical computing and graphics. It is highly extensible which makes it particularly useful to researchers from varied academic and professional fields (they increasingly range from social science, biology and engineering to finance and energy sectors and multifold other fields in between). It is also one of the most rapidly growing software programs in the world, most likely due to the expansion of data science. In the context of Geographic Information Systems (GIS), it can be described as a powerful command-line system comprised of a range of tailored packages, each of them offering different and additional components for handling and analyzing spatial data. The ones utilized in the project were ggplot2, and maptools, and to lesser extent plyr. The former two are some of the most common ones in the R geospatial community while the others encountered in research and worth exploring further were: leaflet and mapview for interactive maps; shiny for web applications; and ggmap, sp and sf for general GIS capabilities. Being an open source software, R community is very helpful in organizing and locating necessary information. One neat option is the readily available cheat sheets for many of the packages (i.e. ggplot cheat sheet) which make finding information genuinely fast.
There are some stunning examples of data visualization in R. One that made a significant media splash a few years ago was done by Paul Butler, a mathematics student at University of Toronto at the time, who plotted social media friendship connections (it created admiration as well as disbelief from many, according to an author, that this was done with less than 150 lines of code in an “old dusty” statistical software such as R). It also inspired further data visualization explorations using R. One of my favorite recent such works came in the form of a compelling book London – The Information Capital by geographer James Cheshire and its co-author designer Oliver Uberti. The majority of the examples in the book were predominantly written not only in R but specifically in its ggplot package, in combination with graphic design applications, and should serve as innovative illustrations on data visualization approaches as well as capabilities on what software could potentially provide. Both of the aforementioned projects inspired mine.
Transportation Mapping and Modelling
I would like to give some background on the type of analysis that was conducted. One of the common types of analysis in transportation geography, transportation planning and transportation engineering is geographic analysis of transport systems for origin-destination data that shows how many people travel (or could potentially travel) between places. This also represents the basic unit of analysis in most transport models which is the trip (single purpose journeys from an origin “A” to origin “B”, and not to be mistaken with Timothy Leary definition). Trips are often grouped by transport mode or number of people travelling, and are represented as desire lines connecting zone centroids (desire lines are straight and closest possible lines between origin – destination points, and can be converted to routes). They do not necessarily need to represent just movement of the people and can show commodity flows and retail trade as well. TransCAD software is often used as the industry standard for this type of modelling. It is, however, quite costly and implemented solely by transportation planning firms and agencies. On the other hand, R is starting to see dedicated transportation planning packages and continuously utilizing relevant GIS ones in transportation field. And most importantly: it’s free.
The dataset implemented for the project was American Community Survey 2009-2013 – 5 Year American Community Survey Commuting Flows located via Inter-University Consortium for Political and Social Research. It is a survey for the entire United States focusing on people’s (over the working age of 16) journeys to work. Data in the original survey was tabulated based on a few categories: means of transportation to work, private vehicle occupancy, time leaving home to go to work, travel and aggregated travel time to work, etc. For the purposes of the project all workers in commuting flows were selected (grouped together for all transportation modes). The trips were based on inner and inter-county commutes.
There are two main components needed when mapping transportation flows in general: coordinates of place of origin, and coordinates of place of destination. Common practice in transportation planning field is to have population weighted centroids for origins and destinations, regardless of the geographic unit of analysis, which in this case was U.S. counties. Therefore population weighted centroid shapefile for U.S. counties was needed so that it can be merged with the original survey data. It was located at the U.S. Census Bureau website and based on 2010 U.S. Census population numbers and distributions per county areas. The study area for the project was the United States and it excluded Canada and Mexico (even though both countries were included for workplace-based geographies), because specific regions of both countries were not mentioned which would make calculations of population weighted centroids not very realistic. Additionally, these records were not numerous to significantly change the model.
In the first step, data was loaded and reformatted in R (R can be downloaded from https://www.r-project.org/ and although analysis can be conducted in R directly it is much preferred and easier to use Rstudio which provides a user-friendly-graphical interface). Rstudio interface and snippet of code is displayed in Figure 1 below (Rstudio can be downloaded from https://www.rstudio.com/ ).
Figure 1: Rstudio interface and snippet of code in the project
Following the two datasets, original commuting survey and population weighted centroids, were joined based on county name and code, and then the unified file was subset to exclude Canada and Mexico, followed by renaming some columns fields for easier readings of origin and destination coordinates. In the next step, ggplot2 was used to position scales for continuous data for x and y axes, succeeded by plotting line segments with alpha command. Number of trips to be plotted were experimented with to show either all trips, or to filter them based on more than 5, 10, 15, 20, 25 and 50 trips. Showing all trips resulted in too dense of a plot as all of the United States was used as a study area. If the study area was of a large scale in nature, showing all trips would be acceptable. The optimal results seemed to be when trips were filtered to show over 10 inner and inter county journeys-to-work trips which resulted in the plot displayed in Figure 2.
The final map was then graphically improved in Adobe Creative Suite resulting in image in Figure 3.
Figure 3: Final mapping project after graphical improvements
The final design showing thousands of commuting trips resembled a NASA image of United States from space at night. It indicated some predictable commuting patterns such as increased journey-to-work lines concentration in large urban centres and in areas with large population densities, such as the North East part of the country. However, some patterns were not so obvious and required some further digging into data accuracy (which passed the test) and then the way in which the original survey was designed. For instance, there are lines from Honolulu, Anchorage and Puerto Rico to the mainland even though the survey was designed to represent daily commuting flows by car, truck, or van; public transport, and other means of commuting. The survey was designed to ask questions for all workers based on primary and secondary jobs by way of commuting for respective reference week when it was conducted and answered. These uncommon results were attributable to people who worked during the reference week at a location that was different from their home (or usual place of work), such as people away from home on business. Therefore place-of-work data showed some interesting geographic patterns of workers who made daily work trips to different parts of the country (e.g., workers who lived in New York and worked in California).
The final mapping product was printed and framed on 24” x 36” canvas as shown in Figure 4. Size was chosen based on aspect ratio of 2 to 3 which seemed best suited to represent the geography of the United States horizontal width and vertical length. Some other options would be to print on acrylic or aluminum which is less cost effective and more time consuming (most of the shops require around 10 days to complete it). However, the printed map on canvas was my preferred choice for this project based on the aesthetic I was aiming for which was to have the appearance of accentuated high commuting areas and dimmed low commuting areas. Another pleasant surprise was that when printing was finalized it manifested more as a painting than data visualization transportation project.
Figure 4: Printed map on canvas