May 2018 – Mark Nagelberg

In the Digging into Data Science Series, I dive into specific tools and technologies used in data science and provide a list of resources you can find to learn more yourself. I keep posts in this series updated on a regular basis as I learn more about the technology or find new resources. Post Last updated on May 28, 2018

If you’re familiar at all with Data Science, you have probably heard of Anaconda. Anaconda is a distribution of Python (and R) targeted towards people doing data science. It’s totally free, open-source, and runs on Windows, Linux, and Mac.

Up until recently, I basically treated my installation of Anaconda as a vanilla python install and was totally ignorant about the unique features and benefits it provides (big mistake). I decided to finally do some research into what Anaconda does and how to get the most out of it.

What is Anaconda and why should you use it?

Anaconda is much more than simply a distribution of Python. It provides two main features to make your life way easier as a data scientist: 1) pre-installed packages and 2) a package and environment manager called Conda.

Pre-installed packages

One of the main selling points of Anaconda is that is comes with over 250 popular data science packages pre-installed. This includes popular packages such as NumPy, SciPy, Pandas, Bokeh, Matplotlib, among many others. So, instead of installing python and running pip install a bunch of times for each of the packages you need, you can just install Anaconda and be fairly confident that most what you’ll need for your project will be there.

Anaconda has also created an “R Essentials” bundle of packages for the R language as well, which is another reason to use Anaconda if you are an R programmer or expect to have to do some development in R.

In addition to the packages, it comes with other useful data science tools preinstalled:

iPython / Jupyter notebooks: I’ll be doing a separate “Digging into Data Science Tools” post on this later, but Jupyter notebooks are an incredibly useful tool for exploring data in Python and sharing explorations with others (including non-Python programmers). You work in the notebook right in your web browser and can add python code, markdown, and inline images. When you’re ready to share, you can save your results to PDF or HTML.
Spyder: Spyder is a powerful IDE for python. Although I personally don’t use it, I’ve heard it’s quite good, particularly for programmers who are used to working with tools like RStudio.

In addition to preventing you from having to do pip install a million times, having all this stuff pre-installed is also super useful you’re teaching a class or workshop where everyone needs to have the same environment. You can just have them all install Anaconda, and you’ll know exactly what tools they have available (no need to worry about what OS they’re running, no need to troubleshoot issues for each person’s particular system).

Package and environment management with Conda

Anaconda comes with an amazing open source package manager and environment manager called Conda.

Conda as a package manager

A package manager is a tool that helps you find packages, install packages, and manage dependencies across packages (i.e. packages often require certain other packages to be installed, and a package manager handles all this messiness for you).

Probably the most popular package manager for python is it’s built-in tool called pip. So, why would you want to use Conda over pip?

Conda is really good at making sure you have the proper dependencies installed for data science packages. Researching for this blog post, I came across many stories of people having a horrible time installing important and widely used packages with pip (especially on Windows). The fundamental issue seems to be that many scientific packages in Python have external dependencies on libraries in other languages like C and pip does not always handle this well. Since Conda is a general-purpose package management system (i.e. not just a python package management system), it can easily install python packages that have external dependencies written in other languages (e.g. NumPy, SciPy, Matplotlib).
There are many open source packages available to install through Conda which are not available through pip.
You can use Conda to install and manage different versions of python (python itself is treated as just another package).
You can still use pip while using Conda if you cannot find a package through Conda.

Together, Conda’s package management and along with the pre-installed packages makes Anaconda particularly attractive to newcomers to python, since it significantly lowers the amount of pain and struggle required to get stuff running (especially on Windows).

Conda as an environment manager

An “environment” is basically just a collection of packages along with the version of python you’re using. An environment manager is a tool that sets up particular environments you need for particular applications. Managing your environment avoids many headaches.

For example, suppose you have an application that’s working perfectly. Then, you update your packages and it no longer works because of some “breaking change” to one of the packages. With an environment manager, you set up your environment with particular versions of the packages and ensure that the packages are compatible with the application.

Environments also have huge benefits when sending your application to someone else to run (they are able to run the program with the same environment on their system) and deploying applications to production (the environment on your local development machine has to be the same as the production server running the application).

If you’re in the python world, you’re probably familiar with the built-in environment manager virtualenv. So why use Conda for environment management over virtualenv?

Conda environments can manage different versions of python. In contrast, virtualenv must be associated with the specific version of python you’re running.
Conda still gives you access to pip and pip packages are still tracked in Conda environments.
As mentioned earlier, Conda is better than pip at handling external dependencies of scientific computing packages.

For a great introduction on managing python environments and packages using Conda, see this awesome blog post by Gergely Szerovay. It explains why you need environments and basics of how to manage them in Conda. Environments can be a somewhat confusing topic, and like a lot of things in programming, there are some up-front costs in learning how to use them, but they will ultimately save tons of time and prevent many headaches.

Bonus: no admin privileges required

In Anaconda, installations and updates of packages are independent of system library or administrative privileges. For people working on their personal laptop, this may not seem like a big deal, but if you are working on a company machine where you don’t have access to admin privileges, this is crucial. Imagine having to run to IT whenever you wanted to install a new python package – It would be totally miserable and it’s not a problem to be underestimated.

Further resources / sources

Conda Commands Cheat Sheet
Conda Documentation
Conda Quick Start Guide
Conda: Myths and Misconceptions
Does Conda replace the need for virtualenv? (Stack Overflow)
Python Environment Management with Conda (includes description of using Anaconda alongside Jupyter Notebooks)
Python Tutorial: Anaconda – Installation and Using Conda (YouTube)
Table Comparison of conda vs. pip vs. virtualenv commands
Understanding and using python virtualenvs from a Data Scientist perspective
What advantages are there of using Anaconda? (Reddit)
What is the difference between pyenv, virtualenv, anaconda? (Stack Overflow)
Why should I use anaconda instead of traditional Python distributions for data science? (Quora)
Why do we need Anaconda when we have pip? (Stack Overflow)
Why you need Python environments and how to manage them with Conda

For a long time, I’ve been interested in transportation and urban economics. When I was doing my Masters, I planned to specialize in these areas if I continued on to a PhD. So, when I saw a job position open for Data Scientist at the City of Winnipeg Transportation Assets Division, I didn’t have to spend much more than two seconds considering whether I would apply.

Well, a few months have passed and I’m happy to announce that I was successful: I’m starting the position this week. To say I’m excited is a huge understatement. The Division has been doing very great things with the recent development of the Transportation Management Centre (TMC) and I’m looking forward to being a part of these cutting edge efforts to improve the City’s transportation system.

To get up to speed, I’ve been looking through various sources to get an idea what municipalities have been up to in this space. I was pointed to the Big Transportation Data for Big Cities Conference, which took place in 2016 in Toronto and involved transportation leaders from 18 big cities across North America. The presentations are all available online and are a great source to understand the kind of transportation data cities are collecting, how they’re using it, possibilities for future use, and challenges that remain.

How cities are using transportation data

Municipalities are collecting unprecedented amounts of data and working to apply it in a variety of ways. Steve Buckley from the City of Toronto Transportation Services provides a useful categorization of the main areas of use for city transportation data: describing, evaluating, operating, predicting, and planning.

Describing (Understanding)

A fundamental application of the transportation data flowing into municipalities is simply to provide situational awareness about what is actually happening on the ground. This understanding is a prerequisite to all other forms of data use.

In the past, this was hard and expensive, but with widespread GPS, mobile applications, wireless communication technology, and inexpensive sensors, this kind of descriptive data is becoming cheaper to collect, easier to collect, and more detailed.

There appears to still be a lot of “low hanging fruit” for improving safety and congestion by simply having more detailed data and observing what is actually happening on the ground. For example, one particularly interesting presentation from Nat Gale from the Los Angeles Department of Transportation points out that only 6% of their streets account for 65% of deaths and serious injuries for people walking and biking (obviously prime targets for safety improvements). His presentation goes on to describe how they installed a simple and inexpensive “scramble” pedestrian crossing at one of the most dangerous intersections in the city (Hollywood / Highland) and this appears to have increased the safety of the intersection dramatically.

Evaluating (Measuring)

While descriptive data is crucial, it is not sufficient. You also need to understand what is most important in the data (i.e. key performance indicators) and have reliable ways of figuring out whether an intervention (e.g. light timing change) actually produced better results.

Along these lines, one particularly interesting presentation was from Dan Howard (San Francisco Municipal Transportation Agency) on their use of transit arrival and departure data to determine transit travel times (no GPS data required). Using this data, they can compare travel times before and after interventions, and understand the source of delays by simply examining the statistical distribution of travel times (e.g. lognormal distribution means good schedule adherence, normal distribution implies random events affect travel times, and multiple peaks indicate intersection / signal issues).

Operating

A key theme throughout many of the presentations is the potential benefits of being able to get traffic data in real time. For example, several municipalities have live real-time camera observations, weather data, and mobile application data (among other sources). These sources can provide real-time insight into operational improvements, such as real time congestion and light timing adjustment, traffic officer deployment planning, construction management, and detecting equipment / mechanical failures.

Predicting

The improved detail of data, the real-time nature of the data, and evaluation techniques come together to enable a variety of valuable predictive analytics allowing municipalities to take proactive response (e.g. determining the locations at highest risk of congestion or accidents and preventing accidents before they happen).

Planning (Prioritizing investments)

With improved data and improved insights from the data, municipalities can do better planning of investments to yield the highest value in terms of some target (e.g. commute times, accidents).

Municipalities are starting to capitalize on the benefits of open data

One common thread throughout many of the presentations is the benefits of opening up city data to the public, third parties, and other government departments. Although this is not without its challenges, there are many potential benefits.

Personally, as a data-oriented person, I’m particularly gung-ho about opening data up to the public, as long as the data does not infringe on anyone’s privacy and the cost of making the data public is not too high. I feel like this should be almost a moral imperative of public institutions – if you’re collecting public data, then the public should be able to access that data (again, after considering privacy concerns and resource constraints).

But there are much more selfish reasons other than moral principle for cities to open up the data, and based on these presentations, municipalities starting to understand these benefits.

One important advantage is by making the data public, you create opportunities for others to do analysis or write software applications that your organization simply does not have the resources to do. For example, it may not be a core competency of a transportation department to build, deploy, and maintain mobile applications. However, many people want something like this to exist, and making transit schedules accessible through a public API facilitates others to do this work. In these cases, the municipality plays the role of enabler.

Another thing to consider is that people can be quite ingenious and figure out things to do with the data that you never dreamed of. By making the data public, you can crowdsource the ingenuity and resourcefulness of citizens for the benefit of the public. Municipalities can do this not only by opening the data, but also by hosting public events such as urban data challenges or open data hackathons. Sara Diamond from OCAD University went through several examples of clever visualizations and related projects resulting from open transit data.

Another advantage of opening data is that it promotes collaboration with other municipalities and other departments within a single municipality. Opening the data builds competencies that can come in handy even if the data is not made public: for example, it may help a municipality share critical transportation data with other departments (e.g. emergency response teams).

This collaborative approach seems central for many municipalities in the conference. For example, Abraham Emmanuel from the City of Chicago talks about the City’s Transportation Management Center, which is working to “develop an integrated and modular system that can be accessed from anywhere on the City network” and “create interfaces with external systems to collect and share data” (where “external systems” can include the Chicago Transit Authority, Utilities, Third Parties, and others).

Municipalities are opening up to open source

Increasingly, municipalities are beginning to understand the value of open source software and incorporating it into their operations. Bibiana McHugh from TriMet Portland provides a useful comparison of the advantages of proprietary software versus open source software, with open source providing more control, fostering innovation / competition, resulting in a broader user and developer base, and the low entry costs.

Catherine Lawson from the The University at Albany Visualization and Informatics Lab (AVAIL) similarly presents benefits of open source, noting advantages such as defensible outputs (open platforms allow for 3rd party verification of output) and trustworthiness (open platforms can lead to a robust shared confidence in outcomes). In contrast, the advantages of proprietary models include alignment with procurement processes and the fact that it is the traditional, (currently) best-understood model.

Perhaps the best illustration of open source in action is given in Holly Krambeck’s (World Bank) presentation showing how open-source solutions can “leapfrog” traditional intelligent transportation systems in resource-constrained cities. She talks about the OpenTraffic program where “data providers” (e.g. taxi hailing companies) collect GPS location data from mobile devices host an open-source application called “Traffic Engine” that translates the raw GPS data into anonymized traffic statistics. These are sent to an server, pooled with other data providers statistics, and served with an API for users to access the data. OpenTraffic is built using fully open-source software and you can find a detailed report of how the project works here.

I think this is very exciting not just for the municipalities that reap the benefits of open source, but for programmers who now have the opportunity to build a reputation for themselves and their city, all while contributing a public good that benefits everyone.

Challenges

Of course, there are challenges that come along with the opportunities of producing large scale, highly detailed transportation data. Mark Fox from the University of Toronto Transportation Research Institute has an extremely useful presentation outlining some of the main challenges often associated with open city data. These include:

Granularity (datasets often have different level of aggregation),
Completeness: important to think carefully about what to open to the public and having a reason behind opening it
Interoperability: datasets across different departments may describe similar things but may not be comparable due to slightly differing schemas / data types)
Complexity: the data presented may be very complex and thus the public presentation of that data is limited
Reliability: whenever you collect data, there are questions of the reliability of the data that limit the ability to use it and apply it.
Empowerment: This is an interesting challenge I had not considered, which refers to the the incentives often built into government organizations to avoid failure at all costs and not engage in any risk-taking through experimentation. There also may tend to be a focus on short-term delivery of political goals and a lack of a long-term strategy of innovation.

Ann Cavoukian from Ryerson University (and formerly the Information and Privacy Commissioner for Ontario) adds privacy to this list of challenges. Her presentation focuses entirely on these issues, along with “Privacy by Design” standards to help mitigate these risks. She points out that extensive data collection and analytics can lead to “expanded surveillance, increasing the risk of unauthorized use and disclosure, on a scale previously unimaginable”. With recent privacy and data breach scandals from Equifax and Facebook since this presentation took place, I assume these issues are even more at the forefront of municipalities’ concerns with respect to transportation data.

Month: May 2018

Digging into Data Science Tools: Anaconda