Budding UW Data Scientists Use Their Powers for Social Good
Earn a degree in the field of data science these days and your ticket is punched: Google, Amazon, Facebook, leading-edge academic research, a well-funded startup—they’re all clamoring for people proficient in the tools and techniques needed to sift through today’s endless streams of digital data in search of something valuable.
Social service organizations and local governments are confronting the data deluge, too, often without the capacity to pay the salaries that profit-driven companies can offer these sought-after experts.
Enter the University of Washington’s just-concluded Data Science for Social Good summer internship. The program set interdisciplinary student teams, guided by professional data scientists and subject-matter experts, to work on thorny, real-world urban problems including family homelessness, paratransit bus service, community well-being, and sidewalk mapping for accessible route planning.
During their final presentations last week, four student teams showed off tools they built over the summer that should provide lasting value to the organizations whose data they worked with, and the community at large. In sharing their process, the teams also highlighted the challenges inherent in drawing insight from big data.
One team, working with the Bill & Melinda Gates Foundation and Building Changes, sought to parse data from King, Pierce, and Snohomish counties on family homelessness. The nonprofits and the counties are in the midst of a multi-year initiative aimed at making family homelessness rare, brief, and one-time.
Each county uses a federally mandated system to track family homelessness, but there are differences in the way they enter data, count what constitutes a family, and define an episode of homelessness. This presented the DSSG team with a classic data-wrangling problem as they tried to look for factors that lead to families successfully moving out of homelessness programs and in to permanent housing.
“We spent the bulk of the summer trying to find ways to process the data into an analyzable format,” said Joan Wang, one of the DSSG interns, during her team’s presentation.
They used clustering algorithms to better define and identify individual households within the anonymous county data. They reviewed literature and consulted with county experts to create a uniform definition of a single episode of homelessness. A family might enroll in multiple programs that overlap—such as emergency shelter followed by rapid re-housing—and would show up as multiple entries in the tracking system. By aggregating these events into a single episode, the data better matched the reality of a family’s experience.
In the end, the team fed the processed data into an interactive diagram that illustrates the flow of families through the system, visualizing the individual programs that contributed to successful exits from homelessness. (It’s a Sankey diagram, commonly used to chart the flow of energy through an economy. You can check it out here.)
“Generally, nonprofits don’t have the capacity to do anything more complicated than a regression analysis, so the machine learning and decision trees (which are used by the for-profit sector) were leaps and bounds more advanced than what we’re used to seeing and provided a huge benefit to the counties,” said Anne Martens of the Gates Foundation via e-mail. “The project allowed the counties to look at the data in new ways, which has already influenced their decision-making process.”
Data Science in Transit
Another team delved into data from King County Metro’s Paratransit service, which provides on-demand, door-to-door transportation for people whose disabilities prevent them from using regularly scheduled bus service. Fares for paratransit service, mandated by the Americans with Disabilities Act, cannot be more than double the fares on regular busses, but the service can cost 10 times as much to provide. The program is funded from the same bucket of money that funds regular bus service. As such, reducing costs to operate paratransit benefits all transit riders in King County.
The team (pictured at top) sought to help King County Metro better predict paratransit usage, analyze its highest-cost rides, and more efficiently reschedule riders when a paratransit bus breaks down in the middle of a run.
One tool they built plots historical usage data—which can be overlaid with holidays, the closure of a community center, and other factors that might impact usage—to help Metro contract for only the paratransit service it needs for a given hour or day. For one particular Tuesday, the team found Metro could have contracted for 30 fewer hours of bus service, saving around $1,500, the team said.
Another tool could help dispatchers pick the least costly alternative to serve riders when the paratransit bus they were scheduled on breaks down. The tool automatically finds nearby busses that could be diverted to pick up stranded riders without causing riders on those busses to miss their appointments.
“Any sort of cost savings that we can provide immediately translates to the ridership of the scheduled [bus] system,” said Anat Caspi, director of the Taskar Center for Accessible Technology within the UW’s computer science department, in an interview earlier this summer. Caspi guided the paratransit team and a team building routing capabilities on top of Access Map, an app that identifies obstacles for people with limited mobility travelling through the city.
Career Opportunities in Urban Challenges
The interns were given tutorials on data management software such as SQL for database queries, ArcGIS for geospatial data, and Socrata, for open government data; programming languages Python and R; machine learning and analytics tools such as GraphLab; visualization tools from Tableau Software and D3, which was used for the homelessness diagram; and more.
DSSG, modeled on programs at the University of Chicago and Georgia Tech, is an outgrowth of the UW’s eScience Institute, a multipronged effort to advance data-driven discovery in all fields. Housed in a light-filled space on the sixth floor of the physics and astronomy building at the UW, the institute pairs subject-matter experts from departments across the university with data scientists who can help them apply the latest methods and technologies to data in their domain.
The idea is to bring the same data-intensive approach to research and problem solving that is now the coin of the realm in the hard sciences to urban challenges, training students along the way. It meshes with another new effort called Urban@UW, which is uniting researchers and practitioners from a range of fields, on and off campus, to tackle the multifaceted, integrated challenges facing cities in the 21st century, starting with Seattle. (Stay tuned for more coverage of Urban@UW, which will be featured in Xconomy’s upcoming Seattle 2035 conference this fall.)
Some 144 students applied to the initial 10-week DSSG internship program. Sixteen were selected, hailing from 10 different departments. The university students—graduates and undergraduates alike—were joined by six high school students from the UW’s Alliances for Learning and Vision for Underrepresented Americans (ALVA) program.
DSSG seemed to resonate with the participating students. It came through in their presentations, each of which ended with a laundry list of things they’d do next if they had more time.
Frank Fineis, a member of the paratransit team, is clearly aware of the opportunities unfurling in front of data scientists. Then he paraphrased Voltaire, or possibly Spiderman’s Uncle Ben, in describing potential data science career paths.
“It’s definitely a super-trendy buzzword,” he said. “It’s cool there’s this focus on social good, when I feel like there’s so many evil ways to use it, you know, like get hired working for a Raytheon. It’s cool that there’s this push for open data and for using your powers for good.”