data science on frankhecker.com

Getting down with data

Wed, 22 May 2019 12:00:00 -0400

The RStudio Cloud service provides a web-based alternative to installing the RStudio desktop software, and includes tutorials for using the R statistical language and the Tidyverse set of functions. (Click for a higher resolution version.)

tl;dr: Everyone can now work with data and visualize it. Should you?

NOTE: This article was originally published in my Civility and Truth Substack newsletter. I have republished it here without changes.

I haven’t had time yet to look at the detailed median household income data for Howard County (for which I’m going to try to do some maps of income by census tract), so that will have to wait for a future post. In the meantime I wanted to talk a bit about how I do these visualizations, how you can do them too if you have the time and interest, and what I’ve learned in the process.

Easy data analysis and visualization for free

Once upon a time anyone wanting to do serious statistical analysis and graphic visualization of data needed to purchase a license for proprietary software products like SAS or SPSS that cost hundreds or even thousands of dollars per user. The traditional alternative for most users was Microsoft Excel, which included at least a basic set of statistical functions and graphing operations. However it was still not exactly cheap, especially for home users, and given its origin in accounting spreadsheets it was not really that suitable for advanced statistical and data visualization work.

R and the tidyverse

What has changed from then until now? First, noncommercial alternatives arose to SAS, SPSS, and similar products, most notably the R statistical programming language and its associated runtime environment. Unlike SAS and SPSS, R was developed through an open collaborative process in which anyone could participate, and the resulting software was distributed in both binary and source form at no charge. R relatively quickly gained many users, and today it is pretty much the most popular language (along with Python) for so-called “data science” projects.

Unfortunately as a programming environment R is relatively difficult to use, especially for people coming to it as a first language. The second advance was to simplify the use of R by dictating a particular way of programming in it. This was accomplished by the statistician Hadley Wickham and his colleagues, who developed a set of R extensions or “packages” known colloquially as the “Hadleyverse” and now renamed as the “tidyverse.”

The tidyverse packages implement a simplified philosophy for working with data, basically treating all data as sets of tables whose rows and columns can be manipulated in various ways, with the output of each manipulation producing a new table used as input to the next manipulation. The tidyverse packages also include an accompanying set of functions (“ggplot” and others) to graph data in various ways, again adhering to a particular philosophy of how to transform data into visuals.

Data analysis and visualization as a service

So-called “free and open source” software products like R and the tidyverse packages are a godsend for people like me who can’t or don’t want to pay for expensive proprietary software. But to paraphrase a former colleague of mine, free software is only free if your time has no value: the time and effort spent downloading, installing, and configuring software can be daunting, especially for a casual user who just wants to do a basic data plot. This is especially true if you want to do more advanced things, like displaying data on maps.

To address this issue Hadley Wickham and his colleagues founded a startup, RStudio, to lower the barriers to widespread use of R and the tidyverse packages. Their first product, also called RStudio, provides a web-based interactive development environment (IDE) to simplify creating R-based data analyses. In its RStudio Server version it allows an organization to stand up a central web site to which users can connect and use R, the tidyverse packages, and other R-based capabilities without having to install software on their own PCs.

However, RStudio Server removed a burden from end users only to place it on the people charged with standing up the server system with all its necessary software. That was fine for larger organizations, but a problem for small businesses, not to mention individual users.

To address that issue RStudio is now developing a new service, RStudio.cloud, currently being made available for testing by the public. With RStudio.cloud all you need is a browser: the R and RStudio software is already pre-installed for you, with additional packages easily installable on the service if and when you need them. RStudio.cloud also includes a full set of interactive tutorials (see the graphic above), so that anyone who’s familiar with (say) working with Excel spreadsheets, formulas, and macros can learn to do basic data analyses and visualizations.

(If you want to try RStudio.cloud yourself, you can sign up for a free account. and work through some of the interactive tutorials. If you want to explore a non-trivial project, I’ve shared a version of my “hocodata” project on RStudio.cloud for others to access.)

Public data for public use

Of course, it’s not enough to know how to do data analyses and visualizations. You also need some actual data to work with. Here, as in other areas, government (local, state, and Federal) has come to the rescue—though not always, and not always completely.

Government’s “data exhaust”

Governments by their nature generate a lot of data about the jurisdictions over which they hold sway. The most notable (and ancient) example of this is the census, which has gone from being a simple count of people to collecting all sorts of relevant demographic, economic, and other data about populations.

Governments also collect a lot of other data in the course of their operations, for example about crimes both serious and petty, building permits and zoning decisions, the locations of fire hydrants and streetlights, and so on. Traditionally this data was generated and kept as paper documents, but now it is almost always generated and stored as digital files or as entries in a digital database—a sort of “data exhaust” that is emitted by the day-to-day running of governments.

Having generated this data, it’s natural for governments to consider giving citizens access to it. In some cases this is part of an overarching strategy to improve visibility into the workings of government. A good example is the “HoCoStat” system proposed by former Howard County Executive Allan Kittleman during his successful 2014 campaign.

In other cases government just takes data and makes it available without an overall strategy—after all, the data is being produced in digital form already, whether that be as Excel spreadsheets or in some other form, and the incremental work to make it publicly available may not be that large. For example, although the full HoCoStat system was never deployed, under Allan Kittleman Howard County did stand up a new OpenHoward site that collected data produced by various Howard County agencies. Somewhat confusingly, there is also a separate site data.howardcountymd.gov that also hosts a variety of data provided by the Howard County GIS division—another project that appears to have been done as an incremental effort.

However, governments do not always make data available, or make it available only in inconvenient ways, for a variety of reasons. For example, some government agencies release data only in the form of PDF documents, the electronic equivalent of traditional paper reports. These can be relatively difficult to extract data from. In other cases data may be displayable on a public web site, but with no way to download it in a more convenient form.

But even here people have created automated ways to access data even in odd formats, whether that be extracting tables from PDF files or “scraping” it off of web sites. The result is yet more data to add to that available from more convenient sources.

The downside of data

So with all this data available, and free ways to analyze it, are we living in utopia (at least as far as data analysis and visualization are concerned)? I don’t really think so: there are downsides to having lots of data to analyze just as there are downsides to not having it.

First, we tend to think that data is more accurate and reflective of reality than it actually is. For example, take the median household income estimate for Howard County and comparable estimates for other counties. In 2017 the estimate for Howard County median household income was $111,473 while the estimate for Stafford County, Virginia, was $112,795, or $1,322 more. This difference was enough to propel Stafford County into the list of top ten counties by median household income, and knock Howard County out of it.

But the margins of error for these estimates were $2,666 for Howard County and $5,081 for Stafford County. There’s therefore a good possibility that Howard County and Stafford County had pretty much equal median household incomes for 2017, and a fair chance that Howard County’s median household income was actually higher than Stafford County’s.

This failure to take margins of error into account is ubiquitous in people’s treatment of data (and I’ve been guilty of it myself). It’s not that significant an issue with respect to median household income estimates, but it can be a big deal indeed when it comes to data measurements that drive funding and personnel decisions, as with student test scores. It’s quite possible that many if not most of the reported test score increases and decreases that are alternatively lauded or derided are actually just random year-to-year fluctuations that don’t reflect any underlying change in students’ ability to learn or teachers’ ability to teach.

School test scores provide another reason not to put too much faith in data: When data measurements are used to drive rewards and punishments, the temptation to game the measurements in various ways can be irresistible. With school test scores such gaming can range from “teaching to the test” up to outright fraud, as shown by scandals around the US. We now have to account not only for the possibility of random fluctuations, which are relatively benign in origin, we also have to assess to what degree the data might be fraudulently measured or reported.

Finally, in many cases we should question ourselves as to whether some data is actually useful, or should be used. For example, do student test scores actually tell us anything useful? Would it be better not to do student testing at all, or to restrict it to certain narrow purposes? One benefit of working directly with raw data, as opposed to consuming pre-cooked graphs and tables prepared by others, is that it can give you a good sense of the limits to what data can tell us.

Useful datasets for Howard County election analysis

Sun, 01 Mar 2015 07:00:17 -0500

tl;dr: I release two useful Howard County election datasets in preparation for future posts.

In the coming days and weeks I’ll be posting some analyses of Howard County election results. Unfortunately the data released by the Howard County Board of Elections and the Maryland State Board of Elections is not always in the most useful form for analysis. In particular I was looking for per-precinct turnout statistics for the 2014 general election in Howard County, along with some way to match up precincts with the county council district of which they’re a part. That data is available in the 2014 general election results per precinct/district published by the Howard County Board of Elections, but unfortunately that document is a PDF document.

PDF files are great for reading by humans, but lousy for reading by machines. They violate guideline 8 in the Open Data Policy Guidelines published by the Sunlight Foundation:

For maximal access, data must be released in formats that lend themselves to easy and efficient reuse via technology. … This means releasing information in open formats (or “open standards”), in machine-readable formats, that are structured (or machine-processable) appropriately. … While formats such as HTML and PDF are easily opened for most computer users, these formats are difficult to convert the information to new uses.

Since the data I wanted wasn’t in a format I could use, I manually extracted the data from the PDF document and converted it into a useful format (Comma Separated Value or CSV format) myself. Then since someone else might find a use for them, I published the files online in a datasets area of my Github hocodata repository. The first two files are as follows:

hocomd-2014-precinct-council.csv. This dataset maps the 118 Howard County election precincts to the county council districts in which those precincts are included.
hocomd-2014-general-election-turnout.csv. This dataset contains turnout statistics for each of the 118 Howard County precincts in the 2014 general election, including the number of registered voters and ballots cast in each precinct on election day.

Stay tuned for some interesting ways to use this data.

Walter Carson (wcarson@columbiaunion.net) - 2015-03-01 14:38

Thank you. As always, of interest. How might such data be used to look at the state legislative districts, if at all? Best wishes. WEC Sent from my iPhone

hecker - 2015-03-01 19:50

See my future posts for some ideas on how this data might be used. Probably the first thing I’ll do is look at different county council districts to see if there seems to be any real difference in 2014 general election turnout between the districts. A similar analysis could be done for legislative districts, or at least those portions of the districts within Howard County. (A more complete analysis would need data from Carroll County, Baltimore County, etc.)

Fun with Howard County building permit data

Mon, 16 Feb 2015 18:53:59 -0500

tl;dr: I have fun creating graphs and maps with building permit data from data.howardcountymd.gov.

I’ve written previously about the cornucopia of interesting data sets that Howard County government has made available at the data.howardcountymd.gov site. I had some spare time over a long weekend and decided to try analyzing some of that data, including making use of the various map files on the site (under the “Spacial Data (GIS)” tab).

The particular data set I decided to start with was for building permits issued for residential and commercial construction—not because I have a burning interest in building permits but because I mentioned this type of data in my last post and thought it would be a relatively easy data set to analyze. The particular question I decided to look at was how many residential building permits were issued in each zip code within Howard County in 2014—basically to get a feel for where the most construction was occurring in the county. (It’s only an approximate measure because some permits cover multiple units.)

To do the analysis I used the skills and the tools I learned in the courses that are part of the Johns Hopkins data science specialization series on Coursera. (See my Coursera-related posts for more on my experiences in these classes.) I won’t go over the process here since I’ve separately published full details on my RPubs page, with the source code available in my hocodata GitHub repository.

I first created a simple table of the top zip codes for residential permits issued. This was sort of boring so I won’t reproduce it here; you can find it in the first example analysis I did. More interesting is the bar chart I created as part of the second example. It’s clear from the chart that there’s wide variation among Howard County zip codes in terms of residential construction. The two Ellicott City zip codes combined (21042 and 21043) accounted for the largest fraction of residential building permits in 2014; in contrast there were almost no permits issued for east Columbia (21045).

However what I really wanted to create was a map showing exactly where permits were being issued across the county. The Howard County GIS division provides on data.howardcountymd.gov a set of map data for zip codes within Howard County. After doing a bit of research and experimentation, in my third example I was able to use this in conjunction with the building permit data to produce a map that is a nice alternative to the bar chart.

I have to stop here and ask the unspoken question: What’s the point of all this? I’d answer as follows:

First, this shows that releasing government data empowers people to do interesting things with it, especially when combined with free software and easily available online information and training. Maybe everybody isn’t interested in building permit data or any other individual government data set, but I suspect that there are a fair amount of people out there who are, including small businesses, nonprofit organizations, or just individual activists and interested citizens.

Second, I did all this in a way that is completely reproducible by anyone else. How often have you seen a graph or map in a newspaper or government report and wondered, where exactly did that data come from? Wonder no longer: In my examples I start with the raw data as released by Howard County and show all my work in analyzing the data and creating the tables, charts, and maps.

Finally, this is all reusable and adaptable. For example, suppose you have a better source of data on construction activity, perhaps one that gives the actual numbers of residential units, commercial square footage, and so on. You can easily plug that modified data into the analysis steps I’ve documented, and create better versions of the charts and maps in my examples.

You can also reuse the overall technical approach for any type of data tied to a geographic area within Howard County. For example, in addition to zip code areas the data.howardcounty.gov site contains map data for Howard County school districts, election precincts, census tracts, and many other subdivisions of the county. If you have data sets that are based on those subdivisions (for example, vote totals or turnout percentages for precincts) then you can adapt the code I wrote (all of which is in the public domain) to create your own maps showing how that data varies across the county.

The bottom line is that the data is out there for the picking, as are the tools to make sense of it. You just need to spend some time learning how to use them or (if you don’t feel up to the task yourself) finding someone who can. Have fun!

Howard County government by the numbers

Mon, 19 Jan 2015 09:00:00 -0500

tl;dr: As we wait to hear more about Allan Kittleman’s HoCoStat proposal, you don’t have to wait to download lots of useful county-related data at data.howardcountymd.gov.

During his (ultimately successful) campaign for Howard County Executive, one of Allan Kittleman’s key proposals was to establish HoCoStat, a program to (in Kittleman’s words), “measure . . . response and process times for various government functions” to help “increase responsiveness, improve efficiency and heighten accountability.” Kittleman’s administration is in its early days, and nothing much has been heard yet about how and when HoCoStat might be implemented. (Even the original HoCoStat proposal has disappeared from Kittleman’s web site as it’s being redesigned, although the Internet archive has a copy.)

But don’t despair! While we’re waiting for HoCoStat to make an appearance there’s other Howard County data-related resources we can explore. In particular, the data.howardcountymd.gov site has a good and growing collection of county-related datasets, many of them tied to county maps—no surprise, since the site is maintained by the county’s Geographic Information System (GIS) Division. Part of what makes the site great is that it is not just presenting predefined maps and PDF documents, but also provides the raw data used to create those maps.

For example, suppose you’re interested in building permits issued in Howard County. At the simplest level you can view an interactive map) showing the locations for all such permits; you can click on the icons corresponding to the issued permits and see the exact address, date when the permit was issued, and other information.

But let’s suppose you want to do more in-depth analysis of permits issued: For example, which areas are seeing the most residential or commercial permits issued? Or, what is the trend for permits issued over time? The data.howardcountymd.gov site also lets you download the raw data behind the map in a variety of formats, for example in CSV format for use with Excel spreadsheets or statistical software like R, KML format for use with Google Maps and Google Earth, and several others. Armed with the relevant data files you can create your own maps and do your own analysis, including combining the Howard County data with data from other sources like US Census data.

All in all the site—which is still evolving—is a model for how Howard County government can make useful data available to the Howard County individual and corporate taxpayers who are ultimately paying for county services. It would be great to see this strategy extended to HoCoStat as well. For example, when promoting the HoCoStat proposal Allan Kittleman pointed to (among others) Montgomery County’s CountyStat site as a model to emulate. While CountyStat is very nice, it has the disadvantage that you can’t see the raw data behind the performance indicators.

For example, CountyStat has some summary statistics relating to issuance of building permits: average number of days to issue a residential permit, commercial permits for new construction, or other commercial permits. But there’s a lot more one might want to know: For example, what’s the variability in the time to issue permits? Are there some permits that for whatever reason took a really long time to issue? How does the time to issue permits vary across the county? Are there particular areas that (for whatever reason) are experiencing greater or lesser delays in getting permits issued? Having the raw data behind the indicators would permit (no pun intended) interested parties to answer these questions, from commercial developers doing large-scale projects down to a small contractor building a single home.

As I wrote in my previous post on Howard County government data initiatives, providing unfettered access to raw data (subject to reasonable concerns relating to individual privacy and corporate confidentiality) is key to making government data useful: It allows the private and civic sectors to exercise their own creativity in using that data, rather than trying to have government anticipate every possible use for it, and also lets the private and civic sectors hold government accountable by enabling them to do their own independent analyses of government data. It’s great to see what Howard County government (and the GIS Division in particular) has been and is doing to make useful data generally available. I hope that as the Kittleman administration gets down to work and the HoCoStat program is implemented that that spirit of openness and commitment to serve citizens through government data continues.

Online competency-based education

Sun, 28 Sep 2014 08:00:00 -0400

Following up from my previous post on my experience with Coursera, here are a few links of interest (mostly) relating to online education, with a focus on “competency-based education,” i.e., education directed specifically at teaching people to become competent at one or more tasks or disciplines:

“Hire Education: Mastery, Modularization, and the Workforce Revolution” (Michelle Weise and Clayton Christensen). Clayton Christensen is famous for his theory of “disruptive innovation,” which I think is useful not so much as a proven theory but rather as a way to structure plausible narratives about business success or failure. When Christensen fails in his predictions it’s usually because he doesn’t pay attention to things that don’t fit neatly into his preferred narratives. For example, he and co-author Michael Horn previously hyped for-profit education companies and failed to see that for many of them actually educating students was not the point. Rather those companies identified a “head I win, tails you lose” business proposition in “chasing Title IV money [i.e., government-subsidized student loans] in a federal financial aid system ripe for gaming.” This represents a second try by Christensen and his associates to forecast the future of post-secondary education.

“The MOOC Misstep and the Open Education Infrastructure” (David Wiley). One of Clayton Christensen’s blind spots is that he tends to overlook what’s going on in the area of not for profit endeavors. In his blog “Iterating toward Openness” David Wiley covers the general area of open educational resources (or OER); this post is a good introduction to his thinking.

Web Literacy Map (Mozilla project). A real-world example of the sort of competency-based open education initiative that Wiley’s promoting. See also the Open Badges project, a Mozilla-sponsored initiative to create an open infrastructure for granting and publishing credentials.

A Smart Way to Skip College in Pursuit of a Job (Eduardo Porter for the New York Times). “Nanodegrees” are online education provider Udacity’s own take on competency-based education, created in cooperation with major employers.

“Missing Links: How Coding Bootcamps Are Doing What Higher Ed and Recruiting Can’t” (Robert McGuire for SkilledUp). You may be beginning to see a trend here: A lot of the action in competency-based training is around software development, data science, and related fields. That’s because there’s high demand for skilled employees in certain fields and a lack of truly-focused traditional educational offerings to meet that demand. A related trend: Sites like SkilledUp that are trying to be become trusted guides to these new-style offerings.

Last but not least, here are some other people’s reviews of the Johns Hopkins Data Science Specialization courses on Coursera that I’m currently taking:

From a local point of view these changes (if indeed they continue and are amplified) are not likely to affect high-end universities like Johns Hopkins; they’ll survive based on their ability to select the most talented applicants and plug them into a set of networks that will maximize their chances of success.¹ The question is rather how they’ll affect institutions like Howard Community College that serve a broader student population that’s looking to acquire job-relevant skills.

Note that from this point of view online offerings like the John Hopkins Data Science Specialization help to promote the institution and identify potential applicants. In fact, just this week I received an email from the Bloomberg School of Public Health inviting me to attend one of their “virtual info sessions” for people considering applying. ↩︎

Adventures in online education

Tue, 09 Sep 2014 08:00:00 -0400

The last three months or so I’ve been in school (which is why I haven’t been posting as much lately). Not a real bricks-and-mortar school—I’ve been participating in the “Data Science Specialization” series of online courses created by faculty at the Johns Hopkins Bloomberg School of Public Health and offered by Coursera, a startup in the online education space. It’s been an interesting experience, and well worth a blog post.

The obvious first question is, why I am doing this? Mainly because I thought it would be fun. I was an applied mathematics (and physics) major in college, enjoyed the courses I had in probability, statistics, stochastic processes, etc., and wanted to revisit what I had learned and (for the most part) forgotten. It’s one of my hobbies—a (bit) more active one than watching TV or reading. Also, I’ve done some minor fiddling about with statistics on the blog (for example, looking at Howard County election data), am thinking about doing some more in the future, and wanted to have a better grounding in how best to do this. Finally, “data scientist” is one of the most hyped job categories in the last few years, and even though I probably won’t have much occasion to use this stuff in my current job it certainly can’t hurt to learn new skills in anticipation of future jobs.

The next question is, why an online course? Because I didn’t have the time (or the money) to commit to attending an in-person class, but I wanted the structure that a formal class provides. I’ve been (re)learning linear algebra out of a textbook for over four years now, and I still haven’t gotten past chapter 3. Part of the reason is that I’m doing every exercise and blogging about it, but mainly it’s that I don’t have an actual deadline to finish my studies. In the Coursera series there are nine courses, each lasting a month, with quizzes every week and course projects every 2-4 weeks depending on the course. I’ve been doing pretty well in the courses thus far and don’t want to spoil my record. For example, the first project in the current class was due Sunday but I was concerned about missing the deadline and so finished it last Friday night.

I like the way the series of courses is structured as well, not just as a class in statistics (only) but covering the whole range of skills needed to wrangle with data in its various forms, not least including the problems of getting datasets and cleaning them up. Each class thus far has only been a month long, so the time commitment is not that great and I know any work I do today will pay off in a completed course not too far down the road. It is a fairly serious commitment of time though, especially since the course video lectures cover only a fraction of what you need to know in order to do the course projects and correctly answer the more difficult quiz questions. I’ve probably spent almost 10 hours each week working on various aspects of the classes, including doing a copious amount of Internet searching to find out the additional information I need. But it’s been time well-spent: I feel like I’m getting a good understanding of how to do “data science” tasks—not that I know everything, but I have a much better picture of what I need to know, and what it would take to finish learning it.

The course I’m currently taking (“Exploratory Data Analysis”), like the others in the series, is what’s been referred to as a MOOC, or “massive open online course,” open at no charge to anyone in the world who wants to participate over the Internet. The instructors provide video lectures and create the quizzes and class projects but are not otherwise directly involved; the students provide help to each other in online discussion forums, assisted by “community TAs,” i.e., former students who volunteer as teaching assistants. MOOCs have recently been the subject of both hype and caution; now that I’ve been involved in them day-to-day I can provide a personal perspective on the controversy.

First, I think MOOCs are good for the sort of people who invented them in the first place: Internet-savvy folks with a technological bent who are motivated to learn something and have the necessary free time and background experience and knowledge to do so effectively. I’ve certainly appreciated having convenient no-charge access to a wide variety of classes, many of which (like the courses I’m taking now) have been put together by people who are leaders and innovators within their fields. I’d even consider paying for at least some of these courses (at $49 each) in order to get a more formal “verified certificate” (as opposed to a “statement of accomplishment,” and may do so for later courses within this series—potentially good news for Coursera, which in the end is a profit-making enterprise.

However for people who are not Internet-savvy, not all that motivated, and don’t have the necessary background then MOOCs aren’t a good choice. In fact, they’re about the worse choice there is. The dropout rates in MOOCs are extremely high (well above 90% in many cases), and the first serious test of MOOCs as a replacement for in-person college courses (at San Jose State University) was not a raging success. Which is not to say that online learning in general is doomed; in its more traditional forms (for example, University of Maryland University College) it’s doing quite fine.

MOOCs are simply the latest in a long line of attempts to move away from the traditional classroom model and “disrupt” the existing educational establishment. They’ll eventually find a place in the overall educational picture, most likely serving a variety of needs from “learning as hobby” (what I’m doing), high-end vocational education (what Coursera competitor Udacity seems to be morphing into), or as a supplement to traditional classes. But that’s for the future, and no real concern of mine; in the meantime I’m just trying to learn how to plot in R.

Nina Basu - 2014-09-10 15:30

I LOVE Coursera!