Is Campaign Finance Data Unusually Dirty Data? At first glance, it sure seems that way.
In an idle moment, just poking around looking at different data files, I decided to load some campaign finance data from New York’s Open Data site. Just go to the site, search on “elections,” pick a file and see what you get. I looked at a couple different files. The analysis below, which is typical, is from the file, “Campaign Finance Expenditures Submitted to the New York State Board of Elections Beginning 1999.” (Note, though I’ve a bunch of questions, I did not call the Board of Elections. I probably will, but I don’t think it necessary before playing with the issues discussed below.)
Given the issues around campaign finance, should we be at all surprised that the data appear especially dirty? I don’t mean this in a political sense, but in a geek sense.
What a mess:
- Misspellings and different spellings of the same names
- Incomplete data
- Non-existent data
- Inconsistent date formats
- Invalid data
Here’s an example of an easily avoidable problem: identifying the state in the contributor’s address. It should be pretty easy to get that one right. Right?
Yet, over the fifteen year period, 1999 through 2014, 9.0 percent of the records (over 195,000 of them) did not even list a state. Those records were associated with reported contributions (perhaps, accurate, perhaps valid, but perhaps not) of over $131 million, about 4.7 percent of the reported contributions. And more were clearly invalid. Less than 91 percent of the total records had a valid state identifier. Can you imagine if the Post Office had an error rate like that?
- No one involved in the process of preparing, submitting, and review (never mind analysis) of the data has a stake in clean data. Indeed, some might even be advantaged by dirty data as it clouds and muddies the what might otherwise be evident.
- There’s little or no penalty for inadequate data.
- Campaigns tend to be short-term affairs, especially losing ones. So even if inclined to get it right, there’s no opportunity for improvement.
Well, you can add your own theories. I have others. They’re less geeky and much more cynical.
{ Comments on this entry are closed }