By tptacek - 3 days ago
Showing first level comment(s)
This, to me, is so neat.
tptacek - 3 days ago
That's some small-government activism I can get behind!
floatrock - 2 days ago
Under Florida public record law, source code produced by state employees is, in very narrow circumstances, a non-exempt public record (the code can't process sensitive data, etc.). I'm considering a future endeavor where I periodically request the code to such projects until the I.T. department decides it's worth the effort to open source it.
I like to think this is a step towards consolidating publicly funded code and reducing duplicate effort. Ahh, imagine making a pull request to your city's website! But I'm getting ahead of myself...
hpincket - 2 days ago
btrettel - 3 days ago
mehrdadn - 3 days ago
jklein11 - 2 days ago
Hundreds of thousands of records a month. I ended up importing them into Excel(1) and then using... what was that called? An MS/Windows library that came with IE 5 and/or a few other things, that provided regex support (with a few quirks) that was accessible via VBA.
The point was, I could programmatically mine it -- including regex pattern matching and replacement of and within cell contents -- while also having a flexible UI within which to find and handle one-off cases. When the one-off's demonstrated a repeating pattern, I could quickly iterate to add that to the programmatic mining logic.
This included adding color cueing for items of particular interest, manual follow-up. Excel's sorting capabilities to bring potentially related instances into visually displayed groups. And the like.
It ended up working quite well. I might have preferred something else to VBA, and I did use Perl and other stuff, elsewhere (something that also gave me both power and the flexibility to rapidly iterate).
But the point is, with such data, I found it very useful to combine regex and rapid programmatic manipulation, together with a good visual interface (including visual cues, the ability to comment upon instances -- Excel cell-level comments -- etc.) and manual manipulation.
As a final aside, the extensive set of Excel keyboard shortcuts greatly aided in rapidly and effectively navigating and massaging the imported data.
--
1. This was back when Excel had... I think it was a 64K (or a bit less) limit on the number of rows in a sheet.
P.S. I tended to retain the originally imported data in its columns, and to produce my mining of it in other columns. That way, I could always and immediately see what I started with, for any particular record. (And, if things visually started to be "too many columns", well, Excel lets you hide a range of columns from the view. As one example of how its features really helped, on the visual front while doing this work.)
I still had to learn and allow for some quirks Excel exhibited with respect to importing text data. That included making sure the cells/columns being imported into carried the correct/needed formatting designation before importing into them (usually, "Text").
pasbesoin - 3 days ago
If it is annually, they got 17m tickets over 7 years so for 10 years, assuming they issue just over 19m tickets, that means each parking ticket needs to be at least $10 to cover the cost, even at $100 per ticket, IBM is banking on 10% share? That seems excessive to me but I never worked in government so could someone enlighten me on this?
By any chance there's a conflict of interest for government to be willing to make improvement and cut down parking tickets or any other similar source of income? Or maybe that's what public audit is for?
Bobbleoxs - 2 days ago
I wrote a blog post about it, because it requires a ton of work to get FOIA requested data - this I'm assuming was done in the same painstaking way:
https://austingwalters.com/foia-requesting-100-universities/
I give this props. I'm sure it required a ton of work
lettergram - 2 days ago
Did you give more thought into the address cleaning bit? Or does anyone have an idea how to go about transforming mangled addresses into coordinates?
I have a problem that's been bothering me for months, similar to what you have here: people from an emergency service call-center are inputting the addresses of the emergencies. For emergencies that happen on the public domain, there is often not a specific address, but rather names of landmarks. Something like "Street StreetName / Opposite Train Station Y", which can be written like "st stName / opp tr st y" or some other infinite variations.
I don't have any after-data to corroborate, but I do have previous instances where the operator inputted the same address better. If I can extract the correct landmarks, I think I can do a Google Places search for them, with a cleaned query, like "Store Amazon, Best Street, Ohio" to get coordinates that can fall into an acceptable area.
PS: in the example you gave with Lake Shore Drive, I think you could easily correct the names with an algorithm based on the Levenshtein distance
kioleanu - 2 days ago