Monday, November 28, 2011

Open Data Initiatives - Growth or Failure?

One of the key areas of growth I have been watching is the birth of the open data movements.  The Open Data Foundation is an important group who's mandate includes helping many agencies migrate to an open data architecture.  (Disclosure: I sit on the Advisory Board). As a standards guy for more than a decade, I find their mandate admirable:

"The Open Data Foundation is committed to using and contributing to international standards and is a project- and results-focussed organisation. We believe in using open standards to deliver measurable benefits in solving business problems in the collection, production, and dissemination of statistics. The aim is use and integrate these standards in a coherent and consistent way, to develop tools and techniques to make them easy to use, and thereby work towards a universal and harmonised statistical architecture."

As noble as it sounds, the devil is often in the details.  I have a meeting later today with David Eaves, an Open Data guru who has worked with many governments including the Government of British Columbia, Vancouver City Council (as championed by Andrea Reimer) and even at the national level.  The gist of the work is that if we, as taxpayers, are footing the bill to create this data, shouldn't we be able to use it to make informed decisions?  The answer to me is an overwhelming yes.  So what about the details then?

If you take a look at the data publishing done my most agencies, it is often in mixed standards (flattened PDF's, Spreadsheets, Custom CSV (both text and binary), XML and more.  Some o these format are easy to work with but trying to parse a spreadsheet with a non-deterministic style to it is a daunting task.

I recently took on such work for a proof of concept for the government of British Columbia.  I worked with some CSV data from this stie and created a mobile application that runs on iOS, Android and BlackBerry Tablet.



One of the things that jumps right out is that open data needs a specialized set of Message Exchange Patterns and the published data has to be deterministic, available in many formats (like JSON/XML) and that callback and notification support is required in the event the data changes.  As an example, this data is published as static data.  As soon as I use it by embedding it, it could be obsolete if the original copy changes.

Another issue is that spreadsheets are not deterministic.  If you have a spreadsheet and output CSV such as this:


Duane, Nickull, Human, Vancouver
Bill, Gates, Human, Seattle

This is an annotation.  No one knows how to account for this if it changes nor how many lines it takes.  Sometimes, naturally occurring commas can also be inside an annotation.  OMG - what can be done?

Second, Set, Of, Data
Third, Set, Of, Data



This is an annotation.  No one knows how to account for this if it changes nor how many lines it takes.  Sometimes, naturally occurring commas can also be inside an annotation.  OMG - what can be done?


Second, Set, Of, Data
Third, Set, Of, Data

it can be major headaches for those trying to parse the data.  XML is far better yet this XML has issues:


These are relatively small data sets too.  Imagine large data sets being requested by mobile devices?

What is required in this industry is a new type of data server that can address some of these problems.  Ideas are rolling in my head already.



1 comment:

  1. Nice post. Here you provide some valuable points about company it's really nice. I like your post. Thank you for sharing................

    Growth Flex

    ReplyDelete

Do not spam this blog! Google and Yahoo DO NOT follow comment links for SEO. If you post an unrelated link advertising a company or service, you will be reported immediately for spam and your link deleted within 30 minutes. If you want to sponsor a post, please let us know by reaching out to duane dot nickull at gmail dot com.