Google’s Open Data Toolkit Mess/Wealth
Google, the company that aims to organize the world’s information, has quite a few tools in their portfolio that are focused towards (or can be used for) publishing structured data. Some even help visualize it. I wonder, how are these all interrelated? How can they cooperate? Which should be used when? And what is Google’s roadmap for them? Is it worth to know them all? Or will they go where numerous labs experiments have gone before?
But let’s take a brief look at each
before we get all fuzzy.
For many it’s Excel without license fee, but some use it to share structured data with others. The british news(-paper) website The Guardian, for example, is known for sharing public data on their Data Blog using the Google Docs spreadsheets module (a more recent example). It’s a pragmatic approach that serves them quite well. Of course, the Google Docs API allows for programmatic access, so Docs is not only about manual data entry. However, for those who favor human labor, Docs offers a form feature. For display and visualization, there are the built-in chart functions of Docs plus a growing ecosystem of both free and paid gadgets. And, not bad at all, Google Docs even comes with a chat function for real time discussion.
However, the downside to using Docs for sharing public data is that Google Docs spreadsheets tend to be simplistic. Spreadsheets hardly support anything more complex than two-dimensional data tables. Additionally, Google Docs has no notion of a schema and metadata. Fields might be interpreted as numbers, strings and dates, but that’s about it. If you want to tell the spreadsheet that two floating point values are actually latitude and longitude of a geo point, it’s not possible. Telling a person’s name from a country name isn’t, either. So data in Google Docs isn’t self-explanatory. It always needs additional documents to describe it and applications to enforce certain rules when handling the data – or very cautious users. When it comes to sharing data with a crowd of potentially unlimited size, this can get problematic.
Google Fusion Tables
Fusion Tables appears to be one of the Google tools only few know about. And when you try it for yourself, it’s hard to freak out about its user experience. Actually it seems as if Fusion Tables hasn’t (yet?) been exposed to UI designers at all.
But: The purpose of this app is exactly what open data people should be looking for. It’s about defining data tables (hence the name), filling them up with data, maintain the data collaboratively, discuss it, visualize it and “merge” it with other open data tables. SQL users, think JOIN.
I won’t judge any of the tools here by the data that’s currently available there. But it’s interesting to see how people use the service. One example is a list of MD5 encrypted email addresses affected by the Gawker password incident, so people can find out if their password has been compromised. The way how the table authors created a “Using this spreadsheet” column clearly shows that Fusion Tables needs more ways to give general table information, instructions, attribution and such.
Naturally, there is an API for read and write access to Fusion Tables. People who know SQL will likely feel comfortable with that, since one basically interacts with fusion tables via a subset of SQL, issued via REST web service queries.
As for built-in visualization tools, Fusion Tables offers a set that seems to be growing quickly. Currently there are table, map, intensity map, line chart, bar chart, pie chart, scatter plot, motion, timeline and storyline. People familiar with Google’s Visualization API know these. A special benefit of the maps module in Fusion Tables is the easy handling of thousands of data points on a map. For an example that allows to try out all available visualization methods, open the Wikileaks Afghan War Diary table.
A mystery to me is how editing rights can be controlled. It seems to me as if only the owner of a table can edit data, even if the table is public. However, a basic comment function allows others to annotate single values or rows.
Google Public Data Explorer
Public Data Explorer has been around as labs product for a while, allowing users to explore a couple of datasets prepared by Google. But, just recently, it has been announced in the Official Google Blog that Public Data Explorer is now open to user submissions. If you’re interested, the Nieman Lab blog has additional insights about the new launch.
For that purpose, a new data exchange format called Dataset Publishing Language (DSPL) has been released.
DSPL is worth a closer look, since it tackles many of the problems which have already been mentioned above: Lack of schema, lack of metadata, lack of self-descriptiveness and being limited two low dimensionality. In DSPL, the meta information about the dataset is described in an XML file. This file contains general information and describes “concepts”. For example, the DSPL package about european unemployment data (available from the code site) defines – among others – a concept of countries and of country_groups. It also contains relationship information telling us that country_groups consist of countries. These concepts are then linked to data tables which are provided as separate CSV files.
Compared to Google Docs, preparation of data for Public Data Explorer will naturally consume more time upfront since a lot of decisions have to be made on the schema level and, depending on your data, many relationships require to be modelled into the schema. On the other hand, this makes visualizing, grouping and filtering that data a lot easier.
When it comes to visualization, Public Data Explorer offers a limited, but well-optimized set of tools. All visualization forms like bar charts, scatter plots, maps etc. can be used to depict change over time (as known from Gapminder).
One thing that seems to be missing here though is an API for programmatic access to data. As of now, data has to be uploaded manually in the form of DSPL files. This will make it difficult to impossible to keep datasets updated over a longer period of time. Also, there is no table view and now export mechanism that people could use to access the raw data.
So, by now Public Data Explorer is a way to quickly make visualizations of data accessible to the public. But it doesn’t serve as a platform to distribute raw data to the public. In that sense, it is not as open as Fusion Tables or Docs.
Google Base started in November 2005 as some universal approach for storing structured data. The last mention of the platform in the official Google blog dates back to June 2009. Back when it started, Base was not at all a typical Google product. Until then, the mantra of the search giant was: just dump your data to the web and we will make sense of it. They didn’t care about semantics all that much, or at least they wished they wouldn’t have to. But Google Base took a different approach. At first, Base seemed to be just the backend of Google’s product search (called Froogle back then), opened to the public for uploading their own stuff. But they also had pre-defined templates to add certain object types like recipes, job offers and other classifieds. Plus users could, and still can, create their own templates/schemas.
While the product search has meanwhile evolved into the Google Merchant Center, the rest of it’s original purpose is no longer visible. The user interface for creation of new data feeds, as they call it, still exists. But there seems to be no outlet for that data. Other than the API, there is no way to display/search/access that data. This used to be different.
Unfortunately, it’s completely unclear to me what purpose Base serves meanwhile.
Freebase is the creation of a company called Metaweb, which has been acquired by Google in July 2010. As the name indicates, Freebase is a database that is free to use. The main concept of Freebase is that each item (entity) can be linked to other items. For example, a specific movie could be one entity and a director could be another entity. Of course, the director can be linked to the movie he/she directed. The schema explorer gives an overview of all the domains and types that a schema already exists for. And the data sources overview lists the (or some of the) sources of data currently contained in Freebase.
It appears that the Freebase concept is comparable to Wikipedia in the way how it’s content is curated by a community. People (and bots) are obviously watching new data entries and take care that no two entities exist which describe the same thing. To get a grasp on the somewhat complex concept, I recommend to read the overview texts available in the API docs.
If there is a spectrum of well-definedness of data and Google Docs is on one end of this spectrum (namely the loosely defined end), then Freebase is on the very opposite end. This comes with benefits as well as with a cost. The downside being that the effort to get data into the system might be pretty high. I have to admit that I haven’t tried id myself so far.
So Google provides a number of different tools that could help you publish structured data. Except for Google Base, which looks like an abandoned step child right now, all serve their own purpose.
As a quick overview, in the spirit of open data, I created this Google spreadsheet:
It’s public for read/write, so feel free to work with it.
When looking at these tools from outside Google, it’s hard not to wonder: Do we need all these tools with their overlap of functionality? Why don’t these folks sit together and merge their code? Or at least make it easier to interoperate with various of their tools. Actually it would be great if one could use Fusion Tables as a data storage for Public Data Explorer instead of all these static CSV files. Or if Fusion Tables had the same fine-grained control over who can read/edit data as Docs has. Or if creating a Fusion Table was as painless as creating a Google Docs Spreadsheet.
On the other hand, Google has been criticized a lot for becoming a big and slow behemoth. Trying to align all these tools (and teams) with each other in the first place might have had the downside that many of these tools wouldn’t even have been launched by now. So I decide to be grateful for what’s there already and give suggestions for improvement.
Man, these are pretty exciting times with the Google/Metaweb acquisition only a couple of months ago, Google just having opened Public Data Explorer and Fusion Tables being under active development, too. I’m curious to see how these tools develop and what they come up with next.
BTW, I’m happy for feedback. Please let me know if I missed something, got something wrong or if you have additional insights or a different opinion.