Marian Steinbach: Blog

What I learned about CouchDB

2011/11/118 comments

With many people talking about CouchDB, I got curious and took a closer look. Especially Map/Reduce made me want to find out if CouchDB would be a good solution to create aggregated statistics from a large number of records. A presentation on SlideShare cought my eye, especially the structured keys and group levels (see slide 43). Since I am now maintaining three growing time series databases (radiation data from Japan and Germany plus air quality data) I wondered if CouchDB would be an option to store and process that data.

Short disclaimer: I am by no means a CouchDB, NoSQL or DBMS expert. I usually make MySQL do the things I want, without ever having crossed the boundaries of one server. All I did was soe testing, I didn’t go into production. The reason I post my experience anyway is to foster discussion, learn from it and help others save their time.

Data import

I imported data from CSV via couchdbkit (Python). A CSV file contained about 850,000 rows. Example row:

"1150000004";"2011-08-15 16:30:00";"38";"0"

The script I used is https://gist.github.com/1357839. Note that it makes use of bulk_save() to write 10,000 rows at once. A dry run without writing objects to CouchDB took 50 Sec. The actual run took 3 Min 56 Sec. In comparison, MySQL needs 45 Seconds to import the same data, including creating several indexes.

This is what the row above looks like as a CouchDB document:

{
   "_id": "003fd8eca823e1b46995ef4ca3000ec5",
   "_rev": "1-6770700eb84c2df9f5eef3bd24348723",
   "doc_type": "Measure",
   "sa": 38,
   "ra": 0,
   "station_id": "1150000004",
   "datetime": [ 2011, 8, 15, 16, 30, 0 ]
}

The doc_type: “Measure” is added by couchdbkit according to the class I used for the objects in Python. This is of course useful to distinguish different types of documents within the same database.

The “datetime” field is represented as a list because I wanted to mimic the behaviour I saw in the presentation mentioned above. It wouldn’t be necessary though. One could also code this field as a DateTimeProperty and later use the map function to create a structured key as needed.

Storage consumption

After importing my first chunk of 850,000 rows, the CouchDB database had a size of 283 MB. I tried compacting it, which caused CouchDB to increase the use of disk space up to 569 MB, only to reduce it to 288 MB. (If you do the math, you can see that the required disk space for compaction equals the space used by the DB before compaction plus the compacted one. This is why it is advisable to run multiple smaller instances of CouchDB on the same server. Otherwise CouchDB can become a big, unmanageable behemoth.). As for the amount of disk space used, I’m impressed in a bad way. Disk space might be cheap, but it’s not that it’s that unimportant.

The graph above illustrates the relation of disk space required by CSV, MySQL and CouchDB for roughly the same data.

The interesting thing here is that this is only the beginning. I haven’t created any views yet. Views can easily become multiple times as big as the underlying document store.

Querying the data

In CouchDB, views are what queries are in SQL. To start with something simple, I opened Futon, the CouchDB Admin interface, and created a temporary view based on this simple map function:

function(doc) {
   emit(doc.station_id, sa);
}

and this reduce function:

_count

Creating this temporal view took about 4 minutes. This is kind of stunning. If I would compare this to SQL, what this view should do is similar to this:

SELECT station_id, COUNT(*) FROM measures
GROUP BY station_id

On the same machine (my MacBook) this takes 640ms, without any indexes.

It seems that my database containing 850,000 documents is much too big for development purposes. The actual, untruncated database that I took the test data from currently contains 7,8 Mio rows. How would I test and get an idea of response times at that scale? Do I need a cluster to find out?

After several coffee breaks, I had finally created one meaningful view with a reduce function that gave me daily, weekly or hourly mean values. Once the view was built, querying it seemed quite fast. But, honestly, I didn’t bother to measure. After creating two simple views, the disk consumption was at 1.6 GB.

Here is where my journey ended. Maybe I just quit where the fun would have started, but right now, CouchDB just doesn’t seem to be an option to store my rather small but growing databases in. The alternative will most likely be to stick with MySQL, calculating data aggregation using background workers and writing the results to MySQL tables as well. With time series data, there is always the option to archive the original fine-grained values and only keep aggregated data for past periods. And with the space-efficiency of MySQL it seems as if I could make much more use of disks and RAM than I could by using CouchDB.

What puzzles me though is how others cope with CouchDB. How do people test their views? Is anyone working locally during development? Did I do a foolish thing?

Discuss (8 comments)

What are you breathing?

2011/10/12

This is a new data visualization I created recently, depicting long-time measurements of emission in the atmosphere in my home area Nordrhein-Westfalen (NRW). It uses data scraped from LANUV, the environment agency of NRW. (Since the domain is limited to a region in Germany, the application is German, too.)

What I found particularly interesting when analyzing the original measurements was the fact that some measurement stations are equipped with wind direction sensors. For these stations, every measurement of emissions is linked to a wind direction. For some stations, this reveals differences in emission levels, depending on the wind direction.

There are definitely many interesting ways to analyse the measurements, aside from wind direction. However I decided to start with a very limited set of displays, get the application public and then gather feedback on what users would find interesting and what puzzles them.

So please let the feedback flow! Comment here (or on the german post at G+) or contact me (see right hand side).

To the application

Discuss

Radiation is there, and so is the data

2011/07/1121 comments

Video on vimeo

The german Bundesamt für Strahlenschutz (BfS) is, among other duties, in charge of measuring gamma radiation in the atmosphere throughout Germany. A network of about 1750 Geiger counters collects data which is displayed publicly on a very basic user interface. Since the Tsunami in Japan and the subsequent, ongoing nuclear desaster in Fukushima, interest in radiation has risen everywhere. According to a BfS press relations officer, the BfS has been confronted with numerous requests to open up their raw data to the public. Fortunately, the BfS has reacted and given access to raw data downloads on a per-users basis.

I have started archiving that data in order to be able to create time-series visualizations. The above video shows a first non-interactive attempt with the data I have so far. (Unfortunately, the BfS only publishes 2-hour-interval readings for a 24 hour timespan and 24-hour-interval data for 7 days. So far, there is no access to historic data reaching further into the past.)

So now we can see, for a limited time frame, how radiation behaved dynamically. Not everybody knows that radiation is a fact of daily life. Even fewer might know that it differs quite a bit between locations and also changes over time. To me it’s particularly interesting to see how values rise and fall similar to waves moving over the country. The probable explanation for this is the influence of rain, which causes readings to rise. The BfS has more information about this in german language.

Of course, radiation is a difficult thing to substantiate. Even if numeric values are displayed as colours or as marks on a scale, the meaning remains abstract. And I have no plans to change this.

However, I can imagine a whole lot of other things to try out. For example, why not try to add a layer of rain information, so one can maybe see if a local radiation rise actually correlates with rain. Or, of course, try different measures for values, like circle radius or go into the third dimension. And, naturally, there should be an interactive way to “scrub” over the time scale, so users can choose the time frame and speed for themselves. (With a good video player, this works, but the vimeo player doesn’t really allow for that.) And last but not least, there could be numerous ways to interact with single dots (sensors) or a group of them, to get details, curves etc. or compare radiation on two spots.

What do you think? What would you like to know and see?

Update: The Python source code is available here.

Discuss (21 comments)

I’m a Business

2011/06/0111 comments

 

This is a big moment for me. I am now a business. Starting now, I am self-employed. Ready to take on your Interaction Design / UXD assignments and willing to excel.

For you, this means that I will no longer see you as someone only interested in my content. I’ll rather regard you as a potential client or project partner. This means that I might think more about your needs and what could make you happy. And it means that I should have much more time for you, because this blog has now become a much more important place for me. Isn’t that nice?

Seriously, more than ever I am willing to get involved with you. Please feel invited to comment on stuff here, ask questions, give feedback and talk. Openly or in private. My contact details are available on every page.

In case you need an idea what I did so far, for now I’ll refer you to my LinkedIn or XING page. If you need more details, don’t hesitate to contact me.

Discuss (11 comments)

An Update on Radiation Data from Japan

2011/03/1917 comments

It’s time to give a brief update on my efforts to make radiation data provided by the Japanese administration more accessible.

The good news: I can now offer you a way more complete raw data download with higher data quality. Data now goes back until March 1. This means, the complete development of the nuclear crisis since the earth quake and Tsunami is covered.

Please see the Japan Radiation Open Data page for details on the download.

Some bad facts remain. Fukushima and Miyagi prefectures still don’t report values to the SPEEDI system (which is the source of the data). Their values are lacking since 2011-03-11 05:40 UTC. And Ishikawa prefecture hasn’t contributed values since yesterday.

I’m curious to see how grassroots projects like geigercrowd and pachube make progress in closing the data gap.

In the meantime, people have started using the data available. The example below is a screenshot from a Japan Radiation Map created by Geir Endahl.

Eron Villarreal has contributed a data dashboard using Tableau Software.

Please comment to let others know what you are doing with the radiation data.

Update: I have done a draft animation to visualize the “burst” effects in Ibaraki around March 14.

Discuss (17 comments)

A Crowdsourced Japan Radiation Spreadsheet

2011/03/1516 comments

The Japan Ministry of Education, Culture, Sports, Science and Technology (MEXT) publishes real time radiation data on their website, but there are two issues:

  • The site is under heavy load
  • The historic values aren’t available

So I created a Google Docs Spreadsheet to collect the real time data and store the values for later use.

Please, if you can, contribute to the document. Copy the real time values from the original source to the spreadsheet.

And, more important, please spread the word.

Here is the link, again: http://goo.gl/b5QX5

Update 2011-03-16 0:03 UTC: Data from the bousai website is now gathered automatically and can be downloaded as CSV.

Update 2011-03-16 20:07 UTC: The original spreadsheet, open to the public for editing, is still available. There is now a Google Docs Spreadsheet with prefecture max values derived from the CSV download mentioned above. That file is read-only for the public. The URL is http://goo.gl/RMuNt.

Update 2011-03-17 23:40: Information about the location of sensor stations are now available in a Google Doc under http://goo.gl/iDo0N.

Update 2011-03-19 10:34 UTC: Updated information is available in a new blog post. In brief: The data collection I started manually with the spreadsheet this post was about is now done automatically and available for download. However, people still work with the spreadsheet and have entered additional data sources.

Update 2011-03-22 14:18 UTC: Someone changed the settings for the original crowdsourced spreadsheet so that only a few people can edit now. I decided to leave it that way and updated the meta-information in the sheet so that people don’t waste their time with manually entering stuff, fighting trolls or looking at incomplete data.

Discuss (16 comments)

Google’s Open Data Toolkit Mess/Wealth

2011/02/263 comments

Google, the company that aims to organize the world’s information, has quite a few tools in their portfolio that are focused towards (or can be used for) publishing structured data. Some even help visualize it. I wonder,  how are these all interrelated? How can they cooperate? Which should be used when? And what is Google’s roadmap for them? Is it worth to know them all? Or will they go where numerous labs experiments have gone before?

 

But let’s take a brief look at each
before we get all fuzzy.

Google Docs
https://docs.google.com/

For many it’s Excel without license fee, but some use it to share structured data with others. The british news(-paper) website The Guardian, for example, is known for sharing public data on their Data Blog using the Google Docs spreadsheets module (a more recent example). It’s a pragmatic approach that serves them quite well. Of course, the Google Docs API allows for programmatic access, so Docs is not only about manual data entry. However, for those who favor human labor, Docs offers a form feature. For display and visualization, there are the built-in chart functions of Docs plus a growing ecosystem of both free and paid gadgets. And, not bad at all, Google Docs even comes with a chat function for real time discussion.

However, the downside to using Docs for sharing public data is that Google Docs spreadsheets tend to be simplistic. Spreadsheets hardly support anything more complex than two-dimensional data tables. Additionally, Google Docs has no notion of a schema and metadata. Fields might be interpreted as numbers, strings and dates, but that’s about it. If you want to tell the spreadsheet that two floating point values are actually latitude and longitude of a geo point, it’s not possible. Telling a person’s name from a country name isn’t, either. So data in Google Docs isn’t self-explanatory. It always needs additional documents to describe it and applications to enforce certain rules when handling the data – or very cautious users. When it comes to sharing data with a crowd of potentially unlimited size, this can get problematic.

Google Fusion Tables
http://www.google.com/fusiontables/

Fusion Tables appears to be one of the Google tools only few know about. And when you try it for yourself, it’s hard to freak out about its user experience. Actually it seems as if Fusion Tables hasn’t (yet?) been exposed to UI designers at all.

But: The purpose of this app is exactly what open data people should be looking for. It’s about defining data tables (hence the name), filling them up with data, maintain the data collaboratively,  discuss it, visualize it and “merge” it with other open data tables. SQL users, think JOIN.

I won’t judge any of the tools here by the data that’s currently available there. But it’s interesting to see how people use the service. One example is a list of MD5 encrypted email addresses affected by the Gawker password incident, so people can find out if their password has been compromised. The way how the table authors created a “Using this spreadsheet” column clearly shows that Fusion Tables needs more ways to give general table information, instructions, attribution and such.

Naturally, there is an API for read and write access to Fusion Tables. People who know SQL will likely feel comfortable with that, since one basically interacts with fusion tables via a subset of SQL, issued via REST web service queries.

As for built-in visualization tools, Fusion Tables offers a set that seems to be growing quickly. Currently there are table, map, intensity map, line chart, bar chart, pie chart, scatter plot, motion, timeline and storyline. People familiar with Google’s Visualization API know these. A special benefit of the maps module in Fusion Tables is the easy handling of thousands of data points on a map. For an example that allows to try out all available visualization methods, open the Wikileaks Afghan War Diary table.

A mystery to me is how editing rights can be controlled. It seems to me as if only the owner of a table can edit data, even if the table is public. However, a basic comment function allows others to annotate single values or rows.

Google Public Data Explorer
http://www.google.com/publicdata/home

Public Data Explorer has been around as  labs product for a while, allowing users to explore a couple of datasets prepared by Google. But, just recently, it has been announced in the Official Google Blog that Public Data Explorer is now open to user submissions. If you’re interested, the Nieman Lab blog has additional insights about the new launch.

For that purpose, a new data exchange format called Dataset Publishing Language (DSPL) has been released.

DSPL is worth a closer look, since it tackles many of the problems which have already been mentioned above: Lack of schema, lack of metadata, lack of self-descriptiveness and being limited two low dimensionality. In DSPL, the meta information about the dataset is described in an XML file. This file contains general information and describes “concepts”. For example, the DSPL package about european unemployment data (available from the code site) defines – among others – a concept of countries and of country_groups. It also contains relationship information telling us that country_groups consist of countries. These concepts are then linked to data tables which are provided as separate CSV files.

Compared to Google Docs, preparation of data for Public Data Explorer will naturally consume more time upfront since a lot of decisions have to be made on the schema level and, depending on your data, many relationships require to be modelled into the schema. On the other hand, this makes visualizing, grouping and filtering that data a lot easier.

When it comes to visualization, Public Data Explorer offers a limited, but well-optimized set of tools. All visualization forms like bar charts, scatter plots, maps etc. can be used to depict change over time (as known from Gapminder).

One thing that seems to be missing here though is an API for programmatic access to data. As of now, data has to be uploaded manually in the form of DSPL files. This will make it difficult to impossible to keep datasets updated over a longer period of time. Also, there is no table view and now export mechanism that people could use to access the raw data.

So, by now Public Data Explorer is a way to quickly make visualizations of data accessible to the public. But it doesn’t serve as a platform to distribute raw data to the public. In that sense, it is not as open as Fusion Tables or Docs.

Google Base
http://www.google.de/base/

Google Base started in November 2005 as some universal approach for storing structured data. The last mention of the platform in the official Google blog dates back to June 2009. Back when it started, Base was not at all a typical Google product. Until then, the mantra of the search giant was: just dump your data to the web and we will make sense of it. They didn’t care about semantics all that much, or at least they wished they wouldn’t have to. But Google Base took a different approach. At first, Base seemed to be just the backend of Google’s product search (called Froogle back then), opened to the public for uploading their own stuff. But they also had pre-defined templates to add certain object types like recipes, job offers and other classifieds. Plus users could, and still can, create their own templates/schemas.

While the product search has meanwhile evolved into the Google Merchant Center, the rest of it’s original purpose is no longer visible. The user interface for creation of new data feeds, as they call it, still exists. But there seems to be no outlet for that data. Other than the API, there is no way to display/search/access that data. This used to be different.

Unfortunately, it’s completely unclear to me what purpose Base serves meanwhile.

Freebase
http://www.freebase.com/

Freebase is the creation of a company called Metaweb, which has been acquired by Google in July 2010. As the name indicates, Freebase is a database that is free to use. The main concept of Freebase is that each item (entity) can be linked to other items. For example, a specific movie could be one entity and a director could be another entity. Of course, the director can be linked to the movie he/she directed. The schema explorer gives an overview of all the domains and types that a schema already exists for. And the data sources overview lists the (or some of the) sources of data currently contained in Freebase.

It appears that the Freebase concept is comparable to Wikipedia in the way how it’s content is curated by a community. People (and bots) are obviously watching new data entries and take care that no two entities exist which describe the same thing. To get a grasp on the somewhat complex concept, I recommend to read the overview texts available in the API docs.

If there is a spectrum of well-definedness of data and Google Docs is on one end of this spectrum (namely the loosely defined end), then Freebase is on the very opposite end. This comes with benefits as well as with a cost. The downside being that the effort to get data into the system might be pretty high. I have to admit that I haven’t tried id myself so far.

Bottom line

So Google provides a number of different tools that could help you publish structured data. Except for Google Base, which looks like an abandoned step child right now, all serve their own purpose.

As a quick overview, in the spirit of open data, I created this Google spreadsheet:

It’s public for read/write, so feel free to work with it.

When looking at these tools from outside Google, it’s hard not to wonder: Do we need all these tools with their overlap of functionality? Why don’t these folks sit together and merge their code? Or at least make it easier to interoperate with various of their tools. Actually it would be great if one could use Fusion Tables as a data storage for Public Data Explorer instead of all these static CSV files. Or if Fusion Tables had the same fine-grained control over who can read/edit data as Docs has. Or if creating a Fusion Table was as painless as creating a Google Docs Spreadsheet.

On the other hand, Google has been criticized a lot for becoming a big and slow behemoth. Trying to align all these tools (and teams) with each other in the first place might have had the downside that many of these tools wouldn’t even have been launched by now. So I decide to be grateful for what’s there already and give suggestions for improvement.

Man, these are pretty exciting times with the Google/Metaweb acquisition only a couple of months ago, Google just having opened Public Data Explorer and Fusion Tables being under active development, too. I’m curious to see how these tools develop and what they come up with next.

BTW, I’m happy for feedback. Please let me know if I missed something, got something wrong or if you have additional insights or a different opinion.

Discuss (3 comments)

LinkedIn’s Beautiful Network Graph Visualization

2011/01/251 comment

My LinkedIn Map - click for large version

LinkedIn Maps is a new way for LinkedIn users to get a picture of their social network. While this is an obvious thing to to for a social networking platform, it’s astonishing how long it took them to come up with something like this.

One nice thing about this visualization is the color-clustering. Look how neat it distinguishes between my study contacts (KISD), the Austin/TX bunch (Trilogy and AIESEC) and my current former employer nexum AG.

How does your map look like?

Discuss (1 comment)

Six Principles of URL Design

2011/01/191 comment

Completely without notice here in my very own blog, last year I offered a session on URL design at the User Experience Bar Camp (UXCamp) in Berlin. Now that this space serves as a spot worthy of content, I re-post it here.

Unfortunately, Myriad renders horribly when embedded into a Powerpoint presentation / PDF and then uploaded to slideshare. So please do your eyes a favor and watch the slides as large as you can. Here you can find the presentation on Slideshare: Sexy URLs don’t end in .aspx?id=23859

As a sneak preview, these are the six principles:

  1. Simplicity
  2. Meaningfulness
  3. Hackability
  4. Unambiguousness
  5. Persistence
  6. Canonicalization

If you’re interested in my background thoughts, find them after the break.

Read on and discuss (1 comment)

What’s Wrong With Youtube’s Statistics Graph?

2011/01/15

Youtube allows all visitors to see access statistics for videos. This is how it looks like.

I love that feature. Not that I really need these figures all that much, but I’m a viual person, and often I just click on the statistics button out of curiosity. Sometimes the graph reveals interesting facts. For example, the fact that the Double Rainbow video (for which you see the current graph displayed above), with now more than 24 million views, took 6 months before really taking off. Actually the annotations (letters A to G above the graph) suggest that a mention at huffingtonpost.com was the first important publication driving traffic for the video.

But why, why have Youtube folks decided to show us these figures as an accumulation?

It’s a matter of fact that the overall number of views can only stay the same or grow. It will never fall. So the fact that it rises over time is not that exciting. It would rather be interesting to see when it rose and by how much. So why not show us the number of views within a certain time period instead? Analytics does it. Goo.gl does it. Feedburner does it.

The benefit would be that the Y-axis wouldn’t have to go all the way up to 24M in the example above, but it would scale to the maximum number of views per day instead and allow for much more detailed reading of the figures. That means: We would get much more information on the same space.

I have only two guesses as an explanation:

  • Youtube wants to show something positive to all users all the time. The graph for every video eventually rises at some point.
  • Youtube doesn’t want to give us all the details and the ability to compare views per time span.

Who knows?

Discuss