Marian Steinbach: Blog

KISDtalk: Data are the new Streets

2013/06/17

On May 28 I had the occasion to talk at Köln International School of Design (KISD) about how the availability of data transforms and enables services, especially in the mobility world. The video shows the presentation slides, me and the audience are audio-only. Thanks to KISD for having me!

Find the video on Vimeo.

Discuss

My Scraping Stack

2013/05/244 comments

Enabling yourself to work with data sources from the web still has a lot to do with scraping. And, I assume, it will for a long time to come. Even if the data you need is available via a webservice API, what you do in order to harvest larger amounts of data is something like scraping. Again and again you will have to answer the same questions.

  • What language should I use to write my scraper?
  • Which HTTP client should I use?
  • How can I parse HTML input, maybe XML, JSON and all the rest?
  • How can I scrape pages that require form submission, session handling etc.?
  • How can I store the data?
  • How can I avoid scraping the same stuff over and over again?
  • How can I make the scraper fault tolerant, so that it won’t brake with a single 404 response?

If you’re simply gathering content from one HTML table, you probably won’t mind. But as your projects grow larger, these questions wil become more important.

Read on and discuss (4 comments)

Green Party Success and Voter Turnout (German)

2012/09/28

A German social scientists has observed a correlation between percentage of votes for the green party and low voter turnout. This has motivated him to write a book about how the green party puts our democracy at risk. Which in turn has motivated me to look into his arguments.

Read my German text, if you’re so inclined

Discuss

The quantified occasional runner

2012/09/238 comments

I have a geeky personal long-term project going on: I run to create data. Of course, I run to become a fitter person. Actually I also run to be a more persistent mountain biker and it takes less time to train the endurance by running than by mountain biking. But, as a consequence, I run more or less consistently for about a year by now. (I ran before, but that was about 2003, and I din’t collect data back then.)

By now, I have collected data for about 80 runs. Here is when I ran, and how far.

About my data collection

I am wearing a Zephyr HxM heart rate sensor on a breast belt, which also collects cadence data. This data is recorded by the SportsTracker app on my Android phone. This app also records, of course, position over time via the phone’s GPS sensor. During the course of my data gathering, two different Android phones have done the job: An HTC Desire and a Samsung Galaxy SII.

Very important to me: after uploading the recorded data to the SportsTrackLive.com website via the app, I can download the data log in GPX and CSV format. The CSV files contain one sample every 10 seconds, with every sample consisting of a time stamp, geo position (latitude, longitude), altitude, cadence (current steps per minute), heart rate (current beats per minute), and current speed.

The SportsTrackLive web platform gives me additional data for a specific run, with three values beeing of special interest to me: overall number of heart beats recorded, temperature and humidity (though I don’t know whether for start time, end time or somewhere inbetween). The web interface also displays the total ascend in meters. (Of course, since GPS altitude data is very rough, all altitude data has to be taken with a grain of salt.)

My running

The graph above already tells you a bit about my running and how it has developed. I started in July 2011, slowly but steadily increasing the distance (and duration, of course) from somewhere around 4 kilometers. Invisible to you, but available in my memories: Some of these dots in August 2011 happened in Italy, where I explored parts of Piemonte and Ligurie in running shoes. Hot, slow, gorgeous! One month later, in September 2011, I exceeded 10 kilometers for the first time. The plot above also reveals that I frequently ran about 8 kilometers in September and October of 2011, simply because I had a favourite route and fell into a routine.

Then, in winter 2011/2012, I got lazy. Cold temperatures and a lack of sunlight in the morning tend to make it hard for me to motivate myself. Therefore I almost had to re-start in March 2012. Luckily, it didn’t take me as long to reach my 8 kilometers house round as the first time. I kept running throughout summer, sometimes increasing the distances up to 17 kilometers (which feels pretty cool when you feel like it and you can simply do it!).

This plot above again shows all the runs over the course of 15 months. Instead of distance, the vertical position of the dots now reflects the average speed during that run. It’s visible that I increased my speed more or less steadily over the two training periods (July-October 2011 and March-September 2012), with one exception: This year, it seems as if I started faster (more than 8 kph) and then became slower on average in May, to become faster again after that. Comparing the two periods visually, it also seems like I increased speed faster last year, while also starting from a lower level.

So my runs became longer and faster. Judging from both speed and distance, one could say that I increased the intensity of my running during the course of the year – which was by far not accidental.

With increased performance, of course, the heart has to pump more oxygene to the muscle fibres. This, not surprisingly, is reflected by the average heart rate of each run.

This looks similar to the speed plot above, doesn’t it? Let’s draw a scatterplot of heart rate over speed to learn more about their correlation:

This show a pretty high positive correlation between speed and heart rate (r=0.74). As a background for those comparing their own data with mine: My minimum heart rate at rest is about 55 bpm, the highest heart rate I have ever measured on myself was 199 bpm, when I deliberately ran uphill last year to measure my max heart rate. (Be sure to have a napkin with you if you have to puke!)

The aim of my training should, of course, be to gain fitness. Fitness is a complex construct, and I don’t know a lot about it, but one thing people say is: If you’re fit, you can run the same intensity (=speed) with a lower heart rate than if you’re not. So, the research question here is:

Have I become more fit during my 15 months of running?

This question cannot be answered by looking at one single measure within the data. The average heart rate during my runs (and probably during yours as well) is influenced by the running intensity. This again is influenced mostly by my running speed, but also by the degree of ascend, maybe by even more? What about temperature and humidity? Or anything else?

I ran a linear regression with the following predictors, trying to predict the average heart rate:

  • Sequence number of the run (first=1 to last=82)
  • Length of the run (distance in kilometers)
  • Climb (ascend in meters)
  • Time of day (hour of the start time, 6 to 21)
  • Average speed (in kilometers/hour)
  • Temperature (in degrees Celsius)
  • Relative Humidity (percentage between 0 and 100)

Here is the result:

4.879 * avg_speed +
0.2939 * timeofday +
-0.2124 * temperature +
-0.1084 * humidity +
0.0337 * climb +
108.2558

The regression model has been created and tested in Weka, using 10-fold cross-validation as a verification method. More details: The correlation coefficient is 0.7525, the mean absolute error is 3.7692.

What does it mean? It tells us that the variance in my recorded average heart rate can be explained mostly via the running speed, but in addition to a small degree by the hour of day (the earlier, the lower the heart rate),  by temperature (the lower the temperature, the higher the heart rate), by humidity (the dryer, the higher my heart rate) and finally by climb (the higher the ascend, the higher the heart rate). Actually, the correlation coefficient is only slightly greater than the one for speed allone (remember, that wa 0.74). This means that four additional variables don’t really influence the outcome a whole lot.

Note that two input variables don’t seem to have any measurable influence at all, (a) the sequence number (which kind of reflects my training experience up to that run), and (b) the length of the run. They haven’t even been included in the model by Weka.

So if it doesn’t matter to a linear regression whether I ran zero or a hundred times before, does this mean that my training is worthless? Not quite. Actually, it could mean that my fitness hasn’t increased in linear way. Which would be reasonable, since I hardly did anything during the winter months and might have lost some.

A different approach

I want confirmation. I will bend the numbers as long as I need in order to make them tell I’m a fitter person now.

At least, the linear regression shows that running speed is probably the most important variable influencing the heart rate during my runs. What if I create a composite variable that represents the number of heart beats required to run at a certain speed? Lower values would indicate a better fitness.

A composite like that could simply by created by dividing the average heart rate for a run by the average speed during that run. For a heart rate of 147 beats per minute and a speed of 9 kph, this would result in a value of 16.33. Same heart rate but 0.5 kph faster speed would result in a value of 15.47. You get the idea. Let’s call this composite the Heart rate over speed (HROS).

Here is the HROS over time plot:

Now this looks like the HROS represents what I have experienced: A first improvement in 2011, which has been neutralized by doing nothing during the winter, and then a second improvement in 2012. It looks like, on average, I now need less heart beats to run at a certain speed than I ever did. Cool! And it tells me that I should try to run more during winter, aiming to at least keep my fitness level.

That was fun! Have you gathered data yourself? Does the HROS work for you? Let me know!

Update Sept. 27, 2012

Although it wasn’t my  goal to predict my heart rate for future runs, I let R create another linear regression model as I did in Weka. It is quite different at first sight. Here is the output of the summary.lm function in R:

Residuals:
    Min      1Q  Median      3Q     Max 
-8.2261 -3.0338 -0.6747  2.5519 17.8290 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   858.81122  648.53753   1.324   0.1905    
avg_speed       6.32242    0.93682   6.749 7.19e-09 ***
id              0.23391    0.26434   0.885   0.3798    
datetime       -0.05003    0.04257  -1.175   0.2446    
dayofyear      -0.03446    0.01531  -2.251   0.0281 *  
length_km      -0.11381    0.27781  -0.410   0.6835    
timeofday       0.24620    0.18753   1.313   0.1943    
temperature    -0.09219    0.13920  -0.662   0.5104    
humidity       -0.07576    0.05422  -1.397   0.1676    
climb           0.04963    0.01954   2.540   0.0137 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 4.543 on 59 degrees of freedom
Multiple R-squared: 0.709,	Adjusted R-squared: 0.6646 
F-statistic: 15.97 on 9 and 59 DF,  p-value: 7.439e-13

The great thing about this summary is that it tells us how relevant each variable is for the outcome, mainly in the column labeled “Pr(>|t|)”. It contains, as far as I understand, the p-Value, which is known from Null Hypothesis Significance Testing (NHST). A low p-Value stands for a high significance of the variable. This is also indicated by the stars on the right side of the table. As we can see, only three variables have a notable significance level (p < 0.05). The average speed stands out in particular, as it already did in the model created by Weka. Also interesting is the selection of the other two variables: Where Weke chose the timeofday, temperature and humidity, R thinks we should instead look at the dayofyear (which is the date of the run as a number between 1 and 366). Note that the coefficient is negative, which implies a negative correlation: The more time passed in the year, the lower my heart rate. How nice, as it suggest the existence of a training effect I am looking for.

As suggested by the german Wikipedia page on linear regression, I eliminated the non-significant variables from the model. The output then is:

Residuals:
    Min      1Q  Median      3Q     Max 
-9.3019 -2.6789 -0.8496  2.9888 15.9475 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 97.889716   4.627295  21.155  < 2e-16 ***
avg_speed    5.448352   0.582350   9.356 2.48e-14 ***
climb        0.050204   0.014406   3.485 0.000816 ***
dayofyear   -0.016825   0.007508  -2.241 0.027907 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 4.662 on 77 degrees of freedom
Multiple R-squared: 0.6228,	Adjusted R-squared: 0.6081 
F-statistic: 42.38 on 3 and 77 DF,  p-value: 2.797e-16

Boiled down to only four variables, the significance of the remaining ones increases. The most significant one is still the average running speed (avg_speed). The coefficient of ~ 5.5 means that when I run 1 kilometer per hour faster, my average heart rate will be 5.5 beats per minute higher. The second most significant variable is the number of meters ascended (climb). Here, for every 100 meters additional ascend, my average heart rate will increase by 5 beats (0.050204 * 100). The last one means: two months (60 days) from now my average heart rate will be 1 beat lower than today, given the other variables are the same (-0.016825 * 60). Lets note that model in more concise form:

 5.448352 * avg_speed +
-0.016825 * dayofyear +
 0.050204 * climb +
97.889716

Since I originally published the post, I was out for running twice. The first run of the two happened in a mood like “Let’s see how fast I can run on my 8 km track.” This resulted in me running faster than ever, but ending 1 km ahead of my finish, overpaced. The second run was more recreational. Here are the numbers:

Day of Year Time of Day Distance Average Speed Ascend Temperature Humidity Average Heart Rate
268 7 7.26 km 11.05 kph 60 m 16 °C 77 % 169 bpm
271 7 8.24 km 8.9 kph 70 m 12 °C 88 % 143 bpm

Could those two heart rate values have been predicted by the linear models above? Plugging the values into the first model (by Weka), the results would have been 154.5 bpm – off by 14.5 bpm – and 144.0 (off by 1). The simplified model by R would have predicted values of 156.6 (off by 12.4) and 143 (off by 2.3).

While both models are pretty close when predicting the actual result of the slow run, both fail miserably to predict the fast run. That’s not a big surprise, considering that the models have been built based on data with a maximum heart rate of 157 bpm and a max average speed of 10.4 kph. Remember that linear-looking scatterplot of heart rate over speed? Here is how it looks now:

The blue line is the regression line, the fast run is marked in red, for your convenience. The slow run blends nicely with the rest and actually resides a bit below the regression line.

To me as a runner, this leads to an interesting conclusion (which others might have come to in different ways): I can’t simply expect to be able to run 11 kph over the course of 8 kilometers at a bearable heart rate of 159, only from looking at previous slower runs.

In contrast to what the scatterplot looks like, the relationship between speed and heart rate might very well not be linear, but – for example – quadratic. For a motor vehicle, the relationship between speed and force required is quadratic, due to the effect of air resistance. A source suggests that air resistance would have little influence on slow running like mine, since runners at about 20 kph see only 8 percent influence of air resistance. But what about the relationship beween force required and heart rate? Is it non-linear?

And then there is the concept of the aerobic threshold. It implies that there is a point on the intensity scale where the linear relationship between intensity and heart rate ends.

However, I have a very concrete goal now: to draw a dot into that scatterplot above, at the position of x=11 and y=160. Let’s see how close I can get there this year.

Discuss (8 comments)

What I learned about CouchDB

2011/11/1112 comments

With many people talking about CouchDB, I got curious and took a closer look. Especially Map/Reduce made me want to find out if CouchDB would be a good solution to create aggregated statistics from a large number of records. A presentation on SlideShare cought my eye, especially the structured keys and group levels (see slide 43). Since I am now maintaining three growing time series databases (radiation data from Japan and Germany plus air quality data) I wondered if CouchDB would be an option to store and process that data.

Short disclaimer: I am by no means a CouchDB, NoSQL or DBMS expert. I usually make MySQL do the things I want, without ever having crossed the boundaries of one server. All I did was some testing, I didn’t go into production. The reason I post my experience anyway is to foster discussion, learn from it and help others save their time.

Data import

I imported data from CSV via couchdbkit (Python). A CSV file contained about 850,000 rows. Example row:

"1150000004";"2011-08-15 16:30:00";"38";"0"

The script I used is https://gist.github.com/1357839. Note that it makes use of bulk_save() to write 10,000 rows at once. A dry run without writing objects to CouchDB took 50 Sec. The actual run took 3 Min 56 Sec. In comparison, MySQL needs 45 Seconds to import the same data, including creating several indexes.

This is what the row above looks like as a CouchDB document:

{
   "_id": "003fd8eca823e1b46995ef4ca3000ec5",
   "_rev": "1-6770700eb84c2df9f5eef3bd24348723",
   "doc_type": "Measure",
   "sa": 38,
   "ra": 0,
   "station_id": "1150000004",
   "datetime": [ 2011, 8, 15, 16, 30, 0 ]
}

The doc_type: “Measure” is added by couchdbkit according to the class I used for the objects in Python. This is of course useful to distinguish different types of documents within the same database.

The “datetime” field is represented as a list because I wanted to mimic the behaviour I saw in the presentation mentioned above. It wouldn’t be necessary though. One could also code this field as a DateTimeProperty and later use the map function to create a structured key as needed.

Storage consumption

After importing my first chunk of 850,000 rows, the CouchDB database had a size of 283 MB. I tried compacting it, which caused CouchDB to increase the use of disk space up to 569 MB, only to reduce it to 288 MB. (If you do the math, you can see that the required disk space for compaction equals the space used by the DB before compaction plus the compacted one. This is why it is advisable to run multiple smaller instances of CouchDB on the same server. Otherwise CouchDB can become a big, unmanageable behemoth.). As for the amount of disk space used, I’m impressed in a bad way. Disk space might be cheap, but it’s not that it’s that unimportant.

The graph above illustrates the relation of disk space required by CSV, MySQL and CouchDB for roughly the same data.

The interesting thing here is that this is only the beginning. I haven’t created any views yet. Views can easily become multiple times as big as the underlying document store.

Querying the data

In CouchDB, views are what queries are in SQL. To start with something simple, I opened Futon, the CouchDB Admin interface, and created a temporary view based on this simple map function:

function(doc) {
   emit(doc.station_id, sa);
}

and this reduce function:

_count

Creating this temporal view took about 4 minutes. This is kind of stunning. If I would compare this to SQL, what this view should do is similar to this:

SELECT station_id, COUNT(*) FROM measures
GROUP BY station_id

On the same machine (my MacBook) this takes 640ms, without any indexes.

It seems that my database containing 850,000 documents is much too big for development purposes. The actual, untruncated database that I took the test data from currently contains 7,8 Mio rows. How would I test and get an idea of response times at that scale? Do I need a cluster to find out?

After several coffee breaks, I had finally created one meaningful view with a reduce function that gave me daily, weekly or hourly mean values. Once the view was built, querying it seemed quite fast. But, honestly, I didn’t bother to measure. After creating two simple views, the disk consumption was at 1.6 GB.

Here is where my journey ended. Maybe I just quit where the fun would have started, but right now, CouchDB just doesn’t seem to be an option to store my rather small but growing databases in. The alternative will most likely be to stick with MySQL, calculating data aggregation using background workers and writing the results to MySQL tables as well. With time series data, there is always the option to archive the original fine-grained values and only keep aggregated data for past periods. And with the space-efficiency of MySQL it seems as if I could make much more use of disks and RAM than I could by using CouchDB.

What puzzles me though is how others cope with CouchDB. How do people test their views? Is anyone working locally during development? Did I do a foolish thing?

Discuss (12 comments)

What are you breathing?

2011/10/12

This is a new data visualization I created recently, depicting long-time measurements of emission in the atmosphere in my home area Nordrhein-Westfalen (NRW). It uses data scraped from LANUV, the environment agency of NRW. (Since the domain is limited to a region in Germany, the application is German, too.)

What I found particularly interesting when analyzing the original measurements was the fact that some measurement stations are equipped with wind direction sensors. For these stations, every measurement of emissions is linked to a wind direction. For some stations, this reveals differences in emission levels, depending on the wind direction.

There are definitely many interesting ways to analyse the measurements, aside from wind direction. However I decided to start with a very limited set of displays, get the application public and then gather feedback on what users would find interesting and what puzzles them.

So please let the feedback flow! Comment here (or on the german post at G+) or contact me (see right hand side).

To the application

Discuss

Radiation is there, and so is the data

2011/07/1121 comments

Video on vimeo

The german Bundesamt für Strahlenschutz (BfS) is, among other duties, in charge of measuring gamma radiation in the atmosphere throughout Germany. A network of about 1750 Geiger counters collects data which is displayed publicly on a very basic user interface. Since the Tsunami in Japan and the subsequent, ongoing nuclear desaster in Fukushima, interest in radiation has risen everywhere. According to a BfS press relations officer, the BfS has been confronted with numerous requests to open up their raw data to the public. Fortunately, the BfS has reacted and given access to raw data downloads on a per-users basis.

I have started archiving that data in order to be able to create time-series visualizations. The above video shows a first non-interactive attempt with the data I have so far. (Unfortunately, the BfS only publishes 2-hour-interval readings for a 24 hour timespan and 24-hour-interval data for 7 days. So far, there is no access to historic data reaching further into the past.)

So now we can see, for a limited time frame, how radiation behaved dynamically. Not everybody knows that radiation is a fact of daily life. Even fewer might know that it differs quite a bit between locations and also changes over time. To me it’s particularly interesting to see how values rise and fall similar to waves moving over the country. The probable explanation for this is the influence of rain, which causes readings to rise. The BfS has more information about this in german language.

Of course, radiation is a difficult thing to substantiate. Even if numeric values are displayed as colours or as marks on a scale, the meaning remains abstract. And I have no plans to change this.

However, I can imagine a whole lot of other things to try out. For example, why not try to add a layer of rain information, so one can maybe see if a local radiation rise actually correlates with rain. Or, of course, try different measures for values, like circle radius or go into the third dimension. And, naturally, there should be an interactive way to “scrub” over the time scale, so users can choose the time frame and speed for themselves. (With a good video player, this works, but the vimeo player doesn’t really allow for that.) And last but not least, there could be numerous ways to interact with single dots (sensors) or a group of them, to get details, curves etc. or compare radiation on two spots.

What do you think? What would you like to know and see?

Update: The Python source code is available here.

Discuss (21 comments)

I’m a Business

2011/06/0111 comments

 

This is a big moment for me. I am now a business. Starting now, I am self-employed. Ready to take on your Interaction Design / UXD assignments and willing to excel.

For you, this means that I will no longer see you as someone only interested in my content. I’ll rather regard you as a potential client or project partner. This means that I might think more about your needs and what could make you happy. And it means that I should have much more time for you, because this blog has now become a much more important place for me. Isn’t that nice?

Seriously, more than ever I am willing to get involved with you. Please feel invited to comment on stuff here, ask questions, give feedback and talk. Openly or in private. My contact details are available on every page.

In case you need an idea what I did so far, for now I’ll refer you to my LinkedIn or XING page. If you need more details, don’t hesitate to contact me.

Discuss (11 comments)

An Update on Radiation Data from Japan

2011/03/1917 comments

It’s time to give a brief update on my efforts to make radiation data provided by the Japanese administration more accessible.

The good news: I can now offer you a way more complete raw data download with higher data quality. Data now goes back until March 1. This means, the complete development of the nuclear crisis since the earth quake and Tsunami is covered.

Please see the Japan Radiation Open Data page for details on the download.

Some bad facts remain. Fukushima and Miyagi prefectures still don’t report values to the SPEEDI system (which is the source of the data). Their values are lacking since 2011-03-11 05:40 UTC. And Ishikawa prefecture hasn’t contributed values since yesterday.

I’m curious to see how grassroots projects like geigercrowd and pachube make progress in closing the data gap.

In the meantime, people have started using the data available. The example below is a screenshot from a Japan Radiation Map created by Geir Endahl.

Eron Villarreal has contributed a data dashboard using Tableau Software.

Please comment to let others know what you are doing with the radiation data.

Update: I have done a draft animation to visualize the “burst” effects in Ibaraki around March 14.

Discuss (17 comments)

A Crowdsourced Japan Radiation Spreadsheet

2011/03/1518 comments

The Japan Ministry of Education, Culture, Sports, Science and Technology (MEXT) publishes real time radiation data on their website, but there are two issues:

  • The site is under heavy load
  • The historic values aren’t available

So I created a Google Docs Spreadsheet to collect the real time data and store the values for later use.

Please, if you can, contribute to the document. Copy the real time values from the original source to the spreadsheet.

And, more important, please spread the word.

Here is the link, again: http://goo.gl/b5QX5

Update 2011-03-16 0:03 UTC: Data from the bousai website is now gathered automatically and can be downloaded as CSV.

Update 2011-03-16 20:07 UTC: The original spreadsheet, open to the public for editing, is still available. There is now a Google Docs Spreadsheet with prefecture max values derived from the CSV download mentioned above. That file is read-only for the public. The URL is http://goo.gl/RMuNt.

Update 2011-03-17 23:40: Information about the location of sensor stations are now available in a Google Doc under http://goo.gl/iDo0N.

Update 2011-03-19 10:34 UTC: Updated information is available in a new blog post. In brief: The data collection I started manually with the spreadsheet this post was about is now done automatically and available for download. However, people still work with the spreadsheet and have entered additional data sources.

Update 2011-03-22 14:18 UTC: Someone changed the settings for the original crowdsourced spreadsheet so that only a few people can edit now. I decided to leave it that way and updated the meta-information in the sheet so that people don’t waste their time with manually entering stuff, fighting trolls or looking at incomplete data.

Discuss (18 comments)