Marian Steinbach: Blog

The quantified occasional runner


I have a geeky personal long-term project going on: I run to create data. Of course, I run to become a fitter person. Actually I also run to be a more persistent mountain biker and it takes less time to train the endurance by running than by mountain biking. But, as a consequence, I run more or less consistently for about a year by now. (I ran before, but that was about 2003, and I din’t collect data back then.)

By now, I have collected data for about 80 runs. Here is when I ran, and how far.

About my data collection

I am wearing a Zephyr HxM heart rate sensor on a chest belt, which also collects cadence data. This data is recorded by the SportsTracker app on my Android phone. This app also records, of course, position over time via the phone’s GPS sensor. During the course of my data gathering, two different Android phones have done the job: An HTC Desire and a Samsung Galaxy SII.

Very important to me: after uploading the recorded data to the website via the app, I can download the data log in GPX and CSV format. The CSV files contain one sample every 10 seconds, with every sample consisting of a time stamp, geo position (latitude, longitude), altitude, cadence (current steps per minute), heart rate (current beats per minute), and current speed.

The SportsTrackLive web platform gives me additional data for a specific run, with three values beeing of special interest to me: overall number of heart beats recorded, temperature and humidity (though I don’t know whether for start time, end time or somewhere inbetween). The web interface also displays the total ascend in meters. (Of course, since GPS altitude data is very rough, all altitude data has to be taken with a grain of salt.)

My running

The graph above already tells you a bit about my running and how it has developed. I started in July 2011, slowly but steadily increasing the distance (and duration, of course) from somewhere around 4 kilometers. Invisible to you, but available in my memories: Some of these dots in August 2011 happened in Italy, where I explored parts of Piemonte and Ligurie in running shoes. Hot, slow, gorgeous! One month later, in September 2011, I exceeded 10 kilometers for the first time. The plot above also reveals that I frequently ran about 8 kilometers in September and October of 2011, simply because I had a favourite route and fell into a routine.

Then, in winter 2011/2012, I got lazy. Cold temperatures and a lack of sunlight in the morning tend to make it hard for me to motivate myself. Therefore I almost had to re-start in March 2012. Luckily, it didn’t take me as long to reach my 8 kilometers house round as the first time. I kept running throughout summer, sometimes increasing the distances up to 17 kilometers (which feels pretty cool when you feel like it and you can simply do it!).

This plot above again shows all the runs over the course of 15 months. Instead of distance, the vertical position of the dots now reflects the average speed during that run. It’s visible that I increased my speed more or less steadily over the two training periods (July-October 2011 and March-September 2012), with one exception: This year, it seems as if I started faster (more than 8 kph) and then became slower on average in May, to become faster again after that. Comparing the two periods visually, it also seems like I increased speed faster last year, while also starting from a lower level.

So my runs became longer and faster. Judging from both speed and distance, one could say that I increased the intensity of my running during the course of the year – which was by far not accidental.

With increased performance, of course, the heart has to pump more oxygene to the muscle fibres. This, not surprisingly, is reflected by the average heart rate of each run.

This looks similar to the speed plot above, doesn’t it? Let’s draw a scatterplot of heart rate over speed to learn more about their correlation:

This show a pretty high positive correlation between speed and heart rate (r=0.74). As a background for those comparing their own data with mine: My minimum heart rate at rest is about 55 bpm, the highest heart rate I have ever measured on myself was 199 bpm, when I deliberately ran uphill last year to measure my max heart rate. (Be sure to have a napkin with you if you have to puke!)

The aim of my training should, of course, be to gain fitness. Fitness is a complex construct, and I don’t know a lot about it, but one thing people say is: If you’re fit, you can run the same intensity (=speed) with a lower heart rate than if you’re not. So, the research question here is:

Have I become more fit during my 15 months of running?

This question cannot be answered by looking at one single measure within the data. The average heart rate during my runs (and probably during yours as well) is influenced by the running intensity. This again is influenced mostly by my running speed, but also by the degree of ascend, maybe by even more? What about temperature and humidity? Or anything else?

I ran a linear regression with the following predictors, trying to predict the average heart rate:

  • Sequence number of the run (first=1 to last=82)
  • Length of the run (distance in kilometers)
  • Climb (ascend in meters)
  • Time of day (hour of the start time, 6 to 21)
  • Average speed (in kilometers/hour)
  • Temperature (in degrees Celsius)
  • Relative Humidity (percentage between 0 and 100)

Here is the result:

4.879 * avg_speed +
0.2939 * timeofday +
-0.2124 * temperature +
-0.1084 * humidity +
0.0337 * climb +

The regression model has been created and tested in Weka, using 10-fold cross-validation as a verification method. More details: The correlation coefficient is 0.7525, the mean absolute error is 3.7692.

What does it mean? It tells us that the variance in my recorded average heart rate can be explained mostly via the running speed, but in addition to a small degree by the hour of day (the earlier, the lower the heart rate),  by temperature (the lower the temperature, the higher the heart rate), by humidity (the dryer, the higher my heart rate) and finally by climb (the higher the ascend, the higher the heart rate). Actually, the correlation coefficient is only slightly greater than the one for speed allone (remember, that wa 0.74). This means that four additional variables don’t really influence the outcome a whole lot.

Note that two input variables don’t seem to have any measurable influence at all, (a) the sequence number (which kind of reflects my training experience up to that run), and (b) the length of the run. They haven’t even been included in the model by Weka.

So if it doesn’t matter to a linear regression whether I ran zero or a hundred times before, does this mean that my training is worthless? Not quite. Actually, it could mean that my fitness hasn’t increased in linear way. Which would be reasonable, since I hardly did anything during the winter months and might have lost some.

A different approach

I want confirmation. I will bend the numbers as long as I need in order to make them tell I’m a fitter person now.

At least, the linear regression shows that running speed is probably the most important variable influencing the heart rate during my runs. What if I create a composite variable that represents the number of heart beats required to run at a certain speed? Lower values would indicate a better fitness.

A composite like that could simply by created by dividing the average heart rate for a run by the average speed during that run. For a heart rate of 147 beats per minute and a speed of 9 kph, this would result in a value of 16.33. Same heart rate but 0.5 kph faster speed would result in a value of 15.47. You get the idea. Let’s call this composite the Heart rate over speed (HROS).

Here is the HROS over time plot:

Now this looks like the HROS represents what I have experienced: A first improvement in 2011, which has been neutralized by doing nothing during the winter, and then a second improvement in 2012. It looks like, on average, I now need less heart beats to run at a certain speed than I ever did. Cool! And it tells me that I should try to run more during winter, aiming to at least keep my fitness level.

That was fun! Have you gathered data yourself? Does the HROS work for you? Let me know!

Update Sept. 27, 2012

Although it wasn’t my  goal to predict my heart rate for future runs, I let R create another linear regression model as I did in Weka. It is quite different at first sight. Here is the output of the summary.lm function in R:

    Min      1Q  Median      3Q     Max 
-8.2261 -3.0338 -0.6747  2.5519 17.8290 

               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   858.81122  648.53753   1.324   0.1905    
avg_speed       6.32242    0.93682   6.749 7.19e-09 ***
id              0.23391    0.26434   0.885   0.3798    
datetime       -0.05003    0.04257  -1.175   0.2446    
dayofyear      -0.03446    0.01531  -2.251   0.0281 *  
length_km      -0.11381    0.27781  -0.410   0.6835    
timeofday       0.24620    0.18753   1.313   0.1943    
temperature    -0.09219    0.13920  -0.662   0.5104    
humidity       -0.07576    0.05422  -1.397   0.1676    
climb           0.04963    0.01954   2.540   0.0137 *  
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 4.543 on 59 degrees of freedom
Multiple R-squared: 0.709,	Adjusted R-squared: 0.6646 
F-statistic: 15.97 on 9 and 59 DF,  p-value: 7.439e-13

The great thing about this summary is that it tells us how relevant each variable is for the outcome, mainly in the column labeled “Pr(>|t|)”. It contains, as far as I understand, the p-Value, which is known from Null Hypothesis Significance Testing (NHST). A low p-Value stands for a high significance of the variable. This is also indicated by the stars on the right side of the table. As we can see, only three variables have a notable significance level (p < 0.05). The average speed stands out in particular, as it already did in the model created by Weka. Also interesting is the selection of the other two variables: Where Weke chose the timeofday, temperature and humidity, R thinks we should instead look at the dayofyear (which is the date of the run as a number between 1 and 366). Note that the coefficient is negative, which implies a negative correlation: The more time passed in the year, the lower my heart rate. How nice, as it suggest the existence of a training effect I am looking for.

As suggested by the german Wikipedia page on linear regression, I eliminated the non-significant variables from the model. The output then is:

    Min      1Q  Median      3Q     Max 
-9.3019 -2.6789 -0.8496  2.9888 15.9475 

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 97.889716   4.627295  21.155  < 2e-16 ***
avg_speed    5.448352   0.582350   9.356 2.48e-14 ***
climb        0.050204   0.014406   3.485 0.000816 ***
dayofyear   -0.016825   0.007508  -2.241 0.027907 *  
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 4.662 on 77 degrees of freedom
Multiple R-squared: 0.6228,	Adjusted R-squared: 0.6081 
F-statistic: 42.38 on 3 and 77 DF,  p-value: 2.797e-16

Boiled down to only four variables, the significance of the remaining ones increases. The most significant one is still the average running speed (avg_speed). The coefficient of ~ 5.5 means that when I run 1 kilometer per hour faster, my average heart rate will be 5.5 beats per minute higher. The second most significant variable is the number of meters ascended (climb). Here, for every 100 meters additional ascend, my average heart rate will increase by 5 beats (0.050204 * 100). The last one means: two months (60 days) from now my average heart rate will be 1 beat lower than today, given the other variables are the same (-0.016825 * 60). Lets note that model in more concise form:

 5.448352 * avg_speed +
-0.016825 * dayofyear +
 0.050204 * climb +

Since I originally published the post, I was out for running twice. The first run of the two happened in a mood like “Let’s see how fast I can run on my 8 km track.” This resulted in me running faster than ever, but ending 1 km ahead of my finish, overpaced. The second run was more recreational. Here are the numbers:

Day of Year Time of Day Distance Average Speed Ascend Temperature Humidity Average Heart Rate
268 7 7.26 km 11.05 kph 60 m 16 °C 77 % 169 bpm
271 7 8.24 km 8.9 kph 70 m 12 °C 88 % 143 bpm

Could those two heart rate values have been predicted by the linear models above? Plugging the values into the first model (by Weka), the results would have been 154.5 bpm — off by 14.5 bpm — and 144.0 (off by 1). The simplified model by R would have predicted values of 156.6 (off by 12.4) and 143 (off by 2.3).

While both models are pretty close when predicting the actual result of the slow run, both fail miserably to predict the fast run. That’s not a big surprise, considering that the models have been built based on data with a maximum heart rate of 157 bpm and a max average speed of 10.4 kph. Remember that linear-looking scatterplot of heart rate over speed? Here is how it looks now:

The blue line is the regression line, the fast run is marked in red, for your convenience. The slow run blends nicely with the rest and actually resides a bit below the regression line.

To me as a runner, this leads to an interesting conclusion (which others might have come to in different ways): I can’t simply expect to be able to run 11 kph over the course of 8 kilometers at a bearable heart rate of 159, only from looking at previous slower runs.

In contrast to what the scatterplot looks like, the relationship between speed and heart rate might very well not be linear, but – for example – quadratic. For a motor vehicle, the relationship between speed and force required is quadratic, due to the effect of air resistance. A source suggests that air resistance would have little influence on slow running like mine, since runners at about 20 kph see only 8 percent influence of air resistance. But what about the relationship beween force required and heart rate? Is it non-linear?

And then there is the concept of the aerobic threshold. It implies that there is a point on the intensity scale where the linear relationship between intensity and heart rate ends.

However, I have a very concrete goal now: to draw a dot into that scatterplot above, at the position of x=11 and y=160. Let’s see how close I can get there this year.


Oliver Wrede on 2012/09/24 at 12:07h GMT:

Very nice!

Michael Pavey on 2012/09/26 at 04:17h GMT:

Fun piece of research! I’m currently studying statistics, so this is a great source of inspiration :)

Marian Steinbach on 2012/09/27 at 13:53h GMT:

Michael (and Oliver, too), nice to meet you here! Note that I updated the post after learning a bit more about R’s regression functions and also adding two more runs to the data.

Tamim on 2012/10/03 at 21:26h GMT:

Ich werde mir die Mühe den Artikel zu lesen wohl nicht machen :) Aber irgendwie respekt Junge.

Alessandro on 2012/11/30 at 12:33h GMT:

Eine kritische Anmerkung:

Am Ende entscheidend ist “Adjusted R-squared: 0.6081″
Du hast zwar Variablen gefunden, die einen Einfluss besitzen, aber diese erklären nur ca. 60,81% der Streuung der Zielvariable. Sprich es gibt noch Variablen, die einen hohen Einfluss besitzen, aber nicht betrachtet werden.

Beispielsweise können Wechselwirkungen eine Rolle spielen.

Und für die Variablenselektion kann ich den Befehl step() in R empfehlen. Das sollte in etwa so aussehen:
step(Model,direction=”backward”,k=log(n)) mit n der Länge der Daten

W. on 2013/04/16 at 21:42h GMT:

Interessing, I did very similar calculation myself a while back. Have you tried to scale your HROF with distance somehow? Since you are a fitter person if you can keep up a certain speed over a longer distance this should also be a factor. There are calculators for estaminated best times from 10K times to half-marathons … The model uses some small coefficiant for that, but it would make you look fitter since your latter distances are greater than your first runs…

Your comment

Note: Due to issues with comment spam, your comment might not be published immediately.