Analysing Hermannslauf 2016 results in R using tidyverse packages

This year, I participated in the Hermannslauf, a public run from Detmold to Bielefeld. Since then, I wanted to have a look at the results in more detail.

The data can be found here. My tool of choice is R, a programming language for statistical analyses and data visualisation. I am using tidyverse packages (more information about the tidyverse can be found on its homepage http://tidyverse.org) for the analysis, a collection of packages designed to work together.

How to get the data

During the run, I tracked the route with Meerun. The resulting .gpx file can be loaded and processed with R. I used import functions I wrote for a different project (more on this topic in a future article) to load and plot the track.

Map showing the Hermannslauf route

The results can be retrieved as .json elements. I will not go into detail about the retrieval today but rather focus on the analysis.

url <- "http://www.davengo.com/event/result/45-hermannslauf-2016/search/list"
start_offset <- 0
end_offset <- 5625
all_data <- tibble::data_frame()
for (i in seq(start_offset, end_offset, 125)) {
    body <- list(
        type = "simple",
        term = NULL,
        category = "31,1 km Hermannslauf",
        offset = i
        )
    response <- httr::POST(url, body = body, encode = "json")
    httr::stop_for_status(response)
    raw_data <- httr::content(response)
    raw_data <- raw_data$results
    if (length(raw_data) == 0) {
        writeLines(c("Stopped at i = ", i), sep = "")
        return(all_data)
    }
    all_data <- c(all_data, list(raw_data))
    }

The data are now buried in a list of lists of lists and can be processed further.

Data tidying

After all the data are retrieved, it is easy to unwrap the present and have a peek at the content:

df <- flatten(raw_data) %>%
bind_rows() %>%
select(-hash, -combined) %>%
mutate(Gender = if_else(is.na(rankFemale), "m", "f")) %>%
select(-rankFemale, -rankMale)
names(df) <- c("Team", "Number", "Name", "Age", "Rank", "Time", "First_name", "Rank_age_group", "Gender")
df$Time <- as.POSIXct(df$Time, format = "%H:%M:%S", origin = "2016-04-30")
df$Age <- as.factor(df$Age)
levels(df$Age) <- c(30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, "H", "J", 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, "H", "J")
df$Age <- factor(df$Age, levels = c("J", "H", "30", "35", "40", "45", "50", "55", "60", "65", "70", "75", "80"), ordered = TRUE)
df <- select(df, Name, First_name, Time, Rank, Rank_age_group, Age, Gender, Number, Team)
df

This is our data frame with which I will be answering all my questions:

## # A tibble: 5,723 × 9
##         Name First_name                Time  Rank Rank_age_group   Age
##        <chr>      <chr>              <dttm> <int>          <int> <ord>
## 1       Name      Name 2016-11-12 01:49:10     1              1    35
## 2       Name      Name 2016-11-12 01:51:02     2              1     H
## 3       Name      Name 2016-11-12 01:52:09     3              2    35
## 4       Name      Name 2016-11-12 01:52:30     4              3    35
## 5       Name      Name 2016-11-12 01:52:40     5              2     H
## 6       Name      Name 2016-11-12 01:55:42     6              3     H
## 7       Name      Name 2016-11-12 01:56:04     7              4     H
## 8       Name      Name 2016-11-12 01:56:32     8              1    45
## 9       Name      Name 2016-11-12 01:56:35     9              5     H
## 10      Name      Name 2016-11-12 01:56:51    10              2    45
## # ... with 5,713 more rows, and 3 more variables: Gender <chr>,
## #   Number <int>, Team <chr>

The age groups H and J stand for Haupt and Jugend and probably contain ages 20 – 24 and 25 – 29, respectively. All the other age groups represent an age range of five years, e.g. the group 30 contains runners from 30 to 34 years.

Analysis

My results

Now, that all the data are in one tidy data frame the analysis can begin. First, let’s look at my results. How well did I fare? For all the questions the tidy data frame can be »queried«.

df %>% filter(Name == "Rüßler")

## # A tibble: 1 × 9
##     Name First_name                Time  Rank Rank_age_group   Age Gender
##    <chr>      <chr>              <dttm> <int>          <int> <ord>  <chr>
## 1 Rüßler     Martin 2016-11-12 03:05:07  2906            367    30      m
## # ... with 2 more variables: Number <int>, Team <chr>

So, I am on rank 2906 of 5723. Not too bad for my first run, I think.

All results

How does it compare to the others?

ggplot(data = df, aes(x = Time, colour = Gender, fill = Gender)) +
annotate("text", x = median(df$Time), y = 105, label = "Median") +
geom_segment(aes(x = median(df$Time), y = 0, xend = median(df$Time), yend = 100), colour = "black") +
geom_histogram(binwidth = 60) +
labs(title = "Distribution of running time", x = "Time", y = "Count")

"Distribution of running time"

Almost exactly median time!

Percentage of women

We can also see that far more men than women participated in the run. How many exactly?

df %>% group_by(Gender) %>%
summarise(Count = n())

## # A tibble: 2 × 2
##   Gender Count
##    <chr> <int>
## 1      f  1212
## 2      m  4511

Age

What about the age distribution? What ages are participating the most?

df %>% group_by(Age, Gender) %>%
summarise(Count = n()) %>%
ggplot(aes(x = Age, y = Count)) +
geom_bar(aes(fill = Gender), stat = "identity", position = "dodge") +
labs(title = "Distribution of age")

Age distribution

This graph draws a highly different picture for men and women. While the biggest group of female runners is the Haupt group, men between 45 and 54 are most frequent among the male runners. Is the midlife crisis driving men to participate in athleticism?

Number

Another question I was asking myself is whether the number is an indicator of the individual’s finishing place, i.e. whether the fastest runners also enrol earlier for the Hermannslauf than others. This assumes, of course, that the number is assigned in order of enrolment. First, a plot of number versus time to get an impression of the data.

df %>% group_by(Gender) %>%
ggplot(aes(x = Number, y = Time)) +
geom_point(size = 0.5) +
geom_smooth() +
labs(title = "Running time vs. number", x = "Number", y = "Time")

"Running time vs. number"

The package Intubate provides standard R functions with pipe support. In this case, I can use lm.

library(broom)
library(intubate)
df %>% ntbt_lm(Number ~ Rank) %>%
tidy()

##          term     estimate   std.error statistic    p.value
## 1 (Intercept) 3.314367e+03 51.79919801 63.984901 0.00000000
## 2        Rank 3.107841e-02  0.01567484  1.982694 0.04744936

Apparently, there is no correlation between the number and final rank. But there is a small cluster right at the beginning. A closer look reveals that this group of men were really fast in both regards, running and enrolment:

df %>% filter(Number <= 15, Gender == "m") %>%
ggplot(aes(x = Number, y = Time)) +
geom_point(size = 0.5) +
geom_smooth(method = "lm") +
labs(title = "Running time vs. number (detail)")

Running time vs. number

df %>% filter(Number <= 15, Gender == "m") %>%
ntbt_lm(Number ~ Rank) %>%
tidy()

##          term    estimate   std.error statistic      p.value
## 1 (Intercept) 7.476406711 1.068633924  6.996228 6.351134e-05
## 2        Rank 0.006382947 0.002837795  2.249263 5.106469e-02

Most of these runners from the cluster are members of the TSVE 1890 Bielefeld, which organised the run.

Teams

I was also interested in the teams. How many runners run without a team? And what are the biggest teams?

df %>% group_by(Team, Gender) %>%
summarise(Count = n()) %>%
arrange(desc(Count))

## Source: local data frame [1,369 x 3]
## Groups: Team [1,161]
##
##                        Team Gender Count
##                       <chr>  <chr> <int>
## 1                                m  2161
## 2                                f   521
## 3       TSVE 1890 Bielefeld      m   149
## 4          LG Oerlinghausen      m    68
## 5       TSVE 1890 Bielefeld      f    67
## 6        Laufspass SW Sende      m    60
## 7        Laufspass SW Sende      f    42
## 8                     CLAAS      m    30
## 9  Active Sportshop Herford      f    27
## 10 Active Sportshop Herford      m    26
## # ... with 1,359 more rows

2161 of 4511 men (roughly 48 %) and 521 of 1212 women (roughly 43 %) had no team affiliation. Those can be excluded with a filter to create a list containing the number of runners per team. Here, the results are not grouped by gender. I am only showing the first ten results:

df %>% group_by(Team) %>%
summarise(Team_Size = n()) %>%
arrange(desc(Team_Size)) %>%
filter(Team != "") %>%
head(n = 10)

## # A tibble: 10 × 2
##                        Team Team_Size
##                       <chr>     <int>
## 1       TSVE 1890 Bielefeld       216
## 2        Laufspass SW Sende       102
## 3          LG Oerlinghausen        94
## 4  Active Sportshop Herford        53
## 5      LC Solbad Ravensberg        38
## 6            Non-Stop-Ultra        35
## 7                     CLAAS        32
## 8              Teuto Runner        30
## 9                  LV Oelde        26
## 10                uniRenner        26

Look, there’s my team, the uniRenner! We and LV Oelde share rank 9. The TSVE 1890 Bielefeld as the organiser of the Hermannslauf is by far the largest team and its many members are actively participating in the run. This is visualised in the following bar chart:

df %>% group_by(Team) %>%
summarise(Count = n()) %>%
arrange(desc(Count)) %>%
filter(Team != "", Count <= 15) %>%
ggplot(aes(x = Count)) +
geom_bar() +
labs(title = "Team sizes", x = "Number of teams", y = "Team size")

Team size vs. number of teams

How many teams are there in total?

df %>% group_by(Team) %>%
summarise(Team_Size = n()) %>%
arrange(desc(Team_Size)) %>%
filter(Team_Size == 1) %>%
nrow()

## [1] 738

There are many teams with just one member. This may be due to spelling errors which I have not checked for.

I hope that I could show how relatively simple it is to do an exploratory data analysis and have fun with it as well. This is but a fraction of questions to analyse and I already have some new ideas which I will discuss in a future article.

Title image by David Marcu.