Getting Started with GoldenCheetah OpenData

In this post I'm going to explain what the GoldenCheetah OpenData project is and how you can work with the data it has collected using Jupyter notebooks.

GoldenCheetah OpenData Project

Large collections of sports workout data are generally not open to the general public. Popular sites like Strava, TodaysPlan, TrainingPeaks collect large volumes of athlete data, but quite rightly do not publish this data publicly. But there is a growing appetite for such data, to inform development of new tools and to feed into models and machine learning algorithms.

So I started a project to do it, the GoldenCheetah OpenData project. My first priority was to make sure we did the right thing, in the right way to protect user privacy and comply with GDPR regulations. As a result, we anonymise all the data before sending it out of GoldenCheetah and remove personally identifiable information and personal metadata. Crucially, we get the user's explicit consent to share anything (and offer options to revoke that consent too).

So, in April 2018, the 3.5 development release of GoldenCheetah started to ask users if they would share their data publicly. So far, at November 2018, over 1300 users have said 'Yes' and shared over 700,000 workouts.

The data shared is posted publicly both on an S3 bucket you can explore and download via a browser, or via a project on the Open Science Framework.

A library with all the book titles erased

In early May 2018 I posted a tweet to announce the availability of the data. Expecting lots of folks to clamour to get hold of it and trigger a flurry of startling new insights and analysis from this treasure trove of information.

That wasn't quite what happened.

OK, right now it is like going in a library where all the book titles have been erased; you know something interesting is there, but are at a loss to find it :-)
— Stephen Seiler (@StephenSeiler) May 12, 2018

The problem of course, is that all the data is hidden away in a gazillion zip files. Providing a huge collection of raw data that was almost impossible to navigate.

We needed to provide tools and extracts of the data to get folks started.

Generated CSV datasets

To get things started I developed some python programs that read through all the raw data and generated comma separated variable (CSV) files folks could work with. Those scripts are running on the same server that receives and posts the raw data to the OSF and S3 buckets.

There are 3 main CSV files so far, all focused primarily on power data:

athletes.csv one line per athlete (1300 or more), providing athlete bio like gender and age, along with career PBs for most popular power metrics
activities.csv - one line per activity (700k or more) providing the same metrics as above, but for each workout.
activities_mmp.csv - one line per activity (700k or more) listing peak power bests for durations from 1s to 36000 seconds.

As part of validating the datasets I started to plot the data and explore the values. It became clear, really quickly, that some of the data was poor quality. Not everyone is as particular about their data as I am. Who knew?

Clearly I needed to do some data profiling to understand the data better. This would then help to generate rules for data editing and cleansing to get rid of some of the dirt.

I spent a good few weeks playing with the data and ended up creating two spreadsheets that summarised the distributions of power values for different durations. These power profiles are also published online:

Power profile table - percentile values for specific durations and parameters, rather similar to the Coggan Power Profile, but empirically derived.
Power duration - percentile values for all power durations from 1sec out to 10 hours.

Power Duration Profile Spreadsheet

Armed with this analysis I could see that much of the data is normally distributed, I could calculate percentile values and create probability density functions.

Crucially these insights define upper and lower bounds to help identify bad data. In later analyses they will help to determine the plausibility and likelihoods of model outputs (aka is it really possible that that 80kg 50 year old bloke is capable of generating 450w for an hour)

Critical Power Probability Density Function aka "CP Distribution"

Python Notebooks for working with the data

So armed with these rules, I started to explore the data using a Jupyter python notebook.

Using the notebook I could load in the CSV files, edit and clean the data, before wrangling it into different structures and do some basic plots to describe and visualise the data.

Over the course of November 2018 I spent a few hours each weekend playing with the data and tweeting what I'd found.

Jupyter Notebook and 3000 odd athlete season MMP curves

I started to get a much better feel for the quality of the data and some of the tweets I posted generated a lot of discussion. Some of these discussions got quite heated. My standard refrain to such criticism was "hey, its public data, the notebooks are online, go look for yourself".

Which leads me to this post: getting started with GoldenCheetah and OpenData.

Installing Python and getting a Jupyter Notebook Setup

Step one: install Python 3

You will need to install Python 3 which is the language all the code is written in. This can be done by downloading the installer from download page on the python.org website.

Instructions for each platform are described on the website page.

Step two: launch python and install key python packages

You need to make sure python is in your path, once that is done you should open a 'CMD' prompt and run python to install the key dependencies via the pip system.

> python -m pip install --upgrade pip

Once we have that resolved we can install all the dependencies, one at a time in case we have issues:

> python -m pip install pandas
> python -m pip install numpy
> python -m pip install math
> python -m pip install random
> python -m pip install datetime
> python -m pip install requests
> python -m pip install io
> python -m pip install zipfile
> python -m pip install dateutil
> python -m pip install scipy
> python -m pip install statsmodels
> python -m pip install lmfit
> python -m pip install matplotlib

Now we can install Jupyter and get cracking

> python -m pip install jupyter

Step three: Setup a folder to work in and get the sample notebook

You can put this anywhere, I'm choosing to use C>/opendata
> cd C:/
> mkdir opendata
> cd opendata

Open your browser and download the example notebook from:
http://goldencheetah-opendata.s3.us-east-1.amazonaws.com/notebooks/BasicOpenDataNotebook.ipynb
and save to the C:/opendata folder you just created.

Step four: Launch jupyter and open notebook

> cd /opendata
> jupyter notebook

A browser window will open, from which you can select the notebook you just downloaded.

Working with the jupyter notebook

Once the notebook is loaded you will see something like this:

From the kernel menu you should 'Restart and run all'.

Each cell will then be processed and the output will be updated. The notebook downloads data along the way and creates the plots above and others.

From here you now have a notebook to play with.

Have fun !

Where is this going?

We have initiated a development project for working with the opendata files. Part of this will be a python module that will fetch, parse and wrangle data. It is likely this will be released by the end of Q1 2018.

When it is released check back as there is likely to be another blog post with some examples for using it to create your own versions of the athletes.csv, activities.csv and activities_mmp.csv files, amongst other things.

Implementing the Banister Impulse-Response Model in GoldenCheetah

Over January 2019 I implemented the Banister model in GoldenCheetah, along the way I learned a little about its strengths and weaknesses. This post is about that; explaining the Banister model and how it relates to the PMC , how it has been implemented in GoldenCheetah and what it's limitations are. I've also added a bit at the end covering some of the things I'm looking to do with this next from potential model improvements through to deep learning. In some ways this post is a longer written form of this tutorial I recorded covering Banister and GoldenCheetah. The Banister Impulse Response model In 1975 Eric Banister proposed an impulse-response model that could be used to correlate past training with changes in performance in order to predict future improvements from future training. Originally proposed for working with collegiate swimmers it was reworked in 1990 for working with running and of course also applicable for cycling. Each type of sport needed a w...

40 Goals

Search This Blog