Skip to main content

Getting Started with GoldenCheetah OpenData

In this post I'm going to explain what the GoldenCheetah OpenData project is and how you can work with the data it has collected using Jupyter notebooks.

GoldenCheetah OpenData Project

Large collections of sports workout data are generally not open to the general public. Popular sites like Strava, TodaysPlan, TrainingPeaks collect large volumes of athlete data, but quite rightly do not publish this data publicly. But there is a growing appetite for such data, to inform development of new tools and to feed into models and machine learning algorithms.

So I started a project to do it, the GoldenCheetah OpenData project. My first priority was to make sure we did the right thing, in the right way to protect user privacy and comply with GDPR regulations. As a result, we anonymise all the data before sending it out of GoldenCheetah and remove personally identifiable information and personal metadata. Crucially, we get the user's explicit consent to share anything (and offer options to revoke that consent too).

So, in April 2018, the 3.5 development release of  GoldenCheetah started to ask users if they would share their data publicly. So far, at November 2018, over 1300 users have said 'Yes' and shared over 700,000 workouts.

The data shared is posted publicly both on an S3 bucket you can explore and download via a browser, or via a project on the Open Science Framework.

A library with all the book titles erased

In early May 2018 I posted a tweet to announce the availability of the data. Expecting lots of folks to clamour to get hold of it and trigger a flurry of startling new insights and analysis from this treasure trove of information.

That wasn't quite what happened.

The problem of course, is that all the data is hidden away in a gazillion zip files. Providing a huge collection of raw data that was almost impossible to navigate.

We needed to provide tools and extracts of the data to get folks started.

Generated CSV datasets

To get things started I developed some python programs that read through all the raw data and generated comma separated variable (CSV) files folks could work with. Those scripts are running on the same server that receives and posts the raw data to the OSF and S3 buckets.

There are 3 main CSV files so far, all focused primarily on power data:
  • athletes.csv one line per athlete (1300 or more), providing athlete bio like gender and age, along with career PBs for most popular power metrics
  • activities.csv - one line per activity (700k or more) providing the same metrics as above, but for each workout.
  • activities_mmp.csv - one line per activity (700k or more) listing peak power bests for durations from 1s to 36000 seconds.
As part of validating the datasets I started to plot the data and explore the values. It became clear, really quickly, that some of the data was poor quality. Not everyone is as particular about their data as I am. Who knew?

Clearly I needed to do some data profiling to understand the data better. This would then help to generate rules for data editing and cleansing to get rid of some of the dirt.

I spent a good few weeks playing with the data and ended up creating two spreadsheets that summarised the distributions of power values for different durations. These power profiles are also published online:
 Power Duration Profile Spreadsheet

Armed with this analysis I could see that much of the data is normally distributed, I could calculate percentile values and create probability density functions.

Crucially these insights define upper and lower bounds to help identify bad data. In later analyses they will help to determine the plausibility and likelihoods of model outputs (aka is it really possible that that 80kg 50 year old bloke is capable of generating 450w for an hour)

Critical Power Probability Density Function aka "CP Distribution"

Python Notebooks for working with the data

So armed with these rules, I started to explore the data using a Jupyter python notebook.

Using the notebook I could load in the CSV files, edit and clean the data, before wrangling it into different structures and do some basic plots to describe and visualise the data.

Over the course of November 2018 I spent a few hours each weekend playing with the data and tweeting what I'd found.

Jupyter Notebook and 3000 odd athlete season MMP curves

I started to get a much better feel for the quality of the data and some of the tweets I posted generated a lot of discussion. Some of these discussions got quite heated. My standard refrain to such criticism was "hey, its public data, the notebooks are online, go look for yourself".

Which leads me to this post: getting started with GoldenCheetah and OpenData.

Installing Python and getting a Jupyter Notebook Setup

Step one: install Python 3

You will need to install Python 3 which is the language all the code is written in. This can be done by downloading the installer from download page on the python.org website.

Instructions for each platform are described on the website page.

Step two: launch python and install key python packages 

You need to make sure python is in your path, once that is done you should open a 'CMD' prompt and run python to install the key dependencies via the pip system.

>  python -m pip install --upgrade pip

Once we have that resolved we can install all the dependencies, one at a time in case we have issues:

> python -m pip install pandas
> python -m pip install numpy
> python -m pip install math
> python -m pip install random
> python -m pip install datetime
> python -m pip install requests
> python -m pip install io
> python -m pip install zipfile
> python -m pip install dateutil
> python -m pip install scipy
> python -m pip install statsmodels
> python -m pip install lmfit
> python -m pip install matplotlib


Now we can install Jupyter and get cracking

> python -m pip install jupyter

 Step three: Setup a folder to work in and get the sample notebook

You can put this anywhere, I'm choosing to use C>/opendata
> cd C:/
> mkdir opendata
> cd opendata

Open your browser and download the example notebook from:
http://goldencheetah-opendata.s3.us-east-1.amazonaws.com/notebooks/BasicOpenDataNotebook.ipynb
and save to the C:/opendata folder you just created.

Step four: Launch jupyter and open notebook

> cd /opendata 
> jupyter notebook

A browser window will open, from which you can select the notebook you just downloaded.

Working with the jupyter notebook

Once the notebook is loaded you will see something like this:



From the kernel menu you should 'Restart and run all'.

Each cell will then be processed and the output will be updated. The notebook downloads data along the way and creates the plots above and others.

From here you now have a notebook to play with.

Have fun !

Where is this going?

We have initiated a development project for working with the opendata files. Part of this will be a python module that will fetch, parse and wrangle data. It is likely this will be released by the end of Q1 2018.

When it is released check back as there is likely to be another blog post with some examples for using it to create your own versions of the athletes.csv, activities.csv and activities_mmp.csv files, amongst other things.

Popular posts from this blog

W'bal its implementation and optimisation

So, the implementation of W'bal in GoldenCheetah has been a bit of a challenge.

The Science I wanted to explain what we've done and how it works in this blog post, but realised that first I need to explain the science behind W'bal, W' and CP.

W' and CP How hard can you go, in watts, for half an hour is going to be very different to how hard you can go for say, 20 seconds. And then thinking about how hard you can go for a very long time will be different again. But when it comes to reviewing and tracking changes in your performance and planning future workouts you quickly realise how useful it is to have a good understanding of your own limits.

In 1965 two scientists Monod and Scherrer presented a ‘Critical Power Model’ where the Critical Power of a muscle is defined as ‘the maximum rate of work that it can keep up for a very long time without fatigue’. They also proposed an ‘energy store’ (later to be termed W’, pronounced double-ewe-prime) that represented a finit…

Polarized Training a Dialectic

Below, in the spirit of the great continental philosophers, is a dialectic that attempts to synthesize the typical arguments that arise when debating a polarized training approach.

It is not intended to serve as an introduction to Polarized training, there are many of those in-print and online. I think that Joe Friel's blog post is a good intro for us amateurs.

For Synthesis Against A Elite athletes have been shown in a number of studies to train in a polarized manner - 80/20 split of workouts targetting polarised zones 1 and 3 [1][2][3]

There is variation across sports in how that ~20% is split between time in Z2 and Z3 [1] There is more than one way to skin a cat and coaches will adapt general principles to specific needs of the sport.

The key message remains: Elite athletes adopt plans that include high-volumes of low intensity and low-volumes of high-intensity. Elite athletes have also been shown to train in a pyramidal manner [13] B Polarized Zones are between LT1…

Finding TTE and Sustained Efforts in a ride

Defining the problem
Any given training ride or race will contain periods of sustained effort, sometimes to exhaustion. Perhaps during the last sprint, or over a long climb, bridging a gap or chasing on after a roundabout or corner.

Being able to identify these efforts is rather useful.

The trouble is, deciding what represents a maximal or sustained effort is often discussed, and generally has fallen into discussions about intensity and FTP or Critical Power. These discussions have tended to then focus on trying to account for the interval duration, periods of freewheeling and applying smoothing etc.

But we already have an excellent description of what constitutes a maximal effort. It is the primary purpose of any power duration model.

Power duration models estimate the maximal effort you can sustain for any given duration through to exhaustion. So if you want to identify maximal efforts its your friend.

Using the model below we can see, for example, that the athlete it represents co…