Over January 2019 I implemented the Banister model in GoldenCheetah, along the way I learned a little about its strengths and weaknesses.
This post is about that; explaining the Banister model and how it relates to the PMC, how it has been implemented in GoldenCheetah and what it's limitations are. I've also added a bit at the end covering some of the things I'm looking to do with this next from potential model improvements through to deep learning.
In some ways this post is a longer written form of this tutorial I recorded covering Banister and GoldenCheetah.
Originally proposed for working with collegiate swimmers it was reworked in 1990 for working with running and of course also applicable for cycling. Each type of sport needed a way of calculating impulse (aka training load) and a way of quantifying performance (e.g. 8minute power, critical speed).
To support this he invented a HR based metric called TRIMP to quantify the load of any type of running workout. Over time the TRIMPs from workouts are accumulated into a performance curve which is fitted to actual performance tests. The resulting parameter estimates can then be used to predict future performance.
The test Banister used for running was to perform a maximal effort for a standard distance (e.g. 1500m) and express the test result as a point score when compared to the world record for that distance (e.g. 1500m WR is 3:26.00). For cycling a 6 minute TTE test was also used.
The key element to all of this is that the Banister model learns the individual's response to an impulse and uses that to predict future performance. This of course means it needs a fair amount of historic data to work with. Fortunately, a large proportion of the users of GoldenCheetah have that.
It is also common to refer to the Positive Influence curve as Positive Training Effect (PTE) or just fitness and the Negative Influence curve as Negative Training Effect (NTE) or just fatigue.
These curves represent accumulated load, but with different time decay constants - so that NTE residual fatigue will clear in 5-10d, where the PTE adaptations will remain for 30-50d. These are represented in 2 model parameters; t1 for PTE decay, and t2 for NTE decay.
If we subtract NTE from PTE we get a third curve; the Performance Curve. This represents performance taking into account adaptations and likely fatigue, but it is still in the arbitray units we used as an input (e.g. TRIMP, TSS, BikeScore). So, at this point we need to find some co-efficients to translate from these arbitrary units to those that represent actual performance measures (e.g. 6 min power is measured in watts).
So, the formula for Performance curve is actually:
These new coefficients are less interesting as they just translate units. But p0 represents the baseline performance, which for 6min power might be the power you can put out when you are at your least fit or untrained performance. The other two parameters k1 and k2 are just coefficients where e.g. BikeScore*k1 = watts. Example values for k1 and k2 are things like -0.00612 and 0.00402.
To estimate all these parameters, t1, t2, p0, k1 and k2 we need to fit the performance curve to actual performances. So the last aspect here is the need to perform regular performance tests so we can get sufficient data points to fit the model.
Ideally, we will have more than 5 observations to fit a 5 parameter model to. So for most people this would mean at least one full season of training. Once we have that we can fit and use the parameters to make predictions about future adaptations from a future training plan.
We should look at this mathematically and then conceptually to understand how they relate. First off, in the PMC the coefficients k1 and k2 have been removed along with p0. This is to remove the need to test and fit. Secondly, the time decays have been retained but reduced to 42 and 7 days respectively (you can change them but there is no guidance about how you might validate them since they don't relate to anything you can measure).
So, so far, so good; the PMC is like Banister but doesn't need any performance tests. We can think of the two models as basically being expressed as:
Overall, the shape of the CTL and PTE curves are generally very similar, the ATL and NTE curves too, see figures 2 and 3 below.
Where the PTE and NTE curves are accumulated over time with a decay applied, the CTL and ATL curves are computed as a weighted rolling average. As a result the dimensions of these curves are very, very different. Go back and look at the y-axis on the two figures above.
So now, when we look at the Banister Performance and PMC TSB curve we see the curve shapes are very different. So we arrive at the conceptual difference between the two models.
TSB is conceptually representing short term fatigue in the PMC model, where in the Banister model NTE conceptually represents residual fatigue. Quite how real either are is up for grabs.
There is a lot of tasseography around the meaning of certain values and rates of change for ATL and TSB. For the purposes of this post, you should now be able to relate Banister and the PMC and may have started to question which approach reflects performance and fatigue more accurately.
In truth, it is something I've meant to implement for a very long time, but have always been held back by the need to perform tests. And not just that, you need to have performed tests in the past in order that the Banister model can learn enough about you to predict the future.
So inspired by Mike Puchowicz FPCA analysis I developed a method for finding absolute best intervals across all workouts and also developed a filtering algorithm to remove obvious sub-maximal efforts. I previously explained both here on this blog its worth a quick read if you haven't already.
This means no time machine is needed, past peak efforts are found automatically. It was rather useful regardless of Banister. I also used it to help improve the MMP filter used in the GoldenCheetah CP plot and model fitting.
Armed with this, and the Levenberg-Marquardt code previously used to fit the Critical Power model I set to adding the model. It was easily done and took about 2 days tops (which was rather annoying since I'd put off doing it for so long).
To my surprise, with my personal data, it seemed to work really well. I was quite shocked. I'd always been led to believe that the model was overly complex and impractical to use. Yet here it was predicting my CP history with some accuracy. So I tweeted my results and thought it was time to start testing against other data.
As I tested I stumbled across and attempted to fix lots of issues and shortcomings. Lets run through them in the order that I found them.
I found that even 3 minutes was too short; athletes with really high W' values would inflate the CP estimate when I used intervals shorter than 4 minutes.
I also found that for long form TTers and some Triathletes that they almost never do maximal efforts of any significant duration. Rather they perform long tests at threshold. So as a result the peak efforts between 4 and 20 minutes were not as good as the peak efforts at 60 minutes.
So the search criteria for maximal efforts was adjusted from the range 3-20minutes to 4-60 minutes. It means the search takes a bit longer, but for most athletes it makes no difference (it finds peaks before 20 mins) but for those long form TTers and Triathletes it gave better results.
In practice this seems to apply where there is either a high rate of change (detraining and retraining) or where the filtering algorithm breaks down. To fix this, the filtering code needed a lot of adjustments, from looking ahead over 4 weeks to looking ahead over 2 months.
I guess the takeaway here is the code needed a lot of tuning and will likely need to be improved over time, but for now so long as there are enough observations over a long enough period this weighting does not have a significant impact on the fit. Typically this means 20 or more observations over 2 years.
One way of overcoming this would be to constrain the model fit, but this was tricky too since it was the ratio of t1 to t2 that also needed to be constrained although I could maybe constrain t1 to the range 30-50d and t2 to 7-20d
After spending a while trying to manage this I hit a decision point: Do I remove them from the fit and make them constants we can tune (like the PMC) or do I do a brute force fit with a range of plausible values (50:7 thru 30:20 say) ?
I landed on just making them tunable by the user, but might change in the future. The other problem of p0, t1 and t2 time dependence was also in my mind.
So now, instead of fitting a 5 parameter model I'm fitting a 3 parameter one with two tunable constants.
Of those three remaining parameters (p0, k1 and k2) p0 is really, really interesting. For example Damien Grauser, one of the GoldenCheetah developers has an untrained CP (p0) of about 285w. That was equivalent to me at my best (!).
At some point it might be interesting to model how this changes over time, which leads me to the next issue.
So when I started to fit the banister model to performances to estimate these parameters I didn't fit to all time data. Instead, I looked for windows to fit against. Initially I looked to split workout history into seasons, figuring everyone was like me and had an extended off-season over Winter before getting back into the saddle ready for Spring.
So I started by splitting the seasons based upon gaps in history -- this worked ok for me, but then I found the first few seasons were poor, a lot of my workouts were hidden from Banister because they didn't have power data, as a result I looked like a mega fast gainer who could do a handful of power workouts and see CP rise by 20%, but then later seasons it looked like I slowed up as all the workouts had power and became visible to the model.
Some athletes had no off-seasons, and so my code did a fit against a 10 year season. Of course this meant the fit at any point in time was pretty poor -- p0, t1 and t2 were kept constant.
So in the end, I ended up with a minimum and maximim season length; long enough to have sufficient data to be stable, but short enough that t1, t2 and p0 could change over time. This turned out to be about 2 years.
The plain Banister model is linear. Fundamentally that is because, if you strip it back (with a bit of poetic license) the model is a simple equation of the form: impulse * x = response. Which is of course a straight line.
For now I need to just suck this up, it is a well know problem with the model and so not surprisingly there are a number of solutions to this in the literature and elsewhere.
The impact of high intensity work at shorter duration versus low intensity work at longer duration or the amount of time working in a heavily fatigued state or the combination of HIIT and LIT all influence the resulting adaptations in different ways. This cannot be expressed in a single number.
Like issue 5 above, I need to suck this up, but again like issue 5 above, there are other approaches that may help to address this.
To support this, and to make the whole thing a little more flexible we'll allow the user to define an input metric (e.g. BikeScore, GOVSS etc) and a performance metric (Power Index Performances, Manually recorded performance, User Metric).
This might also facilitate multiple banister curves; e.g. one for CP, one for Pmax, one for 5min power with distinct inputs; volume, intensity, time in w'bal zone.
Additionally, there are solutions that add a Kalman filter to the Busso time variant model of Banister that address issue#5 and issue#4 above
I'm curious to use CNNs as they have been found to work well for time-series analysis and adding to a desktop application like GoldenCheetah is quite easily achieved using the C++ library Dlib without needing to use frameworks like Google's TensorFlow with embedded Python (which would be a real faff).
The beauty of a deep learning approach is also the ability to use multiple inputs, so addresses issue 6 above, but will also address 3,4 and 5 too.
In addition, if the model is pre-trained perhaps against the GoldenCheetah opendata athletes we could use it as a generic model for athletes with little or no training history as well as re-training it against data for those athletes that do.
I think it is safe to say that there is a lot of road left to travel with performance modelling, in fact I suspect the journey is only just starting!
This post is about that; explaining the Banister model and how it relates to the PMC, how it has been implemented in GoldenCheetah and what it's limitations are. I've also added a bit at the end covering some of the things I'm looking to do with this next from potential model improvements through to deep learning.
In some ways this post is a longer written form of this tutorial I recorded covering Banister and GoldenCheetah.
The Banister Impulse Response model
In 1975 Eric Banister proposed an impulse-response model that could be used to correlate past training with changes in performance in order to predict future improvements from future training.Originally proposed for working with collegiate swimmers it was reworked in 1990 for working with running and of course also applicable for cycling. Each type of sport needed a way of calculating impulse (aka training load) and a way of quantifying performance (e.g. 8minute power, critical speed).
To support this he invented a HR based metric called TRIMP to quantify the load of any type of running workout. Over time the TRIMPs from workouts are accumulated into a performance curve which is fitted to actual performance tests. The resulting parameter estimates can then be used to predict future performance.
The test Banister used for running was to perform a maximal effort for a standard distance (e.g. 1500m) and express the test result as a point score when compared to the world record for that distance (e.g. 1500m WR is 3:26.00). For cycling a 6 minute TTE test was also used.
The key element to all of this is that the Banister model learns the individual's response to an impulse and uses that to predict future performance. This of course means it needs a fair amount of historic data to work with. Fortunately, a large proportion of the users of GoldenCheetah have that.
Figure 1: Overview of the Banister IR model
|
Fitting PTE, NTE and Performance Tests
At the heart of the Banister model are two curves; the Positive Influence curve and the Negative influence curve. As the names suggest they represent accumulated training load that will have a positive impact on performance (by eliciting physiological adaptations) or that has a negative impact on performance (tired or sore legs, residual fatigue).It is also common to refer to the Positive Influence curve as Positive Training Effect (PTE) or just fitness and the Negative Influence curve as Negative Training Effect (NTE) or just fatigue.
These curves represent accumulated load, but with different time decay constants - so that NTE residual fatigue will clear in 5-10d, where the PTE adaptations will remain for 30-50d. These are represented in 2 model parameters; t1 for PTE decay, and t2 for NTE decay.
If we subtract NTE from PTE we get a third curve; the Performance Curve. This represents performance taking into account adaptations and likely fatigue, but it is still in the arbitray units we used as an input (e.g. TRIMP, TSS, BikeScore). So, at this point we need to find some co-efficients to translate from these arbitrary units to those that represent actual performance measures (e.g. 6 min power is measured in watts).
So, the formula for Performance curve is actually:
Performance (t) = p0 + PTE(t)*k1 - NTE(t)*k2
These new coefficients are less interesting as they just translate units. But p0 represents the baseline performance, which for 6min power might be the power you can put out when you are at your least fit or untrained performance. The other two parameters k1 and k2 are just coefficients where e.g. BikeScore*k1 = watts. Example values for k1 and k2 are things like -0.00612 and 0.00402.
To estimate all these parameters, t1, t2, p0, k1 and k2 we need to fit the performance curve to actual performances. So the last aspect here is the need to perform regular performance tests so we can get sufficient data points to fit the model.
Ideally, we will have more than 5 observations to fit a 5 parameter model to. So for most people this would mean at least one full season of training. Once we have that we can fit and use the parameters to make predictions about future adaptations from a future training plan.
Relating this to the Performance Manager Chart (PMC)
So at this point, most readers are likely trying to translate the Banister model into the terms they're used to from the TrainingPeaks Performance Manager Chart; CTL, ATL and TSB.We should look at this mathematically and then conceptually to understand how they relate. First off, in the PMC the coefficients k1 and k2 have been removed along with p0. This is to remove the need to test and fit. Secondly, the time decays have been retained but reduced to 42 and 7 days respectively (you can change them but there is no guidance about how you might validate them since they don't relate to anything you can measure).
So, so far, so good; the PMC is like Banister but doesn't need any performance tests. We can think of the two models as basically being expressed as:
PMC: TSB = CTL - ATL
Banister: Performance = PTE - NTE
Overall, the shape of the CTL and PTE curves are generally very similar, the ATL and NTE curves too, see figures 2 and 3 below.
Figure 2: CTL vs PTE curve |
Figure 3: ATL vs NTE curve |
Where the PTE and NTE curves are accumulated over time with a decay applied, the CTL and ATL curves are computed as a weighted rolling average. As a result the dimensions of these curves are very, very different. Go back and look at the y-axis on the two figures above.
Figure 4: TSB vs Performance |
So now, when we look at the Banister Performance and PMC TSB curve we see the curve shapes are very different. So we arrive at the conceptual difference between the two models.
TSB is conceptually representing short term fatigue in the PMC model, where in the Banister model NTE conceptually represents residual fatigue. Quite how real either are is up for grabs.
There is a lot of tasseography around the meaning of certain values and rates of change for ATL and TSB. For the purposes of this post, you should now be able to relate Banister and the PMC and may have started to question which approach reflects performance and fatigue more accurately.
Implementation of Banister in GoldenCheetah
Over Christmas 2018 I started to work on planning functionality for GC, goals, periodisation, load planning and so on. Keen not to develop a glorified diary I wanted to include modelling to support this. So I got sidetracked into the Banister model.In truth, it is something I've meant to implement for a very long time, but have always been held back by the need to perform tests. And not just that, you need to have performed tests in the past in order that the Banister model can learn enough about you to predict the future.
No time machine needed
Short of getting everyone into a time machine to go back and perform tests I needed a different approach. As part of the recent CP explainer series I'd already discussed embedding testing into general riding with Dr Len. Indeed, we already discussed how maximal efforts often occur in training without needing to schedule them -- either as part of a workout or climbing a hill and so on.So inspired by Mike Puchowicz FPCA analysis I developed a method for finding absolute best intervals across all workouts and also developed a filtering algorithm to remove obvious sub-maximal efforts. I previously explained both here on this blog its worth a quick read if you haven't already.
This means no time machine is needed, past peak efforts are found automatically. It was rather useful regardless of Banister. I also used it to help improve the MMP filter used in the GoldenCheetah CP plot and model fitting.
Figure 5: Maximal and Submaximal Performances |
Implementation Exemplar
Dave Clarke and Phil Skiba published a paper a few years ago that was intended to teach lay people about sports performance modelling. Its open access and rather good. It also has an exemplar spreadsheet that includes the Banister model.Armed with this, and the Levenberg-Marquardt code previously used to fit the Critical Power model I set to adding the model. It was easily done and took about 2 days tops (which was rather annoying since I'd put off doing it for so long).
To my surprise, with my personal data, it seemed to work really well. I was quite shocked. I'd always been led to believe that the model was overly complex and impractical to use. Yet here it was predicting my CP history with some accuracy. So I tweeted my results and thought it was time to start testing against other data.
Imagine if you could use your PMC to track and predict CP #banister— Mark Liversedge (@liversedge) January 10, 2019
[dark red predicted, light red my curated setting] pic.twitter.com/jZ4bIkdqgx
As I tested I stumbled across and attempted to fix lots of issues and shortcomings. Lets run through them in the order that I found them.
Issue#1: Maximal Performance Duration
If you didn't read the post about power index and submax filtering, then in a nutshell the algorithm was set to look for peak efforts between 3 and 20 minutes long. Longer than 3 minutes to avoid conflating with W' and shorter than 20 minutes to avoid conflating with fatigue.I found that even 3 minutes was too short; athletes with really high W' values would inflate the CP estimate when I used intervals shorter than 4 minutes.
I also found that for long form TTers and some Triathletes that they almost never do maximal efforts of any significant duration. Rather they perform long tests at threshold. So as a result the peak efforts between 4 and 20 minutes were not as good as the peak efforts at 60 minutes.
So the search criteria for maximal efforts was adjusted from the range 3-20minutes to 4-60 minutes. It means the search takes a bit longer, but for most athletes it makes no difference (it finds peaks before 20 mins) but for those long form TTers and Triathletes it gave better results.
Issue#2: Unintentional weighting due to filtering submax efforts
The peak performances found on a weekly basis are filtered, this means that there are periods where there are no observations and periods where there are clusters of observations. Since we are using a damped least squares fit this means the fit will be skewed to areas with large numbers of observations.In practice this seems to apply where there is either a high rate of change (detraining and retraining) or where the filtering algorithm breaks down. To fix this, the filtering code needed a lot of adjustments, from looking ahead over 4 weeks to looking ahead over 2 months.
I guess the takeaway here is the code needed a lot of tuning and will likely need to be improved over time, but for now so long as there are enough observations over a long enough period this weighting does not have a significant impact on the fit. Typically this means 20 or more observations over 2 years.
Issue#3: Fit plausibility
With 5 parameters to fit obviously you need a lot of data. On most athletes this was ok, we typically get about 20 maximal efforts per season to fit to. But even so, the optimal fit often lead to implausibly low values for t1 and t2 and they would also converge on the same values. For some athletes the model fit wasn't at all stable and sometimes failed to converge.One way of overcoming this would be to constrain the model fit, but this was tricky too since it was the ratio of t1 to t2 that also needed to be constrained although I could maybe constrain t1 to the range 30-50d and t2 to 7-20d
After spending a while trying to manage this I hit a decision point: Do I remove them from the fit and make them constants we can tune (like the PMC) or do I do a brute force fit with a range of plausible values (50:7 thru 30:20 say) ?
I landed on just making them tunable by the user, but might change in the future. The other problem of p0, t1 and t2 time dependence was also in my mind.
So now, instead of fitting a 5 parameter model I'm fitting a 3 parameter one with two tunable constants.
Of those three remaining parameters (p0, k1 and k2) p0 is really, really interesting. For example Damien Grauser, one of the GoldenCheetah developers has an untrained CP (p0) of about 285w. That was equivalent to me at my best (!).
At some point it might be interesting to model how this changes over time, which leads me to the next issue.
Issue#4: As we get older p0, t1 and t2 change
Over the years we tend to see a drop in performance and our ability to make gains and hold on to fitness. Obviously we're talking about over many years and not week to week.So when I started to fit the banister model to performances to estimate these parameters I didn't fit to all time data. Instead, I looked for windows to fit against. Initially I looked to split workout history into seasons, figuring everyone was like me and had an extended off-season over Winter before getting back into the saddle ready for Spring.
So I started by splitting the seasons based upon gaps in history -- this worked ok for me, but then I found the first few seasons were poor, a lot of my workouts were hidden from Banister because they didn't have power data, as a result I looked like a mega fast gainer who could do a handful of power workouts and see CP rise by 20%, but then later seasons it looked like I slowed up as all the workouts had power and became visible to the model.
Some athletes had no off-seasons, and so my code did a fit against a 10 year season. Of course this meant the fit at any point in time was pretty poor -- p0, t1 and t2 were kept constant.
So in the end, I ended up with a minimum and maximim season length; long enough to have sufficient data to be stable, but short enough that t1, t2 and p0 could change over time. This turned out to be about 2 years.
Issue#5: Diminishing returns
At the beginning of training history, or after a long layoff, you will make quick gains from relatively little work. As you get more and more fit the gains come much slower.The plain Banister model is linear. Fundamentally that is because, if you strip it back (with a bit of poetic license) the model is a simple equation of the form: impulse * x = response. Which is of course a straight line.
For now I need to just suck this up, it is a well know problem with the model and so not surprisingly there are a number of solutions to this in the literature and elsewhere.
Issue#6: One input is too blunt
Another well known issue with using the Banister model is that trying to encapsulate training impulse into a single number is fraught with difficulty.The impact of high intensity work at shorter duration versus low intensity work at longer duration or the amount of time working in a heavily fatigued state or the combination of HIIT and LIT all influence the resulting adaptations in different ways. This cannot be expressed in a single number.
Like issue 5 above, I need to suck this up, but again like issue 5 above, there are other approaches that may help to address this.
Next Steps: Multisport, Non-linearity and Deep Learning
So, where next I guess?Different Sports or Performance Measures
For starters, the current code only supports a single performance metric which is only applicable to power and cycling. We need to add support for multiple sports like running and swimming.To support this, and to make the whole thing a little more flexible we'll allow the user to define an input metric (e.g. BikeScore, GOVSS etc) and a performance metric (Power Index Performances, Manually recorded performance, User Metric).
This might also facilitate multiple banister curves; e.g. one for CP, one for Pmax, one for 5min power with distinct inputs; volume, intensity, time in w'bal zone.
Dynamical and non-linear variants of Banister
Banister has seen some research activity quite recently, new formulations have emerged that add additional parameters to remove the linearity that help to address issue#5 above.Additionally, there are solutions that add a Kalman filter to the Busso time variant model of Banister that address issue#5 and issue#4 above
Deep learning with Neural Networks
Alan Couzens recently compared Banister with the use of multi-layer perceptrons (MLPs) aka old-school neural networks and got excellent results from a very simple network. He also reported positive results using RNN/LSTM too.I'm curious to use CNNs as they have been found to work well for time-series analysis and adding to a desktop application like GoldenCheetah is quite easily achieved using the C++ library Dlib without needing to use frameworks like Google's TensorFlow with embedded Python (which would be a real faff).
The beauty of a deep learning approach is also the ability to use multiple inputs, so addresses issue 6 above, but will also address 3,4 and 5 too.
In addition, if the model is pre-trained perhaps against the GoldenCheetah opendata athletes we could use it as a generic model for athletes with little or no training history as well as re-training it against data for those athletes that do.
I think it is safe to say that there is a lot of road left to travel with performance modelling, in fact I suspect the journey is only just starting!