We do seem to have let things slip over the holiday period but rest assured, Esteemed Readers, that we have not abandoned you or, for that matter, William of Ockham who is familiarising himself with the health and safety implications of bringing razors to work. Following on from the last post, we take a look at article on overfitting . Unlike much of the literature we review in this column, we rather like this article and think it's a real shame that many of the folk who publish predictive models are palpably unaware of its existence.
What is overfitting, one might ask? The author notes that, "Occam's razor, or the principle of parsimony, calls for using models and procedures that contain all that is necessary for the modeling but nothing more". How aptly put! Blofeld and Random Forest had better mind their step. You might ask what all this means and we will respond that the predictive models that we're talking about all use adjustable parameters to fit the data. To some extent, the more parameters that you use, the better the fit that you'll achieve. One definition of overfitting is using a model that has too many parameters. Or at least more than you needed.
By now you'll have figured out why we are keen to see the actual models rather than just reading about their wonderful r-squares, q-squares and root mean squares. When the actual model is presented you can see exactly how many parameters it uses. "But isn't this just the number of descriptors?", you might ask. Basically yes, if it's a linear regression model and you don't count the intercept as a parameter. Once you enter the non-linear world of neural nets this is not the case any more. We don't believe that any journal that wishes to be considered respectable should be publishing new predictive models unless these models are are fully specified in in the article.
So we hope we've now got your attention. We're predicting solubility and have two models at our disposal, both of which have satisfied validation criteria. "What are these validation criteria?", we hear you cry. Fair point! The models satisfactorially predicted the solublities of compounds that were not used to train the model. We'll discuss validation in a future post because to do so here would get us bogged down in the data-analytic equivalent of Passchendaele. Anyway back to the models. We'll use root mean square error (RMSE) as our measure of model quality and there are problems with this. However, it's another point that we're trying to address and RMSE will work well for that. One model predicts log(S/M) with RMSE = 0.3 and uses 100 parameters and the other uses 3 parameters and predicts log(S/M) with RMSE of 0.4. The observant amongst you will have noticed that we're actually using the logarithm of the molar solublity rather than solubility itself and there are really good reasons for doing this which we'll not go into right now. Anyway with which model are you going to use to make your predictions?
Being regular readers of The Crapshoot has of course made you cynical. You've seen some of the underhand tricks that folk can use to persuade you that the trends that they have uncovered are stronger than they actually are (see examples 1 and 2 to get an idea of what we mean). The models have both been validated but you can't see the details so you're quite right to be suspicious. You're also thinking that the model with RMSE of 0.4 using 3 parameters is less likely to be overfit than the model with RMSE of 0.3 with 100 parameters. Also you might expect the first model to work better for compounds that are not chemically similar to the compounds used to train it. However, in the predictive modelling world validation is assumed to be valid
and we can only ask who guards the guardians. Once models have been validated, the numbers of descriptors used become irrelevant.
Let's go back to the article on solubility prediction that we mentioned in the previous post. The cross-validation results for partial least squares (PLS), artifical neural net (ANN), support vector machine (SVM) and random forest (RF) models are given in Table 2. The cross-validated RMSE is lowest for RF as is the RMSE for the external test set. Random Forest is the best model! Long live Random Forest! It was validated so who are you, the uncouth authors of a blog that nobody reads, to question this finding?
It is true that the readership of The Crapshoot could comfortably assemble in the ensuite portion of a budget London hotel room. However, we really do object to being called uncouth and so we're going home (and taking our ball with us). Our parting shot is that we've not quite used up all the ammo from that nice paper on overfitting...
next
Subscribe to:
Post Comments (Atom)

4 comments:
Uhm, a truly budget London hotel room won't have en suite :-)
In which case 'ensuite' would imply the presence of a wash basin.
Thanks for pointing to the excellent article on overfitting.
Hi
I'm assuming that "Also you might expect the first model to work better for compounds that are not chemically similar to the compounds used to train it." refers to the sentence before but it is a bit ambiguous - it could refer to the order of the models when they were first introduced.
Other than that I really enjoy your blog. I do think it would be a nice exercise for you guys to pull the one most useful bit of information out of every article that you hate. Just an idea.
Post a Comment