It really has been a while since we posted so we start this post with a grovelling apology to our loyal and patient readers. Cider consumption has increased sharply and we have struggled to generate and maintain motivation. This is partly due the nature of the literature that posting on the column forces us to read and in fact we’d almost forgotten what we planned to post on.
Regular readers of this column will be well aware of what we mean by categorical sins. The usual trick involves plotting average values of some property of interest (e.g. solubility, bioavailability or promiscuity) for each category. We think this is very naughty because variation is hidden. An even more insidious practise involves transforming continuous variables like ClogP and molecular weight (MW) into categories such as MW between 300 and 500. Why do we term this behaviour sinful? We do this because the motivation for hiding variation is normally to make weak trends appear less so, in the process making the people presenting the trends look smarter and more cultured. That is why we use the term sin. We stress that it is not a consequence of the participation of Jesuits in our early education.
A number of relevant publications have appeared since we first introduced our readership, which we estimate to number no more than 6 hardy souls, to the concept of categorical sin. The publication that features in today’s Crapshoot claims to present simple, interpretable ADMET rules of thumb and it can be found in a journal with an enticingly high impact factor. In fact one of our fellow bloggers has even beaten us to it.
Let’s dive straight in! We ask that you take a look at the top left plot (there are 6 of them) in Figure 2 which is labelled a. Four categories of molecular weight have been defined with the boundaries at 300, 500 and 700 and mean values of log(Solubility) have been plotted for each category. Presented like this, the relationship between solubility and molecular weight looks strong. However we’ve seen that plotting data in this manner can make a weak relationship appear to be stronger than it is. At least r-square values were not quoted for this plot as was done in the other article.
Our other criticism of this plot is that variation is hidden. Wait a minute, we hear you cry, there are error bars. Patience, dear readers, error bars are indeed present but the variation is still hidden. How can that be and why do you persist in playing these silly games with us?
Typically we calculate standard deviation when want to quantify variation. There can be problems with this if the distribution suffers from excessive skewness, kurtosis or halitosis but the standard deviation is a good place to start. If Figure 2b (solubility by charge type) had been presented showing the standard deviations for each category we would have been satisfied and the Crapshoot would have been focusing elsewhere. However the error bars are not standard deviations.
The error bars in these plots are actually confidence intervals for the mean. Each of these confidence intervals is derived from the standard error in the mean which in turn is obtained by dividing the standard deviation by the square root of the number of observations. Confidence intervals defined in this manner would normally be used to address questions like whether mean solubility is significantly different for cations and neutrals. If you have enough data even very small differences in means will become significant and we’ll return to that point at the end of the post.
Well that was all a bit of a mouthful, wasn’t it? Let’s take a look at Figure 3a (the top left plot in Figure) which shows mean values of log(Bioavailability) for four categories of MW (<300, 300-500, 500-700, >700). You’ll notice that the error bars are greater for the <300 and >700 groups of compounds. So this means that there is more variation in bioavailability for these groups of compounds, doesn’t it? Well not exactly. It could mean that there are fewer compounds in these groups than in the other groups. The problem is that we don’t know how many compounds are in each group so we don’t know really know too much about the standard deviations for the 4 groups of compounds. And that is why we say that the variation is hidden.
There are plots in this article for a number of properties relevant to drug discovery and it is not our intention review them all. Compounds are either grouped according to MW as described above or according to charge type. The article claims to provide simple, interpretable ADMET rules of thumb but we were not sure how these plots should be used. Do we expect a compound with 499 MW to be more similar in its properties to the compound of 301 MW than to a compound with 501 MW just because somebody has chosen to set boundaries at 300 and 500? We were unclear where all this was heading until we got to section 2.6 (Rules of thumb for a given set molecular properties). At that point it became clearer about where things were heading and we did rather wish we had gone somewhere else instead.
A new categorisation scheme is introduced in section 2.6. Compounds were categorised as desirable if MW is less than 400 and logP < 4. All other compounds were categorised as less-desirable. This categorisation was overlaid onto the four charge types (neutral, anion, cation and zwitterion) to give a total of eight categories. Hope you’re still with us because we must admit to having become a bit disorientated ourselves with all of this slicing and dicing of the data. Now it’s time to bring it all together. Analyses were performed for 13 properties relevant to ADMET by comparing mean values for each category with mean values for the full data sets. Take a look at Table 3 if you want to see what these look like. The comparisons are coded as higher, lower or average with respect to average for the full data set.
The observant amongst you will be wondering how one might use the results of this analysis. Let’s take a look at the very first row of Table 3 which corresponds to solubility for neutral compounds. The desirable category is labelled ‘average’ and the less-desirable category is labelled ‘lower’. How should we use this information? Presumably if our favourite compound has MW of 390 and is a bit less soluble than we would like, it would be extremely unwise to add a hydroxyl group because this will push MW over the edge.
Let’s pause for a moment to think about why people dare to make compounds that are so offensive to the Molecular Weight Gestapo and ClogP Thought Police. It isn’t exactly a secret that excessive lipophilicity and MW cause molecules to bind to proteins that you’d prefer they didn’t. Generally people make these compounds in order to increase binding to the primary target. This is usually overlooked when presenting analysis based on questionable (i.e. in which variation is hidden) plots of promiscuity against lipophilicity. When you’re struggling to achieve potency, you really need to have something a little more concrete than a pious statement of the sinfulness of lipophilicity and MW.
It’s now back to Table 3 which tells us that solublility of neutral compounds will change from ‘average’ to ‘lower’ if we let MW go over 400 or ClogP exceed 4. Now the chemist probably has some idea of how much extra potency will result from that structural change that takes ClogP from 3.8 to 4.2. So it’s now over the rules of thumb. How much solubility are we going to lose? No, you can’t phone a friend. OK, just tell us how much lower than ‘average’ is ‘lower’? We’re sorry but that just wasn’t the answer that we were looking for.
We really owe it to our loyal readers to tell them how different ‘average’ and ‘lower’ really are. For solubility of neutral molecules with MW > 400 or ClogP > 4, ‘lower’ simply means that the average solubility for this group of compounds is significantly lower than the average solubility for all the compounds measured. The average solubility for the group of compounds (MW < 400 and ClogP < 4) does not differ significantly from the average for the full data set and these are labelled ‘average’. However if you have enough data even small differences can become significant.
Well it’s now pub time so that’s just about it from us in this post but before we go we’ll share a thought. Did you know that the drug you’re taking daily holds the world record for number of volunteers in a clinical trial? Your reaction is:
a) Great! It must be significantly better than placebo with that number of people.
b) Oh shit! It took THAT many people to see an effect.
Tuesday, July 15, 2008
Desperately seeking significance
Labels:
categorical sin,
data analysis,
gsk,
jmc,
literature reviews,
oral drugs,
stamp collecting
Subscribe to:
Post Comments (Atom)

3 comments:
Let’s pause for a moment to think about why people dare to make compounds that are so offensive to the Molecular Weight Gestapo and ClogP Thought Police. It isn’t exactly a secret that excessive lipophilicity and MW cause molecules to bind to proteins that you’d prefer they didn’t.
Hilarious, I am still shaking ... lets be honest, in the modern drug world 'big protein binding' is greedy, mean, and watching out for any innocent drug ;-)
Glad you liked it. There's a lot of BS out there.
I found this site using [url=http://google.com]google.com[/url] And i want to thank you for your work. You have done really very good site. Great work, great site! Thank you!
Sorry for offtopic
Post a Comment