Iva recently attended the London Innovation Society’s Big Data Analysis Innovation Awards at which she was selected to present a poster. This has prompted us to ponder whether our recent collaborative work with MedChemica is genuinely “Big Data” or just an analysis that happens to have more data than is normal in the field (Medicinal Chemistry). Moreover, what (if anything) can we learn from the leaps forward in big data analysis taking place in other sectors? A recently published article considers how genomics might compare with astronomy, YouTube and Twitter (if nothing else, we enjoy the juxtaposition of one of mankind’s most primordial obsessions with the obsessions into which we are now regressing). In terms of sheer scale, medicinal chemistry seems to still be some way off from having the “zetta” (10 to the power of 21) scale data attributed to genomics or astronomy. Depending on how inclusive one wishes to be, it may compare with the fractions of a billion tweets per year. My back-of-the-envelope guess is that the global medicinal chemistry effort might add some hundreds of millions of datapoints per year (an HTS may be of order 1 million but few organisations can undertake them; individual compound testing efforts within large companies may add hundreds or thousands of data points per active research project of which there may be some hundreds). Recent efforts to make and test encoded libraries with billions of compounds in them probably don’t yet add one data point per compound so are unlikely to shift this in the near term. Indeed, it is not clear whether the number of medicinal chemistry data points being generated per year is currently increasing or decreasing. It is a thought provoking contrast that the four-headed beast that is predicted for genomics (data acquisition, storage, distribution and analysis) remains barely relevant to medicinal chemistry: the data instead remain divided amongst a stack of individual companies around the globe. Databases like Chembl (13 million datapoints) surely represent only a small fraction of the medicinal chemistry dataset but not an insignificant fraction. Others and others have recently speculated about the impact big data will have on medicinal chemists. Two aspects that we are particularly interested in are training and culture.
Unless things have changed radically in the last five years, most medicinal chemists come from a background in synthetic organic chemistry. As has been noted recently, this is the discipline of the discrete, the precise and of worrying about how to make things. Medicinal chemistry on the other hand deals with the “continuous” properties of biology which can be measured with much less precision and reproducibility and should be concerned with what to make (how to make it only kicks in afterwards). Does this training provide the best background to deal with medicinal chemistry in the big data era? What role is there for statisticians, analysts and mathematicians? Particularly in the UK, can we start to bring back some of the brilliant minds that have been lured into the city to do just this sort of analysis? Furthermore, a culture that is imbued with the beauty of synthesis (a penchant that I still share, but these days more as a guilty indulgence than anything else) and on caring for individual molecules (I have chosen the verb “to care” with some care: “the process of protecting someone or something and providing what that person or thing needs” describes one aspect of the problem rather well). It is hard not to have your head turned by any molecule (or other thing) that you have invested many days of hard work to but making the right thing may require just that.