Biases of drug–target interaction network data

Twan van Laarhoven and Elena Marchiori
2014, under review

Abstract

Network based prediction of interaction between drug compounds and target proteins is a core step in the drug discovery process. The availability of drug–target interaction data has boosted the development of machine learning methods for the in silico prediction of drug–target interactions. In this paper we focus on the crucial issue of data bias.

We show that four popular datasets contain a bias because of the way they have been constructed: all drug compounds and target proteins have at least one interaction and some of them have only a single interaction. We show that this bias can be exploited by prediction methods to achieve an optimistic generalization performance as estimated by cross-validation procedures, in particular leave-one-out cross validation. We discuss possible ways to mitigate the effect of this bias, in particular by adapting the validation procedure. In general, results indicate that the data bias should be taken into account when assessing the generalization performance of machine learning methods for the in silico prediction of drug–target interactions.

Downloads

Full text (PDF) (LaTeX source)
Source code (Matlab/Octave). License: free for non-comercial use.
Put the datasets in the ../dataset/drugtarget directory relative to the Makefile, or change the path location in load_dataset.m.
Datasets by Yamanishi et al. (local copy)

Contact

Contact the authors at tvanlaarhoven@cs.ru.nl for any questions or comments.