Data-readiness in a World of AI

One of the key points of discussion at the last two PRISME Forum Technical Meetings on the topic of AI was that the limitations for AI/ML was not computing power, nor indeed algorithms, rather it was the availability of high-quality and fit-for-purpose structured data sets labeled both with appropriate metadata and endpoints. The scarcity of data for training machine learning is a fundamental feature of AI in the Life Science industry. Living systems are complex and noisy and as such requires a significant amount of data to model them accurately. While substantial amounts of in vitro experimental data exist, in vivo data is much more difficult to collect and, in the case of human data, use is limited by informed consent, privacy regulations and ethical considerations.

The idea that ‘data is more important than algorithms’, has been gaining support since 2001 when Banko et al. published their paper “Scaling to Very Very Large Corpora for Natural Language Disambiguation”i which demonstrated that several very different Machine Learning Algorithms performed almost identically well on the complex problem of natural language disambiguation once they were given enough data.

The idea was, more recently, taken up by an article entitled “The Unreasonable Effectiveness of Data”ii by Peter Norvig et. al. in 2009 which showed (Figure 1) that it can be relatively easy to reach around 50% accuracy using a variety of algorithms but to improve further, the need for data grows logarithmically. For AI to be effective a sufficient amount of high-quality data needs to be readily available.

The biopharmaceutical and healthcare industry in its entirety has a great deal of data. However, this data is rarely in a form amenable to use to train AI/ML methods without substantial data cleanup and labeling with meta-data and endpoints.

Additionally, this data is generally widely dispersed both within individual companies and between companies. This causes problems with gaining access to the data and, with the diversity of data formats, reading and understanding the data. Individual biopharmaceutical companies selfevidently have less data on which to train AI/ML systems to produce robust and generalizable results. If there were cross-company collaboration to merge data sets then much larger, more diverse and more effective training data sets could be made available. Despite this, the industry is cautious about sharing its data; not least because companies fear they will compromise or lose their IP. Other alternatives to address the issue include methods that mitigate data shortage and overfitting such as transfer learning, multi-task learning and the generation of synthetic data.

This PRISME Forum Technical Meeting will set out to explore opportunities for the biopharmaceutical industry to improve timely access to sufficient, high-quality data, on which AI systems can be trained (both within and beyond individual companies) and to best use the available data in the age of AI. A focus will be on practical examples that have been implemented at pharmaceutical companies along with efforts that have been attempted, but failed, and associated lessons learned.

Topics that will be addressed include:

  • The implementation and use of the FAIR data principles (Findable, Accessible, Interoperable, Reusable)iii in industry
  • Current tools and methods for meta data capture, end-state labeling and automated data preparation both at the point of creation and the time of use
  • Practical storage, management and access to data from every stage of the R&D process and examples of data re-use & models constructed with data federated across multiple domains.
  • Examples of the use of methods such as transfer learning to reduce the amount of directly relevant data required to build models for specific tasks.
  • Methods that would allow companies to share their data, including the use of “guestalgorithms” that can train on data sets without exposing the IP
  • Identification of the most tractable domains within biopharma – both for internal development and where cross-industry data sets for AI training could be created


The PRISME Forum Technical Meeting Advisory Committee (see table) is seeking contributions (e.g. plenary presentations, start-up company ‘pitches’, poster presentations, etc.) from any person or company with an informed and experienced contribution to make in is area.

  • Christian Baber (Chair), Head of R&D IT, Shire
  • Nick Brown, Head of Technology Incubation Lab, AstraZeneca
  • Dan Chapman, Head of IT New Med. Information Management, UCB
  • David Christie, Vice President, Enterprise Applications Group, CSL Behring
  • Lars Greiffenberg, Director – R&D IT and Translational Informatics, Abbvie    
  • Carol Rohl, Executive Director, Scientific Information Management, Merck
  • Martin Romacker, Principal Scientist – Data and Information Architecture, Roche
  • Nico Stanculescu, Logistics, PRISME Forum 
  • Jianchao (JC) Yao, Associate Principal Scientist, Merck
  • TBD, Takeda

