2018 FALL - Data-readiness in a World of AI
DATE, LOCATION & HOST
The 2018 fall PRISME Forum Tech Meeting was held Thursday, the 15th of November, 2018, and hosted by Takeda Pharmaceuticals at 1 Takeda Pkwy, Deerfield, IL 60015.
PRISME Forum Technical Meeting Chair: Christian Baber, Shire
PRISME Forum Chair: Dan Chapman, UCB
Fall Business Meeting of the PRISME Forum 2018
Meeting hosted by Takeda Pharmaceuticals
Data-readiness in a World of AI
One of the key points of discussion at the last two PRISME Forum Technical Meetings on the topic of AI was that the limitations for AI/ML was not computing power, nor indeed algorithms, rather it was the availability of high-quality and fit-for-purpose structured data sets labeled both with appropriate metadata and endpoints. The scarcity of data for training machine learning is a fundamental feature of AI in the Life Science industry. Living systems are complex and noisy and as such require a significant amount of data to model them accurately. While substantial amounts of in vitro experimental data exist, in vivo data is much more difficult to collect and, in the case of human data, use is limited by informed consent, privacy regulations and ethical considerations.
The idea that ‘data is more important than algorithms’, has been gaining support since 2001 when Banko et al. published their paper “Scaling to Very Very Large Corpora for Natural Language Disambiguation”i which demonstrated that several very different Machine Learning Algorithms performed almost identically well on the complex problem of natural language disambiguation once they were given enough data.
The idea was, more recently, taken up by an article entitled “The Unreasonable Effectiveness of Data”ii by Peter Norvig et. al. in 2009 which showed (Figure 1) that it can be relatively easy to reach around 50% accuracy using a variety of algorithms but to improve further, the need for data grows logarithmically. For AI to be effective a sufficient amount of high-quality data needs to be readily available.
The biopharmaceutical and healthcare industry in its entirety has a great deal of data. However, this data is rarely in a form amenable to use to train AI/ML methods without substantial data cleanup and labeling with meta-data and endpoints.
Additionally, this data is generally widely dispersed both within individual companies and between companies. This causes problems with gaining access to the data and, with the diversity of data formats, reading and understanding the data. Individual biopharmaceutical companies self-evidently have less data on which to train AI/ML systems to produce robust and generalizable results. If there were cross-company collaboration to merge data sets then much larger, more diverse and more effective training data sets could be made available. Despite this, the industry is cautious about sharing its data; not least because companies fear they will compromise or lose their IP. Other alternatives to address the issue include methods that mitigate data shortage and overfitting such as transfer learning, multi-task learning and the generation of synthetic data.
This PRISME Forum Technical Meeting will set out to explore opportunities for the biopharmaceutical industry to improve timely access to sufficient, high-quality data, on which AI systems can be trained (both within and beyond individual companies) and to use best the available data in the age of AI. A focus will be on practical examples that have been implemented at pharmaceutical companies along with efforts that have been attempted, but failed, and associated lessons learned.
Topics addressed include:
- The implementation and use of the FAIR data principles (Findable, Accessible, Interoperable, Reusable)iii in industry
Current tools and methods for meta data capture, end-state labeling and automated data preparation both at the point of creation and the time of use.
- Practical storage, management and access to data from every stage of the R&D process and examples of data re-use & models constructed with data federated across multiple domains.
- Examples of the use of methods such as transfer learning to reduce the amount of directly relevant data required to build models for specific tasks.
- Methods that would allow companies to share their data, including the use of “guest-algorithms” that can train on data sets without exposing the IP.
The PRISME Forum Technical Meeting Advisory Committee:
- Christian Baber (Chair), Head of R&D IT, Shire
- Nick Brown, Head of Technology Incubation Lab, AstraZeneca
- Dan Chapman, Head of IT New Med. Information Management, UCB
- David Christie, Vice President, Enterprise Applications Group, CSL Behring
- Lars Greiffenberg, Director – R&D IT and Translational Informatics, Abbvie
- Carol Rohl, Executive Director, Scientific Information Management, Merck
- Martin Romacker, Principal Scientist – Data and Information Architecture, Roche
- Nico Stanculescu, Logistics, PRISME Forum
- Jason Tetrault, Global Head Data Engineering and Emerging Technologies, Takeda
- Jianchao (JC) Yao, Associate Principal Scientist, Merck
The hotel for this meeting is the Hyatt Regency Deerfield, located at 1750 Lake Cook Rd, Deerfield, IL 60015. The discounted room rate is $169 per night plus tax.
Rates are valid ONLY through October 13, 2018.
Reservations can be made online at https://book.passkey.com/go/PRISMEForum
When reserving a room, please remember to use “PRISME” for the above rate and appropriate allocation to our room block.
DISTANCE TO MEETING VENUE FROM THE AIRPORT
O’Hare International Airport is a 20 minute ride (14 .4 mi/24 km) from the meeting venue or conference hotel.
Uber and Lyft remain reliable sources for the transfer between O’Hare and the meeting venue/hotel.
Additional car services will be posted soon!
MEETING AND SOCIAL EVENT VENUE TRANSFERS
Morning and afternoon transfers will be offered between the hotel, the meeting venue and the social/networking events (per program outline).