The NIH is soliciting feedback from the biomedical community. Here's what we think.
NIH recently released a draft of the 2023-2028 Strategic Plan for Data Science and solicited feedback from the broader biomedical community (NOT-OD-24-037). Overall, our impression is that the NIH has developed an impressive and thoughtful 5-year strategic plan for data science in biomedical research. We’ve provided our feedback directly to the NIH, and are now sharing it publicly to spark discussion.
Agree with our feedback? Disagree? We’d love to know!
Below, we have highlighted some benefits of the strategic plan that seem particularly important, as well as gaps and potential opportunities.
Goal 1 of the strategic plan aims to support the 2023 policy for Data Management and Sharing (DMS) by improving the capabilities of the data sharing ecosystem.
Benefits:
The new DMS policy addresses many of the challenges expressed by the data contributors we work with, including a lack of directed funding for data curation.
Objective 1-1: An emphasis on training around data management practices across a range of roles and experience levels will be important.
A focus on standardizing policies and procedures for recombining data from individuals housed across repositories is highly valuable.
Gaps and Opportunities:
A potential negative outcome of the DMS policy is if individual institutions create their own data management platforms, which may further fragment the data ecosystem.
More emphasis on sustainability of data and resources in the long term is needed. For example, introducing the ability to prepay for long-term data storage from NIH grants, to easily move data to centralized free repositories for archival, and considerations for remapping this archival data to meet current standards (e.g. as common data elements evolve).
Are there ways to mandate rather than simply “encourage usage of open and standardized schemas, ontologies, and data formats…”?
How do individual repositories assess themselves against the criteria defined in the Office of Science, Technology, and Policy (OSTP) Desirable Characteristics of Data Repositories” and alignment with “community standards such as the Transparency, Responsibility, User focus, Sustainability, and Technology (TRUST).” Should the NIH audit these criteria for federal and non-federal data repositories?
Data management and sharing is complicated and onerous for the data generators and the data stewards. Similarly, manual human-based data curation and annotation is error-prone. Stronger emphasis should be placed on funding or developing tools that automate and streamline the process of extracting, transforming, and loading data as well as metadata that describe the data. There are huge opportunities here to develop technologies that reduce the reporting burden on investigators and data stewards while simultaneously improving data quality and reusability.
For Objective 1-2 to be truly effective, we suggest creating a core standards repository that holds recommended standards, vocabularies, ontologies that have been community developed for use in multiple domains. This we hope will mitigate the proliferation of standards in our domains and ensure reuse and interoperability between domains. Ideally, this repository would include features to enable facile community contribution and management of data standards. In our experience, one driver of standards proliferation is a need to modify, extend, or adapt pre-existing standards.
Goal 2 of the strategic plan aims to enhance the collection and value of human-derived data.
Gaps and Opportunities:
Molecular characterization of samples from patients tends to occur in research labs, while the clinical trajectory of patients tends to be recorded in EHR data. Currently, the crosstalk between these two modalities of data is very limited. The strategic plan has not addressed bringing these two modalities together.
Federated data linkage between cohorts is a key requirement to cross-modality data use especially for social and environmental determinants of health and diverse patient populations. The strategic plan does not address the process and/or mechanisms by which a federated data linkage spine could be constructed. A possible solution could be a Domain Name System (DNS) for healthcare data - This would address the ‘F’ in FAIR, enabling greater than one-to-one connections between data assets.
We strongly advocate for more funding opportunities to support incremental feature development and enable more data repositories to align with the NIH’s vision for biomedical data repositories. For example, a request for applications like PAR-23-236 or PAR-23-237, but with a reduced focus on innovation and an increased focus on funding the implementation of NIH-specified federation, interoperability, and authentication standards could help unify the data repository landscape.
Goal 3 of the strategic plan aims to develop new opportunities to support the creation and use of cutting edge software and technology for biomedical research, including AI.
Gaps and Opportunities:
While the objectives greatly support the development of new AI models and technologies, the plan has not addressed open sharing of AI models. Open sharing will be critical to maintain transparency of AI research in healthcare or biomedical sciences. This will also encourage significant improvement of AI in healthcare and biomedical sciences due to collaborative and iterative approaches.
Due to the sensitive nature of healthcare and biomedical data, investments should be made to support the development of synthetic datasets that could be used for complex algorithms and AI to be tested on. There is an opportunity to generate both synthetic data that might not be statistically accurate, but is structurally accurate and broadly distributed for the purposes of teaching, as well as synthetic data that is statistically representative of the source data, and rigorously vetted for accuracy.
For Objective 3-3, beyond releasing accessible software tools, sustainable software development requires robust community maintenance. Factors such as supporting user forums and enabling measures of social proof are needed to attract and retain contributors and users for open source projects. Social proof measures include paper citations, but also include project websites, repository engagement metrics (e.g. stars), and accessible benchmarking of methods on standard datasets as evidence of value.
Goal 4 of the strategic plan aims to increase the interoperability between NIH cloud-based data repositories to enable facile federated data use.
Benefits:
Plans for streamlining and semi- or fully-automating requests for controlled access data, including a standardized vocabulary (e.g. GA4GH DUO), would address a critical need, particularly for the AI/ML research community.
Gaps and Opportunities:
The NIH Cloud Platform Interoperability Effort has made important advances, but needs continued emphasis. A truly federated ecosystem will require training initiatives to use multiple cloud platforms and data enclaves, along with engineering and development work to rewrite data sharing platforms, analysis pipelines, and genomics workflows for different systems. However, the infrastructure and training complexity of implementing a truly federated analytical system may be, in part, offset by eliminating costs associated with transferring massive quantities of data between cloud platforms and regions.
The Researcher Auth Service (RAS) has the potential to be highly valuable for the research community in enabling federated sharing and interoperability. But in our experience it has been challenging to implement due to significant administrative hurdles. It would be helpful to improve the standardization, interoperability, documentation, administrative hurdles, and availability of client SDKs to support easier integration with this service.
Notably, the plan does not appear to address interoperability and federation issues with non-NIH data repositories. It is imperative that non-federal repositories, such as the platform we have developed and maintain, Synapse, as well as non-US data repositories such as the European Genome Phenome Archive are well-integrated into the federated biomedical data ecosystem.
The features should extend beyond federated query and API access to expand and include resources beyond NIH controlled data. Federated benchmarking protocols (e.g. MedPerf from MLCommons) should be considered to allow for analysis at multiple data repositories.
For data interoperability to truly succeed, federated data linkage must be considered as the starting step to interoperate with different data cohorts and modalities. For example, consider a research study looking for 18-35 year old patients, who are non–smokers, not pregnant and have comorbidities of Alzheimer’s and diabetes, as evidenced by T1/FLAIR/fMRI brain scans and blood reports. A federated query and API presupposes that each data repository can be queried in a federated manner. This in itself is a very high bar. However, the true power is unleashed when you are able to not just query, but connect multiple sources of data to expand your sample size.
Request access to data held in each data repository is a time consuming process, see Taylor JA, et al. “The road to hell is paved with good intentions: the experience of applying for national data for linkage and suggestions for improvement.”[1]. Researchers currently need to wade through large amounts of bureaucratic and administrative burden to request one dataset let alone multiple datasets. The plan does not address the need to streamline and simplify data access requests for one and between multiple data custodians.
Goal 5 of the strategic plan aims to increase the data science capacity of the biomedical community.
Gaps and Opportunities:
An improvement to the plan would be a strategy for improving the diversity and availability of cloud based no-code analytical tools. In communities that have fewer data science experts, we have observed that there is still a strong desire to derive insights from shared data. However, the learning curve to take raw data and conduct meaningful analyses of those data is steep. This can mean that the individuals that spend immense amounts of effort sharing data generated by their labs then struggle to reap the benefits of the data-sharing efforts of their peers. Instead, data reuse is limited to a smaller subset of data-science-savvy community members. While data science training is imperative - it’s clear there are not enough data scientists in biomedical research - the NIH should also consider strategies, tools, and programs to make data more accessible to people without strong computational training.
For Objective 1-1, we suggest considering NIH partnerships with electronic lab notebook companies and data generation technical companies (e.g. Zeiss, Illumina, 10X genomics etc) to facilitate development of data and metadata generation standards, thus lowering the burden of data generators in extracting and depositing data and metadata.
Many of the challenges we face in biomedical data science, particularly many of those outlined in the strategic plan, are not unique to public biomedical research. The NIH should consider when and how to engage other federal agencies that have to deal with large amounts of data (e.g. NASA) as well as regional agencies and private industry to develop common solutions for common problems.
Beyond the aforementioned gaps and opportunities, we recommend that the strategic plan address the following topics:
Improving search and discovery of biomedical datasets across all platforms. Enhancing data search capabilities is crucial for efficiently locating and accessing relevant datasets. This involves developing more sophisticated search algorithms and indexing methods to improve the precision and speed of data retrieval.
Better methods for generating and sharing synthetic data. There is a growing need for synthetic replica data, which can be used for training AI models without the privacy and security concerns associated with real-world data. This involves creating realistic and representative datasets that can mimic the properties of actual data.
Developing a standard AI-ready biomedical “data card.” Developing a standardized format for AI-ready data cards is essential. These data cards should provide comprehensive metadata about datasets, including their origin, contents, and any preprocessing steps taken. This will facilitate better data management and usage in AI projects. Examples of existing resources for this include the Data Cards Playbook[2] developed by Google.
Chat-With-A-Dataset: Exploring the concept of 'chat-with-a-dataset' as a novel way of interacting with data; for example, using large language models to enable users to ask questions and receive answers directly from the dataset, making data exploration more intuitive and accessible.
Independent benchmarking of AI models: Establishing challenges and benchmarks is important for driving progress in AI research. These should be designed to test the capabilities of AI models in biomedical domains, providing a clear metric for evaluating performance and encouraging innovation. While this was already a need with biomedical machine learning, with the rapid development of generative AI approaches, independent and robust benchmarking of computational models is becoming more critical every day.
Better AI provenance: There is a need to link training datasets directly to AI models to allow for the determination and investigation of model biases. This involves creating transparent and traceable connections between the data used for training and the resulting model behavior. This will provide researchers with the ability to identify and address potential biases and to enable data dignity[3].