Skip to main content
SearchLoginLogin or Signup

The Open Science Community Faces Major Barriers to Sharing and Reusing Data. Here’s How We Can Remove Them

Understand the hurdles faced when managing valuable research assets, as well as some new opportunities that can support data-sharing efforts.

Published onJul 18, 2024
The Open Science Community Faces Major Barriers to Sharing and Reusing Data. Here’s How We Can Remove Them
·

High-quality health and biomedical data is a cornerstone of scientific knowledge and discovery. The power to share and preserve vast amounts of data holds the promise of accelerating breakthroughs, fostering global collaboration, and tackling some of our most difficult health challenges.

Yet, the promise and potential of data for secondary research use remains largely untapped due to resource constraints and other barriers. In this article, we explore not only some of the hurdles we face at Sage Bionetworks (Sage) as a steward of valuable research assets, but also some of the solutions that can support data-sharing efforts.

How do we manage data assets?

The entry point for all data at Sage is the Synapse platform. Synapse is an open-source, cloud-based collaborative research platform. It enables researchers to share, organize, and discuss their work, facilitating scientific collaboration. As an NIH-designated generalist repository, Synapse maintains a variety of data types and file formats for the biomedical data community and beyond. 

Sage also develops and maintains specialized data “portals,” such as the AD Knowledge Portal. These portals offer community-centric platforms for storing, sharing, and accessing diverse types of research data, e.g. human specimens, preclinical animal models, multiomics data, and imaging data.

All data is maintained under strict governance controls that aim to safeguard privacy and public trust. Data uploaded to Synapse is classified into data access tiers based on participant consent, legal agreements, institutional policies, and more. Publicly available data may include animal model data and aggregate human data, while controlled-access data includes individual-level human data.

Despite Sage’s decade-long expertise in responsible data stewardship, the broader ecosystem faces persistent challenges that require collective attention.



Funding Constraints

The NIH funding model has historically encouraged the development of IC-centric or domain-specific repositories tailored to specific research communities. While this approach offers significant advantages in terms of specialized functionality, it can lead to resource silos and limited interoperability.

Generalist repositories like Synapse provide an alternative, offering a single entry point for researchers to access and use multiple data modalities. This can bridge gaps between specialized repositories and provide a standardized environment for maximizing data reuse.

However, maintaining large-scale biomedical datasets incurs substantial costs, especially for genomics, imaging, spatial distribution, and longitudinal studies. Data repositories require expansive storage capacity, robust and secure infrastructure, and efficient retrieval systems. The dynamic nature of biomedical research necessitates scalable solutions that can adapt to increasing data volumes and emerging analytical techniques.

Sustainability in funding for data-sharing infrastructure is another major challenge. While policies like the NIH's Data Management and Sharing Policy (DMSP) improve research progress, the costs of managing data often exceed available resources. Open science organizations like Sage incur significant, unrecoverable costs from hosting data long term.

The current fragmented funding model falls short of addressing the true costs of long-term preservation and maintenance of valuable research assets, highlighting the need for a more comprehensive approach to supporting data-sharing initiatives.

Human Capital

The variety of data formats and metadata standards used across different repositories presents significant challenges for integrating and utilizing data from multiple sources. This lack of standardization often leads to inconsistencies and incompatibilities, complicating the process of merging datasets for comprehensive analysis. Consequently, researchers must invest considerable time and resources in data transformation and harmonization efforts.

Effective management and sharing of big biomedical data demand a highly skilled, multidisciplinary workforce. Expertise is needed in data management, data processing, infrastructure implementation, governance, and data integration.

While AI services can support data curation efforts, a “human-in-the-loop” approach remains essential for handling sensitive health and biomedical data appropriately. Investing in this specialized workforce is crucial for driving innovation, alongside investments in tools and technology.

Reviewing data access requests also requires a nuanced, human-centric approach. Expert personnel must carefully evaluate each application to ensure alignment with ethical and legal standards. This labor-intensive vetting process involves interpreting the context and implications of each request. Constraints here can lead to delays in approvals, potentially hindering timely data access and slowing down downstream research.

Supporting data users also goes beyond technical solutions; it requires expertise and strong interpersonal skills. This support includes assisting users with data interpretation, troubleshooting complex queries, and resolving various data-related issues. It’s vital for preventing data misuse or misinterpretation, but it demands significant time and expertise from personnel. Inadequate support can lead to errors in data usage, ultimately reducing the quality and impact of secondary research.

Researchers working with clinical data must navigate complex informed consent requirements, which can restrict data sharing. Sociocultural norms and values also influence study participants' willingness to share data and researchers' ability to do so, adding another layer of complexity to the data-sharing process.

Moreover, the landscape of privacy laws, human subjects regulations, and data protection standards presents significant challenges, especially across multiple jurisdictions. While these laws are crucial for maintaining public trust, compliance can be burdensome, particularly for teams lacking specialized legal expertise.

The rapid pace of technological advancements, such as AI, further complicates the interpretation and implementation of these regulations. For instance, interpretation of guidance like the EU AI Act requires collaboration among research teams, mirroring the collaboration needed among science teams in translational research studies.

In addition to laws and regulations, institutional policies, data-sharing agreements, and contractual obligations can significantly limit data dissemination. The dynamic nature of guidelines and standards requires ongoing reassessment of data-sharing protocols. Variable risk tolerance among institutions leads to diverse interpretations of laws and regulations, complicating efforts to establish consistent data-sharing practices across organizations.

Finally, political dynamics can impede open data-sharing initiatives by politicizing scientific research. Certain data categories, such as genomic data with commercial potential or population data from specific cultural groups, may require additional safeguards. It is crucial to build systems that support responsible and ethical reuse of sensitive data in secondary research, particularly in light of current sociocultural and geographic trends. 

With Challenges Come Opportunities

Addressing these challenges requires strategic investments in infrastructure, clear and flexible legal frameworks, robust ethical guidelines, and ongoing support and training for both data contributors and data users. With this in mind, we identify potential solutions to each set of challenges that could help support data-sharing activities and increase the impact of open science.

Sustainability

  • Challenge: Research data ecosystems face risks to long-term scalability and sustainability without adequate financial support.

  • Opportunity: Develop creative funding models, including public-private partnerships, that contribute to the value proposition for our research communities and address sustainability challenges in the biomedical research ecosystem.

Resource Constraints

  • Challenge: Individual researchers may lack the resources to effectively manage, share, and analyze large datasets.

  • Opportunity: Provide resources, training, and tools for data management and analysis. Invest in workforce development across all career stages and promote team science across research silos.

Data Accessibility

  • Challenge: Biomedical data is often siloed in different repositories with complex access mechanisms, making it difficult for researchers to find and use data.

  • Opportunity: Improve searchability features, such as advanced search engines, and develop open-access repositories and/or platforms to improve data findability and accessibility across multiple platforms.

Data Quality and Standardization

  • Challenge: Secondary data may vary in quality, with inconsistencies affecting research outcomes and complicating dataset integration.

  • Opportunity: Establish rigorous data collection and reporting standards, and develop AI-based tools to translate metadata into common data models.

Legal and Ethical Issues

  • Challenge: Navigating intellectual property, data ownership, and privacy protection can be complex. Together they may deter data sharing.

  • Opportunity: Implement advanced data anonymization techniques, dynamic consent methods, and clear communication of data-sharing rules.

Community Integration

  • Challenge: Biomedical data contributors often lack integration with other research communities, leading to poor idea propagation.

  • Opportunity: Create engagement hubs to foster collaboration across biomedical sciences, similar to GitHub for coding or Hugging Face for AI.

Success Story: Sage, Synapse, & the Alzheimer’s Disease (AD) Knowledge Portal 

Active promotion is an essential facet of encouraging data reuse in the biomedical research ecosystem. The AD Knowledge Portal, managed by Sage, exemplifies this approach through comprehensive outreach, community engagement, documentation, and training efforts.

Based on user surveys, Sage developed a robust support strategy for the AD Knowledge Portal. This includes tutorials, vignettes, videos, and hands-on workshops.

  • Annual virtual workshops introduce participants to the Portal's infrastructure, data discovery tools, metadata integration, and analytical resources.

  • A comprehensive documentation site provides how-to articles, FAQs, a glossary, and technical vignettes with code snippets.

  • An active discussion forum facilitates direct interaction between the DCC team and data contributors, having supported over 700 posts to date.

The AD Knowledge Portal also maintains a multi-faceted communication strategy to announce new data and updates. This includes a newsletter reaching over 1,400 subscribers, regular blog posts, active social media engagement (averaging 84,995 engagements per month on X), and a formal webinar series. Sage also organizes information booths at major international conferences, resulting in significant increases (222% in 2022) in Portal website visits.

These community outreach efforts have catalyzed both exploration of Portal resources and data reuse. Notably, 60% of citations attributing AD Knowledge Portal data represent secondary data use. Data contributors have acknowledged the positive impact of data sharing on scientific discovery and validation.

One AD consortia member described: ‘the biggest satisfaction is almost instantaneously seeing your discoveries replicated and validated…consortia like this help the investigator but they also help accelerate the science”.

The success of data repositories like Synapse extends beyond technological investments. Community engagement is key to driving data-sharing strategies that foster collaboration, inspire innovation, and cultivate a culture of open science. This approach not only accelerates scientific progress but also builds a vibrant, interconnected community dedicated to advancing biomedical research for the greater good. This collective effort ensures that the full potential of shared data is realized, leading to breakthroughs that can transform health outcomes and improve lives worldwide.

Comments
0
comment
No comments here
Why not start the discussion?