Today's sequence data infrastructure is set up for failure in the age of AI.
Building an open and collaborative sequence platform for both Human and AI scientists.
Biological sequences are immensely diverse, and we have only begun to understand the functional significance of this diversity. Interpreting these sequences has now become a critical step in scientific inquiry, from discovering new therapeutics and engineering organisms to modeling ecosystems. However, a vast majority of the sequences remain functionally unannotated (unlabelled), or worse, misannotated. In the age of AI, we optimistically imagine a future where we can finally begin understanding the language of biology (just like the way we can readily predict the structures of proteins with AlphaFold). However, such a fundamental advance will not be possible if we do not innovate on our sequence data infrastructure built for functional curation.
Sequence-to-function curation is, and has always been, a collaborative process. Given a gene of interest, a scientist first predicts its biological function by looking for similar sequences that have already been studied – in a process called sequence annotation. Sequence annotation is often the first major step in sequence-based analysis for biologists, and therefore, all subsequent analyses (e.g. target selection and experimental design), depend on the accuracy and sensitivity of sequence annotation. Unfortunately, the way we collectively curate the sequence data is fundamentally limited for the following main reasons:
Scientific findings cannot be searched directly with sequences. One cannot simply google DNA or protein sequences to retrieve related information. Instead, one must first map the sequence to a previously indexed identifier (e.g. uniprot ID) or a descriptor (e.g. commonly used gene names) using sequence and/or structural similarity. Related scientific findings can only be discovered if other authors also used the same identifier or descriptor in their web-indexed papers. This two step search process leads to compounding information loss and inaccuracy. (A notable tool addressing this pain point is PaperBLAST, automating the two-step process using a curated database of sequence to paper pairs)
Databases used for sequence annotation are largely static. Even though scientific knowledge is ever-evolving and the number of scientific products is increasing exponentially, sequence annotation databases (e.g. SwissProt, Metacyc) are manually curated by a small number of experts who update the database as new papers are published. In order to make use of our collective scientific practice in interpreting the biological code we need a more scalable, real-time, and self-improving database that maps sequences to their biological functions.
The experimental data that inform sequence function are rarely discoverable. In most cases, when a research product is published, the most critical data that maps sequences to an experimental readout is hidden in zipped supplemental datasets that are rarely findable using standard sequence search tools (e.g. BLAST). Without a data infrastructure that makes these datasets discoverable and reusable, AI models and agents cannot leverage the raw experimental data critical for scientific reasoning and discovery.
What we are building
The next iteration of sequence data infrastructure should be built as a collaborative, self-improving, and real-time platform for sequence curation.
In the near-term, this will accelerate research and improve the efficiency of individual research products. The longer term vision, however, is to make the sequence information interface directly with AI systems, and ultimately to be able to train models that effectively leverage the sequence knowledgebase the scientific community has collectively curated. While we expect our ideas to evolve and refine as we iterate, here are the three core ideas that we are starting with:
The way we search, compile and propagate scientific findings should be sequence-centric. A data infrastructure that centralizes information in a sequence-centric manner can be optimized for efficient propagation and traversal by leveraging unique evolutionary relationships between biological sequences. We are doing this by representing sequences as genomic language model (gLM) embeddings, to enable real-time search, efficient indexing and compression of evolutionary and contextual information.
Community buy-in starts with addressing critical pain points for scientists. Previous attempts at crowdsourcing scientific curation (e.g. COMBREX) and encouraging data deposition have often failed with poor community buy-in unless enforced by mandatory data deposition requirements (e.g. the Protein Data Bank). We look for inspiration in successful parallels such as Github and Huggingface Hub, where community buy-in and cultural shift is driven by the platform addressing a critical need (i.e. software version control, model sharing).
Data deposition should be seamlessly integrated with sequence analysis. Today, raw data (both experimental and sequencing data) that lead to sequence interpretation is often only deposited because that is required by the peer-reviewed journals to ensure research integrity and reproducibility. This compilation of supplementary information is often perceived as supplemental in manuscript writing efforts. This is because until now, the currency of science was in the scientific narrative that can be cited by other scientists. However, in the age of AI, we expect the underlying data to be the currency of scientific impact. To enable this, we need both easy and transparent data deposition, and this is best done by integrating data deposition with data analysis. Once again, to draw inspiration from Github and Huggingface, depositing code and datasets/models requires toggling a button from private to public. Importantly, for these two examples, the upsides of making code/data/model public is immediately clear, as they are already the native currency of impact in software engineering and machine learning respectively.
Decoding the language of genomes and proteins can only be achieved through a collaboration, not only across the scientific community, but also between humans and artificial intelligence. We are building a new type of sequence data infrastructure for both human and AI scientists, so that we can have a proper shot at the profound task of understanding the code that underlies all living things.
We thank Simon Roux and Antonio Camargo for their feedback on this blogpost.