A collection of tasks to probe the effectiveness of protein sequence representations in modeling aspects of protein design
This repository is for the paper submitted to the 2021 NeurIPS Benchmark track.
collect_splits
contains notebooks to process RAW datasets collected from various sources.splits
contains all splits, a brief description of their processing and the logic behind train/test splitsbaselines
contains code used to compute baselinesA .gitignore
d folder called data
contains RAW data used to produce all splits. As the folder size is substantial, it could not be shipped with GitHub. However, it can be accessed here: http://data.bioembeddings.com/public/FLIP
Here are available all the FLIP datasets in FASTA format (following the standardization proposed in biotrainer).
The goal of the splits in this repository is to assess how well machine learning devices using protein sequence inputs can represent different dimensions relevant for protein design.
The main place to find out about the splits is the splits
folder. Each set contains a zip file with one or more "splits", where different splits may be different train/test splits based on biological or statistical intuition.
Splits are associated with a semaphore which indicates for what they may be used: