Vdjdb Db Save

🗂️ [vdjdb.cdr3.net is up and running] Git-based TCR database storage & management. Submissions welcome!

Project README

VDJDB: A curated database of T-cell receptor sequences of known antigen specificity

Splash

The primary goal of VDJdb is to facilitate access to existing information on T-cell receptor antigen specificities, i.e. the ability to recognize certain epitopes in certain MHC contexts.

Our mission is to both aggregate the scarce TCR specificity information available so far and to create a curated repository to store such data.

In addition to routine database updates providing the most up-to-date information, we make our best to ensure data consistency and fight irregularities in TCR specificity reporting with a complex database validation scheme:

  • We take into account all available information on experimental setup used to identify antigen-specific TCR sequences and assign a single confidence score to highligh most reliable records at the database generation stage.
  • Each database record is also automatically checked against a database of V/J segment germline sequences to ensure standardized and consistent reporting of V-J junctions and CDR3 sequences that define T-cell clones.

This repository hosts the submissions to database and scripts to check, fix and build the database itself.

To build database directly from submissions, go to src directory and run groovy -cp . BuildDatabase.groovy script (requires Groovy).

To query the database for your immune repertoire sample(s) use the vdjmatch software.

A web-based GUI for the database can be found in VDJdb-web repository.

Citing

Please cite the database using the most recent paper Mikhail Goncharov, Dmitry Bagaev, Dmitrii Shcherbinin, Ivan Zvyagin, Dmitry Bolotin, Paul G. Thomas, Anastasia A. Minervina, Mikhail V. Pogorelyy, Kristin Ladell, James E. McLaren, David A. Price, Thi H. O. Nguyen, Louise C. Rowntree, E. Bridie Clemens, Katherine Kedzierska, Garry Dolton, Cristina Rafael Rius, Andrew Sewell, Jerome Samir, Fabio Luciani, Ksenia V. Zornikova, Alexandra A. Khmelevskaya, Saveliy A. Sheetikov, Grigory A. Efimov, Dmitry Chudakov & Mikhail Shugay. VDJdb in the pandemic era: a compendium of T cell receptors specific for SARS-CoV-2. Nature Methods 2022. doi:10.1038/s41592-022-01578-0.

Submission guide

To submit previously published sequence follow the steps below:

  • Create an issue(s) labeled as paper and named by the paper pubmed id, PMID:XXXXXXX. Note that if paper is a meta-study, you can mark it as meta-paper and link issues for its references in a reply to this issue. Also note that in case submitting unpublished sequences, choose any appropriate issue name with details on submitter (name, organization, etc) in issue comments.

  • Create new branch and add chunk(s) for corresponding papers named as PMID_XXXXXXX. Don't forget to close/reference corresponding issues in the commit message.

  • Create a pull request for the branch and check if it passes the CI build. If there are any issues, modify them by fixing/removing entries as necessary.

The structure of submission chunk is provided below, but first a couple of notes:

STYLE Try avoiding spaces (e.g. TRBV7,TRBV5, not TRBV7, TRBV5) and leave fields that have no information as blank (don't use any placeholder). Stick to listed field values at all cost! In case a critical part of your submission doesn't fit in current specification: 1) Create an issue in the issues section (and tag it as maintainance), 2) provide us with an example (e.g. open a pull request). Do not insert critical information into the comment field.

FORMAT Please ensure that Variable/Joining and MHC names in your submission come from IMGT nomenclature (this does not apply to donor MHC typing fields).

The BuildDatabase routine will be executed during CI tests upon each submission and prior to every database release implements table format checks, CDR3 sequence checks and fixes (if possible), and VDJdb confidence score assignment (see below).

To view the list of papers that were not yet processed follow here.

An XLS template is available here.

CAUTION make sure that nothing is messed up (x/X frequencies are transformed to dates, bad encoding, etc) when importing from XLS template. The format of all fields is pre-set to text to prevent this case.

Database specification

Each database submission in chunks/ folder should have the following header and columns:

Complex information columns (required)

These columns convey full information about TCR:peptide:MHC complex and are mandatory for any submission.

column name description
cdr3.alpha TCR alpha CDR3 amino acid sequence. Complete sequence starting with C and ending with F/W should be provided if possible. Trimmed sequences will be fixed at database building stage in case sufficient V/J germline parts are present
v.alpha TCR alpha Variable (V) segment id, up to best resolution possible (TRAVX*XX, e.g. TRAV7, TRAV7*01, TRAV7*02...). Strictly IMGT nomenclature. Can be left blank if unknown.
j.alpha TCR alpha Joining (J) segment id
cdr3.beta TCR beta CDR3 amino acid sequence
v.beta TCR beta V segment id
j.beta TCR beta J segment id
species TCR parent species (HomoSapiens, MusMusculus,...)
mhc.a First MHC chain allele, to best resolution possible, HLA-X*XX:XX, e.g. HLA-A*02:01
mhc.b Second MHC chain allele (B2M for MHCI)
mhc.class MHCI or MHCII
antigen.epitope Amino acid sequence of the epitope
antigen.gene Parent gene of the epitope sequence (e.g. pp24)
antigen.species Parent species of the antigen, to the best clade resolution possible (e.g. HIV-1, HIV-1*HXB2)
reference.id Pubmed id, doi, etc
submitter Name of submitting person/organization

Notes:

In case given record represents a clonotype with either TCR alpha or beta sequence unknown, missing CDR3/V/(D)/J fields should be left blank.

V/(D)/J fields can be left blank, however this will abrogate CDR3 fixing/verification procedure for a given record.

Any record should have at least one of CDR3 alpha/beta fields that are not blank.

Method information columns (optional)

Optional columns (i.e. it is not required to fill them, but they should be present in table header) that ensure correct confidence ranking of a given entry. Used to calculate a single confidence score based on various factors, e.g. fraction of a given TCRab sequence among tetramer+ clones sequenced and verification experiments performed.

column name description
method.identification tetramer-sort, dextramer-sort, pelimer-sort, pentamer-sort, etc for sorting-based identification. For molecular assays use: antigen-loaded-targets (if T cells specificity was analysed against cells incubatetd with antigenic peptide), antigen-expressing-targets (if T cells specificity was analysed against cells tranformed with antigenic organism, protein or peptide, e.g. BCL transformed with EBV). For magnetic cell separation use beads keyword. Add cultured-T-cells or limiting-dilution-cloning if T cells were cultured before sequencing as in this case method.frequency will have completely different meaning. Use comma to separate phrases. For cases that use UMI-tagged multimers use tetramer-umi, etc.
method.frequency Frequency in isolated antigen-specific population, reported as X/X if possible, e.g. 7/30 if a given V/D/J/CDR3 is encountered in 7 out of 30 tetramer+ clones. Formats X%, X.X% and X.X are also supported.
method.singlecell yes if single cell sequencing was performed, blank otherwise
method.sequencing Sequencing method: sanger, rna-seq or amplicon-seq
method.verification tetramer-stain, dextramer-stain, pelimer-stain, pentamer-stain, etc for methods that include TCR cloning and re-staining with multimers. For magnetic cell separation use beads keyword. restimulation, co-culture, antigen-loaded-targets, antigen-expressing-targets for molecular assays that validate specificity of cloned T-cell receptors. direct in case the affinity of TCRs of specific T-cells to the pMHC is quantified directly in some way. Several comma-separated verification methods can be specified.

Notes:

In case method.identification is left blank, the record is automatically assigned with a lowest confidence score possible.

For special cases such as CD8-null tetramers that utilize HLA with mutated residues that abrogate CD8 binding, specify cd8null-tetramer in method.identification field rather than using mhc.a field.

During database build phase, the information from columns mentioned above is collapsed to a JSON string and stored in a single method column, e.g.:

{
   "identification":"tetramer-sort",
   "frequency":"5/13",
   "sequencing":"sanger",
   "verification":"antigen-loaded-targets"
}

Meta-information columns (optional)

column name description
meta.study.id Internal study id
meta.cell.subset T-cell subset, free style, e.g. CD8+, CD4+CD25+
meta.subset.frequency Frequency of a given TCR sequence in specified cell subset, e.g. 5% means that the TCR sequence represents an expanded clone occupying 5% of CD8+ cells
meta.subject.cohort Subject cohort, free style, e.g. healthy or HIV+. If possible, specify to what extent a healthy donor is healthy, e.g. CMV-seronegative.
meta.subject.id Subject id (e.g. donor1, donor2,...)
meta.replica.id Replicate sample coming from the same donor, also applies for different time points, etc (e.g. 5mo)
meta.clone.id T-cell clone id
meta.epitope.id Epitope id (e.g. FL10)
meta.tissue Tissue used to isolate T-cells: PBMC, spleen, etc. or TCL (T-cell culture) if isolated from re-stimulated T-cells
meta.donor.MHC Donor MHC list if available, blank otherwise. IMGT nomenclature (e.g. HLA-A*02:01) is preferable. Allele group names (e.g. A02, B18) is also acceptable (don't use asterisk in such cases). Use comma to separate alleles.
meta.donor.MHC.method Donor MHC typing method if available, blank otherwise
meta.structure.id PDB structure ID if exists, or blank. Records having a structural data associated with them will automatically get the highest confidence score.
comment Plain text comment, maximum 140 characters

Note:

While these columns are optional, subject identifier, replica identifier, etc are used when scanning submission for duplicates. Normally duplicate records (with identical complex information columns) are not allowed, but they will not be considered as duplicates in case they have distinct id fields mentioned above.

During database build phase, the information from columns mentioned above is collapsed to a JSON string and stored in a single meta column, e.g.:

{
   "cell.subset":"CD8+",
   "subject.cohort":"HSV-2+",
   "subject.id":12,
   "clone.id":46,
   "tissue":"PBMC"
}

Condition association columns (for extended database, TBA)

Condition metadata:

column name description
condition.name natural language terms like T1D, pollen allergy, BRCA or YF vaccination
condition.id ICD-11:5A10 for T1D in ICD-11 or OMIM:114480 for breast cancer in OMIM
condition.type infection, vaccination, cancer, allergy or autoimmune
condition.subtype natural language terms like acute or poor prognosis or grade II

Association metadata:

column name description
condition.freq fraction of samples matching the entry
condition.count number of samples matching the entry (can be blank)
population.freq fraction of controls matching the entry, or Pgen computed by OLGA/IgOR
population.count number of controls matching the entry (can be blank)
association.pvalue Association P-value, e.g. enrichment P-value for Fisher's exact test
association.test Fisher, TCRNET, ALICE or another statistical method

Ambiguous antigens (for extended database, TBA)

Peptide pools, long peptides for T-cell culture expansion, non-peptide ligands

column name description
antigen.epitope.long encompassing protein sequence containing the epitope
antigen.peptide.pool e.g. MIRA COVID19 TBD
antigen.nonpeptide α-GalCer or KRN7000 TBD

Non TRAB columns (for extended database, TBA)

Information for non alpha-beta T-cells, CAR-T, etc

column name description
v.delta ID of Variable segment in delta chain
cdr3.delta CDR3 of delta chain
... ...
v.heavy.shm CIGAR string of hypermutations in the heavy chain Variable segment
... ...

Database processing

CDR3 sequence fixing

At this stage, a series of checks is performed for CDR3 sequence and reported V/J segments:

  • In case of canonical (starting with conserved C and ending with F/W) CDR3 sequences: checks if 5' and 3' germline parts match corresponding V/J segment sequences.
  • In case of truncated CDR3 sequences: adds conserved C/F/W residues. Can add more missing residues in case a relatively large contiguous V/J germline match is present.
  • In case excessive germline part is reported (e.g. FGXG instead of simply F at CDR3 3' part), excessive residues are removed.
  • Can correct mismatches in V/J germline regions in case a reliable non-contiguous V/J match is found.

The main reason behind that is that current immune repertoire sequencing (RepSeq) data processing software reports canonical clonotype sequences, high number antigen-specific TCR sequences present in literature are reported inconsistently. The latter greatly complicates annotation of RepSeq data using known antigen-specific TCR sequences.

In case of good V/J germline matching and errors in CDR3 sequence, the final CDR3 sequence in the database is replaced by its fixed version. The following report of CDR3 fixer is placed under cdr3fix.alpha and cdr3fix.beta columns, e.g.

{
	"fixNeeded":true,
	"good":false,
	"cdr3":"CASSQDVGTGGVFALYF",
	"cdr3_old":"CASSQDVGTGGVFALY",
	"jFixType":"FixAdd",
	"jId":"TRBJ1-6*01",
	"jCanonical":true,
	"jStart":14,
	"vFixType":"FailedBadSegment",
	"vId":null,
	"vCanonical":true,
	"vEnd":-1
	}

and

{
	"fixNeeded":true,
	"good":true,
	"cdr3":"CASSLSRGGNQPQYF",
	"cdr3_old":"CASSLSRGGNQPQY",
	"jFixType":"FixAdd",
	"jId":"TRBJ1-5*01",
	"jCanonical":true,
	"jStart":9,
	"vFixType":"NoFixNeeded",
	"vId":"TRBV14*01",
	"vCanonical":true,
	"vEnd":4
}

Field descriptions:

field description
fixNeeded true if corrected CDR3 sequence differs from the original one, false otherwise
good true if the fix can be applied, false if the fix cannot be applied due to bad V/J entry or no V/J matching
cdr3 Fixed CDR3 sequence
cdr3_old Original CDR3 sequence
jFixType Type of fix applied to CDR3 J germline part
jCanonical true if CDR3 ends with F or W, false otherwise
jId J segment identifier
jStart A 0-based index of first CDR3 amino acid that belongs to J segment
vFixType Type of fix applied to CDR3 V germline part
vCanonical true if CDR3 starts with C, false otherwise
vId V segment identifier
vEnd A 0-based index of the last CDR3 amino acid of V segment plus one

Note:

Possible V and J fix types: NoFixNeeded, FixAdd, FixReplace, FixTrim, FailedReplace (too many mismatches), FailedBadSegment (bad segment entry), FailedNoAlignment (no alignment at all)

VDJdb scoring

At the final stage of database processing, TCR:peptide:MHC complexes are assigned with confidence scores. Scores are computed according to reported method entries.

VDJdb scoring is performed by evaluating TCR sequence, identification and verification confidence based on the following criteria:

  1. Ensuring TCR sequence is correctly identified according to method.sequencing and method.singlecell (1-3 points)
    • sanger - several cells sequenced (2+ cells sequenced according to method.frequency) - 2 points, otherwise 1
    • amplicon-seq - frequency is higher than 0.01 - 2 points, otherwise 0
    • single-cell - 3 points if performed
  2. Initial identification of TCR:pMHC is correct according to method.identification (0-1 point)
    • sort-based - frequency is higher than 0.1 according to method.frequency)
    • culture-based - frequency is higher than 0.5
    • limiting dilution/culture prior to sequencing - the method.frequency becomes somewhat ambigous, check if it is higher than 0.5
  3. Verification T-cell specificity (0-3 points)
    • direct method - 3 points, e.g. has PDB id (meta.structure.id is not empty) or some other method that directly evaluates TCR:pMHC binding
    • target stimulation-based - 2 points
    • staining-based - 1 points
    • If verification is performed, then the TCR sequence is assumed to be known, so score from 1. is set to 3

The final score is then calculated as minimal between score from part 1. and sum of scores from part 2. and part 3..

Maximal score is then selected among different records (independent submissions, replicas, etc) pointing to the same unique complex entry (i.e. set of unique complex fields).

score description
0 Low confidence/no information - a critical aspect of sequencing/specificity validation is missing
1 Moderate confidence - no verification / poor TCR sequence confidence
2 High confidence - has some specificity verification, good TCR sequence confidence
3 Very high confidence - has extensive verification or structural data

Database build contents

The final database assembly can be found in the database/ folder upon execution of BuildDatabase.groovy script:

  • vdjdb_full.txt - combined chunks with TCRalpha/beta records, antigen information, etc. All method and meta information are collapsed into two columns with corresponding names. VDJdb scores and CDR3 fixing information for TCR alpha and beta are given in separate columns. This is the raw version of VDJdb.
  • vdjdb.txt - a collapsed version of database used for annotation of single-chain TCR sequencing data by VDJdb-standalone software. Each line corresponds to either TCR alpha or TCR beta record as specified by the gene column. TCR records coming from the same alpha-beta pair have the same index in complex.id column. In case complex.id is equal to 0 a record doesn't have either TCRalpha or TCRbeta chain information. This table is used by VDJdb-standalone and VDJdb-server.
  • vdjdb.meta.txt - metadata for vdjdb.txt table, used by VDJdb-standalone and VDJdb-server.
  • vdjdb.slim.txt - a slim database used for annotation of single-chain TCR sequencing data by VDJdb-standalone software. This is a collapsed version of vdjdb.txt containing unique records for each CDR3:antigen pair and comma-separated lists of values for other columns (*.segm,mhc.*, complex.id and reference.id). This table can be easily parsed with R and Python/Pandas, it is intended for end users exploring VDJdb.
  • vdjdb.slim.meta.txt - metadata for vdjdb.slim.txt table.
  • motif_pwms.txt and cluster_members.txt - position-weight matrices for antigen-specific TCR motifs and representative sets of TCR sequences that constitute them. These tables are computed separately using code from vdjdb-motifs repository.

Note that some statistics can be generated by running R markdown templates in summary/ folder.

Building database release and generating summary figures

First make sure that you clone both vdjdb-db repo and vdjdb-motifs repo to the same folder, say ~/vcs.

Then navigate to vdjdb-db and run bash release.sh. You can then find the output in ~/vcs/vdjdb-db/database, ~/vcs/vdjdb-db/summary and ~/vcs/vdjdb-motifs folders. Note that you have to check .Rmd files that will be executed and manually install missing R packages, as well as get VDJtools binary and place it in the path specified in ~/vcs/vdjdb-motifs/compute_vdjdb_motifs.Rmd.

Database build process with Docker

The repository contains Dockerfile to simplify the database building process. Dockerfile instantiates the correct environment needed to build the database. If you have Docker Desktop installed and running on your machine use the following command to build local Docker image:

docker build -t vdjdbdb .

In order to build the database using the newly created local Docker image create some folder (e.g. /tmp/output) and use it as a external volume when running Docker image. Docker image always puts the result in /root/output folder within docker container.

NOTE: Host path, e.g. /tmp/output, should be absolute.

NOTE: Database building process requires at least 64GB of RAM.

mkdir -p /tmp/output
docker run -v /tmp/output:/root/output vdjdbdb
Open Source Agenda is not affiliated with "Vdjdb Db" Project. README Source: antigenomics/vdjdb-db
Stars
127
Open Issues
72
Last Commit
3 days ago

Open Source Agenda Badge

Open Source Agenda Rating