Computational protocol: Individual heritable differences result in unique cell lymphocyte receptor repertoires of naïve and antigen-experienced cells

Similar protocols

Protocol publication

[…] The Illumina NGS raw reads were processed using a multi-step pipeline. The pre-processing of the sequencing data was done using the VDJPipe NGS processing software (https://vdjserver.org/software, manuscript in preparation). Briefly, sample IDs were extracted from the external identifier (P5 and P7), UMIs were extracted from the first 10 bases of read 1, and the receptor constant region was identified by alignment with a reference library to identify the isotype of the receptors. Together, the sample index, UMI and constant region primer were used to assign a barcode group for each read. Reads where constant region gene or index could not be successfully matched with the original library were discarded. After assignment to a UMI group, reads in the same group were further subdivided using the ‘usearch' sequence clustering algorithm (v7.0.1090) (ref. ) to account for randomly overlapping UMIs. The final barcodes were defined based on clusters of sequences that shared at least 75% identity in positions 100–200 of read 2. The resulting individual barcode groups are each assumed to represent multiple reads from a single mRNA molecule.Additional processing was performed using the pRESTO toolkit. Low-quality reads (average Phred quality score Q<20) were discarded. Reads within a UMI group were aligned, and a consensus sequence for each UMI group was generated (minimum total quality for unambiguous base assignment was set to 30; groups containing large numbers of reads were randomly subsampled to 2,000 reads before alignment). After building a consensus sequence, paired-ends were assembled using pRESTO's sequence aligner, and pairs where no significant overlap could be found were joined end-to-end. Finally, a second filtering step was performed to remove consensus reads with total length <310 bases or average Phred quality score <30.For each processed sequence, V, D and J genes and alleles, locations of complementarity determining regions (CDRs) and the location, length and nucleotide sequence of the junctions were identified using the IMGT/HighV-QUEST online tool (version 1.3.1). Sequences were further filtered based on the quality of the V gene alignment, with a V gene score cutoff of >900 used to remove low-confidence alignments. For comparison, we also processed our data using the VDJServer analysis pipeline (https://vdjserver.org). VDJServer performs raw read processing and repertoire characterization, and relies on a local installation of IgBLAST for making gene assignments. [...] To mitigate the effects of clonal expansion on the memory repertoires of B and T cells (and to re-combine reads that were incorrectly separated into distinct molecules during the sequence processing step), sequences were grouped by clonality using Change-O, and each clone was only counted once in all future analyses. Sequences from an individual donor were first grouped by V gene, J gene and CDR3 length. For T cells, sequences with similarity less than or equal to 1 nucleotide (nt) mutation (to account for amplification or sequencing error) in their CDR3 regions were considered clonally related, and were grouped into a clonal group. For B cells, where somatic hypermutation must be accounted for, sequences differing from one another by a weighted distance of less than 0.15 within the junction region were defined as clones. Distance was measured as the number of point mutations weighted by a symmetric version of the nucleotide substitution probability as previously described. A distance of 0.15 corresponds to 15 transition mutations per 100 nt, or ∼5 per 100 nt of the least likely mutations. [...] Subject-specific genotypes, including alleles not in the IMGT database, were found using the TIgGER allele finder. Briefly, novel IGHV alleles were identified based on analyses of mutation patterns in IGHV sequences by comparing each sequence to that of the germline allele assigned by IMGT/HighV-QUEST. Sequences that aligned better to novel germline alleles were then reassigned from their initial IMGT/HighV-QUEST assignments, and sequences that perfectly matched a germline allele (novel or known) were determined. Among these sequences, frequencies of allele assignments were calculated for each gene, and the minimum number of alleles required to explain seven out of eight of the observed sequences was determined. These alleles were considered to constitute an individual's genotype. Allele assignments were then re-calculated by alignment with germline sequences from only the individual's genotype.Following genotype determination, V genes with heterozygous alleles were considered for further analysis. The number of reads containing each allele was counted in both the non-productive sequences and in all sequences, and the ratios of matched heterozygous alleles were calculated for each subject. For consistency across donors, the ratio was calculated by setting the numerator to the first allele numerically (that is, IGHV1–69*01/IGHV1–69*02), and the correlation between ratios was measured using the Pearson metric. […]

Pipeline specifications

Software tools VDJServer, USEARCH, pRESTO, IgBLAST, Change-O, TIgGER
Databases IMGT
Application Rep-seq analysis