Computational protocol: Sequence Based Discovery Demonstrates That Fixed Light Chain Human Transgenic Rats Produce a Diverse Repertoire of Antigen Specific Antibodies

Similar protocols

Protocol publication

[…] We generated approximately 100,000 paired-end reads for each sample sequenced. To determine the total number of CDR3 clonotypes present in the sample based on the number of CDR3 clonotypes identified at this sequencing depth, we conducted four technical replicate sequencing runs from one lymph node sample. These experiments resulted in an average of 112 unique CDR3 clonotypes per experiment. We then measured the overlap of CDR3 sequences between each pairwise technical replicate. The average overlap between pairwise comparisons was 96. Mathematically, these results can be modeled as a twice-replicated counting experiment in which some number of entities (112 in this case) is chosen from a larger population. From the number of entities chosen repeatedly in the two separate counting experiments, the actual size of the total population can be reasonably inferred.(nk1) and (nk2) and want to infer the most likely value of n from the value of y = k1 ∩ k2, where k1 and k2 are uniquely identifiable objects (with k1 = k2).To achieve this goal, we conducted a computer simulation using Python and the NumPy (numerical python) library in which we varied the values of n and k1, k2. For each distinct set of n and k1, k2 values, we repeated the simulation 5,000 times and averaged the resulting overlap. As the population size n varies, the average overlap value y = k1 ∩ k2 changes, and we can view the average overlap y as a function of the population size. Our simulations can be generalized by the following equation: y(n)=k2n−1, where y = average number of objects found in both repeated samplings; k = the number of objects sampled in each individual experiment; and n = the total size (number of distinct objects) in the population being sampled.Based on this equation, we determine the appropriate value of n from y=96(CDR3 clonotypes found in both technical replicates), k=112(total CDR3 clonotypes sampled in each experiment), and find a corresponding value of n = 131.Thus, our technical replicate results suggest a likely starting population of 131 unique CDR3 clonotypes. We therefore calculate our sampling efficiency as follows: 112 sampled/131 total available=85.5%. [...] For the NGS-based clonotype analysis, we downloaded and processed all paired fastq reads for each sample. Each sample was covered by approximately 100,000 paired reads on average. A first pass quality control of the sequence was performed to eliminate artifactual sequences with homopolymer runs of 30 or more bases. We also performed a permissive alignment to all human V-gene framework 1 sequences to keep only those sequences with at least 20 aligned nucleotides derived from the human Ig locus. After we applied the sequence QC filters described, we first merged the forward and reverse paired reads by aligning the paired reads using the FLASH package, and we kept all reads that were successfully merged. After merging the fastq reads, we then determined the longest open reading frame for each merged read and generated an output of the translated amino acid protein sequence encoded by the open reading frame. We then aligned all of the protein sequences to the set of human germline IGHV genes from IMGT using IGBLAST. Based on the protein alignments and the IMGT coordinate system, we then determined the framework and CDR regions of the full heavy chain variable region protein sequence. Based on the CDR and framework annotation, we then determined the CDR3 sequence contained in each protein sequence derived from the paired sequence reads. We then used agglomerative clustering to cluster the full set of CDR3 protein sequences for each sample at an 80% similarity threshold and recorded the total number of reads in each cluster. We define a clonotype as the cluster of CDR3 protein sequences clustered at 80% similarity. We calculated the total number of CDR3 clonotypes as all of the CDR3 clonotypes comprised of five or more paired sequence reads. A summary of the NGS sequence metrics derived from the samples analyzed can be found in Table in Supplementary Material.To compare the overlap of CDR3 clonotypes between lymph node samples or samples from different animals, we performed an all-by-all comparison of the consensus sequence from each CDR3 clonotypes and used the Wagner–Fischer algorithm to calculate the Levenshtein distance between two CDR3 consensus sequences. Two sequences were said to match when the Levenshtein distance between the two sequences were less than 20% the length of the longest sequence. This criterion for matching is consistent with the two sequences belonging to the same CDR3 clonotypes based on 80% similarity.We calculated the polarization, or skewed, representation of CDR3 clonotypes by calculating the percentage of total number reads in a sample that are contained in each CDR3 clonotype. We then ranked the CDR3 clonotypes based sequence read abundance for each sample. We then calculated the mean and 25th and 75th quartile values of corresponding ranked CDR3 clonotypes across all samples using the BoxPlotR package available at: http://shiny.chemgrid.org/boxplotr/. […]

Pipeline specifications

Software tools Numpy, IgBLAST, BoxPlotR
Databases IMGT
Applications Miscellaneous, Rep-seq analysis
Organisms Rattus norvegicus, Homo sapiens