Computational protocol: Germline mutations in ETV6 are associated with thrombocytopenia, red cell macrocytosis and predisposition to lymphoblastic leukemia

Similar protocols

Protocol publication

[…] Reads passing Illumina chastity filter were subjected to a quality filter step that removed low quality bases from the 3′ end and retained pairs of reads if the trimmed reads for both members of the pair were 50 bp or longer. Paired reads that passed the quality filter were mapped to the reference human genome sequence (hg19) with GSNAP (Genomic Short-read Nucleotide Alignment Program, version 2012-07-20). Sequence calls for single-nucleotide polymorphisms (SNPs) and insertions and deletions (indels) were performed using the GATK (Broad's Genome Analysis Toolkit, v2.1-8-g5efb575). The program ANNOVAR (Annotate Variation, version 2012-03-08) was used to classify variants and to cross-reference all variants across various genetic variation databases. Included in ANNOVAR are databases to determine nonsynonymous and splice site variants (refGene.txt), variants in conserved genomic regions (phastConsElements46way.txt), variants in segmental duplications (genomicSuperDups.txt), and variant frequencies from 1000 genomes database (hg19_ALL.sites.2012_02.txt). Variants located outside of conserved regions, or with frequencies >1% were excluded from further analysis._ENREF_2 Only non-synonymous changes (SNPs and indels), splice site variants, and/or an aberrant stop codon changes were considered for further analysis. All insertion and deletion variants were considered damaging, whereas SNP variants were cross-referenced to the dbNSFP (database for nonsynonymous SNPs' functional predictions, version 2.0) to determine whether the changes would be considered tolerable or damaging using four algorithms (Sorting Intolerant From Tolerant (SIFT), PolyPhen-2 (Polymorphism Phenotyping v2), likelihood ratio test (LRT), MutationTaster).The final filtered list of variants for each affected family member was then intersected to find putative causal variants. [...] RNA was isolated from leukoreduced platelet preparations stored in Trizol as previously described. RNA-seq libraries were prepared and bar-coded using Illumina TruSeq V2 with oligo dT selection. 50 cycle single end reads were generated on a single lane of the HiSeq 2000, and aligned using Novoalign (Novocraft Technologies, Malaysia) to UCSC genome version hg19 with known and shuffled splice junctions included. Normalization of read counts and differential expression analysis was performed with DESeq2 (http://biorxiv.org/). Sample to sample variability in the level of leukocyte transcripts, which can significantly alter read counts (JWR, unpublished observations), was corrected for by including the ratio of PTPRC (the leukocyte marker CD45)/ITGA2B (platelet marker) as a factor in the model for significance testing. Relationship to affected was also included in the model. Euclidean distance computation, clustering (complete linkage), and heatmap analysis of the regularized log transformed read counts was performed in R as described in the DESeq2 vignette. A list of 351 expressed transcripts involved in platelet biogenesis or function was curated from Reactome and from transcripts enriched in platelets compared to all other tissues in Illumina's human body map 2.0. Gene Set Enrichment Analysis (GSEA) of the 177 differentially expressed targets included testing against GO BP, Biocarata, Kegg, and Reactome gene sets. [...] Reads passing Illumina chastity filter were subjected to a quality filter step as described above. Paired reads that passed the quality filter were mapped to the reference human genome sequence (hg19) with GSNAP (Genomic Short-read Nucleotide Alignment Program, version 2012-07-20). The aligned reads were then searched for gene fusions using 2 separate algorithms: TopHat-Fusion, v2.0.9 and FusionMap, version 7.0.1.25. Intersection of the resulting gene fusion predictions from the 2 programs resulted in a single high-confidence candidate. […]

Pipeline specifications

Software tools NovoAlign, DESeq2, GSEA, GSNAP, TopHat-Fusion, FusionMap
Databases Reactome KEGG
Application RNA-seq analysis
Diseases Anemia, Blood Platelet Disorders, Leukemia, Myelodysplastic Syndromes, Thrombocytopenia, Precursor Cell Lymphoblastic Leukemia-Lymphoma
Chemicals Nucleotides