Computational protocol: Comprehensive genomic analysis of Oesophageal Squamous Cell Carcinoma reveals clinical relevance

Similar protocols

Protocol publication

[…] We collected fastq or somatic mutations data of 492 paired ESCC samples from seven publications (Supplementary Table ), consisting of 41 whole-genome sequences and 451 whole-exome sequences. The clinical data were also acquired (Supplementary Table ). Of the 492 cases of ESCC, the fastq data of 323 cases from 4 publications,,, were re-analysed with our standard pipeline, and somatic mutations from the remaining publications were combined for further analysis. To improve the accuracy and comparability of the data, we eliminated one hyper-mutant sample and filtered the false positive mutations with our own panel of normal datasets and the Exome Aggregation Consortium (ExAC) database. Finally, 490 cases were used for further analysis.The Fastq data from 323 cases were processed according to the following pipeline. Low-quality reads with more than five unknown bases and sequencing adaptors were removed. The remaining high-quality reads were aligned to NCBI human reference (hg19) using BWA. Picard (http://broadinstitute.github.io/picard/) was used to mark duplicates, and Genome Analysis Toolkit (v.1.0.6076, GATK IndelRealigner) to improve the accuracy of the genome alignment. Somatic point mutations were detected using muTect. Somatic Indels were detected with GATK Somatic Indel Detector. The somatic variations combined the remaining publications’ somatic mutations were annotated with Oncotator.To further enhance the accuracy of somatic mutations, we filtered the false positive mutations with a threshold of greater than 5% of mutation frequency in normal samples according to our panel of normal bams, and a threshold of greater than 1% in the Exome Aggregation Consortium (ExAC) database.Copy number alterations (CNAs) were first detected with SegSeq for 31 WGS, and GATK4 Alpha for 283 WES. GISTIC2.0 was performed to identify significantly amplified or deleted genomic regions. Hierarchical clustering was used to identify sample subtypes. Student’s two-sided t-test was used to select significant differential CNAs between subtype3 and subtype2. P values were adjusted using the R package ‘p.adjust’, and q < 0.001 was defined as statistically significant. We employed CIViC to identify CNA genes associated response a targeted therapy. […]

Pipeline specifications

Software tools BWA, Picard, GATK, MuTect, Oncotator, SegSeq, GISTIC
Applications WGS analysis, WES analysis, Nucleotide sequence alignment
Organisms Homo sapiens