Computational protocol: Detailed Analysis of Focal Chromosome Arm 1q and 6p Amplifications in Urothelial Carcinoma Reveals Complex Genomic Events on 1q, and SOX4 as a Possible Auxiliary Target on 6p

Similar protocols

Protocol publication

[…] Raw data was extracted from the scanned images using Agilent Feature Extraction (Agilent Technologies, Santa Clara, CA, USA). The data was filtered from control probes and probes that did not pass Agilent's default "well above background" condition. Remaining probes were corrected for background signal and log 2 ratios (log 2 (Signal sample/Signal reference)) were calculated from the adjusted signal intensities for each array. The log 2 ratios were normalized and centered using popLowess . The log 2 values of replicate probes were merged to their median value. Segmentation was performed on normalized log 2 ratios for each sample using Circular Binary Segmentation (CBS) (Settings: 10 000 permutations, significance level for accepting change-points, α, set to 0.01, and a minimum of 5 consecutive probes for calling a segment). Gains and losses were called at regions where the segmentation value exceeded a sample adaptive threshold (SAT) . The SAT ranged from 0.15 to 0.59, with a median value of 0.20. Copy number gain frequencies were calculated using segmented data at an individual probe level by dividing the number of times the probe was observed above the SAT with the number of samples investigated. Average copy number gain amplitudes (log 2) were calculated by measuring the summed segmentation line amplitude of each probe above SAT divided by the number of times the probe was observed above the SAT. RefSeq gene locations were downloaded from the UCSC genome browser (GRCh37/HG19 Assembly). MicroRNA (miRNA) data was obtained from miRBase (, Release 18). Copy number variant (CNV) data generated by Conrad et al. was used to account for naturally occurring variations. Gene specific copy number was measured as the mean segmentation value spanning each RefSeq gene position. The correlation between gene specific copy number and gene expression levels was determined using Spearman correlation in the 58 samples with matched gene expression, and p-values were FDR corrected to account for multiple testing . The gene expression levels in samples with amplifications were compared to the remainder of the 212 samples where expression data was available using the Mann-Whitney Test, in order to determine whether there was a significant difference in expression levels. Raw and processed data, together with array design and sample annotations, are deposited in the Gene Expression Omnibus (GSE40938). [...] Breakpoints were called at positions where the segmentation shifts exceed the SAT or occurred above the SAT. Breakpoints were manually curated in selected regions to account for outlier probes. In order to test for an uneven distribution of chromosomal breaks within the 1q and 6p target regions, the observed breakpoint distribution was compared to that of 10000 random permutations in 50 kb windows. Significance levels were determined by rank statistics. Data on repetitive genomic features (LINE, SINE, and LTR) was downloaded from the UCSC genome browser RepeatMasker track . Locations of segmental duplications were obtained from the UCSC genome browser (Duplications of >1000 Bases of Non-RepeatMasked Sequence). G4 quadruplex locations were obtained using the Quadparser algorithm, which identifies d(G3N1–7G3N1–7G3N1–7G3) sequence motifs postulated to fold into a quadruplex structure . LINE, SINE, LTR, and G4 sequence element content was measured in 50 kb non-overlapping windows across the genome. In order to assess the association between element content and breakpoint occurrence, the breakpoint frequency in windows that harbored an above median element content was compared to that of windows with a below median element content. Only regions with array coverage were included, and windows with CNVs were excluded. Fisher's exact test was used to assess the significance of repetitive sequence enrichment in the 1q and 6p amplicon peak regions. […]

Pipeline specifications

Software tools Agilent Feature Extraction, RepeatMasker
Databases UCSC Genome Browser
Application Genome data visualization
Diseases Carcinoma