Similar protocols

Protocol publication

[…] Except where mentioned, data analysis was performed using the R/Bioconductor environment , . Source code for the analysis is available in a Github repository ( The archived version of the code at the time of publication can be accessed through Zenodo mislabeled.samples.identification: doi: 10.5281/zenodo.60313.We identified datasets containing sex information as experimental factors by searching the Gemma database . Out of an initial 121 datasets we focused on 79 studies run on the Affymetrix HG-U1333Plus_2 and HG-U133A platforms as they have the same sex marker genes (GEO platform identifiers GPL570 and GPL96 respectively). The annotations in Gemma, which originate from GEO sample descriptions augmented with manual annotation, were re-checked against GEO, resulting in the correction of errors for 14 samples. Datasets that contained samples of only one sex, represented data from sex-specific tissues ( e.g. ovary or testicle) or contained numerous missing values were excluded (nine datasets). A final set of 70 studies (a total of 4160 samples) met the criteria. summarizes the data included and full details of each study are in . Whenever possible, data were reanalyzed from .CEL files. The signals were summarized using RMA method from the Affymetrix “power tools” (, log 2 transformed and quantile normalized as part of the general Gemma pre-processing pipeline. Probeset selection: The male-specific genes KDM5D and RPS4Y1 are represented by a single probeset on both platforms included in our analysis. XIST is represented by two probesets on the GPL96 platform and by seven probesets on the GPL570 platform. With the exception of the 221728_x_at probeset, XIST probesets were highly correlated with each other, and negatively correlated with the KDM5D and RPS4Y1 expression in all of the datasets analyzed ( ). The poor-performing XIST probeset (221728_x_at) was excluded from further analysis. The final set was four probesets for GPL96 and eight probesets for GPL570. Assigning gene-based (biological) sex to samples: The expression data for the selected sex markers were extracted from the normalized data for each dataset. For each of these small expression matrices, we applied standard k-means clustering (using the “kmeans” function from the “stats” package in R to classify the samples into two clusters. We assigned the two clusters as “male” or “female”, based on the centroid values of each of the probesets: specifically, the cluster with higher values of the XIST probesets centroids and a lower value of KDM5D and RPS4Y1 centroids was assigned as a “female” cluster. To identify samples with ambiguous sex, we calculated the difference between the median expression level of the XIST probesets and the median expression level of the KDM5D and RPS4Y1 probesets. We compared this difference with the cluster-based gender, and validated that the difference is positive for samples assigned as females and negative for samples assigned as males. We excluded 34 samples that showed disagreement in this comparison since they could not provide a conclusive result for the gene—expression-based sex. We note that 12 (35%) of these would have been assigned to a cluster contradicting their annotated sex if we had retained them. Manual validation of the discrepancy between the gene-based sex and the meta-data-based sex: For all the cases where a discrepancy was found between the gene-expression-based sex and the meta-data-based sex, we manually examined the original studies to check if the mismatch was due to incorrect annotation of the sample during the data upload to GEO, or was present in the original paper. Since most of the manuscripts only contain summary statistics of the demographic data (13/32, ), direct sample-by-sample validation was not possible for most studies. For these studies we used the highest resolution level of group summary statistics, provided in the publication to validate that the data in the paper corroborate the data in GEO. In addition, for all of the datasets with mismatched samples, we manually evaluated the expression values of the relevant probesets using the GEO2R tool on the GEO website. Confidence interval estimate for population proportion of studies with misannotated samples: We used the properties of the binomial distribution to compute the confidence interval for the population estimate of affected data sets using the “qbinom” function in R. Analysis of Stanley Foundation datasets: CEL files and sample metadata were downloaded directly from the Stanley Medical Research Institute genomic database ( CEL files were pre-processed, quantile normalized and log 2 transformed using the rma function from the “affy” package in R Bioconductor , . […]

Pipeline specifications

Software tools GEO2R, affy
Application Transcription analysis