Computational protocol: Rapid Discovery of De Novo Deleterious Mutations in Cattle Enhances the Value of Livestock as Model Species

Similar protocols

Protocol publication

[…] Paired-end libraries with a 250-bp (for BD1, DR, GEA and NC) or a 400-bp (for BD2, OI, and the BD3 trio) insert size were generated using the TruSeq DNA Sample Prep Kit (Illumina; for DR and GEA) or the NEXTflex PCR-Free DNA Sequencing Kit (Biooscientific; for BD1, BD2, NC, OI, and the BD3 trio). Libraries were sequenced on one lane of Illumina HiSeq. 2000, 2500 or 3000 instruments each using 2 × 101 bp (for BD1, DR, GEA and NC) or 2 × 150 bp paired-end reads (for BD2, OI, and the BD3 trio). Reads were mapped on the UMD3.1 bovine sequence assembly using BWA-MEM. Reads with multiple alignments were removed. PCR duplicates were filtered and variants were called using SAMtools rmdup and pileup options. Only variants with a quality score (QUAL) of > = 30 and a mapping quality (MQ) score of > = 30 were considered with the exception of BD2 for which a less stringent quality threshold (QUAL > = 15) was applied due to the lower genome coverage (9 x) as compared with the other defects. For each genetic defect we assumed the causative mutation is dominant and occurred de novo. We therefore retained only heterozygous polymorphisms that were absent from analogous whole genome sequencing data from 1230 control animals (Supplementary Table ). Analysis of pedigree information, demonstrated that none of the 1230 control animals were descended from any of the first mutant animals for the different conditions studied. Within the genome of each affected animal, heterozygous private variants supported by only one read on each strand were removed to avoid possible artefacts created by reading twice an error present on a fragment displaying an insert size that was too small. SNP and Indels were then annotated using Ensembl VEP. Only private frameshift mutations, in-frame insertions or deletions, stop gain or loss variants, polymorphisms affecting splice donor or splice acceptor sites, and deleterious missense polymorphisms were retained. The reliability of the annotation was verified on the UCSC genome browser (http://genome.ucsc.edu/; accessed 2016/05/10) using the “RefSeq Genes”, “Non-cow RefSeq genes”, “Cow mRNAs from GenBank” and “Cow ESTs that have been spliced” tracks. Polymorphisms predicted to affect a transcript that was not supported by the alignment of bovine mRNAs or orthologous genes were considered as non-coding variants. Finally, to predict the phenotypic consequences of the remaining variants, we made a search for the corresponding genes in the Online Mendelian Inheritance in Man (OMIM; www.omim.org) and Mouse Genome Informatics (MGI; www.informatics.jax.org) databases.The same approach was used to detect heterozygous private deleterious mutations in the genomes of healthy AI sires to anticipate the emergence of novel recessive disorders. For this pilot study we selected 43 bulls used in France (to benefit from the network of the French National Observatory of bovine Anomalies for subsequent verifications; http://www.onab.fr/) and born one to four generations before the animals currently in use (since consanguineous mating between descendants of a common ancestor are more likely to occur after the fifth generation). These 43 animals consisted of 24 Holstein, 11 Montbéliarde and five Normande bulls born in 1996 or after, and three Charolais bulls born in 1993 or after. Two generations of 6.5 or 7 years and two subsequent generations of 2.5 or 3.5 years were considered for dairy and beef breeds, respectively, to account for the reduction of generation interval recently allowed by genomic selection. In a first step, we selected the heterozygous deleterious variants that were found only in one of the 43 genomes and not in the other control genomes (Supplementary Table ). These private deleterious variants consists either (i) in recent mutations that occurred on haplotypes that are also observed in their original form (i.e. without these mutations) in other genomes in the dataset or (ii) in rare (and predominantly ancient) polymorphisms located on haplotypes carried by a single animal in the dataset. To discriminate these two groups of variants, and to reduce as much as possible the proportion of false positive results in our screening for deleterious de novo mutations, we calculated the number of private mutations in a +/− 2.5 Mb interval around each private deleterious mutation in the genome of each carrier animal. We considered arbitrarily that private deleterious variants observed in animals having 10 private mutations or less in the investigated intervals were likely recent mutations that occurred on non-rare haplotypes. Therefore, we retained only these mutations for further studies.Because the lack of large databases for structural variations precluded their efficient filtering, we did not systematically investigate these polymorphisms (which are expected to represent only 1% of the total genomic variation). However, we used Pindel, DELLY and IGV to identify structural variants in the BD3 trio and in the GEA, DR and NC mapping intervals (see below). [...] For BD3, the de novo nature of the candidate mutations was confirmed using whole genome sequencing data of the parents of the affected animal, and PCR and Sanger sequencing. Details on the animals and PCR primers used are presented in Supplementary Tables  and respectively. PCR reactions were performed using the Go-Taq Flexi (Promega) according to the manufacturer’s instructions on a Mastercycler pro thermocycler (Eppendorf). The resulting amplicons were purified and bidirectionally sequenced by Eurofins MWG (Germany) using conventional Sanger sequencing. Polymorphisms were detected with the NovoSNP software.For BD1, BD2 and OI, the de novo nature of the candidate mutations was confirmed using PCR and Sanger sequencing in a panel of affected and unaffected animals, and DNA samples extracted from the semen (for BD2 and OI) and/or the blood (for BD1, BD2 and OI) of the mosaic sires.For all other candidate de novo variants, a panel of animals including cases and related controls with both DNA and Illumina BovineSNP50 Beadchip haplotype data available in the French Genomic Selection DNA bank and database was genotyped by PCR and Sanger sequencing. In a first step, the segregation of the putative de novo alleles and haplotypes of 100 markers (5 Mb) surrounding these mutations were studied among the descendants of putative first mutant animals. Haplotypes carrying these mutations were identified and, in a second step, ancestors of the putative primo-mutants were screened. Identification of carriers of the same identical-by-descent haplotypes but not of the mutant alleles was interpreted as a confirmation of the de novo nature of the mutation. For DR, since no ancestor of the first mutant heifer was available in our data set, we mined the French Genomic Selection database to identify AI bulls which (i) shared the same haplotype as DR animals over the whole DR interval but which (ii) were phenotypically black. Among them, JOCKO BESNE (HOLFRAM005694028588), which presented the longest haplotype (interval Chr3:7,928,589–30,728,145) in common with a DR animal (HOLCHEM120093681213), was selected as a control carrying the same IBD haplotype but without the DR mutation. [...] The RNA sequencing data considered in our analyses were obtained from Dorshorst et al. and are available at the following: 10.1371/journal.pone.0128969.s004. Briefly, RNA was extracted from skin samples from three dominant red and three dominant black animals. cDNA libraries were generated and sequenced as paired-end 50 base pair reads on an Illumina HiSeq. 2000 instrument. Differential expression analyses were performed using DESeq. 2 package. For details see Dorshorst et al.. In this study, a threshold of FDR < 0.05 was used to evaluate significant expression changes between dominant red and dominant black skin samples (Supplementary Data ). Transcript annotation was updated using Ensembl release 88 and information from the UCSC Genome Browser (http://genome.ucsc.edu/; accessed 2017/04/01). We used the “Non-cow RefSeq genes,” “Cow mRNAs from GenBank” and “Cow ESTs that have been spliced” tracks to recover the name of protein-coding genes that are predicted to code “unknown protein” according to Ensembl. Gene set enrichment analyses were carried out with two software packages using different methods and sources of information. To provide a first overview of the over-represented groups of genes, we performed two analyses with Enrichr (http://amp.pharm.mssm.edu/Enrichr/), using the list of genes that are significantly upregulated (FDR < 0.05, fold change >1) and downregulated (FDR < 0.05, fold change <1). We focused on Mammalian Phenotype ontology from Mouse Genome Informatics (MGI Mammalian Phenotype Level 3 and 4) to gain information on the possible phenotypic consequences associated with modification of gene expression in dominant red versus dominant black animals. Then we performed two analyses with Ingenuity Pathway Analysis (http://www.ingenuity.com/products/ipa/) using two list of genes (FDR < 0.05 and fold change >2 or <0.5, respectively) to identify canonical pathways that are markedly upregulated or downregulated. Results of gene set enrichment analyses are presented in Supplementary Tables  and . […]

Pipeline specifications

Software tools BWA, SAMtools, Pindel, DELLY, IGV, novoSNP
Applications WGS analysis, Sanger sequencing