Computational protocol: A method for the allocation of sequencing resources in genotyped livestock populations

Similar protocols

Protocol publication

[…] We present two algorithms. The first selects ‘focal individuals’ whose genomes collectively represent the maximum possible portion of the haplotype diversity in the population. A focal individual is one that shares a large number of its own haplotypes with a large number of individuals in the population. This individual does not need to be a key ancestor, and may have no offspring. The second allocates a fixed sequencing budget across focal families (i.e. a focal individual, its parents and grandparents) to enable phasing of the most frequent haplotypes in the population at the sequence level. The aim of both algorithms is that the final haplotypes, when phased at the sequence level, represent the maximum possible portion of the haplotype diversity in the population that can be sequenced and phased for a fixed sequencing budget. We implemented our method in a software package called AlphaSeqOpt. Throughout the rest of the paper, AlphaSeqOpt is used when referring to our method. An outline of each algorithm is given below. [...] To help phase and resolve the haplotypes of focal individuals, some of the fixed sequencing budget should be allocated to an additional group of individuals that share their haplotypes. In populations with pedigree, the best additional group of individuals is likely to be the parents and grandparents of focal individuals. The advantage of sequencing parents and grandparents is that simple inheritance-based rules can be developed to phase the sequenced haplotypes of focal individuals. To maximise the accuracy of phasing the most frequent haplotypes in a population, a larger proportion of the fixed sequencing budget could be allocated to the families of focal individuals whose haplotypes are more frequent in the population, and this is addressed by algorithm 2.The inputs for algorithm 2 are:The k focal individuals and the proportion of population haplotypes that each one carries.The accuracies of phasing each member of a focal family given a chosen ‘sequencing scenario’, i.e. the selected sequencing coverage for each member of a focal family. Expected phasing accuracies for a sequencing scenario may be calculated using algorithms such as that implemented in AlphaFamSeq (Battagin and Hickey, unpublished), which has been developed by our group specifically for family-based phasing of haplotypes at the sequence level (for a brief description, see Additional file : Figure S1).Population pedigree.The cost of preparing and sequencing a DNA library at any coverage.The total fixed sequencing budget (in monetary terms).Information on any historically available sequence data. The k focal individuals and the proportion of population haplotypes that each one carries.The accuracies of phasing each member of a focal family given a chosen ‘sequencing scenario’, i.e. the selected sequencing coverage for each member of a focal family. Expected phasing accuracies for a sequencing scenario may be calculated using algorithms such as that implemented in AlphaFamSeq (Battagin and Hickey, unpublished), which has been developed by our group specifically for family-based phasing of haplotypes at the sequence level (for a brief description, see Additional file : Figure S1).Population pedigree.The cost of preparing and sequencing a DNA library at any coverage.The total fixed sequencing budget (in monetary terms).Information on any historically available sequence data.The allocation of a fixed sequencing budget in algorithm 2 is addressed using a differential evolution algorithm [] that samples and evaluates different combinations of sequencing scenarios within and across the focal families. Algorithm 2 can be run for any number of rounds until convergence is reached or until no further improvements are made. An outline of algorithm 2 is given below.For each member of a focal family, sample a sequencing coverage and determine the sequencing scenario and haplotype phasing accuracy. Sampling of sequencing coverages is performed based on the multinomial probabilities of sequencing an individual at a defined coverage. The probabilities are obtained by logit transforms of the internal problem representation in the differential evolution algorithm []. Haplotype phasing accuracies given a sequencing scenario were calculated using AlphaFamSeq (Battagin and Hickey, unpublished) (see Additional file : Figure S1 for more information).Calculate the overall cost of the selected set of sequencing scenarios across all focal families, taking into account pre-existing DNA libraries and/or sequence data.Compute a ‘goodness criterion’ for this combination of sequencing scenarios and associated cost. The criterion takes into account:The proportion of sequenced population haplotypes that would be phased at the sequence level.The accuracy of phasing the haplotypes of focal individual given sampled sequencing scenarios [AlphaFamSeq (Battagin and Hickey, unpublished)].The fixed sequencing budget. If the total cost is above the budget, then this combination of sequencing scenarios is penalised.Any historically available sequence data.Repeat steps 1 to 3 n times (n is the number of rounds). For each member of a focal family, sample a sequencing coverage and determine the sequencing scenario and haplotype phasing accuracy. Sampling of sequencing coverages is performed based on the multinomial probabilities of sequencing an individual at a defined coverage. The probabilities are obtained by logit transforms of the internal problem representation in the differential evolution algorithm []. Haplotype phasing accuracies given a sequencing scenario were calculated using AlphaFamSeq (Battagin and Hickey, unpublished) (see Additional file : Figure S1 for more information).Calculate the overall cost of the selected set of sequencing scenarios across all focal families, taking into account pre-existing DNA libraries and/or sequence data.Compute a ‘goodness criterion’ for this combination of sequencing scenarios and associated cost. The criterion takes into account:The proportion of sequenced population haplotypes that would be phased at the sequence level.The accuracy of phasing the haplotypes of focal individual given sampled sequencing scenarios [AlphaFamSeq (Battagin and Hickey, unpublished)].The fixed sequencing budget. If the total cost is above the budget, then this combination of sequencing scenarios is penalised.Any historically available sequence data.Repeat steps 1 to 3 n times (n is the number of rounds).The final result is an ordered list of focal individuals that carry the highest proportion of the most frequent haplotypes in the population and the sequencing coverage for each focal individual, its parents and grandparents. For further details on both algorithms, see Additional file . [...] Sequence data was generated for 1000 base haplotypes for each of ten chromosomes using the Markovian Coalescent Simulator [] and AlphaSim [, ]. Chromosomes were simulated as 100 centiMorgans (cM) and 108 bp in length, with a per site mutation rate of 2.5 × 10−8 and a per site recombination rate of 1.0 × 10−8. The effective population size (Ne) was set at specific points during the simulation based on previously estimated Ne values within the Holstein cattle population []. These set points were: 100 in the base generation, 1256 at 1000 years ago, 4350 at 10,000 years ago, and 43,500 at 100,000 years ago, with linear changes in between. The resulting sequence had approximately 650,000 segregating sites across the ten chromosomes. [...] To emulate the recent history of modern livestock breeding, ten replicates of five pedigrees were simulated. Pedigrees were 5, 10, 15, 30 or 50 generations for populations 1 to 5, respectively. All pedigrees and replicates were independently simulated and had the following general structure. Each generation comprised 1000 individuals with equal sex ratio, i.e. 500 males and 500 females. In the first generation, chromosomes for each individual were sampled from the 1000 haplotypes in the base generation. In subsequent generations, chromosomes of each individual were sampled from parental chromosomes, assuming recombination with no interference. In each generation, the 25 males with the highest TBV were selected as sires of the next generation. No selection was performed on females, and all 500 females were used as parents. The sixth population was simulated to obtain an unrelated population of 100,000 individuals directly from base haplotypes, i.e. individuals were nominally unrelated. This population was simulated to test the performance of the algorithm in extreme circumstances. Circumstances such as these may not typically arise in livestock breeding but could arise in human or other natural populations or in gene bank collections, which are especially topical in plant breeding (e.g. the Seeds of Discovery project, CIMMYT: http://seedsofdiscovery.org/). We assumed that all individuals had genotypes for 10,000 single nucleotide polymorphisms (SNPs) distributed equally across the ten chromosomes, i.e. 1000 SNPs per chromosome. Genotypes of all individuals were phased using AlphaPhase [–] as input. We also performed the same analysis with a pedigree from a real livestock breeding program. Since the results showed the same trends as the simulated data and in the interests of brevity, these results are not presented. [...] In each population, the parameters for selecting the focal individuals were:Population haplotype libraries were created using individuals and SNPs with at least 90% phased genotype data.Sharing of haplotypes was determined as 100% identity matches. A 100% identity match was chosen to overcome phasing errors and ensures that haplotypes with small differences are considered as independent haplotypes.Haplotype lengths were set to 250 SNPs per chromosome (see Additional file for haplotype length choice). Population haplotype libraries were created using individuals and SNPs with at least 90% phased genotype data.Sharing of haplotypes was determined as 100% identity matches. A 100% identity match was chosen to overcome phasing errors and ensures that haplotypes with small differences are considered as independent haplotypes.Haplotype lengths were set to 250 SNPs per chromosome (see Additional file for haplotype length choice).For each population, we calculated the frequency of all haplotypes of the top 50 and 200 focal individuals selected by algorithm 1 in the population. For populations 1 to 5, we compared this to the frequency of all haplotypes of the top 50 and 200 focal individuals selected by the key ancestors approach and two haplotype-based approaches. We implemented the key ancestors approach using PEDIG [], which selects focal individuals that cumulatively have the largest pedigree-inferred marginal contributions. The two haplotype-based approaches were those by Bickhart et al. [] and Gusev et al. []. The algorithm of Bickhart et al. [] attempts to select the least redundant set of individuals that represent the population haplotypes by scoring haplotypes based on their population frequency, whereas the algorithm of Gusev et al. [] attempts to select individuals that share a large proportion of the population haplotypes with other individuals IBD. […]

Pipeline specifications