Computational protocol: Role of genetic introgression during the evolution of cultivated rice (Oryza sativa L.)

Similar protocols

Protocol publication

[…] The complete genotype dataset for 1529 wild and cultivated rice accessions consisting of ~ 8 million SNPs from all 12 rice chromosomes [] was downloaded from the Rice Haplotype Map Project database ( Accessions with intermediate phenotypes (44) and aromatic rice (5) were discarded (Additional file : Table S1). The reduced dataset was split according to the group membership – 520 indica, 484 japonica, 30 aus and 446 wild rice accessions (using the bash command cut) – and the group SNP matrices converted into a table of allelic frequencies (using basic operations and the function COUNTIF in Libre Calc). Subsequently, only SNP positions with at least one third of non-missing data points in each analysed group were retained. A total of 705,124 variable positions passed this filter. The data filtering and analysis pipeline is schematically summarized in Additional file : Figure S1.In order to identify the progenitor populations of the cultivated groups, we analysed the wild distributions of non-shared ancestral variants, by which we mean variants that are common in wild rice and one cultivated group (allelic frequency ≥ 0.05) but absent or insignificant in the remaining two cultivated groups (< 0.01). For each cultivated group, the SNP positions meeting these criteria were extracted from the SNP matrix. Subsequently, each wild accession was assessed for the presence of japonica-specific, indica-specific and aus-specific ancestral variants and the proportions of sites with these variants were calculated. The forty wild accessions with the highest proportions were identified for each cultivated group, yielding three non-overlapping groups of 40 wild accessions that we regard as the progenitor populations of indica, japonica and aus. Geographic distributions of the identified progenitors are shown on ArcGIS maps (ArcGIS software by Esri).The robustness of the identified relationships was evaluated by a phylogenetic analysis and a principal component analysis (PCA). All format conversions and data extractions were done using bash command line utilities. For the phylogenetic reconstruction, a subset of the 705,124 SNP dataset was prepared by selecting 150 accessions (40 wild accessions for each progenitor population, ten indica, ten japonica, ten aus; the cultivated accessions with the least amount of missing data were chosen), yielding an alignment with 358,218 variable positions and 30.3% missing data points. A maximum likelihood (ML) tree was computed with RAxML [] using the GTRCAT model, new rapid hill-climbing algorithm and 200 non-parametric bootstrap replicates. In the PCA, all variable characters from the original genome-wide SNP matrix were included, regardless of the per-site proportion of missing data, but excluding rice accessions with > 75% missing data points. The resulting SNP matrix with 701 individuals and 5,759,207 positions was analysed with smartpca [], using the lsqproject option, excluding no outliers and inferring genetic distance from physical distance. […]

Pipeline specifications

Software tools RAxML, EIGENSOFT
Applications Phylogenetics, Population genetic analysis
Organisms Oryza sativa