Computational protocol: Mapping Global Potential Risk of Mango Sudden Decline Disease Caused by Ceratocystis fimbriata

Similar protocols

Protocol publication

[…] We collected MSD occurrence data from all countries where the disease currently occurs: Brazil, Oman, and Pakistan []. MSD occurrence data points that cover all the regions inside these countries were collected (). The data points for Brazil and Oman were collected in the field while conducting a study on phylogenetic analyses of C. fimbriata []. These data points correspond to locations with the presence of mango trees with symptoms of branch death, wilting foliage, bark discoloration, small holes in the bark, or sap exudation which indicates the presence of MSD disease [, , , , ]. At these locations, samples of the xylem showing discoloration (a characteristic of an infected tree; ) were collected from symptomatic mango trees in plantations, small farms, gardens, and along streets and roads for further confirmation of C. fimbriata presence. Samples were only taken at locations where the land owner had previously approved of sampling. No specific permissions were required for these countries since the species involved here are of agronomic interest and are not endangered or protected species. A total of 219 sites in Brazil, Oman, and Pakistan were confirmed for presence of the pathogen []. Since some of these sites were sampled more than once over the sampling period, we removed repeated occurrences corresponding to 80 unique points from Brazil and Oman. For Pakistan, the MSD disease presence data were collected from published papers that provided the coordinates of the locations of the diseased trees [–]. All taxonomic issues for the species were considered and only those that we were sure to be MSD caused by C. fimbriata or a synonym were considered []. Thus, a total of 94 unique occurrence records were collected from three countries where the disease is currently known to occur in mango trees (, ) [, , ]. These records were reduced to 54 after applying spatial filtering using spThin, an R package (version 3.1.0) [] to reduce spatial autocorrelation []. This method was chosen since it keeps the most locations possible and tends to perform better than other methods to reduce spatial autocorrelation []. The spThin checks for all possible combinations of filtered points using a minimum distance between them. From these new datasets, the one that keeps the largest number of records is selected to be used in the ENM []. Filtered occurrence data points were >10 km apart [, ]. This distance was used to ensure that each cell could have only one occurrence point since we used ~5-km spatial resolution climatic data in the model. [...] The correlative maximum entropy based model or MaxEnt algorithm (version 3.3.3k) [] was used to assess the global potential distribution of MSD. MaxEnt is a machine learning method and estimates the probability distribution of the maximum entropy for a species constrained by the sample data and it is based on multiple environmental variables using a high-dimensional dataset [, , –, ]. MaxEnt was chosen because it uses species presence and background data (absence data are not needed) and also works well with small sample sizes [, ]. MaxEnt estimates the environmental suitability for a species based on presence records and randomly generated background points by finding the maximum entropy distribution and its geographical projection []. It produces an index of suitability that varies from 0 (unsuitable) to 1 (most suitable) [, , ]. A total of 50,000 background points were randomly selected from areas where C. fimbriata currently occurs. This number was chosen since it is more appropriate when working at a global scale [, ]. A sampling bias was suspected because the data were collected near roads and more accessible areas and from sources where we could not control the sampling process. Thus, a bias surface using a kernel density estimate was generated using the SDMToolbox []. The bias surface will result in a raster where cells with lower values will represent places with lower bias []. The bias surface was used to account for sampling intensity and potential sampling bias [].Different settings in MaxEnt were adjusted to find an optimal model for MSD disease potential distribution since default settings are not always the best [, , ]. These adjustments consisted of different combinations of regularization multiplier (RM) and feature types generating many different models. The RM controls the number of parameters and consequently the model complexity [, ]. The RM values used were 1.0, 1.5, and 2.0. An RM value <1 generates models that are very restricted (not desired for world predictions) and values >1 would result in simpler models with a broader potential distribution []. These values were used in combination with different sets of MaxEnt features (i.e. linear [L], quadratic [Q], product [P], threshold [T], and hinge [H]). The ‘fade-by-clamping’ option was used to prevent extrapolations outside the environmental range of the training data []. The percent contribution, permutation importance, and ‘Jackknife’ (leave-one-out) technique in MaxEnt [] were used to estimate the predictive power of different environmental predictors. The percent contribution estimates the contribution of a variable to the model and the permutation importance indicates how much the model depends on that variable. 'Jackknife' procedure was used in MaxEnt to account for the importance of a variable over 10-fold-cross-validation. This is done by evaluating different models in two situations: using only the variable by itself and using all other variables excluding that one in question. The results are the training gain and the area under the curve (AUC) for each environmental variable for each situation. The MaxEnt generated response curves that were used to show the relationships between predicted probabilities of presence of the disease with respect to the variation within each environmental variable. These curves were analyzed and models showing complex curves (highly irregular shape) were not considered for further evaluations; models that included predictors with these erratic curves are not used because they are considered biologically unrealistic. We considered complex curves as those with the highly jagged or multimodal behavior which normally does not happen with species’ responses to environmental variables. Only thirteen models were considered for further evaluations.The evaluation metrics for ranking the models’ performance were the AUCcv (area under the receiver operating characteristic [ROC] curve) [] and the test sensitivity (i.e., percentage of correctly predicted presences) at 0% and 10% training Omission Rates (OR) [, ]. OR was used in addition to AUCcv because AUCcv alone is not the best approach to choose between different models when working with the prediction of invasive potential of a species. The problem with AUCcv is that it gives the same weight for sensitivity and specificity, while in case of prediction of invasive potential of a species, sensitivity should receive more attention [–]. Test sensitivity thresholds at 0% and 10% means that zero and ten percent, respectively, of training presence locations for MSD fall outside the predicted suitable area. For that we ran a 10-fold cross-validation in MaxEnt to calculate AUCcv and OR. The AUCcv measures the ability of the model to discriminate presence from background. AUCcv value of 0.5 shows that model predictions are not better than random; values below 0.5 are worse than random; between 0.5–0.7 indicate poor performance; between 0.7–0.9, reasonable or moderate performance; and values higher than 0.9 indicates high performance []. For the OR, the expected value of test omission rate at 0% training OR is 0, whereas at 10% training OR threshold it is 0.10; higher than expected ORs show poor performance of the models []. The best models were ranked based on 10% training OR, 0% training OR, and AUCcv, respectively [, , ].To identify the mango growing areas that are under potential risk of MSD establishment mango yield data were obtained from the Earth Stat (http://www.earthstat.org/) [] with 10x10 km resolution. These data represent the average yield of mango in tons per hectare for the period from 1997–2003. These data were reclassified to a binary map using Reclassify tool in ArcGIS, version 10.2 (ESRI, Redlands, CA). Cells with zero values and no data values were converted to zero, and cells with all other values were converted to one, thus generating a map with zero representing cells with no mango production and one for those areas where mango is produced. This binary layer of mango production reports using the Expand tool in ArcGIS was used to reduce problems due to the fact that in some areas the reports were just single cells, they were difficult to visualize, and the data for some regions were of low accuracy []. Finally, to estimate the suitability for the disease only in mango production areas, the MaxEnt predicted output (the output of the model) was extracted to mango production areas. The extended binary map of mango production was multiplied by the MaxEnt predicted output, to keep the suitability for MSD disease (in relation to the model) in cells with mango production reports and converted areas with no mango production to zero. […]

Pipeline specifications

Software tools spThin, SDMtoolbox
Application Phylogenetics