Computational protocol: The “Bear” Essentials: Actualistic Research on Ursus arctos arctos in the Spanish Pyrenees and Its Implications for Paleontology and Archaeology

Similar protocols

Protocol publication

[…] In order to document relationships among analytical components of the comparative LLB samples, we conducted principle component analyses (PCA), which produce factors that result from the reduction of dimensionality caused by multiple variables. With exploratory PCA, the analyst aims to improve prediction and variance accountability by detecting those variables that do not contribute significantly to sample variance; with confirmatory PCA, the analyst uses selected variables to maximize sample variability and sample component relationships. The use of continuous numerical variables may result in bias of PCA solutions due to the heterogeneity of these values and overemphasis of the weight of variables displaying high numerical values. For this reason, variables are usually centered and scaled prior to their statistical analysis. However, in the present analysis, all variables involved the use of percentage values; they have similar scale, so there was no need to center and scale variables.As opposed to dimension reduction by orthogonal projection as performed in PCA, in multidimensional scaling (MDS) points are chosen so that stress (the sum of the squared differences between the inter-sample disparities and the inter-point distances) is minimized . The MDS option we selected for our analyses is the identity transformation, which consists of taking the inter-sample disparities as the inter-sample dissimilarities themselves. This metric MDS approach uses a Pythagorean metric analysis of inter-point distances, which includes an iterative majorization algorithm to find the MDS solution . This algorithm was considered to have converged as soon as the relative decrease in stress was less than 10−6. The algorithm was also stopped once greater than 5,000 iterations were performed. In MDS, points are related in a low-dimensional Euclidean space , with data spatially projected by regression methods admitting non-linearity. The use of MDS in the present study therefore complements the PCA test.We also employed a canonical variate analysis (CVA). CVA focuses on data grouped into K classes by transforming original variables into canonical variables defined by square distances between the means of the groups obtained by Mahalanobiśs D2. This is scale invariant. CVA produces a higher degree of separation between the group means than does either PCA or MDS . The biplot axes for CVA were determined by the group means.In a separate database, we entered LLB completeness data (see above, point 3) and tooth score/tooth pit ratio (see above, point 4) data. Both data sets were analyzed together using a Redundancy Analysis (RDA), which combines a multiple regression with a PCA. Data were previously normalized using a Hellinger transformation method . Missing data were converted using group average values.Because results of the multivariate statistical analyses summarized above differentiated our comparative carnivore groups (see below Results), our next step was to create thresholds from which frequencies of different modification types could be used as lines to demarcate the different types of carnivoran taphonomic agents. To this end, we submitted data for all taxa and all variables to a tree-based analysis.Regression and classification trees usually suffer from high variance, but a procedure with a repeated sequence of data sets derived from the same original sample will decrease that variance. This is the general purpose of bootstrap aggregation, more commonly known as bagging, which splits the original training data set (TDS) into multiple data sets derived from bootstrapping the original TDS. Similarly, the powerful procedure of random forests (RF) performs a bootstrapping approach, but samples only a subset of variables for each tree. Thus, RF produces a final solution that includes a selection of variables that are important for correct classification of the analytical set. RF produces hundreds of trees that are repeatedly fitted to bootstrapped sets of data. The results are contrasted against a validation test, from the observations (about one third) not used for the training data set (these observations are referred to as out-of-bag (OOB) observations). RF produce estimates on how many iterations are needed to minimize the OOB error. The importance of each response variable is determined by mean decreased error (MDE) for regression trees (RT), whereas the Gini index is more useful for classification trees (CT). The following step is the creation of a classification tree using the most useful variables selected by the RF test.We conducted an RF test by selecting three-variable sets in each aggregated bootstrap iteration , , , . A total of 500 trees was generated. The test was applied to all variables in . Subsequently, it was applied only to the variables dealing with LLB portion modification. Variable selection for CT analysis was carried out when variables showed a mean decreased Gini index of >0.5.The variables with higher MDE values were then selected to obtain single regression trees. CT tests were carried out by selecting a complexity parameter (cp) of 0.005.The frequency data are shown in . Analyses were performed in R. The PCA, CVA, and MDS tests are graphically displayed as biplots using the R library “BiplotGUI.” RDA was performed with the aid of the “ade4”, “vegan”, “packfor”, “MASS”, “ellipse” and “FactoMineR” libraries and variables and data were plotted in a triplot. RF analysis was carried out with the R library “rattle”, which uses the “randomForest” library for RF tests and the “rpart” library for RT tests. […]

Pipeline specifications

Software tools FactoMineR, randomforest
Application Miscellaneous
Organisms Ursus arctos, Crocuta crocuta, Canis lupus, Panthera leo