Ph.D Thesis

Ph.D StudentGueta Tomer
SubjectDeveloping Tools and Methodology for Data-Intensive
Biodiversity Research
DepartmentDepartment of Civil and Environmental Engineering
Supervisors PROF. Yohay Carmel
DR. Avi Bar-Massada


The recent availability of massive volumes of species occurrence data from numerous sources, connected within a single portal, may facilitate answering fundamental ecological questions. Yet, these big biodiversity databases suffer from serious errors and biases, which may invalidate their use in research. Here, three directions were used to mitigate this problem: First, tools to help end-users (ecologists) conduct their own advanced data-cleaning, based on the aim of their specific research were developed and evaluated. Towards that end, bdclean, a novel R package that facilitates data-cleaning in a user-friendly workflow, specifically designed for inexperienced R users, was developed.

Second, the value of case-specific user-level data-cleaning was explored. A new framework to quantify the effect of user-level data-cleaning on data quality using SDMs (Species Distribution Models) was developed. The basic assumption here is that the change in SDM performance following data-cleaning reflects the change in data-quality. Data on Australian mammals served to exemplify this approach. SDMs for various functional groups at six spatial scales were constructed. Data-cleaning resulted in significant improvement in gain (SDM performance index) of 5-25% for all functional groups and across all spatial scales.

            Third, a novel means of interpreting results was proposed and evaluated, by binding data-cleaning to data analysis. It is feasible to assume that even the most advanced cleaning procedures are not perfect. The change in signal between the pre- and post-cleaning phases, in addition to the signal itself, was used in order to evaluate the research question. This approach was explored using a well-known community-ecology question. The case study concerns the debate over the role of environmental factors in determining species distribution (relative to the roles of stochasticity and dispersal).

Three alternative hypotheses were distinguished (niche, neutral and continuum), using SDM performance as a proxy for the strength of environmental factors over a gradient of species richness. These hypotheses were tested using data downloaded from GBIF (Global Biodiversity Information Facility). Three corresponding datasets were generated using virtual species, in order to validate the predictions and to test various aspects of the analysis. Analyses of the virtual species showed that the niche, continuum, and neutral communities resulted in a clear positive-, negative-, and nonsignificant trends, respectively. Negative correlations between species-richness and the predictive power of environmental factors were more common than positive correlations. The signal was found to be consistent in various thresholds, ensemble techniques and spatial grids, and was supported by the virtual species results. Comparing the results before- and after data-cleaning, there was a consistent trend, in which the signal became stronger and clearer after data-cleaning. The results, therefore, provide strong support to the continuum hypothesis.

This research reveals the merit of incorporating data-cleaning as part of the data analysis when working with biodiversity big-data for answering macro-ecological questions, and builds tools towards the best practice of user-level data-cleaning. The tools and methodology which were developed throughout this research can improve our ability to answer ecological questions, specifically in empirical analysis that builds upon data available from large biodiversity databases.