Fit for purpose data
Almost always you will want to post-process your GBIF download in some way to fit your purposes. Sometimes you will have to make difficult judgement calls for your particular use-case. Whenever you are dealing with thousands or millions of records, you will never quite know the true quality of the source data. It is important to keep in mind that you are always just mitigating data quality issues, not eliminating them.
The data that we get in GBIF download, will contain data from a range of sources and the data will likely vary in its correctness and consistency. Correctness and consistency are two ways of documenting data errors and are measures of data quality. These are measures of how well the data gatherer was able to capture the true value being investigated. The nature of GBIF’s data publication workflow means that the correctness and consistency of the data can vary dependent on the data publishers and the source of the data. Knowing these properties of the data you have, will help you to understand the ways in which you can and cannot clean, validate and process the data.
-
Correctness (Accuracy) - closeness of measured values, observations or estimates to the real or true value e.g. has the species been identified correctly or the collection locality been identified correctly.
For instance, if we are studying plant biogeography in Indonesia, and want to do to a specific analysis for only one of the islands within the archipelago, then an appropriate question might be - Have localities on the island been correctly georeferenced?
-
Consistency (Precision) - level of resolution of the data e.g. precision of coordinates, taxonomic determination.
In the Indonesian example, an appropriate question might be - Does the uncertainty in the coordinate estimate allow for the occurrence record to not be on the island?
As a general rule, for most analyses you want highly accurate data although the level of precision may vary dependent on your analysis. GBIF can help you to determine the accuracy and precision of the data through, for example, filters and issue flags, however, you must always double-check!