12. Data management

In this module, you will review the main concepts, related tools and best practices for data management, particularly, data cleaning and standardization.

12.1. Principles of data management

In this video (09:49), you will review an important set of principles necessary to improve data through the processes of data cleaning. If you are unable to watch the embedded video, you can download it locally. (MP4 - 16.6 MB)

12.2. Data management tools

In this video (06:42), you will learn about a variety of tools that you can use to improve the quality of your data. If you are unable to watch the embedded video, you can download it locally. (MP4 - 10.3 MB)

12.3. OpenRefine

In this video (03:27), you will learn about OpenRefine. You can use OpenRefine to standardize and improve the quality of your data. If you are unable to watch the embedded video, you can download it locally. (MP4 - 3.8 MB)

12.4. Data journey step 7

Complete step 7, tasks 13-15.

12.5. Exercise tips

12.5.1. Validation checks

Technical errors Relatively simple, often able to be automated, checks against the integrity of the data. These may indicate incorrect exports, data mapping, field slippage (e.g. moving 1 column to the right) or data missing at the source.

  • Completeness: Whether all the data and metadata is available – are all fields present, are all fields filled out?

  • Bounds: For example, are days given in the range 1-31 (depending on month)

  • Data type: For example, does the Date field contain a date or a number?

  • Data format: For example, are Dates provided as 01/01/2010 or 01/Jan/10?

Consistency errors

Application of real-world rules to the data. These may indicate incorrect data entry from older records, transcription errors or post processing. Some are complex to implement and require reference data sets to check against. E.g. a list of known collectors and collecting habits. These rules can be gathered from data users and analysts.

  • Taxonomic: For example, if identified to species level, have a binomial scientific name and entries in genus and species fields been provided?

  • Currency: Are dates of collection, identification, update and digitization consistent?

  • Outliers: Detect outliers, but remember that not all outliers are necessarily errors. For example, compare against a known species range, or known environmental range (but remember that outliers may be misidentifications, rather than incorrect coordinates).

  • Geographic: Are the coordinates within the identified locality or region? For example, are there any terrestrial occurrences in the sea or marine occurrences on land?

  • Collecting patterns: Does the occurrence detail match the known collecting patterns of the organization or collector? Do any records appear to have been created after a collector has died (could this possibly be a different collector with a similar name)? For example, are any mammal records attributed to a bird watching group?

  • Accuracy and precision: For example, are any georeferenced records indicating very high precision or accuracy from a pre-GPS (or pre-accurate GPS) collecting period?

  • Collecting methods: Different survey methods (e.g. transects and area surveys) have particular characteristics. Are the records consistent with the method provided?