Exercise 3a-c

For these exercises, you will perform technical and consistency validation checks, improve data with different tools, and learn how to use OpenRefine.

Read USE CASE I (if you haven’t already).

Your institution is part of the “Global Poales Association (GPA)”. This association has secured funding to publish an up-to-date flora on the group and has requested your herbarium to participate and provide any high quality records you may have on this order of plants. The order is well represented in your collection so you think you could contribute substantially to this effort.

Exercise 3a

Validation checks

In this exercise we will focus on technical errors and perform a basic validation check to identify technical errors. Refer to Validation checks for information on the types of errors.

  1. Download UC1-3ab-data-cleaning.csv. (207.5 KB)

  2. Import the CSV file in Excel using the Excel wizard. See Excel-tips-EN.pdf (PDF, 7 MB) for import instructions for your operating system (Windows, Mac, Linux).

  3. Find and correct the errors manually.

  4. Use the previously downloaded exercise sheet to provide your answers.

Exercise 3b

Other data management tools

The GPA association has given you a checklist of data quality elements to verify:

  • All plant names (full name) are correctly spelled

  • All plant names belong to the order

  • All records have coordinates

  • All coordinates are inside the country stated and converted to decimal format

  • All dates are in the proper column and in the format YYYY-MM-DD

The three categories of errors are:

  • Nomenclatural errors

  • Format errors

  • Geographic errors / outliers

    1. Refer to Helpful tools in order to complete the exercise. You are not limited to these tools, you may use any tools you like.

    2. Use the same file from the previous exercise.

    3. Make the correction ONLY for the Eriocaulaceae family (so you may want to filter the data)

    4. Correct the errors found in the dataset used in exercise 3a (previous exercise), using the tools of your choice, and document the changes you perform in the exercise sheet.

    5. Correct the entire file if you have time.

    6. Use the previously downloaded exercise sheet to provide your answers.

Exercise 3c

In this video (03:27), you will learn about OpenRefine. You can use OpenRefine to standardize and improve the quality of your data. If you are unable to watch the embedded video, you can download it locally. (MP4 - 3.8 MB)

OpenRefine

In this exercise we use OpenRefine to improve the quality of a dataset by using the default features, existing web services and regular expressions.

  1. Download UC1-3c-open-refine.csv. (207.5 KB)

  2. Download and complete the exercises in OpenRefine-Exercise3c-EN.pdf. (PDF, 1.1 MB) Also available in French and Spanish.

  3. Use the previously downloaded exercise sheet to provide your answers.