3. Data journey - bumblebee pollinators

This data journey is a series of practical exercises focusing on bumblebee pollinators. Using GBIF and BOLD, you will learn: I. how to find and use already available biodiversity data in connexion with your research questions; II. How to efficiently capture and clean this data – i.e. put them in a standard format that is directly relevant and exploitable for you; III. How to generate and publish new data according to a international standards.
This data journey is comprised of nine steps. Each step (or set of steps), correlate to the different modules of the course. After a series of theoretical lectures, you will return to the Data Journey to complete the practical exercises. The practical exercises follow a path: I. the study system; II. questions and hypotheses; III. availability of data; IV. capture and cleaning of data; and V. generate and publish data.

3.1. Study system

Bombus-Vicia pollination network

3.1.1. Bumblebee pollinators

  1. What do we know about bumblebees in the Caucasus region?

    1. How many species? Where are they distributed?

    2. What are their preferred plant resources?

  2. Identify simple and clear research question

    1. What resources and type of data you will need?

    2. How much of this data already exists?

3.1.2. Caucasus ecoregion

  • One of the most species-rich regions in the world

  • High endemism rates

  • Accelerate biodiversity discovery

    • Systematics

    • DNA barcoding

    • Biological metadata

  • Some biodiversity data is already available

  • Identify relevant datasets

  • Mobilize them for addressing scientific & conservation issues

  • Identify data gaps for future research

3.1.3. Public biodiversity databases

You may find the data in the ecoregion is not publicly available in all the databases so you will need to reflect on your strategy.


3.2. Step 1

In Step 1, you will develop your research questions.

In a workshop setting, complete Step 1 as a group. Select a recorder/reporter to report back after task 3 is complete.

3.2.1. Task 1


Take 5 minutes to complete the following:

  1. Form your research question(s)

  2. Where is data available (or not available)?

  3. Draw a schematic of your question and plan to answer your question.

3.2.2. Task 2

DJC 10

Take 25 minutes to complete the following using publicly available databases:

  1. Search for different species based on their names.

    1. Are there any records?

  2. How much overlap in available records do you observe between different databases?

  3. What possible research questions can you formulate?

  4. Do you have the necessary data?

3.2.3. Task 3

DJC 11

Take 10 minutes to complete the following:

  1. Reflect on your experience in task 2 and further refine your project goals.

3.3. Step 2

In Step 2, you will begin to work with species identifications using GBIF and the Georgian Biodiversity Database (GBD).

3.3.1. Task 4

DJC 10
DJC 12

Take 15 minutes to complete the following:

  1. Search Bombus records in both GBIF and GBD.

  2. What species of Bombus are known to occur in Georgia?

  3. Combine lists from the two databases into a final list in Excel of species names occurring in Georgia.

  4. If you have time, do the same for Vicia records.

In this exercise you will be manual editing / cleaning the list from GBD to prepare for the next task.

3.3.2. Task 5

DJC 12

Take 15 minutes to complete the following:

  1. Using the GBIF species lookup, produce a clean list of species with validated taxonomic names.

DJC 13
Synonyms may arise whenever the same taxon is described and named more than once, independently.

3.3.3. Task 6

DJC 12

Take 10 minutes to complete the following:

  1. Reflect on what has been learned and experienced during Steps 1 and 2.

  2. Explore the BOLD database in preparation the next steps.

In a workshop setting, complete task 6 as a group. Select a recorder/reporter to report back (1 minute per group) after task 6 is complete.

3.4. Step 3

3.4.1. Task 7

DJC 12

Take 45 minutes to complete the following:

  1. Explore Bombus-Vicia records in BOLD.

  2. Are there DNA barcodes available for the species of Bombus and Vicia you listed from Georgia?

  3. If yes, download the data.

  4. How many specimens per species?

  5. Which and how many of these species lack a DNA barcode? Mark in your list.

3.4.2. Task 8

DJC 12

Take 15 minutes to complete the following:

  1. Explore more automatized ways for accessing DNA barcoding data in BOLD.

  2. Use the GBIF shortcut for Bombus hortorum (https://www.gbif.org/species/1340542)

  3. Click around for different species

  4. Are you able to access all and the same type of data?

3.4.3. Task 9

DJC 12

Take 35 minutes to complete the following:

  1. How much information is needed for referencing a DNA barcode?

  2. Find out what data and metadata is available for each DNA barcode for the specimen you just located.

  3. Examine what information is given in BOLD for a specimen and download the available data for one Bombus specimen.

In a workshop setting, complete task 9 as a group. Select a recorder/reporter to report back (2 minutes per group) after task 9 is complete.

3.5. Step 4

3.5.1. Task 10

DJC 12

Take 30 minutes to complete the following:

  1. Download sequences.zip file (ZIP 5 KB)

  2. Get an identification for a sequence by copying-pasting an “unknown” DNA sequence to BOLD.

  3. Enter species names and check whether they have records in BOLD, look these sequences with the sequence analysis tools in BOLD.

  4. Download all sequences for all available Bombus species known from Georgia in BOLD.

  5. Select three Bombus species and make a map.

    1. How are species distributed (i.e. located in your area of interest)?

    2. Are all of them useful for your research?

3.6. Step 5

3.6.1. Task 11

DJC 12

Take 30 minutes to complete the following:

  1. Using the BOLD sequence analysis tools:

    1. Make a tree to check how well the sequences group by species

    2. Look at the BINs vs species

    3. Look at barcode gaps

    4. Among the barcodes for Bombus, are all of them useful for your research?

    5. Are there cryptic species or misidentifications?

  2. If you have time, use the GBIF sequence id tool

3.7. Step 6

3.7.1. Task 12

DJC 12

Take 45 minutes to complete the following:

Imagine that you are the person assigned to transcribe the collected field data. You know you will share your data with GBIF so you decided to start with an occurrence template. However, you know that that you will share more data than the GBIF required and recommended fields.

  1. Review the herbarium sheets and determine what information can be captured. Consider that all data should be captured verbatim.

  2. Download the Excel template and add fields to the spreadsheet to accommodate all the data that can be captured.

  3. Transcribe the verbatim data from the two herbariums sheets. Vicia.zip file (ZIP 51 MB)

  4. Consider other fields that you could add to the spreadsheet that can be derived from other known information.

3.8. Step 7

Refer to Exercise tips in the data management section for information on types of errors and helpful tools.

3.8.1. Task 13

DJC 12

Take 15 minutes to complete the following:

  1. You are now tasked with performing some standard data quality checks on the data.

  2. Download ViciaForCleaning.txt (ZIP 66 KB)

  3. Import the file in Excel using the Excel wizard. See Excel-tips-EN.pdf (PDF, 7 MB) for import instructions for your operating system (Windows, Mac, Linux).

  4. Find and correct the errors manually.

    1. country - check for blanks

    2. year - do any years seem odd?

    3. countryCode - code should include only letters

    4. month - review the Darwin Core recommendation for month

    5. taxonRank - check for blanks

    6. minimumElevationInMeters and maximumElevationInMeters - 8872 is the elevation of Mt. Everest

    7. kingdom - is kingdom correct?

3.8.2. Task 14

DJC 12

Take 15 minutes to complete the following:

  1. Continue using the same file. Filter for the cracca species.

  2. Locate and correct the errors using tools of your choice.

    1. species - Are the names valid?

    2. decimalLatitude and decimalLongitude - Are all the coordinates consistent and in decimal format?

    3. decimalLatitude and decimalLongitude - Are all the occurrences with coordinates taking place in Armenia?

    4. eventDate - Does the data exist?

3.8.3. Task 15

DJC 12

Take 60 minutes to complete the following:

  1. Use OpenRefine to improve the quality of a dataset by using the default features, existing web services and regular expressions.

  2. Download UC1-3c-open-refine.csv. (207.5 KB)

  3. Download and complete the exercises in OpenRefine-Exercise3c-EN.pdf. (PDF, 1.1 MB)

3.9. Step 8

3.9.1. Task 16

DJC 12
DJC 14

Take 45 minutes to complete the following:

  1. Login and explore the IPT.

  2. Select a dataset published on the IPT and review the metadata.

  3. Download the Darwin Core Archive.

  4. Navigate to the dataset on GBIF.

  5. Review the occurrence records through the GBIF dev portal.

  6. Download the occurrences from GBIF.

If you’d like to try publishing a dataset yourself, download the following files and use the IPT listed above to create an occurrence dataset with associated images:

  1. Download RBGE-occurrence.txt (ZIP 76 KB)

  2. Download RBGE-multimedia.txt (ZIP 65 KB)

3.10. Step 9

3.10.1. Task 17

DJC 12
DJC 14

Take 60 minutes to complete the following:

  1. Upload specimen data

  2. Upload images

  3. Upload sequence & trace files