Part 4 Data preprocessing
We will use the food web network compiled by Planque et al. (2014) which is available from the repository ‘Ecological Archives’ (ID E095-124).
N.B. If needed, the dataset can be retrieved from:
- the Journal “Ecology”,
- published in Year “2014”,
- by the First author “Planque”,
- as a Data Paper (“Yes”).
4.1 Load dependencies
# rm(list = ls())
library(tidyverse)
library(magrittr)
old_opts <- options() # preserve current preferences for after we exist the function
options(box.path = c("./scripts",
"../scripts",
"./scripts/box",
"../scripts/box") )
Load modules
box::use(box/modify_strings[locate_pattern, replace_pattern])
# box::use(box/refresh_box[refresh])
# refresh("box/modify_strings")
4.2 Data importation
The dataset is provided as tables formatted inside text files.
4.2.1 Import species list
species_list <- read.delim("raw_data/Barents_Sea/SpeciesList_2015.txt", header = TRUE)
sp_save <- species_list # keep original copy
head(species_list)
#> TROPHOSPECIES ABBREVIATION PHYLUM_SUBPYLUM CLASS
#> 1 DETRITUS DET_IND Detritus Detritus
#> 2 AUTOTHROPH_FLAGELLAT AUT_FLA Autotroph flagellates Autotroph flagellates
#> 3 BACTERIA_INDET BAC_IND Picoplankton Picoplankton
#> 4 DIATOM DIATOM Microplankton Microplankton
#> 5 HETEROTROPH_FLAGELLAT HET_FLA Heterotroph flagellates Heterotroph flagellates
#> 6 ICE_ALGAE ICE_ALG Ice algae Ice algae
#> ORDER FAMILY GROUP
#> 1 Detritus Detritus 1_Plankton
#> 2 Autotroph flagellates Autotroph flagellates 1_Plankton
#> 3 Picoplankton Picoplankton 1_Plankton
#> 4 Microplankton Microplankton 1_Plankton
#> 5 Heterotroph flagellates Heterotroph flagellates 1_Plankton
#> 6 Ice algae Ice algae 1_Plankton
The list of species is provide as a table that contains the species names along with an abbreviation and the taxonomy (i.e. classification of each organism).
4.2.2 Import interactions list
pairwise_list <- read.delim("raw_data/Barents_Sea/PairwiseList_2015.txt", header = TRUE)
pw_save <- species_list # keep original copy
head(pairwise_list)
#> PWKEY PREDATOR PREY CODE
#> 1 ACA_SPP-ACA_SPP ACARTIA_SPP ACARTIA_SPP 2
#> 2 ACA_SPP-AUT_FLA ACARTIA_SPP AUTOTHROPH_FLAGELLAT 1
#> 3 ACA_SPP-DIATOM ACARTIA_SPP DIATOM 1
#> 4 ACA_SPP-HET_FLA ACARTIA_SPP HETEROTROPH_FLAGELLAT 1
#> 5 ACA_SPP-MIX_FLA ACARTIA_SPP MIXOTROPH_FLAGELLATES 4
#> 6 ACA_SPP-PROZOO ACARTIA_SPP PROTOZOOPLANKTON 1
The list of trophic interactions (a.k.a. relationships of ‘who eats and whom’) is provided as a pairwise list. The first column contains an identifier. The consecutive columns PREDATOR and PREY contains the species names of the predator and prey, respectively. The rows contain the relationships between a predator and a prey.
4.2.3 Import literature references
references <- read.delim("raw_data/Barents_Sea/References_2015.txt")
pairwise_to_references <- read.delim("raw_data/Barents_Sea/Pairwise2References_2015.txt")
Both ‘references’ and ‘pairwise_to_references’ tables contain metadata about the interactions.
4.3 Data cleaning and augmentation
4.3.1 Species list
4.3.1.1 Correct mispelling and non-letter characters
Correct mispelling in column names:
Correct mispelling in species name
- Identify patterns using regular expressions while excluding the “_” in the names
- Check what patterns need replacing
- Replace the patterns
# Helper function to apply to each column
box::use(box/modify_strings[locate_pattern])
strings_to_correct <- locate_pattern(species_list, "[^0-9a-zA-Z_]")
#> NA rows contain the pattern: [^0-9a-zA-Z_]
#> Column: 1
#> Column: 2
#> Column: 3
#> Column: 4
#> Column: 5
#> Column: 6
#> Column: 7
print(strings_to_correct)
#> [1] "BERO\xe8_SP" "OITHONA_SPINIROSTRIS/ATLANTICA"
#> [3] "Autotroph flagellates" "Heterotroph flagellates"
#> [5] "Ice algae" "Mixotroph flagellates"
#> [7] "Amphipoda/Gammarida" "Amphipoda/Gammaridea"
#> [9] "Amphipoda/Hyperiidea"
The organisms names are inconsistent and contain ASCII strings, slashes, and spaces that R won’t handle properly. We will replace them.
# Store the patterns
patterns <- c("\xe8", " ", "/")
# Store their replacements
replacements <- c("E", "_", "_")
# Replace the patterns in all columns
box::use(box/modify_strings[replace_pattern])
species_list <- replace_pattern(x = species_list, vector_pattern = patterns, vector_replacement = replacements )
#> Replacing '<e8>' with 'E'
#> Replacing ' ' with '_'
#> Replacing '/' with '_'
sp_save <- species_list
# species_list <- sp_save
Some of the latin name abbreviations are associated with the wrong organism. The abbreviations follow international fisheries standards (e.g. GAD_MOR for ‘GADUS MORHUA’). We can reconstruct the abbreviations from the organisms’ names.
# "(^[A-Za-z]{1,3})"
# "_+([A-Za-z]{1,3})"
abbr1 <- species_list$TROPHOSPECIES %>% str_extract(., "(^[A-Za-z]{1,3})")
abbr2 <- species_list$TROPHOSPECIES %>% str_extract(., "(?<=_)([A-Za-z]{1,3})")
species_list$ABBREVIATION <- paste(abbr1, ifelse(is.na(abbr2), "IND", abbr2), sep="_" )
4.3.1.2 Data augmentation
- Genus + Species
We can extract more information from the ‘trophospecies’ column that contains both the Genus and the species name. We can use the Genus as additional information for the classification.
N.B. It does not matter if you do not know what Genus and species names are. Just know that they are deeper levels in the taxonomy of an organism.
Genus and species are grouped with ’_’. We can split them.
species_list %<>% separate(., col = TROPHOSPECIES, c("GENUS", "SPECIES"),
remove = FALSE, extra = "drop" )
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 6 rows [1, 4, 7, 10, 80,
#> 111].
Finally, we can drop the column species because all species are different and cannot be used to group information.
4.4 List of interactions: pairwise list
4.4.1 Data cleaning
For the purpose of the analysis, the data in ‘species_list’ and ‘pairwise_list’ need to match. We need to apply the same cleaning to this table.
4.4.1.1 Correct mispellings
# debug(locate_pattern)
box::use(box/modify_strings[locate_pattern])
box::use(box/modify_strings[replace_pattern])
strings_to_correct <- locate_pattern(pairwise_list, "[^0-9a-zA-Z_-]")
print(strings_to_correct)
pairwise_list <- replace_pattern(x = pairwise_list, vector_pattern = patterns, vector_replacement = replacements )
#### Correct PWKEY
"(^[A-Za-z]{1,3})"
"_+([A-Za-z]{1,3})"
# debug(extract_and_merge)
extract_and_merge(pairwise_list$PREDATOR, c("(^[A-Za-z]{1,3})", "_+([A-Za-z]{1,3})_?"), sep = "" )
box::use(box/modify_strings[test_extract])
test_extract(pairwise_list$PREDATOR, "_+([A-Za-z]{1,3})(?(_)([A-Za-z]{1,3}))")
4.4.2 Data transformation
The data was provided as a long pairwise list which is limiting for the purpose of the analysis. We can convert the list to a squared binary matrix where the intersections of rows and columns contain the information for an interaction.
In other words, ‘0s’ means that the interaction is absent (or has yet to be observed in nature); ‘1s’ indicates the presence of an interaction between the organisms listed in the columns and rows interacting.