10  OPTIONAL - Movement Data

Week 10 - Cleaning and plotting movement data

This is an optional tutorial if you plan to work with animals with telemetry tracking their movement.

Note

THIS TUTORIAL WILL NOT BE IN THE EXAM. It is an optional extra learning tool.

Since movement data from tracked animals include location and time data, you will apply skills from both the time series tutorial and spatial data tutorial in this tutorial.

Sensor Technologies

How do we track animals? Here are some example ways researchers use to track animals in the wild.

Important

How you process, analyse and infer movement data will heavily depend on the type of tracking technique used.

Various methods for marking and tagging insects. Image from Ruzichova & Elek (2026)

Process GPS Data

The most basic movement data will include a column for the ID of the individual tracked, an x coordinate representing longitude, and a y coordinate representing latitude, and a date_time stamp representing the day and time of the location.

In this tutorial, we will work GPS-data from three GPS-tagged red-footed boobies, Sula sula.

We will loosely follow the cleaning and processing workflow of raw data from ExMove tutorial, but there are others out there. For example, Gupte et al. (2021)

An example of how the ExMove toolkit has been applied to GPS-tracked central-place foraging red-footed boobies, Sula sula.

Clean raw data

The three GPS data and metadata files are available in the github repository ECS200 in the data folder called gps_study. You will need to download it to your local desktop.

Load multiple GPS files into R

# install.packages("stringr")

# Load libraries
library(tidyverse)
library(stringr)
library(lubridate) # for converting dates and time
library(sf) # work with spatial data

Adjust these numbers for extracting the ID number from file name using stringr R package (e.g. to extract GV37501 from “GV37501_201606_DG_RFB.csv”, we want characters 1-7). NB: this approach only works if all ID’s are the same length and in the same position — see the str_sub() documentation for other options.

# List all individual files 
files <- list.files(path = "YOUR/FILE/LOCATION", 
                    pattern = "GV.*\\.csv$", # only extract files with the letters GV for the GPS data
                    full.names = TRUE)

# load and merge all listed files
raw_data <- files %>%
  map_dfr(~ read_csv(.x, 
                     col_types = cols(
                       Date = col_character(),
                       Time = col_character()
                       )
                     ) %>% 
            mutate(file_name = basename(.x),
                   # for files with different date formats
                   Date = lubridate::parse_date_time(Date, orders = c("ymd", "dmy", "mdy"))
                   ),
          ) 


IDstart <- 1 # start position of the ID in the filename
IDend <- 7 # end position of the ID in the filename

# Clean raw data
raw_data <- raw_data %>%
  dplyr::mutate(
    # convert character to date and time, and define time zone for tracking data (GMT)
    date_time = lubridate::ymd_hms(paste(Date, Time), tz = "GMT"),
    ID = stringr::str_sub(file_name, start =  IDstart, end = IDend),
    date = lubridate::date(date_time), # extract date only
    x = Longitude, # for spatial analysis
    y = Latitude, # for spatial analysis
  ) %>%
  dplyr::select(-c(Date, Time, Longitude, Latitude, Altitude, Speed, Course, Type, Distance, Essential))


# You can also have additional columns depending on the type of logger used, for example:
# lc = Argos fix quality
# Lat2/Lon2 = additional location fixes from Argos tag
# laterr/lonerr = location error information provided by some GLS processing package

glimpse(raw_data)

Merge with metadata

Metadata are an essential piece of information for any tracking study, as they contain important information about each data file, such as tag ID, animal ID, or deployment information, that we can add back into to our raw data when needed.

Format all dates and times, combine them and specify timezone.

# Load csv file
df_metadata <- read_csv("YOUR/FILE/LOCATION/RFB_metadata.csv")

# check what the data contains
glimpse(df_metadata)

# Looks like a few things require a correction their class. Lets do it.

df_metadata_slim <- df_metadata %>%
  dplyr::mutate(DeploymentDate = lubridate::dmy(DeploymentDate),
                RetrievalDate  = lubridate::dmy(RetrievalDate),
                Deploydatetime = lubridate::parse_date_time(
                  paste(DeploymentDate, DeploymentTime),# make deploy datetime
                  order = c("dmY HMS", "Ymd HMS"), 
                  tz    = "Indian/Chagos"),
                Retrievedatetime = lubridate::parse_date_time(
                  paste(RetrievalDate, RetrievalTime), # make retrieve datetime
                  order = c("dmY HMS", "Ymd HMS"),
                  tz    = "Indian/Chagos"),
                TagID          = as.character(TagID),
                ID             = BirdID,
                CPY            = NestLat,
                CPX            = NestLong,
                ) %>%
  dplyr::select(-c("DeploymentDate", "DeploymentTime", "RetrievalDate", "RetrievalTime", "BirdID")) %>%
  dplyr::mutate(across(contains('datetime'), # for chosen datetime column
                       ~with_tz(., tzone = "GMT")) #format to different tz
         )

glimpse(df_metadata_slim)

# Here we’ll create a dataframe of temporal extents of our data to use in absence of deploy/retrieve times (this is also useful for basic data checks and for writing up methods).

df_temporalextents <- raw_data %>%
  group_by(ID) %>%
  summarise(min_datetime = min(date_time),
            max_datetime = max(date_time))

# Then we use these temporal extents of our data to fill in any NA’s in the deploy/retrieve times.

df_metadata_slim <- df_metadata_slim %>%
  dplyr::left_join(df_temporalextents, by = "ID") %>%
  dplyr::mutate(Deploydatetime = case_when(!is.na(Deploydatetime) ~ Deploydatetime,
                                      is.na(Deploydatetime) ~ min_datetime),
         Retrievedatetime = case_when(!is.na(Retrievedatetime) ~ Retrievedatetime,
                                      is.na(Retrievedatetime) ~ max_datetime)) %>%
  dplyr::select(-c(min_datetime, max_datetime))

df_meta_merged <- raw_data %>%
  dplyr::left_join(df_metadata_slim, by = "ID") 

Have a look at the raw data using ggplot.

df_meta_merged %>%
  ggplot(aes(x = x, y = y, colour = ID)) +
  # geom_path lets you explore how two variables are related over time
  geom_path() + 
  geom_point() +
  theme_bw()

Looks like there are outlier data points for GV37734 (blue). Before we deal with this, let’s check if there are duplicates and missing data first.

Duplicates

Check for duplicated rows from week 1’s R prac.

sum(duplicated(df_meta_merged))  # Returns the number of duplicate rows
[1] 0
# View the duplicate rows themselves
raw_data[duplicated(df_meta_merged), ]
# A tibble: 0 × 6
# ℹ 6 variables: file_name <chr>, date_time <dttm>, ID <chr>, date <date>,
#   x <dbl>, y <dbl>

In this dataset of three individuals, there are no duplicated rows which is good! Always good for sanity check.

Missing fixes

A small percentage of random missing fixes are typically not a problem, but many consecutive missing points are not random, e.g. dense canopy creates habitat-biased missing fixes.

Visualise missing rows with vis_miss() function from week 1’s R prac.

# Visualise missing data pattern
visdat::vis_miss(df_meta_merged)

Great! No missing values in this dataset. Probably because the boobies tracked are in open water with no structures that could interfere with GPS signals.

Outliers

As you saw with the raw plot traces, we can visually see that GV37734 has erroneous GPS fixes. Most GPS units will also include a dilution of precision columns (DOP) which measure the potential loss of precision due to satellite geometry. Lower values are better (1-2 excellent vs >8 poor to very poor)

If your GPS data has a DOP column, you can remove data with DOP precision values above a defined user-threshold. Check for sampling interval as histogram and position error.

Our example dataset does not have a DOP column, but if you would like to know how to check, then run this code with whatever your data object name is called.

# Check the distribution of errors
data %>%
  ggplot(aes(x = DOP)) +
  geom_histogram()

# example filter for DOP values above 10, keeping data with DOP values below 10.
data %>%
  filter(dop < 10)

In our dataset, we don’t have DOP column so we need to check for erroneous GPS fixes in a different way.

Note

Check projection

We will work frequently with raster data (spatial covariates) and possibly with vector data (i.e., home ranges). One of the challenges is to ensure that both – tracking data and covariates – have a matching coordinate reference system (CRS).

The CRS defines the reference system that is being used to explicitly reference a feature in space. Projected CRS flatten the three dimensional data to the a two-dimensional plane (and introduce some distortion).

There are two classes of CRS: geographic (e.g., WGS84) and projected (e.g., UTM) CRS.

Which CRS is best to use? It depends on the range of the study species. I usually prefer projected CRS, because their units are meters and not degrees.

First we need to specify the co-ordinate projection systems for the tracking data and meta data. The default here is lon/lat for both tracking data & metadata, for which the EPSG code is 4326. For more information have a look at the ESPG.io database.

tracking_crs <- 4326 # Only change if data are in a different coordinate system
meta_crs <- 4326 # Only change if data are in a different coordinate system

Next we transform coordinates of data, and perform spatial calculations. This requires spatial analysis, and so it is good practice to run all spatial analyses in a coordinate reference system that uses metres as a unit.

The default CRS for this workflow is the Spherical Mercator projection — (crs = 3857), which is used by Google Maps/OpenStreetMap and works worldwide. However, this CRS can over-estimate distance calculations in some cases, so it’s important to consider the location and scale of your data (e.g., equatorial/polar/local scale/global scale) and choose a projection system to match. Other options include (but are not limited to) UTM, and Lambert Azimuthal Equal-Area (LAEA) projections.

transform_crs <- 3857

Here we’ll calculate some useful movement metrics from the tracking data, including distance between fixes, time between fixes, and net displacement from the first fix.

This will involve transforming the data to a projected coordinate system using the st_transform() and st_as_sf() functions from thesf R package.

We can calculate distance with the st_distance() function, with units depending on the coordinate system.

df_diagnostic <- df_meta_merged %>%
  dplyr::ungroup() %>% #need to ungroup to extract geometry of the whole dataset
  dplyr::mutate(
    geometry_GPS = sf::st_transform( # transform X/Y coordinates
    sf::st_as_sf(., coords = c("x","y"), crs = tracking_crs), #from original format 4326
    crs = transform_crs)$geometry # to the new transform_crs format 3857
    ) %>%
  dplyr::group_by(ID) %>% #back to grouping by ID for calculations per individual
  dplyr::mutate(
    # distance travelled from previous fix, 
    dist = sf::st_distance(geometry_GPS, lag(geometry_GPS), by_element = T), # calculations are done by row
    # time passed since previous fix
    diff_time = difftime(date_time, lag(date_time), units = "secs"), # in seconds
    # dist. between 1st and current location
    netdisp = sf::st_distance(geometry_GPS, geometry_GPS[1], by_element = F)[,1], # dense matrix w/ pairwise distances
    speed = as.numeric(dist)/as.numeric(diff_time), # calculate speed (distance/time)
    dX = as.numeric(x)-lag(as.numeric(x)), #diff. in lon  relative to prev. location
    dY = as.numeric(y)-lag(as.numeric(y)), #diff. in lat relative to prev. location
    turnangle = atan2(dX, dY)*180/pi + (dX < 0)*360) %>% # angle from prev. to current location
  dplyr::ungroup() %>% 
  dplyr::select(-c(geometry_GPS, dX, dY)) # ungroup and remove excess geometries

Add latitude and longitude column — this can be useful for plotting and is a common coordinate system used in the shiny app

df_diagnostic <- sf::st_coordinates(
  sf::st_transform(
    sf::st_as_sf(df_diagnostic, coords = c("x", "y"), 
                 crs = tracking_crs), 
    crs = 4326)) %>% 
                  as.data.frame() %>% 
                  rename("Lon" = "X", "Lat" = "Y") %>% 
                  cbind(df_diagnostic, .)

# Check the distribution of errors
diff_plot <- df_diagnostic %>%
  ggplot(aes(x = diff_time / 60)) + # from seconds to minutes
  geom_histogram(bins = 100) +
  labs(x = "Difference between points (minute)", y = "Frequency") + 
  theme_bw() +
  facet_wrap(~ ID)

speed_plot <- df_diagnostic %>%
  ggplot(aes(x = speed)) +
  geom_histogram(bins = 100) +
  labs(x = "Speed (m/s)", y = "Frequency") +
  theme_bw() +
  facet_wrap(~ ID)


netdisp_plot <- df_diagnostic %>%
  ggplot(aes(x = as.vector(netdisp) / 1000)) + # convert m to km
  labs(x = "SNet displacement (km)", y = "Frequency") +
  geom_histogram(bins = 100) +
  theme_bw() +
  facet_wrap(~ ID)

cowplot::plot_grid(diff_plot, 
                   netdisp_plot, 
                   speed_plot,
                   ncol = 1, align = "v")
Don't know how to automatically pick scale for object of type <difftime>.
Defaulting to continuous.
Warning: Removed 3 rows containing non-finite outside the scale range (`stat_bin()`).
Removed 3 rows containing non-finite outside the scale range (`stat_bin()`).

Define threshold

For this section, you can either use the code chunks produced by the Shiny app, or manually define the threshold values yourself.

  • If you’re using the Shiny app, you can copy the code chunks from the bottom of each page to replace the user input section below.

  • If you’re manually defining the threshold values, you can edit each of the variables below as you would for any of the user input sections (the tutorial has suggested some for the RFB dataset).

First we define a period to filter after tag deployment, when all points before the cutoff will be removed (e.g. to remove potentially unnatural behaviour following the tagging event).

# We define this period using the as.period function, by providing an integer value and time unit (e.g. hours/days/years). This code below specifies a period of 30 minutes:
filter_cutoff <- as.period(30, unit = "minutes") 

Next we define a net displacement (distance from first point) threshold and specify units. Any points further away from the first tracking point will be removed (see commented code for how to retain all points):

# Create net displacement filter using distance and units
filter_netdisp <- units::as_units(300, "km") # e.g., "m", "km"

#If you want to retain points no matter the net displacement value, use these values instead:
#filter_netdisp_dist <- max(df_diagnostic$netdisp)
#filter_netdist_units <- "m"

Filter the data based on your defined thresholds.

df_filtered <- df_diagnostic %>%
  filter(Deploydatetime + filter_cutoff < date_time, # keep times after cutoff
         speed < 20, # keep speeds slower than speed filter (here is 20 m/s)
         netdisp <= filter_netdisp) # keep distances less than net displacement filter (here is 300 km)

Have a look at the filtered data using ggplot and compare with the original raw data.

raw_plot <- df_meta_merged %>%
  ggplot(aes(x = x, y = y, colour = ID)) +
  geom_path() + 
  geom_point() +
  labs(title = "Raw data") +
  theme_bw() +
  theme(legend.position = "bottom")

filtered_plot <- df_filtered %>%
  ggplot(aes(x = x, y = y, colour = ID)) +
  geom_path() + 
  geom_point() +
  labs(title = "Filtered data") +
  theme_bw() +
  theme(legend.position = "bottom")

cowplot::plot_grid(raw_plot, 
                   filtered_plot, 
                   nrow = 1, align = "v")

Plotting

Create version of data for plotting by transforming required columns to numeric and creating time elapsed columns.

df_plotting <- df_filtered %>%
  group_by(ID) %>%
  mutate(diffsecs    = as.numeric(diff_time),
         secs_elapsed = cumsum(replace_na(diffsecs, 0)),
         time_elapsed = as.duration(secs_elapsed),
         days_elapsed = as.numeric(time_elapsed, "days")) %>%
  mutate(across(c(dist, speed, Lat, Lon), as.numeric))

Other Data Types

When cleaning other types of telemetry data, take into consideration of the following issues:

Argos staellite tags

  • Data are labelled with a location class rather than DOP.

  • Error distributions are studied for each location class (bivariate t-distributions).

  • Often modelled with an SSM

Acoustic telemtery

Acoustic signals are more easily corrupted than electromagnetic caves, resultiung in the wrong tag ID being recorded.

  • Type A false detections are of tag IDs not in the study (easy to discard)

  • Type B false detections are of tag IDs in the study (hard to distinguish from real data)

Most rely of speed filter (use SDR) pr coarse temporal scale summary stats.

Analysis

You are now ready to analyse movement data. However, movement analysis is beyond this second-year unit, because they require more sophisticated models to work with messy data. This includes multidimensional data (x,y,time), highly correlated data, complex and heterogeneous data, circular and skewed distributions, and often have missing or irregularly sampled data. But they still over important information.

Here are some possibilities you can you do with spatial data:

  • Home range analysis: Understanding animal space use patterns has profound implications across multiple domains of ecology, conservation, and wildlife management. Some example cases include conservation and management, disease transmission, human-wildlife conflict, and tourism, hunting and recreation activities. Some methods for estimating home range include simple Minimum Convex Polygon (MCP), Kernel Density Estimation (KDE), Local Convex Hull (LoCoH), and Autocorrelated Kernel Density Estimation (AKDE).
Method Complexity Data Requirements Intensity Info Boundaries Best For
MCP Very Simple Low (≥30 points) None Hard edge Coarse-scale, presence/absence questions
KDE Moderate Moderate (≥50 points) Yes (UD) Smooth Core area identification, intensity-based analyses
LoCoH Moderate Moderate (≥50 points) Yes (local density) Complex Complex landscapes, hard boundaries
AKDE High High (≥100+ points) Yes (UD) Smooth Rigorous statistical inference, intensive GPS data
  • UD = utilization distribution
Important

The term home range can be broadly defined as “that area traversed by an individual in its normal activities of food gathering, mating and caring for young.” (Burt, 1943), but the movement data collected can’t explicitly tell whether this is an animals home range without additional information on the behaviour of the animal such as food gathering, mating and caring for young.

The most conservative definition of using these methods (MCP, KDE etc) is space use. This is because if you track an animal in one study for one season, let’s say summer, but another researcher tracked the same animal in winter and their area found is hundreds of kilometers apart, then does that mean they have two home ranges? For migratory species like the example above, is there a clear home range?

The term home range, therefore must be stated carefully and the inference of the movement data must be in the context of research question and study design (e.g. studies done in one week vs multiple months).

  • Behavioural states: Fine-temporal resolution movement data (often seconds) can reveal behavioural states such as when animal is resting or active. Together with accelerometers, such movement data can infer detailed behaviours such as jumping, digging, mating, fighting. Some statistical analyse to classify behaviors include hidden markov models, and self-organizing map networks.

Classifying behaviour via raw accelerometer traces and infering behaviour across space and time. Study from Clemente et al. (2016)
  • Social networks: You can transform spatio-temporal data into edge lists (who is connected) and “gambit-of-the-group” data (who is in which group at what time) to build network models.

  • Disease surveillance: A clear example is the COVID-19 pandemic, tracking the spread of cases and predicting locations of spread by integrating with human population density and movement.

Additional resource