4  Descriptive Stats

Week 4 - Descriptive stats and plotting data

In this workshop, you will learn how to summarise different types of ecological data using appropriate statistical measures and visualisations. Understanding the right measures of central tendency and spread is critical before conducting any formal hypothesis testing. Workshop materials are available in the github repository ECS200.

Background reading

Before we can decide on the appropriate descriptive stats it helps to understand the scale of your variable. The table below outlines appropriate measures for both qualitative and quantitative variables:

Data Scale Type Definition Examples Central Tendency Spread/Variation
Nominal Qualitative Named categories with no intrinsic order sex (male/female), species name Mode Count, Proportion
Ordinal Qualitative Named categories data with a natural order life stage (egg, juvenile, adult) Median, Mode Range, IQR
Interval Quantitative Equal intervals between values but no true zero temperature, pH, dates Mean, Median SD, SE, CI
Ratio Quantitative Equal intervals and a true zero length, age, body mass Mean, Median SD, SE, CV, IQR, CI, variance

When laying out your datasheet for your study, it is important to know what data types each column will be when imported into R. Here is an example data sheet that is color-coded according to the type of variable: nominal, continuous, ordinal, and binary.

Key Terms used in this workshop:

  • Mean: Average value. Use with symmetric, normally distributed data.
  • Median: Middle value. More robust to outliers and skewed data.
  • Mode: Most frequent value (useful for categories).
  • Standard Deviation (SD): How spread out the data is.
  • Standard Error (SE): SD divided by the square root of n (how precise the mean is).
  • Coefficient of Variation (CV): SD / Mean. Compares spread across variables.
  • Interquartile Range (IQR): Range between 25th and 75th percentiles.
  • Confidence Interval (CI): A range that likely contains the true mean.
  • Variance: How much values differ from the average value.

If you are still confused by the terminologies, the CrashCourse Statistics videos on the Mean, Median, and Mode: Measures of Central Tendency (11:22 min long), and the Measures of Spread (11:46 min long) better explains the terms visually.

Descriptive Statistics

Exercise 1 (5 min)

🧪 Complete the following task for the excerise_survey_data below

excerise_survey_data <- data.frame(
  site = c("A","A","A","B","B","B","C","C","C"),
  plot = c(1,2,3,1,2,3,1,2,3),
  species_1 = c(12, 7, 15, 4, 9, 6, 20, 18, 25),
  species_2 = c(5, 2, 1, 10, 7, 4, 3, 1, 0),
  species_3 = c(0, 3, 2, 6, 4, 5, 0, 1, 2),
  species_4 = c(8, 6, 3, 1, 4, 2, 10, 7, 5)
)
  • Calculate mean, standard deviation, and sample size of species 3. You can calculate sample size with length() function.
  • What is the mean number of species 3? Does the number make sense for species count data?
  • What is the standard deviation of species 3?
  • What is the sample size of species 3?
Show answer
excerise_survey_data %>%
  summarise(mean = mean(species_3),
            sd   = sd(species_3),
            n    = length(species_3))  

# 1. 2.55
# 2. 2.17
# 3. 9

Part 2 - Figures with ggplot

R really shines when it comes to data visualisation and with some tweaks, you can make eye-catching plots that make it easier for people to understand your science. The ggplot2 package, part of the tidyverse collection of packages, as well as its many extension packages are a great tool for data visualisation.

The gg in ggplot2 stands for grammar of graphics. Writing the code for your graph is like constructing a sentence made up of different parts that logically follow from one another. In a more visual way, it means adding layers that take care of different elements of the plot. Your plotting workflow will therefore be something like creating an empty plot, adding a layer with your data points, then your measure of uncertainty, the axis labels, and so on.

Independent vs dependent variables

In this example experiment data sheet, we have one independent and three dependent variables.

The other variables in the sheet can’t be classified as independent or dependent, but they do contain data that you will need in order to interpret your independent and dependent variables.

Plot Types

There are so many ways to visualise your data beyond this workshop, so we will only focus a few common ones you will likely use in your report.

Let’s say we are interested in visualising at differences in stem length in calcium-treated (W1) versus untreated (Reference) watersheds, there are a couple of ways we can visualise this.

# Load a CSV file
plant_task_clean_data <- read_csv("https://raw.githubusercontent.com/nicholaswunz/ECS200-Workshop/refs/heads/main/data/plant_calcium_task.csv") %>%
  dplyr::mutate(watershed = recode(watershed, "W2" = "W1"),
                watershed = as.factor(watershed),
                transect  = as.factor(transect)) %>%
  dplyr::filter(stem_dry_mass < 1)

Tweaking plots

To make your figures publication ready, here are some advice on changing the labels, adding colour, changing the shape and size, plotting multiple plots.

Assessment

Task to complete before the end of the workshop.

Load you R file from week 2 that contained your plant_task_data_clean. Under the header “Week 4 workshop - Descriptive statistics”, complete the following tasks:

1. Check the distribution of the total dry mass. What kind of distribution is this data?

  • You do not need to make the plot pretty here.
# Week 4 workshop - Descriptive stats
# Written by Nicholas Wu, 06/11/2024, Murdoch University

# Load packages
library(tidyverse)
library(visdat)

# Set your working directory
setwd("YOUR-DIRECTORY")

# Load your data
plant_task_data_clean <- read_csv("plant_task_data_clean.csv")

# If you are not confident with your original plant_task_data_clean file, here is one woth all the errors fixed
plant_task_data_raw <- read_csv("plant_calcium_task.csv") %>%
  dplyr::mutate(watershed = recode(watershed, "W2" = "W1"),
                watershed = as.factor(watershed),
                transect  = as.factor(transect),
                total_mass_g = (leaf_dry_mass + stem_dry_mass) * 1000) %>% # calculates total dry mass 
  dplyr::filter(stem_dry_mass < 1)

2. Calculate the central tendency, spread, and sample size of total_mass_g by watershed, and year based on your interpretation of the spread

  • Decide on calculations of central tendency and spread you will apply based on the distribution observed.
  • What are the numbers and sample size for year 2003 W1?
  • Are they similar to year 2004 W1? If not, what’s the difference?

3. Plot the central tendency and spread with the raw data in a figure by watershed, and year

  • Decide on which figure to create based on the workshop.
  • Make a publication ready figure (include colour).

Try to create a figure similar to option 1 or option 2 from your plant_task_data_clean.csv dataset.

For the adventurous folks

We will use a new dataset provided by lterdatasampler, called pie_crab. The pie_crab contains body size data of Atlantic marsh fiddler crab (Minuca pugnax) across various salt marshes throughout the eastern coast of the United States to test the Bergmann’s rule, which predicts that organisms at higher latitudes are larger than ones at lower latitudes.

13 marshes were sampled on the Atlantic coast of the United States in summer 2016, spanning > 12 degrees of latitude, from northeast Florida to northeast Massachusetts. Between 25 and 37 adult male fiddler crabs were collected, and their sizes recorded, from each marsh between 2016-07-24 and 2016-08-13.

Here is the metadata of the pie_crab data: - date: Date of collection - latitude: Latitude of the collection site in degrees - site: The site ID - size: Mean carapace width (mm) - air_temp: The mean air temperature of the day. - air_temp_sd: The standard deviation of the measured air temprature. Also an indicator of seasonality - water_temp - water_temp_sd - name: Names of the sampled Marshes.

#install.packages("lterdatasampler")

# Load R package
library(lterdatasampler)

# Load 'pie_crab' from the R package
glimpse(pie_crab)
Rows: 392
Columns: 9
$ date          <date> 2016-07-24, 2016-07-24, 2016-07-24, 2016-07-24, 2016-07…
$ latitude      <dbl> 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, …
$ site          <chr> "GTM", "GTM", "GTM", "GTM", "GTM", "GTM", "GTM", "GTM", …
$ size          <dbl> 12.43, 14.18, 14.52, 12.94, 12.45, 12.99, 10.32, 11.19, …
$ air_temp      <dbl> 21.792, 21.792, 21.792, 21.792, 21.792, 21.792, 21.792, …
$ air_temp_sd   <dbl> 6.391, 6.391, 6.391, 6.391, 6.391, 6.391, 6.391, 6.391, …
$ water_temp    <dbl> 24.502, 24.502, 24.502, 24.502, 24.502, 24.502, 24.502, …
$ water_temp_sd <dbl> 6.121, 6.121, 6.121, 6.121, 6.121, 6.121, 6.121, 6.121, …
$ name          <chr> "Guana Tolomoto Matanzas NERR", "Guana Tolomoto Matanzas…

1. Do a data quality check

  • Any missing data?
  • Are there duplicates?
  • Check the data structure and types and correct any mistakes
  • Check the data distribution carapace size. Are there outliers?

2. Calculate the central tendency, spread, and sample size of the carapace size size by site and latitude

  • Decide on calculations of central tendency and spread you will apply based on the distribution observed.
  • Which site has the highest average carapace size?
  • Which site has the lowest sample size?

3. Plot the relationship between latitude and mean carapace width (plus raw carapace width) to test the Bergmann’s rule in Fidler crabs

  • Make a publication ready figure (include colour).
  • What relationship can you see?

Extra Stuff

Beautify figures

If you want to go beyond what we learnt and further enhance your figure production skills, here is a link that provides a tutorial to convert default ggplots to publication ready plots including this one on histograms.

Additional resource