5  Statistical Inference

Week 5 - A refresher of regression, two independent samples t-test and one-way ANOVA

In this workshop, you will review three statistical inference techniques used in analysing simple data sets. These methods are simple regression with a single numeric (and preferably) continuous explanatory variable, the two independent samples t-test and one-way ANOVA. Workshop materials are available in the github repository ECS200.

Today’s task will be a refresher on the following:

For each, you will need to understand about intercepts, slopes, p-values, assumptions of equal variances, assumptions of independence, and assumptions that the residuals have a normal distribution.

Background

Why Log scale

In many ecological systems, such as island size, values often are non-linear which does not meet the assumptions of linear models (like in this prac). Many small islands vs few large islands.

If you tried to fit island size with species richness, it wont work well because the relationship is curved (left figure). If you log transform island size (island_area_log) and species richness (species_richness_log), the relationship looks more linear (right figure).

By log transforming:

  • It stabilises variance: Ecological data often have variance that increases with the mean (e.g., larger islands have both more species and more variability). Raw data has fan-shaped” spread while log transforming evens the spread
  • it reduces the influence of extreme values.
  • It improves interpretability of parameters: The slope z becomes a scaling exponent.
  • It often makes data more “normal”: Many statistical methods assume normally distributed residuals. Log transformation can reduce skewness

When not to log-transform:

  • You have zeros like count data (can’t log zeros)
  • The relationship is already linear
  • Interpretation on the original scale is critical

If you need to log transform your data, here is how to do it:

# new_data is whatever data object you have
# x_value is whatever the name of the data column you want to log transform.

new_data <- new_data %>%
  mutate(x_value_log = log(x_value))

Assessment

Task to complete before the end of the workshop.

Island size may be associated with the species richness of herbivores. This simulated data set allows us to use the statistical inference process we would implement if investigating this relationship is a similar manner to Ross et al. (2019), figure 3 top right sub figure.

First download the island.csv file on GitHub in the ‘data’ folder and load it in R.

# Week 5 workshop - Inference
# Written by Nicholas Wu, 06/11/2024, Murdoch University

# Load packages
library(EcoData)
library(tidyverse)
library(broom)
library(psych)

# Set your working directory
setwd("YOUR-DIRECTORY")

# Load your data
island <- read_csv("island.csv")

Note that in island.csv both island area is on the log scale.

Your tasks are to:

1. Create a scatterplot and check if the data suggests a linear relationship

2. Create the other visualisations for diagnostics checking

  • Create one figure to look at the assumption of equal variances.
  • Create one figure to look at the assumption that the residuals have a normal distribution
  • Optional: Plot leverage versus the standardised residual plots and note if there are any concerning high leverage points.
  • Decide if simple linear regression is suitable for this data or not

3. Provide a summary of the model and visualise the model

  • Report the p-value in context (you can write this as a comment in the code)
  • Report the slope and confidence interval (you can write this as a comment in the code too)
  • Overlay the linear model on the scatterplot