Statistics Lecture 02, 30/01/2025

Normal Distribution

x <- seq(-4, 4, length=200)
y <- dnorm(x, mean=0, sd=1)
plot(x,y, type = "l", lwd = 2, xlim = c(-3.5,3.5))

Dataset samples

#include sample height data in data frame
height<-c(126,178,181,167,166)
gender<-c("female","female","male","female","male")
lat<-c(-43.67327,43.42426,51.79624,0.84228,-49.05650)
long<-c(-114.51627,-95.23205,-43.43058,-91.71106,-23.16819)
data<-data.frame(height,gender,lat,long)

histogram(~height|gender,data=data, main="", col="red")

Plot Geodata

Statistics Lecture 03, 06/02/2025

Hypothesis Testing

Are two samples similar or different?

Example: Agricultural Land Use

lowland_clay_sample <- c(10,20,25,25,30,35,45,50,50,60)
upland_limestone_sample <- c(25,30,30,40,40,50,50,60,60,65)
percentage_meadow <- data.frame(lowland_clay_sample, upland_limestone_sample)

Statistical hypothesis testing gives an objective method to see if something is due to chance or is genuinely statistically significant

Six Stages

  1. State null hypothesis
  2. State alternative hypothesis
  3. Decide upon a rejection level (α)
  4. Conduct statistical test (Do the Maths)
  5. Determine whether to accept or reject the null hypothesis
  6. Draw a physical process-based conclusion from the results

1. State null hypothesis

There is no significant difference between meadow percentage in lowland and upland areas. Hₒ: μ(lowland) = μ(upland)

2. State alternative hypothesis

There is a significant difference between meadow percentage in lowland and upland areas. Hₐ: μ(lowland) != μ(upland)

Non-directional or Directional Tests
Non-directional Directional
Do NOT state which variable is greatest in Hₐ State which variable is greatest in Hₐ: μ(lowland) > μ(upland) OR μ(lowland) < μ(upland)
Two-tailed test One-tailed test

3. Decide upon a rejection level

Probability of Hₒ being true = 0.024

Probability of Hₐ being true = 0.976

The purpose of this test is to find out which hypothesis is most likely to be correct. (Conclusions are probabilistic not absolute).

Statistical significance levels
  • Usually scientists use α = 0.05 - the 95% statistical significance level (confidence level).
  • Leaves room for 5% error in our conclusion.
  • Second most commonly used is α = 0.01 - the 99% confidence level.
Types of Error

Type I Error - Should have retained Hₒ

Type II Error - Should have rejected Hₒ

95% level is the best level for the trade-off between Type I and Type II errors

(90% more likely Type I; 99.9% more likely Type II)

4. Conduct Statistical Test

Example: Ag. Land Use Problem: Comparing means between two samples

I.e. Is there a significant difference between the two samples?

The Student’s t test

Assumptions

Before you do the test, have you satisfied the assumptions?

  1. Each sample is taken from a population that is normally distributed

  2. Interval/ratio data, cannot use data in categories

  3. Each observation is independent of the other observations in the same sample

  4. The samples are taken from populations which have similar variances

1. Normality

  1. Plot the data (Plot a histogram of each variable)
# why is this splitting?
# histogram(~lowland_clay_sample|upland_limestone_sample,data=percentage_meadow,main="")
hist(percentage_meadow$lowland_clay_sample)

hist(percentage_meadow$upland_limestone_sample)

  1. Look at skewness values
stat.desc(percentage_meadow$lowland_clay_sample)
##      nbr.val     nbr.null       nbr.na          min          max        range 
##    10.000000     0.000000     0.000000    10.000000    60.000000    50.000000 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
##   350.000000    32.500000    35.000000     5.000000    11.310786   250.000000 
##      std.dev     coef.var 
##    15.811388     0.451754
stat.desc(percentage_meadow$upland_limestone_sample)
##      nbr.val     nbr.null       nbr.na          min          max        range 
##   10.0000000    0.0000000    0.0000000   25.0000000   65.0000000   40.0000000 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
##  450.0000000   45.0000000   45.0000000    4.4721360   10.1166744  200.0000000 
##      std.dev     coef.var 
##   14.1421356    0.3142697
  1. Calculate Z score of skewness statistic
  2. The Kolmogorov-Smirnov Test (K-S Test)

4. Similarity of Variances

The t test -> comparing arithmetic means of two samples

So samples need an equal variance

Rule: Variances must differ by less than a factor of two

Formal test for this -> Levene’s Test

For the example:

Pick which form of test

One- or two-sample test (comparison to popln. mean or one another)

Independent samples or Paired samples (samples are independent of each other or are paired - measured at the same time)

\[ t = \frac{\text{difference between the means}}{\text{standard error of the difference}} \] Lowland Clay: Arithmetic Mean = 35% Standard Deviation = 15.81% Number of Observations = 10

Upland Limestone: Arithmetic Mean = 45% Standard Deviation = 14.14% Number of Observations = 10

Statistics Lecture 04, 13/02/2025

Anaylsis of Variance (ANOVA)

ANOVA determines whether the means of several populations are equal.

NB. Student’s t test compares the means of two samples only.

Two forms of ANOVA

  1. One-way ANOVA i. Independent samples (THIS LECTURE)
  1. Paired samples (Repeated Measures ANOVA)
  1. Two-way ANOVA

One-way ANOVA: Independent Samples

  • Each observation is classified into one sample or another based on ONE criterion.

E.g. crop yield per acre obtained on different soil types

soil_type <- c('A','A','A','A','A','B','B','B','B','B','C','C','C','C','C')
as.factor(soil_type)
##  [1] A A A A A B B B B B C C C C C
## Levels: A B C
crop_yield <- c(24,27,21,22,26,17,25,24,19,28,19,18,22,24,23)
table1 <- data.frame(soil_type, crop_yield)

ANOVA separates the total variance into two components:

  1. Variance between the samples

  2. Variance within the samples

\[ \text{Total variance} = \text{Variance between samples} + \text{Variance within samples} \newline \text{TSS} = \text{SSC} + \text{SSE} \]

Assumptions of ANOVA

  1. The errors (residuals) are normally distributed. This is more likely to be achieved when random samples from normal populations are used. For each sample:
  • Plot a histogram
  • Calculate the skewness value
  • Calculate the Z score of skewness
  • Perform either the K-S test or the Shapiro-Wilk test
  1. The normal populations all have an equal or COMMON VARIANCE.
  • Variance ratio
  • Levene’s Test

NB. If you cannot satisfy these assumptions, you need to do another test (see end of lecture).

Reminder: What is Variance?

\[ \text{Variance} = (\text{Standard Deviation})^2 = σ^2 \]

Calculating the Total Sum of Squares

The Total Sum of Squares (TSS) is the sum of the squared differences between each observation and the overall mean.

Calculating the Sum of Squares Between the Columns (SSC)

SSC -> The variation between the samples

\[ \text{SSC} = \sum_{i=1}^{k} n_i(\bar{X}_i - \bar{X})^2 \]

If SSC is small, the samples have similar means.

IF SSC is large, the samples have different means.

If SSC is zero, the samples have the same mean.

Calculating the Error Sum of Squares (SSE)

SSE -> The variation within the samples

\[ \text{SSE} = \] High SSE means similar variations within the samples.

Low SSE means different variations (more spread) within the samples.

Worked Example

\[ \text{TSS} = \text{SSC} + \text{SSE} \newline 153.6 = 19.6 + 134.0 \]

19.6 + 134.0
## [1] 153.6

ANOVA Table

Need to use Degrees of Freedom to calculate the Mean Sum of Squares

Then we can run Snedecor’s Variance Ratio Test (F)

\[ F = \frac{\text{Mean Sum of Squares between samples}}{\text{Mean Sum of Squares within samples}} = \frac{\text{Systematic Differences}}{\text{Random Variations}} \]

F value Meaning
Less than 1 More variance within samples than between samples
1 Variance within samples equals variance between samples
Greater than 1 More variance between samples than within samples

Conclusions

  1. The null hypothesis must be retained.
  2. There is no significant difference in crop yields between the three soil types.
  3. The differences observed between the samples are a product of random chance and do not reflect a real difference in crop yield with soil type.

Arranging In R

Values in one column, factor in another (see table1)

print(table1)
##    soil_type crop_yield
## 1          A         24
## 2          A         27
## 3          A         21
## 4          A         22
## 5          A         26
## 6          B         17
## 7          B         25
## 8          B         24
## 9          B         19
## 10         B         28
## 11         C         19
## 12         C         18
## 13         C         22
## 14         C         24
## 15         C         23
res<-aov(crop_yield~soil_type, data = table1)
summary(res)
##             Df Sum Sq Mean Sq F value Pr(>F)
## soil_type    2   19.6    9.80   0.878  0.441
## Residuals   12  134.0   11.17
boxplot(crop_yield~soil_type, data = table1)

What if Assumptions are not met?

  1. Transform all of the samples (e.g. log base 10) and then test assumptions again
  2. Use a non-parametric test (e.g. Kruskal-Wallis Test)

Kruskal-Wallis Test

Only use if you CANNOT use a parametric test

kruskal.test(crop_yield~soil_type, data = table1)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  crop_yield by soil_type
## Kruskal-Wallis chi-squared = 1.7639, df = 2, p-value = 0.414

The bigger the value, the bigger the difference between the samples

Post-hoc Tests

If you have more than two samples, you need to do a post-hoc test to see which samples are different from each other

Tukey Test

TukeyHSD(res)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = crop_yield ~ soil_type, data = table1)
## 
## $soil_type
##     diff       lwr      upr     p adj
## B-A -1.4 -7.038394 4.238394 0.7890417
## C-A -2.8 -8.438394 2.838394 0.4089635
## C-B -1.4 -7.038394 4.238394 0.7890417

Statistics Lecture 05, 20/02/2025

The Chi-Square Test

What if your data is categorized? -> the Chi(pronounced “Kai”)-Square test

Purpose

  • Assessing the similarity of two or more frequency distributions

Example - Expansion plans for Luton Airport

  • Random sample of 100 people living in the vicinity of the airport
  • Questionnaires usually begin with background questions, e.g. age, sex, and socio-economic status (ask occupation as opposed to salary)
  • Scale from 1 to 5 (1 = strongly agree, 5 = strongly disagree) -> Likert scale


  • Chi-Square is all about cross-tabulation (Comparing two or more categorical variables)

(E.g. Attitudes to Airport expansion cross-tabulated by age group)

Assumptions

  1. A non-parametric test (no assumptions about distribution shape)
  2. The data must be in the form of frequencies (never on continuous numerical data)
  3. The samples are mutually independent (only answers one option).
  4. No more than 20% of the categories should have expected frequencies of less than five (mainly to ensure sample size is large enough - will come back to later on).
  5. Random and representative sampling

Worked Example

# create data frame
categories <- c("1 person", "2 people", "3 people", "4 or more")
exeter <- c(44, 81, 45, 30)
exmouth <- c(33, 40, 18, 9)
table2 <- data.frame(categories, exeter, exmouth)

Stage 1: Write down hypotheses

Null Hypothesis

The frequency distributions are identical.

Alternative Hypothesis

The frequency distributions are different.

Stage 2: Calculate expected counts

The expected counts assumed that the two frequency distributions are identical (H₀ is true).

E.g. 1 person households

sum_oneperson <- sum(table2$exeter[1], table2$exmouth[1])
exeter_size <- sum(table2$exeter)
exmouth_size <- sum(table2$exmouth)
total_size <- exeter_size + exmouth_size
expected_oneperson <- (sum_oneperson/total_size) * 100
expected_oneperson_exeter <- (expected_oneperson / 100) * exeter_size
expected_oneperson_exmouth <- (expected_oneperson / 100) * exmouth_size

Expected vs observed count

Stage 3: Calculate the Chi-Square Statistic

\[ \chi^2 = \frac{(O - E)^2}{E} \]

Stage 4: Intepreting the Chi-Square Statistic

Chi-Square < 0 -> You have made a pig’s ear of your arithmetic.

Chi-Square = 0 -> The two frequency distributions are identical and/or there is no association between the two variables.

As Chi-Square becomes larger and more positive -> The two frequency distributions are becoming more dissimilar (different) and/or there is a strong association between the two variables.

Stage 5: Compare the Chi-Square Statistic to the Critical Value

\[ \chi^2 = 5.657 \]

  1. Size of table: larger tables have higher critical values \[ \text{Degrees of freedom} = (\text{rows} - 1) * (\text{columns} - 1) \newline \text{df} = (4 - 1) * (2 - 1) = 3 \newline \text{df} = 3 \]

  2. Rejection level To reject at 5% level, the critical value is 7.815

So here, we would retain the null hypothesis

Stage 6: Draw a conclusion

Here we have retained the null hypothesis.

Therefore, there is no difference between the distributions of household sizes in the two towns.

The distributions are similar.

The differences between the two towns are small and could have been generated by chance alone.

Note: Remember to mention at what level if you are rejecting H₀

Representing the data

  • Clustered bar chart
### produce clustered bar chart with table2
barplot(as.matrix(table2[,2:3]), beside = TRUE, col = c("blue", "red"), legend = rownames(table2))

What if the 20% rule is violated?

  • The test is only valid when no more than 20% of the cells have EXPECTED COUNTS of less than five.
  • If this rule is violated, you could aggregate data or combine categories together

Chi-Square Test with R

# cant get it to work

The formula is only an approximation (breaks down at low cell counts)

Statistics Lecture 06, 27/02/2025

Measures of Association: Correlation

(A bridge between Lectures 1-5 and the following lectures on linear regression)

  • The degree of association between two variables

  • No implied directionality

  • Correlation does NOT imply causation

Correlation Coefficient

  • A number between -1 and 1
Correlation Coefficient Meaning
+1 Perfect positive correlation
0 No correlation
-1 Perfect negative correlation

Always plot a scatter graph before calculating correlation

Pearson’s product-moment correlation coefficient (r)

Assumptions

  1. The two variables are interval or ratio data (not nominal or ordinal)
  2. The relationship between the two variables is linear
  3. Parametric test: the two variables are normally distributed
  4. Observations are independent of each other and in pairs

If the assumptions are not met, use a non-parametric test ⬇️

ASIDE: Spearman’s Rank Correlation Coefficient (rs)

Assumptions

  1. The two variables are ordinal, ratio or interval data
  2. The relationship between the two variables is monotonic (not necessarily linear)
  3. Non-parametric test: the two variables are not normally distributed
  4. Observations are independent of each other and in pairs

But you must always use Pearson where appropriate

Pearson’s (continued)

Hypotheses

  • H₀: There is no correlation between the two variables
  • Hₐ: There is a correlation between the two variables (two-tailed) or
  • Hₐ: There is a positive correlation between the two variables (one-tailed)

Equation

\[ r = \frac{\sum (x - \bar{x})(y - \bar{y})}{\sigma_x \sigma_y} \]

Significance testing for r

  • Rare for r = ±1
  • Degrees of freedom is \((n-2)\) not \((n-1)\)
Example

\[ r = 0.8 \newline n = 20 \newline t = \frac{r}{\sqrt{\frac{1-r^2}{n-2}}} \newline t = \frac{0.8}{\sqrt{\frac{1-0.8^2}{20-2}}} \newline t = 5.66 \newline \text{Critical value} = 2.086 \newline \text{if $\alpha$ = 0.05, reject H₀} \]

Writing up your results

There is a negative (inverse) association (relationship) between variable A and variable B, which is not statistically significant at the 95% confidence level (0.05 level) (r = -0.021, t = -0.127, p = 0.899, n = 40).

Correlation in R

# create data frame
age <- c(25, 30, 35, 40, 45, 50, 55, 60, 65, 70)
blood_pressure <- c(122, 137, 140, 152, 165, 178, 181, 196, 202, 213)
table3 <- data.frame(age, blood_pressure)

# plot scatter graph
plot(table3$age, table3$blood_pressure, xlab = "Age", ylab = "Blood Pressure", main = "Scatter plot of Age vs Blood Pressure")

# calculate correlation
res<-cor.test(table3$age, table3$blood_pressure, method = "pearson")
res
## 
##  Pearson's product-moment correlation
## 
## data:  table3$age and table3$blood_pressure
## t = 31.615, df = 8, p-value = 1.09e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9826145 0.9990945
## sample estimates:
##      cor 
## 0.996022

Sample Size Matters!

  • The larger the sample size, the more likely you are to find a significant correlation

  • Even a weak correlation can be statistically significant with a large enough sample size

Co-efficent of Determination (R2)

\[ R^2 = r \cdot r \]

The meaning of the R2 value

R2 varies between 0 and 1, but is commonly expressed as a percentage.

E.g. 0.545 > 54.5%

54.5% of the variation in \(y\) is ‘statistically explained’ by variations in \(x\).

R2 measure the Common Variance between the two variables.

There is no common variance between the two variables if r and R2 = 0

Multicollinearity

  • When two or more independent variables are highly correlated with each other
  • Occurs when \(y\) is calculated directly from \(x\).
    • These variables are “autocorrelated”, and the correlation between them is false or spurious.

Closed Number Sets

  • If you have a closed number set, you will always get a correlation of 1
  • E.g. \(y + x = 100\%\)

Statistics Lecture 07, 06/03/2025

Linear Regression

E.g. Question: Does a relationship exist between average annual rainfall and altitude?

Three Solutions

1. Scatter Graph

  • Plot the data
  • Look for a pattern

2. Correlation Coefficient

  • Calculate the correlation coefficient
  • Look for a relationship
  • Unexplained variance

3. Regression Analysis

  • Regression analysis produces an equation that uses one (or more) variable(s) to help explain the observed variation in another variable.
  • Unlike a correlation co-efficient, there is DIRECTION of causation, i.e. \(y = f(x)\)
  • NB. Don’t throw away marks by getting this the wrong way around!
  • Line of best fit -> ‘regression line’

Example

Does a relationship exist between house prices and distance from the City Centre?

\(\text{House Price} = f(\text{Distance})\)

\(y = f(x)\)

  • Scatter plot
  • Correlation Coefficient -> if \(R² = 0.8\), 80% of the variation in house prices is explained by distance from the City Centre
  • Regression analysis -> line of best fit

Regression is important for prediction of values

E.g. Where should I built a new supermarket to gain the highest revenue?

Reminder: \(y=mx+c\)

NB. statistics equation is \(y = a + bx\)

\(y\) -> dependent variable (response variable)

\(a\) -> y-intercept (\(c\))

\(b\) -> gradient of the line (\(m\)) \(= \frac{\Delta y}{\Delta x}\)

\(x\) -> independent variable (predictor variable)

(\(a + b\) -> regression coefficients)

Two types of regression

Simple -> one predictor variable

Multiple -> two or more predictor variables

(Coursework tip: 3 or 4 predictors in your final model)

Observations in rows; variables in columns

Assumptions (Simple)

1. Normality of errors (residuals)

  • The assumption of normally distributed errors is more likely to be satisfied when the variables are normally distributed in the first place.
  • Plot a histogram (e.g. altitude is positively skewed, rainfall is more normally distributed)
  • Calculate the skewness value / Z score of skewness
  • Perform the K-S test or Shapiro-Wilk test

2. Linearity

  • The relationship between the two variables is linear

3. Outliers

  • Outliers can have a large effect on the regression line (very high or very low value)
  • Z-scores are commonly used to identify outliers (\(Z = \frac{X - \bar{X}}{\sigma}\))

Assumptions (Mutliple)

1. Normality of errors (residuals)

2. Linearity

3. Outliers

4. Independence of predictors (no multicollinearity)

The ‘independent’ (predictor) variables should be precisely that - independent of one another.

The predictor variables should NOT be significantly correlated, or show high multicollinearity.

E.g. Supermarket store size and parking spaces are significantly correlated; the second variable is said to be redundant.

Ideally, predictor variables are perfectly uncorrelated. But some tolerance allowed.

Making predictions from regression equations

E.g. Predicting rainfall from altitude

\[ y = a + bx \newline \text{Rainfall} = 18.14 + (0.015 * \text{Altitude}) \newline \text{Altitude} = 2000 \text{ft} \newline \text{Rainfall} = 18.14 + (0.015 * 2000) = 18.14 + 30 = 48.14 \text{in} \]

Defining the line of best fit

  • We need an objective (mathematical) method of determining the line of best fit.
  • The most commonly used technique is called the LEAST SQUARES METHOD.

Bi-variate mean centre

  • The mean of the x and y values are subtracted from each value.

The line is placed in the position that minimises \((Y_o – Y_p)^2\)

(\(Y_o\) = observed value, \(Y_p\) = predicted value)

How good a fit is a regression model?

(Next lecture -> how to do this mathematically)

  • The regression line has a steep gradient
  • The points are close to the regression line
  • High R² value (close to 1)
  • \(y\) is responsive to variations in \(x\)
  • Good predictor variable

Or, a bad model:

  • The regression line is nearly horizontal
  • The points are far from the regression line
  • Low R² value (close to 0)
  • \(y\) is not responsive to variations in \(x\)
  • Poor predictor variable

Statistics Lecture 08, 07/03/2025

Linear Regression (continued)

Assessing goodness of fit (mathematically!)

We want to get R² as close to 1 as possible (if R² = 1, that is a perfect model)

The Common Theory

  • The t test, Analysis of Variance (ANOVA) and the F test are all used to test the significance of the regression model.

\(\text{TSS} = \text{SSC} + \text{SSE}\)

The Total Sum of Squares (TSS)

In regression, the equivalent of TSS is the variability in the dependent (\(y\)) variable.

If TSS = 0, then all the values of \(y\) are the same.

The higher the TSS value, the more variability in \(y\).

\(\text{TSS} = \sum (y - \bar{y})^2\)

The Sum of Squares Between the Samples (SSC)

In regression, the equivalent of SSC is the variability in the \(y\) variable that is explained by the regression line.

If RSS = 0, there is no trend in the data.

The higher the SSC value, the stronger the trend in the data.

\(\text{SSC} = \sum (y_p - \bar{y})^2\)

(The closer the regression line is to the mean, the smaller the SSC value)

The Sum of Squares with the Samples (SSE)

NB. Residual -> the difference between the observed value and the predicted value (error)

In regression, the equivalent of SSE is the variation in the \(y\) variable that is unexplained by the regression line.

High SSE means the model is poor at predicting \(y\)

Low SSE means the model is good at predicting \(y\) -> this is what we want!

The goodness of fit

\[ R^2 = \frac{\text{Regression sum of squares}}{\text{Total sum of squares}} \newline R^2 = \frac{\text{RSS}}{\text{TSS}} \newline R^2 = \frac{\text{Variation in } y \text{ that is explained by the regression line}}{\text{Total variation in the } y \text{ variable}} \]

Significance testing

1. The regression model as a whole

ANOVA is used to test the statistical significance of R².

Use ANOVA to find RSS and TSS.

\(R² = \frac{RSS}{TSS}\)

(N.B. R² is called Multiple R-squared in R output)

The regression and residual (error) sum of squares are affected by the number of variables and observations respectively.

So we need to remove these effects by calculating the variance terms.

\[ \text{Variance term} = \frac{\text{Sum of Squares}}{\text{Degrees of Freedom}} \]

Degrees of Freedom for RSS and TSS

\[ \text{df (RSS)} = \text{Number of variables } (x + y) - 1 \newline \newline \text{df (TSS)} = \text{Number of observations } - \text{Number of variables } (x + y) \] ##### Snedecor’s Variance Ratio (F) \[ F = \frac{\text{Variance term for Regression sum of squares}}{\text{Variance term for Error sum of squares}} \] E.g. p-value = 0.03374, so only reject H₀ at 0.05 level

The larger the F ratio the better the model

2. The regression coefficents (\(a + b\))

Are the two regression coefficents (a + b) statistically significant?

The t test is used to test the significance of the regression coefficients.

\[ t = \frac{\text{Value of coefficent}}{\text{Standard Error}} \] Standard error (SE): how far is co-efficient from true value?

We want t to be as high (positive or negative) as possible

N.B. R Significance Codes

0 (99.99) ‘***’ 0.001 (99.9) ‘**’ 0.01 (99) ‘*’ 0.05 (95) ‘.’ 0.1 (90) ‘ ’ 1 (0)

Statistics Lecture 09, 13/03/2025

Multiple Linear Regression

Introduction

  • Multiple linear regression is used when you have more than one predictor variable.

N.B. You will need this for your coursework - it explains more of the variance than a simple regression model.

\(y=a+b_1x_1+b_2x_2\)

A good model needs to be “parsimonious” - it needs to be efficient

A high R² value and a small number of \(x\) variables (i.e. highest R² with fewest variables)

R provides Akaike’s Information Criteria (AIC)

N.B. No more than four predictor variables for coursework model

Four Methods

1. Enter method (not very good)

  • Forces all variables into model, but retains the insignificant predictors

2. Forward method

  • Start with most significant variable first (based on p-value, strongest Pearson’s correlation coefficient)
  • Add the next most significant variable etc.
  • Only retains significant predictors

3. Backward method (opposite of Forward)

  • Start with all variables in the model
  • Remove the least significant variable first
  • Remove the next least significant variable etc.
  • Only retains significant predictors

4. Stepwise method

  • A variation of the Forward method
  • Start with the most significant variable (closest to +1 or -1)
  • Add the next most significant variable, the one with the strongest correlation with \(y\) that has not already been explained by the first variable
  • you want an R² value > 0.8
  • Removal test - may remove variables as the computer tries to find the combination which will give the R² value closest to 1
  • Choice of variables is done statistically (\(\alpha = 0.05\))
Disadvantages
  • Correlation does not imply causation (variable could be included even if there is not a logical relationship between \(x\) and \(y\))
  • Football manager analogy - you don’t want all strikers, you need a balanced team
Examples of Stepwise Regression
Example 1
  • Percentage of wet days during the winter at 127 sites in Devon and Cornwall
  • Predictor variables: altitude, distance from the coast, latitude, longitude
  • Altitude is chosen first, so must be best predictor
  • Then latitude is chosen, but other predictors not significant
  • N.B. R only provides statistics about the final model
  • Altitude explains 23.3%, Latitude increases this to 29.8%, the rest is unexplained
Adjusted R² Value
  • R² value is the sample coefficient of determination

  • Adjusted R² value is the population coefficient of determination

  • The adjusted R² value indicates the ‘shrinkage’ (loss of predictive power) when the model is applied to the population

  • A large difference (>5%) between R² and adjusted R² indicates that the model is poor

    • Could be due to a small sample size (overfitting)
    • Could be due to too many predictor variables used in the model
  • You don’t want the model to “collapse” when applied to the population

The standard error of the estimate (SEE)
  • A measure of average model error
  • The lower the SEE the better the model
  • N.B. This is called the residual standard error in R
  • Use ANOVA to calculate the SEE

\[ \text{SEE} = \sqrt{\frac{\text{Error Sum of Squares}}{n - 2}} \] Unstandardised Coefficients (called Estimate in R)

\(t\)-value - the higher the better

Use this to write down the regression equation

\[ y = a \text{ (intercept)} + b_1x_1 + b_2x_2 \newline \text{Rainfall} = 334.378 + (0.02995 \times \text{Altitude}) - (5.438 \times \text{Latitude}) \]

Checks to perform on regression coefficients
  • The coefficients should be statistically significant
  • The coefficients for each variable should be in the expected direction, the same as the correlation coefficient of the \(y\) variable
Unexpected outcomes in Multiple Regression
  • A variable that has a significant correlation with \(y\) may not be significant in the regression model
    • This particular \(x\) variable is probably correlated with one of the other \(x\) variables that is already in the model
    • The \(x\) variable is said to be “redundant” - only the most significant variable is retained
  • A variable that is not significantly correlated with \(y\) may be significant in the regression model
    • The variable is said to be “confounded” - it is correlated with one of the other \(x\) variables that is already in the model
    • The variable is retained in the model
  • N.B. Remove \(W\) from the model in the rainfall coursework (not NAOI)
Examples of Stepwise Regression (continued)
Example 2
  • Variations in overcrowding (\(y\)) in 37 sites around Birmingham

  • Predictor variables: Percentage unemployment rate, Percentage of part-time workers, Percentage renting from council/housing association, Percentage of terraced housing, Percentage with access to good amenities, Percentage with no car

  • The model is built up stepwise

  • The computer selects % unemployment, % part-time workers, % in social housing

  • % No car skipped because it is confounded with % unemployment

  • R² value is 0.960

\[ y = a + b_1x_1 + b_2x_2 + b_3x_3 \newline \text{Overcrowding} = 10.337 + (2.016 \times \text{Unemployment}) + (0.917 \times \text{Part-time workers}) + (0.109 \times \text{Social housing}) \] Checks on regression coefficients -> all significant, but not all in the expected direction (to be explained next week)

Visualising Multiple Regression Model

  • The regression line is in a higher dimension
  • The regression plane is the equivalent of the regression line in 3D


Last updated: 2025-03-13