x <- seq(-4, 4, length=200)
y <- dnorm(x, mean=0, sd=1)
plot(x,y, type = "l", lwd = 2, xlim = c(-3.5,3.5))
#include sample height data in data frame
height<-c(126,178,181,167,166)
gender<-c("female","female","male","female","male")
lat<-c(-43.67327,43.42426,51.79624,0.84228,-49.05650)
long<-c(-114.51627,-95.23205,-43.43058,-91.71106,-23.16819)
data<-data.frame(height,gender,lat,long)
histogram(~height|gender,data=data, main="", col="red")
Are two samples similar or different?
lowland_clay_sample <- c(10,20,25,25,30,35,45,50,50,60)
upland_limestone_sample <- c(25,30,30,40,40,50,50,60,60,65)
percentage_meadow <- data.frame(lowland_clay_sample, upland_limestone_sample)
Statistical hypothesis testing gives an objective method to see if something is due to chance or is genuinely statistically significant
There is no significant difference between meadow percentage in lowland and upland areas. Hₒ: μ(lowland) = μ(upland)
There is a significant difference between meadow percentage in lowland and upland areas. Hₐ: μ(lowland) != μ(upland)
Non-directional | Directional | |
---|---|---|
Do NOT state which variable is greatest in Hₐ | State which variable is greatest in Hₐ: μ(lowland) > μ(upland) OR μ(lowland) < μ(upland) | |
Two-tailed test | One-tailed test |
Probability of Hₒ being true = 0.024
Probability of Hₐ being true = 0.976
The purpose of this test is to find out which hypothesis is most likely to be correct. (Conclusions are probabilistic not absolute).
Type I Error - Should have retained Hₒ
Type II Error - Should have rejected Hₒ
95% level is the best level for the trade-off between Type I and Type II errors
(90% more likely Type I; 99.9% more likely Type II)
Example: Ag. Land Use Problem: Comparing means between two samples
I.e. Is there a significant difference between the two samples?
Before you do the test, have you satisfied the assumptions?
Each sample is taken from a population that is normally distributed
Interval/ratio data, cannot use data in categories
Each observation is independent of the other observations in the same sample
The samples are taken from populations which have similar variances
# why is this splitting?
# histogram(~lowland_clay_sample|upland_limestone_sample,data=percentage_meadow,main="")
hist(percentage_meadow$lowland_clay_sample)
hist(percentage_meadow$upland_limestone_sample)
stat.desc(percentage_meadow$lowland_clay_sample)
## nbr.val nbr.null nbr.na min max range
## 10.000000 0.000000 0.000000 10.000000 60.000000 50.000000
## sum median mean SE.mean CI.mean.0.95 var
## 350.000000 32.500000 35.000000 5.000000 11.310786 250.000000
## std.dev coef.var
## 15.811388 0.451754
stat.desc(percentage_meadow$upland_limestone_sample)
## nbr.val nbr.null nbr.na min max range
## 10.0000000 0.0000000 0.0000000 25.0000000 65.0000000 40.0000000
## sum median mean SE.mean CI.mean.0.95 var
## 450.0000000 45.0000000 45.0000000 4.4721360 10.1166744 200.0000000
## std.dev coef.var
## 14.1421356 0.3142697
The t test -> comparing arithmetic means of two samples
So samples need an equal variance
Rule: Variances must differ by less than a factor of two
Formal test for this -> Levene’s Test
For the example:
One- or two-sample test (comparison to popln. mean or one another)
Independent samples or Paired samples (samples are independent of each other or are paired - measured at the same time)
\[ t = \frac{\text{difference between the means}}{\text{standard error of the difference}} \] Lowland Clay: Arithmetic Mean = 35% Standard Deviation = 15.81% Number of Observations = 10
Upland Limestone: Arithmetic Mean = 45% Standard Deviation = 14.14% Number of Observations = 10
ANOVA determines whether the means of several populations are equal.
NB. Student’s t test compares the means of two samples only.
E.g. crop yield per acre obtained on different soil types
soil_type <- c('A','A','A','A','A','B','B','B','B','B','C','C','C','C','C')
as.factor(soil_type)
## [1] A A A A A B B B B B C C C C C
## Levels: A B C
crop_yield <- c(24,27,21,22,26,17,25,24,19,28,19,18,22,24,23)
table1 <- data.frame(soil_type, crop_yield)
ANOVA separates the total variance into two components:
Variance between the samples
Variance within the samples
\[ \text{Total variance} = \text{Variance between samples} + \text{Variance within samples} \newline \text{TSS} = \text{SSC} + \text{SSE} \]
NB. If you cannot satisfy these assumptions, you need to do another test (see end of lecture).
\[ \text{Variance} = (\text{Standard Deviation})^2 = σ^2 \]
The Total Sum of Squares (TSS) is the sum of the squared differences between each observation and the overall mean.
SSC -> The variation between the samples
\[ \text{SSC} = \sum_{i=1}^{k} n_i(\bar{X}_i - \bar{X})^2 \]
If SSC is small, the samples have similar means.
IF SSC is large, the samples have different means.
If SSC is zero, the samples have the same mean.
SSE -> The variation within the samples
\[ \text{SSE} = \] High SSE means similar variations within the samples.
Low SSE means different variations (more spread) within the samples.
\[ \text{TSS} = \text{SSC} + \text{SSE} \newline 153.6 = 19.6 + 134.0 \]
19.6 + 134.0
## [1] 153.6
Need to use Degrees of Freedom to calculate the Mean Sum of Squares
Then we can run Snedecor’s Variance Ratio Test (F)
\[ F = \frac{\text{Mean Sum of Squares between samples}}{\text{Mean Sum of Squares within samples}} = \frac{\text{Systematic Differences}}{\text{Random Variations}} \]
F value | Meaning |
---|---|
Less than 1 | More variance within samples than between samples |
1 | Variance within samples equals variance between samples |
Greater than 1 | More variance between samples than within samples |
Values in one column, factor in another (see table1)
print(table1)
## soil_type crop_yield
## 1 A 24
## 2 A 27
## 3 A 21
## 4 A 22
## 5 A 26
## 6 B 17
## 7 B 25
## 8 B 24
## 9 B 19
## 10 B 28
## 11 C 19
## 12 C 18
## 13 C 22
## 14 C 24
## 15 C 23
res<-aov(crop_yield~soil_type, data = table1)
summary(res)
## Df Sum Sq Mean Sq F value Pr(>F)
## soil_type 2 19.6 9.80 0.878 0.441
## Residuals 12 134.0 11.17
boxplot(crop_yield~soil_type, data = table1)
Only use if you CANNOT use a parametric test
kruskal.test(crop_yield~soil_type, data = table1)
##
## Kruskal-Wallis rank sum test
##
## data: crop_yield by soil_type
## Kruskal-Wallis chi-squared = 1.7639, df = 2, p-value = 0.414
The bigger the value, the bigger the difference between the samples
If you have more than two samples, you need to do a post-hoc test to see which samples are different from each other
TukeyHSD(res)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = crop_yield ~ soil_type, data = table1)
##
## $soil_type
## diff lwr upr p adj
## B-A -1.4 -7.038394 4.238394 0.7890417
## C-A -2.8 -8.438394 2.838394 0.4089635
## C-B -1.4 -7.038394 4.238394 0.7890417
What if your data is categorized? -> the Chi(pronounced “Kai”)-Square test
(E.g. Attitudes to Airport expansion cross-tabulated by age group)
# create data frame
categories <- c("1 person", "2 people", "3 people", "4 or more")
exeter <- c(44, 81, 45, 30)
exmouth <- c(33, 40, 18, 9)
table2 <- data.frame(categories, exeter, exmouth)
The frequency distributions are identical.
The frequency distributions are different.
The expected counts assumed that the two frequency distributions are identical (H₀ is true).
E.g. 1 person households
sum_oneperson <- sum(table2$exeter[1], table2$exmouth[1])
exeter_size <- sum(table2$exeter)
exmouth_size <- sum(table2$exmouth)
total_size <- exeter_size + exmouth_size
expected_oneperson <- (sum_oneperson/total_size) * 100
expected_oneperson_exeter <- (expected_oneperson / 100) * exeter_size
expected_oneperson_exmouth <- (expected_oneperson / 100) * exmouth_size
Expected vs observed count
\[ \chi^2 = \frac{(O - E)^2}{E} \]
Chi-Square < 0 -> You have made a pig’s ear of your arithmetic.
Chi-Square = 0 -> The two frequency distributions are identical and/or there is no association between the two variables.
As Chi-Square becomes larger and more positive -> The two frequency distributions are becoming more dissimilar (different) and/or there is a strong association between the two variables.
\[ \chi^2 = 5.657 \]
Size of table: larger tables have higher critical values \[ \text{Degrees of freedom} = (\text{rows} - 1) * (\text{columns} - 1) \newline \text{df} = (4 - 1) * (2 - 1) = 3 \newline \text{df} = 3 \]
Rejection level To reject at 5% level, the critical value is 7.815
So here, we would retain the null hypothesis
Here we have retained the null hypothesis.
Therefore, there is no difference between the distributions of household sizes in the two towns.
The distributions are similar.
The differences between the two towns are small and could have been generated by chance alone.
Note: Remember to mention at what level if you are rejecting H₀
### produce clustered bar chart with table2
barplot(as.matrix(table2[,2:3]), beside = TRUE, col = c("blue", "red"), legend = rownames(table2))
# cant get it to work
The formula is only an approximation (breaks down at low cell counts)
(A bridge between Lectures 1-5 and the following lectures on linear regression)
The degree of association between two variables
No implied directionality
Correlation does NOT imply causation
Correlation Coefficient | Meaning |
---|---|
+1 | Perfect positive correlation |
0 | No correlation |
-1 | Perfect negative correlation |
Always plot a scatter graph before calculating correlation
If the assumptions are not met, use a non-parametric test ⬇️
But you must always use Pearson where appropriate
\[ r = \frac{\sum (x - \bar{x})(y - \bar{y})}{\sigma_x \sigma_y} \]
\[ r = 0.8 \newline n = 20 \newline t = \frac{r}{\sqrt{\frac{1-r^2}{n-2}}} \newline t = \frac{0.8}{\sqrt{\frac{1-0.8^2}{20-2}}} \newline t = 5.66 \newline \text{Critical value} = 2.086 \newline \text{if $\alpha$ = 0.05, reject H₀} \]
There is a negative (inverse) association (relationship) between variable A and variable B, which is not statistically significant at the 95% confidence level (0.05 level) (r = -0.021, t = -0.127, p = 0.899, n = 40).
# create data frame
age <- c(25, 30, 35, 40, 45, 50, 55, 60, 65, 70)
blood_pressure <- c(122, 137, 140, 152, 165, 178, 181, 196, 202, 213)
table3 <- data.frame(age, blood_pressure)
# plot scatter graph
plot(table3$age, table3$blood_pressure, xlab = "Age", ylab = "Blood Pressure", main = "Scatter plot of Age vs Blood Pressure")
# calculate correlation
res<-cor.test(table3$age, table3$blood_pressure, method = "pearson")
res
##
## Pearson's product-moment correlation
##
## data: table3$age and table3$blood_pressure
## t = 31.615, df = 8, p-value = 1.09e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9826145 0.9990945
## sample estimates:
## cor
## 0.996022
The larger the sample size, the more likely you are to find a significant correlation
Even a weak correlation can be statistically significant with a large enough sample size
\[ R^2 = r \cdot r \]
R2 varies between 0 and 1, but is commonly expressed as a percentage.
E.g. 0.545 > 54.5%
54.5% of the variation in \(y\) is ‘statistically explained’ by variations in \(x\).
R2 measure the Common Variance between the two variables.
There is no common variance between the two variables if r and R2 = 0
E.g. Question: Does a relationship exist between average annual rainfall and altitude?
Does a relationship exist between house prices and distance from the City Centre?
\(\text{House Price} = f(\text{Distance})\)
\(y = f(x)\)
Regression is important for prediction of values
E.g. Where should I built a new supermarket to gain the highest revenue?
NB. statistics equation is \(y = a + bx\)
\(y\) -> dependent variable (response variable)
\(a\) -> y-intercept (\(c\))
\(b\) -> gradient of the line (\(m\)) \(= \frac{\Delta y}{\Delta x}\)
\(x\) -> independent variable (predictor variable)
(\(a + b\) -> regression coefficients)
Simple -> one predictor variable
Multiple -> two or more predictor variables
(Coursework tip: 3 or 4 predictors in your final model)
Observations in rows; variables in columns
The ‘independent’ (predictor) variables should be precisely that - independent of one another.
The predictor variables should NOT be significantly correlated, or show high multicollinearity.
E.g. Supermarket store size and parking spaces are significantly correlated; the second variable is said to be redundant.
Ideally, predictor variables are perfectly uncorrelated. But some tolerance allowed.
\[ y = a + bx \newline \text{Rainfall} = 18.14 + (0.015 * \text{Altitude}) \newline \text{Altitude} = 2000 \text{ft} \newline \text{Rainfall} = 18.14 + (0.015 * 2000) = 18.14 + 30 = 48.14 \text{in} \]
The line is placed in the position that minimises \((Y_o – Y_p)^2\)
(\(Y_o\) = observed value, \(Y_p\) = predicted value)
(Next lecture -> how to do this mathematically)
Or, a bad model:
We want to get R² as close to 1 as possible (if R² = 1, that is a perfect model)
\(\text{TSS} = \text{SSC} + \text{SSE}\)
In regression, the equivalent of TSS is the variability in the dependent (\(y\)) variable.
If TSS = 0, then all the values of \(y\) are the same.
The higher the TSS value, the more variability in \(y\).
\(\text{TSS} = \sum (y - \bar{y})^2\)
In regression, the equivalent of SSC is the variability in the \(y\) variable that is explained by the regression line.
If RSS = 0, there is no trend in the data.
The higher the SSC value, the stronger the trend in the data.
\(\text{SSC} = \sum (y_p - \bar{y})^2\)
(The closer the regression line is to the mean, the smaller the SSC value)
NB. Residual -> the difference between the observed value and the predicted value (error)
In regression, the equivalent of SSE is the variation in the \(y\) variable that is unexplained by the regression line.
High SSE means the model is poor at predicting \(y\)
Low SSE means the model is good at predicting \(y\) -> this is what we want!
\[ R^2 = \frac{\text{Regression sum of squares}}{\text{Total sum of squares}} \newline R^2 = \frac{\text{RSS}}{\text{TSS}} \newline R^2 = \frac{\text{Variation in } y \text{ that is explained by the regression line}}{\text{Total variation in the } y \text{ variable}} \]
ANOVA is used to test the statistical significance of R².
Use ANOVA to find RSS and TSS.
\(R² = \frac{RSS}{TSS}\)
(N.B. R² is called Multiple R-squared in R output)
The regression and residual (error) sum of squares are affected by the number of variables and observations respectively.
So we need to remove these effects by calculating the variance terms.
\[ \text{Variance term} = \frac{\text{Sum of Squares}}{\text{Degrees of Freedom}} \]
\[ \text{df (RSS)} = \text{Number of variables } (x + y) - 1 \newline \newline \text{df (TSS)} = \text{Number of observations } - \text{Number of variables } (x + y) \] ##### Snedecor’s Variance Ratio (F) \[ F = \frac{\text{Variance term for Regression sum of squares}}{\text{Variance term for Error sum of squares}} \] E.g. p-value = 0.03374, so only reject H₀ at 0.05 level
The larger the F ratio the better the model
Are the two regression coefficents (a + b) statistically significant?
The t test is used to test the significance of the regression coefficients.
\[ t = \frac{\text{Value of coefficent}}{\text{Standard Error}} \] Standard error (SE): how far is co-efficient from true value?
We want t to be as high (positive or negative) as possible
0 (99.99) ‘***’ 0.001 (99.9) ‘**’ 0.01 (99) ‘*’ 0.05 (95) ‘.’ 0.1 (90) ‘ ’ 1 (0)
N.B. You will need this for your coursework - it explains more of the variance than a simple regression model.
\(y=a+b_1x_1+b_2x_2\)
A good model needs to be “parsimonious” - it needs to be efficient
A high R² value and a small number of \(x\) variables (i.e. highest R² with fewest variables)
R provides Akaike’s Information Criteria (AIC)
N.B. No more than four predictor variables for coursework model
R² value is the sample coefficient of determination
Adjusted R² value is the population coefficient of determination
The adjusted R² value indicates the ‘shrinkage’ (loss of predictive power) when the model is applied to the population
A large difference (>5%) between R² and adjusted R² indicates that the model is poor
You don’t want the model to “collapse” when applied to the population
\[
\text{SEE} = \sqrt{\frac{\text{Error Sum of Squares}}{n - 2}}
\] Unstandardised Coefficients (called Estimate in R)
\(t\)-value - the higher the better
Use this to write down the regression equation
\[ y = a \text{ (intercept)} + b_1x_1 + b_2x_2 \newline \text{Rainfall} = 334.378 + (0.02995 \times \text{Altitude}) - (5.438 \times \text{Latitude}) \]
Variations in overcrowding (\(y\)) in 37 sites around Birmingham
Predictor variables: Percentage unemployment rate, Percentage of part-time workers, Percentage renting from council/housing association, Percentage of terraced housing, Percentage with access to good amenities, Percentage with no car
The model is built up stepwise
The computer selects % unemployment, % part-time workers, % in social housing
% No car skipped because it is confounded with % unemployment
R² value is 0.960
\[ y = a + b_1x_1 + b_2x_2 + b_3x_3 \newline \text{Overcrowding} = 10.337 + (2.016 \times \text{Unemployment}) + (0.917 \times \text{Part-time workers}) + (0.109 \times \text{Social housing}) \] Checks on regression coefficients -> all significant, but not all in the expected direction (to be explained next week)
Last updated: 2025-03-13