Statistics with R, Course Six, Multiple Regression
Foreword
- Output options: the ‘tango’ syntax and the ‘readable’ theme.
- Snippets only.
A gentle introduction to the principles of multiple regression¶
Multiple regression: visualization of the relationships
# Perform the two single regressions and save them in a variable
#model_years <- lm(salary ~ years, data = fs)
#model_pubs <- lm(salary ~ pubs, data = fs)
model_years <- lm(fs$salary ~ fs$years)
model_pubs <- lm(fs$salary ~ fs$pubs)
# Plot both enhanced scatter plots in one plot matrix of 1 by 2
par(mfrow = c(1, 2))
#plot(fs$years, fs$salary, main = 'plot_years', xlab = 'years', ylab = 'salary')
plot(fs$salary ~ fs$years, main = 'plot_years', xlab = 'years', ylab = 'salary')
abline(model_years)
#plot(fs$pubs, fs$salary, main = 'plot_pubs', xlab = 'pubs', ylab = 'salary')
plot(fs$salary ~ fs$pubs, main = 'plot_pubs', xlab = 'pubs', ylab = 'salary')
abline(model_pubs)
Multiple regression: model selection
# Do a single regression of salary onto years of experience and check the output
model_1 <- lm(fs$salary ~ fs$years)
summary(model_1)
# Do a multiple regression of salary onto years of experience and numbers of publications and check the output
model_2 <- lm(fs$salary ~ fs$years + fs$pubs)
summary(model_2)
# Save the R squared of both models in preliminary variables
preliminary_model_1 <- summary(model_1)$r.squared
preliminary_model_2 <- summary(model_2)$r.squared
# Round them off while you save them in new variables
r_squared <- c()
r_squared[1] <- round(preliminary_model_1, 3)
r_squared[2] <- round(preliminary_model_2, 3)
# Print out the vector to see both R squared coefficients
r_squared
Multiple regression: beware of redundancy
# Do multiple regression and check the regression output
model_3 <- lm(fs$salary ~ fs$years + fs$pubs + fs$age)
summary(model_3)
# Round off the R squared coefficients and save the result in the vector (in one step!)
r_squared[3] <- round(summary(model_3)$r.squared,3)
# Print out the vector in order to display all R squared coefficients simultaneously
r_squared
Intuition behind estimation of multiple regression coefficients¶
Definition of matrices
# Construction of 3 by 8 matrix r that contains the numbers 1 up to 24
r <- matrix(seq(1,24), 3)
# Construction of 3 by 8 matrix s that contains the numbers 21 up to 44
s <- matrix(seq(21,44), 3)
# Take the transpose t of matrix r
t <- t(r)
Addition, subtraction and multiplication of matrices
# Compute the sum of matrices r and s
operation_1 <- r + s
# Compute the difference between matrices r and s
operation_2 <- r - s
# Multiply matrices t and s
operation_3 <- t %*% s
Row vector of sums
# The raw dataframe `X` is already loaded in.
X
# Construction of 1 by 10 matrix I of which the elements are all 1
I <- matrix(rep(1,10), 1, 10)
# Compute the row vector of sums
t_mat <- I %*% X
Row vector of means and matrix of means
# The data matrix `X` and the row vector of sums (`T`) are saved and can be used.
# Number of observations
n = 10
# Compute the row vector of means
# you summed up the row, you divide by the nrow to compute the average
M <- t_mat / 10
# Construction of 10 by 1 matrix J of which the elements are all 1
J <- matrix(rep(1,10), 10, 1)
# Compute the matrix of means
MM <- J %*% M
Matrix of deviation scores
# The previously generated matrices X, M and MM do not need to be constructed again but are saved and can be used.
# Matrix of deviation scores D
D <- X - MM
Sum of squares and sum of cross products matrix
# The previously generated matrices X, M, MM and D do not need to be constructed again but are saved and can be used.
# Sum of squares and sum of cross products matrix
S <- t(D) %*% D
Calculating the correlation matrix
# The previously generated matrices X, M, MM, D and S do not need to be constructed again but are saved and can be used.
n = 10
# Construct the variance-covariance matrix
C <- S * 1/n
# Generate the standard deviations matrix
SD <- diag(x = diag(C)^(1/2), nrow = 3, ncol = 3)
# Compute the correlation matrix
R <- solve(SD) %*% C %*% solve(SD)
Dummy coding¶
Starting off
# Summary statistics
describeBy(fs, fs$dept)
A system to code categorical predictors in a regression analysis
Suppose we have a categorical vector of observations. The vector counts 4 distinct groups. Here is how to assign the dummy variables:
ProfID | Group | Pubs | D1 | D2 | D3 |
---|---|---|---|---|---|
NU | Cognitive | 83 | 0 | 0 | 0 |
ZH | Clinical | 74 | 1 | 0 | 0 |
MK | Developmental | 80 | 0 | 1 | 0 |
RH | Social | 68 | 0 | 0 | 1 |
Summary statistics:
Group | M | SD | N |
---|---|---|---|
Cognitive | 93.31 | 29.48 | 13 |
Clinical | 60.67 | 11.12 | 8 |
Developmental | 103.5 | 23.64 | 6 |
Social | 70.13 | 21.82 | 9 |
Model:
Y = β0 + β1(D1)+β2(D2)+β3(D3)+ϵ
Coefficient:
B | SE | B | t | p | |
---|---|---|---|---|---|
93.31 | 6.5 | 0 | 14.37 | <.001 | |
D1 (Clinical) | -32.64 | 10.16 | -0.51 | -3.21 | 0.003 |
D2 (Devel) | 10.19 | 11.56 | 0.14 | 0.88 | 0.384 |
D3 (Social) | -23.18 | 10.52 | -0.35 | -2.2 | 0.035 |
Creating dummy variables (1)
# Create the dummy variables
dept_code <- dummy.code(fs$dept)
dept_code
# Merge the dataset in an extended dataframe
extended_fs <- cbind(fs, dept_code)
# Look at the extended dataframe
extended_fs
# Provide summary statistics
summary(extended_fs)
Creating dummy variables (2)
# Regress salary against years and publications
model <- lm(fs$salary ~ fs$years + fs$pubs)
# Apply the summary function to get summarized results for model
summary(model)
# Compute the confidence intervals for model
confint(model)
# Create dummies for the categorical variable fs$dept by using the C() function
dept_code <- C(fs$dept, treatment)
# Regress salary against years, publications and department
model_dummy <- lm(fs$salary ~ fs$years + fs$pubs + dept_code)
# Apply the summary function to get summarized results for model_dummy
summary(model_dummy)
# Compute the confidence intervals for model_dummy
confint(model_dummy)
Model selection: ANOVA
# Compare model 4 with model3
anova(model, model_dummy)
Discrepancy between actual and predicted means
# Actual means of fs$salary
tapply(fs$salary, fs$dept, mean)
Unweighted effects coding
Consult the PDF for ‘Unweighted Effects Coding’.
# Number of levels
fs$dept
# Factorize the categorical variable fs$dept and name the factorized variable dept.f
dept.f <- factor(fs$dept)
# Assign the 3 levels generated in step 2 to dept.f
contrasts(dept.f) <- contr.sum(3)
# Regress salary against dept.f
model_unweighted <- lm(fs$salary ~ dept.f)
# Apply the summary() function
summary(model_unweighted)
Weighted effects coding
Consult ‘Weighted Effects Coding’.
# Factorize the categorical variable fs$dept and name the factorized variable dept.g
dept.g <- factor(fs$dept)
# Assign the weights matrix to dept.g
contrasts(dept.g) <- weights
# Regress salary against dept.f and apply the summary() function
model_weighted <- lm(fs$salary ~ dept.g)
# Apply the summary() function
summary(model_weighted)