Skip to content

Statistics with R, Course Six, Multiple Regression


Foreword

  • Output options: the ‘tango’ syntax and the ‘readable’ theme.
  • Snippets only.

A gentle introduction to the principles of multiple regression

Multiple regression: visualization of the relationships

# Perform the two single regressions and save them in a variable
#model_years <- lm(salary ~ years, data = fs)
#model_pubs <-  lm(salary ~ pubs, data = fs)
model_years <- lm(fs$salary ~ fs$years)
model_pubs <- lm(fs$salary ~ fs$pubs)

# Plot both enhanced scatter plots in one plot matrix of 1 by 2
par(mfrow = c(1, 2))

#plot(fs$years, fs$salary, main = 'plot_years', xlab = 'years', ylab = 'salary')
plot(fs$salary ~ fs$years, main = 'plot_years', xlab = 'years', ylab = 'salary')
abline(model_years)

#plot(fs$pubs, fs$salary, main = 'plot_pubs', xlab = 'pubs', ylab = 'salary')
plot(fs$salary ~ fs$pubs, main = 'plot_pubs', xlab = 'pubs', ylab = 'salary')
abline(model_pubs)

Multiple regression: model selection

# Do a single regression of salary onto years of experience and check the output
model_1 <- lm(fs$salary ~ fs$years)
summary(model_1)

# Do a multiple regression of salary onto years of experience and numbers of publications and check the output
model_2 <- lm(fs$salary ~ fs$years + fs$pubs)
summary(model_2)

# Save the R squared of both models in preliminary variables
preliminary_model_1 <- summary(model_1)$r.squared
preliminary_model_2 <- summary(model_2)$r.squared

# Round them off while you save them in new variables
r_squared <- c()
r_squared[1] <- round(preliminary_model_1, 3)
r_squared[2] <- round(preliminary_model_2, 3)

# Print out the vector to see both R squared coefficients
r_squared

Multiple regression: beware of redundancy

# Do multiple regression and check the regression output
model_3 <- lm(fs$salary ~ fs$years + fs$pubs + fs$age)
summary(model_3)

# Round off the R squared coefficients and save the result in the vector (in one step!)
r_squared[3] <- round(summary(model_3)$r.squared,3)

# Print out the vector in order to display all R squared coefficients simultaneously
r_squared

Intuition behind estimation of multiple regression coefficients

Definition of matrices

# Construction of 3 by 8 matrix r that contains the numbers 1 up to 24
r <- matrix(seq(1,24), 3)

# Construction of 3 by 8 matrix s that contains the numbers 21 up to 44
s <- matrix(seq(21,44), 3)

# Take the transpose t of matrix r
t <- t(r)

Addition, subtraction and multiplication of matrices

# Compute the sum of matrices r and s
operation_1 <- r + s

# Compute the difference between matrices r and s
operation_2 <- r - s

# Multiply matrices t and s
operation_3 <- t %*% s

Row vector of sums

# The raw dataframe `X` is already loaded in.
X

# Construction of 1 by 10 matrix I of which the elements are all 1
I <- matrix(rep(1,10), 1, 10)

# Compute the row vector of sums
t_mat <- I %*% X

Row vector of means and matrix of means

# The data matrix `X` and the row vector of sums (`T`) are saved and can be used.
# Number of observations
n = 10

# Compute the row vector of means
# you summed up the row, you divide by the nrow to compute the average
M <- t_mat / 10

# Construction of 10 by 1 matrix J of which the elements are all 1
J <- matrix(rep(1,10), 10, 1)

# Compute the matrix of means 
MM <- J %*% M

Matrix of deviation scores

# The previously generated matrices X, M and MM do not need to be constructed again but are saved and can be used.

# Matrix of deviation scores D 
D <- X - MM

Sum of squares and sum of cross products matrix

# The previously generated matrices X, M, MM and D do not need to be constructed again but are saved and can be used.

# Sum of squares and sum of cross products matrix
S <- t(D) %*% D

Calculating the correlation matrix

# The previously generated matrices X, M, MM, D and S do not need to be constructed again but are saved and can be used.
n = 10

# Construct the variance-covariance matrix 
C <-  S * 1/n

# Generate the standard deviations matrix 
SD <- diag(x = diag(C)^(1/2), nrow = 3, ncol = 3)

# Compute the correlation matrix
R <- solve(SD) %*% C %*% solve(SD)

Dummy coding

Starting off

# Summary statistics
describeBy(fs, fs$dept)

A system to code categorical predictors in a regression analysis

Suppose we have a categorical vector of observations. The vector counts 4 distinct groups. Here is how to assign the dummy variables:

ProfID Group Pubs D1 D2 D3
NU Cognitive 83 0 0 0
ZH Clinical 74 1 0 0
MK Developmental 80 0 1 0
RH Social 68 0 0 1

Summary statistics:

Group M SD N
Cognitive 93.31 29.48 13
Clinical 60.67 11.12 8
Developmental 103.5 23.64 6
Social 70.13 21.82 9

Model:

Y = β0 + β1(D1)+β2(D2)+β3(D3)+ϵ

Coefficient:

B SE B t p
93.31 6.5 0 14.37 <.001
D1 (Clinical) -32.64 10.16 -0.51 -3.21 0.003
D2 (Devel) 10.19 11.56 0.14 0.88 0.384
D3 (Social) -23.18 10.52 -0.35 -2.2 0.035

Creating dummy variables (1)

# Create the dummy variables
dept_code <- dummy.code(fs$dept)
dept_code

# Merge the dataset in an extended dataframe
extended_fs <- cbind(fs, dept_code)

# Look at the extended dataframe
extended_fs

# Provide summary statistics
summary(extended_fs)

Creating dummy variables (2)

# Regress salary against years and publications
model <- lm(fs$salary ~ fs$years + fs$pubs)

# Apply the summary function to get summarized results for model
summary(model)

# Compute the confidence intervals for model
confint(model) 

# Create dummies for the categorical variable fs$dept by using the C() function
dept_code <- C(fs$dept, treatment)

# Regress salary against years, publications and department 
model_dummy <- lm(fs$salary ~ fs$years + fs$pubs + dept_code)

# Apply the summary function to get summarized results for model_dummy
summary(model_dummy)

# Compute the confidence intervals for model_dummy
confint(model_dummy)

Model selection: ANOVA

# Compare model 4 with model3
anova(model, model_dummy)

Discrepancy between actual and predicted means

# Actual means of fs$salary
tapply(fs$salary, fs$dept, mean)

Unweighted effects coding

Consult the PDF for ‘Unweighted Effects Coding’.

# Number of levels
fs$dept

# Factorize the categorical variable fs$dept and name the factorized variable dept.f
dept.f <- factor(fs$dept)

# Assign the 3 levels generated in step 2 to dept.f
contrasts(dept.f) <- contr.sum(3)

# Regress salary against dept.f
model_unweighted <- lm(fs$salary ~ dept.f)

# Apply the summary() function
summary(model_unweighted)

Weighted effects coding

Consult ‘Weighted Effects Coding’.

# Factorize the categorical variable fs$dept and name the factorized variable dept.g
dept.g <- factor(fs$dept)

# Assign the weights matrix to dept.g 
contrasts(dept.g) <- weights

# Regress salary against dept.f and apply the summary() function
model_weighted <- lm(fs$salary ~ dept.g)

# Apply the summary() function
summary(model_weighted)