Statistics with R, Course Six, Multiple Regression

Foreword

Output options: the ‘tango’ syntax and the ‘readable’ theme.
Snippets only.

A gentle introduction to the principles of multiple regression¶

Multiple regression: visualization of the relationships

# Perform the two single regressions and save them in a variable
#model_years <- lm(salary ~ years, data = fs)
#model_pubs <-  lm(salary ~ pubs, data = fs)
model_years <- lm(fs$salary ~ fs$years)
model_pubs <- lm(fs$salary ~ fs$pubs)

# Plot both enhanced scatter plots in one plot matrix of 1 by 2
par(mfrow = c(1, 2))

#plot(fs$years, fs$salary, main = 'plot_years', xlab = 'years', ylab = 'salary')
plot(fs$salary ~ fs$years, main = 'plot_years', xlab = 'years', ylab = 'salary')
abline(model_years)

#plot(fs$pubs, fs$salary, main = 'plot_pubs', xlab = 'pubs', ylab = 'salary')
plot(fs$salary ~ fs$pubs, main = 'plot_pubs', xlab = 'pubs', ylab = 'salary')
abline(model_pubs)

Multiple regression: model selection

# Do a single regression of salary onto years of experience and check the output
model_1 <- lm(fs$salary ~ fs$years)
summary(model_1)

# Do a multiple regression of salary onto years of experience and numbers of publications and check the output
model_2 <- lm(fs$salary ~ fs$years + fs$pubs)
summary(model_2)

# Save the R squared of both models in preliminary variables
preliminary_model_1 <- summary(model_1)$r.squared
preliminary_model_2 <- summary(model_2)$r.squared

# Round them off while you save them in new variables
r_squared <- c()
r_squared[1] <- round(preliminary_model_1, 3)
r_squared[2] <- round(preliminary_model_2, 3)

# Print out the vector to see both R squared coefficients
r_squared

Multiple regression: beware of redundancy

# Do multiple regression and check the regression output
model_3 <- lm(fs$salary ~ fs$years + fs$pubs + fs$age)
summary(model_3)

# Round off the R squared coefficients and save the result in the vector (in one step!)
r_squared[3] <- round(summary(model_3)$r.squared,3)

# Print out the vector in order to display all R squared coefficients simultaneously
r_squared

Intuition behind estimation of multiple regression coefficients¶

Definition of matrices

# Construction of 3 by 8 matrix r that contains the numbers 1 up to 24
r <- matrix(seq(1,24), 3)

# Construction of 3 by 8 matrix s that contains the numbers 21 up to 44
s <- matrix(seq(21,44), 3)

# Take the transpose t of matrix r
t <- t(r)

Addition, subtraction and multiplication of matrices

# Compute the sum of matrices r and s
operation_1 <- r + s

# Compute the difference between matrices r and s
operation_2 <- r - s

# Multiply matrices t and s
operation_3 <- t %*% s

Row vector of sums

# The raw dataframe `X` is already loaded in.
X

# Construction of 1 by 10 matrix I of which the elements are all 1
I <- matrix(rep(1,10), 1, 10)

# Compute the row vector of sums
t_mat <- I %*% X

Row vector of means and matrix of means

# The data matrix `X` and the row vector of sums (`T`) are saved and can be used.
# Number of observations
n = 10

# Compute the row vector of means
# you summed up the row, you divide by the nrow to compute the average
M <- t_mat / 10

# Construction of 10 by 1 matrix J of which the elements are all 1
J <- matrix(rep(1,10), 10, 1)

# Compute the matrix of means 
MM <- J %*% M

Matrix of deviation scores

# The previously generated matrices X, M and MM do not need to be constructed again but are saved and can be used.

# Matrix of deviation scores D 
D <- X - MM

Sum of squares and sum of cross products matrix

# The previously generated matrices X, M, MM and D do not need to be constructed again but are saved and can be used.

# Sum of squares and sum of cross products matrix
S <- t(D) %*% D

Calculating the correlation matrix

# The previously generated matrices X, M, MM, D and S do not need to be constructed again but are saved and can be used.
n = 10

# Construct the variance-covariance matrix 
C <-  S * 1/n

# Generate the standard deviations matrix 
SD <- diag(x = diag(C)^(1/2), nrow = 3, ncol = 3)

# Compute the correlation matrix
R <- solve(SD) %*% C %*% solve(SD)

Dummy coding¶

Starting off

# Summary statistics
describeBy(fs, fs$dept)

A system to code categorical predictors in a regression analysis

Suppose we have a categorical vector of observations. The vector counts 4 distinct groups. Here is how to assign the dummy variables:

ProfID	Group	Pubs	D1	D2	D3
NU	Cognitive	83	0	0	0
ZH	Clinical	74	1	0	0
MK	Developmental	80	0	1	0
RH	Social	68	0	0	1

Summary statistics:

Group	M	SD	N
Cognitive	93.31	29.48	13
Clinical	60.67	11.12	8
Developmental	103.5	23.64	6
Social	70.13	21.82	9

Model:

Y = β₀ + β₁(D₁)+β₂(D₂)+β₃(D₃)+ϵ

Coefficient:

	B	SE	B	t	p
	93.31	6.5	0	14.37	<.001
D1 (Clinical)	-32.64	10.16	-0.51	-3.21	0.003
D2 (Devel)	10.19	11.56	0.14	0.88	0.384
D3 (Social)	-23.18	10.52	-0.35	-2.2	0.035

Creating dummy variables (1)

# Create the dummy variables
dept_code <- dummy.code(fs$dept)
dept_code

# Merge the dataset in an extended dataframe
extended_fs <- cbind(fs, dept_code)

# Look at the extended dataframe
extended_fs

# Provide summary statistics
summary(extended_fs)

Creating dummy variables (2)

# Regress salary against years and publications
model <- lm(fs$salary ~ fs$years + fs$pubs)

# Apply the summary function to get summarized results for model
summary(model)

# Compute the confidence intervals for model
confint(model) 

# Create dummies for the categorical variable fs$dept by using the C() function
dept_code <- C(fs$dept, treatment)

# Regress salary against years, publications and department 
model_dummy <- lm(fs$salary ~ fs$years + fs$pubs + dept_code)

# Apply the summary function to get summarized results for model_dummy
summary(model_dummy)

# Compute the confidence intervals for model_dummy
confint(model_dummy)

Model selection: ANOVA

# Compare model 4 with model3
anova(model, model_dummy)

Discrepancy between actual and predicted means

# Actual means of fs$salary
tapply(fs$salary, fs$dept, mean)

Unweighted effects coding

Consult the PDF for ‘Unweighted Effects Coding’.

# Number of levels
fs$dept

# Factorize the categorical variable fs$dept and name the factorized variable dept.f
dept.f <- factor(fs$dept)

# Assign the 3 levels generated in step 2 to dept.f
contrasts(dept.f) <- contr.sum(3)

# Regress salary against dept.f
model_unweighted <- lm(fs$salary ~ dept.f)

# Apply the summary() function
summary(model_unweighted)

Weighted effects coding

Consult ‘Weighted Effects Coding’.

# Factorize the categorical variable fs$dept and name the factorized variable dept.g
dept.g <- factor(fs$dept)

# Assign the weights matrix to dept.g 
contrasts(dept.g) <- weights

# Regress salary against dept.f and apply the summary() function
model_weighted <- lm(fs$salary ~ dept.g)

# Apply the summary() function
summary(model_weighted)