Linear Models

The role of linear models in statistics is to help us understand and analyze relationships between variables. For this section assignment I have worked with data sets and dummy models to show the versatility of Linear Models.

##From our textbook: pp. 188 Question 10.1 and 10.3
library(ISwR)

## Warning: package ‘ISwR’ was built under R version 4.2.3

data(ashina)
data

# Reshaping ashine data

The code essentially reshapes the original ashina dataset from wide format to long format
# and adds several new factor variables (subject, treat, grp, and period) to the reshaped dataset.

ashina.long <-reshape(ashina,direction = “long”,varying=1:2, timevar=”treat”)

# This code block transforms the ashina.long data frame by creating new categorical variables
# (subject, treat, grp, and period) and removing the temporary m matrix.
#The period variable is generated based on the combination of values from grp and treat using the m matrix.

ashina.long <-within(ashina.long,{
m <- matrix(c(2,1,1,2),2)
subject <-factor(id)
treat <- factor(treat)
grp <- factor(grp)
period <- factor(m[cbind(grp,treat)])
rm(m)
})

ashina.long

##           grp treat vas id period subject
## 1.active    1 active -167 1      2       1
## 2.active    1 active -127 2      2       2
## 3.active    1 active -58 3      2       3
## 4.active    1 active -103 4      2       4
## 5.active    1 active -35 5      2       5
## 6.active    1 active -164 6      2       6
## 7.active    1 active   -3 7      2       7
## 8.active    1 active   25 8      2       8
## 9.active    1 active -61 9      2       9
## 10.active   1 active -45 10      2      10
## 11.active   2 active -38 11      1      11
## 12.active   2 active   29 12      1      12
## 13.active   2 active    2 13      1      13
## 14.active   2 active -18 14      1      14
## 15.active   2 active -74 15      1      15
## 16.active   2 active -72 16      1      16
## 1.plac      1   plac -102 1      1       1
## 2.plac      1   plac -39 2      1       2
## 3.plac      1   plac   32 3      1       3
## 4.plac      1   plac   28 4      1       4
## 5.plac      1   plac   16 5      1       5
## 6.plac      1   plac -42 6      1       6
## 7.plac      1   plac -27 7      1       7
## 8.plac      1   plac -30 8      1       8
## 9.plac      1   plac -47 9      1       9
## 10.plac     1   plac    8 10      1      10
## 11.plac     2   plac   12 11      2      11
## 12.plac     2   plac   11 12      2      12
## 13.plac     2   plac   -9 13      2      13
## 14.plac     2   plac   -1 14      2      14
## 15.plac     2   plac    3 15      2      15
## 16.plac     2   plac -36 16      2      16

# In this line, we are fitting a linear regression model using the lm() function. The dependent variable (response variable) is “vas,”
# and the independent variables (predictors) in the model are “subject,” “period,” and “treat.”
# The data for the regression analysis is taken from the “ashina.long” data frame.
fit.ashina <- lm(vas ~ subject + period + treat, data=ashina.long)
# The drop1() function is used to perform a type-II analysis of variance (ANOVA) on the linear regression model.
# It sequentially removes each predictor variable from the model and tests whether dropping that variable significantly
# affects the model’s fit. The “test = ‘F'” argument specifies that an F-test should be used for this assessment.
# This function provides information about the change in the model’s F-statistic and p-value as each variable is dropped.
drop1(fit.ashina, test = “F”)

## Single term deletions
##
## Model:
## vas ~ subject + period + treat
##         Df Sum of Sq   RSS    AIC F value Pr(>F)
## <none>               19679 241.49
## subject 15     51137 70816 252.47 2.4254 0.05287 .
## period   1      1505 21184 241.85 1.0709 0.31830
## treat    1     11603 31282 254.32 8.2550 0.01228 *
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

# The anova model assesses the overall significance of the model by comparing the full model to a null model (intercept-only model)
# to determine whether the predictors as a whole are contributing significantly to explaining the variability in the dependent variable.
anova(fit.ashina)

## Analysis of Variance Table
##
## Response: vas
##           Df Sum Sq Mean Sq F value Pr(>F)
## subject   15 51137 3409.2 2.4254 0.05287 .
## period     1   4608 4608.0 3.2783 0.09171 .
## treat      1 11603 11603.3 8.2550 0.01228 *
## Residuals 14 19679 1405.6
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

###### Interpretation drop1() result:
# For “subject,” the F-test has an F-value of 2.4254 with a p-value of 0.05287, indicating that removing the “subject” variable from the model is borderline significant (p-value is close to 0.05).
# For “period,” the F-test has an F-value of 1.0709 with a p-value of 0.31830, indicating that removing the “period” variable does not significantly affect the model.
# For “treat,” the F-test has an F-value of 8.2550 with a p-value of 0.01228, indicating that removing the “treat” variable significantly affects the model (p-value is less than 0.05).
###### Interpretation anova() result:
# For “subject,” the F-test has an F-value of 2.4254 with a p-value of 0.05287, indicating that the overall significance of the “subject” variable is borderline significant.
# For “period,” the F-test has an F-value of 3.2783 with a p-value of 0.09171, indicating that the overall significance of the “period” variable is not statistically significant.
# For “treat,” the F-test has an F-value of 8.2550 with a p-value of 0.01228, indicating that the overall significance of the “treat” variable is statistically significant.
# The “Residuals” row represents the unexplained variation in the model.

###### testing and comparing period t-test with additive model
#
attach(ashina)
aa <- vas.active – vas.plac
t.test(aa[grp==1],-aa[grp==2], var.eq=T)

##
## Two Sample t-test
##
## data: aa[grp == 1] and -aa[grp == 2]
## t = -2.8731, df = 14, p-value = 0.01228
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -137.39089 -19.94244
## sample estimates:
## mean of x mean of y
## -53.50000 25.16667

t.test(aa[grp==1],aa[grp==2], var.eq=T)

##
## Two Sample t-test
##
## data: aa[grp == 1] and aa[grp == 2]
## t = -1.0348, df = 14, p-value = 0.3183
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -87.05756 30.39089
## sample estimates:
## mean of x mean of y
## -53.50000 -25.16667

## results interpretation test 1

# The t-statistic is -2.8731, and the degrees of freedom (df) are 14.
# The p-value is 0.01228, which is less than the typical significance level of 0.05.
# Therefore, we would reject the null hypothesis and conclude that there is a statistically significant difference in means between the two groups.
# The 95 percent confidence interval for the true difference in means is between -137.39 and -19.94.
# The sample estimates show that the mean of group 1 (mean of x) is -53.5, and the mean of group 2 (mean of y) is 25.17.

## results interpretation test 2
# The t-statistic is -1.0348, and the degrees of freedom (df) are 14.
# The p-value is 0.3183, which is greater than the typical significance level of 0.05. Therefore, we would fail to reject the null hypothesis, indicating that there is no statistically significant difference in means between the two groups based on this test.
# The 95 percent confidence interval for the true difference in means is between -87.06 and 30.39.
# The sample estimates show that the mean of group 1 (mean of x) is -53.5, and the mean of group 2 (mean of y) is -25.17.

################################# second assignment
#10.3. Consider the following definitions
# Your assignment
# Generate the model matrices for models z ~ a*b, z ~ a:b, etc. In your blog posting discuss the implications. Carry out the model fits and notice which models contain singularities.

a <- gl(2, 2, 8)
b <- gl(2, 4, 8)
x <- 1:8
y <- c(1:4, 8:5)
z <- rnorm (8)

# Model 1: z ~ a*b
model_matrix_1 <- model.matrix(~ a*b)
#This model includes main effects of both a and b, as well as their interaction.It assumes that the effect of a on z is different for different levels of b, and vice versa.

# Model 2: z ~ a:b
model_matrix_2 <- model.matrix(~ a:b)
#This model only includes the interaction term between a and b.It assumes that the effect of a on z depends on the level of b, and vice versa.

# Model 3: z ~ a:x
model_matrix_3 <- model.matrix(~ a:x)
#This model includes main effects of a and the numeric variable x.It assumes that the effect of a on z is constant across different values of x.

# Model 4: z ~ a*x
model_matrix_3 <- model.matrix(~ a*x)
# This model includes main effects of a, the numeric variable x, and their interaction.It assumes that the effect of a on z depends on the value of x.

################## Fitting the models
model1 <- lm(z ~ a*b)
model2 <- lm(z ~ a:b)
model3 <- lm(z ~ a:x)
model4 <- lm(z ~ a*x)

model1
## Call:
## lm(formula = z ~ a * b)
##
## Coefficients:
## (Intercept) a2 b2 a2:b2
## -0.6023 1.1236 0.1919 -1.9871

model2

##
## Call:
## lm(formula = z ~ a:b)
##
## Coefficients:
## (Intercept) a1:b1 a2:b1 a1:b2 a2:b2
## -1.2739 0.6716 1.7952 0.8635 NA

model3

##
## Call:
## lm(formula = z ~ a:x)
##
## Coefficients:
## (Intercept) a1:x a2:x
## 0.2590 -0.1686 -0.1474

model4

##
## Call:
## lm(formula = z ~ a * x)
##
## Coefficients:
## (Intercept) a2 x a2:x
## -0.42094 2.10175 -0.02441 -0.34962

# In simple terms, we can determine which R models contain singularities by looking at the coefficient values:
#
# 1. model1 (z ~ a * b):
#   – Coefficients: Intercept, a2, b2, a2:b2
# – This model does not contain singularities because all coefficients have numerical values. There is no “NA” (Not Available) in the coefficients.
#
# 2. model2 (z ~ a:b):
#   – Coefficients: Intercept, a1:b1, a2:b1, a1:b2, a2:b2
# – This model contains a singularity because there is an “NA” (Not Available) value in the coefficient for a2:b2. This “NA” indicates that the model matrix is rank-deficient, which can make parameter estimation problematic.
#
# 3. model3 (z ~ a:x):
#   – Coefficients: Intercept, a1:x, a2:x
# – This model does not contain singularities because all coefficients have numerical values.
#
# 4. model4 (z ~ a * x):
#   – Coefficients: Intercept, a2, x, a2:x
# – This model does not contain singularities because all coefficients have numerical values.

#In summary, model2 (z ~ a:b) is the model that contains a singularity due to the “NA” value in the coefficient for a2:b2. Singularities can make it challenging to estimate model parameters and interpret results, so it’s important to be aware of them when working with linear models in R.

Everyday Stat, Analytics & Python

Beginners journey

Linear Models

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply