Multivariate Regression in R

Multivariate regression in R is a valuable concept to learn because it helps us understand and analyze relationships between multiple variables. This time I am tasked to analyze two different datasets “cystfibr” and “secher” that have multiple variables.

# 9.1
#Conduct ANOVA (analysis of variance) and Regression coefficients to the data from cystfibr (> data (” cystfibr “) database.
# You can choose any variable you like. In your report,you need to state the result of Coefficients (intercept) to any variables you like both under ANOVA and multivariate analysis.
# The model code:
 library(ISwR)

## Warning: package ‘ISwR’ was built under R version 4.2.3

data(“cystfibr”)

attach(cystfibr)

## The following object is masked from package:ISwR:
##
##     tlc

lm(formula = cystfibr$pemax ~ age+weight+bmp+fev1, data=cystfibr)

##
## Call:
## lm(formula = cystfibr$pemax ~ age + weight + bmp + fev1, data = cystfibr)
##
## Coefficients:
## (Intercept)          age       weight          bmp         fev1 
##     179.296       -3.418        2.688       -2.066        1.088

summary(lm(formula = cystfibr$pemax ~ age+weight+bmp+fev1, data=cystfibr))

##
## Call:
## lm(formula = cystfibr$pemax ~ age + weight + bmp + fev1, data = cystfibr)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -42.521 -10.885   3.003  15.488  41.767
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 179.2957    61.8855   2.897  0.00891 **
## age          -3.4181     3.3086  -1.033  0.31389  
## weight        2.6882     1.1727   2.292  0.03287 *
## bmp          -2.0657     0.8198  -2.520  0.02036 *
## fev1          1.0882     0.5139   2.117  0.04695 *
## —
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 23.4 on 20 degrees of freedom
## Multiple R-squared:  0.5918, Adjusted R-squared:  0.5101
## F-statistic: 7.248 on 4 and 20 DF,  p-value: 0.0008891

The regression analysis models the relationship between the dependent variable cystfibr$pemax and four independent variables: age, weight, bmp, and fev1. Here’s how to interpret the results:

1. Coefficients (Intercept):

   Estimate: The estimated intercept (constant) is 179.2957.

   Std. Error: The standard error associated with the intercept estimate is 61.8855.

   t value: The t-value for the intercept is 2.897.

   Pr(>|t|): The p-value associated with the intercept is 0.00891, which is less than the significance level of 0.05 (indicated by the ‘**’ in the Signif. codes section). This suggests that the intercept is statistically significant.

2. Coefficients (age, weight, bmp, fev1):

 Each of these coefficients represents the estimated change in the dependent variable cystfibr$pemax associated with a one-unit change in the respective independent variable, holding all other variables constant.

    For example:

    The coefficient for “age” is -3.4181. This suggests that, on average, for each additional year of age, cystfibr$pemax is estimated to decrease by 3.4181 units, although it’s not statistically significant (p-value > 0.05).

     The coefficient for “weight” is 2.6882. This suggests that, on average, for each additional unit of weight, cystfibr$pemax is estimated to increase by 2.6882 units, and this change is statistically significant (p-value < 0.05).

     The coefficient for “bmp” is -2.0657. This suggests that, on average, for each additional unit of “bmp”, cystfibr$pemax is estimated to decrease by 2.0657 units, and this change is statistically significant (p-value < 0.05).

     The coefficient for “fev1” is 1.0882. This suggests that, on average, for each additional unit of “fev1”, cystfibr$pemax is estimated to increase by 1.0882 units, and this change is statistically significant (p-value < 0.05).

3. Residuals:

    The residuals represent the differences between the observed values of cystfibr$pemax and the values predicted by the regression model. These residuals have a minimum value of -42.521, a maximum value of 41.767, and various values in between. The summary statistics of the residuals, including the minimum, 1st quartile, median, 3rd quartile, and maximum, are provided.

4. Model Fit:

   The “Multiple R-squared” value is 0.5918, which indicates that approximately 59.18% of the variability in cystfibr$pemax is explained by the model.

   The “Adjusted R-squared” value is 0.5101, which adjusts the R-squared value for the number of predictors in the model.

   The F-statistic tests the overall significance of the regression model, and its associated p-value is 0.0008891, indicating that the model is statistically significant.

Overall, the model suggests that weight, bmp, and fev1 are statistically significant predictors of cystfibr$pemax, while age is not statistically significant at the 0.05 significance level. The intercept is also statistically significant.

anova(lm(cystfibr$pemax ~ age + weight + bmp + fev1, data=cystfibr))

## Analysis of Variance Table
##
## Response: cystfibr$pemax
##           Df  Sum Sq Mean Sq F value    Pr(>F)   
## age        1 10098.5 10098.5 18.4385 0.0003538 ***
## weight     1   945.2   945.2  1.7258 0.2038195   
## bmp        1  2379.7  2379.7  4.3450 0.0501483 . 
## fev1       1  2455.6  2455.6  4.4836 0.0469468 * 
## Residuals 20 10953.7   547.7                     
## —
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

The analysis of variance (ANOVA) table you provided assesses the overall statistical significance of the regression model that relates the dependent variable cystfibr$pemax to the independent variables ageweightbmp, and fev1. Here is a simple interpretation:

1. Response: cystfibr$pemax:

   – This section identifies the dependent variable under study, which is cystfibr$pemax.

2. Df (Degrees of Freedom):

   – For each independent variable and the residual (error) term:

     – “age” has 1 degree of freedom.

     – “weight” has 1 degree of freedom.

     – “bmp” has 1 degree of freedom.

     – “fev1” has 1 degree of freedom.

     – “Residuals” have 20 degrees of freedom.

3. Sum Sq (Sum of Squares):

   This represents the sum of the squared differences between the observed values and the predicted values for each independent variable and the residuals.

    For example, for “age,” the sum of squares is 10098.5.

4. Mean Sq (Mean Sum of Squares):

   The mean square is obtained by dividing the sum of squares by the degrees of freedom.

   For example, for “age,” the mean square is 10098.5 / 1 = 10098.5.

5. F value (F-statistic):

   The F-statistic is a test statistic that measures whether there is a significant overall effect of the independent variables on the dependent variable. It’s calculated by comparing the variance explained by the model (the mean square for each variable) to the variance within the model (the mean square for the residuals).

   – For “age,” the F value is 18.4385.

6. Pr(>F) (p-value):

   The p-value associated with the F-statistic tells you whether the model is statistically significant. In other words, it tests whether at least one of the independent variables has a significant effect on the dependent variable.

   For “age,” the p-value is very small (0.0003538), indicating that the model with “age” as one of the predictors is statistically significant. The ‘***’ next to it indicates high significance.

7.Signif. Codes:

   These codes provide a quick way to assess the significance of each independent variable. They are marked as ‘***’ (very significant), ‘**’ (significant), ‘*’ (marginally significant), ‘.’ (not significant), or ‘ ‘ (not significant at all).

Interpretation:

 The ANOVA table suggests that the overall regression model, including all the independent variables (“age,” “weight,” “bmp,” and “fev1”), is statistically significant because the p-value for the F-statistic is very small (0.0003538, indicated by ‘***’).

  Among the individual independent variables:

  “age” is highly significant (p-value: 0.0003538).

  “weight” is not statistically significant (p-value: 0.2038195).

  “bmp” is marginally significant (p-value: 0.0501483, indicated by ‘.’).

  “fev1” is marginally significant (p-value: 0.0469468, indicated by ‘*’).

In summary, the regression model is statistically significant, and the variable “age” is the most significant predictor of cystfibr$pemax. “Weight” has no significant effect, while “bmp” and “fev1” have some marginal significance in predicting cystfibr$pemax.

# 9.2
# The secher data (> data(“secher”) are best analyzed after log-transforming birth weight as well as the abdominal and biparietal diameters.
# Fit a prediction weight as well as abdominal and biparietal diameters. For a prediction equation for birth weight.
# How much is gained by using both diameters in a prediction equation? The sum of the two regression coefficients is almost identical and equal to 3.
# Can this be given a nice interpretation to our analysis? Please provide step by step on your analysis and code you use to find out the result.
                
# Model10 <-lm(log(bwt))~I(log(ad)), data=secher
# summary(model10)
data(“secher”)
Model10 <- lm(log(bwt) ~ log(bpd) + log(ad), data=secher)
summary(Model10)

##
## Call:
## lm(formula = log(bwt) ~ log(bpd) + log(ad), data = secher)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -0.35074 -0.06741 -0.00792  0.05750  0.36360
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -5.8615     0.6617  -8.859 2.36e-14 ***
## log(bpd)      1.5519     0.2294   6.764 8.09e-10 ***
## log(ad)       1.4667     0.1467   9.998  < 2e-16 ***
## —
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 0.1068 on 104 degrees of freedom
## Multiple R-squared:  0.8583, Adjusted R-squared:  0.8556
## F-statistic: 314.9 on 2 and 104 DF,  p-value: < 2.2e-16

Let us analyze the regression results step by step and answer your questions:

Step 1: Create the Regression Model

Multiple linear regression model (Model10) with the following formula:

log(bwt) ~ log(bpd) + log(ad)

This model uses the natural logarithm of birth weight (log(bwt)) as the dependent variable and the natural logarithms of both “bpd” and “ad” as independent variables.

Step 2: Interpret the Coefficients and Statistics

Now, let us interpret the coefficients and statistics from the summary of Model10:

Coefficients:

  Intercept: The intercept coefficient represents the estimated value of log(bwt) when both log(bpd) and log(ad) are equal to zero. In this case, it is estimated to be approximately -5.8615.

  -log (bpd): The coefficient for log (bpd) is approximately 1.5519. This means that, on average, a one-unit increase in the natural logarithm of “bpd” is associated with an increase of approximately 1.5519 units.

  – log (ad): The coefficient for log (ad) is approximately 1.4667. This means that, on average, a one-unit increase in the natural logarithm of “ad” is associated with an increase of approximately 1.4667 units.

Residual Standard Error: The residual standard error is approximately 0.1068. It measures the typical size of the residuals, which are the differences between the observed and predicted values. Smaller values indicate a better fit of the model to the data.

Multiple R-squared: The multiple R-squared value is 0.8583, indicating that approximately 85.83% of the variability is explained by the model.

F-statistic:The F-statistic tests the overall significance of the model. With a very low p-value (< 2.2e-16), the model is highly statistically significant, suggesting that at least one of the independent variables is significant in predicting the natural logarithm of birth weight.

Step 3: Calculate the Sum of Coefficients

To calculate the sum of the two regression coefficients (log(bpd) and log(ad)), add them together:

Sum of coefficients = 1.5519 (log(bpd)) + 1.4667 (log(ad)) = 3.0186

Step 4: Interpretation

 For a prediction equation for birth weight, how much is gained by using both diameters in a prediction equation?

   By using both “bpd” and “ad” in the prediction equation (Model10), we gain the combined information captured by the coefficients of both variables. The sum of the coefficients (approximately 3.0186) indicates the collective effect of both diameters on the prediction of log(bwt). This suggests that both diameters are important predictors of birth weight, and their combined effect is approximately three units on the natural logarithm scale.

Leave a comment