Statistical Analysis of Data

Learning Objectives

Conducting t-tests

Basic model fitting

Structure of the Vitamin Data

This datasets contains information about actual intake and daily recommended intake for calories and various other vitamin. The relationship is to be investigated across gender and the potential impact of BMI. Each row holds information for a single person, and the columns represent:

Column	Description
RowID	Unique row id for each person
Gender	Gender of person
BMI	Body mass index
<variable>_Actual	Actual amount for <variable>
<variable>_DRI	Daily recommended intake for <variable>

Download Vitamin Data to Your Local Machine

Link: Data

Data Preparation

vitamin <- read.csv('data/vitamin.csv', header=TRUE, sep=",")

Using head() function to view first few rows of data.

head(vitamin)

#>   RowID Gender  BMI Energy_Actual VitaminA_Actual VitaminC_Actual
#> 1     1      F 18.0          1330             206              15
#> 2     2      F 25.0          1792             469              59
#> 3     3      F 21.6          1211             317              18
#> 4     4      F 23.9          1072             654              24
#> 5     5      F 24.3          1534             946             118
#> 6     6      F 23.5          1893             321              35
#>   Calcium_Actual Iron_Actual Energy_DRI VitaminA_DRI VitaminC_DRI
#> 1            827          22       1604          700           65
#> 2            900          12       2011          700           65
#> 3            707           7       2242          700           75
#> 4            560          11       1912          700           75
#> 5            851          12       1895          700           65
#> 6           1024          16       2114          700           75
#>   Calcium_DRI Iron_DRI
#> 1        1300       15
#> 2        1300       15
#> 3        1000        8
#> 4        1000       18
#> 5        1300       15
#> 6        1000       18

The initial investigation will focus on Energy (i.e. Calorie intake). The select() function from the dplyr package will be used to create a smaller data.frame from which analysis will be done.

library(dplyr)

#> 
#> Attaching package: 'dplyr'

#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag

#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

vitamin_energy <- select(vitamin, RowID, Gender, BMI, Energy_Actual, Energy_DRI)

The resulting vitamin_energy data.frame from the above command.

head(vitamin_energy)

#>   RowID Gender  BMI Energy_Actual Energy_DRI
#> 1     1      F 18.0          1330       1604
#> 2     2      F 25.0          1792       2011
#> 3     3      F 21.6          1211       2242
#> 4     4      F 23.9          1072       1912
#> 5     5      F 24.3          1534       1895
#> 6     6      F 23.5          1893       2114

Goal: Compare actual intake to daily recommended intake

The goal is to compare the actual intake to the daily recommended intake on an individual-by-individual level. The mutate() function will be used to obtain a difference column for each individual.

\[Difference = ( Energy\_Actual - Energy\_DRI )\]

vitamin_energy <- vitamin_energy %>% mutate('Difference' = Energy_Actual - Energy_DRI)

The differences for the first few observations.

head(vitamin_energy)

#>   RowID Gender  BMI Energy_Actual Energy_DRI Difference
#> 1     1      F 18.0          1330       1604       -274
#> 2     2      F 25.0          1792       2011       -219
#> 3     3      F 21.6          1211       2242      -1031
#> 4     4      F 23.9          1072       1912       -840
#> 5     5      F 24.3          1534       1895       -361
#> 6     6      F 23.5          1893       2114       -221

Understanding the Difference Variable

What does a negative difference mean in the context of this problem?

What would a difference near 0 imply?

Comparing Differences

The following use of piping and the filter() function can be used to obtain the number of individuals that are taking in less calories than their daily recommended intake.

vitamin_energy %>% filter(Difference < 0) %>% summarize('n' = n())

#>    n
#> 1 55

There are a total of \(71\) observations in this data.frame thus, about \(55/71 = 77\%\) of individuals in this dataset are not taking in enough calories relative to their daily recommended intake.

Using ggplot() to obtain a plot of the difference column for all individuals.

  library(ggplot2)

#> Warning: package 'ggplot2' was built under R version 3.3.3

  ggplot(data=vitamin_energy, aes(x="",y=Difference)) + geom_jitter(width=0.03) + geom_hline(yintercept=0) + labs(x="")

Getting the basic descriptive summaries using summarize().

    vitamin_energy %>% summarize('Mean Difference' = mean(Difference), 'Std'=sd(Difference), 'n'=n())

#>   Mean Difference      Std  n
#> 1       -346.0563 486.3541 71

These summaries can be used to conduct the following statistical test. This test will statistically determine whether or not the average difference across all WSU students is less than zero (i.e. on average are WSU students not taking in enough calories).

\[\begin{array} {l} H_{O}: \mu_{Difference} = 0 \\ H_{A}: \mu_{Difference} < 0 \\ \end{array}\]

The test statistic for this particular statistical test is computed here.

\[\frac{(\bar{x} - \mu)}{\frac{s}{\sqrt{n}}} = \frac{(-346.1 - 0)}{\frac{486.4}{\sqrt{71}}} = -5.996\]

The apprpriate t-test can be done directly in R as follows.

   t.test(vitamin_energy$Difference, mu=0, alternative=c('less'))

#> 
#>  One Sample t-test
#> 
#> data:  vitamin_energy$Difference
#> t = -5.9955, df = 70, p-value = 3.99e-08
#> alternative hypothesis: true mean is less than 0
#> 95 percent confidence interval:
#>       -Inf -249.8427
#> sample estimates:
#> mean of x 
#> -346.0563

Comparing Differences Across Gender

Using ggplot() to obtain a plot of the difference column for all individuals by Gender.

  ggplot(data=vitamin_energy, aes(x=Gender,y=Difference, color=Gender)) + geom_boxplot() + geom_jitter(width=0.05) + geom_hline(yintercept=0)

Getting the basic descriptive summaries using summarize() by Gender.

    vitamin_energy %>% group_by(Gender) %>% summarize('Mean Difference' = mean(Difference), 'Std'=sd(Difference), 'n'=n())

#> # A tibble: 2 × 4
#>   Gender `Mean Difference`      Std     n
#>   <fctr>             <dbl>    <dbl> <int>
#> 1      F         -391.8596 432.1122    57
#> 2      M         -159.5714 650.1473    14

The statistical test here would be to compare the average deficiency in calorie intake across genders.

\[\begin{array} {l} H_{O}: \mu_{Difference:Females} = \mu_{Difference:Males}\\ H_{A}: \mu_{Difference:Females} \ne \mu_{Difference:Males} \\ \end{array}\]

Getting the basic descriptive summaries using summarize() by Gender.

    FemaleData <- vitamin_energy %>% filter(Gender=="F") %>% select(Difference)
    MaleData <- vitamin_energy %>% filter(Gender=="M") %>% select(Difference)
    t.test(FemaleData, MaleData, alternative=c("two.sided"))

#> 
#>  Welch Two Sample t-test
#> 
#> data:  FemaleData and MaleData
#> t = -1.2697, df = 15.93, p-value = 0.2224
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -620.2472  155.6708
#> sample estimates:
#> mean of x mean of y 
#> -391.8596 -159.5714

Impact of BMI on Calorie Differences

Consider the following plot that allows one to understand the relationship between BMI and \((Energy\_Actual - Energy\_DRI)\) across Gender.

  ggplot(data=vitamin_energy, aes(x=BMI, y=Difference, color=Gender)) + geom_point(aes(color=Gender)) + geom_hline(yintercept = 0) + geom_smooth(se=FALSE) + labs(y="Calorie Difference")

#> `geom_smooth()` using method = 'loess'

Understanding Relationship between BMI and Differences

What general trends exist for female students?

What is the impact of BMI for females students?

Is the impact of BMI the same across Gender?

Impact of BMI on Vitamin A

 vitaminA_difference <- vitamin %>% mutate('Difference' = VitaminA_Actual - VitaminA_DRI) 

 ggplot(vitaminA_difference, aes(x=BMI, y=Difference, color=Gender)) + geom_point(aes(color=Gender)) + geom_hline(yintercept = 0) + geom_smooth(se=FALSE) + labs(y="VitaminA Difference")

#> `geom_smooth()` using method = 'loess'

Impact of BMI on Vitamin C

 vitaminC_difference <- vitamin %>% mutate('Difference' = VitaminC_Actual - VitaminC_DRI) 

 ggplot(vitaminC_difference, aes(x=BMI, y=Difference, color=Gender)) + geom_point(aes(color=Gender)) + geom_hline(yintercept = 0) + geom_smooth(se=FALSE) + labs(y="VitaminC Difference")

#> `geom_smooth()` using method = 'loess'

Impact of BMI on Calcium

 Calcium_difference <- vitamin %>% mutate('Difference' = Calcium_Actual - Calcium_DRI) 

 ggplot(Calcium_difference, aes(x=BMI, y=Difference, color=Gender)) + geom_point(aes(color=Gender)) + geom_hline(yintercept = 0) + geom_smooth(se=FALSE) + labs(y="Calcium Difference")

#> `geom_smooth()` using method = 'loess'

Impact of BMI on Iron

 Iron_difference <- vitamin %>% mutate('Difference' = Iron_Actual - Iron_DRI) 

 ggplot(Iron_difference, aes(x=BMI, y=Difference, color=Gender)) + geom_point(aes(color=Gender)) + geom_hline(yintercept = 0) + geom_smooth(se=FALSE) + labs(y="Iron Difference")

#> `geom_smooth()` using method = 'loess'

Example Regression Model

The smoothing function used in these plots can be changed. For example, the following will smooth the scatterplot using a linear regression equation.

 ggplot(Iron_difference, aes(x=BMI, y=Difference, color=Gender)) + geom_point(aes(color=Gender)) + geom_hline(yintercept = 0) + geom_smooth(method="lm",se=FALSE) + labs(y="Iron Difference")

The analogous linear regressio model can be fit in R as follows. The summary() function provides a summary of the resulting fit.

 lmfit <- lm(Difference ~ BMI + Gender + BMI*Gender, data=Iron_difference)
 summary(lmfit)

#> 
#> Call:
#> lm(formula = Difference ~ BMI + Gender + BMI * Gender, data = Iron_difference)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -10.333  -4.373  -1.792   2.457  24.809 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  14.2589    10.6377   1.340    0.185
#> BMI          -0.7142     0.4744  -1.505    0.137
#> GenderM     -10.3763    21.2704  -0.488    0.627
#> BMI:GenderM   0.9563     0.9576   0.999    0.322
#> 
#> Residual standard error: 7.455 on 67 degrees of freedom
#> Multiple R-squared:  0.2824, Adjusted R-squared:  0.2503 
#> F-statistic:  8.79 on 3 and 67 DF,  p-value: 5.398e-05

Page build on: 2017-08-15 08:36:26

Data Carpentry, 2017. License. Contributing.
Questions? Feedback? Please file an issue on GitHub.
On Twitter: @datacarpentry