The purpose of this project is to analyze data for a prototype car. We are tasked to review the production data for insights in vehicle performance. Using the programming language R, we will perform multiple linear regressions, summary statistics, and t-tests to produce statistical interpretations.
The MechaCar_mpg.csv
dataset contains mpg test results for 50 prototype MechaCars. The MechaCar prototypes were produced using multiple design specifications to identify ideal vehicle performance. We will perform a multiple linear regression model to predict the mpg
of MechaCar prototypes. In our analysis we will use:
AWD
ground_clearance
spoiler_angle
vehicle_weight
vehicle_length
Multiple linear regression is a regression model that estimates the relationship between a quantitative dependent variable and two or more independent variables using a straight line. The mathematical equation we will be using is represented below:
$$ mpg = β_0 + β_1x_1 + β_2x_2 + \cdots + β_nx_n $$R Code:
lm(mpg ~ AWD + ground_clearance + spoiler_angle + vehicle_weight + vehicle_length,data=mecha)
summary(lm(mpg ~ AWD + ground_clearance + spoiler_angle + vehicle_weight + vehicle_length,data=mecha))
Output:
Call:
lm(formula = mpg ~ AWD + ground_clearance + spoiler_angle + vehicle_weight +
vehicle_length, data = mecha)
Residuals:
Min 1Q Median 3Q Max
-19.4701 -4.4994 -0.0692 5.4433 18.5849
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.040e+02 1.585e+01 -6.559 5.08e-08 ***
AWD -3.411e+00 2.535e+00 -1.346 0.1852
ground_clearance 3.546e+00 5.412e-01 6.551 5.21e-08 ***
spoiler_angle 6.877e-02 6.653e-02 1.034 0.3069
vehicle_weight 1.245e-03 6.890e-04 1.807 0.0776 .
vehicle_length 6.267e+00 6.553e-01 9.563 2.60e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.774 on 44 degrees of freedom
Multiple R-squared: 0.7149, Adjusted R-squared: 0.6825
F-statistic: 22.07 on 5 and 44 DF, p-value: 5.35e-11
In our analysis, we can see that ground_clearance
and vehicle_length
have a p-value of 5.21e-08 and 2.60e-12, respectively. Since both values are smaller than the significance level of 0.05, therefore both variables are statistically significant resulting in rejecting the null hypothesis. In other words, the slope of the linear model is not zero and these coefficients provided a non-random amount of variance to the mpg values in the dataset. Additionally, the R-Squared value of this multiple linear regression model is 0.71. The coefficient of determination represents how well the regression model approximates real-world data points. In our case, we have a strong likelihood that the model can be used to approximate the mpg
values in the dataset.
The MechaCar Suspension_Coil.csv
dataset contains the results from multiple production lots. In this dataset, the weight capacities of multiple suspension coils were tested to determine if the manufacturing process is consistent across production lots. We created summary statistics for the suspension coil’s PSI continuous variable across all manufacturing lots and the PSI metrics for each lot.
R Code:
total_summary <- suspension %>% summarize(Mean=mean(PSI),Median=median(PSI),Variance=var(PSI),SD=sd(PSI), .groups = 'keep')
Mean | Median | Variance | SD |
---|---|---|---|
1498.78 | 1500 | 62.29356 | 7.892627 |
In the summary of the entire suspension coil dataset, we can see that the mean PSI is 1498.78 and the median PSI is 1500. Additionally, the variance is 62.29356 and the standard deviation is 7.892627. Overall, since the variance of the suspension coils does not exceed 100 pounds per square inch, it meets the design specifications for the MechaCar suspension coils.
R Code:
lot_summary <- suspension %>% group_by(Manufacturing_Lot) %>% summarize(Mean=mean(PSI),Median=median(PSI),Variance=var(PSI),SD=sd(PSI), .groups = 'keep')
Lot | Mean | Median | Variance |
---|---|---|---|
Lot1 | 1500.00 | 1500.00 | 0.9795918 |
Lot2 | 1500.20 | 1500.00 | 7.4693878 |
Lot3 | 1496.14 | 1498.50 | 170.2861224 |
In the summary of the individual lots, we can see that the mean and median between the lots are similar. The variance for Lot 1 is 0.98, Lot 2 is 7.47, and Lot 3 is 170.29. Since the variance of the suspension coils for Lot 1 and Lot 2 are below 100 pounds per square inch, it meets the design specifications for the MechaCar suspension coils. However, the Lot 3 variance exceeds 100 pounds per square inch, so it does not meet the design specifications.
To further our analysis, we will perform t-tests to determine if all manufacturing lots and each lot individually are statistically different from the population mean of 1,500 pounds per square inch. The mathematical equation is represented below:
$$ t = \frac{\bar{x} - \mu}{\frac{s}{\sqrt{n}}} $$R Code:
t.test((suspension$PSI),mu=1500)
Output:
One Sample t-test
data: (suspension$PSI)
t = -1.8931, df = 149, p-value = 0.06028
alternative hypothesis: true mean is not equal to 1500
95 percent confidence interval:
1497.507 1500.053
sample estimates:
mean of x
1498.78
Based on our results for the t-test that compares all manufacturing lots against the mean PSI of the population, we can interpret that there is no statistical difference between the observed sample mean and its presumed population mean since the p-value is 0.06. It is not statistically significant as the p-value is greater than our significance level of 0.05.
R Code:
t.test(subset(suspension, Manufacturing_Lot == "Lot1")$PSI, mu=1500)
Output:
One Sample t-test
data: subset(suspension, Manufacturing_Lot == "Lot1")$PSI
t = 0, df = 49, p-value = 1
alternative hypothesis: true mean is not equal to 1500
95 percent confidence interval:
1499.719 1500.281
sample estimates:
mean of x
1500
Based on our results for the t-test that compares manufacturing lot 1 against the mean PSI of the population, we can interpret that there is no statistical difference between the observed sample mean and its presumed population mean since the p-value is 1. It is not statistically significant as the p-value is greater than our significance level of 0.05.
R Code:
t.test(subset(suspension, Manufacturing_Lot == "Lot2")$PSI, mu=1500)
Output:
One Sample t-test
data: subset(suspension, Manufacturing_Lot == "Lot2")$PSI
t = 0.51745, df = 49, p-value = 0.6072
alternative hypothesis: true mean is not equal to 1500
95 percent confidence interval:
1499.423 1500.977
sample estimates:
mean of x
1500.2
Based on our results for the t-test that compares manufacturing lot 2 against the mean PSI of the population, we can interpret that there is no statistical difference between the observed sample mean and its presumed population mean since the p-value is 0.6. It is not statistically significant as the p-value is greater than our significance level of 0.05.
R Code:
t.test(subset(suspension, Manufacturing_Lot == "Lot3")$PSI, mu=1500)
Output:
One Sample t-test
data: subset(suspension, Manufacturing_Lot == "Lot3")$PSI
t = -2.0916, df = 49, p-value = 0.04168
alternative hypothesis: true mean is not equal to 1500
95 percent confidence interval:
1492.431 1499.849
sample estimates:
mean of x
1496.14
Based on our results for the t-test that compares manufacturing lot 3 against the mean PSI of the population, we can interpret that there is a statistical difference between the observed sample mean and its presumed population mean since the p-value is 0.04. It is statistically significant as the p-value is smaller than our significance level of 0.05.
To further our analysis, we will design a statistical study to compare the vehicle performance of the MechaCar vehicles against vehicles from other manufacturers.
Since we are determining whether the means of two-samples (MechaCar vs Competitors) are statistically different, we will be utilizing the two-sample t-test to perform this analysis. The two dependent variables in the two analyses would be fuel efficiency and vehicle price.
$$ t = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} $$Since data would be difficult to obtain before the release of competitor cars, we can collect or scrape the data from outside sources on previous car models in the same class as the MechaCar. The data could potentially be found through competitor webpages since the information is public after the car's release. We would need the fuel efficiency and vehicle price data from the competitors.