Logistic regression is a statistical method to predict the probability of an event occurring by fitting the data to a logistic curve using logistic function. The regression analysis used for predicting the outcome of a categorical dependent variable, based on one or more predictor variables. The logistic function used to model the probabilities describes the possible outcome of a single trial as a function of explanatory variables. The dependent variable in a logistic regression can be binary (e.g. 1/0, yes/no, pass/fail), nominal (blue/yellow/green), or ordinal (satisfied/neutral/dissatisfied). The independent variables can be either continuous or discrete.

Where: z can be any value ranging from negative infinity to positive infinity.

The value of f(z) ranges from 0 to 1, which matches exactly the nature of probability (i.e., 0 ≤ P ≤ 1).

Logistic Regression Equation

Based on the logistic function,

we define f(z) as the probability of an event occurring and z is the weighted sum of the significant predictive variables.

Where: Z represents the weighted sum of all of the predictive variables.

Another of way of representing f(z) is by replacing the z with the sum of the predictive variables.

Where: Y is the probability of an event occurring and x’s are the significant predictors.

Notes:

[unordered_list style=”star”]

- When building the regression model, we use the actual Y, which is discrete (e.g. binary, nominal, ordinal).
- After completing building the model, the fitted Y calculated using the logistic regression equation is the probability ranging from 0 to 1. To transfer the probability back to the discrete value, we need SMEs’ inputs to select the probability cut point.

[/unordered_list]

The logistic curve for binary logistic regression with one continuous predictor is illustrated by the following Figure.

Odds is the probability of an event occurring divided by the probability of the event not occurring.

Odds range from 0 to positive infinity.

Probability can be calculated using odds.

Because probability can be expressed by the odds, and we can express probability through the logistic function, we can equate probability, odds, and ultimately the sum of the independent variables.

Since in logistic regression model

therefore

[unordered_list style=”star”]

- Binary Logistic Regression
- Binary response variable
- Example: yes/no, pass/fail, female/male

- Nominal Logistic Regression
- Nominal response variable
- Example: set of colors, set of countries

- Ordinal Logistic Regression
- Ordinal response variable
- Example: satisfied/neutral/dissatisfied

[/unordered_list]

All three logistic regression models can use multiple continuous or discrete independent variables and can be developed in SXL using the same steps.

We want to build a logistic regression model using the potential factors to predict the probability that the person measured is female or male.

Data File: “Logistic Regression” tab in “Sample Data.xlsx”

[unordered_list style=”star”]

- Response (Y): Female/Male
- Potential Factors (Xs):
- Age
- Weight
- Oxy
- Runtime
- RunPulse
- RstPulse
- MaxPulse

[/unordered_list]

Step 1:

- Select the entire range of data (“Name”, “Sex”, “Age”, “Weight”, “Oxy”, “Runtime”, “RunPulse”, “RstPulse”, “MaxPulse” columns)
- Click SigmaXL -> Statistical Tools -> Regression ->Binary Logistic Regression
- A new window named “Binary Logistic Regression” appears with the selected range of data appearing in the box under “Please select your data”

- Click “Next>>”
- A new window also called “Binary Logistic Regression” pops up.
- Select “Sex” as the “Binary Response (Y)”

Select “Age”, “Weight”, “Oxy”, “Runtime”, “RunPulse”, “RstPulse”, “MaxPulse” as the “Continuous Predictors (X)”.

- The reference event is set as “M” by default.
- Click “OK”

Step 2:

- Check the p-values of all the independent variables in the model.
- Remove the insignificant independent variable one at a time from the model and rerun the model.
- Repeat step 2.1 until all of the independent variables in the model are statistically significant.

Since the p-values of all the independent variables are higher than the alpha level (0.05), we need to remove the insignificant independent variables one at a time from the model, starting from the one with the highest p-value. Runtime has the highest p-value (0.9897), so it would be removed from the model first. Re-run the binary logistic regression but this time exclude Runtime from the “Continuous Predictors (X)” in the Binary Logistic Regression dialog box.

After removing Runtime from the model, the p-values of all the independent variables are still higher than the alpha level (0.05). We need to continue removing the insignificant independent variables one at a time from the model, starting from the one with the highest p-value. Age has the highest p-value (0.9773), so it would be removed from the model next.

After removing Age from the model, the p-values of all the independent variables are still higher than the alpha level (0.05). We need to continue removing the insignificant independent variables one at a time from the model, starting from the one with the highest p-value. RstPulse has the highest p-value (0.8017) so it would be removed from the model next.

After removing RstPulse from the model, the p-values of all the independent variables are still higher than the alpha level (0.05). We need to continue removing the insignificant independent variables one at a time from the model, starting from the one with the highest p-value. Weight has the highest p-value (0.242), so it would be removed from the model next.

After removing Weight from the model, the p-values of all the independent variables are still higher than the alpha level (0.05). We need to continue removing the insignificant independent variables one at a time from the model, starting from the one with the highest p-value. RunPulse has the highest p-value (0.1604), so it would be removed from the model next.

After removing RunPulse from the model, the p-values of all the independent variables are still higher than the alpha level (0.05). We need to continue removing the insignificant independent variables one at a time from the model, starting from the one with the highest p-value. MaxPulse has the highest p-value (0.2290), so it would be removed from the model next.

After removing MaxPulse from the model, the p-value of the only remaining independent variable “Oxy” is at the alpha level (0.05). There is no need to remove “Oxy” from the model, we will accept the minute risk of rejecting the null at this p-value (0.0556). But before we do that, let’s check the validity of the model as a whole.

Step 3:

Analyze the binary logistic report and check the performance of the logistic regression model. The p-value here is greater than the alpha level of (0.05). We will conclude that at least one of the slope coefficients is not equal to zero. The pseudo R-squared is 10.55%. The R-squared of logistic regression is in general lower than the R-squared of the traditional multiple linear regression model. The p-value of lack of fit test is higher than alpha level (0.05). We conclude that the model fits the data. Also, 62.50% of the predicted outcomes match the observed outcomes.

Step 4: Enter the setting of the Oxy into the cell highlighted in yellow and the predicted event probability would appear automatically. In this case, if we set the oxy value to 50, the probability that the person measured being male is 41%.

]]>[unordered_list style=”star”]

- Data transforms are usually applied so that the data appear to more closely meet assumptions of a statistical inference model to be applied or to improve the interpret-ability or appearance of graphs.
- Power transformation is a class of transformation functions that raise the response to some power. For example, a square root transformation converts X to X
^{1/2} - Box Cox transformation is a popular power transformation method developed by George E. P. Box and David Cox.

[/unordered_list]

The formula of the Box Cox transformation is:

Where:

[unordered_list style=”star”]

- y is the transformation result
- x is the variable under transformation
- λ is the transformation parameter

[/unordered_list]

SigmaXL provides the best Box-Cox transformation with an optimal λ that minimizes the model SSE (sum of squared error). Here is an example of how we transform the non-normally distributed response to normal data using Box-Cox method.

Data File: “Box-Cox” tab in “Sample Data.xlsx”

Step 1: Test the normality of the original data set.

- Select the entire range of “Y” in column H
- Click SigmaXL -> Graphical Tool -> Histograms & Descriptive Statistics
- A new window named “Histograms & Descriptive” pops up and the selected range automatically appears in the box below “Please select your data”.

- Click “Next >>”
- A new window named “Histograms & Descriptive Statistics” pops up.
- Select “Y” as “Numeric Data Variables (Y)”

- Click “OK>>”
- The analysis results are shown automatically in the new spreadsheet “Hist Descript(1)”

Normality Test:

[unordered_list style=”star”]

- H
_{0}: The data are normally distributed. - H
_{1}: The data are not normally distributed.

[/unordered_list]

If p-value > alpha level (0.05), we fail to reject the null hypothesis. Otherwise, we reject the null. In this example, p-value = 0.029 < alpha level (0.05). The data are not normally distributed.

Step 2: Run the Box-Cox Transformation:

- Select the entire range of Y in column H
- Click SigmaXL -> Process Capability -> Nonnormal -> Box-Cox Transformation
- A new window named “Box-Cox Transformation” pops up and the selected range appears automatically in the box under “Please select your data”

- Click “Next >>”
- A new window also named “Box-Cox Transformation” pops up.
- Select “Y” as “Numeric Data Variables (Y)”

- Click “OK>>”
- The analysis results are shown automatically in the new spreadsheet “Box-Cox (1)”

The software looks for the optimal value of lambda that minimizes the SSE (Sum of Squares of Error). In this case the minimum value is 0.12. The transformed Y can also be saved in another column. The transformed Y is also listed in Column G in the newly generated tab “Box-Cox (1)

Use the Anderson–Darling test to test the normality of the transformed data

[unordered_list style=”star”]

- H
_{0}: The data are normally distributed. - H
_{1}: The data are not normally distributed.

[/unordered_list]

Model summary: If p-value > alpha level (0.05), we fail to reject the null. Otherwise, we reject the null. In this example, p-value = 0.327 > alpha level (0.05). The data are normally distributed.

]]>Multiple linear regression is a statistical technique to model the relationship between one dependent variable and two or more independent variables by fitting the data set into a linear equation.

The difference between simple linear regression and multiple linear regression:

[unordered_list style=”star”]

- Simple linear regression only has one predictor.
- Multiple linear regression has two or more predictors.

[/unordered_list]

Where:

[unordered_list style=”star”]

*Y*is the dependent variable (response)*X*are the independent variables (predictors). There are p predictors in total_{1}, X_{2}. . . X_{p}

[/unordered_list]

Both dependent and independent variables are continuous.

[unordered_list style=”star”]

- β is the intercept indicating the Y value when all the predictors are zeros
*α*are the coefficients of predictors. They reflect the contribution of each independent variable in predicting the dependent variable._{1}, α_{2}. . . α_{p}*e*is the residual term indicating the difference between the actual and the fitted response value.

[/unordered_list]

Case study: We want to see whether the scores in exam one, two, and three have any statistically significant relationship with the score in final exam. If so, how are they related to final exam score? Can we use the scores in exam one, two, and three to predict the score in final exam?

Data File: “Multiple Linear Regression” tab in “Sample Data.xlsx.”

Step 1: Determine the dependent and independent variables, all should be continuous. Y (dependent variable) is the score of final exam. X_{1}, X_{2}, and X_{3} (independent variables) are the scores of exam one, two, and three respectively. All x variables are continuous.

Step 2: Start building the multiple linear regression model

- Select the range of independent and dependent variables in Excel.
- Click SigmaXL -> Statistical Tools -> Regression -> Multiple Regression
- A new window named “Multiple Regression” pops up and the selected range appears automatically in the box below “Please select your data”

- Click “Next >>”
- A new window also named “Multiple Regression” pops up
- Select “FINAL” as “Numeric Response (Y)” and “EXAM1”, “EXAM2” and “EXAM3” as “Continuous Predictor (X)”

- Click “OK>>”
- The regression analysis results appear in the newly generated spreadsheet “Multiple Regression” and the residual analysis results appear in another new spreadsheet “Mult Reg Residuals (1)”.

Step 3: Check whether the whole model is statistically significant. If not, we need to re-examine the predictors or look for new predictors before continuing.

- H
_{0}: The model is not statistically significant (i.e., all the parameters of predictors are not significantly different from zeros). - H
_{1}: The model is statistically significant (i.e., at least one predictor parameter is significantly different from zero).

[/unordered_list]

In this example, p-value is much smaller than alpha level (0.05), hence we reject the null hypothesis; the model is statistically significant.

Step 4: Check whether multicollinearity exists in the model.

The VIF information is automatically generated in table of parameter estimates.

We use the VIF (Variance Inflation Factor) to determine if multicollinearity exists.

*Multicollinearity* is the situation when two or more independent variables in a multiple regression model are correlated with each other. Although multicollinearity does not necessarily reduce the predictability for the model as a whole, it may mislead the calculation for individual independent variables. To detect multicollinearity, we use VIF (Variance Inflation Factor) to quantify its severity in the model.

VIF quantifies the degree of multicollinearity for each individual independent variable in the model.

VIF calculation:

Assume we are building a multiple linear regression model using p predictors.

Two steps are needed to calculate VIF for X_{1}.

Step 1: Build a multiple linear regression model for X_{1} by using *X _{2}, X_{3} . . . X_{p}* as predictors.

Step 2: Use the R^{2 }generated by the linear model in step 1 to calculate the VIF for X_{1}.

Apply the same methods to obtain the VIFs for other Xs. The VIF value ranges from one to positive infinity.

Rules of thumb to analyze variance inflation factor (VIF):

[unordered_list style=”star”]

- If VIF = 1, there is no multicollinearity.
- If 1 < VIF < 5, there is small multicollinearity.
- If VIF ≥ 5, there is medium multicollinearity.
- If VIF ≥ 10, there is large multicollinearity.

[/unordered_list]

- Increase the sample size.
- Collect samples with a broader range for some predictors.
- Remove the variable with high multicollinearity and high p-value.
- Remove variables that are included more than once.
- Combine correlated variables to create a new one.

In this section, we will focus on removing variables with high VIF and high p-value.

Step 3: Deal with multicollinearity:

- Identify a list of independent variables with VIF higher than 5. If no variable has VIF higher than 5, go to Step 6 directly.
- Among variables identified in Step 5.1, remove the one with the highest p-value.
- Run the model again, check the VIFs and repeat Step 5.1.

Note: we only remove one independent variable at a time.

In this example, all three predictors have VIF higher than 5. Among them, EXAM1 has the highest p-value. We will remove EXAM1 from the equation and run the model again.

Run the new multiple linear regression with only two predictors (i.e., EXAM2 and EXAM3).

Check the VIFs of EXAM2 AND EXAM3. They are both smaller than 5; hence, there is little multicollinearity existing in the model.

Step 4: Identify the statistically insignificant predictors. Remove one insignificant predictor at a time and run the model again. Repeat this step until all the predictors in the model are statistically significant.

Insignificant predictors are the ones with p-value higher than alpha level (0.05). When p > alpha level, we fail to reject the null hypothesis; the predictor is not significant.

[unordered_list style=”star”]

- H
_{0}: The predictor is not statistically significant. - H
_{1}: The predictor is statistically significant.

[/unordered_list]

As long as the p-value is greater than 0.05, remove the insignificant variables one at a time in the order of the highest p-value. Once one insignificant variable is eliminated from the model, we need to run the model again to obtain new p-values for other predictors left in the new model. In this example, both predictors’ p-values are smaller than alpha level (0.05). As a result, we do not need to eliminate any variables from the model.

Step 5: Interpret the regression equation

The multiple linear regression equation appears automatically at the top of the session window. “Parameter Estimates” section provides the estimates of parameters in the linear regression equation. Now that we have removed multicollinearity and all of the insignificant predictors, we have the parameters for the regression equation.

Rsquare Adj = 98.4%

[unordered_list style=”star”]

- 98% of the variation in FINAL can be explained by the predictor variables EXAM2 & EXAM3.

[/unordered_list]

P-value of the F-test = 0.000

[unordered_list style=”star”]

- We have a statistically significant model.

[/unordered_list]

Variables p-value:

[unordered_list style=”star”]

- Both are significant (less than 0.05).

[/unordered_list]

VIF

[unordered_list style=”star”]

- EXAM2 and EXAM3 are both below 5; we’re in good shape!

[/unordered_list]

Equation: −4.34 + 0.722*EXAM2 + 1.34*EXAM3

[unordered_list style=”star”]

- −4.34 is the Y intercept, all equations will start with −4.34.
- 722 is the EXAM2 coefficient; multiply it by EXAM2 score.
- 34 is the EXAM3 coefficient; multiply it by EXAM3 score.

[/unordered_list]

Let us say you are the professor again, and this time you want to use your prediction equation to estimate what one of your students might get on their final exam.

Assume the following:

[unordered_list style=”star”]

- Exam 2 results were: 84
- Exam 3 results were: 102

[/unordered_list]

Use your equation: −4.34 + 0.722*EXAM2 + 1.34*EXAM3

Predict your student’s final exam score:

−4.34 + (0.722*84) + (1.34*102) =−4.34 + 60.648 + 136.68 = 192.988

Model summary: Nice work again! Now you can use your “magic” as the smart and efficient professor and allocate your time to other students because this one projects to perform much better than the average score of 162. Now that we know that exams two and three are statistically significant predictors, we can plug them into the regression equation to predict the results of the final exam for any student.

]]>

Pearson’s correlation coefficient is also called Pearson’s r or coefficient of correlation and Pearson’s product moment correlation coefficient (r), where r is a statistic measuring the linear relationship between two variables.

Correlation is a statistical technique that describes whether and how strongly two or more variables are related.

Correlation analysis helps to understand the direction and degree of association between variables, and it suggests whether one variable can be used to predict another. Of the different metrics to measure correlation, Pearson’s correlation coefficient is the most popular. It measures the linear relationship between two variables.

Correlation coefficients range from −1 to 1.

- If r = 0, there is no linear relationship between the variables.
- The sign of r indicates the direction of the relationship:
- If r < 0, there is a negative linear correlation. If r > 0, there is a positive linear correlation.

The absolute value of r describes the strength of the relationship: - If |r| ≤ 0.5, there is a weak linear correlation.
- If |r| > 0.5, there is a strong linear correlation.
- If |r| = 1, there is a perfect linear correlation.

When the correlation is strong, the data points on a scatter plot will be close together (tight). The closer r is to −1 or 1, the stronger the relationship. - −1 Strong inverse relationship
- +1 Strong direct relationship

When the correlation is weak, the data points are spread apart more (loose). The closer the correlation is to 0, the weaker the relationship.

*Fig 1.0 Examples of Types of Correlation*

This Figure demonstrates the relationships between variables as the Pearson r value ranges from 1 to 0 and to −1. Notice that at −1 and 1 the points form a perfectly straight line.

- At 0 the data points are completely random.
- At 0.8 and −0.8, notice how you can see a directional relationship, but there is some noise around where a line would be.
- At 0.4 and −0.4, it looks like the scattering of data points is leaning to one direction or the other, but it is more difficult to see a relationship because of all the noise.

Pearson’s correlation coefficient is only sensitive to the linear dependence between two variables. It is possible that two variables have a perfect non-linear relationship when the correlation coefficient is low. Notice the scatter plots below with correlation equal to 0. There are clearly relationships but they are not linear and therefore cannot be determined with Pearson’s correlation coefficient.

*Fig 1. 1 Examples of Types of Relationships*

Correlation does not imply causation.

If variable A is highly correlated with variable B, it does not necessarily mean A causes B or vice versa. It is possible that an unknown third variable C is causing both A and B to change. For example, if ice cream sales at the beach are highly correlated with the number of shark attacks, it does not imply that increased ice cream sales cause increased shark attacks. They are triggered by a third factor: summer.

This example demonstrates a common mistake that people make: assuming causation when they see correlation. In this example, it is hot weather that is a common factor. As the weather is hotter, more people consume ice cream and more people swim in the ocean, making them susceptible to shark attacks.

If two variables are independent, the correlation coefficient is zero.

WARNING! If the correlation coefficient of two variables is zero, it does not imply they are independent. The correlation coefficient only indicates the linear dependence between two variables. When variables are non-linearly related, they are not independent of each other but their correlation coefficient could be zero.

The correlation coefficient indicates the direction and strength of the linear dependence between two variables but it does not cover all the existing relationship patterns. With the same correlation coefficient, two variables might have completely different dependence patterns. A scatter plot or X-Y diagram can help to discover and understand additional characteristics of the relationship between variables. The correlation coefficient is not a replacement for examining the scatter plot to study the variables’ relationship.

The correlation coefficient by itself does not tell us everything about the relationship between two variables. Two relationships could have the same correlation coefficient, but completely different patterns.

The correlation coefficient could be high or low by chance (randomness). It may have been calculated based on two small samples that do not provide good inference on the correlation between two populations.

In order to test whether there is a statistically significant relationship between two variables, we need to run a hypothesis test to determine whether the correlation coefficient is statistically different from zero.

Hypothesis Test Statements

- H
_{0}: r = 0: Null Hypothesis: There is no correlation. - H1: r ≠ 0: Alternate Hypothesis: There is a correlation.

Hypothesis tests will produce p-values as a result of the statistical significance test on r. When the p-value for a test is low (less than 0.05), we can reject the null hypothesis and conclude that r is significant; there is a correlation. When the p-value for a test is > 0.05, then we fail to reject the null hypothesis; there is no correlation.

We can also use the t statistic to draw the same conclusions regarding our test for significance of the correlation coefficient. To use the t-test to determine the statistical significance of the Pearson correlation, calculate the t statistic using the Pearson r value and the sample size, n.

Test Statistic

Critical Statistic

Is the t-value in t-table with (n – 2) degrees of freedom.

If the absolute value of the calculated t value is less than or equal to the critical t value, then we fail to reject the null and claim no statistically significant linear relationship between X and Y.

- If |t| ≤ t
_{critical}, we fail to reject the null. There is no statistically significant linear relationship between X and Y. - If |t| > t
_{critical}, we reject the null. There is a statistically significant linear relationship between X and Y.

We are interested in understanding whether there is linear dependence between a car’s MPG and its weight and if so, how they are related. The MPG and weight data are stored in the “Correlation Coefficient” tab in “Sample Data.xlsx.” We will discuss three ways to get the results.

The formula CORREL in Excel calculates the sample correlation coefficient of two data series. The correlation coefficient between the two data series is −0.83, which indicates a strong negative linear relationship between MPG and weight. In other words, as weight gets larger, gas mileage gets smaller.

Fig 1.3 Correlation coefficient in Excel

How do we interpret results and make decisions based Pearson’s correlation coefficient (r) and p-values?

Let us look at a few examples:

- r = −0.832, p = 0.000 (previous example). The two variables are inversely related and the linear relationship is strong. Also, this conclusion is significant as supported by p-value of 0.00.
- r = −0.832, p = 0.71. Based on r, you should conclude the linear relationship between the two variables is strong and inversely related. However, with a p-value of 0.71, you should then conclude that r is not significant and that your sample size may be too small to accurately characterize the relationship.
- r = 0.5, p = 0.00. Moderately positive linear relationship, r is statistically significant.
- r = 0.92, p = 0.61. Strong positive linear relationship but r is not statistically significant. Get more data.
- r = 1.0, p = 0.00. The two variables have a perfect linear relationship and r is significant.

Population Correlation Coefficient (ρ)

Sample Correlation Coefficient (r)

It is only defined when the standard deviations of both X and Y are non-zero and finite. When covariance of X and Y is zero, the correlation coefficient is zero.

]]>*Simple linear regression* is a statistical technique to fit a straight line through the data points. It models the quantitative relationship between two variables. It is simple because only one predictor variable is involved. It describes how one variable changes according to the change of another variable. Both variables need to be continuous; there are other types of regression to model discrete data.

The simple linear regression analysis fits the data to a regression equation in the form

Where:

[unordered_list style=”star”]

*Y*is the dependent variable (the response) and*X*is the single independent variable (the predictor)*α*is the slope describing the steepness of the fitting line. β is the intercept indicating the*Y*value when*X*is equal to 0*e*stands for error (residual). It is the difference between the actual*Y*and the fitted*Y*(i.e. the vertical difference between the data point and the fitting line).

[/unordered_list]

The *ordinary least squares* is a statistical method used in linear regression analysis to find the best fitting line for the data points. It estimates the unknown parameters of the regression equation by minimizing the sum of squared residuals (i.e. the vertical difference between the data point and the fitting line).

In mathematical language, we look for α and β that satisfy the following criteria:

The actual value of the dependent variable:

Where: *i = 1, 2 . . . n*.

The fitted value of the dependent variable:

Where: *i = 1, 2 . . . n*.

By using calculus, it can be shown the sum of squared error is minimal when

and

[unordered_list style=”star”]

- X: the independent variable that we use to predict;
- Y: the dependent variable that we want to predict.

[/unordered_list]

The variance in simple linear regression can be expressed as a relationship between the actual value, the fitted value, and the grand mean—all in terms of Y.

[unordered_list style=”star”]

- Total Variation = Total Sums of Squares =
- Explained Variation = Regression Sums of Squares =
- Unexplained Variation = Error Sums of Squares =

[/unordered_list]

Regression follows the same methodology as ANOVA and the hypothesis tests behind it use the same assumptions.

*Variation Components*

i.e. Total Sums of Squares = Regression Sums of Squares + Error Sums of Squares

*Degrees of Freedom** Components*

i.e. n – 1 = (k – 1) + (n – k), where n is the number of data points, k is the number of predictors

Whether the overall model is statistically significant can be tested by using F-test of ANOVA.

[unordered_list style=”star”]

- H
_{0}: The model is not statistically significant. - H
_{a}: The model is statistically significant.

[/unordered_list]

*Test Statistic*

*Critical Statistic*

Is represented by F value in F table with (k – 1) degrees of freedom in the numerator and (n – k) degrees of freedom in the denominator.

[unordered_list style=”star”]

- If F ≤ F
_{critical}(calculated F is less than or equal to the critical F), we fail to reject the null. There is no statistically significant relationship between X and Y. - If F > F
_{critical}, we reject the null. There is a statistically significant relationship between X and Y.

[/unordered_list]

*R-squared or* R^{2 }(also called coefficient of determination) measures the proportion of variability in the data that can be explained by the model.

[unordered_list style=”star”]

- R
^{2}ranges from 0 to 1. The higher R^{2}is, the better the model can fit the actual data. - R
^{2}can be calculated with the formula:

[/unordered_list]

Case study: We want to see whether the score on exam one has any statistically significant relationship with the score on the final exam. If yes, how much impact does exam one have on the final exam?

Data File: “Simple Linear Regression” tab in “Sample Data.xlsx”

Step 1: Determine the dependent and independent variables. Both should be continuous variables.

[unordered_list style=”star”]

- Y (dependent variable) is the score of final exam.
- X (independent variable) is the score of exam one.

[/unordered_list]

Step 2: Create a scatter plot to visualize whether there seems to be a linear relationship between X and Y.

- Select the range of both independent and dependent variables in Excel.
- Click SigmaXL -> Graphical Tools -> Scatter Plots
- A new window named “Scatter Plots” pops up and the selected range appears automatically in the box below “Please select your data”.

- Click “Next >>”
- A new window also named “Scatter Plots” pops up.
- Select “FINAL” as Numeric Response (Y)” and “EXAM1” as “Numeric Predictor (X1) >>”

- Click “OK>>”
- A scatter plot is generated in a new spreadsheet “Scatterplot(1)”.

Based on the scatter plot, the relationship between exam one and final seems linear. The higher the score on exam one, the higher the score on the final. It appears you could “fit” a line through these data points.

Step 3: Run the simple linear regression analysis.

- Select the range of both independent and dependent variables in Excel.
- Click SigmaXL -> Statistical Tools -> Regression -> Multiple Regression
- A new window named “Multiple Regression” pops up and the selected range appears automatically in the box below “Please select your data”

- Click “Next >>”
- A new window also named “Multiple Regression” pops up
- Select “FINAL” as “Numeric Response (Y)” and “EXAM1” as “Continuous Predictor (X)”

- Click “OK>>”
- The regression analysis results appear in the newly generated spreadsheet “Multiple Regression” and the residual analysis results appear in another new spreadsheet “Mult Reg Residuals (1)”.

Step 4: Check whether the model is statistically significant. If not significant, we will need to re-examine the predictor or look for new predictors before continuing. R^{2 }measures the percentage of variation in the data set that can be explained by the model. 89.5% of the variability in the data can be accounted for by this linear regression model. “Analysis of Variance” section provides an ANOVA table covering degrees of freedom, sum of squares, and mean square information for total, regression and error. The p-value of the F-test is lower than the α level (0.05), indicating that the model is statistically significant.

The p-value is 0.0001; therefore, we reject the null and claim the model is statistically significant. The R square value says that 89.5% of the variability can be explained by this model.

Step 5: Understand regression equation

The estimates of slope and intercept are shown in the equation at the top of the output. In this example, Y = 15.622 + 1.852 × Exam 1. Y is the predicted final exam score. A one unit increase in the score of Exam1 would increase the final score by 1.852.

Let us say you are the professor and you want to use this prediction equation to estimate what two of your students might get on their final exam.

Rsquare Adj = 89.0%

[unordered_list style=”star”]

- 89% of the variation in FINAL can be explained by EXAM1

[/unordered_list]

P-value of the F-test = 0.000

[unordered_list style=”star”]

- We have a statistically significant model

[/unordered_list]

Prediction Equation: 15.6 + 1.85 × EXAM1

[unordered_list style=”star”]

- 6 is the Y intercept, all equations will start with 15.6
- 85 is the EXAM1 Coefficient: multiply it by EXAM1 score

[/unordered_list]

Because the model is significant, and it explains 89% of the variability, we can use the model to predict final exam scores based on the results of Exam1.

Let us assume the following:

[unordered_list style=”star”]

- Student “A” exam 1 results were: 79
- Student “B” exam 1 results were: 94

[/unordered_list]

Remember our prediction equation 15.6 + 1.85 × Exam1?

Now apply the equation to each student:

Student “A” Estimate: 15.6 + (1.85 × 79) = 161.8

Student “B” Estimate: 15.6 + (1.85 × 94) = 189.5

Model summary: By simply replacing exam 1 scores into the equation we can predict their final exam scores. But the key thing about the model is whether or not it is useful. In this case, the professor can use the results to Figure out where to spend his time helping students.

]]>