Improve Phase – New Horizons

Fractional Factorial Designs with SigmaXL

Anthony Bhawani — Wed, 03 Feb 2016 22:09:24 +0000

What Are Fractional Factorial Experiments?

In simple terms, a fractional factorial experiment is a subset of a full factorial experiment.

[unordered_list style=”star”]

Fractional factorials use fewer treatment combinations and runs.
Fractional factorials are less able to determine effects because of fewer degrees of freedom available to evaluate higher order interactions.
Fractional factorials can be used to screen a larger number of factors.
Fractional factorials can also be used for optimization.

[/unordered_list]

Why Fractional Factorial Experiments?

To run a full factorial experiment for k factors, we need 2^k unique treatments. In other words, we need resources that can afford at least 2^k runs.

With k increasing, the number of runs required in full factorial experiments rises dramatically even without any replications, and the percentage of degrees of freedom spent on the main effects decreases. However, the higher order interactions (3 or 4 factor interactions) can typically be ignored, which allows us to run fewer trials to understand the main effects and two-way interactions.

The main effects and two-way interaction are the key effects we need to evaluate. The higher order the interaction is, the more we can ignore it.

Notice the number of treatments increases dramatically as factors are added.

How Does a Fractional Factorial Work?

We are trying to find the cause-and-effect relationship between a response (Y) and three factors (factor A, B, and C) and their interactions (AB, BC, AC, and ABC). As follows is the 2³ full factorial design (2 level 3 factor). There are eight treatment combinations (2 * 2 * 2).

To perform a 2³ full factorial experiment, we need to run at least eight unique treatments (2 * 2 * 2).
What if we only have enough resources to run four treatments?
As a result, we need to carefully select a subset from the eight treatments so that all of our main effects can be evaluated and the design can be kept balanced and orthogonal.

Example of an invalid design

This design is invalid because only the low setting of factor C is tested. We cannot evaluate the main effect of factor C using this design. Remember orthogonality?

This design is also invalid because it is neither balanced nor orthogonal. Checking orthogonality: the sum of AC interaction signs should equal zero (0).

[unordered_list style=”star”]

Run 1 (−)
Run 2 (−)
Run 3 (−)
Run 4 (+)
Sum (−1)

[/unordered_list]

This design has a low and high setting for each factor, but is not orthogonal. To select the four treatments run in the 2³⁻¹ fractional factorial experiment, we start from the 2² full factorial design of experiment. If we replace the two-way interaction (AB) column with the factor C column, the design will be valid.

Imagine a two-factor full factorial with factors A and B. We also learn about the interaction of A and B. In a fractional factorial, we sacrifice learning about the two-way interaction between A and B, and substitute factor C.

2³⁻¹ Fractional Factorial Design Pattern

This pattern implies three factors and four treatments.

Note: We also call this kind of design a half-factorial design since we only have half of the treatments that we would have in a full factorial design. In 2³⁻¹ fractional factorial design of experiment, the effect of three-way interaction (ABC) is not measurable since it only has “+1”.
In four runs, we are able to run high and low settings for each of the three factors. The three-way interaction ABC is only at the high setting. In the 2³⁻¹ fractional factorial design, we notice that the column of each main effect has identical “+1” and “−1” values with one two-way interaction column.

[unordered_list style=”star”]

A and BC
B and AC
C and AB

[/unordered_list]

In this situation, we say that A is aliased with BC or A is the alias of BC. By multiplying any column with itself, we obtain the identity (I).

A*A=I

The product of any column and the identity is the column itself.

A*I=A

Column ABC is called the generator. By multiplying any column with the generator, we obtain its alias.

A*ABC=(A*A)*BC=I*BC=BC

Use SigmaXL to Run a Fractional Factorial Experiment

Step 1: Initiate the experiment design

Click SigmaXL -> Designs of Experiments -> 2-Level Factorial/Screening -> 2-Level Factorial/Screening Designs
A window named “2-Level Factorial/Screening Designs of Experiments” pops up.

Step 2: Enter the response and factors information in the window “2-Level Factorial/Screening Design of Experiments”

One response: Y
Three factors: A, B and C
Two-level design: each factor has two settings
Select “4-Run, 2**(3-1), ½ Fraction, Res III”.
Select the number of replications and make design, we will assume there is sufficient resources allow each treatment to be run twice. Enter “2” into the box of “Number of Replicates”
Click “OK>>”
The DOE template appears in the newly generated tab “3 Factor DOE”.

Step 3: Implement the experiment and record the results in cell labeled “Y” in the DOE table. The data has been provided for you in the DOE Fractional data table in your Sample Data.xlsx file. Carefully (paying close attention to using the “A” and “B” and “C” settings to map your “Y” results) enter the “Y” values into your newly generated DOE template.

Step 4: Fit the model using the experiment results

Click SigmaXL -> Design of Experiments -> 2-Level Factorial/Screening -> Analyze 2-Level Factorial/Screening Design
A new window named “Analyze 2-Level Factorial/Screening Design” appears with the response and factors pre-populated.
Check the checkbox “Show Residual Plots”
Click “OK>>”

Step 5: Analyze the model results

Check whether the model is statistically significant.
Check which factors are insignificant
If any independent variables are not significant, remove them one at a time and rerun the model until all the independent variables in the model are significant.

The p-value of factor B is greater than the alpha level (0.05), so it is not statistically significant. In this example, since factor B is not statistically significant, it needs to be removed from the model.

The p-values of all the independent variables are smaller than 0.05. There is no need to remove any independent variables from the model.

Step 6: Conduct residual analysis to ensure that the residuals of the model satisfy the following criteria. Because you already elected to check the box to “Show Residual Plots”, the residuals are stored in your data table (shown below). Check whether residuals are normally distributed with mean equal to zero.

Select the entire range of the residuals in the residual report
Click SigmaXL -> Graphical Tools -> Histograms & Descriptive Statistics
A new window named “Histograms & Descriptive” appears and the selected range is automatically populated into the box below “Please select your data”
Click “Next>>”
A new window named “Histograms & Descriptive Statistics” pops up.
Select “Residuals” as the “Numeric Data Variables (Y)”
Click “OK>>”
The histogram and the normality test of the residuals appear in the newly generated tab “Hist Descript (1)”

If the p-value of the normality test is greater than the alpha level (0.05), the residuals are normally distributed. The p-value of the normality test is larger than alpha level (0.05). The residuals are normally distributed. Residuals’ mean is zero.

Step 7: Check whether the residuals are independent.

Select the entire range of the residuals in the residual report.
Click SigmaXL -> Control Charts -> Individuals & Moving Range
A new window named “Individuals and Moving Range” appears with the selected range automatically populated into the box below “Please select your data”.
Click “Next>>”
A new window named “Individuals and Moving Range Chart” pops up.
Select the “Residuals” as the “Numeric Data Variables (Y)”
Click “OK>>”
The control charts appear in the newly generated tab “Indiv & MR Charts (1)”.

If no data points on the control charts fail any tests, the residuals are in control and independent of each other. Note: The prerequisite of plotting IR chart for residuals: the residuals are in the time order. These are the control charts for estimating whether residuals are independent.

Step 8: Check whether the residuals have equal variance across the predicted response values. Close to the bottom of the tab “Analyze – 3 Factor DOE (1)” is the residual by predicted plot.

Model summary: We look for patterns in which the residuals tend to have even variation across the entire range of the fitted response values.

Logistic Regression with SigmaXL

Michael Parker — Tue, 02 Feb 2016 15:05:08 +0000

What is Logistic Regression?

Logistic regression is a statistical method to predict the probability of an event occurring by fitting the data to a logistic curve using logistic function. The regression analysis used for predicting the outcome of a categorical dependent variable, based on one or more predictor variables. The logistic function used to model the probabilities describes the possible outcome of a single trial as a function of explanatory variables. The dependent variable in a logistic regression can be binary (e.g. 1/0, yes/no, pass/fail), nominal (blue/yellow/green), or ordinal (satisfied/neutral/dissatisfied). The independent variables can be either continuous or discrete.

Logistic Function

Where: z can be any value ranging from negative infinity to positive infinity.
The value of f(z) ranges from 0 to 1, which matches exactly the nature of probability (i.e., 0 ≤ P ≤ 1).
Logistic Regression Equation
Based on the logistic function,

we define f(z) as the probability of an event occurring and z is the weighted sum of the significant predictive variables.

Where: Z represents the weighted sum of all of the predictive variables.

Logistic Regression

Another of way of representing f(z) is by replacing the z with the sum of the predictive variables.

Where: Y is the probability of an event occurring and x’s are the significant predictors.
Notes:

[unordered_list style=”star”]

When building the regression model, we use the actual Y, which is discrete (e.g. binary, nominal, ordinal).
After completing building the model, the fitted Y calculated using the logistic regression equation is the probability ranging from 0 to 1. To transfer the probability back to the discrete value, we need SMEs’ inputs to select the probability cut point.

[/unordered_list]

Logistic Curve

The logistic curve for binary logistic regression with one continuous predictor is illustrated by the following Figure.

Odds

Odds is the probability of an event occurring divided by the probability of the event not occurring.

Odds range from 0 to positive infinity.
Probability can be calculated using odds.

Because probability can be expressed by the odds, and we can express probability through the logistic function, we can equate probability, odds, and ultimately the sum of the independent variables.
Since in logistic regression model

therefore

Three Types of Logistic Regression

[unordered_list style=”star”]

Binary Logistic Regression
- Binary response variable
- Example: yes/no, pass/fail, female/male
Nominal Logistic Regression
- Nominal response variable
- Example: set of colors, set of countries
Ordinal Logistic Regression
- Ordinal response variable
- Example: satisfied/neutral/dissatisfied

[/unordered_list]

All three logistic regression models can use multiple continuous or discrete independent variables and can be developed in SXL using the same steps.

How to Run a Logistic Regression in SigmaXL

We want to build a logistic regression model using the potential factors to predict the probability that the person measured is female or male.

Data File: “Logistic Regression” tab in “Sample Data.xlsx”

Response and Potential Factors

[unordered_list style=”star”]

Response (Y): Female/Male
Potential Factors (Xs):
- Age
- Weight
- Oxy
- Runtime
- RunPulse
- RstPulse
- MaxPulse

[/unordered_list]

Step 1:

Select the entire range of data (“Name”, “Sex”, “Age”, “Weight”, “Oxy”, “Runtime”, “RunPulse”, “RstPulse”, “MaxPulse” columns)
Click SigmaXL -> Statistical Tools -> Regression ->Binary Logistic Regression
A new window named “Binary Logistic Regression” appears with the selected range of data appearing in the box under “Please select your data”
Click “Next>>”
A new window also called “Binary Logistic Regression” pops up.
Select “Sex” as the “Binary Response (Y)”
Select “Age”, “Weight”, “Oxy”, “Runtime”, “RunPulse”, “RstPulse”, “MaxPulse” as the “Continuous Predictors (X)”.
The reference event is set as “M” by default.
Click “OK”

Step 2:

Check the p-values of all the independent variables in the model.
Remove the insignificant independent variable one at a time from the model and rerun the model.
Repeat step 2.1 until all of the independent variables in the model are statistically significant.

Since the p-values of all the independent variables are higher than the alpha level (0.05), we need to remove the insignificant independent variables one at a time from the model, starting from the one with the highest p-value. Runtime has the highest p-value (0.9897), so it would be removed from the model first. Re-run the binary logistic regression but this time exclude Runtime from the “Continuous Predictors (X)” in the Binary Logistic Regression dialog box.
After removing Runtime from the model, the p-values of all the independent variables are still higher than the alpha level (0.05). We need to continue removing the insignificant independent variables one at a time from the model, starting from the one with the highest p-value. Age has the highest p-value (0.9773), so it would be removed from the model next.

After removing Age from the model, the p-values of all the independent variables are still higher than the alpha level (0.05). We need to continue removing the insignificant independent variables one at a time from the model, starting from the one with the highest p-value. RstPulse has the highest p-value (0.8017) so it would be removed from the model next.

After removing RstPulse from the model, the p-values of all the independent variables are still higher than the alpha level (0.05). We need to continue removing the insignificant independent variables one at a time from the model, starting from the one with the highest p-value. Weight has the highest p-value (0.242), so it would be removed from the model next.
After removing Weight from the model, the p-values of all the independent variables are still higher than the alpha level (0.05). We need to continue removing the insignificant independent variables one at a time from the model, starting from the one with the highest p-value. RunPulse has the highest p-value (0.1604), so it would be removed from the model next.
After removing RunPulse from the model, the p-values of all the independent variables are still higher than the alpha level (0.05). We need to continue removing the insignificant independent variables one at a time from the model, starting from the one with the highest p-value. MaxPulse has the highest p-value (0.2290), so it would be removed from the model next.
After removing MaxPulse from the model, the p-value of the only remaining independent variable “Oxy” is at the alpha level (0.05). There is no need to remove “Oxy” from the model, we will accept the minute risk of rejecting the null at this p-value (0.0556). But before we do that, let’s check the validity of the model as a whole.

Step 3:

Analyze the binary logistic report and check the performance of the logistic regression model. The p-value here is greater than the alpha level of (0.05). We will conclude that at least one of the slope coefficients is not equal to zero. The pseudo R-squared is 10.55%. The R-squared of logistic regression is in general lower than the R-squared of the traditional multiple linear regression model. The p-value of lack of fit test is higher than alpha level (0.05). We conclude that the model fits the data. Also, 62.50% of the predicted outcomes match the observed outcomes.

Step 4: Enter the setting of the Oxy into the cell highlighted in yellow and the predicted event probability would appear automatically. In this case, if we set the oxy value to 50, the probability that the person measured being male is 41%.

Full Factorial DOE with SigmaXL

Anthony Bhawani — Thu, 28 Jan 2016 21:36:24 +0000

What is a Full Factorial DOE?

In a full factorial experiment, all of the possible combinations of factors and levels are created and tested. For example, for two-level design (i.e.each factor has two levels) with k factors, there are 2k possible scenarios or treatments.

Two factors, each with two levels, we have 2²= 4 treatments
Three factors, each with two levels, we have 2³= 8 treatments
k factors, each with two levels, we have 2^k treatments

2^k Full Factorial DOE

Full factorial DOE is used to discover the cause-and-effect relationship between the response and both individual factors and the interaction of factors. Generate an equation to describe the relationship between Y and the important Xs:

Where:

[unordered_list style=”star”]

Y is the response and X₁, X₂. . . X_k are the factors
α₀ is the intercept and α₁, α₂. . . α_p are the coefficients of the factors and interactions
ε is the error of the model

[/unordered_list]

Two-Level Two-Factor Full Factorial

Below is a design pattern of a two-level two-factor full factorial experiment.

2 (level) raised to 2 (factors) = 4 treatment combinations.

Two-Level Three-Factor Full Factorial

Below is a design pattern of a two-level three-factor full factorial experiment.

2 (levels) raised to 3 (factors) = 8 treatment combinations.

Two-Level Four-Factor Full Factorial

Below is a design pattern of a two-level four-factor full factorial experiment

2 (levels) raised to 4 (factors) = 16 treatment combinations

Two-Level Five-Factor Full Factorial

Below is a design pattern of a two-level five-factor full factorial experiment

2 (levels) raised to 5 (factors) = 32 treatment combinations

Order to Run Experiments

The four design patterns shown earlier are listed in the standard order. Standard order is used to design the combinations/treatments before experiments start. When actually running the experiments, randomizing the standard order is recommended to minimize the noise.

Replication in Experiments

Each treatment can be tested multiple times in an experiment in order to increase the degrees of freedom and improve the capability of analysis. We call this method replication.
Replicates are the number of repetitions of running an individual treatment, which increase the power of the experimental responses. The order to run the treatments in an experiment should be randomized to minimize the noise.
Advantages of replication include: helps to better identify the true sources of variation, helps estimate the true impacts of the factors on the response, and overall improves the reliability and validity of the experimental results.

2² Full Factorial DOE

Case study: We are running a 2² full factorial DOE to discover the cause-and-effect relationship between the cake tastiness and two factors: temperature of the oven and time length of baking. Each factor has two levels and there are four treatments in total.
We decide to run each treatment twice so that we have enough degrees of freedom to measure the impact of two factors and the interaction between two factors. Therefore, there are eight observations in response eventually.

The objective is to understand the main effects and the interactions of these factors on the response variable. After running the four treatments twice in a random order, we obtain the following results

There are two factors and two levels, so there would be 2^2 = 4 treatment combinations. With replicates, each treatment combination is repeated once; therefore, there are in total 8 runs in this experiment. The experiment results are consolidated into the following table

The main effect of factor A is computed by averaging the difference between combinations where A was at its high settings and where A was at its low settings.
Main effect of factor A (temperature of the oven):

Where:

[unordered_list style=”star”]

k is the number of factors
r is the number of times individual treatments are being run

[/unordered_list]

Using the formula provided, the main effect of increasing the temperature of the oven is to decrease tastiness of the cake by −6.25.

The main effect of factor B, similar to A, is computed by averaging the difference between combinations where B was at its high settings and where B was at its low settings.
Main effect of factor B (time length of baking):

Where:

[unordered_list style=”star”]

k is the number of factors
r is the number of times individual treatments are being run

[/unordered_list]

Using the formula provided, the main effect of increasing the baking time is to decrease the tastiness of the cake by −1.75

The interaction effect is computed by averaging the difference between combinations where A and B were at opposite settings (low and high).

Interaction (i.e. A*B) effect:

Where:

[unordered_list style=”star”]

k is the number of factors
r is the number of times individual treatments are being run

[/unordered_list]

Using the formula provided, the interaction effect of the temperature and time variables on tastiness was −3.25.

Sum of squares of factors and interaction

Where:

[unordered_list style=”star”]

k is the number of factors
r is the number of times individual treatments are being run

[/unordered_list]

The sum of squares tells us the relative strength of each main effect and interaction. A has the strongest effect as indicated by the high SS value. The degrees of freedom are necessary to determine the mean squares value.

Degrees of freedom of factors and interaction:

Four degrees of freedom are necessary because there are three effects we are looking to understand: factor A, factor B, and the interaction between them.

Mean squares of factors and interaction:

Use SigmaXL to Run a 2k Full Factorial DOE

Step 1: Initiate the experiment design

Click SigmaXL -> Design of Experiments -> 2-Level Factorial/Screening -> 2-Level Factorial/Screening Designs
A new window named “2-Level Factorial/Screening Design of Experiments” pops up
Select “1” as the Number of Responses.
Enter “Tastiness” into the “Response Name” box.
Select “2” as the Number of Factors
Select “4-Run, 2**2, Full-Factorial” as the design
Select “2” as the Number of Replicates
Enter “Temp” as the name for factor A
Enter “Time” as the name for factor B
Click “OK>>”
The 2² full factorial DOE template appears in the newly generated tab “2 Factor DOE”.

Step 2: Run the experiment and record the response in the table created by SigmaXL. The data has been provided for you in the DOE Full Factorial data table in your Sample Data.xlsx file. Carefully (paying close attention to using the “Temp” and “Time” settings to map your “Tastiness” results) enter the “Tastiness” values into your newly generated DOE template.

Step 3: Analyze the experiment results

Click SigmaXL -> Design of Experiments -> 2-Level Factorial/Screening -> Analyze 2-Level Factorial/Screening Design
A new window named “Analyze 2-Level Factorial/Screening Design” appears in which “Tastiness” is automatically selected as the response variable and three factors including the interaction term are automatically selected as the independent variables.
Click “OK”
The DOE analysis results appear in the newly generated tab “Analyze – 2 Factor DOE”.

These are the results, Since the p-values of all the independent variables in the mode are smaller than the alpha level (0.05), both factors and their interaction have statistically significant impact on the response. High R² value shows around 98% of the variation in the response can be explained by the model (very good results).

Enter the actual settings of the independent variables into the yellow cells on the “Analyze – 2 Factor DOE” tab. The predicted response will be calculated automatically.

Model summary: These are the software outputs for expected/predicted results, as well as the residuals for each combination.

Box Cox Transformation with SigmaXL

Michael Parker — Tue, 26 Jan 2016 21:17:48 +0000

Box Cox Transformation

[unordered_list style=”star”]

Data transforms are usually applied so that the data appear to more closely meet assumptions of a statistical inference model to be applied or to improve the interpret-ability or appearance of graphs.
Power transformation is a class of transformation functions that raise the response to some power. For example, a square root transformation converts X to X^1/2
Box Cox transformation is a popular power transformation method developed by George E. P. Box and David Cox.

[/unordered_list]

Box Cox Transformation Formula

The formula of the Box Cox transformation is:

Where:

[unordered_list style=”star”]

y is the transformation result
x is the variable under transformation
λ is the transformation parameter

[/unordered_list]

Use SigmaXL to Perform a Box-Cox Transformation

SigmaXL provides the best Box-Cox transformation with an optimal λ that minimizes the model SSE (sum of squared error). Here is an example of how we transform the non-normally distributed response to normal data using Box-Cox method.
Data File: “Box-Cox” tab in “Sample Data.xlsx”

Step 1: Test the normality of the original data set.

Select the entire range of “Y” in column H
Click SigmaXL -> Graphical Tool -> Histograms & Descriptive Statistics
A new window named “Histograms & Descriptive” pops up and the selected range automatically appears in the box below “Please select your data”.
Click “Next >>”
A new window named “Histograms & Descriptive Statistics” pops up.
Select “Y” as “Numeric Data Variables (Y)”
Click “OK>>”
The analysis results are shown automatically in the new spreadsheet “Hist Descript(1)”

Normality Test:

[unordered_list style=”star”]

H₀: The data are normally distributed.
H₁: The data are not normally distributed.

[/unordered_list]

If p-value > alpha level (0.05), we fail to reject the null hypothesis. Otherwise, we reject the null. In this example, p-value = 0.029 < alpha level (0.05). The data are not normally distributed.

Step 2: Run the Box-Cox Transformation:

Select the entire range of Y in column H
Click SigmaXL -> Process Capability -> Nonnormal -> Box-Cox Transformation
A new window named “Box-Cox Transformation” pops up and the selected range appears automatically in the box under “Please select your data”
Click “Next >>”
A new window also named “Box-Cox Transformation” pops up.
Select “Y” as “Numeric Data Variables (Y)”
Click “OK>>”
The analysis results are shown automatically in the new spreadsheet “Box-Cox (1)”

The software looks for the optimal value of lambda that minimizes the SSE (Sum of Squares of Error). In this case the minimum value is 0.12. The transformed Y can also be saved in another column. The transformed Y is also listed in Column G in the newly generated tab “Box-Cox (1)
Use the Anderson–Darling test to test the normality of the transformed data

[unordered_list style=”star”]

H₀: The data are normally distributed.
H₁: The data are not normally distributed.

[/unordered_list]

Model summary: If p-value > alpha level (0.05), we fail to reject the null. Otherwise, we reject the null. In this example, p-value = 0.327 > alpha level (0.05). The data are normally distributed.

Multiple Linear Regression

Michael Parker — Tue, 26 Jan 2016 21:08:23 +0000

What is Multiple Linear Regression?

Multiple linear regression is a statistical technique to model the relationship between one dependent variable and two or more independent variables by fitting the data set into a linear equation.
The difference between simple linear regression and multiple linear regression:

[unordered_list style=”star”]

Simple linear regression only has one predictor.
Multiple linear regression has two or more predictors.

[/unordered_list]

Multiple Linear Regression Equation

Where:

[unordered_list style=”star”]

Y is the dependent variable (response)
X₁, X₂ . . . X_p are the independent variables (predictors). There are p predictors in total

[/unordered_list]

Both dependent and independent variables are continuous.

[unordered_list style=”star”]

β is the intercept indicating the Y value when all the predictors are zeros
α₁, α₂ . . . α_p are the coefficients of predictors. They reflect the contribution of each independent variable in predicting the dependent variable.
e is the residual term indicating the difference between the actual and the fitted response value.

[/unordered_list]

Use SigmaXL to Run a Multiple Linear Regression

Case study: We want to see whether the scores in exam one, two, and three have any statistically significant relationship with the score in final exam. If so, how are they related to final exam score? Can we use the scores in exam one, two, and three to predict the score in final exam?
Data File: “Multiple Linear Regression” tab in “Sample Data.xlsx.”

Step 1: Determine the dependent and independent variables, all should be continuous. Y (dependent variable) is the score of final exam. X₁, X₂, and X₃ (independent variables) are the scores of exam one, two, and three respectively. All x variables are continuous.

Step 2: Start building the multiple linear regression model

Select the range of independent and dependent variables in Excel.
Click SigmaXL -> Statistical Tools -> Regression -> Multiple Regression
A new window named “Multiple Regression” pops up and the selected range appears automatically in the box below “Please select your data”
Click “Next >>”
A new window also named “Multiple Regression” pops up
Select “FINAL” as “Numeric Response (Y)” and “EXAM1”, “EXAM2” and “EXAM3” as “Continuous Predictor (X)”
Click “OK>>”
The regression analysis results appear in the newly generated spreadsheet “Multiple Regression” and the residual analysis results appear in another new spreadsheet “Mult Reg Residuals (1)”.

Step 3: Check whether the whole model is statistically significant. If not, we need to re-examine the predictors or look for new predictors before continuing.

[unordered_list style=”star”]

H₀: The model is not statistically significant (i.e., all the parameters of predictors are not significantly different from zeros).
H₁: The model is statistically significant (i.e., at least one predictor parameter is significantly different from zero).

[/unordered_list]

In this example, p-value is much smaller than alpha level (0.05), hence we reject the null hypothesis; the model is statistically significant.

Step 4: Check whether multicollinearity exists in the model.

The VIF information is automatically generated in table of parameter estimates.
We use the VIF (Variance Inflation Factor) to determine if multicollinearity exists.

Multicollinearity

Multicollinearity is the situation when two or more independent variables in a multiple regression model are correlated with each other. Although multicollinearity does not necessarily reduce the predictability for the model as a whole, it may mislead the calculation for individual independent variables. To detect multicollinearity, we use VIF (Variance Inflation Factor) to quantify its severity in the model.

Variance Inflation Factor (1)

VIF quantifies the degree of multicollinearity for each individual independent variable in the model.

VIF calculation:

Assume we are building a multiple linear regression model using p predictors.

Two steps are needed to calculate VIF for X₁.

Step 1: Build a multiple linear regression model for X₁ by using X₂, X₃ . . . X_p as predictors.

Step 2: Use the R²generated by the linear model in step 1 to calculate the VIF for X₁.

Apply the same methods to obtain the VIFs for other Xs. The VIF value ranges from one to positive infinity.

Variance Inflation Factor (2)

Rules of thumb to analyze variance inflation factor (VIF):

[unordered_list style=”star”]

If VIF = 1, there is no multicollinearity.
If 1 < VIF < 5, there is small multicollinearity.
If VIF ≥ 5, there is medium multicollinearity.
If VIF ≥ 10, there is large multicollinearity.

[/unordered_list]

How to Deal with Multicollinearity

Increase the sample size.
Collect samples with a broader range for some predictors.
Remove the variable with high multicollinearity and high p-value.
Remove variables that are included more than once.
Combine correlated variables to create a new one.

In this section, we will focus on removing variables with high VIF and high p-value.

Step 3: Deal with multicollinearity:

Identify a list of independent variables with VIF higher than 5. If no variable has VIF higher than 5, go to Step 6 directly.
Among variables identified in Step 5.1, remove the one with the highest p-value.
Run the model again, check the VIFs and repeat Step 5.1.

Note: we only remove one independent variable at a time.

In this example, all three predictors have VIF higher than 5. Among them, EXAM1 has the highest p-value. We will remove EXAM1 from the equation and run the model again.

Run the new multiple linear regression with only two predictors (i.e., EXAM2 and EXAM3).
Check the VIFs of EXAM2 AND EXAM3. They are both smaller than 5; hence, there is little multicollinearity existing in the model.

Step 4: Identify the statistically insignificant predictors. Remove one insignificant predictor at a time and run the model again. Repeat this step until all the predictors in the model are statistically significant.

Insignificant predictors are the ones with p-value higher than alpha level (0.05). When p > alpha level, we fail to reject the null hypothesis; the predictor is not significant.

[unordered_list style=”star”]

H₀: The predictor is not statistically significant.
H₁: The predictor is statistically significant.

[/unordered_list]

As long as the p-value is greater than 0.05, remove the insignificant variables one at a time in the order of the highest p-value. Once one insignificant variable is eliminated from the model, we need to run the model again to obtain new p-values for other predictors left in the new model. In this example, both predictors’ p-values are smaller than alpha level (0.05). As a result, we do not need to eliminate any variables from the model.

Step 5: Interpret the regression equation

The multiple linear regression equation appears automatically at the top of the session window. “Parameter Estimates” section provides the estimates of parameters in the linear regression equation. Now that we have removed multicollinearity and all of the insignificant predictors, we have the parameters for the regression equation.

Interpreting the Results

Rsquare Adj = 98.4%

[unordered_list style=”star”]

98% of the variation in FINAL can be explained by the predictor variables EXAM2 & EXAM3.

[/unordered_list]

P-value of the F-test = 0.000

[unordered_list style=”star”]

We have a statistically significant model.

[/unordered_list]

Variables p-value:

[unordered_list style=”star”]

Both are significant (less than 0.05).

[/unordered_list]

VIF

[unordered_list style=”star”]

EXAM2 and EXAM3 are both below 5; we’re in good shape!

[/unordered_list]

Equation: −4.34 + 0.722*EXAM2 + 1.34*EXAM3

[unordered_list style=”star”]

−4.34 is the Y intercept, all equations will start with −4.34.
722 is the EXAM2 coefficient; multiply it by EXAM2 score.
34 is the EXAM3 coefficient; multiply it by EXAM3 score.

[/unordered_list]

Let us say you are the professor again, and this time you want to use your prediction equation to estimate what one of your students might get on their final exam.

Assume the following:

[unordered_list style=”star”]

Exam 2 results were: 84
Exam 3 results were: 102

[/unordered_list]

Use your equation: −4.34 + 0.722*EXAM2 + 1.34*EXAM3

Predict your student’s final exam score:

−4.34 + (0.722*84) + (1.34*102) =−4.34 + 60.648 + 136.68 = 192.988

Model summary: Nice work again! Now you can use your “magic” as the smart and efficient professor and allocate your time to other students because this one projects to perform much better than the average score of 162. Now that we know that exams two and three are statistically significant predictors, we can plug them into the regression equation to predict the results of the final exam for any student.

Correlation Coefficient with SigmaXL

Michael Parker — Tue, 26 Jan 2016 20:53:03 +0000

Pearson’s Correlation Coefficient

Pearson’s correlation coefficient is also called Pearson’s r or coefficient of correlation and Pearson’s product moment correlation coefficient (r), where r is a statistic measuring the linear relationship between two variables.

What is Correlation?

Correlation is a statistical technique that describes whether and how strongly two or more variables are related.
Correlation analysis helps to understand the direction and degree of association between variables, and it suggests whether one variable can be used to predict another. Of the different metrics to measure correlation, Pearson’s correlation coefficient is the most popular. It measures the linear relationship between two variables.
Correlation coefficients range from −1 to 1.

If r = 0, there is no linear relationship between the variables.
The sign of r indicates the direction of the relationship:
If r < 0, there is a negative linear correlation. If r > 0, there is a positive linear correlation.
The absolute value of r describes the strength of the relationship:
If |r| ≤ 0.5, there is a weak linear correlation.
If |r| > 0.5, there is a strong linear correlation.
If |r| = 1, there is a perfect linear correlation.
When the correlation is strong, the data points on a scatter plot will be close together (tight). The closer r is to −1 or 1, the stronger the relationship.
−1 Strong inverse relationship
+1 Strong direct relationship
When the correlation is weak, the data points are spread apart more (loose). The closer the correlation is to 0, the weaker the relationship.

Fig 1.0 Examples of Types of Correlation

This Figure demonstrates the relationships between variables as the Pearson r value ranges from 1 to 0 and to −1. Notice that at −1 and 1 the points form a perfectly straight line.

At 0 the data points are completely random.
At 0.8 and −0.8, notice how you can see a directional relationship, but there is some noise around where a line would be.
At 0.4 and −0.4, it looks like the scattering of data points is leaning to one direction or the other, but it is more difficult to see a relationship because of all the noise.

Pearson’s correlation coefficient is only sensitive to the linear dependence between two variables. It is possible that two variables have a perfect non-linear relationship when the correlation coefficient is low. Notice the scatter plots below with correlation equal to 0. There are clearly relationships but they are not linear and therefore cannot be determined with Pearson’s correlation coefficient.

Fig 1. 1 Examples of Types of Relationships

Correlation and Causation

Correlation does not imply causation.
If variable A is highly correlated with variable B, it does not necessarily mean A causes B or vice versa. It is possible that an unknown third variable C is causing both A and B to change. For example, if ice cream sales at the beach are highly correlated with the number of shark attacks, it does not imply that increased ice cream sales cause increased shark attacks. They are triggered by a third factor: summer.
This example demonstrates a common mistake that people make: assuming causation when they see correlation. In this example, it is hot weather that is a common factor. As the weather is hotter, more people consume ice cream and more people swim in the ocean, making them susceptible to shark attacks.

Correlation and Dependence

If two variables are independent, the correlation coefficient is zero.
WARNING! If the correlation coefficient of two variables is zero, it does not imply they are independent. The correlation coefficient only indicates the linear dependence between two variables. When variables are non-linearly related, they are not independent of each other but their correlation coefficient could be zero.

Correlation Coefficient and X-Y Diagram

The correlation coefficient indicates the direction and strength of the linear dependence between two variables but it does not cover all the existing relationship patterns. With the same correlation coefficient, two variables might have completely different dependence patterns. A scatter plot or X-Y diagram can help to discover and understand additional characteristics of the relationship between variables. The correlation coefficient is not a replacement for examining the scatter plot to study the variables’ relationship.
The correlation coefficient by itself does not tell us everything about the relationship between two variables. Two relationships could have the same correlation coefficient, but completely different patterns.

Statistical Significance of the Correlation Coefficient

The correlation coefficient could be high or low by chance (randomness). It may have been calculated based on two small samples that do not provide good inference on the correlation between two populations.
In order to test whether there is a statistically significant relationship between two variables, we need to run a hypothesis test to determine whether the correlation coefficient is statistically different from zero.
Hypothesis Test Statements

H₀: r = 0: Null Hypothesis: There is no correlation.
H1: r ≠ 0: Alternate Hypothesis: There is a correlation.

Hypothesis tests will produce p-values as a result of the statistical significance test on r. When the p-value for a test is low (less than 0.05), we can reject the null hypothesis and conclude that r is significant; there is a correlation. When the p-value for a test is > 0.05, then we fail to reject the null hypothesis; there is no correlation.
We can also use the t statistic to draw the same conclusions regarding our test for significance of the correlation coefficient. To use the t-test to determine the statistical significance of the Pearson correlation, calculate the t statistic using the Pearson r value and the sample size, n.
Test Statistic

Critical Statistic
Is the t-value in t-table with (n – 2) degrees of freedom.
If the absolute value of the calculated t value is less than or equal to the critical t value, then we fail to reject the null and claim no statistically significant linear relationship between X and Y.

If |t| ≤ t_critical, we fail to reject the null. There is no statistically significant linear relationship between X and Y.
If |t| > t_critical, we reject the null. There is a statistically significant linear relationship between X and Y.

Using Software to Calculate the Correlation Coefficient

We are interested in understanding whether there is linear dependence between a car’s MPG and its weight and if so, how they are related. The MPG and weight data are stored in the “Correlation Coefficient” tab in “Sample Data.xlsx.” We will discuss three ways to get the results.

Use Excel to Calculate the Correlation Coefficient

The formula CORREL in Excel calculates the sample correlation coefficient of two data series. The correlation coefficient between the two data series is −0.83, which indicates a strong negative linear relationship between MPG and weight. In other words, as weight gets larger, gas mileage gets smaller.

Fig 1.3 Correlation coefficient in Excel

Interpreting Results

How do we interpret results and make decisions based Pearson’s correlation coefficient (r) and p-values?
Let us look at a few examples:

r = −0.832, p = 0.000 (previous example). The two variables are inversely related and the linear relationship is strong. Also, this conclusion is significant as supported by p-value of 0.00.
r = −0.832, p = 0.71. Based on r, you should conclude the linear relationship between the two variables is strong and inversely related. However, with a p-value of 0.71, you should then conclude that r is not significant and that your sample size may be too small to accurately characterize the relationship.
r = 0.5, p = 0.00. Moderately positive linear relationship, r is statistically significant.
r = 0.92, p = 0.61. Strong positive linear relationship but r is not statistically significant. Get more data.
r = 1.0, p = 0.00. The two variables have a perfect linear relationship and r is significant.

Correlation Coefficient Calculation

Population Correlation Coefficient (ρ)

Sample Correlation Coefficient (r)

It is only defined when the standard deviations of both X and Y are non-zero and finite. When covariance of X and Y is zero, the correlation coefficient is zero.

Simple Linear Regression

Michael Parker — Tue, 26 Jan 2016 20:43:25 +0000

What is Simple Linear Regression?

Simple linear regression is a statistical technique to fit a straight line through the data points. It models the quantitative relationship between two variables. It is simple because only one predictor variable is involved. It describes how one variable changes according to the change of another variable. Both variables need to be continuous; there are other types of regression to model discrete data.

Simple Linear Regression Equation

The simple linear regression analysis fits the data to a regression equation in the form

Where:

[unordered_list style=”star”]

Y is the dependent variable (the response) and X is the single independent variable (the predictor)
α is the slope describing the steepness of the fitting line. β is the intercept indicating the Y value when X is equal to 0
e stands for error (residual). It is the difference between the actual Y and the fitted Y (i.e. the vertical difference between the data point and the fitting line).

[/unordered_list]

Ordinary Least Squares

The ordinary least squares is a statistical method used in linear regression analysis to find the best fitting line for the data points. It estimates the unknown parameters of the regression equation by minimizing the sum of squared residuals (i.e. the vertical difference between the data point and the fitting line).

In mathematical language, we look for α and β that satisfy the following criteria:

The actual value of the dependent variable:

Where: i = 1, 2 . . . n.

The fitted value of the dependent variable:

Where: i = 1, 2 . . . n.

By using calculus, it can be shown the sum of squared error is minimal when

and

ANOVA in Simple Linear Regression

[unordered_list style=”star”]

X: the independent variable that we use to predict;
Y: the dependent variable that we want to predict.

[/unordered_list]

The variance in simple linear regression can be expressed as a relationship between the actual value, the fitted value, and the grand mean—all in terms of Y.

[unordered_list style=”star”]

Total Variation = Total Sums of Squares =
Explained Variation = Regression Sums of Squares =
Unexplained Variation = Error Sums of Squares =

[/unordered_list]

Regression follows the same methodology as ANOVA and the hypothesis tests behind it use the same assumptions.

Variation Components

i.e. Total Sums of Squares = Regression Sums of Squares + Error Sums of Squares

Degrees of Freedom Components

i.e. n – 1 = (k – 1) + (n – k), where n is the number of data points, k is the number of predictors

Whether the overall model is statistically significant can be tested by using F-test of ANOVA.

[unordered_list style=”star”]

H₀: The model is not statistically significant.
H_a: The model is statistically significant.

[/unordered_list]

Test Statistic

Critical Statistic

Is represented by F value in F table with (k – 1) degrees of freedom in the numerator and (n – k) degrees of freedom in the denominator.

[unordered_list style=”star”]

If F ≤ F_critical (calculated F is less than or equal to the critical F), we fail to reject the null. There is no statistically significant relationship between X and Y.
If F > F_critical, we reject the null. There is a statistically significant relationship between X and Y.

[/unordered_list]

Coefficient of Determination

R-squared or R²(also called coefficient of determination) measures the proportion of variability in the data that can be explained by the model.

[unordered_list style=”star”]

R² ranges from 0 to 1. The higher R² is, the better the model can fit the actual data.
R² can be calculated with the formula:

[/unordered_list]

Use SigmaXL to Run a Simple Linear Regression

Case study: We want to see whether the score on exam one has any statistically significant relationship with the score on the final exam. If yes, how much impact does exam one have on the final exam?

Data File: “Simple Linear Regression” tab in “Sample Data.xlsx”

Step 1: Determine the dependent and independent variables. Both should be continuous variables.

[unordered_list style=”star”]

Y (dependent variable) is the score of final exam.
X (independent variable) is the score of exam one.

[/unordered_list]

Step 2: Create a scatter plot to visualize whether there seems to be a linear relationship between X and Y.

Select the range of both independent and dependent variables in Excel.
Click SigmaXL -> Graphical Tools -> Scatter Plots
A new window named “Scatter Plots” pops up and the selected range appears automatically in the box below “Please select your data”.
Click “Next >>”
A new window also named “Scatter Plots” pops up.
Select “FINAL” as Numeric Response (Y)” and “EXAM1” as “Numeric Predictor (X1) >>”
Click “OK>>”
A scatter plot is generated in a new spreadsheet “Scatterplot(1)”.

Based on the scatter plot, the relationship between exam one and final seems linear. The higher the score on exam one, the higher the score on the final. It appears you could “fit” a line through these data points.

Step 3: Run the simple linear regression analysis.

Select the range of both independent and dependent variables in Excel.
Click SigmaXL -> Statistical Tools -> Regression -> Multiple Regression
A new window named “Multiple Regression” pops up and the selected range appears automatically in the box below “Please select your data”
Click “Next >>”
A new window also named “Multiple Regression” pops up
Select “FINAL” as “Numeric Response (Y)” and “EXAM1” as “Continuous Predictor (X)”
Click “OK>>”
The regression analysis results appear in the newly generated spreadsheet “Multiple Regression” and the residual analysis results appear in another new spreadsheet “Mult Reg Residuals (1)”.

Step 4: Check whether the model is statistically significant. If not significant, we will need to re-examine the predictor or look for new predictors before continuing. R²measures the percentage of variation in the data set that can be explained by the model. 89.5% of the variability in the data can be accounted for by this linear regression model. “Analysis of Variance” section provides an ANOVA table covering degrees of freedom, sum of squares, and mean square information for total, regression and error. The p-value of the F-test is lower than the α level (0.05), indicating that the model is statistically significant.

The p-value is 0.0001; therefore, we reject the null and claim the model is statistically significant. The R square value says that 89.5% of the variability can be explained by this model.

Step 5: Understand regression equation

The estimates of slope and intercept are shown in the equation at the top of the output. In this example, Y = 15.622 + 1.852 × Exam 1. Y is the predicted final exam score. A one unit increase in the score of Exam1 would increase the final score by 1.852.

Interpreting the Results

Let us say you are the professor and you want to use this prediction equation to estimate what two of your students might get on their final exam.

Rsquare Adj = 89.0%

[unordered_list style=”star”]

89% of the variation in FINAL can be explained by EXAM1

[/unordered_list]

P-value of the F-test = 0.000

[unordered_list style=”star”]

We have a statistically significant model

[/unordered_list]

Prediction Equation: 15.6 + 1.85 × EXAM1

[unordered_list style=”star”]

6 is the Y intercept, all equations will start with 15.6
85 is the EXAM1 Coefficient: multiply it by EXAM1 score

[/unordered_list]

Because the model is significant, and it explains 89% of the variability, we can use the model to predict final exam scores based on the results of Exam1.

Let us assume the following:

[unordered_list style=”star”]

Student “A” exam 1 results were: 79
Student “B” exam 1 results were: 94

[/unordered_list]

Remember our prediction equation 15.6 + 1.85 × Exam1?

Now apply the equation to each student:

Student “A” Estimate: 15.6 + (1.85 × 79) = 161.8

Student “B” Estimate: 15.6 + (1.85 × 94) = 189.5

Model summary: By simply replacing exam 1 scores into the equation we can predict their final exam scores. But the key thing about the model is whether or not it is useful. In this case, the professor can use the results to Figure out where to spend his time helping students.

Improve Phase – New Horizons

Fractional Factorial Designs with SigmaXL

What Are Fractional Factorial Experiments?

Why Fractional Factorial Experiments?

How Does a Fractional Factorial Work?

Example of an invalid design

23−1 Fractional Factorial Design Pattern

Use SigmaXL to Run a Fractional Factorial Experiment

Logistic Regression with SigmaXL

What is Logistic Regression?

Logistic Function

Logistic Regression

Logistic Curve

Odds

Three Types of Logistic Regression

How to Run a Logistic Regression in SigmaXL

Response and Potential Factors

Full Factorial DOE with SigmaXL

What is a Full Factorial DOE?

2k Full Factorial DOE

Two-Level Two-Factor Full Factorial

Two-Level Three-Factor Full Factorial

Two-Level Four-Factor Full Factorial

Two-Level Five-Factor Full Factorial

Order to Run Experiments

Replication in Experiments

22 Full Factorial DOE

Interaction (i.e. A*B) effect:

Sum of squares of factors and interaction

Degrees of freedom of factors and interaction:

Mean squares of factors and interaction:

Use SigmaXL to Run a 2k Full Factorial DOE

Box Cox Transformation with SigmaXL

Box Cox Transformation

Box Cox Transformation Formula

Use SigmaXL to Perform a Box-Cox Transformation

Multiple Linear Regression

What is Multiple Linear Regression?

Multiple Linear Regression Equation

Use SigmaXL to Run a Multiple Linear Regression

Multicollinearity

Variance Inflation Factor (1)

Variance Inflation Factor (2)

How to Deal with Multicollinearity

Interpreting the Results

Correlation Coefficient with SigmaXL

Pearson’s Correlation Coefficient

What is Correlation?

Correlation and Causation

Correlation and Dependence

Correlation Coefficient and X-Y Diagram

Statistical Significance of the Correlation Coefficient

Using Software to Calculate the Correlation Coefficient

Use Excel to Calculate the Correlation Coefficient

Interpreting Results

Correlation Coefficient Calculation

Simple Linear Regression

What is Simple Linear Regression?

Simple Linear Regression Equation

Ordinary Least Squares

ANOVA in Simple Linear Regression

Coefficient of Determination

Use SigmaXL to Run a Simple Linear Regression

Interpreting the Results

2³⁻¹ Fractional Factorial Design Pattern

2^k Full Factorial DOE

2² Full Factorial DOE