# How to Use Scatter Plots for Problem Solving and Optimization

**The Scatter Plot**

In my last post, I promised a discussion of another very useful tool in the Six Sigma tool kit, the *Scatter Plot.* Just like the rest of the tools presented in this series, the Scatter Plot is an excellent tool for solving problems. But as you will see, the real value in the Scatter Plot is that it helps you identify the strength of the relationship between two variables and their corresponding cause and effect.

**Background**

A Scatter Plot, also known as a Scatter Diagram, is a tool used to analyze relationships between two variables. One variable (e.g. Variable 1) is plotted on the horizontal axis (x) and the other variable (e.g. Variable 2) is plotted on the vertical axis (y). Most of the time, a Scatter Plot is used to prove or disprove cause-and-effect relationships between variables. But while the diagram shows relationships, it does not, by itself, prove that one variable *causes *the other variable.

Scatter Plots have a very specific purpose and are similar to other graphs in that they use both the horizontal and vertical axes to plot data points. Scatter Plots are used to demonstrate just how much one variable might be affected by another variable. This relationship between two variables is referred to as a *correlation* and the strength of this correlation can be measured both visually and mathematically. When you plot one variable against the other (i.e. “x” versus “y”), if the two variable sets of data points form a perfectly straight line, the relationship is said to be a *perfect correlation* between the two variables. If this straight-line relationship moves in an upward direction from left to right, then the two variables are said to have a *positive correlation.*** ** Conversely, if the line moves in a downward direction from left to right, then the variables form a *negative correlation*. Let’s take a look at this visually.

Variable 1 |
Variable 2 |

1 |
2 |

2 |
4 |

3 |
6 |

4 |
8 |

5 |
10 |

6 |
12 |

7 |
14 |

8 |
16 |

9 |
18 |

10 |
20 |

Suppose you collected data on two variables (i.e. Variables 1 and 2 above) and you wanted to see the visual representation of the strength of this relationship. You could do so in an Excel format where you perform a *regression analysis* in the Excel Data Pack. You enter the numbers into the Excel spreadsheet under Variables 1 and 2, like with the above data, and then ask Excel to give you a fitted line plot of the relationship between Variable 1 and Variable 2 (here is a good YouTube video that walks you through graphing a scatterplot with a line of best fit in Excel: https://youtu.be/_FOmUskHzPA).

The figure below represents what a positive correlation looks like. In this example, as you increase Variable 1’s numbers, there is a very predictable increase in Variable 2.

Excel actually calculates a number that reflects how strong this correlation actually is as indicated by the R^{2} value (i.e. *coefficient **of determination*) listed on the plot below. A value of 1.0 for R^{2 }tells us that a perfect correlation between Variables 1 and 2 exists. Excel also provides you with the equation that describes the relationship mathematically which in this case is y = 2x. This equation tells us that as we increase Variable 1, Variable 2 increases by a value equal to 2 times “x” in linear fashion.

R^{2} is a statistical measure of how close the data are to the fitted regression line. The definition of R^{2} is very straight-forward; it is the percentage of the response variable variation that is explained by a linear model. Or:

R^{2} = Explained variation/Total variation

R^{2} is always between 0 and 100% (or 0 and 1.0 if you are not using percentage):

- 0% (or 0.00) indicates that the model explains none of the variability of the response data around its mean.
- 100% indicates that the model explains all the variability of the response data around its mean. In our perfect, positive correlation plot above, the R
^{2}value is 1.0 and therefore explains all of the variation.

So if a perfect positive correlation is given by the R^{2} value of 1.0, and graphically the correlation line moves upward from left to right, then because all of the variation is explained, doesn’t it make sense that a perfect negative correlation is also given by an R^{2} value of 1.0 with the correlation line moving downward from left to right as in the figure below?

If there is absolutely no correlation present, then the value of R^{2} would be 0 (or close to it as in the plot below). The conclusion from all of these plots and calculations is that the closer the value of R^{2}* *is to 1.0, the stronger the relationship is between the two variables. Conversely, the closer the value of R^{2} is to 0, the weaker the correlation.

**When Do You Use a Scatter Plot?**

Most of the time you use a Scatter Plot to help you explain cause-and-effect relationships and to search for root causes of an identified problem, but that’s not its only use. You can also use it when you are trying to determine if two variables are related after listing the potential root causes and possible effects using a Fishbone Diagram. You could even use a Scatter Plot to “optimize” the best settings on a production machine by locating a data point on the y-axis and finding the setting required to achieve this value in production.

**An Example**

Suppose you were setting up a new production line and you wanted to determine what speed to run an extruder to give you a thickness of the plastic material close to 7.0 mm. You collect data by varying the speed of the extruder and measuring the thickness of the extruded plastic. Your data set looks like the following:

Speed |
Thickness |

100 |
4.5 |

200 |
4.6 |

300 |
4.8 |

400 |
5 |

500 |
7.4 |

600 |
8 |

700 |
9.3 |

800 |
9.8 |

900 |
9.8 |

1000 |
9.7 |

You then use Excel to create a Scatter Plot to arrive at the following plot which has both an equation and an R^{2} value. It’s important to note that an R^{2} value of 0.908, the speed you select could be almost 91% accurate.