Für immer single online anschauen deutsch

single equation linear regression analysis

Diese Präsentation wurde erfolgreich gemeldet.

What is a Single Linear Regression
Single Linear RegressionConceptual Explanation frau für lockere treffen • Welcome to this explanation of Single LinearRegression. • Welcome to this explanation of Single LinearRegression.• Single linear regression <a href=bayernticket single preis 2015 is an extension ofcorrelation. " /> • Welcome to this explanation of Single LinearRegression.• Single linear regression is an extension ofcorrelation.Corr... • Correlation is designed to render a singlecoefficient that represents the degree of coherencebetween two variables • Correlation is designed to render a singlecoefficient that represents the degree of coherencebetween two variables • Correlation is designed to render a singlecoefficient that represents the degree of coherencebetween two variables • Correlation is designed to render a singlecoefficient that represents the degree of coherencebetween two variables+.9... • Correlation is designed to render a singlecoefficient that represents the degree of coherencebetween two variables+.9... • Correlation is designed to render a singlecoefficient that represents the degree of coherencebetween two variablesAve... • Correlation is designed to render a singlecoefficient that represents the degree of coherencebetween two variablesAve... • Correlation is designed to render a singlecoefficient that represents the degree of coherencebetween two variablesAve... • Correlation is designed to render a singlecoefficient that represents the degree of coherencebetween two variablesAve... • Single linear regression uses that information topredict the value of one variable <a href=partnersuche für alleinerziehende mütter kostenlos based on thegiven value of the othe..." /> • Single linear regression uses that information topredict the value of one variable based on thegiven value of the othe... • Single linear regression uses that information topredict the value of one variable based on thegiven value of the othe... • For example:If the following data set were real, what would youpredict ice cream sales would be when thetemperature r... • For example:If the following data set were real, what would youpredict ice cream sales would be when thetemperature r... • Single linear regression uses that information topredict the value of one variable (ice cream) basedon the given value... • Single linear regression uses that information topredict the value of one variable (ice cream) basedon the given value... If the following data set were real, what would youpredict ice cream sales would be when the temperaturereaches 1000?•... If the following data set were real, what would youpredict ice cream sales would be when the temperaturereaches 1000?•... • In some cases which variable is consideredpredictor or outcome is arbitrary. • In some cases which variable is consideredpredictor or outcome is arbitrary.• Like measures of depression and anxiety • In some cases which variable is consideredpredictor or outcome is arbitrary.• Like measures of depression and anxiety... • In some cases which variable is consideredpredictor or outcome is arbitrary.• Like measures of depression and anxiety... • In some cases, either by theory or by the nature ofthe research design, one variable will be rationallydefined as the... • In some cases, either by theory or by the nature ofthe research design, one variable will be rationallydefined as the... • In some cases, either by theory or by the nature ofthe research design, one variable will be rationallydefined as the... • In some cases, either by theory or by the nature ofthe research design, one variable will be rationallydefined as the... • An easy way to conceptualize single linearregression is to create a scatterplot in Cartesianspace. • An easy way to conceptualize single linearregression is to create a scatterplot in Cartesianspace.Let’s plot the foll... • An easy way to conceptualize single linearregression is to create a scatterplot in Cartesianspace.Let’s plot the foll... • First, we assign the predictor variable along the Xaxis, which in this case we’ll arbitrarily say isdepression. • First, we assign the predictor variable along the Xaxis, which in this case we’ll arbitrarily say isdepression.0204... •... and the outcome variable along the Y axis we’llarbitrarily say is Anxiety. •... and the outcome variable along the Y axis we’llarbitrarily say is Anxiety.0204060801001200 10 20 30 40Anxi... • Now, let’s identify or plot each point or dot • Now, let’s identify or plot each point or dotDepression33262214126Anxiety10310092745226 • Now, let’s identify or plot each point or dotDepression33262214126Anxiety10310092745226020406080100... • Now, let’s identify or plot each point or dotDepression33262214126Anxiety10310092745226020406080100... • Now, let’s identify or plot each point or dotDepression33262214126Anxiety10310092745226 • Now, let’s identify or plot each point or dotDepression33262214126Anxiety10310092745226020406080100... • Now, let’s identify or plot each point or dotDepression33262214126Anxiety10310092745226 • Now, let’s identify or plot each point or dotDepression33262214126Anxiety10310092745226020406080100... • Now, let’s identify or plot each point or dotDepression33262214126Anxiety10310092745226 • Now, let’s identify or plot each point or dotDepression33262214126Anxiety10310092745226020406080100... • Now, let’s identify or plot each point or dotDepression33262214126Anxiety10310092745226 • Now, let’s identify or plot each point or dotDepression33262214126Anxiety10310092745226020406080100... • Now, let’s identify or plot each point or dotDepression33262214126Anxiety10310092745226 • Now, let’s identify or plot each point or dotDepression33262214126Anxiety10310092745226020406080100... • Visually, one can see in the plotted space whetherthere is a tendency for the variables to be relatedand in what direc... • Visually, one can see in the plotted space whetherthere is a tendency for the variables to be relatedand in what direc... • Visually, one can see in the plotted space whetherthere is a tendency for the variables to be relatedand in what direc... • With this data set the tendency for the variables torelate is strong and the direction is negative: • With this data set the tendency for the variables torelate is strong and the direction is negative:Depression61214... • With this data set the tendency for the variables torelate is strong and the direction is negative:Depression61214... • With this data set the tendency for the variables torelate is strong and the direction is negative:Depression61214... • When no relationship exists the scatter plot tendsto look like a big circle. • When no relationship exists the scatter plot tendsto look like a big circle.Depression22331261426Anxiety10310... • When no relationship exists the scatter plot tendsto look like a big circle.Depression22331261426Anxiety10310... • When no relationship exists the scatter plot tendsto look like a big circle.Depression22331261426Anxiety10310... • When no relationship exists the scatter plot tendsto look like a big circle.Depression22633261412Anxiety10310... 0204060801001200 10 20 30 40AnxietyDepressionRelationship betweenDepression & Anxiety• When no relationship exi... 0204060801001200 10 20 30 40AnxietyDepressionRelationship betweenDepression & Anxiety• When no relationship exi... • When no relationship exists the scatter plot tendsto look like a big circle.Depression61433261222Anxiety10310... 0204060801001200 10 20 30 40AnxietyDepressionRelationship betweenDepression & Anxiety• When no relationship exi... 0204060801001200 10 20 30 40AnxietyDepressionRelationship betweenDepression & Anxiety• When no relationship exi... • You might have noticed that as the variables are relatedeither positively or negatively, the plot looks more like anov... • You might have noticed that as the variables are relatedeither positively or negatively, the plot looks more like anov... • You might have noticed that as the variables are relatedeither positively or negatively, the plot looks more like anov... • As mentioned before, Linear Regression is used to predictone variable (ice cream sales) from another related variable(... • As mentioned before, Linear Regression is used to predictone variable (ice cream sales) from another related variable(... • As mentioned before, Linear Regression is used to predictone variable (ice cream sales) from another related variable(... • As mentioned before, Linear Regression is used to predictone variable (ice cream sales) from another related variable(... • As mentioned before, Linear Regression is used to predictone variable (ice cream sales) from another related variable(... • As mentioned before, Linear Regression is used to predictone variable (ice cream sales) from another related variable(... • Recall that a line in Cartesian space is defined by itsslope and its Y intercept (the value of Y when Xequals 0). • Recall that a line in Cartesian space is defined by itsslope and its Y intercept (the value of Y when Xequals 0).[Y=... • Recall that a line in Cartesian space is defined by itsslope and its Y intercept (the value of Y when Xequals 0).[Y=... • In this case the slope would be 1. You mayremember that this value is derived by taking whatis called the “rise” over... 01234560 1 2 3 4 5 6rise1• In this case the slope would be 1. You mayremember that this value is derived by tak... 01234560 1 2 3 4 5 6rise1• In this case the slope would be 1. You mayremember that this value is derived by tak... 01234560 1 2 3 4 5 6rise1• In this case the slope would be 1. You mayremember that this value is derived by tak... 01234560 1 2 3 4 5 6rise1run1𝒚 = 0 +11𝒙 01234560 1 2 3 4 5 6rise1run1𝒚 = 0 +11𝒙This is where theline crosses theY axis. 01234560 1 2 3 4 5 6rise1run1𝒚 = 0 +11𝒙This is the slopewhich is the riseover the run. • A line represents the functional relationshipbetween variable X and variable Y, therefore, thatline can be used to pre... • A line represents the functional relationshipbetween variable X and variable Y, therefore, thatline can be used to pre... • In this case the two variables (temperature and icecream sales) have a perfect linear relationship. Thisis rarely ever... • In this case the two variables (temperature and icecream sales) have a perfect linear relationship. Thisis rarely ever... • Now let’s say we have data for the average temperatureduring the month of July. But, we don’t have the data forthe ave... • Now let’s say we have data for the average temperatureduring the month of July. But, we don’t have the data forthe ave... • Now let’s say we have data for the average temperatureduring the month of July. But, we don’t have the data forthe ave... • Now let’s say we have data for the average temperatureduring the month of July. But, we don’t have the data forthe ave... • Now let’s say we have data for the average temperatureduring the month of July. But, we don’t have the data forthe ave... • Now let’s say we have data for the average temperatureduring the month of July. But, we don’t have the data forthe ave... • Using this data set we can create a formula for a straight linethat represents that relationship: • Using this data set we can create a formula for a straight linethat represents that relationship:FebMarAprMayJunA... • Using this data set we can create a formula for a straight linethat represents that relationship:FebMarAprMayJunA... • With this equation we can now plug in the averagetemperature for July (1000) and see what the predictedaverage ice cre... • With this equation we can now plug in the averagetemperature for July (1000) and see what the predictedaverage ice cre... • With this equation we can now plug in the averagetemperature for July (1000) and see what the predictedaverage ice cre... • With this equation we can now plug in the averagetemperature for July (1000) and see what the predictedaverage ice cre... • With this equation we can now plug in the averagetemperature for July (1000) and see what the predictedaverage ice cre... • With this equation we can now plug in the averagetemperature for July (1000) and see what the predictedaverage ice cre... • With this equation we can now plug in the averagetemperature for July (1000) and see what the predictedaverage ice cre... • So, based on our single linear regression analysis we wouldpredict that in the month of July that the average monthlyi... • So, based on our single linear regression analysis we wouldpredict that in the month of July that the average monthlyi... • So, based on our single linear regression analysis we wouldpredict that in the month of July that the average monthlyi... • Most will look like this: • Most will look like this: • Most will look like this:• This line is called the best fitting line because it minimizesthe distance between the line... • Most will look like this:• This line is called the best fitting line because it minimizesthe distance between the line... • Most will look like this:• This equation is calculated by using the standarddeviations and means of the two variables.... • Given the infinite number of positive linear fitting through ascatterplot, the one closer to represent the functionalr... • Given the infinite number of positive linear fitting through ascatterplot, the one closer to represent the functionalr... • Given the infinite number of positive linear fitting through ascatterplot, the one closer to represent the functionalr... • Given the infinite number of positive linear fitting through ascatterplot, the one closer to represent the functionalr... • We don’t have to actually plot the coordinates and lines. Wecan operate solely on the equations to generate predictedv... • So here are the actual data we plotted the data from: • So here are the actual data we plotted the data from:Jan 40 300Feb 50 320Mar 60 370Apr 70 480May 80 560Jun 90 640... • So here are the actual data we plotted the data from:Jan 40 300Feb 50 320Mar 60 370Apr 70 480May 80 560Jun 90 640... • So here are the actual data we plotted the data from:Jan 40 300Feb 50 320Mar 60 370Apr 70 480May 80 560Jun 90 640... • So here are the actual data we plotted the data from:Jan 40 300Feb 50 320Mar 60 370Apr 70 480May 80 560Jun 90 640... • So here are the actual data we plotted the data from:Jan 40 300Feb 50 320Mar 60 370Apr 70 480May 80 560Jun 90 640... • We can now plot the predicted Y using the equation: • We can now plot the predicted Y using the equation:𝑦= -50.93+7.21(x) • We can now plot the predicted Y using the equation:Jan 40 300Feb 50 320Mar 60 370Apr 70 480May 80 560Jun 90 640Ju... • We can now plot the predicted Y using the equation:Jan 40 300Feb 50 320Mar 60 370Apr 70 480May 80 560Jun 90 640Ju... • We can now plot the predicted Y using the equation:Jan 40 300Feb 50 320Mar 60 370Apr 70 480May 80 560Jun 90 640Ju... • We can now plot the predicted Y using the equation:• With this information we can now determine if x (temperature) is a... • To begin we need to determine the total sum of squares justlike we would do with analysis of variance. • To begin we need to determine the total sum of squares justlike we would do with analysis of variance.• This is done b... • To begin we need to determine the total sum of squares justlike we would do with analysis of variance.• This is done b... • To begin we need to determine the total sum of squares justlike we would do with analysis of variance.• This is done b... • We then subtract each y value from the mean • We then subtract each y value from the mean(y) Actual AveMonthly IceCream Sales3003203704805606407206004003... • We then subtract each y value from the mean• Note - if we did not know the functional relationshipbetween X and Y, our... • Because we are calculating the total sum of squares we willneed to square the results and then take the average of the... • Because we are calculating the total sum of squares we willneed to square the results and then take the average of the... • Because we are calculating the total sum of squares we willneed to square the results and then sum up the results • Because we are calculating the total sum of squares we willneed to square the results and then sum up the result(y) Ac... • Now we find regression (good) and residual (bad). To havebetter prediction power we want the regression sums ofsquares... • Now we find regression (good) and residual (bad). To havebetter prediction power we want the regression sums ofsquares... • Now we find regression (good) and residual (bad). To havebetter prediction power we want the regression sums ofsquares... • Now we find regression (good) and residual (bad). To havebetter prediction power we want the regression sums ofsquares... • Now we find regression (good) and residual (bad). To havebetter prediction power we want the regression sums ofsquares... • Now we find regression (good) and residual (bad). To havebetter prediction power we want the regression sums ofsquares... • Before we calculate residual and regression let’s seevisually how we calculated the total sums of squares -372,844. • Before we calculate residual and regression let’s seevisually how we calculated the total sums of squares -372,844.•... • Before we calculate residual and regression let’s seevisually how we calculated the total sums of squares -372,844.•... • Before we calculate residual and regression let’s seevisually how we calculated the total sums of squares -372,844.•... • Before we calculate residual and regression let’s seevisually how we calculated the total sums of squares -372,844.•... • Before we calculate residual and regression let’s seevisually how we calculated the total sums of squares -372,844.•... • Before we calculate residual and regression let’s seevisually how we calculated the total sums of squares -372,844.•... • The first data set are the actual Y values. We subtract themfrom the mean (417) which would be our best prediction ifw... • The first data set are the actual Y values. We subtract themfrom the mean (417) which would be our best prediction ifw... • Here is the graphic depiction of our subtracting each datapoint from the mean (417): • Here is the graphic depiction of our subtracting each datapoint from the mean (417):1220100200300400500600700... • Here is the graphic depiction of our subtracting each datapoint from the mean (417):1220100200300400500600700... • Here is the graphic depiction of our subtracting each datapoint from the mean (417):1220100200300400500600700... • Here is the graphic depiction of our subtracting each datapoint from the mean (417):2000100200300400500600700... • Here is the graphic depiction of our subtracting each datapoint from the mean (417):2000100200300400500600700... • Here is the graphic depiction of our subtracting each datapoint from the mean (417):0100200300400500600700800... • Here is the graphic depiction of our subtracting each datapoint from the mean (417):0100200300400500600700800... • Now we have the difference between the actual values forY (ice cream sales) and the mean of the values for Y (417)010... • Now we have the difference between the actual values forY (ice cream sales) and the mean of the values for Y (417)010... • As we showed previously we have to square this valuebecause if we don’t when we sum the differences they willcome to z... • As we showed previously we have to square this valuebecause if we don’t when we sum the differences they willcome to z... • As we showed previously we have to square this valuebecause if we don’t when we sum the differences they willcome to z... • As we showed previously we have to square this valuebecause if we don’t when we sum the differences they willcome to z... • Now that we’ve seen a visual depiction of how wecalculated total sums of squares we compare the sums ofsquares that ar... • Now that we’ve seen a visual depiction of how wecalculated total sums of squares we compare the sums ofsquares that ar... • Now that we’ve seen a visual depiction of how wecalculated total sums of squares we compare the sums ofsquares that ar... • The error or residual sums of squares arecomputed by subtracting each actual Y value fromeach Y predicted value. • The error or residual sums of squares arecomputed by subtracting each actual Y value fromeach Y predicted value.• Her... • The error or residual sums of squares arecomputed by subtracting each actual Y value fromeach Y predicted value.• Her... • The error or residual sums of squares arecomputed by subtracting each actual Y value fromeach Y predicted value.• Her... • Here are the predicted values using the linearregression formula: • Here are the predicted values using the linearregression formula:01002003004005006007008000 20 40 60 80 100 <a href=schön dich kennenlernen zu dürfen 1..." /> • Here are the predicted values using the linearregression formula:300320370480560640720600400300200122(y) A... • Here are the predicted values using the linearregression formula:300320370480560640720600400300200122(y) A... • From these points and the linear regression formulaa line can be drawn • From these points and the linear regression formulaa line can be drawn01002003004005006007008000 20 40 60 80... • From these points and the linear regression formulaa line can be drawn01002003004005006007008000 20 40 60 80... • The difference between each actual value (orange) and thepredicted value (green line) is what is called error orresidu... • The difference between each actual value (orange) and thepredicted value (green line) is what is called error orresidu... • Let’s subtract the orange actual values and the green linepredicted values: • Let’s subtract the orange actual values and the green linepredicted values:01002003004005006007008000 20 40 6... • Let’s subtract the orange actual values and the green linepredicted values:01002003004005006007008000 20 40 6... • Let’s subtract the orange actual values and the green linepredicted values:• And so on…0100200300400500600700... • We then square those difference (deviations) • We then square those difference (deviations)(y) Actual AveMonthly IceCream Sales300320370480560640720600400... • We then square those difference (deviations) and sum them up(y) Actual AveMonthly IceCream Sales300320370480560... Sum ofSquaresdf Mean Square F-ratio SignificanceRegressionResidual 35,014Total 372,844 • We will now calculate the regression sums ofsquares. • We will now calculate the regression sums ofsquares.• Our hope is that this value will be much bigger thanthe residua... • The regression sums of squares is calculated by subtractingthe predicted values from the mean. • The regression sums of squares is calculated by subtractingthe predicted values from the mean.• Let’s see what this lo... • The regression sums of squares is calculated by subtractingthe predicted values from the mean.• Let’s see what this lo... • The regression sums of squares is calculated by subtractingthe predicted values from the mean.• Let’s see what this lo... • The regression sums of squares is calculated by subtractingthe predicted values from the mean.• Let’s see what this lo... • You can probably already tell that it will be bigger because asimple way to calculate it is to subtract the residual (3... • You can probably already tell that it will be bigger because asimple way to calculate it is to subtract the residual (3... • We subtract each predicted value from the mean ofthe actual Y values (y) Actual AveMonthly IceCream Sales237.47309.57381.67453.77525.87597.97670.07597.97525.87381.67237.4793.27... • We subtract each predicted value from the mean ofthe actual Y values01002003004005006007008000 20 40 60 80 10... • We subtract each predicted value from the mean ofthe actual Y values01002003004005006007008000 20 40 60 80 10... • Then we square the differences (or deviations) • Then we square the differences (or deviations)(y) Actual AveMonthly IceCream Sales237.47309.57381.67453.77525.87... • Then we square the differences (or deviations) andsum them up(y) Actual AveMonthly IceCream Sales237.47309.57381.... • Then we square the differences (or deviations) andsum them upSum ofSquaresdf Mean Square F-ratio SignificanceRegres... • Now we have all of the information to test forsignificance • Now we have all of the information to test forsignificanceSum ofSquaresdf Mean Square F-ratio SignificanceRegressio... • The degrees of freedom (df) for the regression are thenumber of parameters that are being estimated whichin this case... • The degrees of freedom (df) for the regression are thenumber of parameters that are being estimated whichin this case... • The degrees of freedom for residual is the number ofcases (12) minus the number of parameters (2) • The degrees of freedom for residual is the number ofcases (12) minus the number of parameters (2)• 12 months – 2 param... • The degrees of freedom for residual is the number ofcases (12) minus the number of parameters (2)• 12 months – 2 param... • We now have the information we need to calculatethe Mean Square values. They are calculated bydividing the sums of squ...sie sucht ihn jerichower land /> • We now have the information we need to calculatethe Mean Square values. They are calculated bydividing the sums of squ... • The F-ratio is computed by dividing the RegressionMean Square by the Residual Mean Square • The F-ratio is computed by dividing the RegressionMean Square by the Residual Mean Square• 337,830 / 3,501 = 96.5 • The F-ratio is computed by dividing the RegressionMean Square by the Residual Mean Square• 337,830 / 3,501 = 96.5Sum... • With this information we can turn to the F-distributiontable to determine the significance value. • With this information we can turn to the F-distributiontable to determine the significance value.Sum ofSquaresdf Mea... Sum ofSquaresdf Mean Square F-ratio SignificanceRegression 337,830 1 337,830 96.5Residual 35,014 10 3,501Total 372,844 • The regression degrees of freedom (1) is represented by thecolumns below:Sum ofSquaresdf Mean Square F-ratio Signifi... • The regression degrees of freedom (1) is represented by thecolumns below:Sum ofSquaresdf Mean Square F-ratio Signifi... Sum ofSquaresdf Mean Square F-ratio SignificanceRegression 337,830 1 337,830 96.5Residual 35,014 10 3,501Total 372,844 • The residual degrees of freedom (10) is represented by therows below:Sum ofSquaresdf Mean Square F-ratio Significanc... • The residual degrees of freedom (10) is represented by therows below:Sum ofSquaresdf Mean Square F-ratio Significanc... Sum ofSquaresdf Mean Square F-ratio SignificanceRegression 337,830 1 337,830 96.5Residual 35,014 10 3,501Total 372,844 • Put them together and we have found the critical F value at the.05 alpha level to be 4.96.Sum ofSquaresdf Mean Squar... • Put them together and we have found the critical F value at the.05 alpha level to be 4.96.Sum ofSquaresdf Mean Squar... • Because the F-ratio (96.5) exceeds the F-critical (4.96)we will reject the null hypothesis and indicate thattemperatur... In Summary In Summary• The whole point of this demonstration was to In Summary• The whole point of this demonstration was to(1) explain that linear regression is used to predict thevalue... In Summary• The whole point of this demonstration was to(1) explain that linear regression is used to predict thevalue... In Summary• The whole point of this demonstration was to(1) explain that linear regression is used to predict thevalue...

Nächste SlideShare

Wird geladen in …5

×

Diese Präsentation gefällt Ihnen? Dann am besten gleich teilen!

  • Gehören Sie zu den Ersten, denen das gefällt!

  1. 1. Single Linear Regression Conceptual Explanation
  2. 2. • Welcome to this explanation of Single Linear Regression.
  3. 3. • Welcome to this explanation of Single Linear Regression. • Single linear regression is an extension of correlation.
  4. 4. • Welcome to this explanation of Single Linear Regression. • Single linear regression is an extension of correlation. Correlation Single Linear Regressionextends to
  5. 5. • Correlation is designed to render a single coefficient that represents the degree of coherence between two variables
  6. 6. • Correlation is designed to render a single coefficient that represents the degree of coherence between two variables
  7. 7. • Correlation is designed to render a single coefficient that represents the degree of coherence between two variables
  8. 8. • Correlation is designed to render a single coefficient that represents the degree of coherence between two variables +.99 As one variable increases the other increases
  9. 9. • Correlation is designed to render a single coefficient that represents the degree of coherence between two variables +.99 As one variable increases the other increases This coefficient represents an almost perfect positive correlation or relationship between these two variables.
  10. 10. • Correlation is designed to render a single coefficient that represents the degree of coherence between two variables Ave Daily Temp 500 600 700 800 900
  11. 11. • Correlation is designed to render a single coefficient that represents the degree of coherence between two variables Ave Daily Temp 500 600 700 800 900 As one variable decreases the other increases
  12. 12. • Correlation is designed to render a single coefficient that represents the degree of coherence between two variables Ave Daily Temp 500 600 700 800 900 -.99 As one variable decreases the other increases
  13. 13. • Correlation is designed to render a single coefficient that represents the degree of coherence between two variables Ave Daily Temp 500 600 700 800 900 -.99 As one variable decreases the other increases Almost a perfect negative correlation or relationship between these two variables.
  14. 14. • Single linear regression uses that information to predict the value of one variable based on the given value of the other variable.
  15. 15. • Single linear regression uses that information to predict the value of one variable based on the given value of the other variable.
  16. 16. • Single linear regression uses that information to predict the value of one variable based on the given value of the other variable. • For example:
  17. 17. • For example: If the following data set were real, what would you predict ice cream sales would be when the temperature reaches 1000?
  18. 18. • For example: If the following data set were real, what would you predict ice cream sales would be when the temperature reaches 1000? Ave Daily Ice frau für lockere treffen Cream Sales ? 560 480 350 320 230 Ave Daily Temp 1000 900 800 700 600 500
  19. 19. • Single linear regression uses that information to predict the value of one variable (ice cream) based on the given value of the other variable (temperature).
  20. 20. • Single linear regression uses that information to predict the value of one variable (ice cream) based on the given value of the other variable (temperature).
  21. 21. If the following data set were real, what would you predict ice cream sales would be when the temperature reaches 1000? • Rather than simply examining the relationship between the variables (as is the case with the Pearson Product Moment Correlation), one variable will be used as the predictor (temperature) and the other value will be used as the outcome or predicted (ice cream sales). Ave Daily Ice Cream Sales 630? 560 480 350 320 230 Ave Daily Temp 1000 900 800 700 600 500
  22. 22. If the following data set were real, what would you predict ice cream sales would be when the temperature reaches 1000? • Rather than simply examining the relationship between the variables (as is the case with the Pearson Product Moment Correlation), one variable will be used as the predictor (temperature) and the other value will be used as the outcome or predicted (ice cream sales). • Linear Regression makes it possible to estimate a value like 630 Ave Daily Ice Cream Sales 630? 560 480 350 320 230 Ave Daily Temp 1000 900 800 700 600 500
  23. 23. • In some cases which variable is considered predictor or outcome is arbitrary.
  24. 24. • In some cases which variable is considered predictor or outcome is arbitrary. • Like measures of depression and anxiety
  25. 25. • In some cases which variable is considered predictor or outcome is arbitrary. • Like measures of depression and anxiety Composite Depression Score 33 26 22 14 12 6 Composite Anxiety Score 103 100 92 74 52 26
  26. 26. • In some cases which variable is considered predictor or outcome is arbitrary. • Like measures of depression and anxiety • It’s not clear which influences which. Most likely depression and anxiety mutually influence one another. Composite Depression Score 33 26 22 14 12 6 Composite Anxiety Score 103 100 92 74 52 26
  27. 27. • In some cases, either by theory or by the nature of the research design, one variable will be rationally defined as the predictor and the other as the outcome.
  28. 28. • In some cases, either by theory or by the nature of the research design, one variable will be rationally defined as the predictor and the other as the outcome. Ave Daily Exposure to Sunlight 3.3 hrs 2.6 hrs 2.2 hrs 1.4 hrs 1.2 hrs 0.6 hrs
  29. 29. • In some cases, either by theory or by the nature of the research design, one variable will be rationally defined as the predictor and the other as the outcome. Ave Daily Exposure to Sunlight 3.3 hrs 2.6 hrs 2.2 hrs 1.4 hrs 1.2 hrs 0.6 hrs Levels of Vitamin E after two months 10.3 units 8.1 units 7.3 units 7.0 units 6.8 units 5.7 units
  30. 30. • In some cases, either by theory or by the nature of the research design, one variable will be rationally defined as the predictor and the other as the outcome. Ave Daily Exposure to Sunlight 3.3 hrs 2.6 hrs 2.2 hrs 1.4 hrs 1.2 hrs 0.6 hrs Levels of Vitamin E after two months 10.3 units 8.1 units 7.3 units 7.0 units 6.8 units 5.7 units In this example, exposure to sunlight may impact levels of Vitamin E. But, levels of Vitamin E would not impact the amount of sunlight one gets.
  31. 31. • An easy way to conceptualize single linear regression is to create a scatterplot in Cartesian space.
  32. 32. • An easy way to conceptualize single linear regression is to create a scatterplot in Cartesian space. Let’s plot the following data set:
  33. 33. • An easy way to conceptualize single linear regression is to create a scatterplot in Cartesian space. Let’s plot the following data set: Composite Depression Score 33 26 22 14 12 6 Composite Anxiety Score 103 100 92 74 52 26
  34. 34. • First, we assign the predictor variable along the X axis, which in this case we’ll arbitrarily say is depression.
  35. 35. • First, we assign the predictor variable along the X axis, which in this case we’ll arbitrarily say is depression. 0 20 40 60 80 100 120 0 10 20 30 40 Anxiety Depression Relationship between Depression & Anxiety
  36. 36. •... and the outcome variable along the Y axis we’ll arbitrarily say is Anxiety.
  37. 37. •... and the outcome variable along the Y axis we’ll arbitrarily say is Anxiety. 0 20 40 60 80 100 120 0 10 20 30 40 Anxiety Depression Relationship between Depression & Anxiety
  38. 38. • Now, let’s identify or plot each point or dot
  39. 39. • Now, let’s identify or plot each point or dot Depression 33 26 22 14 12 6 Anxiety 103 100 92 74 52 26
  40. 40. • Now, let’s identify or plot each point or dot Depression 33 26 22 14 12 6 Anxiety 103 100 92 74 52 26 0 20 40 60 80 100 120 0 10 20 30 40 Anxiety Depression Relationship between Depression & Anxiety
  41. 41. • Now, let’s identify or plot each point or dot Depression 33 26 22 14 12 6 Anxiety 103 100 92 74 52 26 0 20 40 60 80 100 120 0 10 20 30 40 Anxiety Depression Relationship between Depression & Anxiety (33, 103)
  42. 42. • Now, let’s identify or plot each point or dot Depression 33 26 22 14 12 6 Anxiety 103 100 92 74 52 26
  43. 43. • Now, let’s identify or plot each point or dot Depression 33 26 22 14 12 6 Anxiety 103 100 92 74 52 26 0 20 40 60 80 100 120 0 10 20 30 40 Anxiety Depression Relationship between Depression & Anxiety(26, 100)
  44. 44. • Now, let’s identify or plot each point or dot Depression 33 26 22 14 12 6 Anxiety 103 100 92 74 52 26
  45. 45. • Now, let’s identify or plot each point or dot Depression 33 26 22 14 12 6 Anxiety 103 100 92 74 52 26 0 20 40 60 80 100 120 0 10 20 30 40 Anxiety Depression Relationship between Depression & Anxiety(22, 92)
  46. 46. • Now, let’s identify or plot each point or dot Depression 33 26 22 14 12 6 Anxiety 103 100 92 74 52 26
  47. 47. • Now, let’s identify or plot each point or dot Depression 33 26 22 14 12 6 Anxiety 103 100 92 74 52 26 0 20 40 60 80 100 120 0 10 20 30 40 Anxiety Depression Relationship between Depression & Anxiety(14, 74)
  48. 48. • Now, let’s identify or plot each point or dot Depression 33 26 22 14 12 6 Anxiety 103 100 92 74 52 26
  49. 49. • Now, let’s identify or plot each point or dot Depression 33 26 22 14 12 6 Anxiety 103 100 92 74 52 26 0 20 40 60 80 100 120 0 10 20 30 40 Anxiety Depression Relationship between Depression & Anxiety (12, 52)
  50. 50. • Now, let’s identify or plot each point or dot Depression 33 26 22 14 12 6 Anxiety 103 100 92 74 52 26
  51. 51. • Now, let’s identify or plot each point or dot Depression 33 26 22 14 12 6 Anxiety 103 100 92 74 52 26 0 20 40 60 80 100 120 0 10 20 30 40 Anxiety Depression Relationship between Depression & Anxiety (6, 26)
  52. 52. • Visually, one can see in the plotted space whether there is a tendency for the variables to be related and in what direction they are related.
  53. 53. • Visually, one can see in the plotted space whether there is a tendency for the variables to be related and in what direction they are related. 0 20 40 60 80 100 120 0 10 20 30 40 Anxiety Depression Relationship between Depression & Anxiety
  54. 54. • Visually, one can see in the plotted space whether there is a tendency for the variables to be related and in what direction they are related. 0 20 40 60 80 100 120 0 10 20 30 40 Anxiety Depression Relationship between Depression & Anxiety In this case there is a strong tendency to relate and the relationship is positive
  55. 55. • With this data set the tendency for the variables to relate is strong and the direction is negative:
  56. 56. • With this data set the tendency for the variables to relate is strong and the direction is negative: Depression 6 12 14 22 26 33 Anxiety 103 100 92 74 52 26
  57. 57. • With this data set the tendency for the variables to relate is strong and the direction is negative: Depression 6 12 14 22 26 33 Anxiety 103 100 92 74 52 26 0 20 40 60 80 100 120 0 10 20 30 40 Anxiety Depression Relationship between Depression & Anxiety
  58. 58. • With this data set the tendency for the variables to relate is strong and the direction is negative: Depression 6 12 14 22 26 33 Anxiety 103 100 92 74 52 26 0 20 40 60 80 100 120 0 10 20 30 40 Anxiety Depression Relationship between Depression & Anxiety Strong and Negative
  59. 59. • When no relationship exists the scatter plot tends to look like a big circle.
  60. 60. • When no relationship exists the scatter plot tends to look like a big circle. Depression 22 33 12 6 14 26 Anxiety 103 100 92 74 52 26
  61. 61. • When no relationship exists the scatter plot tends to look like a big circle. Depression 22 33 12 6 14 26 Anxiety 103 100 92 74 52 26 0 20 40 60 80 100 120 0 10 20 30 40Anxiety Depression Relationship between Depression & Anxiety
  62. 62. • When no relationship exists the scatter plot tends to look like a big circle. Depression 22 33 12 6 14 26 Anxiety 103 100 92 74 52 26 0 20 40 60 80 100 120 0 10 20 30 40Anxiety Depression Relationship between Depression & Anxiety
  63. 63. • When no relationship exists the scatter plot tends to look like a big circle. Depression 22 6 33 26 14 12 Anxiety 103 100 92 74 52 26
  64. 64. 0 20 40 60 80 100 120 0 10 20 30 40Anxiety Depression Relationship between Depression & Anxiety • When no relationship exists the scatter plot tends to look like a big circle. Depression 22 6 33 26 14 12 Anxiety 103 100 92 74 52 26
  65. 65. 0 20 40 60 80 100 120 0 10 20 30 40Anxiety Depression Relationship between Depression & Anxiety • When no relationship exists the scatter plot tends to look like a big circle. Depression 22 6 33 26 14 12 Anxiety 103 100 92 74 52 26 Weak and Positive
  66. 66. • When no relationship exists the scatter plot tends to look like a big circle. Depression 6 14 33 26 12 22 Anxiety 103 100 74 92 52 26
  67. 67. 0 20 40 60 80 100 120 0 10 20 30 40Anxiety Depression Relationship between Depression & Anxiety • When no relationship exists the scatter plot tends to look like a big circle. Depression 6 14 33 26 12 22 Anxiety 103 100 74 92 52 26
  68. 68. 0 20 40 60 80 100 120 0 10 20 30 40Anxiety Depression Relationship between Depression & Anxiety • When no relationship exists the scatter plot tends to look like a big circle. Depression 6 14 33 26 12 22 Anxiety 103 100 74 92 52 26 Weak and Negative
  69. 69. • You might have noticed that as the variables are related either positively or negatively, the plot looks more like an oval tilted one way or the other.
  70. 70. • You might have noticed that as the variables are related either positively or negatively, the plot looks more like an oval tilted one way or the other. 0 20 40 60 80 100 120 0 10 20 30 40 Anxiety Depression Relationship between Depression & Anxiety 0 20 40 60 80 100 120 0 10 20 30 40 Anxiety Depression Relationship between Depression & Anxiety
  71. 71. • You might have noticed that as the variables are related either positively or negatively, the plot looks more like an oval tilted one way or the other. Weak and Negative 0 20 40 60 80 100 120 0 10 20 30 40 Anxiety Depression Relationship between Depression & Anxiety 0 20 40 60 80 100 120 0 10 20 30 40 Anxiety Depression Relationship between Depression & Anxiety Weak and Positive
  72. 72. • As mentioned before, Linear Regression is used to predict one variable (ice cream sales) from another related variable (temperature).
  73. 73. • As mentioned before, Linear Regression is used to predict one variable (ice cream sales) from another related variable (temperature). • The stronger the relationship (e.g., +.99 or -.99) the more accurate the prediction.
  74. 74. • As mentioned before, Linear Regression is used to predict one variable (ice cream sales) from another related variable (temperature). • The stronger the relationship (e.g., +.99 or -.99) the more accurate the prediction. • The weaker the relationship (e.g., +.14 or -.03) the less accurate the prediction.
  75. 75. • As mentioned before, Linear Regression is used to predict one variable (ice cream sales) from another related variable (temperature). • The stronger the relationship (e.g., +.99 or -.99) the more accurate the prediction. • The weaker the relationship (e.g., +.14 or -.03) the less accurate the prediction.
  76. 76. • As mentioned before, Linear Regression is used to predict one variable (ice cream sales) from another related variable (temperature). • The stronger the relationship (e.g., +.99 or -.99) the more accurate the prediction. • The weaker the relationship (e.g., +.14 or -.03) the less accurate the prediction. • One of the ways to represent those relationships is of course with the coefficients (e.g., +.99, +.14, -.03, -.99).
  77. 77. • As mentioned before, Linear Regression is used to predict one variable (ice cream sales) from another related variable (temperature). • The stronger the relationship (e.g., +.99 or -.99) the more accurate the prediction. • The weaker the relationship (e.g., +.14 or -.03) the less accurate the prediction. • One of the ways to represent those relationships is of course with the coefficients (e.g., +.99, +.14, -.03, -.99). • Another way to represent it is by graphing the relationship.
  78. 78. • Recall that a line in Cartesian space is defined by its slope and its Y intercept (the value of Y when X equals 0).
  79. 79. • Recall that a line in Cartesian space is defined by its slope and its Y intercept (the value of Y when X equals 0). [Y= intercept + (slope ∙ X)]
  80. 80. • Recall that a line in Cartesian space is defined by its slope and its Y intercept (the value of Y when X equals 0). [Y= intercept + (slope ∙ X)] 0 1 2 3 4 5 6 0 1 2 3 4 5 6
  81. 81. • In this case the slope would be 1. You may remember that this value is derived by taking what is called the “rise” over the “run”.
  82. 82. 0 1 2 3 4 5 6 0 1 2 3 4 5 6 rise 1 • In this case the slope would be 1. You may remember that this value is derived by taking what is called the “rise” over the “run”. run 1
  83. 83. 0 1 2 3 4 5 6 0 1 2 3 4 5 6 rise 1 • In this case the slope would be 1. You may remember that this value is derived by taking what is called the “rise” over the “run”. • So the equation for this line so far would look like this: run 1
  84. 84. 0 1 2 3 4 5 6 0 1 2 3 4 5 6 rise 1 • In this case the slope would be 1. You may remember that this value is derived by taking what is called the “rise” over the “run”. • So the equation for this line so far would look like this: run 1 𝒚 = 0 + 1 1 𝒙
  85. 85. 0 1 2 3 4 5 6 0 1 2 3 4 5 6 rise 1 run 1 𝒚 = 0 + 1 1 𝒙
  86. 86. 0 1 2 3 4 5 6 0 1 2 3 4 5 6 rise 1 run 1 𝒚 = 0 + 1 1 𝒙 This is where the line crosses the Y axis.
  87. 87. 0 1 2 3 4 5 6 0 1 2 3 4 5 6 rise 1 run 1 𝒚 = 0 + 1 1 𝒙 This is the slope which is the rise over the run.
  88. 88. • A line represents the functional relationship between variable X and variable Y, therefore, that line can be used to predict a Y value from any given X value.
  89. 89. • A line represents the functional relationship between variable X and variable Y, therefore, that line can be used to predict a Y value from any given X value. Feb Mar Apr May Jun Ave Monthly Temperature 500 600 700 800 900 Ave Monthly Ice Cream Sales 239 320 400 480 560
  90. 90. • In this case the two variables (temperature and ice cream sales) have a perfect linear relationship. This is rarely ever seen among variables such as these in the real world, but for illustrative purposes we have created a perfect relationship.
  91. 91. • In this case the two variables (temperature and ice cream sales) have a perfect linear relationship. This is rarely ever seen among variables such as these in the real world, but for illustrative purposes we have created a perfect relationship. 0 100 200 300 400 500 600 700 0 20 40 60 80 100 120 AveMonthlyTemperature Average Monthly Ice Cream Sales
  92. 92. • Now let’s say we have data for the average temperature during the month of July. But, we don’t have the data for the average ice cream sales for July
  93. 93. • Now let’s say we have data for the average temperature during the month of July. But, we don’t have the data for the average ice cream sales for July Feb Mar Apr May Jun JUL Ave Monthly Temperature 500 600 700 800 900 1000 Ave Monthly Ice Cream Sales 239 320 400 480 560 ?
  94. 94. • Now let’s say we have data for the average temperature during the month of July. But, we don’t have the data for the average ice cream sales for July • Using single linear regression we can predict the average ice cream sales for July. Here is the formula we will use for the prediction: Feb Mar Apr May Jun JUL Ave Monthly Temperature 500 600 700 800 900 1000 Ave Monthly Ice Cream Sales 239 320 400 480 560 ?
  95. 95. • Now let’s say we have data for the average temperature during the month of July. But, we don’t have the data for the average ice cream sales for July • Using single linear regression we can predict the average ice cream sales for July. Here is the formula we will use for the prediction: Feb Mar Apr May Jun JUL Ave Monthly Temperature 500 600 700 800 900 1000 Ave Monthly Ice Cream Sales 239 320 400 480 560 ? 𝑦 = 𝒚 𝒊𝒏𝒕𝒆𝒓𝒄𝒆𝒑𝒕 + 𝒔𝒍𝒐𝒑𝒆(𝑥
  96. 96. • Now let’s say we have data for the average temperature during the month of July. But, we don’t have the data for the average ice cream sales for July • Using single linear regression we can predict the average ice cream sales for July. Here is the formula we will use for the prediction: • There are many ways to write this equation. Here is one way: Feb Mar Apr May Jun JUL Ave Monthly Temperature 500 600 700 800 900 1000 Ave Monthly Ice Cream Sales 239 320 400 480 560 ? 𝑦 = 𝒚 𝒊𝒏𝒕𝒆𝒓𝒄𝒆𝒑𝒕 + 𝒔𝒍𝒐𝒑𝒆(𝑥
  97. 97. • Now let’s say we have data for the average temperature during the month of July. But, we don’t have the data for the average ice cream sales for July • Using single linear regression we can predict the average ice cream sales for July. Here is the formula we will use for the prediction: • There are many ways to write this equation. Here is one way: Feb Mar Apr May Jun JUL Ave Monthly Temperature 500 600 700 800 900 1000 Ave Monthly Ice Cream Sales 239 320 400 480 560 ? 𝑦 = 𝒚 𝒊𝒏𝒕𝒆𝒓𝒄𝒆𝒑𝒕 + 𝒔𝒍𝒐𝒑𝒆(𝑥 𝑦 = 𝒃 + 𝒎(𝑥
  98. 98. • Using this data set we can create a formula for a straight line that represents that relationship:
  99. 99. • Using this data set we can create a formula for a straight line that represents that relationship: Feb Mar Apr May Jun Ave Monthly Temperature 500 600 700 800 900 Ave Monthly Ice Cream Sales 239 320 400 480 560
  100. 100. • Using this data set we can create a formula for a straight line that represents that relationship: Feb Mar Apr May Jun Ave Monthly Temperature 500 600 700 800 900 Ave Monthly Ice Cream Sales 239 320 400 480 560 0 100 200 300 400 500 600 700 0 20 40 60 80 100 120AveMonthlyTemperature Average Monthly Ice Cream Sales 𝑦= singles baden bei wien -162+8(𝑥)
  101. 101. • With this equation we can now plug in the average temperature for July (1000) and see what the predicted average ice cream sales would be:
  102. 102. • With this equation we can now plug in the average temperature for July (1000) and see what the predicted average ice cream sales would be: Feb Mar Apr May Jun Jul Ave Monthly Temperature 500 600 700 800 900 1000 Ave Monthly Ice Cream Sales 239 320 400 480 560 𝑦
  103. 103. • With this equation we can now plug in the average temperature for July (1000) and see what the predicted average ice cream sales would be: Feb Mar Apr May Jun Jul Ave Monthly Temperature 500 600 700 800 900 1000 Ave Monthly Ice Cream Sales 239 320 400 480 560 𝑦 0 100 200 300 400 500 600 700 0 20 40 60 80 100 120 AveMonthlyTemperature Average Monthly Ice Cream Sales 𝑦 ̂ = -162 + 8(100)
  104. 104. • With this equation we can now plug in the average temperature for July (1000) and see what the predicted average ice cream sales would be: Feb Mar Apr May Jun Jul Ave Monthly Temperature 500 600 700 800 900 1000 Ave Monthly Ice Cream Sales 239 320 400 480 560 𝑦 0 100 200 300 400 500 600 700 0 20 40 60 80 100 120 AveMonthlyTemperature Average Monthly Ice Cream Sales 𝑦 ̂ = -162 + 8(100)
  105. 105. • With this equation we can now plug in the average temperature for July (1000) and see what the predicted average ice cream sales would be: Feb Mar Apr May Jun Jul Ave Monthly Temperature 500 600 700 800 900 1000 Ave Monthly Ice Cream Sales 239 320 400 480 560 𝑦 0 100 200 300 400 500 600 700 0 20 40 60 80 100 120 AveMonthlyTemperature Average Monthly Ice Cream Sales 𝑦 ̂ = -162 + 800
  106. 106. • With this equation we can now plug in the average temperature for July (1000) and see what the predicted average ice cream sales would be: Feb Mar Apr May Jun Jul Ave Monthly Temperature 500 600 700 800 900 1000 Ave Monthly Ice Cream Sales 239 320 400 480 560 𝑦 0 100 200 300 400 500 600 700 0 20 40 60 80 100 120 AveMonthlyTemperature Average Monthly Ice Cream Sales 𝑦 ̂ = 638
  107. 107. • With this equation we can now plug in the average temperature for July (1000) and see what the predicted average ice cream sales would be: Feb Mar Apr May Jun Jul Ave Monthly Temperature 500 600 700 800 900 1000 Ave Monthly Ice Cream Sales 239 320 400 480 560 638 0 100 200 300 400 500 600 700 0 20 40 60 80 100 120 AveMonthlyTemperature Average Monthly Ice Cream Sales 𝑦 ̂ = 638
  108. 108. • So, based on our single linear regression analysis we would predict that in the month of July that the average monthly ice cream sales will be 638. Feb Mar Apr May Jun Jul Ave Monthly Temperature 500 600 700 800 900 1000 Ave Monthly Ice Cream Sales 239 320 400 480 560 638 0 100 200 300 400 500 600 700 0 20 40 60 80 100 120 AveMonthlyTemperature Average Monthly Ice Cream Sales 𝑦 ̂ = 638
  109. 109. • So, based on our single linear regression analysis we would predict that in the month of July that the average monthly ice cream sales will be 638. • This is a simple demonstration of how regression works. Feb Mar Apr May Jun Jul Ave Monthly Temperature 500 600 700 800 900 1000 Ave Monthly Ice Cream Sales 239 320 400 480 560 638 0 100 200 300 400 500 600 700 0 20 40 60 80 100 120 AveMonthlyTemperature Average Monthly Ice Cream Sales 𝑦 ̂ = 638
  110. 110. • So, based on our single linear regression analysis we would predict that in the month of July that the average monthly ice cream sales will be 638. • This is a simple demonstration of how regression works. • In reality, however, most variables will not correlate so perfectly like this did: Feb Mar Apr May Jun Jul Ave Monthly Temperature 500 600 700 800 900 1000 Ave Monthly Ice Cream Sales 239 320 400 480 560 638 0 100 200 300 400 500 600 700 0 20 40 60 80 100 120 AveMonthlyTemperature Average Monthly Ice Cream Sales 𝑦 ̂ = 638
  111. 111. • Most will look like this:
  112. 112. • Most will look like this:
  113. 113. • Most will look like this: • This line is called the best fitting line because it minimizes the distance between the line and all of the points. You will notice again that we have a linear equation for that line:
  114. 114. • Most will look like this: • This line is called the best fitting line because it minimizes the distance between the line and all of the points. You will notice again that we have a linear equation for that line: 𝑦= -50.93+7.21(x)
  115. 115. • Most will look like this: • This equation is calculated by using the standard deviations and means of the two variables. For brevity sake we will not go into this here. 𝑦= -50.93+7.21(x)
  116. 116. • Given the infinite number of positive linear fitting through a scatterplot, the one closer to represent the functional relationship between X and Y is the line that results in the cumulative least squared error between the predicted values of Y and the true observed values of Y for each given X.
  117. 117. • Given the infinite number of positive linear fitting through a scatterplot, the one closer to represent the functional relationship between X and Y is the line that results in the cumulative least squared error between the predicted values of Y and the true observed values of Y for each given X.
  118. 118. • Given the infinite number of positive linear fitting through a scatterplot, the one closer to represent the functional relationship between X and Y is the line that results in the cumulative least squared error between the predicted values of Y and the true observed values of Y for each given X. This line is the predicted values of Y calculated from the equation 𝑦 = 𝑏 + 𝑚𝑥
  119. 119. • Given the infinite number of positive linear fitting through a scatterplot, the one closer to represent the functional relationship between X and Y is the line that results in the cumulative least squared error between the predicted values of Y and the true observed values of Y for each given X. These dots represent the actual data This line is the predicted values of Y calculated from the equation 𝑦 = 𝑏 + 𝑚𝑥
  120. 120. • We don’t have to actually plot the coordinates and lines. We can operate solely on the equations to generate predicted values and errors in prediction. In this way we can determine if temperature is a statistically significant predictor of ice cream sales.
  121. 121. • So here are the actual data we plotted the data from:
  122. 122. • So here are the actual data we plotted the data from: Jan 40 300 Feb 50 320 Mar 60 370 Apr 70 480 May 80 560 Jun 90 640 Jul 100 720 Aug 90 600 Sep 80 400 Oct 60 300 Nov 40 200 Dec 20 122 (X) Ave Monthly Temp (y) Actual Ave Monthly Ice Cream Sales
  123. 123. • So here are the actual data we plotted the data from: Jan 40 300 Feb 50 320 Mar 60 370 Apr 70 480 May 80 560 Jun 90 640 Jul 100 720 Aug 90 600 Sep 80 400 Oct 60 300 Nov 40 200 Dec 20 122 (X) Ave Monthly Temp (y) Actual Ave Monthly Ice Cream Sales
  124. 124. • So here are the actual data we plotted the data from: Jan 40 300 Feb 50 320 Mar 60 370 Apr 70 480 May 80 560 Jun 90 640 Jul 100 720 Aug 90 600 Sep 80 400 Oct 60 300 Nov 40 200 Dec 20 122 (X) Ave Monthly Temp (y) Actual Ave Monthly Ice Cream Sales • We can now plot the predicted Y using the equation:
  125. 125. • So here are the actual data we plotted the data from: Jan 40 300 Feb 50 320 Mar 60 370 Apr 70 480 May 80 560 Jun 90 640 Jul 100 720 Aug 90 600 Sep 80 400 Oct 60 300 Nov 40 200 Dec 20 122 (X) Ave Monthly Temp (y) Actual Ave Monthly Ice Cream Sales • We can now plot the predicted Y using the equation: 𝑦= -50.93+7.21(x)
  126. 126. • So here are the actual data we plotted the data from: Jan 40 300 Feb 50 320 Mar 60 370 Apr 70 480 May 80 560 Jun 90 640 Jul 100 720 Aug 90 600 Sep 80 400 Oct 60 300 Nov 40 200 Dec 20 122 (X) Ave Monthly Temp (y) Actual Ave Monthly Ice Cream Sales • We can now plot the predicted Y using the equation: • Which is the equation for the best fitting line between these two variables: 𝑦= -50.93+7.21(x)
  127. 127. • We can now plot the predicted Y using the equation:
  128. 128. • We can now plot the predicted Y using the equation: 𝑦= -50.93+7.21(x)
  129. 129. • We can now plot the predicted Y using the equation: Jan 40 300 Feb 50 320 Mar 60 370 Apr 70 480 May 80 560 Jun 90 640 Jul 100 720 Aug 90 600 Sep 80 400 Oct 60 300 Nov 40 200 Dec 20 122 (X) Ave Monthly Temp (y) Actual Ave Monthly Ice Cream Sales 𝑦= -50.93+7.21(x)
  130. 130. • We can now plot the predicted Y using the equation: Jan 40 300 Feb 50 320 Mar 60 370 Apr 70 480 May 80 560 Jun 90 640 Jul 100 720 Aug 90 600 Sep 80 400 Oct 60 300 Nov 40 200 Dec 20 122 (X) Ave Monthly Temp (y) Actual Ave Monthly Ice Cream Sales 𝑦= -50.93+7.21(x) 𝑦= -50.93+7.21(300) = 𝑦= -50.93+7.21(320) = 𝑦= -50.93+7.21(480) = 𝑦= -50.93+7.21(370) = 𝑦= -50.93+7.21(560) = 𝑦= -50.93+7.21(640) = 𝑦= -50.93+7.21(720) = 𝑦= -50.93+7.21(600) = 𝑦= -50.93+7.21(400) = 𝑦= -50.93+7.21(300) = 𝑦= -50.93+7.21(200) = 𝑦= -50.93+7.21(122) =
  131. 131. • We can now plot the predicted Y using the equation: Jan 40 300 Feb 50 320 Mar 60 370 Apr 70 480 May 80 560 Jun 90 640 Jul 100 720 Aug 90 600 Sep 80 400 Oct 60 300 Nov 40 200 Dec 20 122 (X) Ave Monthly Temp (y) Actual Ave Monthly Ice Cream Sales 𝑦= -50.93+7.21(x) 𝑦= -50.93+7.21(300) = 𝑦= -50.93+7.21(320) = 𝑦= -50.93+7.21(480) = 𝑦= -50.93+7.21(370) = 𝑦= -50.93+7.21(560) = 𝑦= -50.93+7.21(640) = 𝑦= -50.93+7.21(720) = 𝑦= -50.93+7.21(600) = 𝑦= -50.93+7.21(400) = 𝑦= -50.93+7.21(300) = 𝑦= -50.93+7.21(200) = 𝑦= -50.93+7.21(122) = ( 𝑦 Predicted Ave Monthly Ice Cream Sales 237.47 309.57 381.67 453.77 single de wirklich kostenlos 525.87 597.97 670.07 597.97 525.87 381.67 237.47 93.27
  132. 132. • We can now plot the predicted Y using the equation: • With this information we can now determine if x (temperature) is a statistically significant predictor of “y” (ice cream sales). Jan 40 300 Feb 50 320 Mar 60 370 Apr 70 480 May 80 560 Jun 90 640 Jul 100 720 Aug 90 600 Sep 80 400 Oct 60 300 Nov 40 200 Dec 20 122 (X) Ave Monthly Temp (y) Actual Ave Monthly Ice Cream Sales 𝑦= -50.93+7.21(x) 𝑦= -50.93+7.21(300) = 𝑦= -50.93+7.21(320) = 𝑦= -50.93+7.21(480) = 𝑦= -50.93+7.21(370) = 𝑦= -50.93+7.21(560) = 𝑦= -50.93+7.21(640) = 𝑦= -50.93+7.21(720) = 𝑦= -50.93+7.21(600) = 𝑦= -50.93+7.21(400) = 𝑦= -50.93+7.21(300) = 𝑦= -50.93+7.21(200) = 𝑦= -50.93+7.21(122) = ( 𝑦 Predicted Ave Monthly Ice Cream Sales 237.47 309.57 381.67 453.77 525.87 597.97 670.07 597.97 525.87 381.67 237.47 93.27
  133. 133. • To begin we need to determine the total sum of squares just like we would do with analysis of variance.
  134. 134. • To begin we need to determine the total sum of squares just like we would do with analysis of variance. • This is done by subtracting the actual “Y” (ice cream sales) values from the average or mean ice cream sales for the whole year.
  135. 135. • To begin we need to determine the total sum of squares just like we would do with analysis of variance. • This is done by subtracting the actual “Y” (ice cream sales) values from the average or mean ice cream sales for the whole year. • The mean is calculated by adding up the values and divided them by how many there are.
  136. 136. • To begin we need to determine the total sum of squares just like we would do with analysis of variance. • This is done by subtracting the actual “Y” (ice cream sales) values from the average or mean ice cream sales for the whole year. • The mean is calculated by adding up the values and divided them by how many there are. • (300+320+370+480+560+640+720+600+400+300+200+122) / 12 = 417 average ice cream sales
  137. 137. • We then subtract each y value from the mean
  138. 138. • We then subtract each y value from the mean (y) Actual Ave Monthly Ice Cream Sales 300 320 370 480 560 640 720 600 400 300 200 122 ( 𝑦 Predicted Ave Monthly Ice Cream Sales 237.47 309.57 381.67 453.77 525.87 597.97 670.07 597.97 525.87 381.67 237.47 93.27 Difference -117 -97 -47 63 143 223 303 183 -17 -117 -217 -295 - - - - - - - - - - - - = = = = = = = = = = = =
  139. 139. • We then subtract each y value from the mean • Note - if we did not know the functional relationship between X and Y, our best prediction of any one person’s Y value would be the mean of Y. (y) Actual Ave Monthly Ice Cream Sales 300 320 370 480 560 640 720 600 400 300 200 122 ( 𝑦 Predicted Ave Monthly Ice Cream Sales 237.47 309.57 381.67 453.77 525.87 597.97 670.07 597.97 525.87 381.67 237.47 93.27 Difference -117 -97 -47 63 143 223 303 183 -17 -117 -217 -295 - - - - - - - - - - - - = = = = = = = = = = = =
  140. 140. • Because we are calculating the total sum of squares we will need to square the results and then take the average of the sum of squares. This is the same as the variance of all of the scores.
  141. 141. • Because we are calculating the total sum of squares we will need to square the results and then take the average of the sum of squares. This is the same as the variance of all of the scores. (y) Actual Ave Monthly Ice Cream Sales 300 320 370 480 560 640 720 600 400 300 200 122 ( 𝑦 Predicted Ave Monthly Ice Cream Sales 237.47 309.57 381.67 453.77 525.87 597.97 670.07 597.97 525.87 381.67 237.47 93.27 Difference -117 -97 -47 63 143 223 303 183 -17 -117 -217 -295 - - - - - - - - - - - - = = = = = = = = = = = = Squared 13689 9409 2209 3969 20449 49729 91809 33489 289 13689 47089 87025
  142. 142. • Because we are calculating the total sum of squares we will need to square the results and then sum up the results
  143. 143. • Because we are calculating the total sum of squares we will need to square the results and then sum up the result (y) Actual Ave Monthly Ice Cream Sales 300 320 370 480 560 640 720 600 400 300 200 122 ( 𝑦 Predicted Ave Monthly Ice Cream Sales 237.47 309.57 381.67 453.77 525.87 597.97 670.07 597.97 525.87 381.67 237.47 93.27 Difference -117 -97 -47 63 143 223 303 183 -17 -117 -217 -295 - - - - - - - - - - - - = = = = = = = = = = = = Squared 13689 9409 2209 3969 20449 49729 91809 33489 289 13689 47089 87025 Sum up SUM 372844
  144. 144. • Now we find regression (good) and residual (bad). To have better prediction power we want the regression sums of squares to be large and the residual or error sums of squares to be small.
  145. 145. • Now we find regression (good) and residual (bad). To have better prediction power we want the regression sums of squares to be large and the residual or error sums of squares to be small. • Let’s see if the residual or the regression is greater.
  146. 146. • Now we find regression (good) and residual (bad). To have better prediction power we want the regression sums of squares to be large and the residual or error sums of squares to be small. • Let’s see if the residual or the regression is greater. • We know that the total sums of squares is 31,070.
  147. 147. • Now we find regression (good) and residual (bad). To have better prediction power we want the regression sums of squares to be large and the residual or error sums of squares to be small. • Let’s see if the residual or the regression is greater. • We know that the total sums of squares is 31,070. Sum of Squares df Mean Square F-ratio Significance Total 372,844
  148. 148. • Now we find regression (good) and residual (bad). To have better prediction power we want the regression sums of squares to be large and the residual or error sums of squares to be small. • Let’s see if the residual or the regression is greater. • We know that the total sums of squares is 31,070. • Now we will calculate the residual (error) and the regression sums of squares which will add up to 372,844. Sum of Squares df Mean Square F-ratio Significance Total 372,844
  149. 149. • Now we find regression (good) and residual (bad). To have better prediction power we want the regression sums of squares to be large and the residual or error sums of squares to be small. • Let’s see if the residual or the regression is greater. • We know that the total sums of squares is 31,070. • Now we will calculate the residual (error) and the regression sums of squares which will add up to 372,844. Sum of Squares df Mean Square F-ratio Significance Regression? Residual (error)? Total 372,844
  150. 150. • Before we calculate residual and regression let’s see visually how we calculated the total sums of squares - 372,844.
  151. 151. • Before we calculate residual and regression let’s see visually how we calculated the total sums of squares - 372,844. • Once again we subtract the actual Y values from the mean of the actual Y values
  152. 152. • Before we calculate residual and regression let’s see visually how we calculated the total sums of squares - 372,844. • Once again we subtract the actual Y values from the mean of the actual Y values (y) Actual Ave Monthly Ice Cream Sales 300 320 370 480 560 640 720 600 400 300 200 122 ( 𝑦 Predicted Ave Monthly Ice Cream Sales 417 417 417 417 417 417 417 417 417 417 417 417 - - - - - - - - - - - -
  153. 153. • Before we calculate residual and regression let’s see visually how we calculated the total sums of squares - 372,844. • Once again we subtract the actual Y values from the mean of the actual Y values (y) Actual Ave Monthly Ice Cream Sales 300 320 370 480 560 640 720 600 400 300 200 122 ( 𝑦 Predicted Ave Monthly Ice Cream Sales 417 417 417 417 417 417 417 417 417 417 417 417 - - - - - - - - - - - - 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam
  154. 154. • Before we calculate residual and regression let’s see visually how we calculated the total sums of squares - 372,844. • Once again we subtract the actual Y values from the mean of the actual Y values (y) Actual Ave Monthly Ice Cream Sales 300 320 370 480 560 640 720 600 400 300 200 122 ( 𝑦 Predicted Ave Monthly Ice Cream Sales 417 417 417 417 417 417 417 417 417 417 417 417 - - - - - - - - - - - - 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam
  155. 155. • Before we calculate residual and regression let’s see visually how we calculated the total sums of squares - 372,844. • Once again we subtract the actual Y values from the mean of the actual Y values (y) Actual Ave Monthly Ice Cream Sales 300 320 370 480 560 640 720 600 400 300 200 122 ( 𝑦 Predicted Ave Monthly Ice Cream Sales 417 417 417 417 417 417 417 417 417 417 417 417 - - - - - - - - - - - - 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam
  156. 156. • Before we calculate residual and regression let’s see visually how we calculated the total sums of squares - 372,844. • Once again we subtract the actual Y values from the mean of the actual Y values (y) Actual Ave Monthly Ice Cream Sales 300 320 370 480 560 640 720 600 400 300 200 122 ( 𝑦 Predicted Ave Monthly Ice Cream Sales 417 417 417 417 417 417 417 417 417 417 417 417 - - - - - - - - - - - - 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam
  157. 157. • The first data set are the actual Y values. We subtract them from the mean (417) which would be our best prediction if we did not know the relationship between X (temperature) and Y (ice cream sales)
  158. 158. • The first data set are the actual Y values. We subtract them from the mean (417) which would be our best prediction if we did not know the relationship between X (temperature) and Y (ice cream sales) (y) Actual Ave Monthly Ice Cream Sales 300 320 370 480 560 640 720 600 400 300 200 122 ( 𝑦 Predicted Ave Monthly Ice Cream Sales 417 417 417 417 417 417 417 417 417 417 417 417 - - - - - - - - - - - - 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam
  159. 159. • Here is the graphic depiction of our subtracting each data point from the mean (417):
  160. 160. • Here is the graphic depiction of our subtracting each data point from the mean (417): 122 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam 122 -417 = -295 417
  161. 161. • Here is the graphic depiction of our subtracting each data point from the mean (417): 122 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam 122 -417 = -295 417 (y) Actual Ave Monthly Ice Cream Sales 300 320 370 480 560 640 720 600 400 300 200 122 ( 𝑦 Predicted Ave Monthly Ice Cream Sales 417 417 417 417 417 417 417 417 417 417 417 417 Difference -117 -97 -47 63 143 223 303 183 -17 -117 -217 -295 - - - - - - - - - - - - = = = = = = = = = = = =
  162. 162. • Here is the graphic depiction of our subtracting each data point from the mean (417): 122 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam 122 -417 = -295 417 (y) Actual Ave Monthly Ice Cream Sales 300 320 370 480 560 640 720 600 400 300 200 122 ( 𝑦 Predicted Ave Monthly Ice Cream Sales 417 417 417 417 417 417 417 417 417 417 417 417 Difference -117 -97 -47 63 143 223 303 183 -17 -117 -217 -295 - - - - - - - - - - - - = = = = = = = = = = = =
  163. 163. • Here is the graphic depiction of our subtracting each data point from the mean (417): 200 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam 200 -417 = -217 417 (y) Actual Ave Monthly Ice Cream Sales 300 320 370 480 560 640 720 600 400 300 200 122 ( 𝑦 Predicted Ave Monthly Ice Cream Sales 417 417 417 417 417 417 417 417 417 417 417 417 Difference -117 -97 -47 63 143 223 303 183 -17 -117 -217 -295 - - - - - - - - - - - - = = = = = = = = = = = =
  164. 164. • Here is the graphic depiction of our subtracting each data point from the mean (417): 200 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam 200 -417 = -217 417 (y) Actual Ave Monthly Ice Cream Sales 300 320 370 480 560 640 720 600 400 300 200 122 ( 𝑦 Predicted Ave Monthly Ice Cream Sales 417 417 417 417 417 417 417 417 417 417 417 417 Difference -117 -97 -47 63 143 223 303 183 -17 -117 -217 -295 - - - - - - - - - - - - = = = = = = = = = = = =
  165. 165. • Here is the graphic depiction of our subtracting each data point from the mean (417): 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam 200 -417 = +303 417 (y) Actual Ave Monthly Ice Cream Sales 300 320 370 480 560 640 720 600 400 300 200 122 ( 𝑦 Predicted Ave Monthly Ice Cream Sales 417 417 417 417 417 417 417 417 417 417 417 417 Difference -117 -97 -47 63 143 223 303 183 -17 -117 -217 -295 - - - - - - - - - - - - = = = = = = = = = = = =
  166. 166. • Here is the graphic depiction of our subtracting each data point from the mean (417): 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam 200 -417 = +303 417 (y) Actual Ave Monthly Ice Cream Sales 300 320 370 480 560 640 720 600 400 300 200 122 ( 𝑦 Predicted Ave Monthly Ice Cream Sales 417 417 417 417 417 417 417 417 417 417 417 417 Difference -117 -97 -47 63 143 223 303 183 -17 -117 -217 -295 - - - - - - - - - - - - = = = = = = = = = = = =
  167. 167. • Now we have the difference between the actual values for Y (ice cream sales) and the mean of the values for Y (417) 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam (y) Actual Ave Monthly Ice Cream Sales 300 320 370 480 560 640 720 600 400 300 200 122 ( 𝑦 Predicted Ave Monthly Ice Cream Sales 417 417 417 417 417 417 417 417 417 417 417 417 Difference -117 -97 -47 63 143 223 303 183 -17 -117 -217 -295 - - - - - - - - - - - - = = = = = = = = = = = =
  168. 168. • Now we have the difference between the actual values for Y (ice cream sales) and the mean of the values for Y (417) 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam (y) Actual Ave Monthly Ice Cream Sales 300 320 370 480 560 640 720 600 400 300 200 122 ( 𝑦 Predicted Ave Monthly Ice Cream Sales 417 417 417 417 417 417 417 417 417 417 417 417 Difference -117 -97 -47 63 143 223 303 183 -17 -117 -217 -295 - - - - - - - - - - - - = = = = = = = = = = = =
  169. 169. • As we showed previously we have to square this value because if we don’t when we sum the differences they will come to zero.
  170. 170. • As we showed previously we have to square this value because if we don’t when we sum the differences they will come to zero. Difference -117 -97 -47 63 143 223 303 183 -17 -117 -217 -295 Squared 13689 9409 2209 3969 20449 49729 91809 33489 289 13689 47089 87025 SUM = 0 SUM = 372,844
  171. 171. • As we showed previously we have to square this value because if we don’t when we sum the differences they will come to zero. Difference -117 -97 -47 63 143 223 303 183 -17 -117 -217 -295 Squared 13689 9409 2209 3969 20449 49729 91809 33489 289 13689 47089 87025 SUM = 0 SUM = 372,844 • We are doing all this once again to show a visual depiction of what the total sums of squares are:
  172. 172. • As we showed previously we have to square this value because if we don’t when we sum the differences they will come to zero. Difference -117 -97 -47 63 143 223 303 183 -17 -117 -217 -295 Squared 13689 9409 2209 3969 20449 49729 91809 33489 289 13689 47089 87025 SUM = 0 SUM = 372,844 • We are doing all this once again to show a visual depiction of what the total sums of squares are: Sum of Squares df Mean Square F-ratio Significance Total 372,844
  173. 173. • Now that we’ve seen a visual depiction of how we calculated total sums of squares we compare the sums of squares that are associated with error (residual) and those associated with regression.
  174. 174. • Now that we’ve seen a visual depiction of how we calculated total sums of squares we compare the sums of squares that are associated with error (residual) and those associated with regression. Sum of Squares df Mean Square F-ratio Significance Regression Residual Total 372,844
  175. 175. • Now that we’ve seen a visual depiction of how we calculated total sums of squares we compare the sums of squares that are associated with error (residual) and those associated with regression. • Let’s calculate the error or residual sums of squares now. Sum of Squares df Mean Square F-ratio Significance Regression Residual Total 372,844
  176. 176. • The error or residual sums of squares are computed by subtracting each actual Y value from each Y predicted value.
  177. 177. • The error or residual sums of squares are computed by subtracting each actual Y value from each Y predicted value. • Here are the actual Y values
  178. 178. • The error or residual sums of squares are computed by subtracting each actual Y value from each Y predicted value. • Here are the actual Y values 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam These are the actual Y values or average ice cream sales averageicecreamsales
  179. 179. • The error or residual sums of squares are computed by subtracting each actual Y value from each Y predicted value. • Here are the actual Y values 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam These are the actual Y values or average ice cream sales averageicecreamsales (y) Actual Ave Monthly Ice Cream Sales 300 320 370 480 560 640 720 600 400 300 200 122
  180. 180. • Here are the predicted values using the linear regression formula:
  181. 181. • Here are the predicted values using the linear regression formula: 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam These are the actual Y values or average ice cream sales averageicecreamsales 300 320 370 480 560 640 720 600 400 300 200 122 (y) Actual Ave Monthly Ice Cream Sales 𝑦= -50.93+7.21(300) = 𝑦= -50.93+7.21(320) = 𝑦= -50.93+7.21(480) = 𝑦= -50.93+7.21(370) = 𝑦= -50.93+7.21(560) = 𝑦= -50.93+7.21(640) = 𝑦= -50.93+7.21(720) = 𝑦= -50.93+7.21(600) = 𝑦= -50.93+7.21(400) = 𝑦= -50.93+7.21(300) = 𝑦= -50.93+7.21(200) = 𝑦= -50.93+7.21(122) = ( 𝑦 Predicted Ave Monthly Ice Cream Sales 237.47 309.57 381.67 453.77 525.87 597.97 670.07 597.97 525.87 381.67 237.47 93.27
  182. 182. • Here are the predicted values using the linear regression formula: 300 320 370 480 560 640 720 600 400 300 200 122 (y) Actual Ave Monthly Ice Cream Sales 𝑦= -50.93+7.21(300) = 𝑦= -50.93+7.21(320) = 𝑦= -50.93+7.21(480) = 𝑦= -50.93+7.21(370) = 𝑦= -50.93+7.21(560) = 𝑦= -50.93+7.21(640) = 𝑦= -50.93+7.21(720) = 𝑦= -50.93+7.21(600) = 𝑦= -50.93+7.21(400) = 𝑦= -50.93+7.21(300) = 𝑦= -50.93+7.21(200) = 𝑦= -50.93+7.21(122) = ( 𝑦 Predicted Ave Monthly Ice Cream Sales 237.47 309.57 381.67 453.77 525.87 597.97 670.07 597.97 525.87 381.67 237.47 93.27 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam averageicecreamsales
  183. 183. • Here are the predicted values using the linear regression formula: 300 320 370 480 560 640 720 600 400 300 200 122 (y) Actual Ave Monthly Ice Cream Sales 𝑦= -50.93+7.21(300) = 𝑦= -50.93+7.21(320) = 𝑦= -50.93+7.21(480) = 𝑦= -50.93+7.21(370) = 𝑦= -50.93+7.21(560) = 𝑦= -50.93+7.21(640) = 𝑦= -50.93+7.21(720) = 𝑦= -50.93+7.21(600) = 𝑦= -50.93+7.21(400) = 𝑦= -50.93+7.21(300) = 𝑦= -50.93+7.21(200) = 𝑦= -50.93+7.21(122) = ( 𝑦 Predicted Ave Monthly Ice Cream Sales 237.47 309.57 381.67 453.77 525.87 597.97 670.07 597.97 525.87 381.67 237.47 93.27 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam averageicecreamsales
  184. 184. • From these points and the linear regression formula a line can be drawn
  185. 185. • From these points and the linear regression formula a line can be drawn 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam averageicecreamsales
  186. 186. • From these points and the linear regression formula a line can be drawn 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam averageicecreamsales
  187. 187. • The difference between each actual value (orange) and the predicted value (green line) is what is called error or residual. The closer these two values are to each other the smaller the error. The farther these two values are from each other the larger the error and the weaker the predictive power of the regression line.
  188. 188. • The difference between each actual value (orange) and the predicted value (green line) is what is called error or residual. The closer these two values are to each other the smaller the error. The farther these two values are from each other the larger the error and the weaker the predictive power of the regression line. 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam averageicecreamsales Difference Difference
  189. 189. • Let’s subtract the orange actual values and the green line predicted values:
  190. 190. • Let’s subtract the orange actual values and the green line predicted values: 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam averageicecreamsales (y) Actual Ave Monthly Ice Cream Sales 300 320 370 480 560 640 720 600 400 300 200 122 ( 𝑦 Predicted Ave Monthly Ice Cream Sales 237.47 309.57 381.67 453.77 525.87 597.97 670.07 597.97 525.87 381.67 237.47 93.27 Difference 62.53 10.43 -11.67 26.23 34.13 42.03 49.93 2.03 -125.87 -81.67 -37.47 28.73 - - - - - - - - - - - - = = = = = = = = = = = = +28.73 122 93
  191. 191. • Let’s subtract the orange actual values and the green line predicted values: 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam averageicecreamsales (y) Actual Ave Monthly Ice Cream Sales 300 320 370 480 560 640 720 600 400 300 200 122 ( 𝑦 Predicted Ave Monthly Ice Cream Sales 237.47 309.57 381.67 453.77 525.87 597.97 670.07 597.97 525.87 381.67 237.47 93.27 Difference 62.53 10.43 -11.67 26.23 34.13 42.03 49.93 2.03 -125.87 -81.67 -37.47 28.73 - - - - - - - - - - - - = = = = = = = = = = = = -125.87 525 400
  192. 192. • Let’s subtract the orange actual values and the green line predicted values: • And so on… 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam averageicecreamsales (y) Actual Ave Monthly Ice Cream Sales 300 320 370 480 560 640 720 600 400 300 200 122 ( 𝑦 Predicted Ave Monthly Ice Cream Sales 237.47 309.57 381.67 453.77 525.87 597.97 670.07 597.97 525.87 381.67 237.47 93.27 Difference 62.53 10.43 -11.67 26.23 34.13 42.03 49.93 2.03 -125.87 -81.67 -37.47 28.73 - - - - - - - - - - - - = = = = = = = = = = = = -125.87 525 400
  193. 193. • We then square those difference (deviations)
  194. 194. • We then square those difference (deviations) (y) Actual Ave Monthly Ice Cream Sales 300 320 370 480 560 640 720 600 400 300 200 122 ( 𝑦 Predicted Ave Monthly Ice Cream Sales 237.47 309.57 381.67 453.77 525.87 597.97 670.07 597.97 525.87 381.67 237.47 93.27 Difference 62.53 10.43 -11.67 26.23 34.13 42.03 49.93 2.03 -125.87 -81.67 -37.47 28.73 - - - - - - - - - - - - = = = = = = = = = = = = Squared 3910.00 108.78 136.19 688.01 1164.86 1766.52 2493.00 4.12 15843.26 6669.99 1404.00 825.41
  195. 195. • We then square those difference (deviations) and sum them up (y) Actual Ave Monthly Ice Cream Sales 300 320 370 480 560 640 720 600 400 300 200 122 ( 𝑦 Predicted Ave Monthly Ice Cream Sales 237.47 309.57 381.67 453.77 525.87 597.97 670.07 597.97 525.87 381.67 237.47 93.27 Difference 62.53 10.43 -11.67 26.23 34.13 42.03 49.93 2.03 -125.87 -81.67 -37.47 28.73 - - - - - - - - - - - - = = = = = = = = = = = = Squared 3910.00 108.78 136.19 688.01 1164.86 1766.52 2493.00 4.12 15843.26 6669.99 1404.00 825.41 Sum up = 35,014
  196. 196. Sum of Squares df Mean Square F-ratio Significance Regression Residual 35,014 Total 372,844
  197. 197. • We will now calculate the regression sums of squares.
  198. 198. • We will now calculate the regression sums of squares. • Our hope is that this value will be much bigger than the residual (35,014). Sum of Squares df Mean Square F-ratio Significance Regression Residual 35,014 Total 372,844
  199. 199. • The regression sums of squares is calculated by subtracting the predicted values from the mean.
  200. 200. • The regression sums of squares is calculated by subtracting the predicted values from the mean. • Let’s see what this looks like visually. The green line is the predicted values for Y or the regression line.
  201. 201. • The regression sums of squares is calculated by subtracting the predicted values from the mean. • Let’s see what this looks like visually. The green line is the predicted values for Y or the regression line. 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam averageicecreamsales
  202. 202. • The regression sums of squares is calculated by subtracting the predicted values from the mean. • Let’s see what this looks like visually. The green line is the predicted values for Y or the regression line. 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam averageicecreamsales
  203. 203. • The regression sums of squares is calculated by subtracting the predicted values from the mean. • Let’s see what this looks like visually. The green line is the predicted values for Y or the regression line. 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam averageicecreamsales The blue line is the mean (417) which is the best predictor absent anything else.
  204. 204. • You can probably already tell that it will be bigger because a simple way to calculate it is to subtract the residual (35,014) from the total (372,844). 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam averageicecreamsales
  205. 205. • You can probably already tell that it will be bigger because a simple way to calculate it is to subtract the residual (35,014) from the total (372,844). • However, we will calculate it the long way so you can see what is happening. 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam averageicecreamsales
  206. 206. • We subtract each predicted value from the mean of the actual Y values
  207. 207. (y) Actual Ave Monthly Ice Cream Sales 237.47 309.57 381.67 453.77 525.87 597.97 670.07 597.97 525.87 381.67 237.47 93.27 Mean Monthly Ice Cream Sales 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 Difference -180.2 -108.1 -36.0 36.1 108.2 180.3 252.4 180.3 108.2 -36.0 -180.2 -324.4 - - - - - - - - - - - - = = = = = = = = = = = = • We subtract each predicted value from the mean of the actual Y values 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam averageicecreamsales
  208. 208. • We subtract each predicted value from the mean of the actual Y values 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam averageicecreamsales (y) Actual Ave Monthly Ice Cream Sales 237.47 309.57 381.67 453.77 525.87 597.97 670.07 597.97 525.87 381.67 237.47 93.27 Mean Monthly Ice Cream Sales 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 Difference -180.2 -108.1 -36.0 36.1 108.2 180.3 252.4 180.3 108.2 -36.0 -180.2 -324.4 - - - - - - - - - - - - = = = = = = = = = = = = 93 - 417 - 324
  209. 209. • We subtract each predicted value from the mean of the actual Y values 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 MidtermExam Final Exam averageicecreamsales (y) Actual Ave Monthly Ice Cream Sales 237.47 309.57 381.67 453.77 525.87 597.97 670.07 597.97 525.87 381.67 237.47 93.27 Mean Monthly Ice Cream Sales 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 Difference -180.2 -108.1 -36.0 36.1 108.2 180.3 252.4 180.3 108.2 -36.0 -180.2 -324.4 - - - - - - - - - - - - = = = = = = = = = = = = 670 - 417 +252
  210. 210. • Then we square the differences (or deviations)
  211. 211. • Then we square the differences (or deviations) (y) Actual Ave Monthly Ice Cream Sales 237.47 309.57 381.67 453.77 525.87 597.97 670.07 597.97 525.87 381.67 237.47 93.27 Mean Monthly Ice Cream Sales 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 Difference -180.2 -108.1 -36.0 36.1 108.2 180.3 252.4 180.3 108.2 -36.0 -180.2 -324.4 - - - - - - - - - - - - = = = = = = = = = = = = Squared 32470.8 11684.9 1295.76 1303.45 11708 32509.3 63707.4 32509.3 11708 1295.76 32470.8 105233
  212. 212. • Then we square the differences (or deviations) and sum them up (y) Actual Ave Monthly Ice Cream Sales 237.47 309.57 381.67 453.77 525.87 597.97 670.07 597.97 525.87 381.67 237.47 93.27 Mean Monthly Ice Cream Sales 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 417.7 Difference -180.2 -108.1 -36.0 36.1 108.2 180.3 252.4 180.3 108.2 -36.0 -180.2 -324.4 - - - - - - - - - - - - = = = = = = = = = = = = Squared 32470.8 11684.9 1295.76 1303.45 11708 32509.3 63707.4 32509.3 11708 1295.76 32470.8 105233 Sum up = 337,830
  213. 213. • Then we square the differences (or deviations) and sum them up Sum of Squares df Mean Square F-ratio Significance Regression 337,830 Residual 35,014 Total 372,844
  214. 214. • Now we have all of the information to test for significance
  215. 215. • Now we have all of the information to test for significance Sum of Squares df Mean Square F-ratio Significance Regression 337,830 Residual 35,014 Total 372,844
  216. 216. • The degrees of freedom (df) for the regression are the number of parameters that are being estimated which in this case is the Y intercept and the slope in this equation minus
  217. 217. • The degrees of freedom (df) for the regression are the number of parameters that are being estimated which in this case is the Y intercept and the slope in this equation minus • 2 parameters -1 = 1 Sum of Squares df Mean Square F-ratio Significance Regression 337,830 1 Residual 35,014 Total 372,844
  218. 218. • The degrees of freedom for residual is the number of cases (12) minus the number of parameters (2)
  219. 219. • The degrees of freedom for residual is the number of cases (12) minus the number of parameters (2) • 12 months – 2 parameters (slope / y intercept) = 10
  220. 220. • The degrees of freedom for residual is the number of cases (12) minus the number of parameters (2) • 12 months – 2 parameters (slope / y intercept) = 10 Sum of Squares df Mean Square F-ratio Significance Regression 337,830 1 Residual 35,014 10 Total 372,844
  221. 221. • We now have the information we need to calculate the Mean Square values. They are calculated by dividing the sums of squares by the degrees of freedom.
  222. 222. • We now have the information we need to calculate the Mean Square values. They are calculated by dividing the sums of squares by the degrees of freedom. Sum of Squares df Mean Square F-ratio Significance Regression 337,830 1 =337,830 Residual 35,014 10 =3,501 Total 372,844
  223. 223. • The F-ratio is computed by dividing the Regression Mean Square by the Residual Mean Square
  224. 224. • The F-ratio is computed by dividing the Regression Mean Square by the Residual Mean Square • 337,830 / 3,501 = 96.5
  225. 225. • The F-ratio is computed by dividing the Regression Mean Square by the Residual Mean Square • 337,830 / 3,501 = 96.5 Sum of Squares df Mean Square F-ratio Significance Regression 337,830 1 337,830 96.5 Residual 35,014 10 3,501 Total 372,844
  226. 226. • With this information we can turn to the F-distribution table to determine the significance value.
  227. 227. • With this information we can turn to the F-distribution table to determine the significance value. Sum of Squares df Mean Square F-ratio Significance Regression 337,830 1 337,830 96.5 Residual 35,014 10 3,501 Total 372,844
  228. 228. Sum of Squares df Mean Square F-ratio Significance Regression 337,830 1 337,830 96.5 Residual 35,014 10 3,501 Total 372,844
  229. 229. • The regression degrees of freedom (1) is represented by the columns below: Sum of Squares df Mean Square F-ratio Significance Regression 337,830 1 337,830 96.5 Residual 35,014 10 3,501 Total 372,844
  230. 230. • The regression degrees of freedom (1) is represented by the columns below: Sum of Squares df Mean Square F-ratio Significance Regression 337,830 1 337,830 96.5 Residual 35,014 10 3,501 Total 372,844
  231. 231. Sum of Squares df Mean Square F-ratio Significance Regression 337,830 1 337,830 96.5 Residual 35,014 10 3,501 Total 372,844
  232. 232. • The residual degrees of freedom (10) is represented by the rows below: Sum of Squares df Mean Square F-ratio Significance Regression 337,830 1 337,830 96.5 Residual 35,014 10 3,501 Total 372,844
  233. 233. • The residual degrees of freedom (10) is represented by the rows below: Sum of Squares df Mean Square F-ratio Significance Regression 337,830 1 337,830 96.5 Residual 35,014 10 3,501 Total 372,844
  234. 234. Sum of Squares df Mean Square F-ratio Significance Regression 337,830 1 337,830 96.5 Residual 35,014 10 3,501 Total 372,844
  235. 235. • Put them together and we have found the critical F value at the .05 alpha level to be 4.96. Sum of Squares df Mean Square F-ratio Significance Regression 337,830 1 337,830 96.5 Residual 35,014 10 3,501 Total 372,844
  236. 236. • Put them together and we have found the critical F value at the .05 alpha level to be 4.96. Sum of Squares df Mean Square F-ratio Significance Regression 337,830 1 337,830 96.5 Residual 35,014 10 3,501 Total 372,844
  237. 237. • Because the F-ratio (96.5) exceeds the F-critical (4.96) we will reject the null hypothesis and indicate that temperature is a statistically significant predictor of ice cream sales
  238. 238. In Summary
  239. 239. In Summary • The whole point of this demonstration was to
  240. 240. In Summary • The whole point of this demonstration was to (1) explain that linear regression is used to predict the value of one variable (ice cream sales) based on another variable (temperature)
  241. 241. In Summary • The whole point of this demonstration was to (1) explain that linear regression is used to predict the value of one variable (ice cream sales) based on another variable (temperature) (2) show that the total variance in Y can be partitioned into regression (prediction power) and residual (error)
  242. 242. In Summary • The whole point of this demonstration was to (1) explain that linear regression is used to predict the value of one variable (ice cream sales) based on another variable (temperature) (2) show that the total variance in Y can be partitioned into regression (prediction power) and residual (error) (3) show how this can be used to test whether the prediction is better than by chance.

Multiple linear regression analysis is an extension of simple linear regression analysis, used to assess the association between two or more independent variables and a single continuous dependent variable. The multiple linear regression equation is as follows:

 MLR1.png,

 whereY-hat.pngis the predicted or expected value of the dependent variable, X1 through Xp are p distinct independent or predictor variables, b0 is the value of Y when all of the independent variables (X1 through Xp) are equal to zero, and b1 through bp are the estimated regression coefficients. Each regression coefficient represents the change in Y relative to a one unit change in the respective independent variable. In the multiple regression situation, b1, for example, is the change in Y relative to a one unit change in X1, holding all other independent variables constant (i.e., when the remaining independent variables are held at the same value or are fixed). Again, statistical tests can be performed to assess whether each regression coefficient is significantly different from zero.

Controlling for Confounding With Multiple Linear Regression

Multiple regression analysis is also used to assess whether confounding exists. Since multiple linear regression analysis allows us to estimate the association between a given independent variable and the outcome holding all other variables constant, it provides a way of adjusting for (or accounting for) potentially confounding variables that have been included in the model.

Suppose we have a risk factor or an exposure variable, which we denote X1 (e.g., X1=obesity or X1=treatment), and an outcome or dependent variable which we denote Y. We can estimate a simple linear regression equation relating the risk factor (the independent variable) to the dependent variable as follows:

 MLR2.png

where b1 is the estimated regression coefficient that quantifies the association between the risk factor and the outcome.

Suppose we now want to assess whether a third variable (e.g., age) is a confounder. We denote the potential confounder X2, and then estimate a multiple linear regression equation as follows:

MLR3.png.

In the multiple linear regression equation, b1 is the estimated regression coefficient that quantifies the association between the risk factor X1 and the outcome, adjusted for X2 (b2 is the estimated regression coefficient that quantifies the association between the potential confounder and the outcome). As noted earlier, some investigators assess confounding by assessing how much the regression coefficient associated with the risk factor (i.e., the measure of association) changes after adjusting for the potential confounder. In this case, we compare b1 from the simple linear regression model to b1 from the multiple linear regression model. As a rule of thumb, if the regression coefficient from the simple linear regression model changes by more than 10%, then X2 is said to be a confounder.

Once a variable is identified as a confounder, we can then use multiple linear regression analysis to estimate the association between the risk factor and the outcome adjusting for that confounder. The test of significance of the regression coefficient associated with the risk factor can be used to assess whether the association between the risk factor is statistically significant after accounting for one or more confounding variables. This is also illustrated below.

 

Example - The Association Between BMI and Systolic Blood Pressure 

Suppose we want to assess the association between BMI and systolic blood pressure using data collected in the seventh examination of the Framingham Offspring Study. A total of n=3,539 participants attended the exam, and their mean systolic blood pressure was 127.3 with a standard deviation of 19.0. The mean BMI in the sample was 28.2 with a standard deviation of 5.3.

A simple linear regression analysis reveals the following:

The simple linear regression model is:

MLR4.png

where

Y-hat.png is the predicted of expected systolic blood pressure. The regression coefficient associated with BMI is 0.67 suggesting that each one unit increase in BMI is associated with a 0.67 unit increase in systolic blood pressure. The association between BMI and systolic blood pressure is also statistically significant (p=0.0001).

Suppose we now want to assess whether age (a continuous variable, measured in years), male gender (yes/no), and treatment for hypertension (yes/no) are potential confounders, and if so, appropriately account for these using multiple linear regression analysis. For analytic purposes, treatment for hypertension is coded as 1=yes and 0=no. Gender is coded as 1=male and 0=female. A multiple regression analysis reveals the following:

 The multiple regression model is:

Y-hat.png = 68.15 + 0.58 (BMI) + 0.65 (Age) + 0.94 (Male gender) + 6.44 (Treatment for hypertension).

Notice that the association between BMI and systolic blood pressure is smaller (0.58 versus 0.67) after adjustment for age, gender and treatment for hypertension. BMI remains statistically significantly associated with systolic blood pressure (p=0.0001), but the magnitude of the association is lower after adjustment. The regression coefficient decreases by 13%.

[Actually, doesn't it decrease by 15.5%. In this case the true "beginning value" was 0.58, and confounding caused it to appear to be 0.67. so the actual % change = 0.09/0.58 = 15.5%.]

Using the informal rule (i.e., a change in the coefficient in either direction by 10% or more), we meet the criteria for confounding. Thus, part of the association between BMI and systolic blood pressure is explained by age, gender and treatment for hypertension.

This also suggests a useful way of identifying confounding. Typically, we try to establish the association between a primary risk factor and a given outcome after adjusting for one or more other risk factors. One useful strategy is to use multiple regression models to examine the association between the primary risk factor and the outcome before and after including possible confounding factors. If the inclusion of a possible confounding variable in the model causes the association between the primary risk factor and the outcome to change by 10% or more, then the additional variable is a confounder.

Relative Importance of the Independent Variables 

Assessing only the p-values suggests that these three independent variables are equally statistically significant. The magnitude of the t statistics provides a means to judge relative importance of the independent variables. In this example, age is the most significant independent variable, followed by BMI, treatment for hypertension and then male gender. In fact, male gender does not reach statistical significance (p=0.1133) in the multiple regression model.

Some investigators argue that regardless of whether an important variable such as gender reaches statistical significance it should be retained in the model. Other investigators only retain variables that are statistically significant.

[Not sure what you mean here; do you mean to control for confounding?] /WL

This is yet another example of the complexity involved in multivariable modeling. The multiple regression model produces an estimate of the association between BMI and systolic blood pressure that accounts for differences in systolic blood pressure due to age, gender and treatment for hypertension.

A one unit increase in BMI is associated with a 0.58 unit increase in systolic blood pressure holding age, gender and treatment for hypertension constant. Each additional year of age is associated with a 0.65 unit increase in systolic blood pressure, holding BMI, gender and treatment for hypertension constant.

Men have higher systolic blood pressures, by approximately 0.94 units, holding BMI, age and treatment for hypertension constant and persons on treatment for hypertension have higher systolic blood pressures, by approximately 6.44 units, holding BMI, age and gender constant. The multiple regression equation can be used to estimate systolic blood pressures as a function of a participant's BMI, age, gender and treatment for hypertension status. For example, we can estimate the blood pressure of a 50 year old male, with a BMI of 25 who is not on treatment for hypertension as follows:

 SysBP1.png

We can estimate the blood pressure of a 50 year old female, with a BMI of 25 who is on treatment for hypertension as follows:

SysBP2.png

Evaluating Effect Modification With Multiple Linear Regression 

On page 4 of this module we considered data from a clinical trial designed to evaluate the efficacy of a new drug to increase HDL cholesterol. One hundred patients enrolled in the study and were randomized to receive either the new drug or a placebo. The investigators were at first disappointed to find very little difference in the mean HDL cholesterol levels of treated and untreated subjects.

 

Sample Size

Mean HDL

Standard Deviation of HDL

New Drug

50

40.16

4.46

Placebo

50

39.21

3.91

However, when they analyzed the data separately in men and women, they found evidence of an effect in men, but not in women. We noted that when the magnitude of association differs at different levels of another variable (in this case gender), it suggests that effect modification is present.

 WOMEN

Sample Size

Mean HDL

Standard Deviation of HDL

New Drug

40

38.88

3.97

Placebo

41

39.24

4.21

 

 

 

 

MEN

 

 

 

New Drug

10

45.25

1.89

Placebo

9

39.06

2.22

Multiple regression analysis can be used to assess effect modification. This is done by estimating a multiple regression equation relating the outcome of interest (Y) to independent variables representing the treatment assignment, sex and the product of the two (called the treatment by sex interaction variable). For the analysis, we let T = the treatment assignment (1=new drug and 0=placebo), M = male gender (1=yes, 0=no) and TM, i.e., T * M or T x M, the product of treatment and male gender. In this case, the multiple regression analysis revealed the following: 

 

The multiple regression model is:

EM1.png

The details of the test are not shown here, but note in the table above that in this model, the regression coefficient associated with the interaction term, b3, is statistically significant (i.e., H0: b3 = 0 versus H1: b3 ≠ 0). The fact that this is statistically significant indicates that the association between treatment and outcome differs by sex.

The model shown above can be used to estimate the mean HDL levels for men and women who are assigned to the new medication and to the placebo. In order to use the model to generate these estimates, we must recall the coding scheme (i.e., T = 1 indicates new drug, T=0 indicates placebo, M=1 indicates male sex and M=0 indicates female sex).

The expected or predicted HDL for men (M=1) assigned to the new drug (T=1) can be estimated as follows:

EM2.png  

The expected HDL for men (M=1) assigned to the placebo (T=0) is:

 EM4.png

Similarly, the expected HDL for women (M=0) assigned to the new drug (T=1) is:

EM3.png  

The expected HDL for women (M=0)assigned to the placebo (T=0) is:

em5.png  

Notice that the expected HDL levels for men and women on the new drug and on placebo are identical to the means shown the table summarizing the stratified analysis. Because there is effect modification, separate simple linear regression models are estimated to assess the treatment effect in men and women:

MEN

Regression Coefficient

T

P-value

Intercept

39.08

57.09

0.0001

T (Treatment)

6.19

6.56

0.0001

 

 

 

 

WOMEN

Regression Coefficient

T

P-value

Intercept

39.24

61.36

0.0001

T (Treatment)

-0.36

-0.40

0.6927

 The regression models are:

In Men:

EM6.png

In Women:

EM7.png

In men, the regression coefficient associated with treatment (b1=6.19) is statistically significant (details not shown), but in women, the regression coefficient associated with treatment (b1= -0.36) is not statistically significant (details not shown).

Multiple linear regression analysis is a widely applied technique. In this section we showed here how it can be used to assess and account for confounding and to assess effect modification. The techniques we described can be extended to adjust for several confounders simultaneously and to investigate more complex effect modification (e.g., three-way statistical interactions).

There is an important distinction between confounding and effect modification. Confounding is a distortion of an estimated association caused by an unequal distribution of another risk factor. When there is confounding, we would like to account for it (or adjust for it) in order to estimate the association without distortion. In contrast, effect modification is a biological phenomenon in which the magnitude of association is differs at different levels of another factor, e.g., a drug that has an effect on men, but not in women. In the example, present above it would be in inappropriate to pool the results in men and women. Instead, the goal should be to describe effect modification and report the different effects separately.

 

There are many other applications of multiple regression analysis. A popular application is to assess the relationships between several predictor variables simultaneously, and a single, continuous outcome. For example, it may be of interest to determine which predictors, in a relatively large set of candidate predictors, are most important or most strongly associated with an outcome. It is always important in statistical analysis, particularly in the multivariable arena, that statistical modeling is guided by biologically plausible associations.

"Dummy" Variables in Regression Models 

Independent variables in regression models can be continuous or dichotomous. Regression models can also accommodate categorical independent variables. For example, it might be of interest to assess whether there is a difference in total cholesterol by race/ethnicity. The module on Hypothesis Testing presented analysis of variance as one way of testing for differences in means of a continuous outcome among several comparison groups. Regression analysis can also be used. However, the investigator must create indicator variables to represent the different comparison groups (e.g., different racial/ethnic groups). The set of indicator variables (also called dummy variables) are considered in the multiple regression model simultaneously as a set independent variables. For example, suppose that participants indicate which of the following best represents their race/ethnicity: White, Black or African American, American Indian or Alaskan Native, Asian, Native Hawaiian or Pacific Islander or Other Race. This categorical variable has six response options. To consider race/ethnicity as a predictor in a regression model, we create five indicator variables (one less than the total number of response options) to represent the six different groups. To create the set of indicators, or set of dummy variables, we first decide on a reference group or category. In this example, the reference group is the racial group that we will compare the other groups against. Indicator variable are created for the remaining groups and coded 1 for participants who are in that group (e.g., are of the specific race/ethnicity of interest) and all others are coded 0. In the multiple regression model, the regression coefficients associated with each of the dummy variables (representing in this example each race/ethnicity group) are interpreted as the expected difference in the mean of the outcome variable for that race/ethnicity as compared to the reference group, holding all other predictors constant. The example below uses an investigation of risk factors for low birth weight to illustrates this technique as well as the interpretation of the regression coefficients in the model.

Example of the Use of Dummy Variables

An observational study is conducted to investigate risk factors associated with infant birth weight. The study involves 832 pregnant women. Each woman provides demographic and clinical data and is followed through the outcome of pregnancy. At the time of delivery, the infant s birth weight is measured, in grams, as is their gestational age, in weeks. Birth weights vary widely and range from 404 to 5400 grams. The mean birth weight is 3367.83 grams with a standard deviation of 537.21 grams. Investigators wish to determine whether there are differences in birth weight by infant gender, gestational age, mother's age and mother's race. In the study sample, 421/832 (50.6%) of the infants are male and the mean gestational age at birth is 39.49 weeks with a standard deviation of 1.81 weeks (range 22-43 weeks). The mean mother's age is 30.83 years with a standard deviation of 5.76 years (range 17-45 years). Approximately 49% of the mothers are white; 41% are Hispanic; 5% are black; and 5% identify themselves as other race. A multiple regression analysis is performed relating infant gender (coded 1=male, 0=female), gestational age in weeks, mother's age in years and 3 dummy or indicator variables reflecting mother's race. The results are summarized in the table below.

 

 Many of the predictor variables are statistically significantly associated with birth weight. Male infants are approximately 175 grams heavier than female infants, adjusting for gestational age, mother's age and mother's race/ethnicity. Gestational age is highly significant (p=0.0001), with each additional gestational week associated with an increase of 179.89 grams in birth weight, holding infant gender, mother's age and mother's race/ethnicity constant. Mother's age does not reach statistical significance (p=0.6361). Mother's race is modeled as a set of three dummy or indicator variables. In this analysis, white race is the reference group. Infants born to black mothers have lower birth weight by approximately 140 grams (as compared to infants born to white mothers), adjusting for gestational age, infant gender and mothers age. This difference is marginally significant (p=0.0535). There are no statistically significant differences in birth weight in infants born to Hispanic versus white mothers or to women who identify themselves as other race as compared to white.


Single Regression

Advanced techniques can be used when there is trend or seasonality, or when other factors (such as price discounts) must be considered.




h2.

  • Develops a line equation y = a + b(x) that best fits a set of historical data points (x,y)
  • Ideal for picking up trends in time series data
  • Once the line is developed, x values can be plugged in to predict y (usually demand)

  • For time series models, x is the time period for which we are forecasting
  • For causal models (described later), x is some other variable that can be used to predict demand: o Promotions
    • Price changes
    • Economic conditions
    • Etc.
  • Software packages like Excel can quickly and easily estimate the a and b values required for the single regression model

h2.

There is a clear upward trend, but also some randomness.




Forecasted demand = 188.55 + 69.43*(Time Period)


Notice how well the regression line fits the historical data,
BUT we aren’t interested in forecasting the past…

Forecasts for May ’05 and June ’05:

May: 188.55 + 69.43*(17) = 1368.86
June: 188.55 + 69.43*(18) = 1438.29

  • The regression forecasts suggest an upward trend of about 69 units a month.
  • These forecasts can be used as-is, or as a starting point for more qualitative analysis.

h2.

Quarter Period Demand
Winter 04 1 80
Spring 2 240
Summer 3 300
Fall 4 440
Winter 05 5 400
Spring 6 720
Summer 7 700
Fall 8 880

Regression picks up the trend, but not seasonality effects

Calculating seasonal index: Winter Quarter

  • (Actual / Forecast) for Winter quarters:
  • Winter ‘04: (80 / 90) = 0.89
  • Winter ‘05: (400 / 524.3) = 0.76
  • Average of these two =.83
  • Interpretation:
  • For Winter quarters, actual demand has been, on average, 83% of the unadjusted forecast

Seasonally adjusted forecast model

For Winter quarter

[ -18.57 + 108.57*Period ] *.83

Or more generally:

[ -18.57 + 108.57*Period ] * Seasonal Index

Seasonally adjusted forecasts

Comparison of adjusted regression model to historical demand

Single regression and causal forecast models

  • Time series assume that demand is a function of time. This is not always true.
  • Examples:
    • Demand as a function of advertising dollars spent
    • Demand as a function of population
    • Demand as a function of other factors (ex. – flu outbreak)
  • Regression analysis can be used in these situations as well; We simply need to identify the x and y values
Month Price per unit Demand
1 $1.50 7,135
2 $1.50 6,945
3 $1.25 7,535
4 $1.40 7,260
5 $1.65 6,895
6 $1.65 7,105
7 $1.75 6,730
8 $1.80 6,650
9 $1.60 6,975
10 $1.60 6,800

Two possible x variables: Month or Price

Which would be a better predictor of demand?

Demand seems to be trending down over time, but the relationship is weak. There may be a better model...

… Demand shows a strong negative relationship to price. Using Excel to develop a regression model results in the following:

  • Demand = 9328 – 1481 * (Price)
  • Interpretation: For every dollar the price increases, we would expect demand to fall 1481 units.
Zahra Doe Morbi gravida, sem non egestas ullamcorper, tellus ante laoreet nisl, id iaculis urna eros vel turpis curabitur.

3 Comments

Zahra Doejune 2, 2017
Morbi gravida, sem non egestas ullamcorper, tellus ante laoreet nisl, id iaculis urna eros vel turpis curabitur.
Zahra Doejune 2, 2017
Morbi gravida, sem non egestas ullamcorper, tellus ante laoreet nisl, id iaculis urna eros vel turpis curabitur.
Zahra Doejune 2, 2017
Morbi gravida, sem non egestas ullamcorper, tellus ante laoreet nisl, id iaculis urna eros vel turpis curabitur.

Leavy Reply

Your Name (required) Your Name (required) Your Message