Linear regression in Excel

Regression Analysis

In a linear regression model, a "dependent" variable is predicted by an additive straight-line function of one or more "independent" ones. In the regression procedure in RegressIt, the dependent variable is chosen from a drop-down list and the independent variables are chosen by checking their boxes on the input panel. All of the output for a given model is organized on a single worksheet, which includes a bitmapped date/time/username stamp at the top. Furthermore, every table and chart is labeled with the model name, the dependent variable name, the number of independent variables, and the sample size. This may appear very redundant at first glance, but it provides an audit trail for any table or chart that is copied to other documents or used for presentations, and it also keeps you oriented when flipping back and forth among model worksheets, especially when focusing on what is happening down near the bottom. Default model names are of the form "Model n" but custom names can be entered at run time. In the example below, the name "Linear price-demand model" was used. If the regression procedure is re-run while positioned on a previous model's output worksheet, the specifications of that model are the starting point for the next one, allowing models to be elaborated and refined in a systematic fashion.

A variety of output options and diagnostic statistics and plots are available, and the confidence level to be used for confidence limits in tables and charts (whose default value is 0.95) can be interactively changed on the output worksheet after the model is fitted. (All confidence-limit cells on the worksheet contain formulas, not values, and many other displayed statistics are also computed from live formulas for educational purposes.) Forecasts can be automatically generated for missing values of the dependent variable or they can be generated manually after fitting the model by entering additional values for the independent variables in the forecast table and then using the forecasting button on the toolbar. Here are the results of fitting a simple regression model (not a very good one!) to predict sales of 18-packs from price of 18-packs using the beer sales data set whose descriptive analysis was shown on the previous page.

The usual regression model summary statistics appear at the top of the model worksheet. In all text output, numbers are formatted by default to fit the scale of the values in order to avoid showing too many or too few decimal places and to keep decimal points lined up as well as possible. They are stored (or computed) in the cells with full precision, of course, so they can be reformatted as desired.

The value of -93 for the slope coefficient in this model means that a $1 change in the price of 18-packs should be predicted to lead to a 93-case change in sales of 18-packs in the opposite direction.

A number of presentation-quality charts are produced for each regression model by default. Markers are scaled to fit the size of the data set and they are semi-transparent on all charts where points may overlap, in order to highlight local densities. Excel's default axis scaling is NOT used. Axis scales are intelligently chosen to fit the data. There is always a grid line at zero if zero is in the X or Y data range, and no more decimal places are shown than necessary. Grid lines are light gray to let the data stand out.

For a simple (1-variable) regression model, a line fit plot appears just below the regression statistics and summary tables, and its title includes the model equation. If a value other than 95% is subsequently typed in the confidence level cell in the regression statistics table, the width and labeling of the confidence bands on this chart will be updated instantly.

The remainder of the table and chart output appears farther down on the model sheet, as pictured below. Why is this not a good model, despite the impressive value of R-squared (75%)? The errors do not have the same variance for large and small predictions, as is apparent on the line fit plot and the residual-vs-predicted plot. They are not normally distributed either, as indicated by the Anderson-Darling statistic and the residual histogram and normal quantile plots. The line fit plot also shows that the model predicts negative values of sales for prices greater than $19.50 per case, and the lower 95% confidence limits for predictions are negative for prices greater than $16.50 per case!

Click here to proceed to the next page: Variable Transformations.

The optional residual table appears at the bottom of model output sheet. In addition to actual and predicted values and residuals, it shows standardized and absolute standardized residuals and two measures of influence: leverage and Cook's D.

The Filter tool on the RegressIt ribbon can be used to filter and interactively sort the table on any of these statistics. For example, sorting the table on the value of absolute standardized residual brings the following observations to the top: