Variable Transformations

Linear regression models make very strong assumptions about the nature of patterns in the data: the predicted value of the dependent variable is a straight-line function of each of the independent variables, holding the others fixed, and the slope of this line doesn’t depend on what those fixed values of the other variables are, and the effects of different independent variables on the predictions are additive, and the unexplained variations are independently and identically normally distributed. (Click

The natural logarithm function (the LN function in Excel) is often used to transform prices and quantities in the modeling of consumer behavior. A linear relation between the natural log of price and the natural log of quantity-sold means that there is a constant "price elasticity of demand", i.e., the marginal percent change in quantity-sold per percent change in price is the same at all starting price levels. Here the variable transformation procedure has been used to apply it to the two 18-pack variables that were used in the regression model shown on the

Linear regression models make very strong assumptions about the nature of patterns in the data: the predicted value of the dependent variable is a straight-line function of each of the independent variables, holding the others fixed, and the slope of this line doesn’t depend on what those fixed values of the other variables are, and the effects of different independent variables on the predictions are additive, and the unexplained variations are independently and identically normally distributed. (Click

**here**for a discussion of these assumptions.) If a model's output indicates that some of them are badly violated, its predictions and inferences may be under-informative or biased or even illogical, as in the linear model's negative sales predictions for price levels only slightly higher than those in the sample. In such cases, better results are often obtained by applying**nonlinear transformations**(log, power, etc.) or**time transformations**(period-to-period change, percent change, etc.) to some of the variables prior to fitting a linear model to them. RegressIt includes a versatile and easy-to-use variable transformation procedure that can be launched by hitting its button in the lower right of the data analysis or regression dialog boxes. The list of available transformations includes time transformations if the "time series data" box has been checked.The natural logarithm function (the LN function in Excel) is often used to transform prices and quantities in the modeling of consumer behavior. A linear relation between the natural log of price and the natural log of quantity-sold means that there is a constant "price elasticity of demand", i.e., the marginal percent change in quantity-sold per percent change in price is the same at all starting price levels. Here the variable transformation procedure has been used to apply it to the two 18-pack variables that were used in the regression model shown on the

**previous page**:The transformed variables are automatically assigned descriptive names and are stored in new columns next to the original variables:

The results of fitting a simple regression model to the logged variables are shown below. The model has been given the name "Log-log price-demand model" rather than the default "Model 2". The slope coefficient of -6.705 is the estimated price elasticity of demand: on the margin a 1% change in the price of 18-packs is predicted to yield a 6.7% change in the number of cases of 18-packs sold, in the opposite direction.

In addition to its theoretical support and the simple and unit-free interpretation of its slope coefficient, this model fits the data much better than the original one in a number of ways. First, as seen in the line fit plot and residual-vs-predicted plot, the variance of the errors (in log units) is approximately the same for large and small predictions. Second, as seen in the value of the Anderson-Darling statistic and the appearance of the residual histogram and normal quantile plots, the error distribution is satisfactorily close to a normal distribution. These two facts indicate that confidence limits for predictions are not systematically biased in this model. (Roughly speaking, it assumes that unexplained variations in demand are identically normally distributed in percentage rather than absolute terms at all price levels, which is more realistic.) Third, there is less of a time pattern in the errors of this model: the lag-1 autocorrelation of the errors is only 0.092, compared to -0.222 for Model 1. (If there is a strong time pattern in the errors, it means the model has overlooked or misidentified some property of the time pattern in the data, leaving room for improvement.) Fourth, this model cannot make negative predictions for sales: when the forecasts for log sales are transformed back into real units of cases by applying the exponential (EXP) function to them, they are necessarily positive numbers at all price levels. And last but not least, the forecast errors of this model in real units are smaller on average than those of the original model in both mean-square and mean-absolute-percentage terms, as shown by some additional calculations in the accompanying spreadsheet file.

It may also be noted that there is a stronger linear pattern in the logged data than in the original data: the correlation between log sales and log price is greater than the correlation between sales and price. Hence R-squared (the fraction of variance explained) is larger for this model, because in a simple regression model it is just the square of the correlation. However, in general a model should not be judged merely on the basis of R-squared, especially when comparing models fitted to different transformations of the dependent variable. Sometimes a transformation of the dependent variable "explains" much of the variance all by itself, in which case the best model might be one with a relatively low value of R-squared. (That often happens when time transformations such as differences are involved.) In any case it is important to examine the rest of the output to determine whether the model's assumptions are good ones for the problem at hand. A good model should explain the variance in a way that is intuitive, no more complicated than needed, useful for inferences, and accurate and unbiased in its predictions out-of-sample as well as in-sample.

At the end of the day, it is necessary to "unlog" the forecasts for logged beer sales in order to convert them back into units of real beer. This can be done by applying the exponential function (EXP in Excel) to the forecasts and confidence limits that have been calculated in log units by this model. The details of how to do this are shown

Click

It may also be noted that there is a stronger linear pattern in the logged data than in the original data: the correlation between log sales and log price is greater than the correlation between sales and price. Hence R-squared (the fraction of variance explained) is larger for this model, because in a simple regression model it is just the square of the correlation. However, in general a model should not be judged merely on the basis of R-squared, especially when comparing models fitted to different transformations of the dependent variable. Sometimes a transformation of the dependent variable "explains" much of the variance all by itself, in which case the best model might be one with a relatively low value of R-squared. (That often happens when time transformations such as differences are involved.) In any case it is important to examine the rest of the output to determine whether the model's assumptions are good ones for the problem at hand. A good model should explain the variance in a way that is intuitive, no more complicated than needed, useful for inferences, and accurate and unbiased in its predictions out-of-sample as well as in-sample.

At the end of the day, it is necessary to "unlog" the forecasts for logged beer sales in order to convert them back into units of real beer. This can be done by applying the exponential function (EXP in Excel) to the forecasts and confidence limits that have been calculated in log units by this model. The details of how to do this are shown

**in this Excel file**.Click

**here**to proceed to the next page: the Model Summary worksheet.