Variable Transformations
Linear regression models make very strong assumptions about the nature of patterns in the data: (i) the predicted value of the dependent variable is a straight-line function of each of the independent variables, holding the others fixed, and (ii) the slope of this line doesn’t depend on what those fixed values of the other variables are, and (iii) the effects of different independent variables on the predictions are additive, and (iv) the errors (unexplained variations) are statistically independent and (v) they are often assumed to be identically normally distributed as well. (Click here for a discussion of these assumptions.) If a model's output indicates that some of them are badly violated, its predictions and inferences may be under-informative or biased or even illogical, as in the linear regression model for weekly beer sales that yields negative sales predictions for price levels only slightly higher than those in the sample. In such cases, better results are often obtained by applying nonlinear transformations (log, power, etc.) or time transformations (period-to-period change, percent change, etc.) to some of the variables prior to fitting a linear model to them. RegressIt includes a versatile and easy-to-use variable transformation procedure that can be launched by hitting its button in the lower right of the data analysis or regression dialog boxes. The list of available transformations includes time transformations if the "time series data" box has been checked.
The natural logarithm function (the LN function in Excel) is often used to transform prices and quantities in the modeling of consumer behavior. A linear relation between the natural log of price and the natural log of quantity-sold means that there is a constant "price elasticity of demand", i.e., the marginal percent change in quantity-sold per percent change in price is the same at all starting price levels. Here the variable transformation procedure has been used to apply it to all of the price and sales variables in the beer sales example:
Linear regression models make very strong assumptions about the nature of patterns in the data: (i) the predicted value of the dependent variable is a straight-line function of each of the independent variables, holding the others fixed, and (ii) the slope of this line doesn’t depend on what those fixed values of the other variables are, and (iii) the effects of different independent variables on the predictions are additive, and (iv) the errors (unexplained variations) are statistically independent and (v) they are often assumed to be identically normally distributed as well. (Click here for a discussion of these assumptions.) If a model's output indicates that some of them are badly violated, its predictions and inferences may be under-informative or biased or even illogical, as in the linear regression model for weekly beer sales that yields negative sales predictions for price levels only slightly higher than those in the sample. In such cases, better results are often obtained by applying nonlinear transformations (log, power, etc.) or time transformations (period-to-period change, percent change, etc.) to some of the variables prior to fitting a linear model to them. RegressIt includes a versatile and easy-to-use variable transformation procedure that can be launched by hitting its button in the lower right of the data analysis or regression dialog boxes. The list of available transformations includes time transformations if the "time series data" box has been checked.
The natural logarithm function (the LN function in Excel) is often used to transform prices and quantities in the modeling of consumer behavior. A linear relation between the natural log of price and the natural log of quantity-sold means that there is a constant "price elasticity of demand", i.e., the marginal percent change in quantity-sold per percent change in price is the same at all starting price levels. Here the variable transformation procedure has been used to apply it to all of the price and sales variables in the beer sales example:
The transformed variables are automatically assigned descriptive names and are stored in new columns next to the original variables:
The results of fitting a simple regression model to the logged variables are shown below. The model has been given the name "Log-log model" rather than the default "Model 2". The slope coefficient of -6.705 is the estimated price elasticity of demand: on the margin a 1% change in the price of 18-packs is predicted to yield a 6.7% change in the number of cases of 18-packs sold, in the opposite direction.
The rest of the output appears below. In addition to its theoretical support and the simple and unit-free interpretation of its slope coefficient, this model fits the data much better than the original one in a number of ways. First, this model cannot make illogical negative predictions for sales: when the forecasts for log sales are transformed back into real units of cases by applying the exponential (EXP) function to them, they are necessarily positive numbers at all price levels. Second, the forecast errors of this model in real units are smaller on average than those of the original model, in root-mean-square terms, as shown by some additional calculations in the accompanying Excel file. (See columns M and O there.) Third, there is less of a time pattern in the errors of this model: the lag-1 autocorrelation of the errors is only 0.092, compared to -0.222 for Model 1. (If there is a strong time pattern in the errors, it means the model has overlooked or misidentified some property of the time pattern in the data, leaving room for improvement.) Fourth, as seen in the line fit plot and residual-vs-predicted plot, the variance of the errors (in log units) is approximately the same for large and small predictions. Finally, as seen in the value of the Anderson-Darling statistic and the appearance of the residual histogram and normal quantile plots, the error distribution (in log units) is satisfactorily close to a normal distribution. The latter two facts indicate that confidence limits for predictions are not systematically biased in this model. Roughly speaking, the model assumes that unexplained variations in demand are identically normally distributed in percentage rather than absolute terms at all price levels, which is more realistic.
It may also be noted that there is a stronger linear pattern in the logged data than in the original data: the correlation between log sales and log price is greater than the correlation between sales and price. Hence R-squared (the fraction of variance explained) is larger for this model, because in a simple regression model it is just the square of the correlation. However, in general a model should not be judged merely on the basis of R-squared, especially when comparing models fitted to different transformations of the dependent variable. Sometimes a transformation of the dependent variable "explains" much of the variance all by itself, in which case the best model might be one with a relatively low value of R-squared. (That often happens when time transformations such as differences are involved.) In any case it is important to examine the rest of the output to determine whether the model's assumptions are good ones for the problem at hand. A good model should explain the variance in a way that is intuitive, no more complicated than needed, useful for inferences, and accurate and unbiased in its predictions out-of-sample as well as in-sample.
At the end of the day, it is necessary to "unlog" the forecasts for logged beer sales in order to convert them back into units of real beer. This can be done by applying the exponential (EXP) function to the forecasts and confidence limits that have been calculated in log units by this model, as shown in the Excel file.
Click here to proceed to the next page: the Model Summary worksheet.
It may also be noted that there is a stronger linear pattern in the logged data than in the original data: the correlation between log sales and log price is greater than the correlation between sales and price. Hence R-squared (the fraction of variance explained) is larger for this model, because in a simple regression model it is just the square of the correlation. However, in general a model should not be judged merely on the basis of R-squared, especially when comparing models fitted to different transformations of the dependent variable. Sometimes a transformation of the dependent variable "explains" much of the variance all by itself, in which case the best model might be one with a relatively low value of R-squared. (That often happens when time transformations such as differences are involved.) In any case it is important to examine the rest of the output to determine whether the model's assumptions are good ones for the problem at hand. A good model should explain the variance in a way that is intuitive, no more complicated than needed, useful for inferences, and accurate and unbiased in its predictions out-of-sample as well as in-sample.
At the end of the day, it is necessary to "unlog" the forecasts for logged beer sales in order to convert them back into units of real beer. This can be done by applying the exponential (EXP) function to the forecasts and confidence limits that have been calculated in log units by this model, as shown in the Excel file.
Click here to proceed to the next page: the Model Summary worksheet.