Example 2: GLOW study

The variables are FRACTURE (yes or no), PRIORFRACTURE (prior fracture, yes or no), AGE (years), WEIGHT (kilograms), HEIGHT (cm), BMI (body mass index in kg/m2), PREMENO (premenopausal, yes or no), MOMFRAC (mother had hip fracture, yes or no) ARMASSIST (arms are needed to stand from chair, yes or no) SMOKE (former or current smoker, yes or no), and RATERISK (self-reported risk of fracture relative to others of same age, on a scale from 1 to 3).

In Hosmer and Lemeshow's analysis, interactions are also considered, leading the construction of 15 additional variables formed from pairwise products of the others. The final model includes two of these--AGExPRIORFRAC and MOMFRACxARMASSIST--as well as AGE, HEIGHT, PRIORFRAC, MOMFRAC, ARMASSIST, and RATERISK3. The same model and a reformulated version of it will be demonstrated in RegressIt.

First, let's look at the correlation matrix of the variables in the model:

**In the Global Longitudinal study of Osteoporosis in Women (GLOW), data was collected from tens of thousands of women worldwide concerning risk factors for osteoporosis, and they were followed up to determine whether they suffered a bone fracture of any kind within the subsequent year. In the logistic regression model analyzed in the text by Hosmer and Lemeshow, the objective is to predict whether a participant will suffer a fracture, based on the risk factor data. The sample consists of a subset of 500 records for participants from the U.S., and those who suffered fractures were over-sampled so that their fraction is exactly 25%, compared to an actual fraction of 4% within the larger population.**

The variables are FRACTURE (yes or no), PRIORFRACTURE (prior fracture, yes or no), AGE (years), WEIGHT (kilograms), HEIGHT (cm), BMI (body mass index in kg/m2), PREMENO (premenopausal, yes or no), MOMFRAC (mother had hip fracture, yes or no) ARMASSIST (arms are needed to stand from chair, yes or no) SMOKE (former or current smoker, yes or no), and RATERISK (self-reported risk of fracture relative to others of same age, on a scale from 1 to 3).

In Hosmer and Lemeshow's analysis, interactions are also considered, leading the construction of 15 additional variables formed from pairwise products of the others. The final model includes two of these--AGExPRIORFRAC and MOMFRACxARMASSIST--as well as AGE, HEIGHT, PRIORFRAC, MOMFRAC, ARMASSIST, and RATERISK3. The same model and a reformulated version of it will be demonstrated in RegressIt.

First, let's look at the correlation matrix of the variables in the model:

The variables most strongly correlated with FRACTURE are AGE, AGExPRIORFRAC, and PRIORFRAC, all with correlations sllightly greater than 0.2, not large in absolute terms. (A big value of R-squared is not to be expected.) The most interesting number in the matrix is the correlation of 0.99 between PRIORFRAC and AGExPRIORFRAC. To see what is going on, let's look at series plots of these variables, with the data sorted on PRIORFRAC so that all of its zeros come first. The following pictures are obtained:

The values of AGE are all positive and bounded far away from zero relative to their range (55 to 90). The variable AGExPRIORFRAC therefore looks quite a lot like PRIORFRAC. In fact, its pattern is the same except for the addition of some noise in the string of positive values at the end, hence the very high correlation between them.

The summary statistics and coefficient table for Hosmer and Lemeshow's model are shown below. R-squared and adjusted R-squared (McFadden's version) are pretty small (0.11 and 0.078) as anticipated. The coefficient estimates are all (barely) significantly different from zero by the usual standard, with z-stats (analogous to t-stats in a linear regression model) roughly between 2.0 and 3.5 magnitude.

A red flag is waving, though: the variance inflation factors for the two highly correlated variables are in the upper 60's! Normally a VIF greater than 10 is considered as an indicator of significant multicollinearity, so this is pathological multicollinearity. An independent variable's VIF is the value of 1-divided-by-1-minus-R-squared in a linear regression of itself on the other independent variables. A VIF in the upper 60's corresponds to an R-squared of greater than 98% in such a regression, and this could have been anticipated from the correlation of 0.99 between these two variables alone. In the correlation matrix of coefficients for this model (not shown here), the correlation between the coefficient estimates for these two variables is greater than 0.99.

The summary statistics and coefficient table for Hosmer and Lemeshow's model are shown below. R-squared and adjusted R-squared (McFadden's version) are pretty small (0.11 and 0.078) as anticipated. The coefficient estimates are all (barely) significantly different from zero by the usual standard, with z-stats (analogous to t-stats in a linear regression model) roughly between 2.0 and 3.5 magnitude.

A red flag is waving, though: the variance inflation factors for the two highly correlated variables are in the upper 60's! Normally a VIF greater than 10 is considered as an indicator of significant multicollinearity, so this is pathological multicollinearity. An independent variable's VIF is the value of 1-divided-by-1-minus-R-squared in a linear regression of itself on the other independent variables. A VIF in the upper 60's corresponds to an R-squared of greater than 98% in such a regression, and this could have been anticipated from the correlation of 0.99 between these two variables alone. In the correlation matrix of coefficients for this model (not shown here), the correlation between the coefficient estimates for these two variables is greater than 0.99.

A related problem appears in the table of exponentiated coefficients: the exponentiated coefficient of PRIORFRAC is 100!

A variable's exponentiated coefficient is the factor by the which the predicted odds in favor of a positive outcome are multiplied when that variable's value is increased by one unit, holding the other variables fixed. Here this means that the predicted odds in favor of a fracture are increased by a factor of 100 by the presence of a prior fracture (i.e., by changing PRIORFRAC from 0 to 1). This seems rather surprising at first glance, given that the correlation between PRIORFRAC and FRACTURE is not very large. What's going on? The catch here is "holding the other variables fixed." Because AGE is always positive, the value of AGExPRIORFRAC cannot be held fixed as the value of PRIORFRAC is changed from 0 to 1. It must go up by an amount that is somewhere between 55 and 90. Therefore the exponentiated coefficient of PRIORFRAC cannot be evaluated independently of the exponentiated coefficient of AGExPRIORFRAC. What one gives the other takes away.

The classification table and ROC chart for the model look like this with the default cutoff value of 0.5:

The classification table and ROC chart for the model look like this with the default cutoff value of 0.5:

With this cutoff value, the model achieves a very low

*false*positive rate (fraction of negative outcomes that are incorrectly predicted to be positive) but also a very low*true*positive rate (fraction of positive outcomes that are correctly predicted to be positive), and its position on the ROC curve is therefore down in the lower left corner of the chart. (The false positive rate is 1 minus the true negative rate.) A lower value for the cutoff level will increase the number of positive predictions, leading to an increase in the true positive rate at the expense of the false positive rate, i.e., a movement to a higher point on the curve. Here's the picture for a cutoff value of 0.3:The percent of correct predictions has gone down very slightly (76% to 73%), and the model has achieved a very great increase in the true positive rate (18% to 62%) without nearly as great an increase in the false positive rate (5% to 23%). Whether this new cutoff value should be chosen in preference to the default value of 0.5 ought to depend on the relative consequences of error in decisions that may depend on the predictions. The ROC chart is agnostic about the cutoff level: it merely shows you the whole frontier of what is achievable.

**What can be done about the multicollinearity issue?**It can be addressed by constructing the interaction variables in a different but logically equivalent way in order to reduce their correlations with other variables. Here's one way to do it. First, apply a standardization transformation to AGE to convert it to units of standard deviations from its own mean. This can be done in a few seconds by using the variable transformation tool in RegressIt, yielding a new variable called AGE_STDZ. Then create two new interaction variables, one of which is PRIORFRAC times AGE_STDZ and the other of which is (1- minus PRIORFRAC) times AGE_STDZ. When these variables are plotted alongside PRIORFRAC, they look like this:The values of the two transformed variables are now centered around zero, so that neither of them has an extremely large correlation with the other or with PRIORFRAC. More importantly, they are constructed in such a way as to treat the two prior-fracture conditions symmetrically, and the coefficients of the new interaction variables will be relatively independent measures of the age effect in both scenarios. This set of three variables is equivalent to the original set in terms predictive capability, because the three variables in the second set can be formed from linear combinations of those in the first set, and vice versa, but they are much more useful for inference. Here is the result that is obtained when the model is re-run with the two new interaction variables used in place of AGE and AGExPRIORFAC:

The variance inflation factors are now very satisfactory and the exponentiated coefficient of PRIORFRAC no longer has a crazy value. Most interestingly,

Notice that the error measures and the coefficients of all the other variables have been unchanged by this transformation of the interaction variables, reflecting the fact that the models are logically equivalent.

**the coefficient of AGE_STDZxPRIOR is not significantly different from zero in this model**(P=0.922)**.**This means that age does not add very much predictive value concerning a future bone fracture for someone who has already had a bone fracture. The latter fact by itself evidently carries much more useful information. (The t-statistic of PRIORFRAC's coefficient is 3.2) However, it does have predictive value for someone who has not yet experienced a bone fracture. This is very much in line with intuition.Notice that the error measures and the coefficients of all the other variables have been unchanged by this transformation of the interaction variables, reflecting the fact that the models are logically equivalent.