- Free add-in for Excel that performs multivariate descriptive data analysis and linear and logistic regression analysis.
- Written in Visual Basic, the program file is less than 1M in size and it can be launched on an as-needed basis or placed on the Excel add-ins menu.
- The linear regression version now runs on Macs (in the native Mac version of Excel) as well as on PCs. A separate version that includes logistic regression runs only in Windows but can be used on Macs with virtualization software such as VMware.
- The new versions feature a 40-button ribbon interface, pictured above, whose features are unique among Excel add-ins (and regression software more generally) in terms of the support that they provide for easy navigation of the model space, comparing the fine details of models, documenting and presenting the results, and taking full advantage of touchscreen technology. A fully detailed analysis can be driven with one finger on a tablet computer, and the output is easily viewed and navigated on smart phones.
- The program is designed to support and and encourage best practices of data analysis by making it easy (even enjoyable) to do the right things. It makes a good complement to other statistical software for anyone who is at least an occasional Excel user.
- It provides abundant table and chart output that is presentation-quality and can be generated either in editable Excel format or in low- or high-resolution fixed formats.
- RegressIt can be operated simultaneously with other Excel statistics add-ins such as XLSTAT, Analyse-it, StatTools, and SigmaXL, effectively adding more menu options and better-designed tables and charts to their regression procedures. (See these slides for a direct comparison of regression output produced by the different add-ins.)
- RegressIt also reads and writes R code for linear regression, so it can be used as a front end for running models in R (with no typing of code) or a back end for producing additional table and chart output in Excel format.
- Charts are formatted in accordance with good principles of statistical graphics, they are uniquely titled and descriptively labeled, they have properly scaled axes, and their point sizes are adjusted for the size of the data set. (Excel's default formatting is not used.) As Excel chart objects, they can be instantly enlarged to fill the screen by moving them to individual worksheets, and they are subject to further editing or active linking after being copied to Word or Powerpoint files.
- Numbers are formatted with enough decimal places but not too many, and they are lined up to the best degree possible in tables. An advantage of the Excel environment is that numbers are stored in full precision within cells, so they do not need to be formatted to show many more digits than are useful.
- Variables are defined as named column ranges, they can be up to 1M rows in length on a PC and up to 120K rows long on a Mac, and they can have descriptive names up to 30 characters in length. All tables, charts, and dialog boxes are formatted to accommodate long variable names with full visibility.
- The output is well-organized and well-labeled for navigation, presentation, and audit trail purposes. The results of each data analysis or regression analysis are stored on a separate worksheet with an identifying analysis name that appears in all table and chart titles. Default names indicate the chronological sequence (Stats 1, Stats 2, ..., Model 1, Model 2, etc.) but customized names and numbering schemes can be assigned when procedures are executed. For example, a model called Product 1.2 is followed by Product 1.3 by default.
- A separate model summaries worksheet provides an audit trail for all regression analyses performed in the same workbook, including a table of side-by-side comparisons of model statistics.
Data analysis procedure
- Produces a table of descriptive statistics and correlation matrix and optional series plots and scatterplots and histogram plots. Autocorrelations at user-selected lags (not necessarily consecutive) can also be included. Click here for an example.
- Variables are sorted alphabetically by default in the table and chart arrays, but there is an option to designate a variable to list first, say, the dependent variable in a multiple regression analysis or a common predictor for several dependent variables.
- The series plots, which are stacked one above the other, can show the data as points, lines, points-and-lines, or vertical bars.
- The correlation matrix uses colors and font shading to highlight the most and least significant values, and column names are staggered down the diagonal to allow long variable names to be fully visible. When the dependent variable for a regression model is selected as the variable to list first, the first column of the correlation matrix shows its correlations with all the other variables right next to their names.
- The scatterplot matrix can be either a full square array or it can be limited to those plots in which the variable-to-list-first appears on either the X-axis or Y-axis. Each plot in the matrix is a separate Excel chart that can be independently moved, copied, re-sized, and edited.
- The titles of scatterplots include the correlation and either its squared value or the slope coefficient.
- Scatterplot axes are scaled so that their limits are the minimum and maximum values of the variables, to spread the points out as fully as possible, and those values (plus the midpoints of the ranges) are shown on the axis scales to provide additional summary statistics for the viewer.
- The scatterplots may also optionally include regression lines and/or center-of-mass points.
- Histogram plots also include some descriptive statistics printed on them: the minimum, maximum, and midpoint of each variable.
- The PC version can handle up to 250 independent and up to 1M rows of data, subject to a maximum of around 10M cells in the data matrix. The Mac version can handle up to 125 variables and up to 120K rows of data.
- The regression output worksheet is formatted to place the most important information at the top with as much as possible in view at one time. Click here for an example of the default view of the top of the worksheet for a simple regression model.
- The title of every regression table and graph is automatically labeled with the analysis name, dependent variable name, sample size, and number of independent variables. This is helpful for presentation purposes and makes it easy to trace any piece of output back to its source if it is copied elsewhere.
- R-squared and adjusted R-squared, standardized coefficients, t-statistics, F-statistics, and P-values are calculated with live formulas for educational purposes.
- Confidence limits in tables and charts are also calculated with live formulas, allowing the confidence level (stored in cell I10) to be interactively adjusted after fitting the model, which is useful for demonstrations and preparing reports.
- Standard chart output includes plots of actual and predicted values vs. row number (with confidence limits for forecasts, if any), residuals vs. observation number, residuals vs. predicted values, and a residual histogram chart. A line fit plot with confidence bands is also produced for a simple (1-variable) regression model. Click here for an annotated example.
- The residuals vs. observation number plot is formatted as a column chart to highlight signs and time patterns.
- Other optional chart output includes a normal quantile plot and residuals vs. independent variable plots.
- Statistical tests of residuals include the adjusted Anderson-Darling statistic and/or Jarque-Bera statistic, and, for time series data, residual autocorrelations for any lags specified by the user The significance of the normality test statistic is printed on the residual histogram and normal quantile plots, and the autocorrelation at the first listed lag can be shown on the residual-vs-observation# chart.
- When the regression procedure is launched from an existing regression model sheet, the starting variable selections are the ones used in that model, making it easy to tweak any previously fitted model and explore the model space in a systematic way.
- When the data analysis procedure is launched from an existing regression model sheet, the starting variable selections are the ones used in that model, with the dependent variable designated as the first variable, so that a descriptive data analysis report can be instantly produced to match the variables and sample used in the regression model.
- The regression model sheet includes an optional residual table which shows not only the actual and predicted values and residuals, but also standardized and absolute standardized residuals, leverage, and Cook's D. Excel's filtering tool, which is included on the RegressIt ribbon, can be used to interactively sort the table on any of the statistics in order to identify the most unusual or influential cases.
Forecasting outside the sample
- If the "Forecasts for missing or additional values of the dependent variable" box is checked, a forecast table will be included on the regression model output worksheet, and forecasts for any rows where the independent variables were all present and the dependent variable was missing will be produced automatically.
- Forecasts are displayed in a table and also plotted, together with confidence limits for means and forecasts. The actual-and-predicted-versus-observation-number chart will also show the forecasts and their confidence limits if the forecast table and chart are not hidden.
- The confidence limits shown in the forecast table respond interactively to changes of the value in the confidence-level cell at the top of the worksheet (which can be adjusted hitting buttons on the ribbon) , and they also are interactive on the charts the editable option was used,
- If the option to export predictions and residuals back to the data sheet is used, the out-of-sample forecasts are included in the column of predicted values. This can be used for later out-of-sample testing and comparisons of models on that basis.
Variable transformation procedure
- Allows variables to be easily transformed with a variety of commonly used functions and, in the case of time series data, time transformations.
- Mathematical transformations include natural log, exponential, square root, power, standardization with either the sample or population standard deviation, and creation of dummy variables from either numeric or text data.
- Time transformations include lag by k periods, difference from k periods ago, difference of natural log from k periods ago, percentage change from k periods ago, deflation at a fixed rate, and centered and trailing moving averages.
- Each created variable is automatically assigned a descriptive name that concatenates the variable name and the transformation, such as X.LN for natural log of X or X.LAG1 for X lagged by 1 period, and it is stored in a new column that is inserted immediately to the right of its parent variable. This provides a clear audit trail and also ensures that the transformed variables appear adjacent to their parent variables when their names are alphabetically sorted in tables of statistics or coefficient estimates or in chart arrays.
Model summaries worksheet
- Contains side-by-side comparisons of summary statistics and coefficient estimates (with P-values) for all regression models fitted in the same workbook. This serves as an audit trail for the contents of the workbook as well as a tool for easily comparing model statistics.
- Provides a well-formatted presentation of comparative results from multiple models that is suitable for directly copying into a slide or report.
- Results for models with different dependent variables are arranged in separate blocks of rows, i.e., statistics are arranged side-by-side only for models fitted to the same dependent variable.
- The documented regression models can be re-visited or re-launched from the specifications on this sheet with a couple of mouse clicks.