Chapter 7 Regression estimation
iNZight can do simple regression and association analyses from the main interface screen. More complex regression analyses can be run from the Model Fitting screen.
Make sure the Survey Design is correctly specified for your data first.
Then go
Advanced > Model Fitting
- Choose the outcome variable in
Variable
- In the
Framework
box choose fromLeast Squares
for continuous data,Binary Regression
for categorical data, andPoisson Regression
for count data. - Choose a transformation (if needed) for the Outcome variable - choices are none, log, exp, square root, inverse
- Choose the explanatory variables (
Variables of Interest
) to go into the model: drag from the list at left to the box at right.
- Click on a variable you have selected and then click the spanner button above to add transformations and interactions
- Double click a variable you have selected to remove it. (No need to use the
Confounding Variables
section.) - Look at the
Model Plots
page to see if the residuals look reasonable (Residuals \(e_i=\widehat{y}_i-y_i\) plotted against fitted values \(\widehat{y}_i\)) - check for any systematics (curvature, outliers etc.) - Look at the
Model Output
page to see the parameter estimates, standard errors, confidence intervals, and \(p-\)values for tests of the parameter being zero.
To get back to the main screen after Model Fitting, click Home
at bottom left.
7.1 Example
The apiclus2
data set is a two-stage cluster sample of schools (population units) within school districts (clusters). Carry out a regression analysis of the association of percentage of students receiving school meals (meals
) on Academic Performance (api00
)
- Define the Survey Design (1st stage clustering variable =
dnum
, 2nd stage clustering variable =snum
, finite population correction =fpc1+fpc2
= total number of clusters and cluster sizes) Advanced > Model Fitting
- Set
Variable
=api00
, Framework =Least Squares
, no Transformation - Drag
meals
intoVariables of Interest
Model Output:
--------------------------------------------------------------------------------
# Summary of Model 1: api00 ~ meals
Survey Generalised Linear Model for: api00
Survey design:
survey::svydesign(id = ~ dnum + snum, fpc = ~ fpc1+fpc2, data = dataSet)
Coefficients:
Estimate Std. Error t value p-value 2.5 % 97.5 %
(Intercept) 8.217e+02 2.524e+01 3.256e+01 < 2e-16 *** 772.24 871.172
meals -2.872e+00 4.176e-01 -6.876e+00 3.62e-08 *** -3.69 -2.053
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 8803.949)
Null deviance: 2340306 on 125 degrees of freedom
Residual deviance: 1100494 on 38 degrees of freedom
AIC: 1555.4
Number of Fisher Scoring iterations: 2
--------------------------------------------------------------------------------
School meals are a highly significant predictor (effect size -2.87 with \(p-\)value 3.6e-08).
Add in some other predictors: e.g. grad.sch
(percentage of parents with postgraduate education) and ell
(percentage of English Language Learners).
--------------------------------------------------------------------------------
# Summary of Model 1: api00 ~ meals + grad.sch + ell
Survey Generalised Linear Model for: api00
Survey design:
survey::svydesign(id = ~ dnum + snum, fpc = ~ fpc1+fpc2, data = dataSet)
Coefficients:
Estimate Std. Error t value p-value 2.5 % 97.5 %
(Intercept) 6.840e+02 3.935e+01 1.738e+01 < 2e-16 *** 606.871 761.1327
meals 3.539e-01 1.094e+00 3.236e-01 0.7481 -1.789 2.4973
grad.sch 4.704e+00 9.865e-01 4.768e+00 3.04e-05 *** 2.770 6.6371
ell -3.351e+00 1.304e+00 -2.569e+00 0.0145 * -5.907 -0.7944
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 5919.871)
Null deviance: 2340306 on 125 degrees of freedom
Residual deviance: 739984 on 36 degrees of freedom
AIC: 1509.4
Number of Fisher Scoring iterations: 2
--------------------------------------------------------------------------------
With these two predictors present, school meals is no longer a significant predictor of api00
.
7.2 Improvement of population estimates
As well as allowing investigation of the relationships between variables within survey data sets, regression estimation allows the improvement of population level estimates.
For example if a numerical variable \(Y_i\) can be modelled well by a set of \(J\) variables \({\mathbf{X}}_i=(X_{i1}, \ldots, X_{iJ})\): \[ Y_i = \mathbf{X}_i^T\beta + \varepsilon_i = \sum_{j=1}^J X_{ij}\beta_j + \varepsilon_i \] then the simple Horwitz-Thompson estimate the population total \(Y\) \[ \widehat{Y}_{\mathrm{HT}} = \sum_{k=1}^n w_k y_k \] may be improved by replacing the observed values \(y_k\) by the fitted values \(\widehat{y}_k\) from the regression \[\begin{eqnarray*} \widehat{Y}_{\mathrm{reg}} &=& \sum_{k=1}^n w_k \widehat{y}_k\\ &=& \sum_{k=1}^n \sum_{j=1}^J w_kX_{kj}\widehat{\beta}_j\\ &=& \sum_{k=1}^n \widehat{X}_j\widehat{\beta}_j \end{eqnarray*}\] where \(\widehat{X}_j=\sum_{j=1}^J w_kX_{kj}\) is an estimate of the population total of the \(j^{\mathrm{th}}\) variable \(\mathbf{X}_j\). If the true values of these totals are known (from some external source) than an even more precise estimate of \(\widehat{Y}\) may be created by replacing \(\widehat{X}_j\) with their true values \(X_j\) \[\begin{eqnarray*} \widehat{Y}_{\mathrm{R}} &=& \sum_{k=1}^n X_j\widehat{\beta}_j \end{eqnarray*}\] A particular example is the ratio estimator where there is one predictor variable \(X\) and the regression relationship is a simple proportionality: \[ Y_i = \beta X_i + \varepsilon_i \] which for a SRSWOR leads to \(\widehat{\beta}=\bar{y}/\bar{x}\) and thus to \(\widehat{Y}=X\bar{y}/\bar{x}\).
The generalised regression (GREG) estimator is yet another alternative \[\begin{eqnarray*} \widehat{Y}_{\mathrm{greg}} &=& \sum_{i=1}^N \widehat{y}_i + \sum_{k=1}^n w_k(y_k - \widehat{y}_k)\\ &=& \sum_{j=1}^J X_j\widehat{\beta}_j + (\widehat{Y}_{\mathrm{HT}} - \widehat{Y}_{\mathrm{reg}}) \end{eqnarray*}\]
iNZight is not (yet) set up for this kind of estimation.