Chapter 7 Regression estimation

iNZight can do simple regression and association analyses from the main interface screen. More complex regression analyses can be run from the Model Fitting screen.

Make sure the Survey Design is correctly specified for your data first.

Then go

  • Advanced > Model Fitting
  • Choose the outcome variable in Variable
  • In the Framework box choose from Least Squares for continuous data, Binary Regression for categorical data, and Poisson Regression for count data.
  • Choose a transformation (if needed) for the Outcome variable - choices are none, log, exp, square root, inverse
  • Choose the explanatory variables (Variables of Interest) to go into the model: drag from the list at left to the box at right.
  • Click on a variable you have selected and then click the spanner button above to add transformations and interactions
  • Double click a variable you have selected to remove it. (No need to use the Confounding Variables section.)
  • Look at the Model Plots page to see if the residuals look reasonable (Residuals \(e_i=\widehat{y}_i-y_i\) plotted against fitted values \(\widehat{y}_i\)) - check for any systematics (curvature, outliers etc.)
  • Look at the Model Output page to see the parameter estimates, standard errors, confidence intervals, and \(p-\)values for tests of the parameter being zero.

To get back to the main screen after Model Fitting, click Home at bottom left.

7.1 Example

The apiclus2 data set is a two-stage cluster sample of schools (population units) within school districts (clusters). Carry out a regression analysis of the association of percentage of students receiving school meals (meals) on Academic Performance (api00)

  • Define the Survey Design (1st stage clustering variable = dnum, 2nd stage clustering variable = snum, finite population correction = fpc1+fpc2 = total number of clusters and cluster sizes)
  • Advanced > Model Fitting
  • Set Variable = api00, Framework = Least Squares, no Transformation
  • Drag meals into Variables of Interest

Model Output:


--------------------------------------------------------------------------------

# Summary of Model 1: api00 ~ meals

Survey Generalised Linear Model for: api00

Survey design:
survey::svydesign(id = ~ dnum + snum, fpc = ~ fpc1+fpc2, data = dataSet)

Coefficients:
              Estimate Std. Error    t value    p-value      2.5 %  97.5 %
(Intercept)  8.217e+02  2.524e+01  3.256e+01  < 2e-16   *** 772.24 871.172
meals       -2.872e+00  4.176e-01 -6.876e+00 3.62e-08   ***  -3.69  -2.053
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for gaussian family taken to be 8803.949)

    Null deviance: 2340306  on 125  degrees of freedom
Residual deviance: 1100494  on  38  degrees of freedom
AIC: 1555.4

Number of Fisher Scoring iterations: 2


--------------------------------------------------------------------------------

School meals are a highly significant predictor (effect size -2.87 with \(p-\)value 3.6e-08).

Add in some other predictors: e.g. grad.sch (percentage of parents with postgraduate education) and ell (percentage of English Language Learners).


--------------------------------------------------------------------------------

# Summary of Model 1: api00 ~ meals + grad.sch + ell

Survey Generalised Linear Model for: api00

Survey design:
survey::svydesign(id = ~ dnum + snum, fpc = ~ fpc1+fpc2, data = dataSet)

Coefficients:
              Estimate Std. Error    t value    p-value       2.5 %   97.5 %
(Intercept)  6.840e+02  3.935e+01  1.738e+01  < 2e-16   *** 606.871 761.1327
meals        3.539e-01  1.094e+00  3.236e-01   0.7481        -1.789   2.4973
grad.sch     4.704e+00  9.865e-01  4.768e+00 3.04e-05   ***   2.770   6.6371
ell         -3.351e+00  1.304e+00 -2.569e+00   0.0145   *    -5.907  -0.7944
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for gaussian family taken to be 5919.871)

    Null deviance: 2340306  on 125  degrees of freedom
Residual deviance:  739984  on  36  degrees of freedom
AIC: 1509.4

Number of Fisher Scoring iterations: 2


--------------------------------------------------------------------------------

With these two predictors present, school meals is no longer a significant predictor of api00.

7.2 Improvement of population estimates

As well as allowing investigation of the relationships between variables within survey data sets, regression estimation allows the improvement of population level estimates.

For example if a numerical variable \(Y_i\) can be modelled well by a set of \(J\) variables \({\mathbf{X}}_i=(X_{i1}, \ldots, X_{iJ})\): \[ Y_i = \mathbf{X}_i^T\beta + \varepsilon_i = \sum_{j=1}^J X_{ij}\beta_j + \varepsilon_i \] then the simple Horwitz-Thompson estimate the population total \(Y\) \[ \widehat{Y}_{\mathrm{HT}} = \sum_{k=1}^n w_k y_k \] may be improved by replacing the observed values \(y_k\) by the fitted values \(\widehat{y}_k\) from the regression \[\begin{eqnarray*} \widehat{Y}_{\mathrm{reg}} &=& \sum_{k=1}^n w_k \widehat{y}_k\\ &=& \sum_{k=1}^n \sum_{j=1}^J w_kX_{kj}\widehat{\beta}_j\\ &=& \sum_{k=1}^n \widehat{X}_j\widehat{\beta}_j \end{eqnarray*}\] where \(\widehat{X}_j=\sum_{j=1}^J w_kX_{kj}\) is an estimate of the population total of the \(j^{\mathrm{th}}\) variable \(\mathbf{X}_j\). If the true values of these totals are known (from some external source) than an even more precise estimate of \(\widehat{Y}\) may be created by replacing \(\widehat{X}_j\) with their true values \(X_j\) \[\begin{eqnarray*} \widehat{Y}_{\mathrm{R}} &=& \sum_{k=1}^n X_j\widehat{\beta}_j \end{eqnarray*}\] A particular example is the ratio estimator where there is one predictor variable \(X\) and the regression relationship is a simple proportionality: \[ Y_i = \beta X_i + \varepsilon_i \] which for a SRSWOR leads to \(\widehat{\beta}=\bar{y}/\bar{x}\) and thus to \(\widehat{Y}=X\bar{y}/\bar{x}\).

The generalised regression (GREG) estimator is yet another alternative \[\begin{eqnarray*} \widehat{Y}_{\mathrm{greg}} &=& \sum_{i=1}^N \widehat{y}_i + \sum_{k=1}^n w_k(y_k - \widehat{y}_k)\\ &=& \sum_{j=1}^J X_j\widehat{\beta}_j + (\widehat{Y}_{\mathrm{HT}} - \widehat{Y}_{\mathrm{reg}}) \end{eqnarray*}\]

iNZight is not (yet) set up for this kind of estimation.