svystatB {ReGenesees}R Documentation

Estimation of Population Regression Coefficients

Description

Computes estimates, standard errors and confidence intervals for Multiple Regression Coefficients.

Usage

svystatB(design, model,
         vartype = c("se", "cv", "cvpct", "var"),
         conf.int = FALSE, conf.lev = 0.95, deff = FALSE,
         na.rm = FALSE)

## S3 method for class 'svystatB'
coef(object, ...)
## S3 method for class 'svystatB'
SE(object, ...)
## S3 method for class 'svystatB'
VAR(object, ...)
## S3 method for class 'svystatB'
cv(object, ...)
## S3 method for class 'svystatB'
deff(object, ...)
## S3 method for class 'svystatB'
confint(object, ...)
## S3 method for class 'svystatB'
summary(object, ...)

Arguments

design

Object of class analytic (or inheriting from it) containing survey data and sampling design metadata.

model

Formula specifying the linear model whose coefficients have to be estimated.

vartype

character vector specifying the desired variability estimators. It is possible to choose one or more of: standard error ('se', the default), coefficient of variation ('cv'), percent coefficient of variation ('cvpct'), or variance ('var').

conf.int

Compute confidence intervals for the estimates? The default is FALSE.

conf.lev

Probability specifying the desired confidence level: the default value is 0.95.

deff

Should the design effect be computed? The default is FALSE (see ‘Details’).

na.rm

Should missing values (if any) be removed from the variables of interest? The default is FALSE (see ‘Details’).

object

An object of class svystatB.

...

Additional arguments to coef, ..., confint methods (if any).

Details

This function computes weighted estimates for Multiple Regression Coefficients using suitable weights depending on the class of design: calibrated weights for class cal.analytic and direct weights otherwise. Standard errors are calculated using the Taylor linearization technique.

The mandatory argument model identifies the regression model whose population coefficients have to be estimated (for details on model specification, see e.g. lm). The design variables referenced by model should be numeric or factor (variables of other types - e.g. character - will be coerced).

The conf.int argument allows to request the confidence intervals for the estimates. By default conf.int=FALSE, that is the confidence intervals are not provided.

Whenever confidence intervals are requested (i.e. conf.int=TRUE), the desired confidence level can be specified by means of the conf.lev argument. The conf.lev value must represent a probability (0<=conf.lev<=1) and its default is chosen to be 0.95.

The optional argument deff allows to request the design effect [Kish 1995] for the estimates. By default deff=FALSE, that is the design effect is not provided. The design effect of an estimator is defined as the ratio between the variance of the estimator under the actual sampling design and the variance that would be obtained for an 'equivalent' estimator under a hypothetical simple random sampling without replacement of the same size. To obtain an estimate of the design effect comparing to simple random sampling "with replacement", one must use deff="replace".
For nonlinear estimators, the design effect is estimated on the linearized version of the estimator (that is for the estimator of the total of the linearized variable, aka "Woodruff transform").
When dealing with domain estimation, the design effects referring to a given subpopulation are currently computed by taking the ratios between the actual variance estimates and those that would have been obtained if a simple random sampling were carried out within that subpopulation. This is the same as the srssubpop option for Stata's function estat.

Missing values (NA) in model variables should be avoided. If na.rm=FALSE (the default) they generate an error. If na.rm=TRUE, observations containing NAs in model variables are dropped, and estimates gets computed on non missing values only. This implicitly assumes that missing values hit interest variables completely at random: should this be not the case, computed estimates would be biased.

The summary method invoked on regression coefficients (say b) estimated via svystatB, gives p-values and significance codes for the component-wise test b = 0. Such values are computed assuming that the distribution of the regression coefficients estimators is normal (which is asymptotically true for large scale surveys). This assumption has the advantage of overcoming the problem of choosing the "right" statistic and assessing its "right" number of degrees of freedom when using data from a complex survey (see e.g. [Korn, Graubard 1990]).

Value

An object inheriting from the data.frame class, whose detailed structure depends on input parameters' values.

Author(s)

Diego Zardetto

References

Sarndal, C.E., Swensson, B., Wretman, J. (1992) "Model Assisted Survey Sampling", Springer Verlag.

Kish, L. (1995). "Methods for design effects". Journal of Official Statistics, Vol. 11, pp. 55-77.

Korn, E.L., Graubard, B.I. (1990) "Simultaneous testing of regression coefficients with complex survey data: Use of Bonferroni t statistics". The American Statistician, 44, 270-276.

See Also

Estimators of Totals and Means svystatTM, Ratios between Totals svystatR, Shares svystatS, Ratios between Shares svystatSR, Quantiles svystatQ, Complex Analytic Functions of Totals and/or Means svystatL, and all of the above svystat.

Examples

######################################################
# A simple regression model with a single predictor. #
# Let's compare the estimated regression coefficient #
# to its true value computed on the sampling frame.  #
######################################################

# Load sbs data:
data(sbs)

# Create a design object:
sbsdes<-e.svydesign(data=sbs,ids=~id,strata=~strata,weights=~weight,
        fpc=~fpc)
 
# The population scatterplot of y vs emp.num reveals a linear
# behaviour:
plot(sbs.frame$emp.num,sbs.frame$y,
     col=rgb(50,205,50,100,maxColorValue=255),pch=16)

# Compute the population fit of the linear regression
# model y~emp.num-1 (no intercept):
pop.fit<-lm(y~emp.num-1,data=sbs.frame)
abline(pop.fit,col="red",lwd=2,lty=2)

# The obtained population R-squared is quite significant
# (greater than 0.7):
pop.R2<-summary(pop.fit)$r.squared
pop.R2

# The population regression coefficient is:
B<-coef(pop.fit)
B

# Now let's estimate B on the basis of the sbs sample and
# let's build a 95% confidence interval for the obtained estimate:
svystatB(sbsdes,y~emp.num-1,conf.int=TRUE)

# Thus, the confidence interval covers the true value of B.

# Notice that using ReGenesees Complex Estimators function
# svystatL, you would have obtained exactly the same results:
sbsdes<-des.addvars(sbsdes,y4emp.num=y*emp.num,
                           emp.num.sq=emp.num^2)
svystatL(sbsdes,expression(y4emp.num/emp.num.sq),
         conf.int=TRUE)


##################################
# A multiple regression example. #
##################################

# Let's estimate the coefficients of a model describing
# value added (variable va.imp2) as a linear function
# of number of employees by region and of nace.macro:
b <- svystatB(sbsdes,va.imp2~emp.num:region+nace.macro,vartype="cvpct")
b

# To obtain p-values and significance codes for the
# component-wise test t=0, you can exploit the
# summary method:
summary(b)

# Notice that estimators normality is assumed.


##########################################
# Obtaining domain means via regression. #
##########################################

# The domain mean of a numeric variable can be thought
# as a regression coefficient. Suppose you need the
# average number of employees by macro-sector, you can
# do as follows:
svystatB(sbsdes,emp.num~nace.macro-1)
 
# ...which, indeed, gives exactly the same results of:
svystatTM(sbsdes,y=~emp.num,by=~nace.macro,estimator="Mean")


##########################
# Handling collinearity. #
##########################

# Function svystatB overcomes problems arising from exact
# collinearity between model variables via 'aliasing'.
# To understand how aliasing works, let's build a manifestly
# redundant linear model:
svystatB(sbsdes,y~emp.num+I(2*emp.num)+I(3*va.imp2)+va.imp2-1)

# The obtained warning message shows that order definitely matters
# in aliasing, indeed:
svystatB(sbsdes,y~emp.num+I(2*emp.num)+va.imp2+I(3*va.imp2)-1)

# Notice also that aliasing gives exact estimates and standard errors
# for non-aliased regression coefficients (i.e. the same results that
# would be obtained with a reduced - no collinearity - model):
svystatB(sbsdes,y~emp.num+va.imp2-1)


###############################################
# Handling missing values in model variables. #
###############################################

# Load fpcdat:
data(fpcdat)

# Now, let's introduce some NAs in survey data:
fpcdat$y[c(1,3)]<-NA
fpcdat$x[c(3,5)]<-NA

# Create a design object:
fpcdes<-e.svydesign(data=fpcdat,ids=~psu+ssu,strata=~stratum,weights=~w,
        fpc=~fpc1+fpc2)

# Let's estimate regression coefficients of model z~y+x
  # na.rm=FALSE (the default) leads to an error:
  ## Not run: 
  svystatB(fpcdes,z~y+x)
  
## End(Not run)

  # whereas na.rm=TRUE simply drops all the cases
  # with missing data in model variables:  
  svystatB(fpcdes,z~y+x,na.rm=TRUE)


##################################
# Handling non positive weights. #
##################################

# Non positive direct weights are not allowed, anyway some
# calibrated weights can sometimes turn out to be <= 0. The
# corrisponding observations would be dropped by svystatB.

  # Prepare a template for population totals:
  pop<-pop.template(fpcdes,~z+pl.domain-1)

  # Fill it with silly values in order to obtain some negative g-weights:
  pop[1,]<-c(20000,90,10,90)

  # Calibrate:
  fpccal<-e.calibrate(fpcdes,pop)
  
  # We got 2 negative calibrated weights:
  g.range(fpccal)
  sum(weights(fpccal)<=0)

  # Now, let's estimate regression coefficients of model z~y+x
  # and pay attantion to the warnings:
  svystatB(fpccal,z~y+x,na.rm=TRUE)


[Package ReGenesees version 2.0 Index]