Exercises for day 3 of the course in Survey methodology

 

March 2, 2004

 

You can find this handout on www.uib.no/people/mihtr

(useful for copying commands)

 

Nutritional survey among fertile Nepalese women aged 13 to 35 years living in the municipality or living (short term) in the carpet factories

  1. Divided the municipality into 120 geographical clusters.
  2. Selected 28 clusters using simple random sampling.
  3. Generated list of all eligible women in these 28 selected clusters
  4. Within each cluster, selected a number of women randomly. The number of women selected was proportional to its size. One fifth of all women within a cluster were selected.
  5. We had a list of 50 carpet factories (where the workers were residing). We selected carpet factories and women within the carpet factories using the same procedure as for the municipality.
  6. Carpet factories and municipality are the two strata in the dataset (denoted by the variable carpet).
  7. The clusters are denoted toleno or tole. There are 23 clusters in the municipality stratum (carpet==0) and 5 in the carpet stratum (carpet==1). There are totally 500 observations in the dataset.
  8. The women from the carpet factories constitute 5% of the total population and 20% (actually 19.4) 97/500) of the sample
  9. We collected information about
    1. Hemoglobin level
    2. Plasma, B12, Folat, albumin, iron, zinc, copper and transferrin receptor.
  10. After the primary sampling units (PSU) had been selected, we generated lists of all women that resided in these clusters from which we selected the women to be enrolled in the study. One problem was that the women in the strata of those living in the carpet factories migrated frequently. Thus, it became impossible to do a random selection within these 5 clusters.

 

The objective of the study was to describe the prevalence of anemia, micronutrient deficiencies and the mean values of hemoglobin and individual micronutrient concentrations.

 

QUESTION: Is the likelihood of being selected equal for all the women in the sampling frame, i.e. is the sample self-weighting?

 

 

______________________________________________________________________


 

 

  1. Load the data-set survey1 into STATA from the directory temp on C:\
  2. Type describe to list the various variables

 

describe

 

Contains data from c/Temp/survey1.dta

  obs:           500                          

 vars:            15                          2 Mar 2004 18:25

 size:        42,500 (99.5% of memory free)

-------------------------------------------------------------------------------

              storage  display     value

variable name   type   format      label      variable label

-------------------------------------------------------------------------------

idno            float  %3.0f                  ID number     

age             float  %8.0f                  Age of woman  

hemoglobin1     float  %8.0g                  Hemoglobin level, method 1 

salbumin        byte   %8.0g                  S-Albumin

sferritin       int    %8.0g                  S-Ferritin

stransferresep  float  %9.0g                  S-Transfer.resep

zn              float  %9.0g                  Zinc concentration

BMI             float  %9.0g                 

B12             float  %9.0g                 

FOLAT           float  %9.0g                  Folat concentration

tole            str30  %30s                   Tole  

carpet          float  %9.0g                  Stratum (0=residents, 1=carpet)

toleno          float  %9.0g                  Tole number

hemoglobin2     float  %9.0g                  Hemoglobin level, method 2

weight          float  %9.0g                 

-------------------------------------------------------------------------------

 

 

 

Two of the variables (BMI and B12) do not have labels

Create labels this way:

Type edit and the editor opens

 

Double click on the column heading (variable name) and the following box appear.

 

Type: “Body mass index” in the Label field

 

Repeat this for the variable B12 where you type “B12 concentration”

 

Type describe again to see that the labels have been added to the variables.

 

 

 

 

 

Exercise 1 – the effect of variability on prevalence.

 

Commands to be used:

 

generate  generates a new variable

tabulate  tabulates

summarize outputs summary statistics of a variable

 

Because we were concerned that the standard method of assessing the hemoglobin concentration was biased, we assessed the hemoglobin concentration in two ways. The hemoglobin concentrations were recorded into these two variables:

 

hemoglobin1 and hemoglobin2

 

use the command summarize to see whether one variable might give a biased estimate of the hemoglobin concentration.

 

summarize hemoglobin1

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

 hemoglobin1 |       500     13.2296    1.289096          7         16

 

summarize hemoglobin2

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

 hemoglobin2 |       500    13.24792    1.733713   6.317462    18.0518

 

 

QUESTION: Is method number 2 (hemoglobin2) inaccurate or imprecise compared to metthod 1 (hemoglobin1)?

 

 

 

 

 

Look at the distribution of the two variables using the command histogram

 

 

histogram hemoglobin1, xlab(5 10 15)

 

 

histogram hemoglobin2, xlab(5 10 15)

 

 

With the commands generate and tabulate, calculate the prevalence of anemia using the two methods of measuring hemoglobin. We defined women that had a hemoglobin concentration of less than 12 grams per liter have anemia.

 

A logical expression in the brackets here returns 0 if it is false and 1 if it is true. Thus, the variable anemia takes the value 0+0 =0 if hemoglobin ≥ 11 g/L and the value 0+1 =  if hemoglobin < 12 g/L.

 

generate anemia1=0+(hemoglobin1<12)

generate anemia2=0+(hemoglovin2<12)

 

tabulate anemia1

tabulate anemia2

 

 

. tabulate anemia1

 

    anemia1 |      Freq.     Percent        Cum.

------------+-----------------------------------

          0 |        442       88.40       88.40

          1 |         58       11.60      100.00

------------+-----------------------------------

      Total |        500      100.00

 

. tabulate anemia2

 

    anemia2 |      Freq.     Percent        Cum.

------------+-----------------------------------

          0 |        388       77.60       77.60

          1 |        112       22.40      100.00

------------+-----------------------------------

      Total |        500      100.00

 

QUESTION:

 

What is the prevalence of anemia (anemia ==1) according to

 

Method 1 (hemoglobin1, anemia1)?_________________________________

 

Method 2 (hemoglobin2, anemia2)?_________________________________

 

Can you explain what is going on?

 

 

____________________________________________________________________

 

 

 

In this dataset the primary sampling unit (PSU) is tole (which is Nepali for neighborhood), the stratification variable is carpet (whether the PSU is a carpet factory or not)

 

To view the distribution of individuals within the PSU type

 

tabulate tole carpet

 

                      |     Stratfication

                      |       variable

              Tole    |         0          1 |     Total

----------------------+----------------------+----------

           Bansagopal |        19          0 |        19

Barahi Carpet Factory |         0         12 |        12

            Bhelukhel |        18          0 |        18

           Bholachhen |        42          0 |        42

            Bolachhen |        10          0 |        10

      Charikot Carpet |         0         19 |        19

            Chasukhel |        24          0 |        24

              Chorcha |        12          0 |        12

       Des Raj Carpet |         0         27 |        27

                Dogan |        10          0 |        10

               Gahiti |         8          0 |         8

               Jagati |        18          0 |        18

 Jagati (R.L. Carpet) |         0         20 |        20

                 Jela |        77          0 |        77

       Jela Pangracha |         8          0 |         8

              Kamicha |        11          0 |        11

               Khauma |         9          0 |         9

             Khichhen |        11          0 |        11

          Lakulachhen |        17          0 |        17

                 Mako |        18          0 |        18

         Mangalachhen |         8          0 |         8

              Nagacha |        26          0 |        26

             Palikhel |        11          0 |        11

           Sallaghari |        11          0 |        11

  Suryabinayak Carpet |         0         19 |        19

           Tibukchhen |         9          0 |         9

            Tulachhen |        12          0 |        12

            Yathapath |        14          0 |        14

----------------------+----------------------+----------

                Total |       403         97 |       500

 

 

 

We used two-stage cluster sampling in two strata in this survey. Set the data for design based analysis ;

 

svyset sets the survey design variables.  The svy estimation commands no

    longer allow setting these variables, thus like tsset and stset, you must

    svyset your data prior to using svy estimation commands.

 

In stata version 8 the data can be set using a dialog box.

 

In the menu

Statistics/Survey data analysis/Setup & uitilities,

select Set variables for survey data

 

This dialog box will appear:

 

 

 

 

 

 

SVYSET DIALOG BOX:

 

Set the data for survey analysis by specifying the strata and the PSU

PSU is                         tole

Strata is                       carpet

weights is                    weight

 

When STATA adjusts the for the design effect in surveys that have been using two stage cluster sampling the set-up is identical to one-stage cluster sampling. STATA cannot handle more stages.

 

Because our sampling strategy in the stratum that consisted of women living in carpet factories it makes no sense to include these in the design based analysis. They do not consist of a random sample of women, from here we will not use these women in the analysis anymore. 

 

All analyses will be followed by the if carpet==0 statement. When we add this statement to the command we select only the observations that belong to cluster not consisting of carpet factory workers.

 

NOTE: The use of “==” and “=

It may seem confusing that we sometimes use “==” and sometimes “=”. Both  means equal to but they are used differently.

 

==   is used in logical expressions, i.e. it is used when we compare different numbers

 

example: ci anemia1 if carpet==1, bi this commando line we restrict the analyses to observations where the variable carpet has the value 1.

 

=     is used when we  assign a value to something

 

example: generate agecat = floor(age/10). Here we generate a new variable with that takes the value floor(age/10).

 

QUESTION: The mean hemoglobin level was different between the two strata, if our sampling strategy had succeeded and this difference would still be there, what would the stratification do to the standard error of the mean hemoglobin concentration?

 

 

 

 

QUESTION: Because the sampling strategy in the stratum “carpet factory workers” failed, should we not report the findings from this stratum. How would you deal with it in a publication?

 

 

________________________________________________________________________

 

 

 

QUESTION: The stratum carpet factory workers are oversampled compared to the other residents. The mean hemoglobin concentration in this stratum was lower than that of the other stratum. How will this affect the overall mean hemoglobin concentration in an unadjusted analysis?

 

 

________________________________________________________________________

 

 

 

Analyzing data where we take the design effect into account

 

Commands to be used

 

Not adjusting for design effect

Adjusting for design effect

 

 

ci  (calulates the ci and se)

 

ci anemia, binomial

svyprop anemia

ci Folat BMI salbumin

svymean BMI salbumin

 

ci   confidence interval, several variables can be selected. When the outcome variable is dichotomous the option bi (binomial) should be specified after a comma

 

svyprop   calculates a proportion taking the design effect into account

svymean   calculates a mean, standard error, confidence interval and the design effect (DEFF)

 

type

 

ci hemoglobin1 salbumin sferritin stransferresep zn BMI FOLAT if carpet==0

 

 

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]

-------------+---------------------------------------------------------------

 hemoglobin1 |        403     13.3794      .05591        13.26949    13.48932

    salbumin |        403    42.77171    .1538168        42.46933     43.0741

   sferritin |        402    37.38806     1.58008        34.28178    40.49433

stransferr~p |        403    1.522903    .0343049        1.455464    1.590343

          zn |        401    8.506733    .1013473        8.307493    8.705973

-------------+---------------------------------------------------------------

         BMI |        403    21.94542    .1577441        21.63531    22.25552

       FOLAT |        402    23.29647     .917373        21.49301    25.09993

 

 

 

svymean hemoglobin1 salbumin sferritin stransferresep zn BMI FOLAT if carpet==0

 

Survey mean estimation

 

pweight:  weight                                  Number of obs(*) =       403

Strata:   carpet                                  Number of strata =         1

PSU:      tole                                    Number of PSUs   =        23

                                                  Population size  =   1563.64

 

------------------------------------------------------------------------------

    Mean |   Estimate    Std. Err.   [95% Conf. Interval]        Deff

---------+--------------------------------------------------------------------

hemogl~1 |    13.3794    .0692858    13.23571    13.52309    1.535712

salbumin |   42.77171     .147411      42.466    43.07742    .9184427

sferri~n |   37.38806    1.862001    33.52651    41.24961    1.388678

strans~p |   1.522903     .046201    1.427088    1.618718    1.813802

      zn |   8.506733    .0869605    8.326388    8.687078    .7362415

     BMI |   21.94542     .203732     21.5229    22.36793    1.668062

   FOLAT |   23.29647     1.20418    20.79915    25.79378    1.723023

------------------------------------------------------------------------------

(*) Some variables contain missing values.

 

 


Exercise: Fill in the table below with values that you obtained from the last two analyses.

 

Variable

Std. Err. from svymean

Std. Err. from ci

SE(svymean)

/SE(ci)

Square of the ratio (A/B)

 

A

B

A/B

(A/B)* (A/B)

hemoglobin1

.069

.056

1.232

1.52

 

salbumin

 

 

 

 

sferritin

 

 

 

 

BMI

 

 

 

 

FOLAT

 

 

 

 

 

 

QUESTION: If you wanted to do the same survey again and with the same precision as we  got from these analyses of 403 women. How many women would you have to sample using simple random sampling to get the same precision of the estimates for:

 

hemoglobin1

 

salbumin

 

sferritin

 

BMI

 

FOLAT