- How do I open the ACS data extracts? Can I open them in Excel?
- Why do the variable names not match the ACS documentation from the Census?
- What set of income variables should I use?
- How do I identify counties and metropolitan areas in the ACS?
- What weights should I use?
Opening CEPR ACS Extracts
CEPR’s uniform ACS data extracts are available in compressed Stata (dta) and Comma Separated Values (csv) formats. The Stata (dta) data files are designed to be opened in Stata, though they can also be opened in SAS, and in R using the foreign or haven packages. The Comma Separate Values (csv) files can be opened by Stata, SAS, R, and others. The ACS datasets are far too large to be used in Excel, and we strongly recommend using some type of statistical software (Stata, R, SAS, etc.) to work with them.
Our Stata programs are the primary source of information on our extracts. For the ACS, the programs can be found here. Our Stata code show the changes we made to the original raw ACS variables in order to create our extract. List of variable names and value labels for our ACS extracts can also be viewed on our documentation page.
The CEPR preferred income variables are those that take into account the Census Bureau’s internal constant calendar year inflation adjustment factor as well as CEPR’s real wage program that uses the CPI-U-RS to convert dollar amounts to current year dollars. These variables have both an “r” prefix and an “_adj” suffix (for example, rincp_all_adj is the real wage adjusted total person’s income).
Identifying Counties and Metropolitan Areas
The ACS does not contain a variable for county. It does, however, have variables for state, and for what are known as Public Use Microdata Areas, or PUMAs. PUMAs are the only sub-state geography available in the ACS PUMS, and represent populations of 100,000 or larger. Boundaries for both residential PUMAs and place-of-work/migration PUMAs were redrawn in accordance with new guidelines and using population estimates from the 2010 census, and those new PUMA delineations were incorporated into the microdata in 2012. The 2010 vintage PUMAs are built on county boundaries and nest accordingly. However, depending on county population size, in some cases a single county may encompass several PUMAs, while in other cases a single PUMA may contain multiple counties. The PUMA names file will tell you which areas are covered by each vintage 2010 PUMA.
To uniquely identify PUMAs in our ACS files, you will need to use a combination of both the state variable (2 digits) and the puma (puma00 for 2005-2011 samples, puma10 for 2012 and later samples) variable (5 digits).
While in some cases it is possible to perform crosswalks between PUMAs and metropolitan areas, doing so may produce mismatch errors of various sizes, depending on how well the PUMA boundaries and metro boundaries line up. Unlike counties, metro areas and PUMAs do not nest, so any crosswalk between the two will contend with errors of omission and commission.
More information on PUMAs can be found on the Census website.
Generally, you’ll want to use the person weight (perwgt) if you’re trying to determine the characteristics of individuals, and the housing weight (hsgwgt) if you’re trying to determine the characteristics of households.
However, to generate more accurate standard error estimates for hypothesis testing and confidence intervals, replicate weights should be used. The ACS includes 160 replicate weights: 80 for the analysis of individuals, and 80 for the analysis of households. The commands below apply to individual replicate weights, but can easily be adapted for estimations of households.
To use replicate weights in Stata, you first must describe the survey to using the svyset command:
svyset [iw=perwgt], sdr(pwgtp1-pwgtp80) vce(sdr)
The use of the replicate weights allows the data to be treated as one strata, so no Primary Sampling Unit (PSU) needs to be specified. The full sample weight, perwgt, must be identified. Once some details of the survey data have been described, place the svy prefix before commands to use replicate weights in estimations. For example,
svy: reg rincp_all female
svy: mean educ
The above example use the successive difference replication (SDR) method, but bootstrap and balanced repeated replication (BRR) methods can also be applied. For the examples above, the SDR method provides the largest standard errors; almost twice the size of those estimated using bootstrapping and BRR. Not all commands can be used with the survey prefix; see help svy_estimation within Stata for a list.
For more information on replicate weights in the ACS, see the IPUMS page.