Version 0.9.8

There were several notable changes to the files in Version 0.9.8. First, we now use the raw CPS March data as the sole source of our extract starting in 2014. We still use Unicon’s extract as the underlying source for our extract from 1980-2013.

In addition, the 2014 March CPS included redesigned health insurance and income questions. All of the addresses received the redesigned health insurance questions. However, the income questions were fielded using a split panel design, where 5/8ths of the sample (approximately 68,000 addresses) were given the same income questions as the previous year, while 3/8ths (approximately 30,000 addresses) were part of the “Research File” and were given redesigned income questions. Those who were asked the redesigned income questions gave responses that were significantly different from those who were given the original questions. Therefore, the Census Bureau urges researchers not to combine the two files.

We provide the files separately here. There is a regular CPS March 2014 file (cps_march_2014.zip) that is based on the 5/8ths sample with original income questions– this is what the Census Bureau has used for their normal publications on income and poverty. We also have available the 3/8ths or Research File (cepr_march_2014_research.zip) that has the redesigned income questions.

We recommend that you use the 5/8ths sample with original income questions for your publications.

For more on these changes, and the Census Bureau’s recommendations on how to handle the data, please see here.

Variable Names

Our Stata programs are the major source for information on our extracts. For the March CPS, the programs can be found here. Our Stata code show the changes we made to the original raw March CPS variables in order to create our extract.

How to Uniquely Identify Households

Within a single file, the best way to uniquely identify households is to use the hhseq variable.

Here’s an explanation from Unicon Research Corporation (which provides the underlying source of our extract from 1980-2013) – “Beginning in 1994 through the current year, there is a problem with duplicate HHIDs for different household units. The problem is particularly severe starting with the SCHIP-expanded sample in 2001 (the 2001s data)… When identifying a household within a file, the variable hhseq should be used. This variable appears to have no problem with duplicate values. However, when matching records across files [if you’re trying to track the same household in multiple years], it is necessary to use HHID. Adding geographic variables to the sort (state, county) may aid in uniquely identifying household units. When that is not enough, we suggest that household units be identified using both hhid and hhseq, then use demographic variables (sex, age, race) to match up individuals within the household, thus insuring that the proper hhid/hhseq units are matched across years.”

While there are duplicate hhids for different households, if you look at hhseq by year, there will no longer be any duplicates.