Who's Counting?

Census Bureau Seal

Members: Gabriel Miller, Amara Amadi, Matthew Rash, Noah Williams, Adrian Partain

 

Summary: This project focuses on the Adult Dataset from the 1994 US Census database. This dataset can be described in general terms as census data on financial success across adult populations internationally. This is based on various demographic/employment attributes that are expected to contribute to making over or under $50,000 a year. The rows represent individuals and the columns contain 15 unique variables, one of them being the individual's income (over/under 50K/yr). They contain factors such as race, sex, marital-status, and education which are used to predict an individual's income. The variable "fnlwgt" represents the "final weight" of that individual, meaning how frequently that type of person is represented in the dataset.

 

We noticed that certain categories for individuals were missing from the variables, leading us to question what kinds of information might be missing from the dataset due to the collection methods. We aimed to answer this and determine how this lack of complete information might affect how people use and interpret the dataset, and it's broader effects on the equitable treatment of individuals.

 

We took a connotative view of the dataset to determine the reasons for creating the dataset and how it was collected and cleaned. We found that the dataset was compiled by Barry Becker, who extracted the data from the census. This dataset was then given to Ron Kohavi who used it to study the effectiveness of a new machine learning algorithm called the NBTree. After the completion of Kohavi’s paper in 1996, the dataset was donated and now hosted by the Machine Learning Group at UC Irvine. The data was pulled from the Current Population Survey (CPS) conducted by the United States Bureau of the Census. It involved personal interviews of a nationwide sample of 57,000 housing units across 1,973 counties. Specific extraction rules were applied, such as including only individuals over age 16 with a working hours-per-week greater than zero. Processing involved identifying 3,620 samples with missing values (often labeled as "?") in fields like occupation and work class. These were typically removed for consistency, leaving 45,222 instances for analysis. Additionally, approximately 24 duplicated records were identified.

 

We then looked at what might be missing and how that might affect a person's interpretation of the data. Some people were ineligible to participate in the survey for obvious reasons, such as anyone under 15, no one in the armed forces, and people in institutions such as prisons or nursing homes. We noticed that certain sampling techniques changed from 1980 and 1990 when definitions and certain geo boundaries changed, leading to many being excluded from the data entirely. Specifically, the 1995 CPS report skips year-over-year comparisons because its mixed sample design doesn't align with 1980 or 1990 census definitions. Racial and ethnic subgroup data are particularly unreliable due to coverage gaps, geographic coding errors, and shifts in sample areas. Additionally, definitions of unemployment changed frequently and were constantly being iterated on to find an accurate definition.

 

These gaps raise important concerns about how the dataset reflects or fails to reflect the full diversity of the adult population. Because the design was used to train and evaluate machine learning algorithms rather than serve as a comprehensive sociological record, it's limitations can easily be overlooked by practitioners who treat it as neutral ground truth. When models trained on this data are used to make predictions about income or financial success, the underrepresentation of incarcerated individuals, military personnel, racial and ethnic minorities, and those in non-standard living situations can lead to systematic bias in those predictions.

 

Furthermore, the removal of records with missing values is a standard step in cleaning, but may not be fully random. If certain groups are more likely to have incomplete records due to language barriers, distrust of government surveys, or geographical inaccessibility, then their exclusion skews the dataset away from accurately representing marginalized groups. This means that the models built on this data could unintentionally learn the historical patterns of underrepresentation instead of correcting for them.

 

In conclusion, the Adult Dataset is a widely used benchmark in machine learning, but it's limitations and biases must be taken into consideration, especially when used in high stakes scenarios such as in hiring or social services. When interpreting the findings we must be aware of the social and institutional forces that shaped the creation of the data and understand the blind spots they create.

 

 

 

 

 

 

 

 

Term and Year
Winter 2026
Category
Bias & Equality
Short Summary

Our project looks at the Adult Dataset from the 1994 US Census database, which predicts whether an individuals income exceeds 50K/yr based on a variety of demographic/employment attributes. In our analysis, we found that many groups were underrepresented and/or misrepresented in the data, which may support inaccurate assumptions if interpreted without the full context.

Files