-
Introduction from Group Project Prospectus.
Our dataset contains 48,842 records, 14 demographic and employment-related features, and a binary income label. Our central question was how do the dataset’s binary income classification and fixed demographic categories shape what forms of inequality can be represented, and which dimensions of identity are excluded?
This data is derived from the 1994 U.S. Census and is often used in data analysis and machine learning to predict income. It has information such as age, education level, job type, and hours worked per week to estimate whether a person earns more than $50K per year. Even though the dataset may look neutral, the way the information is organized can affect how inequality and identity appear in the data. By looking at how some parts of people’s lives and social experiences may be simplified or missing from the data.
-
Methods.
To analyze the dataset, we used two methods from Poirier’s Reading Datasets method and Koopman’s Format Anatomies method. First, we used Poirier’s method to study how the dataset presents information. With denotative reading, we looked at what the dataset directly includes, such as its demographic variables and binary income label. With deconstructive reading, we looked at what the dataset leaves out and asked what kinds of information or identities might be missing.
Second, we used Koopman’s format anatomies method to examine how the dataset is structured. At the meso level, we analyzed how identity is represented through categories such as race, sex, education level, and occupation. At the micro level, we looked closely at how income is formatted as a binary variable, meaning people are placed into only two income groups.
-
Work Performed.
The work we performed involved the two methods of analyzing datasets. We also looked at the dataset itself to see what it found.
Our group began by looking at the structure of the Census Income dataset and identifying its main variables. We reviewed the demographic and employment features included in the dataset and analyzed how income is classified. Next, we used Poirier’s reading methods to interpret the data. We first identified the formal definitions of the variables in the dataset. After that, we found what information might be missing, such as broader social or economic factors that affect people’s income.
Finally, we used Koopman’s format anatomies framework to study how the dataset organizes information about individuals. We found how people are placed into fixed demographic categories and how income is simplified into only two groups. This allowed us to see how the dataset’s structure influences the way inequality is represented.
- Findings.
By combining Koopman’s format anatomies and Poirier’s reading strategies we found that this dataset encodes inequality not just in the data itself, but through the structural decisions that determine what can and cannot appear in the dataset.
The findings from Poirier’s Reading Datasets method are separated into 3 categories.
Denotative Reading: The units of observation in this dataset were adult individuals recorded in the 1994 U.S. Census database. The primary variables were age, workclass, education level, marital status, occupation, race, sex, hours per week, and native country. The variable definitions came from the U.S. Census Bureau. They include standardized labels for race, binary sex classifications, and binary income labels. This means that there are only two options for gender (man/woman) and only two options for income (above/below 50k a year).
Connotative Reading: The connotative reading found that for many of the variables, the available responses were outdated or insufficient. For race people could only select one category which is insufficient for multiracial individuals, and there were also missing categories. Sex classification was strictly binary, and employment did not include modern jobs like hybrid/remote work or jobs in the gig economy. Some of these restrictive categories can be explained by the fact that this dataset was specifically designed for machine learning applications. It is used to train models so they become better at predicting if someone makes more or less than 50,000 a year. This explains why the dataset uses rigid categories, why income is a binary threshold, and why certain things are missing. These are all because this is what's easiest for the model.
Deconstructive Reading: This dataset is missing a representation of structural inequality coming from things besides the variables listed that influence income. It is also missing a more specific classification of income beyond the binary 50,000 threshold. There are also many personal details that are missing. Things like racial and gender identity, disability, discrimination, and local living circumstances.
Using Koopman’s format anatomies method, we found that the structure of the dataset shapes how identity and inequality are represented. At the meso level, the dataset organizes people into fixed categories such as race, sex, education, and occupation. Because these categories are predefined, individuals whose identities do not fit these options cannot be fully represented in the dataset.
At the micro level, we found the way income is formatted as a binary variable, either above $50K or below. This simplifies income into only two groups and removes the many differences.
The data itself suggests that the most potent predictors of income are education, hours worked, age, and job type. The more education an individual has and the more hours they work the more likely they are to make above 50,000. Age also predicts income, with people between 35 and 55 having the highest probability of making above 50,000.
This project analyzes the Census Income dataset using Poirier’s Reading Datasets method and Koopman’s Format Anatomies method to show that the dataset’s binary income classification and fixed demographic categories simplify complex social realities and limit how inequality and identity can be represented.