What factors would affect income?
- margo
- Sep 13, 2022
- 3 min read

Intro
During this project, I am interested to find out if gender or race really affects one’s wage, or just like my parents believed, one’s education is a significant factor in their salary.
If we can predict income, what will it be the most related factor?
Data Source
Adult Data Set was extracted by Barry Becker in 1994, and it was downloaded from UCI Machine Learning Repository. It is a Census dataset that includes variables like age, workclass, fnlwgt(final weight), education, education-num, martial-status, occupation, relationship, race, sex capital-gain, capital-loss, hours-per-week, native-country and dummy variables of wage surpassing or under 50K.
Preliminary Analysis
1. Excel
I first opened the document through Excel. Most of the variables are characters, and there are 4162 “?”s present in the dataset. Since there were 32561 rows, I will consider dropped-off rows with “?” inside.

From the pivot table, we know the differences between average hours per week are 7 hours. It shows the possibility that people might share the same hourly rate, and working hours determine if a person can earn more than 50K per year.

In order to know how the proportion of males and females will impact the income, I did a pivot table by percentage. From the grand total, we understand the adult dataset lacks female Census data, so distinguishing the income status by gender can be inaccurate.
However, if we supposed that the final weight is equally collected, and the dataset revealed exactly the same proportion as the real Census, we then can conclude that only 10% of females can receive a salary over 50K, while around 30% of males can receive income over 50K.
2. R

Besides age, fnlwgt, education_num, capital_gain, capital_loss, and hours_per_week, others are all characters. It means that I will need to do one hot encoding.
In order to make things easier, I decided to choose some factors that are either easier to clean or I am curious about. Therefore, I am gonna use race, sex, and hours_per_week as my independent variables, and income as my dependent variable.

According to the heatmap, the relationship between females and males is approximately -1, and that’s why one of the should be dropped off.
On the other hand, the relationship between every independent variable and >50K seems to be obvious:
- Positive: white, male, hours.per.week
- Negative: female, black
- No relation: Eskimo, islander, other
From the color heatmap, we can tell that hours.per.week and gender are two factors that would affect income the most.
Logistic Regression

Except for islanders, all other factors are statistically significant. Therefore, we can conclude that most independent variables above have a negative impact on income.
After using cor() function, I realized there is multicollinearity between the variable "black" and "white". In this case, if we want to know the coefficient of “white”, we will need to drop off the column “black.”The result seems to be awkward because there are multiple races inside the U.S., and “black” and “white” shouldn’t be highly related. However, the situation could happen if “black” and “white” are the main races in the U.S. In this case, I will consider dropping off “black” and see the estimated coefficient of “white”.

Conclusion
First talked about the conclusion. Genders and races do impact the amount of income, and working hours per week has a slightly higher influence on wages. I did not put education as one of my factors. But from the result of my past project, education is a significant factor compared to other variables.
Actually, this project is one of my school projects. Given a chance to look back, I realized some fundamental mistakes from the beginning, which basically means the whole project was terribly wrong. This also verified the saying “garbage in, garbage out.” From this experience, I will consider more during the status of cleaning a dataset, and make sure I have enough knowledge to analyze data with specific attributes.
Since I only created one model this time, it's really a shame that there are no other models for me to compare and evaluate. Well, no matter what I have done this time, it’s still a precious experience, and I am so excited to finish multiple models and evaluate them in my next project.
To learn more about me, feel free to shoot me an email or connect me on LinkedIn. Hope to see you soon!
Comentarios