It's well established that incomes are not evenly distributed by race and gender. However, it can be hard to conceptualize the absolute difference in incomes in our everyday actions and compare incomes across different cities and countries. Here, I wanted to see if I could get a more intuitive measure of income inequality, one that reflects the kinds of social judgments we make everyday: If I meet someone of Race X and Gender A, what is the probability that my income is higher?

I found some census data that lists the frequency of people in different income brackets. People here are divided into Caucasian, Black, White and Hispanic. And there's one male and one female dataset. Annoyingly, we don't have people's individual incomes, so we can't just compare raw histograms. Consequently, I decided to use resampling to estimate those probabilities.

After resampling and finding the distribution of incomes, I calculated the probability of income inequality between different groups:

Followed by a **conclusion**

First let's quickly load our packages

In [41]:

```
%matplotlib inline
import csv
import numpy as np
import matplotlib.pyplot as plt
import os
import scipy.stats as spstats
import pandas as pd
import seaborn as sns
sns.set_style("whitegrid")
```

In [38]:

```
def load_income_freq(fname,disp=False):
stereo_file=open(fname,'rU')
stereo_read=csv.reader(stereo_file)
if disp: print('\n');print(fname);print(stereo_read.next())
all_freq=[]
for row in stereo_read:
if disp:print(row)
try:
to_int=map(int,row[1:])
all_freq.append(to_int)
except:
pass
return np.array(all_freq)
all_count_female=load_income_freq('income_freq_female.csv',disp=False)
all_count_male=load_income_freq('income_freq_male.csv',disp=False)
races=['White', 'Black', 'Asian', 'Hispanic']
```

For the time being, I'm going to collapse these two sets; I'll get back to them later

In [3]:

```
all_count=all_count_female+all_count_male
```

In [4]:

```
all_sum=np.sum(all_count,0)
print all_sum
```

So let's convert these to proportions

In [40]:

```
num_row=np.size(all_count,0)
all_prop=all_count.astype(float)/np.tile(all_sum.astype(float),(num_row,1))
print all_prop
```

Now let's get to answering the question I had in mind in the first place--What is the probability of a person from race X being wealthier than a person from race Y? To measure this, I'm going to use the proportions we just calculated to resample.

If you're unfamiliar with resampling, the code below will

- Use the proportions we calculated above to generate a random person of race X and a random person of race Y.
- See if the income bracket of the person from race X is higher
- And then repeat this process many times so we can calculate the mean and confidence intervals

In [6]:

```
def compare_race(race1,race2,n1,n2,fold_name):
# Pairwise race comparison
if not os.path.exists(fold_name):
os.makedirs(fold_name)
fname='comp'+str(n1)+'_'+str(n2)+'.npy'
full_fname=os.path.join(fold_name,fname)
if not os.path.isfile(full_fname):
print 'Fitting '+fname
num_income=range(0,len(race1))
num_samp=20000
race_comp=[]
for its in range(0,num_samp):
race1_choice=np.random.choice(num_income,p=race1)
race2_choice=np.random.choice(num_income,p=race2)
if race1_choice>race2_choice:
race_comp.append(1)
elif race1_choice<race2_choice:
race_comp.append(0)
race_comp_array=np.array(race_comp)
np.save(full_fname,race_comp_array)
else:
race_comp_array=np.load(full_fname)
return race_comp_array
def grid_prop(all_prop,fold_name):
# Run compare_race for each race-race pairing
num_race = np.size(all_prop,1)
race_comp=np.zeros((num_race,num_race))
race_equal=np.zeros((num_race,num_race))
for race1 in range(0,num_race):
for race2 in range(0,num_race):
curr_race1=all_prop[:,race1]
curr_race2=all_prop[:,race2]
race_comp_array=compare_race(curr_race1,curr_race2,race1,race2,fold_name)
race_comp[race1,race2]=np.mean(race_comp_array)
return race_comp
all_grid_greater=grid_prop(all_prop,'comp_race')
```

And now let's make it pretty

In [20]:

```
def heatmap_prop(all_grid,ttle,group_names):
fig1 = plt.figure(2)
fig1.set_facecolor('white')
plt.clf()
prop_plot=plt.imshow(all_grid, aspect='auto', interpolation='nearest')
prop_plot.set_cmap('RdBu_r')
xlabels = group_names
plt.xticks(range(len(group_names)), xlabels, rotation='horizontal')
ylabels = group_names
plt.yticks(range(len(group_names)), ylabels, rotation='vertical')
plt.title(ttle)
plt.colorbar()
plt.show()
heatmap_prop(all_grid_greater,'Figure 1: Probability Row greater than Column',races)
```

*very slightly* more than Asians and Hispanics make a bit less than Blacks.

Of course, we can also use this same procedure to compare people from different genders

In [42]:

```
count_gender=np.transpose(np.array([np.sum(all_count_male,1),np.sum(all_count_female,1)]))
genders=['Male','Female']
```

In [11]:

```
num_row=np.size(count_gender,0)
sum_gender=np.sum(count_gender,0)
prop_gender=count_gender.astype(float)/np.tile(sum_gender.astype(float),(num_row,1))
all_grid_greater_gender=grid_prop(prop_gender,'comp_gender')
```

In [23]:

```
heatmap_prop(all_grid_greater_gender,'Probability Row greater than Column',genders)
```

And males have a much higher probability of making more money than females

Some of these inequalities may be even more exacerbated when race and gender interact

In [13]:

```
def compare_gender_race(data,fold_name,acr=['Male','Female'],within=['White', 'Black', 'Asian', 'Hispanic']):
# Given data1 and data2 containing different genders and the same races, run inequality resampling analysis
num_gender=np.size(data,2)
num_race=np.size(data,1)
mean_great=[]
label_great=[]
for ig1 in range(num_gender):
for ir1 in range(num_race):
for ig2 in range(num_gender):
for ir2 in range(num_race):
curr_1=data[:,ir1,ig1]
curr_2=data[:,ir2,ig2]
n1=acr[ig1]+within[ir1]
n2=acr[ig2]+within[ir2]
cc_race=compare_race(curr_1,curr_2,n1,n2,fold_name)
mean_great.append(np.mean(cc_race))
label_great.append(n1+'>'+n2)
# Returns the probability gender in data 1
return mean_great,label_great
combined_gender=np.dstack((all_prop_male,all_prop_female))
mean_great,label_great=compare_gender_race(combined_gender,'gender_race')
```

In [49]:

```
def point_prop(means,ttle,group_names):
fig1 = plt.figure(2,figsize=(16, 6))
fig1.set_facecolor('white')
plt.clf()
prop_plot=plt.scatter(range(len(means)),means,s=200,color='teal')
xlabels = group_names
plt.xticks(range(len(group_names)), xlabels, rotation='90',fontsize=14)
plt.xlim((-1,len(group_names)))
plt.ylabel('Probability Group 1 > Group 2',fontsize=22)
plt.title(ttle)
plt.grid(b=True, axis='y',which='major', color='grey', linestyle='-')
plt.show()
rankings=np.argsort(mean_great)
mean_great2=[mean_great[i] for i in rankings]
label_great2=[label_great[i] for i in rankings]
num_include=(len(mean_great2)/2)+4
point_prop(mean_great2[num_include:],'Figure 5: Race-Gender Inequalities',label_great2[num_include:])
```

People are typically very bad at estimating people's average income, making it difficult to conceptualize the magnitude of socioeconomic inequality in our society. Using this measure allows us to express these social differences more intuitively.

Comparably, in some of my other work (Lew & Vul, 2016, Knowledge and use of price distributions by populations and individuals, submitted to CogSci 2016), we found that people were really awful at estimating how much the absolute prices of objects vary but had pretty accurate knowledge of the relative dispersion of prices--that is, people know which objects have more variable prices but not how much they vary. Our probability measure of inequality may allow us to evaluate deficits in people's knowledge of income distributions and improve people's awareness of income inequality

whether people themselves know *something* about how much incomes vary in our society.

In [15]:

```
def compare_gender_race2(data,fold_name,acr=['Male','Female'],within=['White', 'Black', 'Asian', 'Hispanic']):
# Given data1 and data2 containing different genders and the same races, run inequality resampling analysis
num_gender=np.size(data,2)
num_race=np.size(data,1)
race1=[]
gender1=[]
race2=[]
gender2=[]
comp_res=[]
for ig1 in range(num_gender):
for ir1 in range(num_race):
for ig2 in range(num_gender):
for ir2 in range(num_race):
curr_1=data[:,ir1,ig1]
curr_2=data[:,ir2,ig2]
n1=acr[ig1]+within[ir1]
n2=acr[ig2]+within[ir2]
cc_race=compare_race(curr_1,curr_2,n1,n2,fold_name)
race1=race1+([within[ir1]]*len(cc_race))
race2=race2+([within[ir2]]*len(cc_race))
gender1=gender1+([acr[ig1]]*len(cc_race))
gender2=gender2+([acr[ig2]]*len(cc_race))
comp_res=comp_res+list(cc_race)
income_df=pd.DataFrame({'race1':race1,'race2':race2,'gender1':gender1,'gender2':gender2,'p1greater':comp_res})
# Returns the probability gender in data 1
return income_df
income_df=compare_gender_race2(combined_gender,'gender_race_csv')
income_df=income_df.loc[:,['gender1','race1','gender2','race2','p1greater']]
```

In [16]:

```
income_df.to_csv('income_comparison.csv')
```