Quantifying socioeconomic inequality by race and gender

It's well established that incomes are not evenly distributed by race and gender. However, it can be hard to conceptualize the absolute difference in incomes in our everyday actions and compare incomes across different cities and countries. Here, I wanted to see if I could get a more intuitive measure of income inequality, one that reflects the kinds of social judgments we make everyday: If I meet someone of Race X and Gender A, what is the probability that my income is higher?

I found some census data that lists the frequency of people in different income brackets. People here are divided into Caucasian, Black, White and Hispanic. And there's one male and one female dataset. Annoyingly, we don't have people's individual incomes, so we can't just compare raw histograms. Consequently, I decided to use resampling to estimate those probabilities.

After resampling and finding the distribution of incomes, I calculated the probability of income inequality between different groups:

  1. Between races
  2. Between genders
  3. By race and gender

Followed by a conclusion

First let's quickly load our packages

In [41]:
%matplotlib inline
import csv
import numpy as np
import matplotlib.pyplot as plt
import os
import scipy.stats as spstats
import pandas as pd

import seaborn as sns
sns.set_style("whitegrid")

Load data

The original Excel files had a lot of information that I didn't want to focus on for this first (mean incomes, mixed race people, etc.). I ended up creating two csv with race and income information--income_freq_male.csv amd income_freq_female.csv. Let's take a look at the datasets

In [38]:
def load_income_freq(fname,disp=False):
    stereo_file=open(fname,'rU')
    stereo_read=csv.reader(stereo_file)

    if disp: print('\n');print(fname);print(stereo_read.next())
    all_freq=[]
    for row in stereo_read:
        if disp:print(row) 
        try:
            to_int=map(int,row[1:])
            all_freq.append(to_int)
        except:
            pass

    return np.array(all_freq)


all_count_female=load_income_freq('income_freq_female.csv',disp=False)    
all_count_male=load_income_freq('income_freq_male.csv',disp=False)
races=['White', 'Black', 'Asian', 'Hispanic']

For the time being, I'm going to collapse these two sets; I'll get back to them later

In [3]:
all_count=all_count_female+all_count_male

For each of these income brackets (rows) we have the number of White, Black, Asian and Hispanic people within that bracket. Counting whether there are more wealthy/poor people of a given race isn't going to give a meaningful measure of inequality--the varying number of people in each race precludes that.

In [4]:
all_sum=np.sum(all_count,0)
print all_sum
[204168  33904  15655  40379]

So let's convert these to proportions

In [40]:
num_row=np.size(all_count,0)
all_prop=all_count.astype(float)/np.tile(all_sum.astype(float),(num_row,1))
print all_prop
[[ 0.17282336  0.22413285  0.24196742  0.26355284]
 [ 0.02494514  0.03232657  0.02395401  0.03075856]
 [ 0.03106755  0.03902194  0.03570744  0.03940167]
 [ 0.04023647  0.05828221  0.03404663  0.0470294 ]
 [ 0.04968457  0.06285394  0.0440115   0.05822333]
 [ 0.03484385  0.03958235  0.02593421  0.03464672]
 [ 0.04174993  0.04692662  0.03251357  0.05071943]
 [ 0.03223326  0.03276899  0.02459278  0.03355705]
 [ 0.04450746  0.04674965  0.03577132  0.06121994]
 [ 0.02713941  0.02902312  0.01769403  0.02583026]
 [ 0.03619078  0.03949387  0.03008623  0.04086282]
 [ 0.0224521   0.02288815  0.01577771  0.01961416]
 [ 0.03873771  0.04026074  0.033855    0.04343842]
 [ 0.01742682  0.01377419  0.01111466  0.01428961]
 [ 0.03094021  0.02887565  0.02753114  0.02850492]
 [ 0.01584479  0.01501298  0.01117854  0.01099581]
 [ 0.03183163  0.03055687  0.03219419  0.02939647]
 [ 0.0125338   0.00996933  0.00811242  0.00792491]
 [ 0.02166353  0.01875885  0.01545832  0.01669184]
 [ 0.01247012  0.01176852  0.00938997  0.00869264]
 [ 0.02591003  0.02333058  0.02625359  0.02043141]
 [ 0.01056973  0.00616446  0.00562121  0.00586939]
 [ 0.01598194  0.01294832  0.01577771  0.01087199]
 [ 0.00811097  0.00575153  0.00670712  0.00354144]
 [ 0.01966028  0.01398065  0.0189077   0.01416578]
 [ 0.0073224   0.00392284  0.00402427  0.00406152]
 [ 0.01234767  0.00814063  0.01290323  0.00777632]
 [ 0.00611261  0.00309698  0.00440754  0.00225365]
 [ 0.0132391   0.00828811  0.01520281  0.00728101]
 [ 0.0054367   0.00306748  0.00472692  0.00208029]
 [ 0.01139258  0.00645941  0.01290323  0.00554744]
 [ 0.00440813  0.0023891   0.00453529  0.002427  ]
 [ 0.01023177  0.00719679  0.01456404  0.00668664]
 [ 0.00412895  0.00315597  0.00249122  0.00245177]
 [ 0.0061322   0.00430628  0.00772916  0.00284802]
 [ 0.00336977  0.0017697   0.00274673  0.00123827]
 [ 0.00700893  0.00424729  0.0101565   0.00351668]
 [ 0.00280651  0.00162223  0.00319387  0.00096585]
 [ 0.00404079  0.00247758  0.00434366  0.00173357]
 [ 0.00295345  0.00171071  0.00389652  0.00101538]
 [ 0.04711316  0.02277017  0.06809326  0.01859878]
 [ 0.01652561  0.00525012  0.02063239  0.00480448]
 [ 0.00690118  0.00159273  0.00657937  0.00188217]
 [ 0.008973    0.00333294  0.01271159  0.00260036]]

Great! From a cursory glance, we can note some interesting features at the extremes. In the lowest income bracket (<\$2500) there is a lower proportion of Caucasians (minimum difference ~.05) while the other races have pretty similar proportions. In the highest income bracket (>\$250,000) there is a higher proportion of Asians (minimum difference ~.004) followed by Caucasians and then Blacks and Hispanics.

Probability of inequality across races

Now let's get to answering the question I had in mind in the first place--What is the probability of a person from race X being wealthier than a person from race Y? To measure this, I'm going to use the proportions we just calculated to resample.

If you're unfamiliar with resampling, the code below will

  1. Use the proportions we calculated above to generate a random person of race X and a random person of race Y.
  2. See if the income bracket of the person from race X is higher
  3. And then repeat this process many times so we can calculate the mean and confidence intervals
In [6]:
def compare_race(race1,race2,n1,n2,fold_name):
    # Pairwise race comparison
    if not os.path.exists(fold_name):
        os.makedirs(fold_name)

    fname='comp'+str(n1)+'_'+str(n2)+'.npy'

    full_fname=os.path.join(fold_name,fname)

    if not os.path.isfile(full_fname):
        print 'Fitting '+fname
        num_income=range(0,len(race1))
        num_samp=20000
        race_comp=[]
        for its in range(0,num_samp):
            race1_choice=np.random.choice(num_income,p=race1)
            race2_choice=np.random.choice(num_income,p=race2)
            if race1_choice>race2_choice:
                race_comp.append(1)
            elif race1_choice<race2_choice:
                race_comp.append(0)
        race_comp_array=np.array(race_comp)
        np.save(full_fname,race_comp_array)
    else:
        race_comp_array=np.load(full_fname)

    return race_comp_array

def grid_prop(all_prop,fold_name):
    # Run compare_race for each race-race pairing
    num_race = np.size(all_prop,1)

    race_comp=np.zeros((num_race,num_race))
    race_equal=np.zeros((num_race,num_race))

    for race1 in range(0,num_race):
        for race2 in range(0,num_race):
            curr_race1=all_prop[:,race1]
            curr_race2=all_prop[:,race2]
            race_comp_array=compare_race(curr_race1,curr_race2,race1,race2,fold_name)
            race_comp[race1,race2]=np.mean(race_comp_array)
    
    return race_comp

all_grid_greater=grid_prop(all_prop,'comp_race')

And now let's make it pretty

In [20]:
def heatmap_prop(all_grid,ttle,group_names):
    fig1 = plt.figure(2)
    fig1.set_facecolor('white')
    plt.clf()
    prop_plot=plt.imshow(all_grid, aspect='auto', interpolation='nearest')
    prop_plot.set_cmap('RdBu_r')
    xlabels = group_names
    plt.xticks(range(len(group_names)), xlabels, rotation='horizontal')

    ylabels = group_names
    plt.yticks(range(len(group_names)), ylabels, rotation='vertical')

    plt.title(ttle)

    plt.colorbar()
    plt.show()

heatmap_prop(all_grid_greater,'Figure 1: Probability Row greater than Column',races)

And we get a really fascinating summary of inequality across races. There's a good amount going on here. Along the diagonal we have Probability=.5, since we're comparing two samples from the same race. Essentially, White and Asian incomes are roughly on par (lime green at (0,2)), but both groups tend to make more than Blacks and Hispanics (red squares). Taking a finer layer of granularity, Whites make very slightly more than Asians and Hispanics make a bit less than Blacks.

What about gender?

Of course, we can also use this same procedure to compare people from different genders

In [42]:
count_gender=np.transpose(np.array([np.sum(all_count_male,1),np.sum(all_count_female,1)]))
genders=['Male','Female']
In [11]:
num_row=np.size(count_gender,0)
sum_gender=np.sum(count_gender,0)
prop_gender=count_gender.astype(float)/np.tile(sum_gender.astype(float),(num_row,1))
all_grid_greater_gender=grid_prop(prop_gender,'comp_gender')
In [23]:
heatmap_prop(all_grid_greater_gender,'Probability Row greater than Column',genders)

And males have a much higher probability of making more money than females

What about race-gender interactions?

Some of these inequalities may be even more exacerbated when race and gender interact

In [13]:
def compare_gender_race(data,fold_name,acr=['Male','Female'],within=['White', 'Black', 'Asian', 'Hispanic']):
    # Given data1 and data2 containing different genders and the same races, run inequality resampling analysis
    num_gender=np.size(data,2)
    num_race=np.size(data,1)
    
    mean_great=[]
    label_great=[]
    for ig1 in range(num_gender):
        for ir1 in range(num_race):
            for ig2 in range(num_gender):
                for ir2 in range(num_race):
                    curr_1=data[:,ir1,ig1]
                    curr_2=data[:,ir2,ig2]
                    n1=acr[ig1]+within[ir1]
                    n2=acr[ig2]+within[ir2]                    
                    cc_race=compare_race(curr_1,curr_2,n1,n2,fold_name)
                    mean_great.append(np.mean(cc_race))
                    label_great.append(n1+'>'+n2)
    
    # Returns the probability gender in data 1 
    return mean_great,label_great

combined_gender=np.dstack((all_prop_male,all_prop_female))
mean_great,label_great=compare_gender_race(combined_gender,'gender_race')
In [49]:
def point_prop(means,ttle,group_names):
    fig1 = plt.figure(2,figsize=(16, 6))
    fig1.set_facecolor('white')
    plt.clf()
    prop_plot=plt.scatter(range(len(means)),means,s=200,color='teal')
    xlabels = group_names
    plt.xticks(range(len(group_names)), xlabels, rotation='90',fontsize=14)
    plt.xlim((-1,len(group_names)))
    plt.ylabel('Probability Group 1 > Group 2',fontsize=22)
    plt.title(ttle)
    plt.grid(b=True, axis='y',which='major', color='grey', linestyle='-')

    plt.show()

rankings=np.argsort(mean_great)

mean_great2=[mean_great[i] for i in rankings]
label_great2=[label_great[i] for i in rankings]

num_include=(len(mean_great2)/2)+4
point_prop(mean_great2[num_include:],'Figure 5: Race-Gender Inequalities',label_great2[num_include:])

For ease of viewing, I've rank ordered the comparisons and removed same-same comparisons and the converses of the ones here. Accounting for race-gender interactions reveals some whopping disparities well beyond what we've seen before. Previously we found that Whites were more likely to make more than Hispanics (~.6) and Males were more likely to make more than Females (~.63). But the combination of the two (that is, comparing White Males to Hispanic Females) resulted in an increased probability of almost .75!

Conclusions

People are typically very bad at estimating people's average income, making it difficult to conceptualize the magnitude of socioeconomic inequality in our society. Using this measure allows us to express these social differences more intuitively.

Comparably, in some of my other work (Lew & Vul, 2016, Knowledge and use of price distributions by populations and individuals, submitted to CogSci 2016), we found that people were really awful at estimating how much the absolute prices of objects vary but had pretty accurate knowledge of the relative dispersion of prices--that is, people know which objects have more variable prices but not how much they vary. Our probability measure of inequality may allow us to evaluate deficits in people's knowledge of income distributions and improve people's awareness of income inequality

whether people themselves know something about how much incomes vary in our society.

Creating a csv for this

In [15]:
def compare_gender_race2(data,fold_name,acr=['Male','Female'],within=['White', 'Black', 'Asian', 'Hispanic']):
    # Given data1 and data2 containing different genders and the same races, run inequality resampling analysis
    num_gender=np.size(data,2)
    num_race=np.size(data,1)
    
    race1=[]
    gender1=[]
    race2=[]
    gender2=[]
    comp_res=[]
    for ig1 in range(num_gender):
        for ir1 in range(num_race):
            for ig2 in range(num_gender):
                for ir2 in range(num_race):
                    curr_1=data[:,ir1,ig1]
                    curr_2=data[:,ir2,ig2]
                    n1=acr[ig1]+within[ir1]
                    n2=acr[ig2]+within[ir2]                    
                    cc_race=compare_race(curr_1,curr_2,n1,n2,fold_name)
                    race1=race1+([within[ir1]]*len(cc_race))
                    race2=race2+([within[ir2]]*len(cc_race))                    
                    gender1=gender1+([acr[ig1]]*len(cc_race))
                    gender2=gender2+([acr[ig2]]*len(cc_race))
                    comp_res=comp_res+list(cc_race)
    
    income_df=pd.DataFrame({'race1':race1,'race2':race2,'gender1':gender1,'gender2':gender2,'p1greater':comp_res})
    
    # Returns the probability gender in data 1 
    return income_df

income_df=compare_gender_race2(combined_gender,'gender_race_csv')
income_df=income_df.loc[:,['gender1','race1','gender2','race2','p1greater']]
In [16]:
income_df.to_csv('income_comparison.csv')