Recently, I was reading this New York Times article about the current US cabinet being the least diverse regarding its members' race and gender since a long time. Out of 24 members, there are 18 white men. What is more, the other six members are in some of the lowest-ranking positions.

I was wondering: How can we quantify the current US cabinet's representativeness (or rather non-representativeness) in terms of race and gender? And I came up with the following specific question:

When randomly picking 24 US citizens, how likely is it that we will end up with a group that consists of at least 18 white men?

Looking for some suitable census data to use for calculating that probability, I found this 2010 census report. For the census, people living in the US are asked about their “race and Hispanic origin”, which means that the data is based on the individuals' self-identification. According to the census report, 72.4 percent of people living in the US identified as “White” in 2010. Looking at another 2010 census report, one can find out that the percentage of people in the US that self-identify as “male” is 49.2 percent across all age groups in total.

Concepts like race, ethnicity, and gender tend to get used in an oversimplified way, e.g. by dividing gender into two groups (male and female). That contributes to the marginalization and social erasure of minorities, and is also reflected in the census reports.

As we only want to show the overrepresentation of white males here, we only need to know what fraction of the US population they represent, and don't need to look at the data about other population groups. We can calculate the fraction of white males by multiplying the fraction of people who identify as White with the fraction of people who identify as male, which is 0.724 * 0.492 = 0.356.

Now, what formula do we use for our calculation?

Let's first describe exactly what we want to know: When we do a series of 24 experiments where we pick a random US citizen, we'd like to know how high the probability of picking 1, 2, 3, […], 17 white males is for each experiment. So how do we find a formula for calculating those probabilities?

When trying to find a good mathematical model for something, it's usually helpful to generalize as much as possible (but not more!). Let's try this with the following assumptions:

  1. We know that we have a finite and fixed, countable number of individual experiments, which means that it is discrete.

  2. We can assume that our experiments are statistically independent. That is, they can't influence each other.

  3. There are only two possible outcomes for our individual experiments: either we picked a white man, or we didn't. In other words, each experiment has its own boolean-valued outcome.

  4. The probability for a particular outcome is the same for each experiment.

From the first of our above assumptions we can infer that we need to look for a discrete probability distribution. But what discrete probability distribution matches the other assumptions as well?

Check out the binomial distribution:

[…] the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question […]

Sounds like exactly what we were looking for, right? Let's find the corresponding distribution function that we can use to calculate our answer!

Looking at the specification of the binomial distribution, we can see that there are two functions, namely a probability mass function, and a cumulative distribution function (CDF), where the latter is a summation of the former over a discrete interval:

binomial cumulative distribution function

Before trying to apply that to our problem, let's first get a general understanding of what a cumulative distribution function is:

In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x.

…which translates to the following formula:

cumulative distribution function

Assuming that X is the random number of white men after picking 24 US citizens randomly, and it supposed to be lower than x, and x is 17, we can see that we can use that formula to answer the following question:

When randomly picking 24 US citizens, how likely is it that we will end up with a group that consists of at most 17 white men?

See how that question is complementary to our former question of how likely will we end up with a group of at least 18 white men? That means that if we calculate that value and subtract it from 1 we're done :)

Now, let's do our calculation by translating our previous formula into Python code:
binomial cumulative distribution function

SciPy, which is a Python library for scientific computing, provides a binom module with a cdf function (among many others things), which is the binomial distribution's cumulative distribution function. We can take that function with the values k = 17, n = 24, p = 0.356, and calculate the questioned probability in a Python shell like this:

>>> from scipy.stats import binom
>>> 1 - binom.cdf(17, 24, 0.356)
9.7307617324737805e-05

We conclude:

When randomly picking 24 US citizens, the probability of ending up with a group that consists of at least 18 white men is approximately 0.01 percent.