Guest post: another critical look at provocative paper claiming to identify the “most discriminatory” federal sentencing judges

August 9, 2021

I expressed concerns in this recent post about a new empirical paper making claims regarding the “most discriminatory” federal sentencing judges. Upon seeing this Twitter thread by Prof. Jonah Gelbach about the work, I asked the good professor if he might turn his thread into a guest post. He obliged with this impressive essay:

—————

This post will comment on the preprint of The Most Discriminatory Federal Judges Give Black and Hispanic Defendants At Least Double the Sentences of White Defendants, by Christian Michael Smith, Nicholas Goldrosen, Maria-Veronica Ciocanel, Rebecca Santorella, Chad M. Topaz, and Shilad Sen. Doug Berman blogged about it here, and I’m grateful to him for the opportunity to publish this post here.

As I explained in a Twitter thread over the weekend, I have serious concerns about the study. The most important concerns I raised in that thread fall into the following categories:

Incomplete data
Endogeneity of included regressors
Small numbers of observations per judge
Use of most extreme judge-specific disparity estimates

I’ll take these in turn.

(1) Incomplete Data. It’s complicated to explain the data, whose construction involve merging multiple large data sets. In fact, a subset of the authors have a whole other paper about data construction. In brief, the data are constructed by linking US Sentencing Commission data files to those in the Federal Judicial Center’s Integrated Data Base, which gives them enough to form docket numbers. They then use the Free Law Project’s Juriscraper tool (https://free.law/projects/juriscraper/) to query PACER, which yields dockets with attached judges’ initials for most cases that merged earlier in the authors’ pipeline. The authors use those initials to identify the judge they believe handled sentencing, using public lists of judges by district.

As involved as the data construction is, my primary concern is simple: the share of cases included in the data set the authors use is very low. For 2001-2018, there were 1.27 million sentences in USSC data and 1.46 million in FJC data (these figures come from the data-construction paper, which is why they apply to the 2001-2018 period rather than the 2006-2019 period used in the estimation of “Most Discriminatory” judges). Of these records, the authors were able to match 860k sentences, of which they matched 809k to dockets via Juriscraper. After using initials to match judges, they have 596k cases they think are matched. That’s a match rate of less than 50% based on the USSC data and barely 40% based on the FJC data. The authors can’t tell us much about the characteristics of missing cases, and it’s clear to me from reading the newer paper that the match rate varies substantially across districts.

I think this much alone is enough to make it irresponsible to report estimates that purport to measure individually named judges’ degrees of discrimination. As a thought experiment, suppose that (i) the authors have half the data, and (ii) if they were able to include the other half of the data they would find that there was no meaningful judge-level variation in estimated racial disparities in sentencing. By construction, that would render any discussion of the “Most Discriminatory” judges pointless. Because the authors can’t explain why cases are missed, they have no way to rule out even such an extreme possibility. Nor do they determine what share of cases they miss for any judge in the data, because they have no measure of the denominator (perhaps they could do this with Westlaw or similar searches for some individual judges). Their approach to the issue of missing data is to simply assume that missing cases are missing at random:

“One unknown potential source of error is that we cannot determine what percentage of each judge’s cases were matched in the JUSTFAIR database. If this missingness is as-if random with respect to sentencing variables of interest, that should not bias our results, but we have little way of determining this.” (Pages 18-19, emphasis added.)

I believe it is irresponsible to name individual judges as “The Most Discriminatory” on the basis of data as incomplete as these.

(2) Endogeneity. The authors include as controls in their model each defendant’s guideline-minimum sentence, variables accounting for the type of charge, & various defendant characteristics. They argue that these variables are enough to deal not only with the enormous amount of missing data (with unknown selection mechanism; see above) but also any concerns that would arise even if all cases were available. As Doug Berman previously noted here, if prosecutors offer plea deals of differing generosity to defendants of different races, then the guideline minimum doesn’t account for heterogeneity in cases. And note that if that happens in general, it’s a problem for all the model’s estimates. In other words, even if the particular mechanism Doug hypothesized (sweet plea deals for Black defendants in the EDPA) doesn’t hold, the whole model is suspect if the guidelines variable is substantially endogenous.

There are other endogeneity concerns, e.g., the study includes as regressors variables that capture reasons why a sentence departed from the guidelines — an outcome that is itself partly a function of the sentence whose (transformed) value is on the left hand side of the model. And as a friend suggested to me after I posted my Twitter thread, the listed charges are often the result of plea bargains, whose consummation can be expected to depend on the expected sentence. So the guideline minimum variable, too, is potentially endogenous.

(3) Small numbers of observations per judge. The primary estimates on which the claim about particular judges’ putative discriminatory sentencing are based are what are known as random effect coefficients on race dummies. It’s lengthy to explain all the machinery here, but I’ll take a crack at a simplified description.

The key model output on which the authors make their “Most Discriminatory” designations are judge-level estimated Black-White disparities (the same type of analysis applies for Hispanic-White disparity). Very roughly speaking, you can think of the estimated disparity for Judge J as an average of two things: (i) the overall observed Black-White disparity across all judges — call this the “overall disparity”, and (ii) the average disparity in the subset of cases in which Judge J did the sentencing — call this the “judge-specific raw disparity”.

For example, suppose that over all defendants, the average (transformed) sentence is 9% longer among Black defendants than among White ones; then the overall disparity would be 9%. Now suppose that among defendants assigned to Judge J, average sentences were 20% longer for Black than White defendants; then the judge-specific raw disparity would be 20%.

The judge-level estimated disparity that results from the kind of model the authors use is a weighted average of the overall disparity and the judge-specific raw disparity. So in our example, the estimated disparity for Judge J would be a weighted average of 9% (overall disparity) and 20% (judge-specific raw disparity). What are the weights used to form this average? They depend on the variance across judges in the true judge-specific disparity and the “residual” variance of individual sentences — the variance that is unassociated with factors that the model indicates help explain variation in sentences.

The greater the residual variance, the less weight will be put on the judge-specific raw disparity. This is what’s known as the “shrinkage” property of mixed models — they shrink the weight placed on judge-specific raw disparities in order to reduce the noisiness of the model’s estimated disparity for each judge. (I noted this property in a follow-up tweet to part of my thread.)

However, all else equal, greater residual variance also means that variation in judge-specific raw disparities will be more driven by randomness in the composition of judges’ caseload. Because these raw disparities contribute to the model-estimated disparity, residual variance creates a luck-of-the-draw effect in the mode estimates: a judge who happens to have been assigned 40 Black defendants convicted of very serious offenses and 40 White defendants convicted of less serious ones will have a high raw disparity due to this luck factor, and that will be transmitted to the model’s estimate disparity.

How important this effect of residual variance is context-sensitive. The key relevant factors are likely to be the numbers of cases assigned to each judge for each racial group and the size of residual variance relative to the size of variance across judges in true judge-level disparities.

As I wrote in my Twitter thread, I used the authors’ posted code and data to determine that Hon. C. Darnell Jones II, the judge named by the authors as the “Most Discriminatory”, had a total of 103 cases with Black (non-Hispanic) defendants, 37 cases with Hispanic defendants, and 67 with White defendants. Hon. Timothy J. Savage, the judge named as the second “Most Discriminatory”, sentenced 155 Black (non-Hispanic) defendants included in the estimation, 58 Hispanic defendants, and 93 White defendants. These don’t strike me as very large numbers of observations, which is another way of saying that I’m concerned residual variance may play a substantial role in driving the model-estimated disparities for these judges.

My replication of the authors’ model shows that true judge-specific disparities in the treatment of Blacks and Whites have an estimated variance of 0.055, whereas the estimated residual variance is nearly 30 times higher — 1.59 for a single defendant. For a judge who sentenced 40 Black and 40 White defendants, this would mean that residual variance would be 2(1.59)/40~0.08 — which is larger than the 0.055 estimated variance in true judge-level disparity. It’s more complicated to assess the pattern for judges with different numbers of defendants by race, but I would not be surprised if the residual variance component is roughly the same size as the variance in judge-level effects.

In other words, even given the effect of shrinkage, I suspect that “bad luck” in terms of the draw of defendants might well be quite important in driving the judge-specific estimates the authors provide. Even leaving aside the missing-data problem, I think that makes the authors’ choice to name individual judges as “Most Discriminatory” problematic.

Another issue is that the judge-specific estimated disparity (remember, this is the model’s output, formed by taking the weighted average of overall and judge-specific raw disparities) is itself only an estimate, and thus a random variable. Thus if one picked a judge at random from the authors’ data, it would be inappropriate to assume that the estimated disparity for that judge was the true value. To compare the judge-specific estimated disparity to other judges’ estimated disparities, or to some absolute standard, would require one to take into account the randomness in estimated disparity. The authors do not report any such estimates. Nor does the replication code they posted along with their data indicate that they calculated standard errors of the judge-specific estimated disparities. There is no indication that I can find in either the code or the paper that they investigated this issue before posting their preprint.

(4) The many-draws problem. Consider a simple coin toss experiment. We take a fair coin and flip it 150 times. Roughly 98% of the time, this experiment will yield a heads share of 41.6% or greater (in other words, 41.6% is the approximate 2nd percentile for a fair coin flipped 150 times). So if we flipped a fair coin once, it would be quite surprising to observe a heads share of 41.6% or lower. But now imagine we take 760 fair coins and flip each of them 150 times. Common sense suggests it would be a lot less surprising to observe some really low heads shares, because we’re repeating the experiment many times.

To illustrate this point, I used a computer to do a simulation of exactly the just-described experiment — 760 fair coins each flipped 150 times. In this single meta-experiment I found that there were 13 “coins” with heads shares of less than 41.6%, just under two percent of the 760 “coins”, roughly as expected. Given that we know all 760 “coins” are fair, it would make no sense to say that “the most biased coin is coin number 561”, even though in my meta-experiment it had the lowest heads share (36.7%, more than 3 standard deviations below the mean). We know the coin is fair; it’s just that we did 760 multi-toss experiments, and with that much randomness we’re going to see some things that would be very unlikely with only one experiment.

Leaving aside differences across judges in the number of cases heard, this is not that different from what the authors’ approach entails. If all judges had the same number of sentences, then they’d all have the same weights on their raw disparities, and so differences across judges would be entirely due to variation in those raw disparities. If the residual variance component of these raw disparities is substantial (see above), then computing judge-specific model-estimated disparities for each of 760 judges would involve an important component related to idiosyncratic variation. Taking the most extreme out of 760 model-estimated disparities is a lot like focusing on “coin” number 561 in my illustrative experiment above.

Another way to say this is that even if there were zero judge-specific disparity — even if all judges were perfectly fair — we might not be surprised to see substantial variation in the authors’ model-estimated disparities.

Now, it’s not really the case that all judges gave the same number of sentences, so there’s definitely some heterogeneity due to shrinkage as discussed above, which complicates the simpler picture I just painted for illustrative purposes. But I suspect there is still a nontrivial “many-tosses problem” here. Note that this is really an instance of a problem sometimes referred to as “multiple testing” in various statistics literatures; as responders to my Twitter thread noted, one place it comes up is in attempts to measure teachers’ value added in education research, and another is in ranking hospitals and/or physicians. In other words, this isn’t a problem I’ve made up or newly discovered.

* * *

In sum, I think the paper has several serious problems. I do not think anyone should use its reported findings as a basis for deciding which judges are discriminatory, or how much. This is as true for people who lack confidence in the fairness of the system as for any people who doubt there is discrimination. In other words, the criticisms I offer do not require one to believe federal criminal sentencing is pure and fair. These criticisms are about the quality of the data and the analysis.

I want to make one final point, as I did in my Twitter thread. Like the authors of the study, I believe that PACER should be made available to researchers. Indeed, I recently have written a whole paper taking that position. But I am very concerned about the impact of their work on that prospect. The work involves problematic methods and choices and then calls out individual judges for shaming. In my experience there’s nontrivial opposition to data openness within the federal judiciary, and I fear this paper will only harden it.