The Spirit Level and its Critics


   Extremes of financial inequality, their possible sources and their potential dangers, have lately become the focus of an increasing number of books and articles. This has been particularly true since the western financial crisis of 2008-2009. One such book was titled The Spirit Level, written by Richard Wilkinson and Kate Pickett, who are public health epidemiologists in the U.K.  A synopsis of their book, discussing some of its implications, was posted on this blog in September 2013.

     Not long after the publication of The Spirit Level, criticisms of that book began to appear in the U.K. These critiques generally focused on the statistical analyses Wilkinson & Pickett had used, and on the conclusions those analyses were said to support. In general, their critics argued that Wilkinson & Pickett had been inappropriately selective in choosing the countries and the U.S. states that they did, and that in consequence the countries, states, or counties with high levels of undesirable social outcomes or problems were also seen to be those with higher levels of income inequality. It was argued that with more appropriate units of analysis, levels of inequality could be seen to be quite unrelated to those same undesirable outcomes.

     One of the first such critiques was in some ways the strongest. This early critique was published by the U.K. think tank Policy Exchange. It bore the title Beware False Prophets. Its author was Peter Saunders, a retired Professor of Sociology. Just below Saunders’ name on the title page, and in the same large font size, an “Editor” was also listed: Natalie Evans, who was the Deputy Director of Policy Exchange at the time. In what follows I will speak of Dr. Saunders as the sole author of Beware False Prophets, although some of what is argued in that book suggests that Ms. Evans may have drafted certain portions of it. She also published her own critical review of The Spirit Level in The Guardian on July 8, 2010, echoing and praising Saunders’ report, while making no mention of her role in that report or at Policy Exchange.

      At about the same time as the Saunders publication, the U.K. think tank Democracy Institute published a book titled The Spirit Level Delusion: Fact-Checking the Left’s New Theory of Everything. The author was Christopher Snowdon, a freelance journalist and research fellow at the Institute of Economic Affairs (IEA). Over the years, the IEA, has received considerable financial support from a number of large corporations, including a number of major tobacco companies. Mr. Snowdon is also the author of the book Velvet Glove: Iron Fist, which is a history that looks critically at past anti-smoking movements and legislation. Snowdon’s criticisms of Wilkinson and Pickett are sometimes similar to those of Saunders, but tend to be rather more strident in tone and scattered in approach.

      A third critique of Wilkinson & Pickett’s book was made by the U.K. Taxpayers’ Alliance. Rather like the Policy Exchange and the Democracy Institute, the Taxpayer’s Alliance is a well-funded free-market lobby group supported by a few large corporations and devoted to encouraging both small government and very low taxes. Clearly there are many wealthy corporations and individuals that are challenged by, and worried about, the Wilkinson & Pickett findings, including the potential effects of those findings on governmental economic policies. In the case of the Taxpayers’ Alliance, their critique was similar to that of Christopher Snowdon. Often it took the form of questions implying that a different study, and another author’s different conclusions, were sufficient to disprove the Wilkinson & Pickett findings. There might still be reasons to withhold judgment on Wilkinson & Pickett’s contributions, but no one or four studies can ever be said to fully disprove any scientific conjecture.

 *   *   *

     Wilkinson & Pickett  have  made  detailed  and  well-reasoned  responses  to  each  of their  critics.  And  they  have  made  these responses  public  on  the  web.  Their responses to their critics can currently be accessed at  In  each  of these responses they begin by pointing out that the research published in The Spirit Level was carried out in an epidemiological and scientific setting, using data provided by governments and NGOs including the U.N. and the World Bank. Their conclusions were also based on previously published studies carried out by many others, not simply on their own research. Although The Spirit Level was written for a lay audience, Wilkinson & Pickett argue that scientific criticism and scientific debate about their book deserve a scientific arena in which to take place. They write:

Almost all of the research we present and synthesize in The Spirit Level had previously been peer-reviewed, and is fully referenced therein. In order to distinguish between well founded criticism and unsubstantiated claims made for political purposes, all future debate should take place in peer-reviewed publications.

However, Wilkinson & Pickett do go on to address each of the questions about their work raised by their critics.

     Wilkinson & Pickett begin their response by emphasizing the restricted scope of their work and of their conclusions. They write that their work has specifically been concerned with

…a theory of problems which have social gradients—[i.e.] problems which become more common further down the social ladder. So, for example, we would not theorize that alcohol use would be related to inequality, [because] it does not have a social gradient, but that alcohol abuse would be, because it does have a social gradient…. …Our aim was to see if there was a consistent tendency among…countries for [the] health and social problems [that exhibit] social gradients to be more common in societies with bigger income differences.

…In contrast to our approach, much the most common strategy used by our critics has been to selectively remove or add countries to our analyses in an attempt to make the damaging effects of inequality disappear. But it is important to note that the criticisms are entirely ad hoc criticisms of each relationship between inequality and a social outcome. This means that they are irrelevant to almost all of the very many other demonstrations of similar relationships in different settings published in academic journals by other researchers.

     Wilkinson & Pickett then refer to studies showing relationships between levels of income inequality and certain health problems across the regions of Russia, the provinces in China and in Japan, and the counties in Chile. They conclude by saying:

Our analysis suggests that the social gradients which exist in health and [for] many social problems cannot be the result simply of a tendency for social mobility to move the resilient up the social ladder and the vulnerable down. No amount of sorting would explain why problems with social gradients may be anything from twice to ten times as common in more unequal societies. …What the evidence does suggest is that problems which become more common further down the social ladder are substantially a response to social status differentiation itself, and that when greater inequality increases the scale of social differentiation, the problems get worse. Our critics provide no alternative account of why so many problems have social gradients.

 [Additional recent commentary, for other Spirit Level critics, can also be found at ]

*   *   *

      At the beginning of his book, Beware False Prophets, Saunders, speaking about The Spirit Level, writes: “As soon as the book is subjected to even a fairly cursory examination, it becomes obvious that it is deeply flawed.” Unfortunately, the “flaws” that Saunders will begin describing soon afterward are not “obvious” at all, and they are only “apparent” to him after some non-cursory statistical analyses that first require adding to (or much more often deleting) some of the supposedly “inappropriate” data that were being analyzed by Wilkinson & Pickett. Most of The Spirit Level data came from scientific journal articles, vetted by statistically sophisticated (and often skeptical) others. Those reviewers accepted the data in the final form that the authors published them. If there were flaws in those data and those statistical analyses, they were anything but “obvious.”

     This form of dismissive exaggeration by Saunders reappears frequently throughout Beware False Prophets. And it is true as well for the statistical conclusions he draws from his re-analyses of selected portions of the data in The Spirit Level. For the layman, when evaluating the statistical arguments made by Wilkinson & Pickett, and particularly when evaluating those made by Saunders and other critics, it may appear that a reader requires advanced statistical training to do so. That is partially, yet not entirely, true. A general appreciation of what statistics can and cannot do is not particularly hard to understand. However, a specific understanding of when statistical conclusions and inferences are reasonable often does require advanced training. It should not surprise you if different expert statisticians disagree about the use and interpretation of statistical techniques. The fact of any such disagreement does not mean that one of two opposite points of view must necessarily be wrong. Or even that one of those views is right.

     Saunders’ statistical re-analyses of some findings reported in The Spirit Level generally capitalize on techniques and conventions that currently change what is called a “statistically significant” finding into one that is called “not significant.” A layperson can certainly understand these conventions, even if not the mathematical techniques they rely upon. Here then is a very brief introduction to statistics, and to the conventions that Saunders exploits, written here for the lay reader. [I taught Statistics to university undergraduates and graduate students for many years. I also taught courses in Tests and Measurement Theory.]

*   *   *

     “Statistics” is the name of a special tool, one that is intended to assist us in getting a clearer understanding of confusing data sets and some possible errors in those data. Statistics are of two sorts: descriptive and inferential. Descriptive statistics consist of rules and conventions for describing presumed traits that are common to one or more large or complicated data sets. Some of the better-known descriptive statistics describe “central tendency” (e.g. the average; the median), and “variability” (e.g. the standard deviation; the variance) and the “degree of association” (e.g. the correlation coefficient; the “common variance”). All descriptive statistics can sometimes be helpful to understanding, but only after making certain prior assumptions about each data set, e.g. assuming that all the data in any one subset are reflecting much the same thing (even if we don’t know what that thing is) and, assuming that, in general, “errors” in measuring each datum have been essentially random. Such assumptions are invariably useful, often justified, and sometimes quite inappropriate. As tools, descriptive statistics do have their limits.

     Inferential statistics are generally tools for understanding how frequently we might observe the same or a greater degree of difference, or pattern-strength, as the one that we have seen in our particular data set, if and when what we have seen actually happens to have been produced only by chance. Suppose, then, that we examine hundreds of similar data sets where we know for certain that all these data came from a world where only random “errors” (i.e. chance) produced the values of each datum in those sets. We then ask, is our particular finding commonly seen, or is it rather unusual, in these control studies where we are confident that only random noise would produce the same suggestive effects as those we saw in our particular study? If our finding is quite uncommon when only chance is producing the outcomes, then we can feel more comfortable about concluding that the effects we have seen are real, and are unlikely to be due to chance.

     With inferential statistics too, however, there are conventions and assumptions. In general, if only random events are at work, and if seeing such an impressive difference or pattern in the data (like the one seen in our study) would only be observed once in every 20 or more studies from randomly produced data, then the convention has become that we will “reject” the idea that mere chance has produced the suggestive difference or pattern that we found in our study. We then say that the outcome in our study is “statistically significant.” But if our observed “suggestive” outcome would happen once in every 19 or 18 random data sets (or even more frequently), then the convention is to label our observed result as “statistically insignificant.” (Perhaps a better designation, or convention, would have been to call these latter results statistically undecided.)

     Whenever we call an observed effect “statistically significant” we have temporarily committed ourselves to the idea that some non-random, consistent forces or links have played some part in producing the differences or patterns that we see in our data. But when inferential statistics are used to declare that our data are “not significant,” we cannot logically assert that no consistent forces or links are producing any part of the differences or patterns that still appear to remain in our data set. The “strengths of the effects” seen (i.e. the magnitudes of group differences, or the strengths of various trends seen) are simply dismissed as being “non-significant” and therefore as if  “due to chance alone.”

     Many statisticians have recognized that the true “strength of the effect” is almost always what we are really interested in determining. A true but weak effect may still be very informative and interesting, even if the power of our technique for detecting that effect and declaring it “statistically significant,” is not yet strong enough to accomplish that detection. Measuring accurate “effect sizes,” and estimating the likely range of errors for such measurements, constitutes a difficult and as-yet under-appreciated branch of statistics. But to say that there is no statistical significance in a data set does not logically imply that the effect size is zero, or near zero, or negligible. Nor does it mean that the effect size cannot potentially turn out to be important or informative and possibly rather greater than is yet apparent. Random measurement error is, in theory, equally likely to produce underestimates of, or overestimates of, any effect size that we are trying to assess.

     There are two main conditions that will reduce the power of inferential statistics to detect true effects. One is when there are “large” amounts of random measurement error (“noise”) distorting each datum in the sets being examined. “Cleaner” (i.e. better) measurements, reducing random errors, may help to overcome this condition. The second condition is if the number of data in each set is too small to allow us to pick out the signal hiding amid the noise being produced by random measurement errors. Unfortunately, when you reduce the number of data points, by omitting data that you have decided do not belong, then the “power” of inferential statistics to detect true effects declines accordingly. Thus, after deleting some data it becomes easier to conclude that the trends in the data have not yet reached “statistical significance.”

     Too often Saunders dismisses as “insignificant,” trends that become harder and harder to detect after he has trimmed the data set. Too often he treats his newly “insignificant” findings as constituting a sound proof that no effect or no relationship was ever there at all. He does not doubt himself. He does not see the questions he is addressing as being particularly complex or deserving of further study before we might conclude that inequality is not a problem for society.

     For me, Saunders’ apparent impatience with those who see the world differently than he does, with those scientists whose cultural traditions and views of what constitute “good science” include an even-handed respect for others who examine the same phenomena, is an unhelpful and regrettable aspect of his writing. Time has a way of humbling most scientists and policy thinkers, bringing fresh viewpoints and new counter-examples to falsify their former beliefs. That is one reason why scientists should never imply, as Saunders so forcefully does, that we have finally arrived at the ultimate and “correct” interpretation of some portion of the world, and thereby have exposed its former false prophets.

*   *   *

     Saunders ends his long critique concluding not only that inequality is almost certainly unimportant for any useful discussions of social policy, he goes on to assert that it would actually be particularly dangerous to initiate any steps designed to reduce inequality. He sees grave dangers, mostly of the severe economic sort, if ever governments were to return to the regulatory and taxation policies that were in place for many years during and after the Second World War. And yet, during those same years economic growth in the U.K. and the U.S.A. was sustained and strong, while financial inequality remained at a much lower level.

     Sadly, however, after insisting that his statistical evidence shows that Wilkinson & Pickett’s analyses were highly flawed and their evidence lacking, writing in Beware False Prophets Saunders does not offer us any evidence at all to support his own rather doubtful prophesy of grave economic consequences for any society that flirts with policies intended to reduce inequality. True, policies to reduce inequality might possibly prove rather inconvenient for some financial supporters of the U.K. Policy Exchange who provided Saunders with part of his income for his work on their behalf. But Wilkinson & Pickett’s detailed rebuttal of Saunders’ arguments, with their many references to additional work published in scientific journals, articles reporting findings that replicate many of the trends that Wilkinson & Pickett reported in The Spirit Level, is telling. We are left with no good reason to dismiss their work, nor to condemn their provisional interpretations and conclusions drawn from it. Only time and closer scientific study can determine to what extent inequality amplifies social problems that are associated with social gradients, or, under what conditions inequality might not increase those problems.

     I submit that modest-scale social and political experiments, testing policy interventions designed to constrain inequalities: of wealth, of access to education, and of access to justice, could become very helpful experiments for eventually deciding where the truth about inequality really lies. Sadly however, the existing large inequalities in social and political power appear likely to prevent such experiments from ever taking place, particularly in those countries where they appear to be needed most.

*   *   *

© J. Barnard Gilmore    Kaslo, British Columbia    September 2014

This entry was posted in Synopses of Books and tagged , , , , , , , , , , , . Bookmark the permalink.