The problem with Galton's problem

Fri, Mar 6, 2020 3-minute read

The Problem with Galton’s Problem

We need to talk about “Galton’s Problem”.

That is, the statistical non-independence of cultural data points due to a common source - famously identified by Francis Galton during his critique of Edward Tylor’s analysis of marriage practices in 1888. This post is not about the importance of this idea when analysing cross-cultural data. The consideration and explicit analysis on non-independence has advanced the quantitative analysis of cultural data considerably, revealing many interesting discoveries about our past. This post is a call to reconsider what we name this phenomenon.

The last few years have highlighted how often society places people with disreputable ideas in positions of admiration (although this problem has a much longer history). The naming of scientific observations after notable scientists who higlight an idea is no exception to this. Francis Galton is widely considered the father of eugenics and actively petitioned for the selective breeding of people within the UK. There are few better examples of disreputable ideas than this. Throughout the academic world there is a reconsideration of how we should relate to historical figures - UCL commendably renamed lecture theaters after reassessing their relationship with their previously esteemed alumni.

The argument often made for maintaining the names of people like Galton in prominent positions is that we are admiring their scientific discoveries and not their politics. This is a complicated argument to take up and a bigger debate than I hope to address here. But if our goal is championing the importance of science and communicating great ideas of the past, naming them after leading figures is an awkward and opaque route to take. To take “Galton’s problem” at face value gives no indication to what the problem is, and given his eugenic beliefs, has quite sinister undertones.

Statistics is notorious for attributing ideas to the discoverer: Gaussian-distribution, Pearson’s correlation, Student’s T-tests (which have their own interesting history). However more recently, statistical discoveries are descriptively named: Linear regression, Neural networks, Classification Trees. Descriptive names offer transparency and are historically unproblematic, and this is an approach we should use retrospectively.

I am not writing this to admonish people who have used this term. In part, this is a letter to myself, having spent much of my academic career thinking about the implications of interdependencies in cultural data and writing using the given terminology. There is a need in both my home fields of statistics and anthropology to reassess the use of nomenclature and to be more considerate towards those who might be ostracized by the honorifics used towards people like Galton.

Moving forward, I will describe this phenomenon as forms of autocorrelation. Autocorrelation is the statistical term to describe the relationship between data points within a variable. While typically this refers more specifically to series autocorrelation (the correlation of a signal with a delayed copy of itself which is caused by the delay), prefixing autocorrelation with the acting process is a descriptive and transparent way to describe the type of autocorrelation at hand. For patterns of shared ancestry: phylogenetic autocorrelation; when describing geographical relationships: spatial autocorrelation.

As an added bonus, the use of this terminology improves interdisciplinarity by using terminology in line with other fields concerned with autocorrelation, such as biology or epidemiology. Some more technically-minded people might argue that autocovariance is a more appropriate term, and this is a debate I am happy to have.

Statistics and Cross-cultural research has a sordid history with racism. There is much more history to address, many more ideas to rethink, and much more work to do. But this change is a small and achievable step in the right direction.