The Heisenberg Uncertainty Principle of Social Science Modeling

Arguably, all of scientific inquiry in modern times begins with some sort of model. A model takes different parameters you are studying and uses them to make some claim about how our world works. It is a reduction of reality with the aim of reconstructing a picture of truth, whether about the spread of disease, the population of a toad species or the number of people who will move in 2020.

But as the number of things we’re trying to study grows, the chance of getting even close to the target reality falls. The reason for this trade-off is the “curse of dimensionality.” It is not a rule of thumb or a limit due to measurement errors, but as much a mathematical fact as the Pythagorean theorem—and it puts fundamental limits on what economics and other social sciences can describe. The curse of dimensionality is why our estimates of how a disease will behave will always have imprecision.

Dimensions most commonly refers to the space and time we occupy, but it can also mean any set of measurable things that are independent of each other. For example, let’s say we want a model of how a public health campaign would affect COVID-19 spread. We might use factors such as the estimated incubation period of the disease under given conditions (call it X), the percentage of people who would wear masks in public given a public health campaign (Y), an estimate of person-to-person transmission likelihood (Z), and so on, to estimate patterns of spread. To make predictions about the effects of our ad campaign, we will need to find numeric values for X, Y and Z (i.e., “Fit the model to data”).

The model has three independent dimensions, so the model parameters X, Y and Z can be read as points in 3-D space. If we get our best fit for what X, Y and Z should be using our real-world data and modeling techniques, will our estimate come close to the true value (which we would only be able to observe directly if we were omniscient)?

To answer this question, we need to think about how shapes behave in different dimensions.

If you have some solid shape with a thin shell around it, the shell holds a surprising amount of the volume. Get an orange from the supermarket nine centimeters in diameter, whose peel is just 0.45 cm thick. About 25 percent of the orange’s volume is in the peel.

What if your orange came in a gift box, exactly big enough that the fruit touches all sides? Stepping back to two dimensions for a moment, a circle takes up 78.5 percent of the volume of its tightest-fitting square. In three dimensions the orange in its box 47.6 percent of the volume of the box and the rest is empty air. As the number of dimensions rise, the percent of inside-the-box volume that is the fruit itself shrinks further. A four-dimensional ball is 30.8 percent of the volume of the box. By nine dimensions, the tightest-fitting box is 99.54 percent empty. Or if you are an optimist, the box is 0.46 percent full.

Now, let’s think about the COVID-19 model as if it existed as a shape in three-dimensional space. Imagine the center of the box to be the true value of X, Y and Z, and the tight-fitting box to be the range of our best guesses regarding each parameter by itself. Define “close” as being inside the sphere in the center of the shell or tight-fitting box. The word “close” has the obvious physical interpretation, but also makes sense in information space, where we need our estimate of X, Y and Z to be within a short distance to the true value. The fact that a point chosen haphazardly in the higher-dimensional box has such a small chance of being close to the center is an example of the curse of dimensionality.

Say we want our model to be more descriptive. A public health campaign could cause people go to the supermarket A percent less often, and induce B percent to work from home, and C percent to stop taking public transportation. Adding these parameters brings us to a six dimensional model (A, B, C, X, Y, Z), and we could easily brainstorm three or four more. If we could put good bounds on the numeric range of each variable, we could put our estimate in a tight-fitting box around the true nine-dimensional parameter value—which gives us a 0.46 percent chance of our full model with all nine moving parts being close to truth.

This is the balancing act of model design. We want to make our models more descriptive by adding more interacting elements, but the curse of dimensionality all but guarantees that if you try to fit a model with a large number of parameters to data, your fit won’t be close. We can get good estimates of the effect of a public health campaign in a broad and context with few details, or we can get an imprecise estimate in a focused and detailed setting, but a high level of detail and a precise estimate for all those parameters is near impossible.

The solutions for a researcher are to avoid simultaneously estimating parameter sets, to accept models with limited scope and few moving parts, to structure models with many more assumptions to reduce informational dimensions, or to put extraordinary work into pegging down every parameter with great precision. In short, resist the desire to fit the latest data set to a model of everything. The solution for readers of research is to accept the limitations of models that do not try to be a theory of everything, and maintain skepticism of models that seem to flout the curse.

Source link