The Lexwerks

Answering the Wrong Question

The things we learn from Big Data are only going to be as true as the model we inject into our data structure to make the data readable. We need a check on possible flaws and blind-spots in our models.

Given all of the families with two children that can be identified as a boy or a girl, what percentage of the families that have a girl also have a boy? You may well guess 50%, but if we put together a truth table involving the possibilities of pairs of children — BB, BG, GB, GG — and then eliminate the one we don’t care about (BB because it doesn’t have our minimum 1 girl) we see that there are 3 options left, 2 of which involve boys, so our answer is about 66%. Which sounds cool because it’s not what you expected, but the question seems a bit complex so let’s simplify it a bit: given a girl with 1 sibling, what are the chances that that she’s got a brother? Truth table says 2-out-of-3.

And now we know that the truth table is lying to us. The dodgy bit is where the question shifts from a family that contains children down to the child-to-child relationship. See, we previously said “any” girl may have an older brother or a younger brother, but since we’re looking at “any” girl we can’t distinguish between sisters. But once we’re looking at “this” girl, then she may have an older brother or sister, or she may have a younger brother or sister. The corrected truth table is BX, GX, XB, XG. The possibility of two brothers doesn’t enter into our set because we know we’re talking about this girl so we can’t create and eliminate new false options just to throw the probabilities off. We can be pretty certain that her sibling is not a vampire, werewolf, zombie, or Mi-go from Pluto, without reducing the chances that her sibling is a boy or a girl.

If you are familiar with SQL, you will may be re-formulating the first problem into a query with a “having” clause while the second query is using a “where” clause. The difference is, to simplify things a bit, a question of the order of operations in terms of where you start counting for the initial set. It seems like a little thing, but as Cathy O’Neil nails exactly:

Don’t be fooled by the mathematical imprimatur: behind every model and every data set is a political process that chose that data and built that model and defined success for that model.

When we’re talking about abstractions of siblings it seems pretty minor. I mean, so what if statisticians prefer feeling clever to knowing how actual people relate to each other? It’s not like they’re some Maestros of Economics that are using a flawed model which both depends on and ignores greed to manage monetary policy for a nation, are they? Well, except when they are.

But my actual concern isn’t that the model might be flawed — after all, the first truth table did appear to answer a question. My concern is that the very clever people will take their very clever answers and attach them to very basic questions so that other people can quickly understand just how clever the very clever people are, without realizing that they’ve also become very wrong. As Tavris and Aronson discuss at great length in Mistakes Were Made (but not by me), when people have a well-loved adjective — “clever!” — at the core of their identity — “I am clever!” — it becomes difficult for them to admit mistakes that would repudiate that adjective. If they know themselves, and they know that they are clever, and they prove their cleverness by this model that says “66% brothers,” then disproving the model would disprove their cleverness and thus disprove their very identity. In these cases where the cleverness of their problem-solving is core to their identities, the possibility of error isn’t just being guilty of a singular mistake but is a shameful shortfall in the essence of their being.

Of course this is common among people who have a strong sense of identity. Rudy Guiliani claimed to be a great mayor as demonstrated by reductions in crime stemming from his commitment to fixing broken windows. Levitt and Dubner then claimed to be great thinkers and authors, writing that Guiliani’s reduced crime rate was actually just a side effect of legalized abortions. But then Rick Nevin comes back and says they’re both effectively wrong: the elimination of lead from gasoline reduced lead poisoning — and correlated crime — in large cities in that time-frame. And that’s an increasingly fascinating way to challenge the identity of politician and challenge the identity of “a rogue economist” because those identities were based on an interpretation of reality that didn’t include all of reality. I knew about the first two claims but didn’t find out about the last one — the leaded gasoline — until Alistair Croll (who has a better site name than mine) mentioned it in a very interesting presentation on Big Data where he also mentioned that a girl with one sibling will have a 66% chance of having a brother.

The upshot — or one of the upshots — of Croll’s presentation was that as data becomes cheap, context becomes valuable. And this is what the siblings example demonstrates: if we misread — or even just mis-repeat — the context for our selected data and model, we will apply the wrong model and arrive at the wrong answer. And then our belief in ourselves as rational and competent people (demonstrated by our very intentional selection of data and model and such) blinds us to the possibility that we made a mistake because it’s easier to believe that they don’t even know how wrong they are than we don’t even know how wrong we are.

Thus, as Nietzsche said, we can “put an end to many brief follies, with a single long stupidity” by identifying ourselves with a bad model built around the wrong data. Now Nietzsche was talking about marriage, but so do Torvis and Aronson: when arguments in a relationship move from things a person does (guilt) to the way a person is (shame) and thus attack an identity, they become laced with contempt and the relationship is pretty much over. And the identity of the Other has to be attacked to justify the choice to attack it while actively ignoring the benefits that also come from the identity which probably helped get the relationship started in the first place. So to close this tangent, be concerned if your significant other spends more than 16% of your time with them snubbing the way you are — a data set spanning the whole of your relationship, but mysteriously containing only the bad bits, is being compiled to errantly explain why they’ve never loved you.

Nothing so subtly influences a model as the order of operations used to select the data. My favorite example of this should be near and dear to the heart of all Douglas Adams’ fans: the answer to the ultimate question of life, the universe, and everything, is 42. At the end of the The Restaurant at the End of the Universe it is suggested that the ultimate question is “what do you get if you multiply six by nine” which people have claimed was using a different mathematical notation or was clearly the wrong question as part of the world-building, but I pose that it’s a meta joke that works like this: 6 * 9 = (1 + 5) * (8 + 1) = 1 + 5 * 8 + 1 = 42, with the meaning being that the reason life, the universe, and everything doesn’t make any sense and therefore needs a simple answer followed promptly by a question which still doesn’t make sense — “42? But multiplying 6 by 9 is 54!” — is because our order of operations is wrong. And when we change our order of operations to go from crunching comfortingly wrong numbers back to the business of living our lives, then life, the universe, and everything will seem a bit more reasonable.

This is very important to remember. Alfred Korzybski told us that “the map is not the territory” while René Magritte warned us that “Ceci n’est pas une pipe“* but the monetary value of mining Big Data is enticing people to forget those warnings. What’s at stake here? Well, let’s look at our initial model and ask who we inadvertently disenfranchised from our sample set. Obviously we disenfranchised families where one of the kids is a vampire, a werewolf, a zombie, or a Mi-go from Pluto — and that’s probably no big loss. But we also disenfranchised families with two boys for no given reason.  We’re ignoring families with only one child, but also three-or-more children. While the cut of the model was arbitrary, it isn’t clear if the siblings are full siblings or if half-siblings, step-siblings, or adoptive siblings are included in the sample. We’re uncertain of the status of siblings that are deceased or were given up for adoption, and we certainly don’t know how the occasional (post-SRS?) transgender sibling or hermaphroditic sibling would count. And the possibility that we’re at all interested in couples with no children or single people at all is straight out. Remember what O’Neil said: behind every model and every data set is a political process that chose that data. This echoes Foucault in Discipline and Punish: the setting of boundaries of the data set establishes “a normalizing gaze [that] establishes over individuals a visibility through which one differentiates them and judges them,” though in this case they’re being judged to be in or out of a demographic. And this goes back to another particularly brilliant point that Croll makes: the use of Big Data to define our target audience creates a Civil Rights issue. Because there’s a very thin line between offering something to people who are statistically likely to be interested in it and not making that offer to some other people because they don’t quite fit in your model.

But I should mention at this juncture that I fully agree with Dr. Jung when he says that “to be normal is the ideal aim of the unsuccessful,” and so after a post about bad models over-normalizing life — the Discordians would say that it is indicative of the Curse of Greyface —  I find myself wanting to re-watch Shawn Achor’s amazingly enjoyable speech on normalization again, especially at 3:20 in:

But if that’s too much to think about at the moment, I would like you to do this: have some friends (or even enemies; as Nietzsche says, “Ye must be proud of our enemies”) you trust enough to accept contradiction from and when they do prove you wrong — which they should if you’re having interesting enough conversations with them — be impressed that they are able to do it and thank them for it. From this exercise you should be training your ego to recognize that it isn’t always right, and it’s okay for other people to be right, such that you hopefully won’t always re-imagine yourself to be the hero of your anecdotes to the detriment of the contributions of your peers and lovers. In this way you won’t ever have to wonder what you ever saw in that Other person because you won’t be mis-attributing their virtues to yourself — and the complex answer to the immediate question of why you’re in their company will seem a bit more reasonable.

* Side note: I want t-shirts for “Ceci n’est pas une piñata” and “Ceci n’est pas une drill”