How to spot a data charlatan

Original article was published by Cassie Kozyrkov on Artificial Intelligence on Medium


How to spot a data charlatan

Tips for identifying fakers and neutralizing their snake oil

You might have heard of analysts, ML/AI engineers, and statisticians, but have you heard of their overpaid cousin? Meet the data charlatan!

Attracted by the lure of lucrative jobs, these hucksters give legitimate data professionals a bad name.

Image: SOURCE

Data charlatans are everywhere

Chances are that your organization has been harboring these fakers for years, but the good news is that they’re easy to identify if you know what to look for.

Data charlatans are so good at hiding in plain sight that you might even be one without even realizing it. Uh-oh!

The first warning sign is a failure to understand that analytics and statistics are very different disciplines. I’ll give you a brief overview in the next section, but if you’d like to understand it more deeply, I’ve written a whole article here.

Different disciplines

While statisticians are trained in inferring what’s beyond their data, analysts are trained in exploring the contents of their dataset. In other words, analysts make conclusions about what’s in their data while statisticians make conclusions about what isn’t.

Analysts help you come up with good questions (hypothesis generation) while statisticians help you get good answers (hypothesis testing).

There are also fancy hybrid roles who are able to wear both hats… but they don’t wear both hats in the same moment. Why not? A core principle of data science is that if you’re dealing with uncertainty, it’s not valid to use the same datapoint for both hypothesis generation and for testing. When you have limited data, uncertainty forces you to choose between statistics or analytics. (Find my explanation here.)

Without statistics, you’re stuck unable to know whether the opinion you just formed holds water.

Without analytics, you’re flying blind with little opportunity to tame your unknown unknowns.

That’s a tough choice! Do you open your eyes to inspiration (analytics) while vowing to forgo the satisfaction of knowing whether your newfound opinion holds water? Or do you break into a cold sweat praying that the question you’ve chosen to ask — by meditating alone in a broom closet without any data — is worth the rigorous answer (statistics) you’re about to get for it?

Peddlers of hindsight

The charlatan’s way out of this bind is to ignore it, finding Elvis’s face in a potato chip and then pretending to be surprised that the same chip looks Elvis-like. (The logic of statistical hypothesis testing boils down to asking whether our data surprise us enough to change our minds. How could we be surprised by data if we’ve already seen them already?)

Do these look like a rabbit and a portrait of Elvis to you? Or perhaps a presidential portrait? For fun with this topic, see my related article here.

Whenever charlatans find a pattern, get inspired, then test the same data for that same pattern to publish the result with a legitimizing p-value or two next to their theory, they’re effectively lying to you (and maybe to themselves too). That p-value has no meaning unless you committed to your hypothesis BEFORE you looked at your data.

Charlatans mimic the actions of analysts and statisticians without understanding the reasons for them, giving the entire field of data science a bad reputation.

True statisticians always call their shots

Thanks to the statistics profession’s near-mystical reputation for rigorous reasoning, snake oil sales in data science are at an all-time high. It’s easy to cheat this way without getting caught, especially if your unsuspecting victims think that it’s all about the equations and data. A dataset is a dataset, right? Wrong. How you use it matters.

A dataset is a dataset, right? Wrong. How you use it matters.

Luckily for their would-be marks, you only need one clue to catch them: charlatans peddle hindsight.

A charlatan peddles hindsight — mathematically rediscovering phenomena that they already know to be in the data — while a statistician offers tests of foresight.

Unlike charlatans, good analysts are paragons of open-mindedness, always pairing inspirational insights with reminders that there could be many different explanations for the observed phenomena, while good statisticians are careful to call their shots before they take them.

Good analysts are paragons of open-mindedness . Unlike charlatans, they don’t make conclusions beyond their data.

Analysts produce inspiration

Analysts are exempt from calling their shots… as long as they aren’t reaching beyond their data. If they’re tempted to make claims about things they haven’t seen, that’s a different job. They should take off their analyst hat and put on their statistician helmet. After all, whatever your official job title, there’s no rule that says you can’t learn both trades if you want to. Just don’t get them confused.

How a charlatan tests hypotheses. Meme: SOURCE.

Being good at statistics does not mean you’re good at analytics and vice versa. If anyone tries to tell you otherwise, check your pockets. If that person tells you that you are allowed to do statistical inference on data you’ve already explored, check your pockets twice.

Hiding behind fancy explanations

If you observe data charlatans in the wild, you’ll notice that they love to spin fancy stories to “explain” observed data. The more academic-sounding, the better. Nevermind that these stories only (over)fit the data in hindsight.

When charlatans do that — let me not mince words — they’re bullshitting. No amount of equations or pretty pontification can make up for the fact that they’ve offered exactly zero evidence that they knew what they were talking about beyond their data.

Don’t be impressed by how fancy their explanation is. For it to be statistical inference, they’d have to call their shots before they see the data.

It’s the equivalent of showing off their “psychic” powers by first peeking at the hand you’ve been dealt and then predicting that you’re holding… whatever you’re holding. Brace yourself for their novel on how it was your facial expression that gave it away. That’s hindsight bias and the data science profession is stuffed to the gills with it.

Analysts say, “That’s a queen of diamonds you just played.” Statisticians say, “I wrote my hypotheses down on this scrap of paper before we started. Let’s play, observe some data, and see if I’m right.” Charlatans say, “I knew you were going to play that queen of diamonds that all along, because…” (Machine learning says, “I’m going to keep calling it in advance and seeing how I did, over and over, and I may adapt my reaction towards a strategy that works. But I’ll do that with an algorithm because keeping track of everything manually is annoying.”)

Charlatan-proofing your life

When there’s not a lot of data to go around, you’re forced to choose between statistics and analytics.

Data-splitting is the cultural quick fix everyone needs.

Luckily, if you have plenty of data, you have a beautiful opportunity to avail yourself of analytics and statistics without cheating. You also have the perfect protection against charlatans. It’s called data-splitting and in my opinion it’s the most powerful idea in data science.

Never take an untested opinion seriously. Instead, use a stash of test data to find out who knows what they’re talking about.

To protect yourself against charlatans, all you have to do is make sure you keep some test data out of reach of their prying eyes, then treat everything else as analytics (don’t take it seriously). When you’re faced with a theory you’re in danger of buying into, use it to call the shot, and then open your secret test data to see if the theory is nonsense. It’s as easy as that!

Make sure that you don’t allow anyone to look at the test data during the exploration phase. Stick to exploratory data for that. Test data should not be used for analytics. Meme: SOURCE

This is a big cultural shift from what people were used to in the era of “small data” where you have to explain how you know what you know in order to convince people — flimsily — that you might indeed know something.

The same rule applies to ML/AI

Some charlatans posing as experts in ML/AI are easy to spot. You catch them the same way you’d catch any other bad engineer: the “solutions” they attempt to build repeatedly fail to deliver. (An earlier warning sign is lack of experience with industry-standard programming languages and libraries.)

But what about the folks who produce systems that seem to work? How do you know if there’s something fishy going on? The same rule applies! The charlatan is sinister character who shows you how well their model performed… on the same data they used to make the model. *facepalm*

If you’ve built a crazy-complicated machine learning system, how do you know if it’s any good? You don’t… until you show that it works on new data it hasn’t seen before.

It’s hardly a *pre*diction if you’ve seen the data before making it.

When you have enough data to split, you don’t need to hand-wave at the prettiness of your formulas to justify your project (which is still an old-fashioned habit I see everywhere, not just in science). You can say, “The reason I know it works is that I can take a dataset that I haven’t seen before and I can accurately predict what’s going to happen there… and be right. Over and over.”

Testing your model/theory in new data is the best basis for trust.

Call your statistical shots or stay humble

To paraphrase a quip by economist Paul Samuelson:

Charlatans have successfully predicted nine out of the last five recessions.

I have no patience for data charlatans. Think you “know” something involving Elvis-like potato chips? I couldn’t care less how well your opinion fits your old chips. I’m not impressed by how fancy your explanation is. Show me that your theory/model works (and keeps working) in a whole pile of new chips you’ve never seen before. That’s the true test of your opinion’s mettle.

Image: SOURCE

Advice for data science professionals

Data science professionals, if you want to be taken seriously by anyone who understands the humor here, stop hiding behind fancy equations to prop up your human biases. Show us what you’ve got. If you want those who “get it” to treat your theory/model as more than a bit of inspiring poetry, have the guts to do the grand reveal of how well it works on a brand new dataset… in front of witnesses!

Advice for leaders

Leaders, refuse to take any data “insights” seriously until they’ve been tested on new data. Don’t feel like putting in the effort? Stick with analytics, but don’t lean on those insights — they’re flimsy and haven’t been checked for trustworthiness. Additionally, when your organization has data in abundance, there is no downside to making splitting a core part of your data science culture and even enforcing it at the infrastructure level by controlling access to test data earmarked for statistics. It’s a great way to nip snake oil sales attempts in the bud!

More bad tricks

If you’d like to see more examples of charlatans up to no good, this Twitter thread is wonderful.

Summary

When data are too scarce to split, only a data charlatan tries to follow inspiration with rigor, peddling hindsight by mathematically rediscovering phenomena that they already know to be in the data and calling their surprise statistically significant. This distinguishes them from the open-minded analyst who deals in inspiration and the meticulous statistician who offers proof of foresight.

When data are plentiful, get in the habit of data-splitting so you can have the best of both worlds! Be sure to do analytics and statistics separately on separate subsets of your original pile of data.

  • Analysts offer you open-minded inspiration.
  • Statisticians offer you rigorous testing.
  • Charlatans offer you twisted hindsight that pretends to be analytics plus statistics.