John Cook started out in applied math and worked for University of Texas and Vanderbilt University. He then left academia and worked as a software developer for Western Atlas, an oilfield service company, and for NanoSoft, a small consulting company. Then for over a decade John worked in biostatistics at M. D. Anderson Cancer Center doing Bayesian statistics and software development. He is currently an independent consultant.

Hi John! What is your educational background and how has it prepared you for your role in this field?  What skills did you not develop in school that you find important in your work?

When I was in graduate school, I studied applied math, mostly partial differential equations and numerical analysis. I came to statistics as an outsider, and that gave me a different perspective. I felt lost at first because statisticians have different notation and terminology for a lot of things. But I was able to apply a few tricks I’d learned from PDEs to statistical problems. And my experience from numerical analysis helped me to speed up statistical simulations, sometimes by a couple orders of magnitude.

What are the biggest challenges in data science and/or analytics?  What are the most important things to ‘get right’. What are the best technologies available to solve these problems?

Modeling is one of the biggest challenges. When I first studied probability, the first line of a problem was the one that bothered me most: “Assume X has a normal distribution” or “Assume Y has an exponential distribution” etc. Later I came to understand that these assumptions were not necessarily as arbitrary as they sounded, that there may be good empirical or theoretical reasons for these choices. But then my suspicion started to return when I realized how often model choices are motivated by convenience or tradition rather than science.

It’s easy to get caught in circular reasoning. For example, how do you decide what data points are outliers? They are points that have low probability under your model. So you throw them out. Then, lo and behold, everything that’s left fits your model!

So how do you break out of the circle? You can start by visualizing your data. And after you select a model, validate it. If you’re fitting a model in order to make predictions, and your model indeed does make good predictions on new data, you can have some confidence that you’re not just playing mental games and that your model may be an approximation of reality.

Ideally you’d have some theoretical justification too. When you say “Look, it works!” what you really mean is that it has worked *so far*. Theory can tell you something about how your model will behave on data you haven’t seen yet. That may increase or decrease your confidence in the model.

By the way, you can’t avoid making modeling decisions. If you don’t have an explicit model, you still have an implicit model. You can’t avoid the problem of modeling by pretending it’s not there.

What’s your definition of data analytics and data science?

Data analytics seems too broad to have much meaning. Seems like it could describe anything you to do data.

Data science, however, has taken on a more definite meaning. Hilary Mason drew a diagram one time that placed data science in the intersection of engineering, math, computer science, and hacking. That seems like a reasonable definition of data science.

Traditional statistics is sort of in the intersection of math and engineering, but statisticians often lack the computational skill to work on large-scale problems. They may also lack the hacking spirit, placing too much emphasis on elegant theory and not enough on pragmatism.

Others have the computer science and engineering skills to capture large amounts of data and move it around efficiently, but they don’t know how to draw conclusions from it. They lack the mathematical background to know what methods to apply and, more subtly, to interpret the results.

Not many people have PhD’s in engineering, math, computer science, and hacking. (Can you even get a PhD in hacking?) But my idea of a data scientist is someone who has *some* skill in all four areas and appreciates the importance of each. Everyone working in data science is going to be stronger in some areas than others, and a good team is going to bring together people with complementary strengths.

What advice can you give someone with little experience in analytics who wants to pursue a career in the field?

Be patient. It takes a long time to develop the various skills you need. You’ll be able to learn some skills on the job, but you’ll have to learn others on your own. On the other hand, you don’t have to wait 20 years before you can do anything. You can start tinkering with data immediately.

I’d also say its good to develop an idea of what data science can and cannot do. For example, you might want to read “ The Human Face of Big Data ” by Rick Smolan and Jennifer Erwitt for an enthusiastic account of what data science can do. But you might also want to read “ Antifragile ” by Nassim Taleb for a sober look at the limits of prediction.

How do you think the field will be different in 5-10 years?

I expect the infrastructure will mature. Some people will specialize in the infrastructure itself but others won’t need to know quite as much about it.

I also see statistics and machine learning coming closer together. Statistics emphasizes probability models and machine learning emphasizes algorithms, and that’s not going to change. But sometimes the distinction between the two is fuzzy, and I expect it will get fuzzier.

Connect with John on:

Have questions?  Continue the conversation in the comments.


  1. Fifteen interviews — The Endeavour - March 18, 2013

    [...] Misra on Data Science and Analytics [...]

Scott Nicholson

Scott is the Chief Data Scientist at Accretive Health, working on uncovering insights that will help doctors increase the quality of [...]

Antonio Piccolboni

Antonio began his career in bioinformatics, spending 10 years split between academia and industry.   He then worked  for a web [...]

Kate Matsudaira

Kate Matsudaira was most recently CTO at Decide where she managed a team of people doing data mining and machine learning. [...]

Christyn Perras

Christyn is currently a quantitative analyst at Youtube.  Previously she worked at Slide, a social gaming startup where she also [...]

Hadley Wickham

Hadley Wickham has recently joined Rstudio as Chief Scientist.  Previously, he spent over four years as a statistics professor at Rice [...]

Andrew Eichenbaum

Andrew Eichenbaum is the Head of Research and Analytics at Yummly, a semantic web search engine for food, cooking and [...]