Antonio began his career in bioinformatics, spending 10 years split between academia and industry.   He then worked  for a web ratings company, Quantcast, then a social network, Hi5, as a data scientist. Currently he is an independent consultant.

Hi Antonio! What is your educational background and how has it prepared you for your role in this field?  What skills did you not develop in school that you find important in your work?

I have undergrad and graduate degrees in CS from the University of Milan, Italy. Everything I learned there turned out to be useful for my career, from algorithms to software engineering, from operating systems to linear algebra. Maybe I could have done without the physics classes, but they were interesting anyway. I got also some machine learning in school and some probability, but no statistics, which I had to pick up en route. Luckily I worked very closely with excellent statisticians for 6 years at Affymetrix which helped fill the gap. Also being somewhat versed in computer theory helped. It’s all math after all.

What are the biggest challenges in data science and/or analytics?  What are the most important things to ‘get right’. What are the best technologies available to solve these problems?

While we are transferring scientific methods into business and government, science itself is going through a period of transition, if not crisis. It made it into the popular press that most scientific results don’t stand the test of time . On the other hand, I meet both experienced scientists and members of the general public regarding any statement supported by some data with a p-value or a confidence interval as ground truth. I think we need to strengthen our methods at a and also we need more education among people who are not specialists themselves but need to understand statistics or we risk a backlash at some point.

The other challenge I would like to mention is more on the computing side. Every piece of hardware we use, from the phone in my pocket to a data center, needs parallel software to deliver more than a dwindling fraction of its potential. The bad news is that most of the code we have written since von Neumann and Turing is sequential, and most of it will have to be rewritten . We will be working on this new generation of software for years or maybe decades. That’s why I devoted the last year, year and a half to developing tools , more than solving specific data analysis problems, and I think it will keep taking a good share of my time.

As far as getting things right, check your assumptions. You can’t just perform a t-test and hope for best, that’s using math as a talisman. If the assumptions were wrong, how would you detect that? When I was doing A/B tests I also did A/A tests and repeated the same A/B test to develop confidence in our system, and things checked alright most of the times, with a few notable exceptions which we could explain and fix. The technology that made the most difference for me was Hadoop. I went from struggling through 1TB of data per year to handling 4TB per day. But it doesn’t solve the problem of applying statistics correctly, it just allows you to spend more time on that.

What’s your definition of data analytics and data science?

I think data science is an umbrella definition for applications of the scientific method outside traditional domains, mostly with a very practical, engineering approach. I mean, physics is a data science but it’s not done by data scientists, there are specialists for that. Outside the  domain of individual sciences there is data science. Maybe one day it will become a set of independent disciplines with their own concepts and communities, maybe some common thread will appear and a new synthesis will emerge. It reminds me of cybernetics for the breadth of aspirations. If I had to guess one unifying thread right now it is less emphasis on small elegant models, the E = mc^2 type,  and more acceptance for the large complex ones that we can only understand indirectly, like a Markov model with one million coefficients. We certainly understand what a Markov model is but not each coefficient. Predictive power is what matters. At this time I would say data science is firmly focused on applications and not so overly concerned with foundations. I think that will come in due course.

What advice can you give someone with little experience in analytics to pursue a career in the field?

One is devoting a similar amount of work to CS and statistics.  The two disciplines are more intertwined than ever, from experimental software engineering to the bootstrap. You’ll be better at each of them for knowing both. People may say teams are a replacement for people with mixed backgrounds, but I and others disagree. The other suggestion is to put yourself on the map by solving some public challenge or making important contributions to open source software. Learn a useful art, be good at it and let it be known.

What do you love about data analytics? What part of your job makes you most excited?

Coming from a science background, it’s quite a lucky strike that everybody now wants to do science in a variety of new domains. Another one is that having to deal with massive datasets we are at the forefront of the conversion to parallel computing I was talking about before, we are among the pioneers. But my biggest source of excitement is another one. The founder of a promising wind power company , Samuel Griffith, once said he is a mechanical engineer because the great problems of our time are hardware, not software. I hope we can prove him partially wrong on this one. I hope that data science can be brought to bear on the most pressing problems that humankind is facing, if only we stop and turn to stuff that matters . Our systems are full of inefficiencies that we have the possibility of slashing using an overwhelming amount of information, from watering each stalk of corn based on weather forecasts and sensor networks to creating efficient markets for CO2 emissions and energy, to creating automated mass transit systems that work to enrolling every human being in a permanent clinical and safety study of every drug and every molecule on the market. If not now, when?

Connect with Antonio on:

Have questions?  Continue the conversation in the comments.

No comments yet.

Scott Nicholson

Scott is the Chief Data Scientist at Accretive Health, working on uncovering insights that will help doctors increase the quality of [...]

Kate Matsudaira

Kate Matsudaira was most recently CTO at Decide where she managed a team of people doing data mining and machine learning. [...]

Christyn Perras

Christyn is currently a quantitative analyst at Youtube.  Previously she worked at Slide, a social gaming startup where she also [...]

John Cook

John Cook started out in applied math and worked for University of Texas and Vanderbilt University. He then left [...]

Hadley Wickham

Hadley Wickham has recently joined Rstudio as Chief Scientist.  Previously, he spent over four years as a statistics professor at Rice [...]

Andrew Eichenbaum

Andrew Eichenbaum is the Head of Research and Analytics at Yummly, a semantic web search engine for food, cooking and [...]