Hadley Wickham has recently joined Rstudio as Chief Scientist.  Previously, he spent over four years as a statistics professor at Rice University .  Hadley considers himself primarily a tool-builder for data scientists working in R. He is interested in tools that reduce the cognitive burden of solving data science problems. He likes to figure out good ways to think about problems, then match up cognitive tools with computational tools that make it easy to solve real problems. His work in this area includes ggplot2 for visualization, plyr for data transformation and reshape2 for data tidying . Recently, Hadley has also become interested in reducing the computational burden associated with dealing with large data in R, and am working on new tools for visualizing and transformation large data sets in R.

Hi Hadley! What is your educational background and how has it prepared you for your role in data analytics and science?  What skills did you not develop in school that you find important in your work?

I actually started off in medical school, which I left because I didn’t enjoy the focus on memorization. It wasn’t a total loss though, as one of the most valuable skills I learned was medical history taking. The process of taking a medical history is surprisingly similar to solving a data science problem. Most people go to the doctor with diagnosis and treatment in mind. The first job of the doctor is to figure out what’s really gone wrong and what the patient needs (which is often not what they want!) This is a similar job to a consulting statistician or a data scientist: when people bring you data, they already think they know what needs to be done.

After leaving medical school, I double-majored in computer science and statistics. While I loved programming, I found the academic study of computer science rather dry, and so continued my interest in statistics with a masters degree. I studied at the University of Auckland, the home of R, and it was a natural progression of my interests in visualization and data analysis that lead to a PhD in Statistics with Di Cook and Heike Hofmann at Iowa State University .

During my PhD I did a lot of statistical consulting for other PhD students. This really opened my eyes to the challenges of working with real data: getting it into a suitable form for analysis, creating the plots I could see in my mind, and explaining complex results to statistical novices. These struggles lead to my PhD thesis which explored better tools for data reshaping, and for visualising complex data and models.

What are the biggest challenges in data science and analytics? What are the most important things to ‘get right’.  What are the best
technologies available to solve these problems?

To me the biggest challenge is integrating visualization and modelling in to a workflow that plays to each of their strengths.  Visualization helps refine questions and reveal the unexpected, but doesn’t scale well because it needs a human viewer. If you have a precise question, modelling scales very well, but will never tell you something you fundamentally didn’t expect, and compact numerical summaries can .

Personally, I think R provides the best environment to integrate visualization and modelling, as well as providing a wide range of data import and manipulation tools. As a tool-builder, it makes sense for me to specialize in one language, but I think most data scientists are best of taking a polyglot approach. You’re better off getting stuff done, and not worrying too much about the lack of elegance of using different tools in different projects.

R has its fair share of haters, but I think it’s in a very similar place to where javascript was 5 or 10 years ago: the current implementation is not very fast, a lot of code is crap, and few people appreciate the elegant heart of the language. But these problems are all fixable, and I think the future for R is bright.

What’s your definition of data analytics and data science?

I’ve never been able to come up with a good definition for data science; but I think I have a decent handle who a data scientist is:
They’re someone who can ask and answer questions about and with data.

The ability to ask a good question is critically important, and is not a skill that’s often talk in technical disciplines. Answering questions means more than just answering for yourself, it also means being able to communicate the results to others who don’t understand the details as well as you. It’s also not about just finding the answer once, but about automating the process so you can continue to find the answer every day.

What advice can you give someone with little experience in analytics to pursue a career in the field?

Play to your strengths. Start small and aim for a few easy wins:you’ve got a long haul ahead of you, and you don’t want to get discouraged too early. Don’t start out with gigabyte sized data; find some smaller datasets that are meaningful to you and try and understand what’s going on.

If you’re a programmer who wants to learn more data science, learn R! The community needs more people (like you!) who know about good software development. You can use your programming skills to make a difference to the community, while you learn more about data and statistics.

Finally: write! Communication is a vital skill and practice is the only way to get better. I find writing to be a great tool to understand new areas, and teaching others through your writing is a great way to master a field. If you want to be a better writer, my recommendation is to read .  I found this book gave me the tools to understand why my prose didn’t work and how to fix it.

Connect with Hadley on:

Have questions?  Continue the conversation in the comments.

Tags:

Trackbacks/Pingbacks

  1. 如何用数据讲故事? | 统计之都 - March 10, 2013

    [...] 一篇对Hadley Wickham的采访。其中教主谈到自己是如何离开医药学专业,又对计算机专业很失望,最后成为一个统计人的过程。 [...]

Scott Nicholson

Scott is the Chief Data Scientist at Accretive Health, working on uncovering insights that will help doctors increase the quality of [...]

Antonio Piccolboni

Antonio began his career in bioinformatics, spending 10 years split between academia and industry.   He then worked  for a web [...]

Kate Matsudaira

Kate Matsudaira was most recently CTO at Decide where she managed a team of people doing data mining and machine learning. [...]

Christyn Perras

Christyn is currently a quantitative analyst at Youtube.  Previously she worked at Slide, a social gaming startup where she also [...]

John Cook

John Cook started out in applied math and worked for University of Texas and Vanderbilt University. He then left [...]

Andrew Eichenbaum

Andrew Eichenbaum is the Head of Research and Analytics at Yummly, a semantic web search engine for food, cooking and [...]