Scott is an Architect at Salesforce, managing the Chatter Discovery engineering teams,  focused on making organizations around the world smarter by allowing their employees and partners be more efficient and productive in how they collaborate together. He is also the founder of Early Ascent which has been helping hundreds of thousands of children around the world learn to read.

What is your educational background and how has it compared you for your role in data analytics and science? What skills did you not develop in school that you find important in your work?

My first academic exposure to “data science,” although we didn’t call it that back then, came when I was an undergrad at UCSD and took a course on neural networks and we worked out of this book called Parallel Distributed Processing which was the neural networks bible at the time. UCSD was one of the centers of excellence in neural networks with people like Geoffrey Hinton and Michael Jordan passing through there. I later completed an MS and did PhD work at UCI working with Padhraic Smyth who is one of the top researchers in the field of data mining and machine learning. I was there just as UCI was putting together a cross-disciplinary program in statistics and computer science so I got a lot of exposure to machine learning and data mining from a CS point of view as well as from a statistics point of view. I also took some classes with some very good social network researchers in the social sciences. Beyond coursework, I published several papers in the areas of data mining and read many hundreds of published papers which gave me a very good understanding of the key problems in the field, how they were being approached, and what was truly ground-breaking avenues as opposed to incremental avenues of research.

The main skills and insights that were not developed enough through a purely academic setting were a) how to build real systems that apply this work, b) practical techniques for things like data cleansing, feature engineering, and model calibration which are the kinds of details that are often left out of research papers and textbooks and c) understanding what were the most important problems that people in the real world actually faced as opposed to the problems that people in academia were able to analyze based on the techniques they had available.

What are the biggest challenges in data science and analytics? What ar the most important things to ‘get right’. What are the best technologies available to solve these problems?

One of the biggest non-technical challenges I have found is convincing organizations that are a) sitting on troves of valuable data or b) have access to data but don’t properly collect it is to clearly understand the value of the data they have and think differently about how their organization should be wired. The hype around “big data” has helped many organizations understand this but there are still so many organizations that don’t have a clue. This is one of the reasons areas like “data science” is so exciting because there are so many opportunities to make a huge, positive impact in helping make the world smarter.

One of the biggest soft technical challenges is setting up the right set of metrics to clearly understand how well the systems/services/products deployed are working. Having the right metrics in place is so key and so commonly overlooked. For example, if you are in the business of fraud detection, you don’t want to use a traditional ROC curve as your measurement of success because that does not take into account how much money was saved.

It’s hard to say what the biggest technical challenges are since it depends so much on the specific problem. No matter what the problem is though, having a well-designed methodology for incremental improvement will help ensure mistakes are caught early and you are headed on the right track. Many data scientists rush immediately to the most complicated algorithm or model right away which is generally a mistake.

When it comes to the best technologies, I often take inspiration from some of my physicist friends who taught me the value of being able to do more with less. For example, if you were to be dropped in a jungle for a week, knowing how to operate a food processor would be useless. But having a machete would be priceless. If I had an important business problem which I needed to be answered quickly and I had to choose between a data scientist who could answer questions only by using Java and a Hadoop cluster of ten thousand machines or someone with a good, working knowledge of sampling techniques and access to Matlab or R on a laptop, I would probably choose the latter. That being said, having a Hadoop cluster can be incredibly powerful and there are far too many new exciting technologies emerging to name including advances in database storage like Cassandra, online data stream processing technologies such as Storm, faster ways to do Bayesian posterior sampling, and so forth.

What’s your definition of data analytics and science?

I think of “data science” as the methodological component of any data-driven inquiry. It can be based on pre-existing observations or grounded in an experimental setting where you have control over the data generating process. It is the super-set of fields like data mining, statistics, machine learning, signal processing, and computational social networks.

What advice can you give someone with little experience in analytics to
pursue a career in the field?

The biggest mistake I see people making is that they get too caught up
with methodology over substance. Data science can be a powerful wedge
to solve some of society’s biggest problems like healthcare and education.
I would encourage everyone to spend more time thinking hard about important
problems they want to solve rather than focusing on sexy techniques for solving
less important problems. Right now many of the brightest data scientists are spending their time working in fields like Wall Street or advertising. But ultimately I think it’s unsatisfying especially given the opportunity cost of knowing that there is still so much low hanging fruit working on big, important problems that what when tackled can leave a lasting impact.

No comments yet.

Scott Nicholson

Scott is the Chief Data Scientist at Accretive Health, working on uncovering insights that will help doctors increase the quality of […]

Antonio Piccolboni

Antonio began his career in bioinformatics, spending 10 years split between academia and industry.   He then worked  for a web […]

Kate Matsudaira

Kate Matsudaira was most recently CTO at Decide where she managed a team of people doing data mining and machine learning. […]

Christyn Perras

Christyn is currently a quantitative analyst at Youtube.  Previously she worked at Slide, a social gaming startup where she also […]

John Cook

John Cook started out in applied math and worked for University of Texas and Vanderbilt University. He then left […]

Hadley Wickham

Hadley Wickham has recently joined Rstudio as Chief Scientist.  Previously, he spent over four years as a statistics professor at Rice […]