Stories on Data Science and Analytics Dispersing Information Tue, 09 Apr 2013 18:32:53 +0000 en-US hourly 1 Scott Nicholson Tue, 09 Apr 2013 18:32:53 +0000 miketwardos Scott is the Chief Data Scientist at Accretive Health, working on uncovering insights that will help doctors increase the quality of care while decreasing cost.  Previously he was a Product Manager of Optimization/Analytics  at Adara Media and then more recently a Data Scientist at LinkedIn.

What is your educational background and how has it compared you for your role in data analytics and science?  What skills did you not develop in school that you find important in your work?

I have a PhD in Economics and have spent a lot of time with Stata. The most valuable skills I learned were econometrics/statistics and a deep analytical intuition. But I had to later pick up much more relevant skills such as R, Python, SQL, etc.

 What are the biggest challenges in data science and analytics?  What are the most important things to ‘get right’.  What are the best technologies available to solve these problems?

Asking the right questions. Cleaning/extracting/munging/preparing data. You’ve got to know where the value is with your data, and to trust it before you put it in your model. The best tech out there for cleaning data is to hammer through it yourself or push the issues upstream to fix them at the source (i.e., log data appropriately).

What’s your definition of data analytics and science?

Owning the end-to-end process. Start with asking the right questions, and do whatever you need to do to deploy insights, have an impact, and iterate.

What advice can you give someone with little experience in analytics to pursue a career in the field?

Find an area that you are passionate about and figure out how to get some relevant data now. What are some interesting questions to ask? How can you answer them with data? Another more structured method is to get the O’Reilly books Programming Collective Intelligence, Mining the Social Web, and Machine Learning for Hackers. Those give a good overview of techniques.


]]> 0
Antonio Piccolboni Thu, 04 Apr 2013 06:14:18 +0000 miketwardos head-shot Antonio began his career in bioinformatics, spending 10 years split between academia and industry.   He then worked  for a web ratings company, Quantcast, then a social network, Hi5, as a data scientist. Currently he is an independent consultant.

Hi Antonio! What is your educational background and how has it prepared you for your role in this field?  What skills did you not develop in school that you find important in your work?

I have undergrad and graduate degrees in CS from the University of Milan, Italy. Everything I learned there turned out to be useful for my career, from algorithms to software engineering, from operating systems to linear algebra. Maybe I could have done without the physics classes, but they were interesting anyway. I got also some machine learning in school and some probability, but no statistics, which I had to pick up en route. Luckily I worked very closely with excellent statisticians for 6 years at Affymetrix which helped fill the gap. Also being somewhat versed in computer theory helped. It’s all math after all.

What are the biggest challenges in data science and/or analytics?  What are the most important things to ‘get right’. What are the best technologies available to solve these problems?

While we are transferring scientific methods into business and government, science itself is going through a period of transition, if not crisis. It made it into the popular press that most scientific results don’t stand the test of time. On the other hand, I meet both experienced scientists and members of the general public regarding any statement supported by some data with a p-value or a confidence interval as ground truth. I think we need to strengthen our methods at a fundamental level and also we need more education among people who are not specialists themselves but need to understand statistics or we risk a backlash at some point.

The other challenge I would like to mention is more on the computing side. Every piece of hardware we use, from the phone in my pocket to a data center, needs parallel software to deliver more than a dwindling fraction of its potential. The bad news is that most of the code we have written since von Neumann and Turing is sequential, and most of it will have to be rewritten. We will be working on this new generation of software for years or maybe decades. That’s why I devoted the last year, year and a half to developing tools, more than solving specific data analysis problems, and I think it will keep taking a good share of my time.

As far as getting things right, check your assumptions. You can’t just perform a t-test and hope for best, that’s using math as a talisman. If the assumptions were wrong, how would you detect that? When I was doing A/B tests I also did A/A tests and repeated the same A/B test to develop confidence in our system, and things checked alright most of the times, with a few notable exceptions which we could explain and fix. The technology that made the most difference for me was Hadoop. I went from struggling through 1TB of data per year to handling 4TB per day. But it doesn’t solve the problem of applying statistics correctly, it just allows you to spend more time on that.

What’s your definition of data analytics and data science? 

I think data science is an umbrella definition for applications of the scientific method outside traditional domains, mostly with a very practical, engineering approach. I mean, physics is a data science but it’s not done by data scientists, there are specialists for that. Outside the  domain of individual sciences there is data science. Maybe one day it will become a set of independent disciplines with their own concepts and communities, maybe some common thread will appear and a new synthesis will emerge. It reminds me of cybernetics for the breadth of aspirations. If I had to guess one unifying thread right now it is less emphasis on small elegant models, the E = mc^2 type,  and more acceptance for the large complex ones that we can only understand indirectly, like a Markov model with one million coefficients. We certainly understand what a Markov model is but not each coefficient. Predictive power is what matters. At this time I would say data science is firmly focused on applications and not so overly concerned with foundations. I think that will come in due course.

What advice can you give someone with little experience in analytics to pursue a career in the field?

One is devoting a similar amount of work to CS and statistics.  The two disciplines are more intertwined than ever, from experimental software engineering to the bootstrap. You’ll be better at each of them for knowing both. People may say teams are a replacement for people with mixed backgrounds, but I and others disagree. The other suggestion is to put yourself on the map by solving some public challenge or making important contributions to open source software. Learn a useful art, be good at it and let it be known.

What do you love about data analytics? What part of your job makes you most excited? 

Coming from a science background, it’s quite a lucky strike that everybody now wants to do science in a variety of new domains. Another one is that having to deal with massive datasets we are at the forefront of the conversion to parallel computing I was talking about before, we are among the pioneers. But my biggest source of excitement is another one. The founder of a promising wind power company, Samuel Griffith, once said he is a mechanical engineer because the great problems of our time are hardware, not software. I hope we can prove him partially wrong on this one. I hope that data science can be brought to bear on the most pressing problems that humankind is facing, if only we stop wasting time and turn to stuff that matters. Our systems are full of inefficiencies that we have the possibility of slashing using an overwhelming amount of information, from watering each stalk of corn based on weather forecasts and sensor networks to creating efficient markets for CO2 emissions and energy, to creating automated mass transit systems that work to enrolling every human being in a permanent clinical and safety study of every drug and every molecule on the market. If not now, when?

Connect with Antonio on:

Have questions?  Continue the conversation in the comments.

]]> 0
Kate Matsudaira Sun, 24 Mar 2013 05:26:37 +0000 romymisra Kate Matsudaira was most recently CTO at Decide where she managed a team of people doing data mining and machine learning.  Before that she was CTO of SEOmoz where she dealt with massive amounts of web crawler data.  She has recently founded a new startup called Pop Forms where Kate says “…we’re not working with a lot of data yet, but I expect to in the future.”

Hi Kate! What is your educational background and how has it prepared you for your role in this field?  What skills did you not develop in school that you find important in your work?

I studied computer science in both undergrad and in my graduate work. To be honest, there’s a lot I had to learn after school to be successful in my field.  First, technology has changed so much since I graduated. When I was in school, I learned lots of fundamentals and programming in languages like C and Matlab; now, applications are evolving so quickly, everyone in this field are in a constant state of learning. In addition, I was super focused on the technical in school, but pretty quickly found myself in a leadership role after I graduated. This required a whole different set of soft skills that I didn’t learn in a Computer Science program.  And even if you are not interested in management, to be promoted, or work on the projects you are more interested in, it often involves a bit of persuasion and communication. So at some point you have to develop those soft skills.

What are the biggest challenges in data science and/or analytics?  What are the most important things to ‘get right’. What are the best technologies available to solve these problems?

In my experience, the biggest challenge with data is knowing how to make good use of it. You’ve got to ask the right questions in order to do thoughtful analysis and draw meaningful conclusions. And in many cases there are challenges collecting the right data and  getting in into a usable state.  When it comes to analysis a lot of people get confused between correlation and causation. Correlation is not causation (just because things seem correlated does not mean one caused the other), and people are tempted to draw conclusions based on data relationships that aren’t really there.

It’s important to be able to use data to impact your business smartly. Understand and use the tools you’ve got, react to the data, and then make smart decisions based on that data.

What’s your definition of data analytics and data science?

My definition is collecting data (through monitoring & metrics) and analyzing that data to some sort of conclusion or result. It’s really about the act of using data to make decisions or drive products. And it doesn’t have to be a complex system; it can be as simple as just taking in data and using it to make smarter decisions.

What advice can you give someone with little experience in analytics to pursue a career in the field?

The best way to break into the field is to just keep learning, and to get as much real world experience as you can. Start a job, internship, or personal project, because nothing trumps real world experience. If you can, try to figure out ways to work data analysis into your current job. If you can’t, find ways to do it outside of work. Kaggle and other sites allow people anywhere to work on data science experiments, which is a great way to start building up a portfolio or resume you can point to in a job interview.

How do you think the field will be different in 5-10 years?

I think there will be a lot more data in the next 5-10 years. People are going to get better at collecting it and analyzing it, and a lot more tools will be available and will run faster. We’ve seem so many tools that have evolved so much just in the last 5 years, but they can still be really slow. I think in the near future they will advance and speed up, and open up even more new applications for data. I expect we’ll see lots more companies that are built on and driven by data, and using data science to run their business.

Connect with Kate on:

Have questions?  Continue the conversation in the comments.

]]> 0
Christyn Perras Wed, 20 Mar 2013 15:04:02 +0000 miketwardos Christyn is currently a quantitative analyst at Youtube.  Previously she worked at Slide, a social gaming startup where she also performed quantitative analysis on product features.  She has also worked as a statistician at Stanford University School of Medicine’s Center for Clinical Research and INC Research (pharmaceutical CRO).

Hi Christyn!  What is your educational background and how has it compared you for your role in data analytics and science?  What skills did you not develop in school that you find important in your work?  

I received a PhD in Research, Evaluation and Measurement from the University of Pennsylvania as well as an MS in Statistics.  My undergraduate degree was in Psychology at Loyola University.  At the time, my education felt like a hodgepodge of my interests and I wasn’t sure how to fit them together into a career. After a bit of trial an error, I found that my background in and love for psychology combined with my desire and instinct to quantify things made me well suited for social gaming. My psychology background gave me the skills to understand why people were using our games, what made them come back, what could we do to change their behavior, etc. My background in statistics and experimental design taught me how to study, test, quantify and interpret their behavior. Both disciplines help me to ask the right questions, find the best approaches and understand the answers.

What are the biggest challenges in data science and analytics?  What are the most important things to ‘get right’.  What are the best technologies available to solve these problems?

One of the most important things to get right is what you do with the results of an analysis. Practice explaining statistical topics to non-quantitative people. Choose your words well and err on the side of over-explaining. Misinterpretation can spread like wildfire and it’s best to avoid it at all costs. Also, remember that it’s not enough to just state what the results are. You also need to need to consider and address what it means, the implications and what should you’re audience *do* as a result. Actionable insight are key.

What’s your definition of data analytics and science?  

It’s the art of drawing insight from numerical chaos.

What advice can you give someone with little experience in analytics to pursue a career in the field?

You need to be on top of your game in three major areas. First, you need to know statistics and research design. Second, you need to know how to work with and communicate to non-quantitative people. Third, you need programming/technical skills. In most places, knowledge of MySQL and a statistical package, like R, is enough. But it’s also imperative that you understand where the data is coming from, where it’s being warehoused, the most efficient way to access it, etc. I never learned any scripting languages and, while it’s not necessary for my job, a basic knowledge of it would make a lot of things easier. Luckily, all three areas can be self taught with books, websites and free educational programs like Coursera ( So if you don’t have any experience in one of the areas or just need a refresher, get to it.

Connect with Christyn on:


]]> 0
John Cook Wed, 13 Mar 2013 06:56:04 +0000 romymisra  

John Cook started out in applied math and worked for University of Texas and Vanderbilt University. He then left academia and worked as a software developer for Western Atlas, an oilfield service company, and for NanoSoft, a small consulting company. Then for over a decade John worked in biostatistics at M. D. Anderson Cancer Center doing Bayesian statistics and software development. He is currently an independent consultant.

Hi John! What is your educational background and how has it prepared you for your role in this field?  What skills did you not develop in school that you find important in your work?

When I was in graduate school, I studied applied math, mostly partial differential equations and numerical analysis. I came to statistics as an outsider, and that gave me a different perspective. I felt lost at first because statisticians have different notation and terminology for a lot of things. But I was able to apply a few tricks I’d learned from PDEs to statistical problems. And my experience from numerical analysis helped me to speed up statistical simulations, sometimes by a couple orders of magnitude.

What are the biggest challenges in data science and/or analytics?  What are the most important things to ‘get right’. What are the best technologies available to solve these problems?

Modeling is one of the biggest challenges. When I first studied probability, the first line of a problem was the one that bothered me most: “Assume X has a normal distribution” or “Assume Y has an exponential distribution” etc. Later I came to understand that these assumptions were not necessarily as arbitrary as they sounded, that there may be good empirical or theoretical reasons for these choices. But then my suspicion started to return when I realized how often model choices are motivated by convenience or tradition rather than science.

It’s easy to get caught in circular reasoning. For example, how do you decide what data points are outliers? They are points that have low probability under your model. So you throw them out. Then, lo and behold, everything that’s left fits your model!

So how do you break out of the circle? You can start by visualizing your data. And after you select a model, validate it. If you’re fitting a model in order to make predictions, and your model indeed does make good predictions on new data, you can have some confidence that you’re not just playing mental games and that your model may be an approximation of reality.

Ideally you’d have some theoretical justification too. When you say “Look, it works!” what you really mean is that it has worked *so far*. Theory can tell you something about how your model will behave on data you haven’t seen yet. That may increase or decrease your confidence in the model.

By the way, you can’t avoid making modeling decisions. If you don’t have an explicit model, you still have an implicit model. You can’t avoid the problem of modeling by pretending it’s not there.

What’s your definition of data analytics and data science?

Data analytics seems too broad to have much meaning. Seems like it could describe anything you to do data.

Data science, however, has taken on a more definite meaning. Hilary Mason drew a diagram one time that placed data science in the intersection of engineering, math, computer science, and hacking. That seems like a reasonable definition of data science.

Traditional statistics is sort of in the intersection of math and engineering, but statisticians often lack the computational skill to work on large-scale problems. They may also lack the hacking spirit, placing too much emphasis on elegant theory and not enough on pragmatism.

Others have the computer science and engineering skills to capture large amounts of data and move it around efficiently, but they don’t know how to draw conclusions from it. They lack the mathematical background to know what methods to apply and, more subtly, to interpret the results.

Not many people have PhD’s in engineering, math, computer science, and hacking. (Can you even get a PhD in hacking?) But my idea of a data scientist is someone who has *some* skill in all four areas and appreciates the importance of each. Everyone working in data science is going to be stronger in some areas than others, and a good team is going to bring together people with complementary strengths.

What advice can you give someone with little experience in analytics who wants to pursue a career in the field?

Be patient. It takes a long time to develop the various skills you need. You’ll be able to learn some skills on the job, but you’ll have to learn others on your own. On the other hand, you don’t have to wait 20 years before you can do anything. You can start tinkering with data immediately.

I’d also say its good to develop an idea of what data science can and cannot do. For example, you might want to read “The Human Face of Big Data” by Rick Smolan and Jennifer Erwitt for an enthusiastic account of what data science can do. But you might also want to read “Antifragile” by Nassim Taleb for a sober look at the limits of prediction.

How do you think the field will be different in 5-10 years?

I expect the infrastructure will mature. Some people will specialize in the infrastructure itself but others won’t need to know quite as much about it.

I also see statistics and machine learning coming closer together. Statistics emphasizes probability models and machine learning emphasizes algorithms, and that’s not going to change. But sometimes the distinction between the two is fuzzy, and I expect it will get fuzzier.

Connect with John on:

Have questions?  Continue the conversation in the comments.

]]> 1
Hadley Wickham Wed, 06 Mar 2013 06:49:58 +0000 romymisra      Hadley Wickham has recently joined Rstudio as Chief Scientist.  Previously, he spent over four years as a statistics professor at Rice University.  Hadley considers himself primarily a tool-builder for data scientists working in R. He is interested in tools that reduce the cognitive burden of solving data science problems. He likes to figure out good ways to think about problems, then match up cognitive tools with computational tools that make it easy to solve real problems. His work in this area includes ggplot2 for visualization, plyr for data transformation and reshape2 for data tidying. Recently, Hadley has also become interested in reducing the computational burden associated with dealing with large data in R, and am working on new tools for visualizing and transformation large data sets in R.

Hi Hadley! What is your educational background and how has it prepared you for your role in data analytics and science?  What skills did you not develop in school that you find important in your work?

I actually started off in medical school, which I left because I didn’t enjoy the focus on memorization. It wasn’t a total loss though, as one of the most valuable skills I learned was medical history taking. The process of taking a medical history is surprisingly similar to solving a data science problem. Most people go to the doctor with diagnosis and treatment in mind. The first job of the doctor is to figure out what’s really gone wrong and what the patient needs (which is often not what they want!) This is a similar job to a consulting statistician or a data scientist: when people bring you data, they already think they know what needs to be done.

After leaving medical school, I double-majored in computer science and statistics. While I loved programming, I found the academic study of computer science rather dry, and so continued my interest in statistics with a masters degree. I studied at the University of Auckland, the home of R, and it was a natural progression of my interests in visualization and data analysis that lead to a PhD in Statistics with Di Cook and Heike Hofmann at Iowa State University.

During my PhD I did a lot of statistical consulting for other PhD students. This really opened my eyes to the challenges of working with real data: getting it into a suitable form for analysis, creating the plots I could see in my mind, and explaining complex results to statistical novices. These struggles lead to my PhD thesis which explored better tools for data reshaping, and for visualising complex data and models.

What are the biggest challenges in data science and analytics? What are the most important things to ‘get right’.  What are the best
technologies available to solve these problems?

To me the biggest challenge is integrating visualization and modelling in to a workflow that plays to each of their strengths.  Visualization helps refine questions and reveal the unexpected, but doesn’t scale well because it needs a human viewer. If you have a precise question, modelling scales very well, but will never tell you something you fundamentally didn’t expect, and compact numerical summaries can hide a lot.

Personally, I think R provides the best environment to integrate visualization and modelling, as well as providing a wide range of data import and manipulation tools. As a tool-builder, it makes sense for me to specialize in one language, but I think most data scientists are best of taking a polyglot approach. You’re better off getting stuff done, and not worrying too much about the lack of elegance of using different tools in different projects.

R has its fair share of haters, but I think it’s in a very similar place to where javascript was 5 or 10 years ago: the current implementation is not very fast, a lot of code is crap, and few people appreciate the elegant heart of the language. But these problems are all fixable, and I think the future for R is bright.

What’s your definition of data analytics and data science?

I’ve never been able to come up with a good definition for data science; but I think I have a decent handle who a data scientist is:
They’re someone who can ask and answer questions about and with data.

The ability to ask a good question is critically important, and is not a skill that’s often talk in technical disciplines. Answering questions means more than just answering for yourself, it also means being able to communicate the results to others who don’t understand the details as well as you. It’s also not about just finding the answer once, but about automating the process so you can continue to find the answer every day.

What advice can you give someone with little experience in analytics to pursue a career in the field?

Play to your strengths. Start small and aim for a few easy wins:you’ve got a long haul ahead of you, and you don’t want to get discouraged too early. Don’t start out with gigabyte sized data; find some smaller datasets that are meaningful to you and try and understand what’s going on.

If you’re a programmer who wants to learn more data science, learn R! The community needs more people (like you!) who know about good software development. You can use your programming skills to make a difference to the community, while you learn more about data and statistics.

Finally: write! Communication is a vital skill and practice is the only way to get better. I find writing to be a great tool to understand new areas, and teaching others through your writing is a great way to master a field. If you want to be a better writer, my recommendation is to read Style: Toward Clarity and Grace.  I found this book gave me the tools to understand why my prose didn’t work and how to fix it.

Connect with Hadley on:

Have questions?  Continue the conversation in the comments.

]]> 1
Andrew Eichenbaum Tue, 26 Feb 2013 16:14:47 +0000 miketwardos Andrew Eichenbaum is the Head of Research and Analytics at Yummly, a semantic web search engine for food, cooking and recipes. Previously he was a Senior Analytics Engineer at Yoono and an Analytics Scientist at MyBuys. He specializes in search, personalization, NLP, reporting, product design/road-map

Hi Andrew! What is your educational background and how has it compared you for your role in data analytics and science? What skills did you not develop in school that you find important in your work?

My highest degree is a PhD. in Physics from the University of Wisconsin, where I specialized in experimental particle physics. This area of research inherently requires work with a very large number of data points, so statistical analysis and programing were basic to daily research. Along with that, higher math and numerical methods were used and developed to allow us to overcome problems that we came across.

The biggest problem with this background was that programing was a tool learned as I went. Thus the quality and cleanliness (e.g. coding style, commenting, etc.) of my code was deficient when I first entered the work force. The second big problem that was not addressed is communication with non-similar expert people. In graduate studies you learn to talk the lingo with your colleagues and at conferences. But people doing the similar analytical work in another field can have a whole different way of talking about a similar problem.

Both of the problems were surmountable in time, but I did loose out on some of my first interviews by just not knowing how to communicate or how to write nice/elegant code.

What are the biggest challenges in data science and analytics? What are the most important things to ‘get right’. What are the best technologies available to solve these problems?

There are a lot of problems in data science and analytics. They range from better ad targeting, to transportation safety, to defining social interactions. But the biggest challenge for the scientist is finding a problem that they are passionate about, and a place they can work on it. When the data geek just works on another data problem, they will get results that may be useful. But when they are immersed in a problem that they can not stop thinking about, those challenges will be in the back of the heads at all times. This leads to 2 AM coding sessions to check out a new bit of data, or the suggestion of a new idea that might be even more valuable than the original project.

As for what is most important thing to ‘get right’, two words: data cleanliness. If you have dirty data, or worse yet, data you do not understand, luck is the only thing that will save your project. The first thing I do whenever starting a new project is look at the data to understand what is going on. There is no reason not to plot out original distributions and compare to base assumptions. Sometimes the problem you are trying to solve is not a problem with the data, but a problem with the way people understand the data. I find these solutions have the highest ROI for any business since it takes so little time, and can have such a big impact on the bottom line.  Finally, use whatever technology works for you and the situation. Remember there is no one correct technology, just as there is no one correct solution to any given problem.

What’s your definition of data analytics and science?

Data Science/Analytics is process of taking a question, and then answering that question using a data driven approach. The finer points of data science are formulating the question into something that can be answered with data, and getting the data into a format that you can use to answer the reformulated question.

What advice can you give someone with little experience in analytics to pursue a career in the field?

I believe that data people are born with a certain mind set which can not be learned by taking classes or trying to solve problems. If you are wondering if you could be a data person, ask yourself this question: Have you ever read an article that states, “There is conclusive evidence,” or “The results were statistically significant,” and wondered how they came to these conclusions? If the answer is yes, then you might have what it takes to be a data scientist.

Connect with Andrew on:

]]> 0
Kovas Boguta Tue, 19 Feb 2013 21:14:54 +0000 miketwardos kovasb_full Kovas Boguta has previously worked at Wolfram Research and Wolfram Alpha in various research roles.  He then went on to found Infoharmoni, a social media analytics company.  Currently, he is the Chief Analytics Engineer at Weebly.

Hi Kovas! What is your educational background and how has it compared you for your role in data analytics and science?  What skills did you not develop in school that you find important in your work? 

BS in Math, with a minor in CS from the University of Chicago. One of the great benefits of formal education in math and science is that it teaches you to debug your thought process, and refines your BS detector. Being able to read research papers certainly helps too. Unfortunately none of my classes forced me to become fluent in the linux toolchain, so I had to pick up those skills much more inefficiently later.

What are the biggest challenges in data science and analytics?  What are  the most important things to ‘get right’.  What are the best technologies available to solve these problems?

There are so many!

On the technical side of things, our tools are pretty bad at supporting the scientific method. Source control, IDEs, and most everything else is geared towards engineering, rather than doing  experiments. IDEs have plenty of support for unit tests for instance, but almost none for visualization. Environments like Mathematica and IPython support the scientific workflow much better, but are still far from a panacea.

Communicating the results of experiments is also a huge challenge. Pie charts don’t convey a lot of nuance, but that’s what typical analytics consumers expect. A compromise solution is to have more sophisticated presentations, but with the most important features prominently highlighted and labeled, and then re-emphasized in textual form.

Ultimately the only way to gain intuition around data is to work with it yourself. And we need to find better ways to empower the rest of  the population to do exactly that.

What’s your definition of data analytics and science?

Data science is really the science of computation. If we understand
the computation, then we can predict it, or program it. That is where
the value comes from.

What advice can you give someone with little experience in analytics to  pursue a career in the field?

The best advice I ever got was from Stephen Wolfram, which was “Ask the simplest, most obvious questions first”. This is a very scalable bit of advice. It applies no matter how sophisticated or experienced you are.

And it blends nicely with Paul Graham’s advice of “Make something people want.”

So I would start by answering simple questions that someone cares about, or should care about. That someone might be you, such as about your personal finances or health data. Or it could be your organization, a customer, or a community you are in. Answer some  simple but useful questions for someone. If its useful, they will come back with more questions, and you are rolling.

In terms of employment, research and academia is a great place to start. Particularly if you can attach to a lab doing research that has commercial potential, like robotics, computer vision, ecological or social modeling, etc. Another a good choice is to go work for people building cutting edge analytics tools. Or, you can start with a growth area, like personal health analytics, or political analytics.

Follow Kovas on:


]]> 0
Jonathan Hsu Wed, 13 Feb 2013 18:55:53 +0000 miketwardos 106543bJonathan is currently an Analytics and Data Science Manager at Facebook .  In this role, he manages a team of analysts and data scientists working on a variety of aspects related to leveraging data to make the product better.  His analytics work began in the Metrics team at Slide after his company that launched a successful Facebook app was acquired. At Slide, he led a small team of analysts covering all analytics responsibilities for the company.

Hi Jonathan!  What is your educational background and how has it compared you for your role in data analytics and science?  What skills did you not develop in school that you find important in your work?

My undergraduate degree was in physics from UC Berkeley. I then went to Stanford where I did my PhD in theoretical physics. My work was on black holes and cosmological inflation in string theory. The biggest things that I missed in my PhD were programming work and thinking about industry. All of my work in my PhD was rather formal pencil and paper math. I didn’t write any meaningful code until I got to Slide and I certainly didn’t spend any time thinking about what technology companies were trying to accomplish.

What are the biggest challenges in data science and analytics?  What are the most important things to ‘get right’.  What are the best technologies available to solve these problems?

The biggest challenge for any company is to be successful and the judicious use of data to guide strategy can be a major factor contributing to that success. I would say that the most important thing to “get right” is to never lose the forest for the trees. There are many start-ups that have failed because they put all their faith in a single data mining approach or a single quantitative view of the world. The best analysts in the consumer web startup space understand that the success of a company rarely rests solely upon the level of sophistication or the amount of statistical significance with which some quantitative question can be addressed. It’s important to have a flexible approach that can incorporate both formal statistics and back-of-the-envelope estimations to understand a given problem with varying levels of urgency.

Regarding technologies, I am partial towards Python, SQL and R.

What’s your definition of data analytics and science?

In the current parlance of Silicon Valley, the terms “data science” and “analytics” span a wide variety of functions. I think of analytics as an extension of what is traditionally thought of as Business Analytics with significantly stronger technical capabilities that include some level of proficiency in modern tools and technologies of quantitative analysis. This end of the spectrum is often closely aligned with Product Management and/or “inbound” marketing at companies in Silicon Valley. In this case, the role is about using data to help the company make decisions.

The term Data Scientist is also used to refer to software engineers who build features that are based on sophisticated handling of large data. This typically involves applying some machine learning technique to a recommendation, targeting, or signal detection problem. This end is more closely aligned with traditional software engineering in that the goal is to implement some particular feature with the usual software engineering concerns (scalability, performance, etc.).

Finally, Data Science sometimes refers to research oriented roles that are involved in long-term academic style research projects. This description is often useful for attracting PhDs to use their sophisticated tool sets to tackle important business problems.

At Facebook, we tend to use the term “data scientist” for people who are doing a fair amount of programming work and “analyst” for everything else. Across the Valley, I’ve seen all sorts of titles for these roles: data scientist, analyst, dataanalyst, analytics engineer, analytics scientist, etc. In the real world, roles tend not to be so cleanly differentiated. The majority of data people I have known in the Valley identify with all three profiles above to varying degrees. As with most things in the business world, the key trait is flexibility to do whatever you can to make the company successful given your particular mix of capabilities and interests.

What advice can you give someone with little experience in analytics to pursue a career in the field?

There are two pieces of advice that are probably relevant. The obvious one is that it pays to have proficiency in the tools of the trade. A little bit of working knowledge in programming languages (both statistical and scripting languages) goes a long way. The more overlooked piece of advice regarding this space is that you should really be fundamentally interested in what the company is trying to achieve. While it’s definitely important to have technical depth, you will probably get bored if you are not truly interested in the core mission of a company. On the hiring side of these roles, it’s generally not too hard to find people with either very high levels of technical skill or genuine passion about what we do. However, it turns out to be very hard to find people with both of these qualities in excess. When people get very proficient in the technical aspects there is a tendency to fall in love with the techniques and to lose interest in business objectives. So while it’s definitely a good investment for you to learn how to write code and learn how to use R, it’s also a good idea for you to read the industry press and explore the strategic landscape that you’re considering to be sure that you really care about what the industry is trying to achieve.

Connect with Jonathan on:

]]> 1
Jake Porway Tue, 05 Feb 2013 17:46:55 +0000 romymisra Hi Jake! What is your experience in data analytics and science roles?

  • Computer vision researcher, UCLA, worked to get computers to recognize objects better.
  • Data scientist, Utopia Compression, sold my soul to do R&D for DARPA, erring on the side of the greater good (e.g. automated landmine removal as opposed to automated baby exploding).
  • Data scientist, New York Times R&D Lab, worked to understand how data would transform journalism and the world writ large.  One of the best jobs ever.
  • Founder + Executive Director, DataKind, working to get socially conscious data experts teamed up with visionary social orgs to make the world better through data.  The best job ever.

What is your educational background and how has it prepared you for your role in this field?  What skills did you not develop in school that you find important in your work?

I got extremely lucky in my academic choices.  I started out getting a B.S. in Computer Science with a focus on intelligent systems, which gave me the software skills and critical thinking for building tech solutions and dipping my toe into machine learning.  I got incredibly lucky in choosing my advisor for grad school, who forced me into a Statistics Ph.D. instead of a Computer Science Ph.D., where I learned the mathematical and statistical foundation for drawing conclusions from data and building computer systems to take advantage of that.

I was very fortunate to be in an extremely applied statistics program at UCLA, so thankfully we were taught many of the computing skills that I hear other stats programs lack.  However, like most tech disciplines, there isn’t enough time to learn every individual tool, so I found myself picking up Python and Processing on my own.  More than that though I would have loved to learn communication design, a topic that I think is *sorely* lacking in the scientific community.  90% of our jobs are (or should be) communicating our results to the non-technical as well as the technically oriented, so being able to visually and orally communicate what we’ve done is a hugely important skill.  I recall a vague “learn how to present!” course being offered through a well-intentioned career services group that I just never got around to taking amongst all my other commitments.  I wish that had been mandatory or that the culture of communicating results had been built into the coursework itself.  R’s default graphics are not only unsexy, they can be misleading if you don’t know what you’re doing.

What are the biggest challenges in data science and/or analytics?  What are the most important things to ‘get right’. What are the best technologies available to solve these problems?

You’re going to hear this so much more in 2013, but the biggest challenge, IMO, is asking the right questions.  People are excited to dive into a new dataset and get “hacking”, or often come to us at DataKind with a big dataset, plop it down, and say “now what?”, but the data isn’t going to ask the questions.  Sure, there are lots of exciting and new things we can learn from data, but without someone with the vision of what needs to get done, be that hitting a performance metric in a company or broadly understanding a trend in your field, you’re just going to be spinning your wheels.  Data should be used in service of solving the bigger problems that an expert can help identify.

One of our major principles at DataKind is that we team data scientists with subject matter experts because, for all of our software writing and data analysis skills, we don’t know what app or analysis is going to be most useful for, say, alleviating poverty.  As the barriers to obtaining and analyzing/visualizing data disappear, you’re going to see a glut of projects that people pull together merely because they can.  The real distinguishing feature between these projects and the ones that really have lasting impact is that the latter solve a problem that was scoped with someone who understands what that analysis/visualization/data tool is ultimately going to be used for and why.

What’s your definition of data analytics and data science?

Woof, I don’t want to wade into a flame war but, simply put, data science (to me) is merely statistics souped up with some programming skills.  Academic statistics really missed out on a marketing opportunity by letting industry define “data science”.  The term data science was being batted around as a term for statistics in academic circles as early as 2001 when Bill Cleveland called for beefing up stats programs with more technical skills.  Cosma Shalizi makes this point better than I ever could, but statistics has always been the discipline dedicated to :

1) collecting data
2) exploring data for hypothesis generation and assumption testing
3) modeling data using mathematical models
4) drawing conclusions from the data about the world in general
5) communicating those results to the public.

The only fundamental shifts I see is that increased computing power has touched every one of those steps – the ubiquity of cellphones and computers means we need programming skills to collect and manage data, new software exists for visualization and analysis, more powerful computers have made previously impossible statistical methods like Monte Carlo methods tractable, and we can interactively display information that used to live on the printed page – and the whole process has become democratized with the removal of barriers to cheap and accessible tech and data.  Aside from that, the core process of collecting data and making sense of it still squarely falls in the realm of statistics, and the sooner programmers pick up stats skills and statisticians become facile in programming, the better off the data science community will be.

What advice can you give someone with little experience in analytics to pursue a career in the field?

As Hal Varian put it some years ago, “the sexy job in the next 10 years will be statistics.”  You can’t browse the news without hearing that data scientists are the new “sexy rock stars” (proving that the term ‘sexy’ is very open to interpretation), so the impetus to join is there.  If you’re already sold and just want to know how to get started, I’d recommend taking advantage of this amorphous time to pick up some new skills alongside the growing data science community.  Like I said above, if you know how to program, join a statistical Meetup near you.  If you know statistics, take some classes on programming.  The data community, at least here in New York, is wildly inclusive and open, and I’d encourage anyone interested in this field to dive in by going to a hackathon and introducing yourself or following along on message boards and forums if face-to-face isn’t your bag.  Moreover, there are now more opportunities to learn “data science” than ever before, from new programs at universities like Columbia University and Rice University, to accelerated courses at places like 3rd Ward or the Insight Data Science Fellows program, to online courses from Coursera.  If you’re interested in the field of analytics and data science, I’d say roll up your sleeves and jump in!

How do you think the field will be different in 5-10 years?

Hah, things change so quickly I can’t even dream of a world more than 2 years from now.  Remember what things were like 10 years ago in 2003?  Facebook was a twinkle in Zuckerberg’s eye, the iPhone was a good presidential term away, and even companies like Google weren’t sure how or why to hire statisticians.  It was like the dark ages.

I will say this, regardless of the time frame:  You’re going to see data everywhere, and very soon.  The hype of “big data” is already reaching a fever pitch where most people have heard of it and, if we do our jobs, that hype will settle into a world where data and analytics are first-class citizens in decision making.  The fun thing about that last idea is that the term “decision making” applies to everything.  We’re not just talking about industry decisions like optimizing supply chains, or understanding customer sentiment, we’re talking about everything from healthcare delivery decisions to government aid distribution decisions to even just what you eat every day.  We’re going to be living in a world where everything is instrumented, everything recorded, and all of that information is going to be used to adjust our practices for the better in realtime.  Lest you see that as a dystopian neo-Tokyo world of endless circuitry and Big Brother style privacy erosion, I believe that we will apply these new technologies and information streams to improving our world for the better.  I know that’s what I’ll be working toward.

Connect with Jake on:

Have questions?  Continue the conversation in the comments.

]]> 0