29 Sep The most expensive data science mistake you can make
Lately, we moved with some of our clients to the point where we start to tackle some more complex use-cases beyond simply consolidating data dozen plus data sources for reporting purposes. Do not get me wrong, data integration (getting data into one place using obscure APIs and formats, transforming the data in any possible way and delivering it all with stability and reliability, basically ETL) is still my beloved discipline. Most people out there consider this the “boring” part, but I enjoy it. Our clients nowadays have usually all those pipelines set up properly and they try to think how to “take their data further”. Obviously, the lights are on data science and advanced analytics.
The usual preparation steps are:
- good and stable infrastructure with possibility to scale
- reliable processes in place
- all data nice and clean in one place
You can even find Maslow’s Hierarchy of needs for data analytics describing exactly this: http://data-informed.com/data-as-a-service-and-the-analytics-hierarchy-of-needs/.
But what if the customer is mature enough and wants to explore the interesting world of data science by implementing some useful algorithms?
First we need to identify the business case – some part of the client’s business that can benefit the most from the investment. Given this is fairly technical article, let’s suppose we have already identified some key problems and can start to think about the best way to implement it.
When making the plan of attack, customer’s one of the first questions would be “how much money do I need/should to put into this project?” And my reaction: “As much as you want to.”
Let me explain.
I am data engineer with a low tollerance to bullshit and somehow strong business focus – I am all about ROI on the project.
In my opinion, Data Science and implementation of such projects have different tiers. There is a great difference in implementation of an algorithm for anomaly detection using ready-made library, using some black box in the form of SaaS and hiring high-end data boutique to create custom-tailored algorithm that fits your needs. Let’s dive into those categories.
Quick and dirty Data Science
This is what I call “do a lot more with a lot less” approach. I think there is a huge amount of open source libraries and nice solutions you can apply on your data with minimum time and money investment. It can be also characterized by Facebook’s “fail fast” approach.
Imagine you have data from advertising system about the costs over time. Sometimes it happens that the costs go really high or really low for some reason (new competitor enters the market, there is a problem with your product, …) and you want to get a notification the moment some anomaly happens – something is out of the usual.
The business case is set and you see the value in it, so let’s interpret it. Of course, you can hire some data scientist with PhD. to study complex mathematics and implement a variation on 2x standard deviation in R. This script will run on some computer in your company with no documentation and you will always be reliant on the author of the algorithm. Does not sound nice.
Or, you can just use Twitter’s Anomaly Detection library from here: https://github.com/twitter/AnomalyDetection. It is open source with great documentation and can be set up in literally 4 lines of code. And it can even recognize seasonality in your data.
Amazing about this approach that you can set it up, try it and if it does not suit you, just throw it away. You spent minutes to try it, so it was not an expensive mistake. And if it works? Awesome, you are now (almost) data-driven company!
This does not stop at anomaly detection, though. Care for forecasting time series data? Facebook and its Prophet has you covered: https://facebookincubator.github.io/prophet/.
Want to build recommendation engine? How about this guide? http://blog.untrod.com/2016/06/simple-similar-products-recommendation-engine-in-python.html
The way to use this “quick and dirty” is very simple – before you actually start doing complex math and hiring a team data scientist, just google existing solution. Do not re-invent the wheel.
Data Science as a Service
Sometimes you do not have an inside guy that would get you started with libraries in R or Python, or you simply like “it is somebody else’s problem” approach. And that is totally fine!
In recent years we saw a continuing trend of “Data Science as a Service”, where companies like Recombee or Geneea develop the high performing algorithms and offer them to the clients for a fee a month.
The procedure is simple – you give them the data, they do their magic and you get insights. Recombee can create for you recommendation engine, Geneea will analyze your unstructured data – e-mail from customer support, content on social networks, a sentiment of the tweets.
How is it different from the open source libraries I mentioned before? Simply put: in performance. While you can get only so far with open source libraries with no modification, those DSaaS companies invested a lot of resources into building the algorithms. Let’s say that with the libraries you get 75% precision, with DSaaS you can 85%. But, what about the rest 15%?
Top of the shelf data boutiques
That is where specialized data boutiques like aLook Analytics come into play. When you reached 75% with open source libraries and simple guides from internet, 85% with DSaaS black box solutions, you can get to 95% precision with custom tailored algorithm just for your use-case.
Algorithms are only getting better if you provide them with more insights to your business. It is not only about more data you feed them, but also the design and custom rules implemented directly into the algorithm. And that is where the money is spent – customizing the algorithms to your exact needs, reaching the top precision.
So what is the best way to go? I hate to say it, but “it depends on the use-case”. If your business relies solely on the data science (like Netflix and Amazon on their recommendation systems), you probably should have great in-house data science team, because even 1% move in precision can make you a lot of money. On the other hand, if you sell tickets for a theatre or concert and want to make your clients click more in e-mails, very simple content-based or collaborative filtering recommendation engine can do the job quite well.
In our next webinar, we will cover the ease of use of Twitter’s open source library for anomaly detection to demonstrate how you can start empowering your company without exorbitant costs for data science projects.