15 Dec Is data analytics collaboration the real challenge you should be addressing?

This is a guest post originally published on Medium by Giuliano Giannetti, Senior Reporting & Business Analyst.

The corporate single version of truth and data driven collaborative decision making; these are the holy grails of analytics today. In this post, I will try to cover the reasons why I believe we fail so miserably at collaborating on data, and why using Keboola and GoodData made it so easy.

My story — leaving a lean startup called Keboola driven by crazy people and not having seen an offline excel sheet for almost two years, I’ve ended up lately in a big (and actually awesome) company in a team of seven super smart people that do analysis on the same data for a lot of internal customers, using what is a usual combo of Alteryx and Tableau.

Seeing seven people working on the same data, you would probably expect the data cleaning and ETL process (which takes usually up to 80% of the time) to be done just once. Then you would see us collaborating on the data, reusing the codes written previously by others, and living happily ever after.

But in reality, we work more as individuals. Each of us having a crazy 20GB RAM computer filled with downloaded data, we clean the data locally and feed it to Tableau dashboards online, creating an endless stream of separate and independent ETL processes.

Usually no one knows what has been done to the original data and why.

However, humans are not to blame. It actually is the design of the tools we are currently using for our analysis and, more importantly, for the presentation of data.


Tableau and the Single Table Paradigm

Tableau is actually at the heart of this problem. Please do not get me wrong: I love using Tableau for ad hoc analysis. But it is more of a tool you want to use to explore your data, not to maintain and sustain a reporting environment. The reason is the way Tableau, and most other tools except Qlik and GoodData, stores data . In Tableau’s case, it is a single denormalized table that the workbook uses as the base of the whole reporting.

Having a single table as the most efficient* way to store data then the most natural way to use Tableau server is to create an infinite number of ad hoc reporting tables that are used for particular purposes — the end of this is an unmaintainable Tableau hell; all analysts know well their own reports, but when someone else wants to work with the same data they are usually better off redeveloping their own ETL script and creating a new dashboard.

* I am aware that Tableau supports joins (but it is not the easiest way to use this product). It results in people working with Tableau in the same way that is described above.

Source: Domo.com

The Value Is in Collaboration

This is how reporting worked until recently and how most analysts probably used to do their job — the boss wants a report, the analyst crunches an SQL query from the DB, posts the data into Excel or something similar and the same pattern now works exactly the same way using Tableau, just with better graphics and some workflow features. Is it an improvement? Yes, but is it a qualitative revolution? No way.

Think about it — the business person (the one that actually can create value out of the data) still needs to know a lot about how the data was cleaned and prepared before being put into the dashboard. So, rather than doing even the simplest analysis by themselves, they take up 90% of their analysts’ expensive time by tasking them with trivialities such as solving basic math problems and cleaning data.

And since the business person needs to communicate and explain what they want, a bottleneck situation is often created. Even simple tasks (requiring a couple of minutes of actual work) take days to weeks in a corporate environment and prevent making important decisions on time — something even more expensive.

Let’s just picture this: The business people are able to solve 80% of all the requests by themselves and right away, instead of asking someone else to do it for them.

Well, this is what a data-driven company really is — a company where every person has access, means and working understanding of the data needed to do their job.

So let’s do a super Data(warehouse, lake, whatever)

I guess we all went through the following scenario. Someone complained: “We need to collaborate on data, we need to make sure that we all look at the same data, etc.” And smart people always came up with a solution: let’s make a single Data Warehouse of Everything and Everyone containing all the reports and Data.

Unfortunately, it always ends like this:

Not so long ago, I attended Adastra’s workshop on data-warehousing. The consultant spent two hours describing 52 topics of issues associated with the implementation of a data warehouse. He said that the major issue for him was usually the “absence of a user layer” in data warehouses. In other words: no one actually use them.

Again, I am not here to say that data warehouses are pure evil . But they should not be misused for other purposes than what they were designed for, e.g., to store data.

The purpose of a data warehouse is to keep and maintain master data about an organization and make sure that there is at least one reliable copy of the data in the firm . And that’s it.

The attempts of using a DWH for reporting will ultimately fail not by mistake but by design since a DWH is inherently:

  1. Slow — as the information needs to be carefully checked before is committed to the data warehouse it takes weeks to months before new data is added. In the case of new implementations, it can take years before you make it right. By the time you implement the model, the business will have completely changed in the meanwhile.
  2. Incomplete — having increasing amount of data in cloud services like Google Analytics, you will find out that eventually you cannot rely just on the DWH for reporting. (Well, unless you do not want to download the whole internet into it.)
  3. Complex — as it should store everything. It is not uncommon that it holds up to several hundreds of highly denormalized tables; not a place to go without proper documentation or help, even for an experienced analyst, not to mention end users.

I believe that data lakes are even worse at that since they allow storing unstructured data, making it easy for the developers but even worse for the analysts.

  1. Owned by someone else — as the interest of the owner is to make sure the data is stored. When choosing whether to make sure the data is written down properly, or risk that you will crash the system by your super-complicated query, the owner will always choose the first over the latter.

Move Fast, Break Things and Keep It Simple

Let’s get to the solution. What do you need to immediately start working with data prepared by another person?

It turns out there are just four requirements for collaboration:

  1. Understanding — you need to make sure that you understand the data you are looking at without needing to ask anyone for help.
  2. Means — you need to have the technical knowledge so you can actually do the analysis.
  3. Have your own workspace — where you are not afraid of breaking down the work of someone else or of breaking the production environment, and where you are not concerned about available computing power.
  4. Trust — this is the most important requirement: you need to trust the data you are given. Otherwise you will always fall down to verifying its source to make sure nothing has happened to it on its way to you.

My answer to these issues is to make it simple . For instance, a sales person in a big company will always work with just a few out of hundreds of tables the company stores: a sales table, customer table and maybe a product table. The same logic of course applies to marketing, logistics, etc. So:

  1. Use a simple and contextual data model: By stripping the data in the working environment to the necessary minimum, you can, without any help, reach an understanding of how to use the data. You could argue that you can do this in Tableau as well. But in my opinion, this is the place where tools supporting data models like GoodData or Qlikview really shine because they enable a variety of views on the underlying data without the need of ulterior transformations. Be careful though about the complexity of the model. The more complex model you build, the less people will be able to effectively use it, which will ultimately lead you back to bottlenecks.
  2. Limit the technologies needed to operate the data: End users would love to have just one reporting tool. Analysts will probably have a set of SQL, R and Python. The more common the used technologies are, the better; use the least possible technologies that will empower the biggest number of users.
  3. Create a team based data sandbox: If you want to collaborate, you need a workspace where you can safely share your code, data and work on that without being concerned about operational security.
  4. Let the teams to implement their own processes: Trust is hard to achieve and it is something that is easily lost. In my experience, people tend to trust the most the things they do themselves . In order to achieve this, it is necessary to give analysts actual control over their data cleaning and ETL process. However, this can’t be done unless they share a common goal (1. create a contextual data model) and common working environment (3. data sandbox.)

How Using Keboola and GoodData Can Help

First, let’s talk about GoodData : As much as I am sorry that this tool in the last three years has not moved an inch in terms of EX design and visualizations, GoodData has, from my point of view, some major advantages that all together solve the “make it simple” issue stated above. It actually enables a data driven environment:

  1. Completely cloud based: If you want people to have a single version of truth, you cannot have them download data to their own computers for analysis. A cloud-based (at least a private cloud) platform is a must and GoodData does that. Say good-bye to patching, version issues and performance issues. That is their job.
  2. It has a data model: GoodData’s data model pre-calculates all possible meaningful combinations of joins between the tables in the model. This basically means that once you upload the model, you do not need to revisit the data for transformation and modification unless you need to add new entities to the model. This saves the analysts a lot of time, plus it also enables feature 3.
  3. Metrics as objects with predefined manipulation methods: If you are versed in object oriented programming, then this is exactly how the metrics behave : You can create a global metric, share it, base a new one on it and predefine what the user can do with it without breaking it.
  4. To make it less abstract, let’s say you have a total sales number. GoodData will show you the SUM of all sales and tell you that you can break this number by all the meaningful attributes included in the model such as the month of the year and the office that produced the sales.
  5. What this does is HUGE . Since end users can examine the metric, decide how to pivot it and get the right results without knowing anything about the underlying model and where the data came from. This makes self service BI something actually achievable. Look at this comparison of retrieving a number using GoodData and SQL:

As for Keboola Connection, it solves most of the other issues mentioned above — Keboola Connection is a cloud based ETL solution that enables teams to extract data from a variety of online sources (databases, ftp, email attachments, resp APIs and various others services as Google apps etc.) and store them in a cloud-based workspace where they can be enriched with open data, transformed and sent to some visualization tool (such as GoodData) without the necessity of leaving the web browser.

It covers the same points mentioned with GoodData (cloud-based and supporting a data model), but it also brings additional features making collaboration on data possible:

  1. Simplifies the data environment : Keboola basically copies the data from different sources into a single working environment, so your team can extract just the minimum data needed for the analysis at the minimum needed level of granularity. By doing this (let’s say by limiting the financial data just to sales, or by limiting the granularity of the data just at a user/day level), it is possible to effectively simplify the data to a standard that is understood by the whole team. Also, since you can create multiple projects, there is no problem creating multiple context-based workspaces that share data between each other again limiting the number of concurrent ETL processes made by different teams.
  2. Limits the technology needed: The problem with extracting data from, let’s say, 50 sources is that someone needs to gain access to the data source, understand how to make an export (thus understand the source data structure) and then learn the technological process for doing the extraction (anything from a click to an api call or NoSQL query). This is complex and you want to save time by doing it only once.Keboola by design always copies data from its sources and converts it to a csv file. It means that once the extraction is done and automatised, a single person can access the data from those 50 sources using just one environment and one technology. Second, Keboola supports a variety of standard technologies for transformations, for example, R, Python and SQL, covering the widest used technologies for data manipulation available.
  3. Individual sandboxing and operational safety: In Keboola, you always work on copied data, so the risk of interfering with the source system can be limited to the first data upload. Once the data is in, there is virtually no way to screw things up. For working on transformations, Keboola provisions a temporary sandbox for each user. It will download a copy of the copied data to an independent sandbox, where users can manipulate the data without fear of interfering with any other running process. The Snowflake using backend is astonishingly fast; it actually brings users the power to work with big data from the browser.
  4. Trust : This one is the hardest to tackle, since it mostly depends on the people working on the project. I believe that trust can be achieved by the team using the tool correctly, and by having a clear vision of the model they want to create. However, currently Keboola does not offer much to support cooperative user behavior by design. This could be easily changed by supporting a better documentation feature over the stored datasets, and by promoting good behavior and best practices by some UX tweaks. I am sure this will come in the not so far away future.

Some other tools comparison and notes

here is some tech I’ve had the opportunity to test lately and some thoughts about it:

Alteryx: Alteryx is a tool that is designed to help end users to do transformations on their computers with a drag and drop interface.

In my opinion, it has two flaws: First, you do not want end users to do the transformations, and second, you do not want them to use their laptops for it.

Based on my experience using this tool, it has shown as a trap — it deceives you to use it for a quick transformation, but it ends in a lot of pain. The automation with it is a nightmare because it works similarly to Tableau; it forces you to create a workflow on your laptop and then upload it to an Alteryx server that will replicate this workflow online. At the end, you will find yourself trying to figure out whether the server uses the same drivers as your PC, or whether it is of the same version. Not mentioning the fact that it will force you, again, to have offline data on your PC for testing purposes, etc. Debugging is also a problem.

 

Tableau: Great for ad hoc analysis that can be thrown away after it’s been used. In case you need to analyze a single dataset and you are pretty sure that you will not need to revisit the analysis later on and collaborate on it, this is the way to go. I believe that the best use of Tableau is in its combination with Alteryx for people that receive data exports they have to work on and then leave it as external consultants etc. For instance, it is a great time saver when you are trying to do a data exploration prior to creating a statistical model.

 

R: The good of R is that everything can be done with it. Afterall, it is a programming language. On the other hand, it is complicated and time consuming. The advantage of all the tools like PowerBI, Tableau and GoodData consists in the fact that once the data is prepared, it can be explored and analysed without ulterior manipulation. That is not, however, the case with R. Creating a subset in R requires creating a new variable, thus writing a code. Every package and function will have different requirements over the data inputs which will then require again preparing the data and, even worse, double checking for errors to make sure the data is correct and comparable across several functions.

 

PowerBI : Still immature, but there are several things that I love about this tool: The DAX language provides a mature framework to create metrics like the fuzzy search and several other functions borrowed from Excel. Then the community — PowerBi to my opinion skyrocketed thanks to the fact that the community is able to create and actually share custom visualizations and scripts for this platform. Then the R support — many of the problems with R can be mitigated with the integration with it and PowerBI . You can make a quick exploration using PBI and then drag and drop relevant attributes from PBI to the R window managing the process mentioned above. Seems like beta but the idea is very promising. The bad: The data model is supported only in the desktop version, which seems artificially limited to a 250 000 rows per dataset making it unusable for most cases. The online version is highly dependent on Azure, making it more of a good tool to analyze data on Microsoft Cloud than a truly independent data analysis tool.

 

Looker: Just had the chance to see the demo and it seems that it does not bring anything new to town. Looker does what the name says; it will create a supportive layer over a database enabling to predefine metrics using SQL, and support the storage of these predefined metrics in an environment accessible to the users. First of all, in order to use Looker, it is necessary to already have the data cleaned and stored in one database thus needing it to operate over a data warehouse or with an additional ETL tool with independent storage like Keboola Connection. Second, the metrics creation is really complicated, and changes in them will result in needing to revisit the underlying data. This is a process that, to my understanding, will require a BI expert even for simple manipulations with dashboards, thus making self service BI impossible. Since the computing power is provided by the database run by the client under Looker, I believe this brings operational concerns to the end user as well.

Bime: Works very similarly to Looker. It reads data over a database, a heavy hand of a BI expert for data analysis was needed, so same applies.

 

Rapidminer: I love it. The data manipulation part is a bit hard, but once the single table is loaded, one can use it as Tableau for doing statistical exploration over dataset as in R but faster. The documentation is well done so one can pretty safely try priorly unknown statistical methods as it will guide you through the best practices. A novelty is also a recommendation function that will recommend tools used by the community while designing a process. I talked with their sales rep lately and I feel that this tool is a bit in trouble since it provides services every data scientist wants but will not be willing to pay for — If I can use Rapidminer I can use R as well, which is free with just some bigger discomfort. The features the business is willing to pay for as maintaining a reporting/data environment are completely missing.

 

GoodData: The good — MAQL — a query language like DAX but more abstract enabling to write a lot of complex metrics with a few lines of code. The data model as mentioned above. The bad… I cannot make a histogram without revisiting the underlying data (come on guys really? Why I cant use a metric as an attribute) The lack of development in terms of visualizations that make GoodData look like a poor brother in comparison to Tableau.