What is a Data Lake?

A Data Lake is a new working method that simplifies and enhances your capabilities of storing, managing and analysing Big Data, in their native or near-native format, when they come from heterogeneous sources.

In short, a Data Lake is:

  • A place where you store structured and unstructured data
  • A tool to analyse Big Data
  • A resource to access, share and compare data for business use

It's a new working method because the systems that have been used so far to store, process and analyse data - following a so-called "DataWarehouse" architecture - are structured and limited by nature by the end use you plan to make of the data.

In a Data Warehouse system, raw data must be structured and processed through a so-called Schema-on-write approach: first, you define the database structure, then the data are written into the database and whenever they need to be utilized, they're returned in the format defined by the database structure, never in their original format.  By definition, a database can contain a limited number of data sources.

In a Data Lake system instead, you adopt a so-called Schema-on-read approach: data are acquired in their native format following a set of policies that standardize acquisition methods, timing and rules, by data type. Each data point is tagged and identified through a metadata system that qualifies the data so that you can query the Data Lake in search for specific information, and the Data Lake returns only relevant data.

It's the query, and not the database structure, that determines the output: the environment you can search into is the entire universe of all available data, independently from the source you have acquired data from.

What's the advantage of adopting a Data Lake?

The benefits are numerous:

Reduced storage costs and unlimited storage space

Managing large quantities of data by means of data warehouse systems is costly and inefficient. The same data set can be replicated a number of times if the database structure differs for diversified analysis applications. Different roles in the organization might have unique analysis needs and might be looking for specific insights. A schema-on-write system forces you to somehow foresee the usage each role must make of the data in order to design the database structure, but because business needs and goals change much more rapidly nowadays, analysis requisites and therefore database structures should change accordingly.  Increasing the volume of ingested data and updating the structure of a database is not an easy task, and it's expensive. Using in Cloud HDFS storage methods, typically deployed in a Data Lake system, gives you virtually unlimited storage space.

Reduced data consolidation costs

Merging databases is a complex task, especially if they have different structures, and it requires a significant data modelling effort. Moreover, to reduce the risk of obsolescence of the data model, it is necessary to forecast the new data sets that you're likely to be willing to integrate in the future. It's a nearly impossible mission when your data grows continuously.

Reduced Time-to-market

Enlarging and merging databases is time-consuming, while business questions sometimes must be answered swiftly. When the data is cleaned, correctly structured and ready to be analysed, it might be too late to draw value out of it. In addition, the quantity of unstructured data that are useful to do the analysis might be much higher than the structured data that would ultimately be made available in a data warehouse-type environment, while being able to access the information deriving from unstructured data in real time might be critical for the success of a marketing goal for instance.

Information sharing is improved and made simpler

The analyses you perform on data can generate results that contribute to further qualifying your data, thus increasing their intrinsical value. For example, assuming we planned to associate a propensity to purchase score to each user profile, in a data warehouse-type structure the score would be used only within the applications accessing that environment. To make the information available through other applications, we'd have to copy the score in the databases used by those applications, provided that the relevant data models have been updated accordingly. The Data Lake eliminates the need to duplicate information and allows to make the most out of insights, making it easy and quick to share them and make them accessible to anyone who has the permission to.

Is the Data Lake the ideal solution for all companies?

No. Building a Data Lake is the ideal solution for those companies who:

  • have the need to perform cross-functional analyses of Big Data;
  • possess internal processes that are structured to guarantee good data governance;
  • can count on a professional staff that is trained in the technologies deployed in the platform as well as in Data Science, or
  • can afford to get external professional advice in those areas where they feel they lack the necessary skills.

Though it's true that the main advantage of a Data Lake compared to a Data Warehouse-type model is that it allows to store larger quantities of data without the need to structure them beforehand, and independently from the use you'll make of them, a certain degree of data organization is still needed for you to be able to draw insights out of your data. Since a Data Lake is capable of storing an (almost) unlimited amount of data, it is necessary that access to data is regulated carefully, both for obvious privacy management issues, but also because only experts - typically data scientists and engineers - know how to run queries and extract meaningful information.

Before you can get to a BI report starting from the data stored in a Data Lake for instance, or before you can get to a content personalization rule to name just another possible application, it is necessary that the original data go through a number of processes that only expert data scientists and programmers can perform granting a quality output. In short, just because the universe of available data is huge, in order to drive knowledge out of it, it is critical to be able to navigate it - you can't improvise experience in this field.

In most companies, 80% of data users are "operative": they use reports, they monitor KPIs or they use relatively simple excel spreadsheets. To address the needs of these users, a Data Warehouse-type system is more than sufficient resource: it's structured, easy to use and it's designed to answer specific questions.

Around 10-15% of users do perform deeper analyses. They often access source systems in the search for raw data that are not available in databases; sometimes they integrate data from additional, external sources. Normally these users generate the reports that are circulated in the organization. 

Only a small percentage of users dives deeply into data. They know how to integrate additional data sources, how to normalize and analyse heterogeneous data. In most cases, these users don't even utilize databases, because they work on data at a much earlier stage, that is before data are structured. They ask themselves questions and explore the data to find the answers, excluding unconfirmed hypotheses. These users run statistical analyses and are capable of using several analysis techniques, such as predictive modelling for instance.

The Data Lake can be a data source feeding the reports in use by the first group of users, or the databases in use by the second group, but it can be managed and queried only by expert users that may not be available - or needed -  in every organization.

How you build a Data Lake

A Data Lake is a platform combining a number of advanced, complex data storage and data analysis technologies.

To simplify, we might group the components of a Data Lake into four categories, representing the four stages of data management:

  • Data Ingestion and Storage, that is the capability of acquiring data in real time or in batch, and also the capacity to store data and make it accessible. Data can be structured, unstructured or semi-structured and it's ingested in its native format through a configurable roles system. 
  • Data Processing, that is the ability to work with raw data so that they're ready to be analysed through standard processes. It also includes the capability of engineering solutions that extract value from the data, leveraging automated, periodical processes resulting from the analysis operations.
  • Data Analysis, that is the creation of modules that extract insights from data in a systematic manner; this can happen in real time or by means of processes that are run periodically.
  • Data Integration, that is the ability to connect applications to the platform; in the first place, applications must allow to query the Data Lake to extract the data in the right format, based on the usage you want to make of it.

There's no universal recipe to build a Data Lake. What you should care for most, is choosing a technology vendor capable of designing the platform architecture on the basis of criteria that are shared and agreed with you after a well-managed analysis phase. The platform should be equipped with the hardware and software components that are necessary to support the four steps illustrated above while granting the maximum efficiency possible, that means providing the best possible result, in the shortest time, saving resources. Reliable automated monitoring processes are a must have, too.