Menù
CLOSE

Data Lake

“It’s a new working method
that saves time and costs
and helps to share
information quickly.”

You acquire data
in its native format
and you use it as you like

If you feel you cannot know today which data you’ll need tomorrow, the Data Lake is for you

Learn more

A Data Lake is a new working method that simplifies and enhances your capabilities of storing, managing and analysing Big Data, in their native or near-native format, when they come from heterogeneous sources.

It's a new working method because the systems that have been used so far to store, process and analyse data - following a so-called "DataWarehouse" architecture - are structured and limited by nature by the end use you plan to make of the data.

By adopting a Data Lake architecture, you dramatically reduce storage costs and at the same time get virtually unlimited storage space, thus reducing the cost of data consolidation and simplifying your information sharing processes.

Contact us
Contact us

Main
features

Data are integrated in any format and processing flows are designed to fit your needs

Compatible with any ingestion/upload system 

Neodata has developed a methodology that standardizes the procedures integrating new data sources into the Data Laka, to reduce the time needed to onboard new data sources and also to reduce the risk of bugs.

Analysis is integrated into the processing flows

The analysis stage, based on a Hadoop environment, allows you to explore available data by means of tools such as HUE, Hive, Redshift, MySQL and to develop machine learning models, for example, connecting the software application in use by Data Scientists.

The results of the analysis operations performed at this stage are then used by the processing flows. This is to exploit the knowledge gathered thanks to the analysis, automatically and without interruptions.

An architecture designed together with you

We start with the analysis of your strategic objectives, then we define data governance processes and go through the structure of your team, your data sources, your information flows. Only after we've gained a clear picture of your starting point, we design the Data Lake architecture, step by step, side by side with you.

Data Security and Privacy are guaranteed

Access to the Data Lake is regulated by a roles and permissions system; roles and permissions are configured by the Administrator and can be personalized to allow multiple administration levels.

Main
features

Data are integrated in any format and processing flows are designed to fit your needs

References

What
it does

If you have a good
governance process,
you can do
virtually anything

What
it does

What
it does

A Data Lake is the ideal solution for companies who:

  • have the need to perform cross-functional analyses of Big Data;
  • possess internal processes that are structured to guarantee good data governance;
  • can count on a professional staff that is trained in the technologies deployed in the platform as well as in Data Science, or
  • can afford to get external professional advice in those areas where they feel they lack the necessary skills.

How
it works

Ingestion,
Processing,
Analysis,
Integration

How
it works

Ingestion,
Processing,
Analysis,
Integration

How
it works

How
it works

01.

Main features

02.

Neodata's Data Lake

Main features

A Data Lake is a platform combining a number of advanced, complex data storage and data analysis technologies.

To simplify, we might group the components of a Data Lake into four categories, representing the four stages of data management:

  • Data Ingestion and Storage, that is the capability of acquiring data in real time or in batch, and also the capacity to store data and make it accessible. Data can be structured, unstructured or semi-structured and it's ingested in its native format through a configurable roles system. 
  • Data Processing, that is the ability to work with raw data so that they're ready to be analysed through standard processes. It also includes the capability of engineering solutions that extract value from the data, leveraging automated, periodical processes resulting from the analysis operations.
  • Data Analysis, that is the creation of modules that extract insights from data in a systematic manner; this can happen in real time or by means of processes that are run periodically.
  • Data Integration, that is the ability to connect applications to the platform; in the first place, applications must allow to query the Data Lake to extract the data in the right format, based on the usage you want to make of it.

There's no universal recipe to build a Data Lake. What you should care for most, is choosing a technology vendor capable of designing the platform architecture on the basis of criteria that are shared and agreed with you after a well-managed analysis phase. The platform should be equipped with the hardware and software components that are necessary to support the four steps illustrated above while granting the maximum efficiency possible, that means providing the best possible result, in the shortest time, saving resources. Reliable automated monitoring processes are a must have, too.

 

Neodata's Data Lake

Ingestion and Storage

Our Data Lake can ingest and store any type of data: structured, unstructured or semi-structured. It is compatible with any ingestion/upload tool that can store data on S3 or on an sftp server. In particular, it is compatible with Flume and Sqoop.

Streaming is managed through Kinesis and it is compatible with Kafka.

At this stage, data is stored in its native, raw format.

Neodata has developed a methodology that standardizes the procedures integrating new data sources into the Data Laka, to reduce the time needed to onboard new data sources and also to reduce the risk of bugs. 

In order to guarantee security and privacy standards, access to the Data Lake is managed through a system based on roles, where every permission is configurable at Admin level.

Processing

Once data is uploaded into S3, a number of processes are triggered that aim at organizing data and making it available for subsequent analysis.

Typical processes are: changes in formats to support better-performing solutions (such as Parquet or Avro for instance); data parsing to extract specific entities (from a .json file for example); operations on one or more data fields (such as a change in the date format); data insertion into Sql or Nosql structures (e.g. Hive, HBase, Redshift, MySQL); data enrichment (e.g. addition of information about a specific entity by means of inference or matching with other data sources).

The processing operations rely upon Hadoop (HDFS), S3 or Kinesis, and they use MapReduce, Tez, Spark and ElasticSearch, case by case. We use several languages and functions, such as pig, hive, java, python, and lambda functions. Therefore, processes are highly scalable with regards to the volume and the nature of the data.

The processing operations are run automatically thanks tothe use of workflow schedulers (Oozie); the process frequency can be set for each data source, independently. Frequencies can vary from one hour to one month. Streaming processes obviously happen in real time.

Analysis

Two types of operations fall into this stage: data exploration and knowledge extraction. 

Data exploration is supported by tools such as HUE, Hive, Athena, Redshift or even MySQL and Tableau, depending on where the data is stored after the conclusion of the processing phase, and on the size of the available data.

The typical tools used for knowledge extraction are usually familiar to Data Scientists: Weka, Python, R, Spark and in general the Hadoop environment. However, all the tools deployed during the data exploration phase remain available. This is the stage where machine learning models are built for instance, or where we define the structure of hypercubes that will be integrated with other applications, such as BI platforms or data visualization software.

The results of the analysis operations performed at this stage are then used by the processing flows. This is to exploit the knowledge gathered thanks to the analysis, automatically and without interruptions. Depending on circumstances, this translates into the update of visualization dashboards, into the creation of reports, into an alarm, etc.

Integration

The knowledge managed through the Data Lake is used by a number of external systems, which feed critical business processes.

The data export flows include connectors towards the DMPs in use (including exaudi [PLS ADD LINK TO DMP PAGE], Neodata's DMP) and DSPs/SSPs. The insights deriving from the analysis phase are made visible through a reporting system, which is integrated into the platform and manages access by means of a roles and permission management system. This is designed according to the Client's policies obviously. At the same time, data can be accessed via the most popular BI platforms (Tableau, Qlik, …), for further analyses and deep-dives.

An API-based SDK allows access to the Data Lake through an application level, too.

Monitoring

Keeping track of all the processes managed through the Data Lake is a critical and complex task. For this reason, the Data Lake integrates an automated monitoring system that verifies processes continuously: all the data sources must be acquired as planned and all the processes must run regularly. If anomalies are detected, for example, if a process is taking longer than expected to complete, specific alerts are issued and sent to a predefined distribution list. Thresholds and parameters, as well as the recipients of the alerts, can be personalized for each single data source and/or process.

Display a chart

Additional
services

Leveraging your data to its fullest is not a problem anymore

Analysis / Data Science

Would you like to analyse your available data in more detail, with a view to gather insights that can better inform your strategic decisions? Would you like to fully leverage your data and adopt ad hoc reports, to measure your specific KPIs?

If you don't have a team of Data Scientists in your organization, or if you simply need some support, we're here for you. Neodata's core nature is Data Science and our team of experts is at your disposal, either on demand or for longer term projects at your premises.

Our objective is to make your life easier and make sure you can manage analysis independently if that's your wish, so we'll provide you with models that you can use and share without our intervention, and we'll arrange training sessions on demand.

 

Design, system integration and support

Our working method when designing a Data Lake includes a starting phase, where we identify the Client's strategic objectives and define a governance plan following the principles of privacy-by-design.

After that, we perform an as-is analysis, aiming at mapping all existing processes, the organizational structure of the teams that'll be involved in the project, communication protocols and flows, policies that define the evolution of the company's technology stack.

Then we enter the design phase, and we transform key guidelines into the clear definition of the system functionalities, the information flows, the dynamic data model, the IT equipment needed. We identify the professional roles in charge of supporting the project and we put at your disposal Project Managers, Business Consultants, System Architects, Data Scientists, Data Visualization Specialists and technical support.