A Data Lake is a new working method that simplifies and enhances your capabilities of storing, managing and analysing Big Data, in their native or near-native format, when they come from heterogeneous sources.
It's a new working method because the systems that have been used so far to store, process and analyse data - following a so-called "DataWarehouse" architecture - are structured and limited by nature by the end use you plan to make of the data.
By adopting a Data Lake architecture, you dramatically reduce storage costs and at the same time get virtually unlimited storage space, thus reducing the cost of data consolidation and simplifying your information sharing processes.
Neodata has developed a methodology that standardizes the procedures integrating new data sources into the Data Lake, to reduce the time needed to onboard new data sources and also to reduce the risk of bugs.
The analysis stage, based on a Hadoop environment, allows you to explore available data by means of tools such as HUE, Hive, Redshift, MySQL and to develop machine learning models, for example, connecting the software application in use by Data Scientists.
The results of the analysis operations performed at this stage are then used by the processing flows. This is to exploit the knowledge gathered thanks to the analysis, automatically and without interruptions.
We start with the analysis of your strategic objectives, then we define data governance processes and go through the structure of your team, your data sources, your information flows. Only after we've gained a clear picture of your starting point, we design the Data Lake architecture, step by step, side by side with you.
Access to the Data Lake is regulated by a roles and permissions system; roles and permissions are configured by the Administrator and can be personalized to allow multiple administration levels.
A Data Lake is the ideal solution for companies who:
Neodata's Data Lake
A Data Lake is a platform combining a number of advanced, complex data storage and data analysis technologies.
To simplify, we might group the components of a Data Lake into four categories, representing the four stages of data management:
There's no universal recipe to build a Data Lake. What you should care for most, is choosing a technology vendor capable of designing the platform architecture on the basis of criteria that are shared and agreed with you after a well-managed analysis phase. The platform should be equipped with the hardware and software components that are necessary to support the four steps illustrated above while granting the maximum efficiency possible, that means providing the best possible result, in the shortest time, saving resources. Reliable automated monitoring processes are a must have, too.
Our Data Lake can ingest and store any type of data: structured, unstructured or semi-structured. It is compatible with any ingestion/upload tool that can store data on S3 or on an sftp server. In particular, it is compatible with Flume and Sqoop.
Streaming is managed through Kinesis and it is compatible with Kafka.
At this stage, data is stored in its native, raw format.
Neodata has developed a methodology that standardizes the procedures integrating new data sources into the Data Laka, to reduce the time needed to onboard new data sources and also to reduce the risk of bugs.
In order to guarantee security and privacy standards, access to the Data Lake is managed through a system based on roles, where every permission is configurable at Admin level.
Once data is uploaded into S3, a number of processes are triggered that aim at organizing data and making it available for subsequent analysis.
Typical processes are: changes in formats to support better-performing solutions (such as Parquet or Avro for instance); data parsing to extract specific entities (from a .json file for example); operations on one or more data fields (such as a change in the date format); data insertion into Sql or Nosql structures (e.g. Hive, HBase, Redshift, MySQL); data enrichment (e.g. addition of information about a specific entity by means of inference or matching with other data sources).
The processing operations rely upon Hadoop (HDFS), S3 or Kinesis, and they use MapReduce, Tez, Spark and ElasticSearch, case by case. We use several languages and functions, such as pig, hive, java, python, and lambda functions. Therefore, processes are highly scalable with regards to the volume and the nature of the data.
The processing operations are run automatically thanks tothe use of workflow schedulers (Oozie); the process frequency can be set for each data source, independently. Frequencies can vary from one hour to one month. Streaming processes obviously happen in real time.
Two types of operations fall into this stage: data exploration and knowledge extraction.
Data exploration is supported by tools such as HUE, Hive, Athena, Redshift or even MySQL and Tableau, depending on where the data is stored after the conclusion of the processing phase, and on the size of the available data.
The typical tools used for knowledge extraction are usually familiar to Data Scientists: Weka, Python, R, Spark and in general the Hadoop environment. However, all the tools deployed during the data exploration phase remain available. This is the stage where machine learning models are built for instance, or where we define the structure of hypercubes that will be integrated with other applications, such as BI platforms or data visualization software.
The results of the analysis operations performed at this stage are then used by the processing flows. This is to exploit the knowledge gathered thanks to the analysis, automatically and without interruptions. Depending on circumstances, this translates into the update of visualization dashboards, into the creation of reports, into an alarm, etc.
The knowledge managed through the Data Lake is used by a number of external systems, which feed critical business processes.
The data export flows include connectors towards the DMPs in use (including exaudi, Neodata's DMP) and DSPs/SSPs. The insights deriving from the analysis phase are made visible through a reporting system, which is integrated into the platform and manages access by means of a roles and permission management system. This is designed according to the Client's policies obviously. At the same time, data can be accessed via the most popular BI platforms (Tableau, Qlik, …), for further analyses and deep-dives.
An API-based SDK allows access to the Data Lake through an application level, too.
Keeping track of all the processes managed through the Data Lake is a critical and complex task. For this reason, the Data Lake integrates an automated monitoring system that verifies processes continuously: all the data sources must be acquired as planned and all the processes must run regularly. If anomalies are detected, for example, if a process is taking longer than expected to complete, specific alerts are issued and sent to a predefined distribution list. Thresholds and parameters, as well as the recipients of the alerts, can be personalized for each single data source and/or process.
Would you like to analyse your available data in more detail, with a view to gather insights that can better inform your strategic decisions? Would you like to fully leverage your data and adopt ad hoc reports, to measure your specific KPIs?
If you don't have a team of Data Scientists in your organization, or if you simply need some support, we're here for you. Neodata's core nature is Data Science and our team of experts is at your disposal, either on demand or for longer term projects at your premises.
Our objective is to make your life easier and make sure you can manage analysis independently if that's your wish, so we'll provide you with models that you can use and share without our intervention, and we'll arrange training sessions on demand.
Our working method when designing a Data Lake includes a starting phase, where we identify the Client's strategic objectives and define a governance plan following the principles of privacy-by-design.
After that, we perform an as-is analysis, aiming at mapping all existing processes, the organizational structure of the teams that'll be involved in the project, communication protocols and flows, policies that define the evolution of the company's technology stack.
Then we enter the design phase, and we transform key guidelines into the clear definition of the system functionalities, the information flows, the dynamic data model, the IT equipment needed. We identify the professional roles in charge of supporting the project and we put at your disposal Project Managers, Business Consultants, System Architects, Data Scientists, Data Visualization Specialists and technical support.
Thanks to subscribe to our newsletter