6 Management Recommendations For Data Lake Projects

There are several reasons for creating Data Lake, Big Data, and Data Analytics projects in the industry. Because, from these projects, it becomes possible to materialize the idea of data-based decision making (Data-driven) and the automation of intelligent decisions by Artificial Intelligence algorithms.

In this way, it is worth remembering that the construction of large databases is, in itself, a great technical challenge. In addition, there is a need for alignment between people, processes and the business so that the long-awaited Data Lake does not become a Data Swamp.

In this article, we present some points of attention to managers, directors of information technology, and CIOs in this high-risk process that is usually linked to high figures.

IMAGE: UNSPLASH

Points Of Attention In Data Lake Projects

The idea of a data lake is indeed very interesting and ends up showing its strategic importance in the medium and long term. However, here are some managerial (non-technological) tips related to the Data Lake construction and structuring processes.

1. Structure The Data – Meaning And Metadata

After carrying out several types of projects related to Data Lakes, we came to some interesting conclusions that we detail below.

The main factor related to the success or failure of the data lake initiatives was the incomplete and even ambiguous design of analyzes. This led us to the creation, registration and publication of the Analytics Business Canvas, which aims to extract the real meaning of each analytical effort.

Although the “Data Lake” concept says that data can be saved as is, starting projects by saving data without a clear business strategy is not a good intuition. In addition, having senior team members helps to mitigate this type of risk a lot.

The great success of analytics projects is usually in the strategy of using data in the face of business opportunities and not necessarily in the technology involved. The focus should be on motivations and “WHY” and then “COMOS”. Even with good motivations, even “COMOS” becomes easier to answer.

In addition to the issues of the meaning of business processes, it is important (very important) the systematic use of metadata (information about the data). An important tip for those who are starting to organize the area of analysis and data lakes is to start structuring the data dictionaries.

It is essential to understand the difference between the nature of transactional and analytical data and their roles/expectations in the project.

2. Choosing The Right Technological Stack

Although technology is the second step in structuring data lakes, it is one of the most important decisions to be made in the project. The keyword for this process is “Systems architecture”.

The choice of the tech stack for the creation of the data lake must be aligned to both the business problem and the technical knowledge of the operation team.

At this point, to design the architecture of the solution (s) we recommend professionals with experience in software engineering, databases, administration and creation of ETL processes, scalability of storage infrastructures.

So that the analytical technological stack does not go out of use, it is highly recommended to guarantee a high level of interoperability between the systems.

3. & 4. Take Care Of The Under/Overestimation Of The Data Volume

As in the planning and construction of a house, in the projects, the data ponds need minimal information for the correct structuring. However, this minimal information is often not clear to either the business team or system architects.

We have already seen cases in which an immense set of data (far above reality) was imagined to investigate patterns of behavior in a specific industry.

Over time it was found that small adjustments in the strategy of performance indicators) with the use of sampling techniques have solved more than 80% of the analytical problems with elegance and precision.

The tip is to question different actors involved in the project, seeking to understand the nature of the problem, the questions, and then look at the internal and external data. In the same way that it is possible to overestimate the need for data, it is also possible to underestimate it.

There are innovations coming from other areas, with special emphasis on IOT projects, which, in their nature, are based on obtaining as much data as possible from the sensors. This in fact implies storage strategies, compression, types of analysis, security, and even transmission speed.

On the same subject, we have previously commented on the conceptual differences between sampling and data clipping.

Another form of data underestimation is the combinatorial exploitation of records that in some cases become computationally unfeasible for processing and/or storage. Thus, appropriate technical imperatives are required for each case.

5. Analyze The Need To Use Indexes

The creation of indexes in the databases must be well structured and not created uncontrollably. “Inappropriate and/or excessive use of indexes”

The use of indexes in databases is a good practice that aims to increase the efficiency of very frequent queries. This allows the database management system (DBMS) to search for less complexity, avoiding costly sequential searches. However, indexes take up space, and an index can very easily reach 25% of the size of a table.

In data lakes, access is not repetitive, high performance queries are not necessary. Therefore, using indexes in addition to primary keys to establishing the relationships between entities can create unnecessary volumes to achieve unwanted efficiency.

Remember that in books the indexes are lower than the content itself.

6. Maintain Information Security

It is clear that where there is valuable information there are also security risks. Security requires a maturity level of permission structures that, on the one hand, allow quick and easy access to analysts and analytics machines without compromising access rules that break the confidentiality of certain information.

The most advanced data governance solutions we know use the theory of identity with mastery in their systems, thus not allowing users to use third-party accesses. All software engineering in the project must be in constant communication with the management and business teams to ensure the correct level of permission for each user for each dataset.

If you are interested in even more technology-related articles and information from us here at Bit Rebels, then we have a lot to choose from.