Understanding Data Catalogs - What is it and why do we need it?
IT and Business have always had a sort of a "no man's land" between them. IT is supposed to understand how to work with the data and the business is supposed to understand what the data represents. But we frequently find that none of them knows adequately about the data to use it strategically. So we often see that both keep guarding their own pockets of expertise - with every organization dealing with this at some level.
Over the years, however, some solutions have emerged that can ease this gap. These solutions are called data catalogs.
A data catalog is a collection of metadata, coupled with search tools and data management, that assists analysts and other data users to effectively find the data that they need. It serves as an inventory of the available data and it also provides knowledge about the fitness data for intended uses.
A data catalog can be thought of being just like a retailers catalog - the difference being that instead of providing you information on the products, it gives information about the data elements inside an organization.
Data catalogs have been around for a long time - where they provided a place for people to find where the data was and how exactly it was used. However, back then, the process was very manual and hence organizations would often give up on the task altogether. Data lake and other more advanced automation from the data catalog vendors have now reignited the data catalog.
Now let's understand what is a data lake. A data lake is a storage repository that contains an enormous amount of raw data which is in its native format until it is needed. A data lake uses a flat architecture to store the data in contrast to the hierarchical data warehouses where data is stored in files or folders.
The reason behind this is that the mission of data lakes is to have an open door for all the organizations' data - so you want the data onboarding to be as easy as possible. But the downside to this is that nobody understands everything that's in the data lake.
Organizations that had deployed a data lake, later on, found themselves with a huge volume of data - with no context to what that data represented, thus making data catalogs necessary.
Data catalogs can improve the data clarity, speed and accuracy in several ways:
Clarity: Everything that is required to understand the data is stored and managed from the start. Thus, as people use their data catalog, the data’s context deepens and its meaning becomes more clear.
Speed: By organizing data and analysis in a discoverable and business-friendly way, people can find what they need much faster.
Accuracy: By using a premier data catalog solution, a wider array of people can validate, improve, and correct data and analysis.
In the absence of a catalog, analysts look for data by searching through documentation, talking to colleagues, relying on tribal knowledge, or just working with familiar datasets since they know about them. The process is filled with trial and error, waste and rework, and repeated dataset searching which more often than not leads to working with “close enough” data since the time is running out!
When analysts have a data catalog, they can search and find data much faster and see all of the available datasets, evaluate them and make informed decisions for which data to use. They can then perform data preparation and analysis much more efficiently and with greater confidence. This has often led to a shift from 80% of time spent finding the data and only 20% on analysis - to 20% of time spent on finding and preparing the data and 80% for analysis. Quality of analysis is thus improved substantially and without the addition of more analysts, organizational analysis capacity increases.
Managing data in the age of big data, data lakes, and self-service is very important to reduce the time spent in finding and preparing the data and use that time to analyse the data instead. Data catalogs help us achieve this goal!