We live in a new era where “Googling” has become a commonly used practice available at our disposal to find almost any information we desire. In this new era of exponential growth in data, enterprises have a need for “Googling Data Assets.” A recent set of data catalog tools with built-in capabilities have been developed that may finally fulfill that wish.
We live in a new era where “Googling” has become a commonly used practice available at our disposal to find almost any information we desire. In this new era of exponential growth in data, enterprises have a need for “Googling Data Assets.” A recent set of data catalog tools have built-in capabilities that may finally fulfill that wish.
What is a Data Catalog?
The data catalog has its own long and unsustainable history of implementation in documentation, SharePoint, as well as other legacy data catalog tools. A couple of reasons it has failed in the past is that data assets are not easy to search, and they quickly become out of sync, thus preventing collaboration. Data cataloging is a key component for modern-day analytics platforms as it enables self-service discovery to information about the data. It provides the means to explore enterprise’s data assets in a secure and governed fashion. Data catalogs can be of enormous assistance in regulatory compliance initiatives like Sarbanes-Oxley. This is because they provide tracking for transactions, for example, being able to track who entered a specific transaction or when a transaction was modified in system.
Implementing a data cataloging solution should essentially leverage ratings, tagging, comments and the ability to collaborate. Furthermore, data cataloging solutions driven by high quality metadata will be able to provide contextual recommendations on data sets relationships and data lineage, which of course will be very helpful in delivering analytics with greater context and higher confidence in the data quality rendered in data metrics.
The Key Components of the Data Catalog Semantic Search
Search is a key feature of a data catalog. It should be as precise as “Googling” in the sense that users should be able to find the data they need and ask questions regarding the data such as: “Where is the customer name used?” or, “How is our EBITDA calculated?”.
Data catalogs utilize semantic search, which goes beyond keyword search to understand the searcher’s intent and contextual meaning. Data catalog tools utilize ontologies, taxonomies, knowledge graphs, and other technology to ensure that all relevant data is returned from a search.
The key to successful data catalog implementation is automation of data loading and finding data relationships. Many data catalog tool leaders can do that using statistics and machine learning algorithms.
Automatically classifying and identifying domains and entities such as the customer or the product enables classified data and supports better search, filtering of search results, and business glossary recommendations. Automatically suggesting business terms and associating them with data assets is a critical factor in data governance.
The ability to relate data sets (Synonyms) is a key element of a data catalog. It uses unsupervised clustering machine learning algorithms based on factors like data overlapping, distinct data value matching, pattern matching, and name matching.
Data catalog platforms should allow social sharing, comments, ratings, communication for business and technical users to describe data assets for common understanding and alignment. It’s a platform to define and manage the lifecycle of business terms, definitions, associated reference data, related terms, links.
The data assets can be further rationalized by crowdsourcing “wisdom of crowds” tags and annotations which help enrich and curate data, making it even more valuable throughout the enterprise.
Data catalog tools provide bi-directional lineage capabilities for a clear understanding of where data is coming from or where a data is used. Additionally, they tell you who has interacted with the data. Views can also be designed to show related datasets, tables, transformations, reports, business terms and applications. This all aids in the progressive discovery of other data sets of interest.
In the view below, you can see the traceability of data from its origin to various data layers through business-friendly lineage views, providing a clear understanding of the data movement.
A drill-down lineage view explores any lineage flow to show additional details like column level details and transformations applied on the way. These types of data catalog tools allow users to perform deeper impact analysis on data assets.
Leading Data Catalog Tools
If you would like to implement your data catalog component, Intersys would be glad to guide you to successful implementation.