Following a rigorous methodology is key to delivering customer satisfaction and expanding analytics use cases across the business.
Enterprises are building intelligent data lakes to drive more value from their raw data with next generation analytics. Data lake analytics are proving to be change agents that transform business by uncovering opportunities, detecting problems, accelerating product innovations, and enhancing customer experiences.
The rewards from implementing a next generation analytic data lake are many, but it is not an easy task. To remove some of the confusion and complexity, this roadmap explains the seven steps necessary to implement a next generation analytic data lake within any enterprise. For a more visual representation and timing, take a look here.
Much like developing a business plan, data lake analytics start with a clear analysis and evaluation of business challenges. During this initial phase, you frame the problem and identify use cases that will have the biggest positive impact on the business. The best business use cases include input from multiple stakeholders such as the analytics team, line of business leaders, data administrators and data scientists, so you may want to invite all these team members to a workshop. Together, the team can develop a business use case with a strong problem statement, description of problem behavior, complications, business impact, and problem history with examples. Keep an open mind during this discovery phase as it is extremely likely to generate unexpected results, and they can lead to real business transformation.
This stage also includes gathering sample data and performing data discovery and analysis. Data scientists and analytic developers will gather and prepare data across multiple segments and from all the possible data sources such as sensors, logs, and enterprise data. The goal is to find trustworthy data that has the trend or event sequence that is the root cause of the problem pattern or behavior.
During data discovery and analysis, the data will need to be both cleansed and validated to ensure accurate results.
Informatica’s Axon Data Governance and Enterprise Data Catalog can help gather sample data and perform discovery and analysis. To prepare and organize sample data, consider Informatica Cloud Data Integration and Informatica Data Quality which can perform both data cleansing and validation.
During this phase, business users and data analysts will assemble and consolidate multiple use cases. They will test the hypothesis from the uses cases and perform fit-gap analysis to understand if the data is appropriate and useful for the use case.
Informatica’s Axon Data Governance and Enterprise Data Catalog can help assemble and consolidate uses cases.
Multiple analytical models exist, such as time-series forecasting, classification, or clustering. A linear programming, for example, is the best analytic approach for finding the most appealing price point of a product, and a statistical model is most appropriate for a random process like setting up alerts to detect credit card frauds.
You will need to evaluate the use case to determine the right analytic technique based on data patterns and behaviors.
Once the correct analytic technique is chosen to address the business problem, you will perform a fit-gap analysis that will identify key data and components that fit the model and gaps that must be filled. Based on the findings from the fit-gap analysis, you can identify the appropriate analytics algorithm and modeling technique for the business use case. Data patterns and segmentation will inform these decisions.
Informatica’s Axon Data Governance and Enterprise Data Catalog can help evaluate and curate use cases. To perform fit-gap analysis, consider Informatica Cloud Data Integration, Informatica Data Quality, and a modeling tool which can help decide appropriate data algorithms needed to solve the use case.
The most significant phase for the analytics team is developing data sets that capture both transactional data and the relationships between data. To build the data sets, the data team must acquire data and understand the data characteristics, such as volume, frequency, availability, and complexity. The “noise” from the data also must be discarded, and that must be labeled correctly with metadata.
Both data at rest and in motion must be ingested by building pipelines that ingest batch and streaming data, creating sampling techniques, and performing data cleansing and data validation.
The collected data patterns should be complete, rich, random, reliable, and consistent. Randomness assures that the samples represent statistical characteristics of the complete data set. Unstructured and structured data must be integrated together and in a consumable format.
Informatica’s Axon Data Governance, Enterprise Data Catalog, PowerExchange Change Data Capture, Informatica Cloud Data Integration and Informatica Cloud Data Quality can help build pipelines to ingest batch and NRT data. To perform data cleansing and validation, consider Informatica Cloud Data Quality. Informatica Cloud Data Integration, Axon Data Governance, and Enterprise Data Catalog can help build pipelines to ingest streaming and real-time data. Informatica Cloud Data Quality can perform both data cleansing and validation for real-time and streaming data.
Using the pre-defined analytic technique, it is time to build the analytical models. For a successful analytical model, the analytics team should select algorithms that use small but meaningful data samples and test it rigorously. Crunching the data sets with rigorous testing will validate outcomes. You will also want to apply data visualization techniques to uncover patterns and trends. Finally, you will need to review results and scores and train the models so they can mature. The data is scored based on validation/comparison, aggregation/summarization, maximization/minimization, rare-event detection, or unusual patter detection. The data scoring will depend on the chosen analytic approach.
The analytics and architecture team need to architect and develop an end-state solution that aligns with current business needs and adapts to future ones. The goals are to operationalize the pipelines and optimize the data lake ecosystem for scalability. Key questions to ask are:
Informatica has multiple products to help operationalize the pipelines. They include Informatica Cloud Data Integration, PowerExchange Change Data Capture, Axon Data Governance, Enterprise Data Catalog and Informatica Cloud Data Quality.
The final stage is not final, as it is a continuous process that will always refine and improve the analytic models. It includes measuring the effectiveness by correlating business outcomes and insights with model results. Based on results, you will calibrate the models through continuous discovery and analysis. You can monitor the models with metadata and visualization methods. And finally establish a feedback loop for continuous improvements.
The most effective data models are repeatable. Following these steps ensures that the business is creating data lake analytical models that solve business problems and deliver real value to the organization.
Success
Link Copied to Clipboard