Original article was published on Artificial Intelligence on Medium
Data and Decision Models — How to use influence diagrams in a Data Science Project?
A step by step approach towards building an influence diagram
The Value Chain — An Overview
“Value Chain” is a widely used term that defines five key areas namely requirement analysis, design, implementation, testing, and evolution. Several such process models are relevant to data science however there is no right answer to which is the best. Multiple organizations like CSIRO, Pivotal (a big data company), and authors like Miller and Mork have proposed different models of this value chain to cater to a data science project. Their proposed models are not significantly different however are customized to cater to different business needs. The figure below illustrates the data model proposed by Miller and Mork (2013).
The Value Chain — Key Activities
In a nutshell value chain is often depicted as the series of activities performed to generate value for a project. These activities include data collection from multiple sources, cleaning, and wrangling, integration with the existing system, analyzing key performance indicators, and finally presenting the results to drive business value that aligns with the organization’s goal. A data scientist plays a key role across all the segments however in bigger organizations these roles are often predefined and they don’t overlap. For example big pharmaceutical and banking companies like Novartis Health Care Pvt. Ltd. or Common Wealth Bank, Australia have multiple predefined roles under a data science project where a data scientist is expected to analyze and present key narratives from the data only to different business stakeholders.
The Influence Diagram
When formulating and analyzing a problem statement a data scientist needs to understand the key levers that impact business decisions. Stakeholders, in general, are not familiar with data models or statistics hence there are scenarios when the model results might not align with their business intuition. This is where an influence diagram comes into the picture. An influence diagram is a visual display of how different known and unknown variables can impact business decisions which in turn regulates the outcome. An influence diagram allows you to have a 360-degree picture of the following.
- The values one can generate from building a model
- The additional information that is critical to the project
- Cost of procuring this additional information
Prudential Financial, Inc. is one of the largest insurance companies in the USA. Recently they observed a lot of customer churn owing to incorrect insurance quotes. The stakeholders of the Life Insurance and Annuity segment wants to identify the drivers impacting risk assessment and how they can cater to different business objectives. The influence diagram is developed to identify the key areas and levers of the problem statement on how they overlap to achieve a common business goal.
Components of an Influence Diagram
The Known Variable
The known variables represent measures that are known to a data scientist at the commencement of a project. This includes data sources and attributes that are critical to the problem, exiting performance indicators, business intuitions, and many more. In the above example, nodes “Historic Customer Data”, “Applicant Customer Data”, and “Existing Business” are the known variables.
The Chance Variable
The chance variable represents any measure that is unknown at the commencement of the project. It may include variables critical to the model, business understanding of the market, data to be procured from an external source, etc. It is important to understand that the value of a chance variable will always be discovered in the future. In the above example “Predict Risk Assessment”, “Market Share”, and “Customer Segment” are the chance variables. When observed closely one can distinguish that Market Share can only be estimated when competitor data is available. Competitor data can be procured from external sources hence the value of this variable is unknown at the beginning of a project.
The Decision Variable
A decision variable is denoted by a decision-maker. It illustrates how different known and unknown variables can impact business choices and what kind of choices are impacted. In the diagram above “Insurance Quote”, “Insurance Product Development” and “Marketing Channels” are the key decisions that are influenced by the known and unknown variables studied during the data science project.
The Objective Variable
An objective variable is defined as the possible outcome the data science project intends to drive. The objective variable should align with business goals and falls under a broader umbrella impacted by multiple decision variables.
How to develop an influence diagram?
- Identify all possible set of variables required for the project
- Discuss with the Stakeholder or the Data Engineering and Management team on the availability of different variables
- Identify the known and chance variables from the discussion
- Understand the impact of chance variables on project or model development. If the impact is significant have a discussion on the procurement process and cost involved
- Analyze all business decisions impacted by the project
- Use an online tool to put these variables in place come up with the final influence diagram
The mathematical theory of influence diagrams is designed to facilitate understanding of how various possibilities and costs work together. The influence diagram is often categorized as an art of problem-solving. It teaches us to recognize the key variables in a problem, the objectives, known and unknown factors, and how different decisions are influenced to arrive at a common goal.