Source: Deep Learning on Medium
Assessing and improving readiness remains a significant priority for the United States Army. With this priority in mind, the Army recently launched a project to enhance its supply chain data environments by leveraging the power of artificial intelligence (AI) and machine learning (ML). The Army’s Logistics Innovation Agency (LIA) partnered with data science leaders in academia and industry to collaboratively build ML-based models that address data anomalies in an existing, active dataset. This proof of principle was accomplished in a condensed period of time and set the stage for an AI-ML driven solution that proactively cleans data, identifies previously unknown flaws, and improves dataset utility issues while applying pattern recognition tools to positively affect readiness.
Machine Learning in Practice: From Hype to Reality
AI and ML are two areas that show tremendous potential for a wide variety of use cases, helping to augment, inform, and supplement human processes. In practice, most companies have fallen short in actually implementing these strategies. According to Gartner, 46% of Chief Information Officers (CIOs) have developed plans to implement AI and ML, but only 4% have actually deployed these solutions. Of the few organizations that are using ML, even fewer can point to demonstrable benefits.
One major factor contributing to this deployment lag is the gap between the technology’s vision and the reality of most organizations’ data environments. While the marketplace is envisioning automated quantum computing, self-driving cars, and robotic process automation, the reality is that most organizations are still trying to make sense of their intrinsically flawed data.
Poor quality data, however, doesn’t mean ML solutions should be scrapped altogether. Instead, AI and ML can be used to better understand operational data, uncover the root causes of data quality issues, resolve existing data errors, and prevent future errors by addressing the source of anomalies and why they occur in the first place.
Case Study: ML for Anomaly Detection in Army ERP Data
The Army’s Logistics Enterprise systems receive syndicated data from an enterprise authoritative data source (ADS) for material and equipment master records. In addition to syndicated feeds, there are also manual “data-create” processes. As part of an Army readiness initiative, all ERP and non-ERP systems must institute synchronization and standardization to ensure enterprise end-to-end lifecycle management. The current process to catalog locally procured materials is bypassed by using procedures in the ERPs that are not syndicated to the ADS. Published written standards exist; however, the systems do not lock-down all fields and allow for invalid entries. A host of problems result, including duplication, incorrect records, lack of data validation, and utilization of inaccurate data.
In addition, there is little visibility into database structures across systems, which is a major limiting factor in implementing an Army Master Data Governance solution. Working in collaboration with leading members of the data science community, the Department of the Army’s LIA set out to evaluate the potential of utilizing AI and ML for improving operational data quality to help improve Army readiness. Accomplished in partnership with the
Army’s Logistics Support Activity (LOGSA), the LIA team included representatives from Leidos, a defense contractor, the MIT Center for Transportation and Logistics, Harvard Medical School, and TCB Analytics, a private sector data science firm that specializes in ML and deep learning. This highly collaborative, quick turn project — the effort was completed in a scant ninety (90) days — allowed the group to greatly benefit from a combination of disparate backgrounds and strengths. According to Dr. Chris Cassa, a data scientist at Harvard Medical School and MIT-CTL, “The issues we faced with this Army dataset and techniques we employed are similar to those in healthcare and genetics and are broadly applicable across disciplines. Data cleansing is a central challenge as companies look to make sense of their data.”
The LIA team launched a pilot analysis and proof of principle to demonstrate the art of the possible with regard to how AI and ML algorithms, protocols, and methodologies could address known flaws in existing datasets and capitalize on pattern recognition to produce data cleansing models. By “big data” standards the dataset was relatively small and included approximately 7 million records with 1.7 million (24.3%) flaws. Phase I AI-ML applications dealt specifically with a subset of the 1.7 million errors sorted into eight error categories. The eight error categories amounted to approximately 650,000 known flaws.
Given the challenges with the data environments, it was insufficient to reactively cleanse the data after import. The team needed to identify, understand, and resolve broken processes that caused anomalies before data was imported. A progression of analysis and ML learning approaches were leveraged to better understand the Army dataset, identify and classify anomalies, and ultimately provide a path to resolution.
Exploration with Summary Statistics
Datasets were uploaded to an Amazon Web Services (AWS) GovCloud instance to allow for distributed access and analysis across the team in a compliant cloud-based environment. Initial data discovery, exploration, and profiling activities were performed to help direct the focus of subsequent analyses.
Standard data analysis and quality control activities proved important to provide a high-level look at the existing data environment. These basic analyses confirmed data anomalies and, more importantly, offered insights about what to look for and suggested next steps with supervised classification and unsupervised techniques.
Looking to understand dataset challenges while uncovering sources of data quality issues, a supervised ML analysis of anomalies was conducted. The team leveraged standard data quality protocols including summary statistics, basic pattern matching, and data profiling. Since standard data exploration may only identify a small subset of data anomalies, ML techniques — supervised and unsupervised — are important for a complete picture of the data environment. ML enables the identification and resolution of more data anomalies and facilitates more granular, automatic classification of anomaly types.
With the supervised ML analysis complete, the team employed more advanced, unsupervised techniques to better understand the datasets. This approach allowed the team to identify causes of known anomalies. This also helped to validate findings from prior supervised approaches.
By adding unsupervised ML techniques and continuing to tune the initial data cleansing model, the number of identified anomalies dramatically increased. This more granular look at the data environment revealed anomaly types that were heretofore not apparent during the summary statistics approach. Ultimately, finer granularity resulted in a more accurate, more automated path to resolve data quality control issues.
Findings and Benefits
During this pilot, the team was able to demonstrate the feasibility of ML to improve the quality and integrity of Army ERP data. For all investigated data anomalies, the root cause of the error was identified, data cleansing algorithms were developed to address existing datasets, algorithms were validated, and resolution activities were proposed to help mitigate future errors by adjusting operational procedures.
For one anomaly category, the application of the data resolution activities resulted in approximately 67% of records being protected against future data quality issues while also identifying a previously unknown second order error. In another anomaly category, applied data error resolution protocols potentially reduced human hours spent to address and correct flaws from approximately one year to several weeks.
Beyond the immediately quantifiable returns, there are numerous technical and operational benefits to this AI-ML application. For IT teams, implementing these data quality controls will reduce the volume of errors, improve data accuracy, reduce the time spent investigating individual errors, and remove roadblocks to data syndication across Army systems. For operations teams, improving the quality of ERP data will positively affect the accuracy of part orders and prevent delays from improper and incorrect orders. Simply put, soldiers in the field will not have to wait as long to receive spare parts. More generally, this AI-ML proof of principle successfully revealed an opportunity to reach the achievable goals of creating, maintaining, and analyzing data in an end-to-end process that the user has confidence in while facilitating predictive insights about the fulfillment process.
Significant benefits for the Army were demonstrable after completion of this brief pilot analysis. Highlighting the power of this approach, in only ninety (90) days the project was negotiated and contracted, data was shared, environments were spun-up, anomalies were investigated, and results were uncovered. Insights gleaned from this AI-ML application on a known, active Army dataset were positively received.
Given the success of this pilot and proof of principle, the LIA team of Leidos, Harvard and MIT academics, and TCB Analytics will work with LOGSA to expand the AI-ML analysis of the dataset to include all existing anomaly categories as well as identification of anomalies as yet unrecognized. Phase II of the project will leverage the findings from the initial AI-ML investigation to intelligently pinpoint specific areas most likely to generate an error and mitigate the chance of errors propagating through the resolution processes. Phase II solutions will address data quality issues by cleansing the existing dataset and preventing future errors through an automated and guided entry process. Specifically, this will include a guided data import application, construction of AI-driven anomaly resolution models, and a real-time monitoring dashboard.
As Tanya Cashorali of TCB Analytics states, “Ultimately we’re looking to address data errors at the source. Our ML and AI approach allows a proactive response to quality control with minimum impact to the end user.”
Dr. Donnie Horner leads a collaborative AI-ML team for Leidos, Inc. composed of colleagues from the private sector and academe serving a host of clients. A West Point (B.S.), MIT (M.S.) and Stanford (Ph.D.) graduate, Horner is Provost Emeritus at Jacksonville University and a former engineer at the Lockheed-Martin Skunkworks, where he worked on the Apache helicopter project and several classified interdisciplinary initiatives. As Provost, Dr. Horner worked extensively with local, regional, and national health care companies and banking and finance corporations to develop tailored curricula to produce baccalaureate and master’s graduates with robust data science, AI, and ML knowledge and skills. He has written and consulted extensively in the areas of AI-ML, deep learning, organizational culture, leadership, and high performing teams.
Stephanie Banks co-leads TCB Analytics, a Boston-based data and analytics consultancy. Prior to joining TCB, Stephanie held various marketing roles at small start-ups and larger consulting agencies in the data and analytics space. Within these roles, her experiences span data science, machine learning, advanced analytics, business intelligence, data visualization, data management, and content management. Stephanie holds a B.S. in Business Administration from Northeastern University, with a triple concentration in Management Information Systems, Marketing, and Management.
— — — — — — — —
Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday.