Toshiba and The Institute of Statistical Mathematics Have Developed Machine Learning Algorithm for Identifying Failure Factors from Data with Many Missing Values

--Contributes to improved reliability and productivity in manufacturing throughhigh-speed, high-accuracy identification of causes of quality and yield deterioration--

August 2019
ISM2019-08

Tokyo—Toshiba Corporation (TOKYO: 6502) and The Institute of Statistical Mathematics (hereinafter “ISM”) today announce development of a machine learning algorithm that can identify factors that lower manufacturing quality and yield, even in datasets with many missing values.

The new algorithm, the “Least absolute shrinkage and selection operator with High Missing rate (HMLasso),” delivers 41% lower estimation error than the cutting-edge CoCoLasso*¹ algorithm. It realizes high-speed, high-accuracy factor analysis, even with previously difficult-to-use data with many missing values, and can improve productivity, yields, and reliability at manufacturing sites.

Toshiba and ISM will present details of the technology at the 28th International Joint Conference on Artificial Intelligence (IJCAI-19)*², which will be held on August 10–16, 2019 in Macao, China. A simple open-source program*³ will be released on August 2, 2019.

Manufacturing facilities generate and collect large volumes of data on manufacturing processes and equipment operations every day, including data covering product quality, processing conditions, equipment temperature, and pressure. Compiled in a regression model*⁴, these data can explain variations in quality, and make significant contributions to identifying and improving causes of deteriorations in quality and yields.

However, the collected data are often incomplete. Mismeasurements and transmission errors cause losses, and inspections are often based on sampling, so there are cases that only around ten percent of potentially available data are collected. In such cases, advance calculations to complement missing values is commonly done prior to analysis, but if the number of missing values is large this requires a substantial number of calculations, and it is difficult to speed up and improve the accuracy of factor analysis.

Toward overcoming this problem, Toshiba and ISM have co-developed a novel machine-learning algorithm, HMLasso, that constructs accurate regression models even from data with many missing values. The technology has three key characteristics:

Construction of accurate regression models, even with datasets with high missing rates
The widely used CoCoLasso does not consider missing rates, and its overall accuracy falls when datasets include items with high missing rates. In contrast, HMLasso adaptively performs calculations according to the missing rate, and constructs accurate regression models without any loss of calculation precision, even for items with high missing rates.
Omission of complementation processes for missing values
Regression models can be constructed directly from data with missing values, reducing overall calculation time.
Automatic selection of important items
Application of sparse modeling*⁵ to performance analysis, even when there are many items, automatically selects important items with a large influence on quality or yield.

Toshiba and ISM have already demonstrated the theoretical and experimental effectiveness of HMLasso. Theoretical analyses verified that utilizing the missing rate realizes optimal error bounds, securing superior results over other algorithms. Benchmarking for numerical experiments, using artificial data with a mean missing rate of 50%, and with missing rates of over 90% for some items, found a reduction in estimation error of approximately 41% against the cutting-edge CoCoLasso algorithm.

Use of HMLasso will ensures high-accuracy factor analysis, even for data with many missing values (Fig. 1). Moving forward, Toshiba and ISM will continue efforts to generalize and speed-up the technology and to verify its application to actual tasks at various kinds of manufacturing facilities, and to thereby contribute to improved productivity, yields and reliability.　

Fig. 1: HMLasso utilization

Notes:

1. Convex Conditioned Lasso (CoCoLasso). Reference: Datta, A., & Zou, H. (2017). CoCoLasso for high-dimensional error-in-variables regression. The Annals of Statistics, 45(6), 2400–2426.

2. https://www.ijcai19.org/

3. https://CRAN.R-project.org/package=hmlasso

4. A model that explains values of a specific item using other items.

5. Sparse modeling: A methodology that simultaneously performs variable selection and modeling.

Contact information for press inquiries:

Toshiba Corporation
Corporate Communications Division, Public Relations & Investor Relations Office
03-3457-2100

The Institute of Statistical Mathematics
URA Station, Planning Unit, Administration Planning and Coordination Section
050-5533-8580

Toshiba and The Institute of Statistical Mathematics Have Developed Machine Learning Algorithm for Identifying Failure Factors from Data with Many Missing Values

Press Release