Cracking Tabular Presentation Diversity for Automatic Cross-Checking over Numerical Facts

Published in ACM SIGKDD Conference on Knowledge Discovery and Data Mining in San Diego, California, USA, 2020

Recommended citation: Hongwei Li, Qingping Yang, Yixuan Cao, Jiaquan Yao, and Ping Luo. Cracking Tabular Presentation Diversity for Automatic Cross-Checking over Numerical Facts. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Aug 23-27, 2020, San Diego, California. https://dl.acm.org/doi/10.1145/3394486.3403310

Tabular forms of numerical facts widely exist in the disclosure documents of vertical domains, especially the financial fields. It is also quite common that the same fact might be mentioned multiple times in different tables with diverse tabular presentation. Firm’s disclosure documents are the main source of accounting information for individual investors. Its authenticity is crucial for both firms’ development and investors’ investment decisions. However, due to large volumes of tables, frequent updates during editing, and limited time for manual cross-checking, these facts might be inconsistent with each other even after official publishing. Such errors may bring about huge reputational risk, and even economic losses even if the mistakes are made unintentionally instead of deliberately. Hence, it creates an opportunity for Automatic Numerical Cross-Checking over Tables. This paper introduces the key module of such a system, which aims to identify whether a pair of table cells are semantically equivalent, namely referring to the same fact. We observed that due to tabular presentation diversity the facts in tabular forms are difficult to be parsed into relational tuples. Thus, we present an end-to-end solution of binary classification over each pair of table cells, which does not involve with explicit semantic parsing over tables. Also, we discuss the design of this neural model to compromise between prediction accuracy and inference time for a large number of table cell pairs, and propose some practical techniques to address the issue of extreme classification imbalance among pairs. Experiments show that our model achieves macro F1=0.8297 in linking semantically equivalent table cells from the IPO prospectus. Finally, an auditing tool is built to support guided cross-checking over financial documents, reducing work hours by 52%~68%. This system has received wide recognition in the Chinese financial community. Nine of the top ten Chinese security brokers have adopted this system to support their business of investment banking.