Towards Automatic Numerical Cross-Checking: Extracting Formulas from Text

Published in The Web Conf (WWW) in Lyon, 2018

Recommended citation: Yixuan Cao, Hongwei Li, Ping Luo, and Jiaquan Yao. Towards Automatic Numerical Cross-Checking: Extracting Formulas from Text. In Proceedings of the 27th International Conference on World Wide Web (WWW-18), April 23–27, 2018,Lyon, France. https://dl.acm.org/doi/10.1145/3178876.3186166

Verbal descriptions over the numerical relationships among some objective measures widely exist in the published documents on Web, especially in the financial fields. However, due to large volumes of documents and limited time for manual cross-check, these claims might be inconsistent with the original structured data of the related indicators even after official publishing. Such errors can seriously affect investors’ assessment of the company and may cause them to undervalue the firm even if the mistakes are made unintentionally instead of deliberately. It creates an opportunity for automated Numerical Cross-Checking (NCC) systems. This paper introduces the key component of such a system, formula extractor, which extracts formulas from verbal descriptions of numerical claims. Specifically, we formulate this task as a DAG-structure prediction problem, and propose an iterative relation extraction model to address it. In our model, we apply a bi-directional LSTM followed by a DAG-structured LSTM to extract formulas layer by layer iteratively. Then, the model is built using a human-labeled dataset of tens of thousands of sentences. The evaluation shows that this model is effective in formula extraction. At the relation level, the model achieves a 97.78% precision and 98.33% recall. At the sentence level, the predictions over 92.02% of sentences are perfect. Overall, the project for NCC has received wide recognition in the Chinese financial community.

Download paper here