Publications

Rich-text Document Styling Restoration via Reinforcement Learning

Published in Frontiers of Computer Science, 2020

Richly formatted documents, such as financial disclosures, scientific articles, government regulations, widely exist on Web. However, since most of these documents are only for public reading, the styling information inside them is usually missing, making them improper or even burdensome to be displayed and edited in different formats and platforms. In this study we formulate the task of document styling restoration as an optimization problem, which aims to identify the styling settings on the document elements, e.g. lines, table cells, text, so that rendering with the output styling settings results in a document, where each element inside it holds the (closely) exact position with the one in the original document. Considering that each styling setting is a decision, this problem can be transformed as a multi-step decision-making task over all the document elements, and then be solved by reinforcement learning. Specifically, Monte-Carlo Tree Search (MCTS) is leveraged to explore the different styling settings, and the policy function is learnt under the supervision of the delayed rewards. As a case study, we restore the styling information inside tables, where structural and functional data in the documents are usually presented. Experiment shows that, our best reinforcement method successfully restores the stylings in 87.65% of the tables, with 25.75% absolute improvement over the greedy method. We also discuss the tradeoff between the inference time and restoration success rate, and argue that although the reinforcement methods cannot be used in real-time scenarios, it is suitable for the offline tasks with high-quality requirement. Finally, this model has been applied in a PDF parser to support cross-format display.

Recommended citation: Hongwei Li, Yingpeng Hu, Yixuan Cao, Ganbin Zhou, and Ping Luo. Rich-text Document Styling Restoration via Reinforcement Learning. Frontiers of Computer Science, 2020. https://doi.org/10.1007/s11704-020-9322-7

Semantic Matching over Matrix-Style Tables in Richly Formatted Documents

Published in International Conference on Database and Expert Systems Applications (DEXA) in Bratislava, Slovakia, 2020

Table is an efficient way to represent a huge number of facts in a compact manner. As practitioners in the vertical domain share lots of common prior knowledge, they tend to represent facts more concisely using matrix-style tables. However, such tables are originally intended for human reading, but not machine-readable due to their complex structures including row header, column header, metadata, external context, and even hierarchies in headers. In order to improve the efficiency of practitioners in mining and utilizing these matrix-style tables, in this study we introduce a challenging task to discover fact-overlapping relations between matrix-style tables. This relation focuses on fine-grained local semantics instead of overall relatedness in conventional tasks. We propose an attention-based model for this task. Experiments reveal that our model is more capable of discovering the local relatedness, and outperforms four baseline methods. We also conduct an ablation study and case study to investigate our model in detail.

Recommended citation: Hongwei Li, Qingping Yang, Yixuan Cao, Ganbin Zhou, and Ping Luo. Semantic Matching over Matrix-Style Tables in Richly Formatted Documents. In Proceedings of the International Conference on Database and Expert Systems Applications, Sep 14-17, 2020, Bratislava, Slovakia.

Cracking Tabular Presentation Diversity for Automatic Cross-Checking over Numerical Facts

Published in ACM SIGKDD Conference on Knowledge Discovery and Data Mining in San Diego, California, USA, 2020

Tabular forms of numerical facts widely exist in the disclosure documents of vertical domains, especially the financial fields. It is also quite common that the same fact might be mentioned multiple times in different tables with diverse tabular presentation. Firm’s disclosure documents are the main source of accounting information for individual investors. Its authenticity is crucial for both firms’ development and investors’ investment decisions. However, due to large volumes of tables, frequent updates during editing, and limited time for manual cross-checking, these facts might be inconsistent with each other even after official publishing. Such errors may bring about huge reputational risk, and even economic losses even if the mistakes are made unintentionally instead of deliberately. Hence, it creates an opportunity for Automatic Numerical Cross-Checking over Tables. This paper introduces the key module of such a system, which aims to identify whether a pair of table cells are semantically equivalent, namely referring to the same fact. We observed that due to tabular presentation diversity the facts in tabular forms are difficult to be parsed into relational tuples. Thus, we present an end-to-end solution of binary classification over each pair of table cells, which does not involve with explicit semantic parsing over tables. Also, we discuss the design of this neural model to compromise between prediction accuracy and inference time for a large number of table cell pairs, and propose some practical techniques to address the issue of extreme classification imbalance among pairs. Experiments show that our model achieves macro F1=0.8297 in linking semantically equivalent table cells from the IPO prospectus. Finally, an auditing tool is built to support guided cross-checking over financial documents, reducing work hours by 52%~68%. This system has received wide recognition in the Chinese financial community. Nine of the top ten Chinese security brokers have adopted this system to support their business of investment banking.

Recommended citation: Hongwei Li, Qingping Yang, Yixuan Cao, Jiaquan Yao, and Ping Luo. Cracking Tabular Presentation Diversity for Automatic Cross-Checking over Numerical Facts. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Aug 23-27, 2020, San Diego, California. https://dl.acm.org/doi/10.1145/3394486.3403310

Towards Automatic Numerical Cross-Checking: Extracting Formulas from Text

Published in The Web Conf (WWW) in Lyon, 2018

Verbal descriptions over the numerical relationships among some objective measures widely exist in the published documents on Web, especially in the financial fields. However, due to large volumes of documents and limited time for manual cross-check, these claims might be inconsistent with the original structured data of the related indicators even after official publishing. Such errors can seriously affect investors’ assessment of the company and may cause them to undervalue the firm even if the mistakes are made unintentionally instead of deliberately. It creates an opportunity for automated Numerical Cross-Checking (NCC) systems. This paper introduces the key component of such a system, formula extractor, which extracts formulas from verbal descriptions of numerical claims. Specifically, we formulate this task as a DAG-structure prediction problem, and propose an iterative relation extraction model to address it. In our model, we apply a bi-directional LSTM followed by a DAG-structured LSTM to extract formulas layer by layer iteratively. Then, the model is built using a human-labeled dataset of tens of thousands of sentences. The evaluation shows that this model is effective in formula extraction. At the relation level, the model achieves a 97.78% precision and 98.33% recall. At the sentence level, the predictions over 92.02% of sentences are perfect. Overall, the project for NCC has received wide recognition in the Chinese financial community.

Recommended citation: Yixuan Cao, Hongwei Li, Ping Luo, and Jiaquan Yao. Towards Automatic Numerical Cross-Checking: Extracting Formulas from Text. In Proceedings of the 27th International Conference on World Wide Web (WWW-18), April 23–27, 2018，Lyon, France. https://dl.acm.org/doi/10.1145/3178876.3186166