Table extraction is the process of recognizing and separating a
table from a large document, possibly also recognizing individual rows, columns or elements.
It may be regarded as a special form of
information extraction.
Table extractions from
webpages can take advantage of the special
HTML elements that exist for tables, e.g., the "table" tag,
and programming libraries may implement table extraction from webpages.
The
Pythonpandas software library can extract tables from HTML webpages via its read_html() function.
More challenging is table extraction from
PDFs or
scanned images, where there usually is no table-specific machine readable markup.[1]
Systems that extract data from tables in scientific
PDFs have been described.[2][3]
Wikipedia presents some of its information in tables,
and, e.g., 3.5 million tables can be extracted from the
English Wikipedia.[4]
Some of the tables have a specific format, e.g., the so-called
infoboxes.
Large-scale table extraction of Wikipedia infoboxes forms one of the sources for
DBpedia.[5]
Commercial
web services for table extraction exist, e.g.,
Amazon Textract,
Google'sDocument AI,
IBM Watson Discovery, and
Microsoft Form Recognizer.[1]
Open source tools also exist, e.g., PDFFigures 2.0 that has been used in
Semantic Scholar.[6]
In a comparison published in 2017, the researchers found the proprietary program
ABBYY FineReader to yield the best PDF table extraction performance among six different tools evaluated.[7] In a 2023 benchmark evaluation,[8] Adobe Extract,[9] a cloud-based
API that employs Adobe’s Sensei
AI-platform,[10] performed best among five tools evaluated for table extraction.
^
abDouglas Burdick; Marina Danilevsky; Alexandre V Evfimievski; Yannis Katsis; Nancy Wang (August 2020). "Table extraction and understanding for scientific and enterprise applications". Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases. 13 (12): 3433–3436.
doi:
10.14778/3415478.3415563.
ISSN2150-8097.
WikidataQ108170445.
^Tobias Bleifuß; Leon Bornemann; Dmitri V. Kalashnikov; Felix Naumann; Divesh Srivastava (17 August 2021).
"The Secret Life of Wikipedia Tables"(PDF). Proceedings of the 2nd Workshop on Search, Exploration, and Analysis in Heterogeneous Datastores. CEUR Workshop Proceedings: 20–26.
WikidataQ108215401.