From Wikipedia, the free encyclopedia

In linguistics and language technology, a language resource is a "[composition] of linguistic material used in the construction, improvement and/or evaluation of language processing applications, (...) in language and language-mediated research studies and applications." [1]

According to Bird & Simons (2003), [2] this includes

  1. data, i.e. "any information that documents or describes a language, such as a published monograph, a computer data file, or even a shoebox full of handwritten index cards. The information could range in content from unanalyzed sound recordings to fully transcribed and annotated texts to a complete descriptive grammar", [2]
  2. tools, i.e., "computational resources that facilitate creating, viewing, querying, or otherwise using language data", [2] and
  3. advice, i.e., "any information about what data sources are reliable, what tools are appropriate in a given situation, what practices to follow when creating new data". The latter aspect is usually referred to as "best practices" or "(community) standards". [2]

In a narrower sense, language resource is specifically applied to resources that are available in digital form, and then, "encompassing (a) data sets (textual, multimodal/multimedia and lexical data, grammars, language models, etc.) in machine readable form, and (b) tools/technologies/services used for their processing and management". [1]

Typology

As of May 2020, no widely used standard typology of language resources has been established (current proposals include the LREMap, [3] METASHARE, [4] and, for data, the LLOD classification). Important classes of language resources include

  1. data
    1. lexical resources, e.g., machine-readable dictionaries,
    2. linguistic corpora, i.e., digital collections of natural language data,
    3. linguistic data bases such as the Cross-Linguistic Linked Data collection,
  2. tools
    1. linguistic annotations and tools for creating such annotations in a manual or semiautomated fashion (e.g., tools for annotating interlinear glossed text such as Toolbox and FLEx, or other language documentation tools),
    2. applications for search and retrieval over such data ( corpus management systems), for automated annotation ( part-of-speech tagging, syntactic parsing, semantic parsing, etc.),
  3. metadata and vocabularies
    1. vocabularies, repositories of linguistic terminology and language metadata, e.g., MetaShare (for language resource metadata), [4] the ISO 12620 data category registry (for linguistic features, data structures and annotations within a language resource), [5] or the Glottolog database (identifiers for language varieties and bibliographical database). [6]

Language resource publication, dissemination and creation

A major concern of the language resource community has been to develop infrastructures and platforms to present, discuss and disseminate language resources. Selected contributions in this regard include:

As for the development of standards and best practices for language resources, these are subject of several community groups and standardization efforts, including

  • ISO Technical Committee 37: Terminology and other language and content resources ( ISO/TC 37), developing standards for all aspects of language resources,
  • W3C Community Group Best Practices for Multilingual Linked Open Data (BPMLOD), [8] working on best practice recommendations for publishing language resources as Linked Data or in RDF,
  • W3C Community Group Linked Data for Language Technology (LD4LT), [9] working on linguistic annotations on the web and language resource metadata,
  • W3C Community Group Ontology-Lexica ( OntoLex), [10] working on lexical resources,
  • the Open Linguistics working group of the Open Knowledge Foundation, working on conventions for publishing and linking open language resources, developing the Linguistic Linked Open Data cloud, [11]
  • the Text Encoding Initiative (TEI), [12] working on XML-based specifications for language resources and digitally edited text.


References

  1. ^ a b LD4LT (2020), The Metashare Ontology as Created by the LD4LT Community Group, W3C Community Group Linked Data for Language Technology (LD4LT), Development branch, version of Mar 10, 2020
  2. ^ a b c d Bird, Steven; Simons, Gary (2003-11-01). "Extending Dublin Core Metadata to Support the Description and Discovery of Language Resources". Computers and the Humanities. 37 (4): 375–388. arXiv: cs/0308022. Bibcode: 2003cs........8022B. doi: 10.1023/A:1025720518994. ISSN  1572-8412. S2CID  5969663.
  3. ^ Calzolari, N., Del Gratta, R., Francopoulo, G., Mariani, J., Rubino, F., Russo, I., & Soria, C. (2012, May). The LRE Map. Harmonising Community Descriptions of Resources. In LREC (pp. 1084-1089).
  4. ^ a b McCrae, John P.; Labropoulou, Penny; Gracia, Jorge; Villegas, Marta; Rodríguez-Doncel, Víctor; Cimiano, Philipp (2015). "One Ontology to Bind Them All: The META-SHARE OWL Ontology for the Interoperability of Linguistic Datasets on the Web". In Gandon, Fabien; Guéret, Christophe; Villata, Serena; Breslin, John; Faron-Zucker, Catherine; Zimmermann, Antoine (eds.). The Semantic Web: ESWC 2015 Satellite Events. Lecture Notes in Computer Science. Vol. 9341. Cham: Springer International Publishing. pp. 271–282. doi: 10.1007/978-3-319-25639-9_42. ISBN  978-3-319-25639-9.
  5. ^ Kemps-Snijders, M., Windhouwer, M., Wittenburg, P., & Wright, S. E. (2008). ISOcat: Corralling data categories in the wild. In 6th International Conference on Language Resources and Evaluation (LREC 2008).
  6. ^ Nordhoff, Sebastian (2012), Chiarcos, Christian; Nordhoff, Sebastian; Hellmann, Sebastian (eds.), "Linked Data for Linguistic Diversity Research: Glottolog/Langdoc and ASJP Online", Linked Data in Linguistics: Representing and Connecting Language Data and Language Metadata, Springer, pp. 191–200, doi: 10.1007/978-3-642-28249-2_18, ISBN  978-3-642-28249-2
  7. ^ "Language Resources and Evaluation". Springer. Retrieved 2020-05-13.
  8. ^ "Best Practices for Multilingual Linked Open Data Community Group". www.w3.org. 2 October 2015. Retrieved 2020-05-13.
  9. ^ "Linked Data for Language Technology Community Group". www.w3.org. 26 June 2015. Retrieved 2020-05-13.
  10. ^ "Ontology-Lexica Community Group". www.w3.org. 10 May 2016. Retrieved 2020-05-13.
  11. ^ "Linguistic Linked Open Data".
  12. ^ "TEI: Text Encoding Initiative". tei-c.org. Retrieved 2020-05-13.