Wikidata is a
document-oriented database, focusing on items, which represent any kind of topic, concept, or object. Each item is allocated a unique,
persistent identifier, a positive integer prefixed with the upper-case letter Q, known as a "QID". Q is the starting letter of the first name of Qamarniso Vrandečić (née Ismoilova), an Uzbek Wikimedian married to the Wikidata co-developer
Denny Vrandečić.[6] This enables the basic information required to identify the topic that the item covers to be translated without favouring any language.
Item labels do not need to be unique. For example, there are two items named "Elvis Presley":
Elvis Presley (Q303), which represents
the American singer and actor, and
Elvis Presley (Q610926), which represents his
self-titled album. However, the combination of a label and its description must be unique. To avoid ambiguity, an item's unique identifier (QID) is hence linked to this combination.
Main parts
Fundamentally, an item consists of:
An
identifier (the QID), related to a label and a description.
Optionally, multiple aliases and some number of statements (and their properties and values).
Statements
Statements are how any information known about an item is recorded in Wikidata. Formally, they consist of
key–value pairs, which match a property (such as "author", or "publication date") with one or more entity values (such as "
Sir Arthur Conan Doyle" or "1902"). For example, the informal English statement "milk is white" would be encoded by a statement pairing the property
color (P462) with the value
white (Q23444) under the item
milk (Q8495).
Statements may map a property to more than one value. For example, the "occupation" property for
Marie Curie could be linked with the values "physicist" and "chemist", to reflect the fact that she engaged in both occupations.[7]
Values may take on many types including other Wikidata items, strings, numbers, or media files. Properties prescribe what types of values they may be paired with. For example, the property
official website (P856) may only be paired with values of type "URL".[8]
Optionally, qualifiers can be used to refine the meaning of a statement by providing additional information. For example, a "population" statement could be modified with a qualifier such as "point in time (P585): 2011" (as its own key-value pair). Values in the statements may also be annotated with references, pointing to a source backing up the statement's content.[9] As with statements, all qualifiers and references are property–value pairs.
Properties
Each property has a numeric identifier prefixed with a capital P and a page on Wikidata with optional label, description, aliases, and statements. As such, there are properties with the sole purpose of describing other properties, such as
subproperty of (P1647).
Properties may also define more complex rules about their intended usage, termed constraints. For example, the
capital (P36) property includes a "single value constraint", reflecting the reality that (typically) territories have only one capital city. Constraints are treated as testing alerts and hints, rather than inviolable rules.[10]
Before a new property is created, it needs to undergo a discussion process.[11][12]
The most used property is
cites work (P2860), which is used on more than 290,000,000 item pages as of November 2023.[update][13]
Lexemes
In
linguistics, a
lexeme is a unit of
lexical meaning representing a group of words that share the same core meaning and grammatical characteristics.[14][15] Similarly, Wikidata's lexemes are items with a structure that makes them more suitable to store
lexicographical data. Since 2016, Wikidata has supported lexicographical entries in the form of lexemes.[16]
In Wikidata, lexicographical entries have a different identifier from regular item entries. These entries are prefixed with the letter L, such as in the example entries for
book and
cow. Lexicographical entries in Wikidata can contain statements, senses, and forms.[17] The use of lexicographical entries in Wikidata allows for the documentation of word usage, the connection between words and items on Wikidata, word translations, and enables machine-readable lexicographical data.
In 2020, lexicographical entries on Wikidata exceeded 250,000. The language with the most lexicographical entries was
Russian, with a total of 101,137 lexemes, followed by
English with 38,122 lexemes. There are over 668 languages with lexicographical entries on Wikidata.[18]
Entity Schemas
In Wikidata, a schema is a data model that outlines the necessary attributes for a data item.[19][20] For instance, a data item that uses the attribute "
instance of" with the value "
human" would typically include attributes such as "
place of birth," "
date of birth,"
"date of death," and "
place of death."[21] The entity schema in Wikidata utilizes
Shape Expression (ShEx) to describe the data in Wikidata items in the form of a
Resource Description Framework (RDF).[22] The use of entity schemas in Wikidata helps address data inconsistencies and unchecked vandalism.[19]
In January 2019, development started of a new extension for MediaWiki to enable storing ShEx in a separate namespace.[23][24] Entity schemas are stored with different identifiers than those used for items, properties, and lexemes. Entity schemas are stored with an "E" identifier, such as
E10 for the entity schema of human data instances and
E270 for the entity schema of building data instances. This extension has since been installed on Wikidata[25] and enables contributors to use ShEx for validating and describing Resource Description Framework data in items and lexemes. Any item or lexeme on Wikidata can be validated against an Entity Schema,[clarification needed] and this makes it an important tool for quality assurance.
Content
Wikidata's content collections include data for biographies,[26] medicine,[27] digital humanities,[28] scholarly metadata through the WikiCite project.[29]
Centralising interlanguage links – links between Wikipedia articles about the same topic in different languages.
Providing a central place for
infobox data for all Wikipedias.
Creating and updating list articles based on data in Wikidata and linking to other Wikimedia sister projects, including
Meta-Wiki and the own Wikidata (interwikilinks).
Initial rollout
Wikidata was launched on 29 October 2012 and was the first new project of the Wikimedia Foundation since 2006.[3][34][35] At this time, only the centralization of language links was available. This enabled items to be created and filled with basic information: a label – a name or title, aliases – alternative terms for the label, a description, and links to articles about the topic in all the various language editions of Wikipedia (interwikipedia links).
Historically, a Wikipedia article would include a list of interlanguage links (links to articles on the same topic in other editions of Wikipedia, if they existed). Wikidata was originally a self-contained
repository of interlanguage links.[36] Wikipedia language editions were still not able to access Wikidata, so they needed to continue to maintain their own lists of interlanguage links.[citation needed]
On 14 January 2013, the
Hungarian Wikipedia became the first to enable the provision of interlanguage links via Wikidata.[37] This functionality was extended to the
Hebrew and
Italian Wikipedias on 30 January, to the
English Wikipedia on 13 February and to all other Wikipedias on 6 March.[38][39][40][41] After no consensus was reached over a proposal to restrict the removal of language links from the English Wikipedia,[42] they were automatically removed by
bots. On 23 September 2013, interlanguage links went live on Wikimedia Commons.[43]
Statements and data access
On 4 February 2013, statements were introduced to Wikidata entries. The possible values for properties were initially limited to two data types (items and images on Wikimedia Commons), with more
data types (such as
coordinates and dates) to follow later. The first new type, string, was deployed on 6 March.[44]
The ability for the various language editions of Wikipedia to access data from Wikidata was rolled out progressively between 27 March and 25 April 2013.[45][46] On 16 September 2015, Wikidata began allowing so-called arbitrary access, or access from a given article of a Wikipedia to the statements on Wikidata items not directly connected to it. For example, it became possible to read data about Germany from the Berlin article, which was not feasible before.[47] On 27 April 2016, arbitrary access was activated on Wikimedia Commons.[48]
According to a 2020 study, a large proportion of the data on Wikidata consists of entries imported en masse from other databases by
Internet bots, which helps to "break down the walls" of
data silos.[49]
Query service and other improvements
On 7 September 2015, the
Wikimedia Foundation announced the release of the Wikidata Query Service,[50] which lets users run queries on the data contained in Wikidata.[51] The service uses
SPARQL as the query language. As of November 2018, there are at least 26 different tools that allow querying the data in different ways.[52] It uses
Blazegraph as its
triplestore and
graph database.[53][54]
In 2021,
Wikimedia Deutschland released the Query Builder,[55] "a form-based query builder to allow people who don't know how to use SPARQL" to write a query.
Logo
The bars on the
logo contain the word "WIKI" encoded in
Morse code.[56] It was created by Arun Ganesh and selected through community decision.[57]
Reception
In November 2014, Wikidata received the Open Data Publisher Award from the
Open Data Institute "for sheer scale, and built-in openness".[58]
In December 2014, Google announced that it would shut down
Freebase in favor of Wikidata.[59]
As of November 2018[update], Wikidata information was used in 58.4% of all English Wikipedia articles, mostly for external identifiers or coordinate locations. In aggregate, data from Wikidata is shown in 64% of all
Wikipedias' pages, 93% of all
Wikivoyage articles, 34% of all
Wikiquotes', 32% of all
Wikisources', and 27% of
Wikimedia Commons.[60]
As of December 2020[update], Wikidata's data was visualized by at least 20 other external tools[61] and over 300 papers have been published about Wikidata.[62]
^Samuel, John (15 August 2018). "Experimental IR Meets Multilinguality, Multimodality, and Interaction". Experimental IR Meets Multilinguality, Multimodality, and Interaction.
CLEF 2018. Lecture Notes in Computer Science. Vol. 11018. p. 129.
doi:
10.1007/978-3-319-98932-7_12.
ISBN978-3-319-98931-0.
^Nielsen, Finn (May 2020) [2020-05]. Ionov, Maxim; McCrae, John P.; Chiarcos, Christian; Declerck, Thierry; Bosque-Gil, Julia; Gracia, Jorge (eds.).
"Lexemes in Wikidata: 2020 status". Proceedings of the 7th Workshop on Linked Data in Linguistics (LDL-2020). Marseille, France: European Language Resources Association: 82–86.
ISBN979-10-95546-36-8.
^Chisholm, Andrew; Radford, Will; Hachey, Ben (April 2017). Lapata, Mirella; Blunsom, Phil; Koller, Alexander (eds.).
"Learning to generate one-sentence biographies from Wikidata". Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Valencia, Spain: Association for Computational Linguistics: 633–642.
^Pellissier Tanon, Thomas; Vrandečić, Denny; Schaffert, Sebastian; Steiner, Thomas; Pintscher, Lydia (11 April 2016).
"From Freebase to Wikidata: The Great Migration". Proceedings of the 25th International Conference on World Wide Web. WWW '16. Republic and Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee: 1419–1428.
doi:
10.1145/2872427.2874809.
ISBN978-1-4503-4143-1.
^Pintscher, Lydia (27 March 2013).
"You can have all the data!". Wikimedia Deutschland.
Archived from the original on 29 March 2013. Retrieved 28 March 2013.
^Mora-Cantallops, Marçal; Sánchez-Alonso, Salvador; García-Barriocanal, Elena (2 September 2019). "A systematic literature review on Wikidata". Data Technologies and Applications. 53 (3): 250–268.
doi:
10.1108/DTA-12-2018-0110.
S2CID202036639.