Lexvo.org Detailed Description

Term URIs

Given a term t in a language L, the URI is constructed as follows:

The term t is encoded using Unicode, and the NFC normalization procedure is applied to ensure a unique representation. Conventional unnormalized Unicode allows encoding a character such as "à" in either a composed or in a decomposed form.
The resulting Unicode code point string is encoded in UTF-8 to obtain a sequence of octets.
These octet values are converted to an ASCII path segment by applying percent-encoding as per RFC 3986. Unacceptable characters as well as the "%" character are encoded as character triplets of the form %4D with the respective octet value stored as two upper-case hexadecimal digits.
The base address http://lexvo.org/id/term/ as well as the ISO 639-3 code for the language L followed by the "/" character are prepended to this path segment to obtain a complete URI.

A term URI generated in this way refers to the term t in language L.

Language URIs

Language URIs consist of the base address http://lexvo.org/id/iso639-3/ followed by a valid three-letter ISO 639-3 language code that is not defined as a special code. A language URI abiding to this specification refers to the language denoted by the language code according to the ISO 639-3 standard. Additionally, because many systems use two-letter ISO 639-1 codes instead of 3-letter ISO 639-3 codes, we also provide equivalent URIs consisting of the base address http://lexvo.org/id/iso639-1/ followed by a 2-letter ISO 639-1 code.

Script URIs

Script URIs consist of the base address http://lexvo.org/id/script/ followed by an ISO 15924 script code other than Zxxx, Zyyy, Zzzz. A script URI abiding to this specification refers to the script denoted by the code according to the ISO 15924 standard.

Character URIs

Character URIs consist of the base address http://lexvo.org/id/char/, followed by a Unicode code point in upper-case hexadecimal notation with zero-padding to 4 digits if shorter than 4 digits, and without additional zero-padding if longer. A character URI abiding to this specification refers to the character denoted by the code point according to the Unicode 5.0 standard.

Geographical URIs

Geographical URIs consist either of the base address http://lexvo.org/id/iso3166-1/, followed by an ISO 3166-1 alpha-2 code for countries, or of the base address http://lexvo.org/id/un_m49/ followed by a UN M.49 code for regions that are not countries (i.e. only for continents and other groupings).

WordNet URIs

WordNet URIs consist of the base address http://lexvo.org/id/wordnet/30/, followed by a part-of-speech indicator ("noun/", "verb/", "adj/", or "adv/"), and a sense key. The sense keys are similar to WordNet's original sense keys, however using the following format: lemma + "_" + lex_filenum + "_" + lex_id [+ "_" + head_word "_" + head_id], where lemma and headword are encoded using percent-encoding as per RFC 3986. These URIs identify not the synsets themselves but the denotational meanings associated with the synsets, just like Lexvo.org's language URIs identify not the corresponding language codes themselves but the actual languages.

Kangxi Radical URIs

Kangxi radicals are abstract entities associated with specific semantic components of Chinese characters. Lexvo.org's Kangxi Radical URIs consist of the base address http://lexvo.org/id/kangxi-radical/, followed by a number from 1 to 214 representing the radical numbers used in the 1716 Kangxi dictionary.