We retrieved lemma data from the Oxford Dictionaries application programming interface (API) and returned Extensible Markup Language (XML) with XQuery, an XML query language, and BaseX, an XML database engine and XQuery processor. This tutorial illustrates how to retrieve lemma data—information related to items that can be looked up in the dictionary—as XML from the Oxford Dictionaries API with XQuery and BaseX.
The Oxford Dictionaries API returns JavaScript Object Notation (JSON) responses that yield undesired XML structures when converted automatically with BaseX. Fortunately, we’re able to use XQuery to fill in some blanks after converting JSON to XML. My GitHub repository od-api-xquery contains XQuery code for this tutorial.
Retrieve lemma data
The Oxford Dictionaries API has a “Lemmatron” endpoint we’ll use to retrieve lemma data as XML with XQuery, and the code template od-api.xquery will help us get started. First, we import an XQuery lemmatron library module, od-api-basex.xquery, and assign it the namespace od-api
. Second, we replace myId
and myKey
with our own Oxford Dictionaries API Credentials. Third, we set options, such as $source-lang
and $lemmatron-filters
, following the “Lemmatron” section of the Oxford Dictionaries API documentation, to retrieve lemma data, such as grammatical features (e.g., gender, number, or tense) and uninflected wordforms (e.g., ‘cats’ → ‘cat’). Finally, we request lemma data using $lemmatron()
. Sample XQuery code for retrieving lemma data as XML for the word ‘change’ follows:
xquery version "3.1" encoding "UTF-8";
import module namespace od-api="od-api-basex" at "https://raw.githubusercontent.com/AdamSteffanick/od-api-xquery/master/od-api-basex.xquery";
let $id := "myId"
let $key := "myKey"
let $source-lang := "en"
let $lemmatron-filters := ""
let $lemmatron := od-api:lemmatron($source-lang, ?, $lemmatron-filters, $id, $key)
return $lemmatron("change")
In the XQuery code above, we use return $lemmatron("change")
to trigger our request to the Oxford Dictionaries API. The library module automatically alters the argument of $lemmatron()
by replacing spaces with underscores, forcing lower-case characters, and encoding reserved characters. For example, returning $lemmatron("soufflé")
sends the word_id
parameter souffl%C3%A9
to the Oxford Dictionaries API. Similarly, returning $lemmatron("United Kingdom")
sends the word_id
parameter united_kingdom
. Returning $lemmatron("change")
triggers a request to the Oxford Dictionaries API through the use of a partial function application. The actual request is sent by calling the library module’s od-api:lemmatron()
function, whose argument ?
is assigned its value by $lemmatron("change")
. Running our XQuery code above with BaseX returns the following result:
<lemmatron input="change" language="en">
<metadata>
<provider>Oxford University Press</provider>
<date>Fri, 10 Feb 2017 18:00:00 GMT</date>
</metadata>
<results>
<result>
<id>change</id>
<language>en</language>
<word>change</word>
<lexicalEntries>
<lexicalEntry>
<language>en</language>
<lexicalCategory>Noun</lexicalCategory>
<text>Change</text>
<grammaticalFeatures>
<grammaticalFeature>
<text>Singular</text>
<type>Number</type>
</grammaticalFeature>
</grammaticalFeatures>
<inflectionOf>
<wordform>
<id>%27change</id>
<text>'Change</text>
</wordform>
</inflectionOf>
</lexicalEntry>
<lexicalEntry>
<language>en</language>
<lexicalCategory>Noun</lexicalCategory>
<text>change</text>
<grammaticalFeatures>
<grammaticalFeature>
<text>Singular</text>
<type>Number</type>
</grammaticalFeature>
</grammaticalFeatures>
<inflectionOf>
<wordform>
<id>change</id>
<text>change</text>
</wordform>
</inflectionOf>
</lexicalEntry>
<lexicalEntry>
<language>en</language>
<lexicalCategory>Verb</lexicalCategory>
<text>change</text>
<grammaticalFeatures>
<grammaticalFeature>
<text>Present</text>
<type>Tense</type>
</grammaticalFeature>
<grammaticalFeature>
<text>Transitive</text>
<type>Subcategorization</type>
</grammaticalFeature>
<grammaticalFeature>
<text>Intransitive</text>
<type>Subcategorization</type>
</grammaticalFeature>
</grammaticalFeatures>
<inflectionOf>
<wordform>
<id>changes</id>
<text>changes</text>
</wordform>
</inflectionOf>
</lexicalEntry>
</lexicalEntries>
</result>
</results>
</lemmatron>
Setting a filter with the XQuery code let $lemmatron-filters := "lexicalCategory=noun"
returns only noun-related data:
<lemmatron input="change" language="en">
<metadata>
<provider>Oxford University Press</provider>
<date>Fri, 10 Feb 2017 18:00:00 GMT</date>
</metadata>
<results>
<result>
<id>change</id>
<language>en</language>
<word>change</word>
<lexicalEntries>
<lexicalEntry>
<language>en</language>
<lexicalCategory>Noun</lexicalCategory>
<text>Change</text>
<grammaticalFeatures>
<grammaticalFeature>
<text>Singular</text>
<type>Number</type>
</grammaticalFeature>
</grammaticalFeatures>
<inflectionOf>
<wordform>
<id>%27change</id>
<text>'Change</text>
</wordform>
</inflectionOf>
</lexicalEntry>
<lexicalEntry>
<language>en</language>
<lexicalCategory>Noun</lexicalCategory>
<text>change</text>
<grammaticalFeatures>
<grammaticalFeature>
<text>Singular</text>
<type>Number</type>
</grammaticalFeature>
</grammaticalFeatures>
<inflectionOf>
<wordform>
<id>change</id>
<text>change</text>
</wordform>
</inflectionOf>
</lexicalEntry>
</lexicalEntries>
</result>
</results>
</lemmatron>
I’ve preserved the original structure of the JSON response as much as possible, and have made minimal alterations beyond those descibed in the below section “Lemmatron library module”. One change is returning <lemmatron>
as the root XML element, which has the attibutes input
and language
. The input
attribute is assigned a value matching the ‘word’ sent to the Oxford Dictionaries API, and is equivalent to the API’s word_id
parameter. Likewise, the language
attribute is assigned a value equivalent to the API’s source_lang
parameter. As a result, <lemmatron input="change" language="en">
allows us to create efficient XML queries on cached lemma data from English related to the string ‘change’. Another addition is a <date>
element within the <metadata>
element. Text content within <date>
is from the response header received from the Oxford Dictionaries API. This is useful for evaluating whether or not to update cached lemma data by making a new request for a given word_id
to the Oxford Dictionaries API.
Lemmatron library module
The library module od-api-basex.xquery contains XQuery functions related to lemma data, and can be used with the XQuery code below:
import module namespace od-api="od-api-basex" at "https://raw.githubusercontent.com/AdamSteffanick/od-api-xquery/master/od-api-basex.xquery";
Using this library module helps create clean, verbose XML structures that follow the models found in the “Lemmatron” section of the Oxford Dictionaries API documentation. Consider that lexical entry data have the following JSON model schema:
{
"lexicalEntries": [
{
"grammaticalFeatures": [
{
"text": "string",
"type": "string"
}
],
"inflectionOf": [
{
"id": "string",
"text": "string"
}
],
"language": "string",
"lexicalCategory": "string",
"text": "string"
}
]
}
Requesting lemma data for the word ‘change’ returns a JSON response including:
{
"lexicalEntries": [
{
"grammaticalFeatures": [
{
"text": "Present",
"type": "Tense"
},
{
"text": "Transitive",
"type": "Subcategorization"
},
{
"text": "Intransitive",
"type": "Subcategorization"
}
],
"inflectionOf": [
{
"id": "changes",
"text": "changes"
}
],
"language": "en",
"lexicalCategory": "Verb",
"text": "change"
}
]
}
By default, converting the response above from JSON to XML with BaseX yields an undesired XML structure:
<lexicalEntries type="array">
<_ type="object">
<grammaticalFeatures type="array">
<_ type="object">
<text>Present</text>
<type>Tense</type>
</_>
<_ type="object">
<text>Transitive</text>
<type>Subcategorization</type>
</_>
<_ type="object">
<text>Intransitive</text>
<type>Subcategorization</type>
</_>
</grammaticalFeatures>
<inflectionOf type="array">
<_ type="object">
<id>changes</id>
<text>changes</text>
</_>
</inflectionOf>
<language>en</language>
<lexicalCategory>Verb</lexicalCategory>
<text>change</text>
</_>
</lexicalEntries>
We can use XQuery to assign names to the <_>
elements above. In the case of elements with plural names, such as <grammaticalFeatures>
, functions within the library module create child elements with singular names, such as <grammaticalFeature>
, and I’ve chosen <wordform>
as the child element of <inflectionOf>
:
<lexicalEntries>
<lexicalEntry>
<language>en</language>
<lexicalCategory>Verb</lexicalCategory>
<text>change</text>
<grammaticalFeatures>
<grammaticalFeature>
<text>Present</text>
<type>Tense</type>
</grammaticalFeature>
<grammaticalFeature>
<text>Transitive</text>
<type>Subcategorization</type>
</grammaticalFeature>
<grammaticalFeature>
<text>Intransitive</text>
<type>Subcategorization</type>
</grammaticalFeature>
</grammaticalFeatures>
<inflectionOf>
<wordform>
<id>changes</id>
<text>changes</text>
</wordform>
</inflectionOf>
</lexicalEntry>
</lexicalEntries>
The majority of the XQuery code in the library module od-api-basex.xquery is not specific to BaseX, however I did use the BaseX HTTP Module. Changes to the library module may be required when using a different XQuery processor.
What we learned
Thanks to this session of the Vanderbilt University XQuery Working Group, we can now:
- retrieve lemma data as XML with XQuery
- use an XQuery Lemmatron library module
Thank you for reading, and have fun coding.