Retrieve Oxford Dictionaries API Lemma Data as XML with XQuery and BaseX

Retrieve Oxford Dictionaries API Lemma Data as XML with XQuery and BaseX
My notes from the Vanderbilt University XQuery Working Group

We retrieved lemma data from the Oxford Dictionaries application programming interface (API) and returned Extensible Markup Language (XML) with XQuery, an XML query language, and BaseX, an XML database engine and XQuery processor. This tutorial illustrates how to retrieve lemma data—information related to items that can be looked up in the dictionary—as XML from the Oxford Dictionaries API with XQuery and BaseX.

The Oxford Dictionaries API returns JavaScript Object Notation (JSON) responses that yield undesired XML structures when converted automatically with BaseX. Fortunately, we’re able to use XQuery to fill in some blanks after converting JSON to XML. My GitHub repository od-api-xquery contains XQuery code for this tutorial.

Retrieve lemma data

The Oxford Dictionaries API has a “Lemmatron” endpoint we’ll use to retrieve lemma data as XML with XQuery, and the code template od-api.xquery will help us get started. First, we import an XQuery lemmatron library module, od-api-basex.xquery, and assign it the namespace od-api. Second, we replace myId and myKey with our own Oxford Dictionaries API Credentials. Third, we set options, such as $source-lang and $lemmatron-filters, following the “Lemmatron” section of the Oxford Dictionaries API documentation, to retrieve lemma data, such as grammatical features (e.g., gender, number, or tense) and uninflected wordforms (e.g., ‘cats’ → ‘cat’). Finally, we request lemma data using $lemmatron(). Sample XQuery code for retrieving lemma data as XML for the word ‘change’ follows:

xquery version "3.1" encoding "UTF-8";

import module namespace od-api="od-api-basex" at "https://raw.githubusercontent.com/AdamSteffanick/od-api-xquery/master/od-api-basex.xquery";

let $id := "myId"
let $key := "myKey"

let $source-lang := "en"
let $lemmatron-filters := ""

let $lemmatron := od-api:lemmatron($source-lang, ?, $lemmatron-filters, $id, $key)

return $lemmatron("change")

In the XQuery code above, we use return $lemmatron("change") to trigger our request to the Oxford Dictionaries API. The library module automatically alters the argument of $lemmatron() by replacing spaces with underscores, forcing lower-case characters, and encoding reserved characters. For example, returning $lemmatron("soufflé") sends the word_id parameter souffl%C3%A9 to the Oxford Dictionaries API. Similarly, returning $lemmatron("United Kingdom") sends the word_id parameter united_kingdom. Returning $lemmatron("change") triggers a request to the Oxford Dictionaries API through the use of a partial function application. The actual request is sent by calling the library module’s od-api:lemmatron() function, whose argument ? is assigned its value by $lemmatron("change"). Running our XQuery code above with BaseX returns the following result:

<lemmatron input="change" language="en">
  <metadata>
    <provider>Oxford University Press</provider>
    <date>Fri, 10 Feb 2017 18:00:00 GMT</date>
  </metadata>
  <results>
    <result>
      <id>change</id>
      <language>en</language>
      <word>change</word>
      <lexicalEntries>
        <lexicalEntry>
          <language>en</language>
          <lexicalCategory>Noun</lexicalCategory>
          <text>Change</text>
          <grammaticalFeatures>
            <grammaticalFeature>
              <text>Singular</text>
              <type>Number</type>
            </grammaticalFeature>
          </grammaticalFeatures>
          <inflectionOf>
            <wordform>
              <id>%27change</id>
              <text>'Change</text>
            </wordform>
          </inflectionOf>
        </lexicalEntry>
        <lexicalEntry>
          <language>en</language>
          <lexicalCategory>Noun</lexicalCategory>
          <text>change</text>
          <grammaticalFeatures>
            <grammaticalFeature>
              <text>Singular</text>
              <type>Number</type>
            </grammaticalFeature>
          </grammaticalFeatures>
          <inflectionOf>
            <wordform>
              <id>change</id>
              <text>change</text>
            </wordform>
          </inflectionOf>
        </lexicalEntry>
        <lexicalEntry>
          <language>en</language>
          <lexicalCategory>Verb</lexicalCategory>
          <text>change</text>
          <grammaticalFeatures>
            <grammaticalFeature>
              <text>Present</text>
              <type>Tense</type>
            </grammaticalFeature>
            <grammaticalFeature>
              <text>Transitive</text>
              <type>Subcategorization</type>
            </grammaticalFeature>
            <grammaticalFeature>
              <text>Intransitive</text>
              <type>Subcategorization</type>
            </grammaticalFeature>
          </grammaticalFeatures>
          <inflectionOf>
            <wordform>
              <id>changes</id>
              <text>changes</text>
            </wordform>
          </inflectionOf>
        </lexicalEntry>
      </lexicalEntries>
    </result>
  </results>
</lemmatron>

Setting a filter with the XQuery code let $lemmatron-filters := "lexicalCategory=noun" returns only noun-related data:

<lemmatron input="change" language="en">
  <metadata>
    <provider>Oxford University Press</provider>
    <date>Fri, 10 Feb 2017 18:00:00 GMT</date>
  </metadata>
  <results>
    <result>
      <id>change</id>
      <language>en</language>
      <word>change</word>
      <lexicalEntries>
        <lexicalEntry>
          <language>en</language>
          <lexicalCategory>Noun</lexicalCategory>
          <text>Change</text>
          <grammaticalFeatures>
            <grammaticalFeature>
              <text>Singular</text>
              <type>Number</type>
            </grammaticalFeature>
          </grammaticalFeatures>
          <inflectionOf>
            <wordform>
              <id>%27change</id>
              <text>'Change</text>
            </wordform>
          </inflectionOf>
        </lexicalEntry>
        <lexicalEntry>
          <language>en</language>
          <lexicalCategory>Noun</lexicalCategory>
          <text>change</text>
          <grammaticalFeatures>
            <grammaticalFeature>
              <text>Singular</text>
              <type>Number</type>
            </grammaticalFeature>
          </grammaticalFeatures>
          <inflectionOf>
            <wordform>
              <id>change</id>
              <text>change</text>
            </wordform>
          </inflectionOf>
        </lexicalEntry>
      </lexicalEntries>
    </result>
  </results>
</lemmatron>

I’ve preserved the original structure of the JSON response as much as possible, and have made minimal alterations beyond those descibed in the below section “Lemmatron library module”. One change is returning <lemmatron> as the root XML element, which has the attibutes input and language. The input attribute is assigned a value matching the ‘word’ sent to the Oxford Dictionaries API, and is equivalent to the API’s word_id parameter. Likewise, the language attribute is assigned a value equivalent to the API’s source_lang parameter. As a result, <lemmatron input="change" language="en"> allows us to create efficient XML queries on cached lemma data from English related to the string ‘change’. Another addition is a <date> element within the <metadata> element. Text content within <date> is from the response header received from the Oxford Dictionaries API. This is useful for evaluating whether or not to update cached lemma data by making a new request for a given word_id to the Oxford Dictionaries API.

Lemmatron library module

The library module od-api-basex.xquery contains XQuery functions related to lemma data, and can be used with the XQuery code below:

import module namespace od-api="od-api-basex" at "https://raw.githubusercontent.com/AdamSteffanick/od-api-xquery/master/od-api-basex.xquery";

Using this library module helps create clean, verbose XML structures that follow the models found in the “Lemmatron” section of the Oxford Dictionaries API documentation. Consider that lexical entry data have the following JSON model schema:

{
  "lexicalEntries": [
    {
      "grammaticalFeatures": [
        {
          "text": "string",
          "type": "string"
        }
      ],
      "inflectionOf": [
        {
          "id": "string",
          "text": "string"
        }
      ],
      "language": "string",
      "lexicalCategory": "string",
      "text": "string"
    }
  ]
}

Requesting lemma data for the word ‘change’ returns a JSON response including:

{
  "lexicalEntries": [
    {
      "grammaticalFeatures": [
        {
          "text": "Present",
          "type": "Tense"
        },
        {
          "text": "Transitive",
          "type": "Subcategorization"
        },
        {
          "text": "Intransitive",
          "type": "Subcategorization"
        }
      ],
      "inflectionOf": [
        {
          "id": "changes",
          "text": "changes"
        }
      ],
      "language": "en",
      "lexicalCategory": "Verb",
      "text": "change"
    }
  ]
}

By default, converting the response above from JSON to XML with BaseX yields an undesired XML structure:

<lexicalEntries type="array">
  <_ type="object">
    <grammaticalFeatures type="array">
      <_ type="object">
        <text>Present</text>
        <type>Tense</type>
      </_>
      <_ type="object">
        <text>Transitive</text>
        <type>Subcategorization</type>
      </_>
      <_ type="object">
        <text>Intransitive</text>
        <type>Subcategorization</type>
      </_>
    </grammaticalFeatures>
    <inflectionOf type="array">
      <_ type="object">
        <id>changes</id>
        <text>changes</text>
      </_>
    </inflectionOf>
    <language>en</language>
    <lexicalCategory>Verb</lexicalCategory>
    <text>change</text>
  </_>
</lexicalEntries>

We can use XQuery to assign names to the <_> elements above. In the case of elements with plural names, such as <grammaticalFeatures>, functions within the library module create child elements with singular names, such as <grammaticalFeature>, and I’ve chosen <wordform> as the child element of <inflectionOf>:

<lexicalEntries>
  <lexicalEntry>
    <language>en</language>
    <lexicalCategory>Verb</lexicalCategory>
    <text>change</text>
    <grammaticalFeatures>
      <grammaticalFeature>
        <text>Present</text>
        <type>Tense</type>
      </grammaticalFeature>
      <grammaticalFeature>
        <text>Transitive</text>
        <type>Subcategorization</type>
      </grammaticalFeature>
      <grammaticalFeature>
        <text>Intransitive</text>
        <type>Subcategorization</type>
      </grammaticalFeature>
    </grammaticalFeatures>
    <inflectionOf>
      <wordform>
        <id>changes</id>
        <text>changes</text>
      </wordform>
    </inflectionOf>
  </lexicalEntry>
</lexicalEntries>

The majority of the XQuery code in the library module od-api-basex.xquery is not specific to BaseX, however I did use the BaseX HTTP Module. Changes to the library module may be required when using a different XQuery processor.

What we learned

Thanks to this session of the Vanderbilt University XQuery Working Group, we can now:

Thank you for reading, and have fun coding.