Manipulate XML Text Data Using XQuery String Functions

Manipulate XML Text Data Using XQuery String Functions

My notes from the Vanderbilt University XQuery Working Group

We evaluated and manipulated text data (i.e., strings) within Extensible Markup Language (XML) using string functions in XQuery, an XML query language, and BaseX, an XML database engine and XQuery processor. This tutorial covers the basics of how to use XQuery string functions and manipulate text data with BaseX.

We used a limited dataset of English words as text data to evaluate and manipulate, and I’ve created a GitHub gist of XML input and XQuery code for use with this tutorial.

Part 1: Create and concatenate strings with XQuery

Concatenation is a process that links two or more independent strings of text together in order to create one large string of text. We’ll start by creating strings, and a simple way to create a string is with XQuery’s let clause. For example, the XQuery code let $word := "book" declares the variable $word and assigns it the string value book. From here, we can create a second variable with the string value mark and concatenate our two string variables in order to create a new string with the XQuery string function fn:concat():

xquery version "3.1";

let $word1 := "book"
let $word2 := "mark"

let $compound-word := fn:concat($word1, $word2)

return $compound-word

In our XQuery code above, we use a let clause to declare and assign string values to the variables $word1, $word2, and $compound-word. Unlike the values of $word1 and $word2, the value of $compound-word is the result of a function. In this case, $word1 and $word2 are arguments of fn:concat(), which we bind to the variable $compound-word. Running our XQuery code above in BaseX returns the value of $compound-word:

bookmark

Our XQuery code above returns a desired result for compound words like ‘bookmark’ because fn:concat() places the initial character of the second argument, $word2, immediately after the final character of the first argument, $word1. However, changing the values of $word1 and $word2 could return undesired results, as illustrated below:

xquery version "3.1";

let $word1 := "Governor"
let $word2 := "General"

let $compound-word := fn:concat($word1, $word2)

return $compound-word

In this case, running our XQuery code in BaseX returns an undesired result:

GovernorGeneral

For compound words like ‘Governor General’, we need a space between our two concatenated strings. Although we could accomplish this with fn:concat($word1," ",$word2), the XQuery string function fn:string-join() is a more robust tool for the job. To return our desired result, we add a delimiter argument to fn:string-join() that separates our two concatenated strings. In this case, we’ll declare our delimiter by assigning a space as the value of a new varible with the code let $delimiter := " ". Our XQuery code for concatenating strings with a delimiter follows:

xquery version "3.1";

let $word1 := "Governor"
let $word2 := "General"

let $delimiter := " "

let $compound-word := fn:string-join(($word1, $word2), $delimiter)

return $compound-word

Running our new XQuery code in BaseX returns the value of $compound-word, to which we assign the concatenated values of $word1 and $word2, separated the value of $delimiter:

Governor General

Part 2: Manipulate XML text data with XQuery

To manipulate strings within XML, we first fetch our XML data file from compound-words.xml:

xquery version "3.1";

let $uri := "https://gist.githubusercontent.com/AdamSteffanick/eed120f4b5915dfb73900ea4bfdf3ace/raw/7bc420051ca0bceaa4252c6bef67157235213767/compound-words.xml"

let $compound-words := fn:doc($uri)

return $compound-words

We start by binding the location of our external XML data file to $uri. We then bind the document located at $uri, compound-words.xml, to $compound-words. Finally, we return $compound-words to view our XML text data by running our XQuery code in BaseX:

<compounds>
  <word>bookcase</word>
  <word>classmate</word>
  <word>bookmark</word>
  <word>newspaper</word>
</compounds>

Now that we’ve successfully fetched our external XML text data, we can manipulate it. Our goal is to create a plural form of each compound word in our dataset, and we do so by looping through $compound-words with a for clause and using fn:concat() to concatenate each compound word with the plural suffix, ‘-s’:

xquery version "3.1";

let $uri := "https://gist.githubusercontent.com/AdamSteffanick/eed120f4b5915dfb73900ea4bfdf3ace/raw/7bc420051ca0bceaa4252c6bef67157235213767/compound-words.xml"

let $compound-words := fn:doc($uri)

let $plural-suffix := "s"

return element compounds {
  for $compound-word in $compound-words/compounds/word/text()
  return element word {
    fn:concat($compound-word, $plural-suffix)
  }
}

The return clause in our XQuery code above is complex, containing a for clause and another return clause. This allows us to return the XML element <compounds>, which contains a <word> element for each <word> element in our input at the XPath $compound-words/compounds/word/text(). Furthermore, our for clause binds the variable $compound-word to the text content of each <word> element in our input, and then uses fn:concat() with the arguments $compound-word and $plural-suffix to create the text content of each <word> element in our output. Running our XQuery code with BaseX returns the following result:

<compounds>
  <word>bookcases</word>
  <word>classmates</word>
  <word>bookmarks</word>
  <word>newspapers</word>
</compounds>

Using conditional expressions to evaluate XML text data with XQuery

As before, our XQuery code above returns undesired results for compound words like ‘Governor General’, which are found in our external XML data file more-compound-words.xml. To return our desired result, we use a conditional expression and the XQuery string function fn:contains() to determine if the text content of each <word> element, $compound-word, contains a space. If $compound-word does contain a space then fn:contains($compound-word, $delimiter) returns true. From here — based on our dataset — we declare the variable $compound-head and use fn:substring-before() to assign it the string value of the leftmost word in each compound (i.e., all characters preceding $delimiter in $compound-word). Finally, we replace $compound-head with the result of fn:concat($compound-head, $plural-suffix) using fn:replace().

xquery version "3.1";

let $uri := "https://gist.githubusercontent.com/AdamSteffanick/eed120f4b5915dfb73900ea4bfdf3ace/raw/e88af4f31c0264149b5ab2aae0de5e984b2e4bc4/more-compound-words.xml"
let $compound-words := fn:doc($uri)
let $plural-suffix := "s"
let $delimiter := " "

return element compounds {
  for $compound-word in $compound-words/compounds/word/text()
  return element word {
    if (fn:contains($compound-word, $delimiter)) then
      let $compound-head := fn:substring-before($compound-word, $delimiter)
      return fn:replace($compound-word, $compound-head, fn:concat($compound-head, $plural-suffix))
    else
      fn:concat($compound-word, $plural-suffix)
  }
}

Running our updated XQuery code with BaseX returns the following result:

<compounds>
  <word>bookcases</word>
  <word>classmates</word>
  <word>bookmarks</word>
  <word>newspapers</word>
  <word>Governors General</word>
  <word>Surgeons General</word>
</compounds>

While some English compound words contain a space, others such as ‘mother-in-law’ contain hyphens. These additional compound words are contained in our external XML data file even-more-compound-words.xml. To accommodate this dataset containing two delimiters, we bind a space and a hyphen to $delimiter as a sequence of strings: let $delimiter := (" ", "-"). In our new sequence, $delimiter[1] is assigned the string value of a space and $delimiter[2] is assigned the string value of a hyphen. Next, we add a second condition to our conditional expression. As before, we use fn:contains() to check whether $compound-word contains a delimiter, but this time we look for $delimiter[1] and $delimiter[2]. Additionally, we require a second condition when $compound-word contains $delimiter[2] because we need to distinguish compound words like ‘mother-in-law’ from ‘in-law’ before we affix the plural suffix, ‘-s’. To do so, we use the XQuery string function fn:tokenize() and fn:count(). The XQuery code fn:tokenize($compound-word, $delimiter[2]) splits our string $compound-word into a sequence of strings at $delimiter[2] boundaries, so the input ‘mother-in-law’ returns a sequence:

(“mother”, “in”, “law”)

Creating a sequence of strings allows us to use fn:count(fn:tokenize($compound-word, $delimiter[2])) and return the number of items in our sequence. If the number of items in our sequence is greater than two, we concatenate $compound-head and $plural-suffix as above. If there are only two strings in our sequence, we pass $compound-word through to our else condition:

xquery version "3.1";

let $uri := "https://gist.githubusercontent.com/AdamSteffanick/eed120f4b5915dfb73900ea4bfdf3ace/raw/1289cd32578393ad9f4ef278959f59e0aebf6f8a/even-more-compound-words.xml"
let $compound-words := fn:doc($uri)
let $plural-suffix := "s"
let $delimiter := (" ", "-")

return element compounds {
  for $compound-word in $compound-words/compounds/word/text()
  return element word {
    if (fn:contains($compound-word, $delimiter[1])) then
      let $compound-head := fn:substring-before($compound-word, $delimiter[1])
      return fn:replace($compound-word, $compound-head, fn:concat($compound-head, $plural-suffix))
    else if (fn:contains($compound-word, $delimiter[2]) and fn:count(fn:tokenize($compound-word, $delimiter[2])) > 2) then
      let $compound-head := fn:substring-before($compound-word, $delimiter[2])
      return fn:replace($compound-word, $compound-head, fn:concat($compound-head, $plural-suffix))
    else
      fn:concat($compound-word, $plural-suffix)
  }
}

We now get desired results for each type of compound word in our dataset when we run our new XQuery code with BaseX:

<compounds>
  <word>bookcases</word>
  <word>classmates</word>
  <word>bookmarks</word>
  <word>newspapers</word>
  <word>Governors General</word>
  <word>Surgeons General</word>
  <word>in-laws</word>
  <word>mothers-in-law</word>
  <word>fathers-in-law</word>
</compounds>

We decided to create our own function and clean up our output, so some final revisions to our XQuery code remain. First, we move our XQuery code above into a function named local:form-plural-compound(), which accepts the argument $compound-words as a node() and returns a node(). Second, we order our results alphabetically using an order by clause inside our for clause. We use the XQuery string function fn:lower-case() when ordering our data, otherwise strings beginning with capital letters would occur before strings begnning with lower-case letters and our results wouldn’t be alphabetical: order by fn:lower-case($compound-word). Third, we add the attribute singular to each <word> element and assign it the value $compound-word to retain our input data. Finally, we bind the document located at $uri to $dataset and pass it as an argument to our new function: local:form-plural-compound($dataset). Our complete XQuery code, manipulate-external-XML-text-data.xquery, follows:

xquery version "3.1";

declare function local:form-plural-compound($compound-words as node()*) as node()* {
  let $plural-suffix := "s"
  let $delimiter := (" ", "-")
  return element compounds {
    for $compound-word in $compound-words/compounds/word/text()
    order by fn:lower-case($compound-word)
    return element word {
      attribute singular {$compound-word},
      if (fn:contains($compound-word, $delimiter[1])) then
        let $compound-head := fn:substring-before($compound-word, $delimiter[1])
        return fn:replace($compound-word, $compound-head, fn:concat($compound-head, $plural-suffix))
      else if (fn:contains($compound-word, $delimiter[2]) and fn:count(fn:tokenize($compound-word, $delimiter[2])) > 2) then
        let $compound-head := fn:substring-before($compound-word, $delimiter[2])
        return fn:replace($compound-word, $compound-head, fn:concat($compound-head, $plural-suffix))
      else
        fn:concat($compound-word, $plural-suffix)
    }
  }
};

let $uri := "https://gist.githubusercontent.com/AdamSteffanick/eed120f4b5915dfb73900ea4bfdf3ace/raw/1289cd32578393ad9f4ef278959f59e0aebf6f8a/even-more-compound-words.xml"
let $dataset := fn:doc($uri)

return local:form-plural-compound($dataset)

Running manipulate-external-XML-text-data.xquery with BaseX returns the following XML:

<compounds>
  <word singular="bookcase">bookcases</word>
  <word singular="bookmark">bookmarks</word>
  <word singular="classmate">classmates</word>
  <word singular="father-in-law">fathers-in-law</word>
  <word singular="Governor General">Governors General</word>
  <word singular="in-law">in-laws</word>
  <word singular="mother-in-law">mothers-in-law</word>
  <word singular="newspaper">newspapers</word>
  <word singular="Surgeon General">Surgeons General</word>
</compounds>

What we learned

Thanks to this session of the Vanderbilt University XQuery Working Group, we can now:

  • manipulate XML text data with XQuery
  • use conditional expressions (if then else)
  • order text data alphabetically
  • evaluate and manipulate text data with the following XQuery functions: fn:concat(), fn:contains(), fn:count(), fn:lower-case(), fn:replace(), fn:string-join(), fn:substring-before(), and fn:tokenize()

Thank you for reading, and have fun coding.