My notes from the Vanderbilt University XQuery Working Group
We evaluated and manipulated text data (i.e., strings) within Extensible Markup Language (XML) using string functions in XQuery, an XML query language, and BaseX, an XML database engine and XQuery processor. This tutorial covers the basics of how to use XQuery string functions and manipulate text data with BaseX.
We used a limited dataset of English words as text data to evaluate and manipulate, and I’ve created a GitHub gist of XML input and XQuery code for use with this tutorial.
Part 1: Create and concatenate strings with XQuery
Concatenation is a process that links two or more independent strings of text together in order to create one large string of text. We’ll start by creating strings, and a simple way to create a string is with XQuery’s let
clause. For example, the XQuery code let $word := "book"
declares the variable $word
and assigns it the string value book
. From here, we can create a second variable with the string value mark
and concatenate our two string variables in order to create a new string with the XQuery string function fn:concat()
:
xquery version "3.1";
let $word1 := "book"
let $word2 := "mark"
let $compound-word := fn:concat($word1, $word2)
return $compound-word
In our XQuery code above, we use a let
clause to declare and assign string values to the variables $word1
, $word2
, and $compound-word
. Unlike the values of $word1
and $word2
, the value of $compound-word
is the result of a function. In this case, $word1
and $word2
are arguments of fn:concat()
, which we bind to the variable $compound-word
. Running our XQuery code above in BaseX returns the value of $compound-word
:
Our XQuery code above returns a desired result for compound words like ‘bookmark’ because fn:concat()
places the initial character of the second argument, $word2
, immediately after the final character of the first argument, $word1
. However, changing the values of $word1
and $word2
could return undesired results, as illustrated below:
xquery version "3.1";
let $word1 := "Governor"
let $word2 := "General"
let $compound-word := fn:concat($word1, $word2)
return $compound-word
In this case, running our XQuery code in BaseX returns an undesired result:
GovernorGeneralFor compound words like ‘Governor General’, we need a space between our two concatenated strings. Although we could accomplish this with fn:concat($word1," ",$word2)
, the XQuery string function fn:string-join()
is a more robust tool for the job. To return our desired result, we add a delimiter argument to fn:string-join()
that separates our two concatenated strings. In this case, we’ll declare our delimiter by assigning a space as the value of a new varible with the code let $delimiter := " "
. Our XQuery code for concatenating strings with a delimiter follows:
xquery version "3.1";
let $word1 := "Governor"
let $word2 := "General"
let $delimiter := " "
let $compound-word := fn:string-join(($word1, $word2), $delimiter)
return $compound-word
Running our new XQuery code in BaseX returns the value of $compound-word
, to which we assign the concatenated values of $word1
and $word2
, separated the value of $delimiter
:
Part 2: Manipulate XML text data with XQuery
To manipulate strings within XML, we first fetch our XML data file from compound-words.xml:
xquery version "3.1";
let $uri := "https://gist.githubusercontent.com/AdamSteffanick/eed120f4b5915dfb73900ea4bfdf3ace/raw/7bc420051ca0bceaa4252c6bef67157235213767/compound-words.xml"
let $compound-words := fn:doc($uri)
return $compound-words
We start by binding the location of our external XML data file to $uri
. We then bind the document located at $uri
, compound-words.xml, to $compound-words
. Finally, we return $compound-words
to view our XML text data by running our XQuery code in BaseX:
<compounds>
<word>bookcase</word>
<word>classmate</word>
<word>bookmark</word>
<word>newspaper</word>
</compounds>
Now that we’ve successfully fetched our external XML text data, we can manipulate it. Our goal is to create a plural form of each compound word in our dataset, and we do so by looping through $compound-words
with a for
clause and using fn:concat()
to concatenate each compound word with the plural suffix, ‘-s’:
xquery version "3.1";
let $uri := "https://gist.githubusercontent.com/AdamSteffanick/eed120f4b5915dfb73900ea4bfdf3ace/raw/7bc420051ca0bceaa4252c6bef67157235213767/compound-words.xml"
let $compound-words := fn:doc($uri)
let $plural-suffix := "s"
return element compounds {
for $compound-word in $compound-words/compounds/word/text()
return element word {
fn:concat($compound-word, $plural-suffix)
}
}
The return
clause in our XQuery code above is complex, containing a for
clause and another return
clause. This allows us to return the XML element <compounds>
, which contains a <word>
element for each <word>
element in our input at the XPath $compound-words/compounds/word/text()
. Furthermore, our for
clause binds the variable $compound-word
to the text content of each <word>
element in our input, and then uses fn:concat()
with the arguments $compound-word
and $plural-suffix
to create the text content of each <word>
element in our output. Running our XQuery code with BaseX returns the following result:
<compounds>
<word>bookcases</word>
<word>classmates</word>
<word>bookmarks</word>
<word>newspapers</word>
</compounds>
Using conditional expressions to evaluate XML text data with XQuery
As before, our XQuery code above returns undesired results for compound words like ‘Governor General’, which are found in our external XML data file more-compound-words.xml. To return our desired result, we use a conditional expression and the XQuery string function fn:contains()
to determine if the text content of each <word>
element, $compound-word
, contains a space. If $compound-word
does contain a space then fn:contains($compound-word, $delimiter)
returns true
. From here — based on our dataset — we declare the variable $compound-head
and use fn:substring-before()
to assign it the string value of the leftmost word in each compound (i.e., all characters preceding $delimiter
in $compound-word
). Finally, we replace $compound-head
with the result of fn:concat($compound-head, $plural-suffix)
using fn:replace()
.
xquery version "3.1";
let $uri := "https://gist.githubusercontent.com/AdamSteffanick/eed120f4b5915dfb73900ea4bfdf3ace/raw/e88af4f31c0264149b5ab2aae0de5e984b2e4bc4/more-compound-words.xml"
let $compound-words := fn:doc($uri)
let $plural-suffix := "s"
let $delimiter := " "
return element compounds {
for $compound-word in $compound-words/compounds/word/text()
return element word {
if (fn:contains($compound-word, $delimiter)) then
let $compound-head := fn:substring-before($compound-word, $delimiter)
return fn:replace($compound-word, $compound-head, fn:concat($compound-head, $plural-suffix))
else
fn:concat($compound-word, $plural-suffix)
}
}
Running our updated XQuery code with BaseX returns the following result:
<compounds>
<word>bookcases</word>
<word>classmates</word>
<word>bookmarks</word>
<word>newspapers</word>
<word>Governors General</word>
<word>Surgeons General</word>
</compounds>
While some English compound words contain a space, others such as ‘mother-in-law’ contain hyphens. These additional compound words are contained in our external XML data file even-more-compound-words.xml. To accommodate this dataset containing two delimiters, we bind a space and a hyphen to $delimiter
as a sequence of strings: let $delimiter := (" ", "-")
. In our new sequence, $delimiter[1]
is assigned the string value of a space and $delimiter[2]
is assigned the string value of a hyphen. Next, we add a second condition to our conditional expression. As before, we use fn:contains()
to check whether $compound-word
contains a delimiter, but this time we look for $delimiter[1]
and $delimiter[2]
. Additionally, we require a second condition when $compound-word
contains $delimiter[2]
because we need to distinguish compound words like ‘mother-in-law’ from ‘in-law’ before we affix the plural suffix, ‘-s’. To do so, we use the XQuery string function fn:tokenize()
and fn:count()
. The XQuery code fn:tokenize($compound-word, $delimiter[2])
splits our string $compound-word
into a sequence of strings at $delimiter[2]
boundaries, so the input ‘mother-in-law’ returns a sequence:
Creating a sequence of strings allows us to use fn:count(fn:tokenize($compound-word, $delimiter[2]))
and return the number of items in our sequence. If the number of items in our sequence is greater than two, we concatenate $compound-head
and $plural-suffix
as above. If there are only two strings in our sequence, we pass $compound-word
through to our else
condition:
xquery version "3.1";
let $uri := "https://gist.githubusercontent.com/AdamSteffanick/eed120f4b5915dfb73900ea4bfdf3ace/raw/1289cd32578393ad9f4ef278959f59e0aebf6f8a/even-more-compound-words.xml"
let $compound-words := fn:doc($uri)
let $plural-suffix := "s"
let $delimiter := (" ", "-")
return element compounds {
for $compound-word in $compound-words/compounds/word/text()
return element word {
if (fn:contains($compound-word, $delimiter[1])) then
let $compound-head := fn:substring-before($compound-word, $delimiter[1])
return fn:replace($compound-word, $compound-head, fn:concat($compound-head, $plural-suffix))
else if (fn:contains($compound-word, $delimiter[2]) and fn:count(fn:tokenize($compound-word, $delimiter[2])) > 2) then
let $compound-head := fn:substring-before($compound-word, $delimiter[2])
return fn:replace($compound-word, $compound-head, fn:concat($compound-head, $plural-suffix))
else
fn:concat($compound-word, $plural-suffix)
}
}
We now get desired results for each type of compound word in our dataset when we run our new XQuery code with BaseX:
<compounds>
<word>bookcases</word>
<word>classmates</word>
<word>bookmarks</word>
<word>newspapers</word>
<word>Governors General</word>
<word>Surgeons General</word>
<word>in-laws</word>
<word>mothers-in-law</word>
<word>fathers-in-law</word>
</compounds>
We decided to create our own function and clean up our output, so some final revisions to our XQuery code remain. First, we move our XQuery code above into a function named local:form-plural-compound()
, which accepts the argument $compound-words
as a node()
and returns a node()
. Second, we order our results alphabetically using an order by
clause inside our for
clause. We use the XQuery string function fn:lower-case()
when ordering our data, otherwise strings beginning with capital letters would occur before strings begnning with lower-case letters and our results wouldn’t be alphabetical: order by fn:lower-case($compound-word)
. Third, we add the attribute singular
to each <word>
element and assign it the value $compound-word
to retain our input data. Finally, we bind the document located at $uri
to $dataset
and pass it as an argument to our new function: local:form-plural-compound($dataset)
. Our complete XQuery code, manipulate-external-XML-text-data.xquery, follows:
xquery version "3.1";
declare function local:form-plural-compound($compound-words as node()*) as node()* {
let $plural-suffix := "s"
let $delimiter := (" ", "-")
return element compounds {
for $compound-word in $compound-words/compounds/word/text()
order by fn:lower-case($compound-word)
return element word {
attribute singular {$compound-word},
if (fn:contains($compound-word, $delimiter[1])) then
let $compound-head := fn:substring-before($compound-word, $delimiter[1])
return fn:replace($compound-word, $compound-head, fn:concat($compound-head, $plural-suffix))
else if (fn:contains($compound-word, $delimiter[2]) and fn:count(fn:tokenize($compound-word, $delimiter[2])) > 2) then
let $compound-head := fn:substring-before($compound-word, $delimiter[2])
return fn:replace($compound-word, $compound-head, fn:concat($compound-head, $plural-suffix))
else
fn:concat($compound-word, $plural-suffix)
}
}
};
let $uri := "https://gist.githubusercontent.com/AdamSteffanick/eed120f4b5915dfb73900ea4bfdf3ace/raw/1289cd32578393ad9f4ef278959f59e0aebf6f8a/even-more-compound-words.xml"
let $dataset := fn:doc($uri)
return local:form-plural-compound($dataset)
Running manipulate-external-XML-text-data.xquery with BaseX returns the following XML:
<compounds>
<word singular="bookcase">bookcases</word>
<word singular="bookmark">bookmarks</word>
<word singular="classmate">classmates</word>
<word singular="father-in-law">fathers-in-law</word>
<word singular="Governor General">Governors General</word>
<word singular="in-law">in-laws</word>
<word singular="mother-in-law">mothers-in-law</word>
<word singular="newspaper">newspapers</word>
<word singular="Surgeon General">Surgeons General</word>
</compounds>
What we learned
Thanks to this session of the Vanderbilt University XQuery Working Group, we can now:
- manipulate XML text data with XQuery
- use conditional expressions (
if then else
) - order text data alphabetically
- evaluate and manipulate text data with the following XQuery functions:
fn:concat()
,fn:contains()
,fn:count()
,fn:lower-case()
,fn:replace()
,fn:string-join()
,fn:substring-before()
, andfn:tokenize()
Thank you for reading, and have fun coding.