Convert TEI XML to HTML with XQuery and BaseX

Convert TEI XML to HTML with XQuery and BaseX

My notes from the Vanderbilt University XQuery Working Group

We converted a document from the Text Encoding Initiative’s (TEI) Extensible Markup Language (XML) scheme to HTML with XQuery, an XML query language, and BaseX, an XML database engine and XQuery processor. This guide covers the basics of how to convert a document from TEI XML to HTML while retaining element attributes with XQuery and BaseX.

I’ve created a GitHub repository of sample TEI XML files to convert from TEI XML to HTML. This guide references a GitHub gist of XQuery code and HTML output to illustrate each step of the TEI XML to HTML conversion process.

Step 1: Locate a TEI XML document

Our TEI XML document, sample-tei-xml-document-latin.xml, contains semi-arbitrary markup and six elements to be converted from TEI XML to HTML:

  1. <body>
  2. <hi>
  3. <p>
  4. <s>
  5. <q>
  6. <quote>

The element <text> in our TEI XML document contains the elements listed above:

<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>A section of "de Finibus Bonorum et Malorum" marked up semi-arbitrarily in TEI XML.</title>
      </titleStmt>
      <publicationStmt>
        <p>In the public domain</p>
      </publicationStmt>
      <sourceDesc>
        <p>Cicero, 45 BC</p>
      </sourceDesc>
    </fileDesc>
    <profileDesc>
      <langUsage>
        <language ident="la">Latin</language>
      </langUsage>
    </profileDesc>
  </teiHeader>
  <text>
    <body>
      <p n="1">
        <s n="1">Lorem ipsum dolor sit amet, consectetur adipiscing <q>elit</q>.</s>
        <s n="2">Vestibulum nec lorem vitae dui varius <hi rend="italic">gravida</hi>.</s>
        <s n="3">Cras eget tristique <hi rend="bold">eros</hi>, id ultrices eros. Mauris nec turpis elit.</s>
        <s n="4">In tincidunt eget ante quis semper.</s>
        <s n="5">Praesent a mi et nisi ullamcorper feugiat nec non ipsum.</s>
        <s n="6">Ut malesuada finibus lorem nec gravida.</s>
        <s n="7">Praesent lobortis magna sed scelerisque molestie.</s>
        <s n="8">Aenean nec sapien quis quam commodo aliquam et et lorem.</s>
        <s n="9">Quisque commodo blandit neque quis scelerisque.</s>
        <s n="10">Fusce nec ultrices enim.</s>
        <s n="11">Cras dignissim convallis mi.</s>
      </p>
      <quote rend="blockquote">
        <p n="2">
          <s n="12">Sed vitae aliquet tellus.</s>
          <s n="13">Maecenas eget orci nec elit efficitur rutrum ac vitae enim.</s>
          <s n="14">Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Quisque ullamcorper porttitor est a malesuada.</s>
          <s n="15">Nullam scelerisque pulvinar risus a elementum.</s>
          <s n="16">Praesent eleifend ligula odio, non ultricies tortor scelerisque id.</s>
          <s n="17">Nulla augue tortor, venenatis pretium elementum vitae, ullamcorper ac sem.</s>
          <s n="18">Duis at nibh nisl.</s>
          <s n="19">Etiam dictum ligula non rhoncus placerat.</s>
          <s n="20">Pellentesque nibh augue, bibendum sit amet leo ut, ornare sollicitudin est.</s>
          <s n="21">Sed tellus orci, posuere et ultrices ut, fermentum non nisi.</s>
          <s n="22">Integer lobortis justo a leo elementum, ut sagittis lorem suscipit.</s>
          <s n="23">Aliquam semper orci mauris, et hendrerit est vulputate eu.</s>
        </p>
      </quote> 
      <p n="3">
        <s n="24">Sed mollis mi nec suscipit hendrerit.</s>
        <s n="25">Praesent velit orci, facilisis ultricies ultricies ac, varius vitae arcu.</s>
        <s n="26">Cras aliquam posuere turpis a aliquam.</s>
        <s n="27">Nunc id suscipit ex.</s>
        <s n="28">Curabitur posuere tincidunt neque in hendrerit.</s>
        <s n="29">Aenean ac malesuada nulla.</s>
        <s n="30">Ut eleifend porttitor accumsan.</s>
        <s n="31">Phasellus eros risus, imperdiet quis elit eget, ultricies rhoncus ex.</s>
        <s n="32">Quisque purus quam, <quote>luctus sit amet imperdiet id,</quote> cursus eget dolor.</s>
        <s n="33">Fusce fermentum convallis gravida.</s>
        <s n="34">Etiam nulla lorem, vehicula ac augue ullamcorper, convallis tincidunt erat.</s>
        <s n="35">Nam maximus et metus placerat interdum.</s>
      </p>
    </body>
  </text>
</TEI>

Step 2: Convert TEI XML to HTML with XQuery

XQuery functions to convert TEI XML to HTML

Converting TEI XML to HTML while retaining element attributes with XQuery requires a number of XQuery functions, which I would normally place in an XQuery module file. This is a modular approach taken from a Wikibooks.org article with modifications I made to retain element attributes.

We begin by declaring a namespace with the XQuery code declare namespace tei = "http://www.tei-c.org/ns/1.0"; in order to apply the TEI scheme to our XML. Our first XQuery function, local:dispatch(), recurses through TEI XML input to return HTML output through the use of typeswitch and case statements. Each case statment returns a specific XQuery function corresponding to a TEI XML node in the input.

For example, local:dispatch() returns the XQuery function local:body() when it encounters the TEI XML node <body>. If local:dispatch() encounters plain text, it simply returns the text. Likewise, if local:dispatch() encounters a TEI XML node not listed in our case statements, it defaults to the funciton local:passthru() and returns the original TEI XML node without converting it to HTML before continuing recursing through the input. The local:dispatch() and local:passthru() XQuery functions follow below:

declare namespace tei = "http://www.tei-c.org/ns/1.0";

(: XQuery for converting TEI XML to HTML :)
declare function local:dispatch($nodes as node()*) as item()* {
  for $node in $nodes
  return
  typeswitch($node)
  case text() return $node
  case element(tei:s) return local:s($node)
  case element(tei:p) return local:p($node)
  case element(tei:hi) return local:hi($node)
  case element(tei:quote) return local:quote($node)
  case element(tei:q) return local:q($node)
  case element(tei:body) return local:body($node)
  default return local:passthru($node)
};

(: Recurse through child nodes :)
declare function local:passthru($node as node()*) as item()* {
  element {name($node)} {($node/@*, local:dispatch($node/node()))}
};

Functions to convert TEI XML to HTML specific to each TEI XML node follow:

(: <s> to <span> with attributes :)
declare function local:s($node as element(tei:s)) as element() {
  let $sentence := $node/@n
  return
  <span data-sentence="{$sentence}">{local:dispatch($node/node())}</span>
};

(: <p> to <p> with attributes :)
declare function local:p($node as element(tei:p)) as element() {
  let $paragraph := $node/@n
  return
  <p data-paragraph="{$paragraph}">{local:dispatch($node/node())}</p>
};

(: <hi> to <b>, <i>, or <span> :)
declare function local:hi($node as element(tei:hi)) as element() {
  let $rend := $node/@rend
  return
  if ($rend = 'bold') then
    <b>{local:dispatch($node/node())}</b>
  else if ($rend = 'italic') then
    <i>{local:dispatch($node/node())}</i>
  else
    <span>{local:dispatch($node/node())}</span>
};

(: <quote> to <span> :)
declare function local:quote($node as element(tei:quote)) as element() {
  let $rend := $node/@rend
  return
  if ($rend = 'blockquote') then
    <blockquote>{local:dispatch($node/node())}</blockquote>
  else
    <q>{local:dispatch($node/node())}</q>
};

(: <q> to quote :)
declare function local:q($node as element(tei:q)) as element() {
  <span class="quotes">‘{local:dispatch($node/node())}’</span>
};

(: <body> to <div> with id attribute :)
declare function local:body($node as element(tei:body)) as element() {
  <div lang="la" id="tei-document">{local:dispatch($node/node())}</div>
};

Both TEI XML <p> and <s> elements in our document, sample-tei-xml-document-latin.xml, have the attribute n=X to denote paragraph and sentence numbers respectively. Our XQuery functions local:p() and local:s() preserve these attributes in the TEI XML to HTML conversion by using HTML5 data attributes, and id attributes could potentially suffice for for HTML4 or XHTML. Since the <p> element functions similarly in both TEI XML and HTML we do not convert it. Conversely, the <s> element functions differently in TEI XML and HTML, and we convert it to an HTML <span> element.

TEI XML uses the <hi> element with a rend attribute to distinguish text as bold or italic, for no specific semantic reason, but HTML5 does so with the elements <b> and <i> respectively. Our XQuery function local:hi() converts each TEI XML <hi> element by evaluating its rend attribute and outputting an HTML <b> element if rend="bold", an HTML <i> element if rend="italic", or an HTML <span> element if the rend attribute is neither. Similarly, our XQuery function local:quote() converts quotations from external sources encoded with the TEI XML <quote> element and evaluates the attribute rend="blockquote" to output either the HTML element <blockquote> or <q>. Furthermore, the XQuery function local:q() converts the TEI XML element <q> to an HTML <span> element with the attribute class="quotes" and adds single quotes to the node’s text. Adding quotes could also be achived using CSS to style the class .quote, but that is beyond the scope of this guide.

In the case of a TEI XML <body> element, our function local:body() outputs an HTML <div> element with the attributes lang="la", since we know our TEI XML document is written in Latin, and id="tei-document". This use of the HTML <div> element allows us to output the content of our TEI XML document within an HTML document that already contains a <body> element, and the id attribute allows us to reference the HTML <div> element as needed. Since there should be only one TEI XML <body> element per TEI XML document, our XQuery function local:body() is our final case statement in our XQuery function local:dispatch() for the sake of efficiency. For this same reason, the most frequent TEI XML element, <s>, is our first case statement.

XQuery code to convert TEI XML to HTML

The additional XQuery code to convert TEI XML to HTML is fairly straightforward, and I’ve made it verbose for illustrative purposes. First, we fetch our TEI XML document, sample-tei-xml-document-latin.xml, and bind it to the variable $document using map { 'chop': false() } to preserve the whitespace in our mixed-content document. Second, we bind the TEI XML <body> element to the variable $tei. Finally, we bind the result of local:dispatch($tei) to the variable $html and return it to get our HTML. The XQuery code to convert our TEI XML to HTML follows:

xquery version "3.1";

let $document := fetch:xml("https://raw.githubusercontent.com/AdamSteffanick/Sample-TEI-XML-Files/49a10fbac2df9c1cef3e5fd57fb484cd1fd49ce4/sample-tei-xml-document-latin.xml", map { 'chop': false() })

let $tei := $document//tei:body

let $html := local:dispatch($tei)

return $html

Complete XQuery code to convert TEI XML to HTML

The complete XQuery code, including functions, to convert TEI XML to HTML follows in convert-tei-xml-to-html.xquery:

xquery version "3.1";

declare namespace tei = "http://www.tei-c.org/ns/1.0";

(: XQuery for converting TEI XML to HTML :)
declare function local:dispatch($nodes as node()*) as item()* {
  for $node in $nodes
  return
  typeswitch($node)
  case text() return $node
  case element(tei:s) return local:s($node)
  case element(tei:p) return local:p($node)
  case element(tei:hi) return local:hi($node)
  case element(tei:quote) return local:quote($node)
  case element(tei:q) return local:q($node)
  case element(tei:body) return local:body($node)
  default return local:passthru($node)
};

(: Recurse through child nodes :)
declare function local:passthru($node as node()*) as item()* {
  element {name($node)} {($node/@*, local:dispatch($node/node()))}
};

(: <s> to <span> with attributes :)
declare function local:s($node as element(tei:s)) as element() {
  let $sentence := $node/@n
  return
  <span data-sentence="{$sentence}">{local:dispatch($node/node())}</span>
};

(: <p> to <p> with attributes :)
declare function local:p($node as element(tei:p)) as element() {
  let $paragraph := $node/@n
  return
  <p data-paragraph="{$paragraph}">{local:dispatch($node/node())}</p>
};

(: <hi> to <b>, <i>, or <span> :)
declare function local:hi($node as element(tei:hi)) as element() {
  let $rend := $node/@rend
  return
  if ($rend = 'bold') then
    <b>{local:dispatch($node/node())}</b>
  else if ($rend = 'italic') then
    <i>{local:dispatch($node/node())}</i>
  else
    <span>{local:dispatch($node/node())}</span>
};

(: <quote> to <span> :)
declare function local:quote($node as element(tei:quote)) as element() {
  let $rend := $node/@rend
  return
  if ($rend = 'blockquote') then
    <blockquote>{local:dispatch($node/node())}</blockquote>
  else
    <q>{local:dispatch($node/node())}</q>
};

(: <q> to quote :)
declare function local:q($node as element(tei:q)) as element() {
  <span class="quotes">‘{local:dispatch($node/node())}’</span>
};

(: <body> to <div> with id attribute :)
declare function local:body($node as element(tei:body)) as element() {
  <div lang="la" id="tei-document">{local:dispatch($node/node())}</div>
};

let $document := fetch:xml("https://raw.githubusercontent.com/AdamSteffanick/Sample-TEI-XML-Files/49a10fbac2df9c1cef3e5fd57fb484cd1fd49ce4/sample-tei-xml-document-latin.xml", map { 'chop': false() })

let $tei := $document//tei:body

let $html := local:dispatch($tei)

return $html

Running convert-tei-xml-to-html.xquery with BaseX returns the following result, convert-tei-xml-to-html-output.html, in HTML:

<div lang="la" id="tei-document">
  <p data-paragraph="1">
    <span data-sentence="1">Lorem ipsum dolor sit amet, consectetur adipiscing <span class="quotes">‘elit’</span>.</span>
    <span data-sentence="2">Vestibulum nec lorem vitae dui varius <i>gravida</i>.</span>
    <span data-sentence="3">Cras eget tristique <b>eros</b>, id ultrices eros. Mauris nec turpis elit.</span>
    <span data-sentence="4">In tincidunt eget ante quis semper.</span>
    <span data-sentence="5">Praesent a mi et nisi ullamcorper feugiat nec non ipsum.</span>
    <span data-sentence="6">Ut malesuada finibus lorem nec gravida.</span>
    <span data-sentence="7">Praesent lobortis magna sed scelerisque molestie.</span>
    <span data-sentence="8">Aenean nec sapien quis quam commodo aliquam et et lorem.</span>
    <span data-sentence="9">Quisque commodo blandit neque quis scelerisque.</span>
    <span data-sentence="10">Fusce nec ultrices enim.</span>
    <span data-sentence="11">Cras dignissim convallis mi.</span>
  </p>
  <blockquote>
    <p data-paragraph="2">
      <span data-sentence="12">Sed vitae aliquet tellus.</span>
      <span data-sentence="13">Maecenas eget orci nec elit efficitur rutrum ac vitae enim.</span>
      <span data-sentence="14">Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Quisque ullamcorper porttitor est a malesuada.</span>
      <span data-sentence="15">Nullam scelerisque pulvinar risus a elementum.</span>
      <span data-sentence="16">Praesent eleifend ligula odio, non ultricies tortor scelerisque id.</span>
      <span data-sentence="17">Nulla augue tortor, venenatis pretium elementum vitae, ullamcorper ac sem.</span>
      <span data-sentence="18">Duis at nibh nisl.</span>
      <span data-sentence="19">Etiam dictum ligula non rhoncus placerat.</span>
      <span data-sentence="20">Pellentesque nibh augue, bibendum sit amet leo ut, ornare sollicitudin est.</span>
      <span data-sentence="21">Sed tellus orci, posuere et ultrices ut, fermentum non nisi.</span>
      <span data-sentence="22">Integer lobortis justo a leo elementum, ut sagittis lorem suscipit.</span>
      <span data-sentence="23">Aliquam semper orci mauris, et hendrerit est vulputate eu.</span>
    </p>
  </blockquote> 
  <p data-paragraph="3">
    <span data-sentence="24">Sed mollis mi nec suscipit hendrerit.</span>
    <span data-sentence="25">Praesent velit orci, facilisis ultricies ultricies ac, varius vitae arcu.</span>
    <span data-sentence="26">Cras aliquam posuere turpis a aliquam.</span>
    <span data-sentence="27">Nunc id suscipit ex.</span>
    <span data-sentence="28">Curabitur posuere tincidunt neque in hendrerit.</span>
    <span data-sentence="29">Aenean ac malesuada nulla.</span>
    <span data-sentence="30">Ut eleifend porttitor accumsan.</span>
    <span data-sentence="31">Phasellus eros risus, imperdiet quis elit eget, ultricies rhoncus ex.</span>
    <span data-sentence="32">Quisque purus quam, <q>luctus sit amet imperdiet id,</q> cursus eget dolor.</span>
    <span data-sentence="33">Fusce fermentum convallis gravida.</span>
    <span data-sentence="34">Etiam nulla lorem, vehicula ac augue ullamcorper, convallis tincidunt erat.</span>
    <span data-sentence="35">Nam maximus et metus placerat interdum.</span>
  </p>
</div>

What we learned

Thanks to this session of the Vanderbilt University XQuery Working Group, we can now:

  • convert TEI XML to HTML with XQuery while retaining element attributes

Thank you for reading, and have fun coding.