Microsoft Translator Text to Speech with XQuery and BaseX

Microsoft Translator Text to Speech with XQuery and BaseX

The Microsoft Cognitive Services Translator Text application programming interface (API) enables text to speech with XQuery, an XML query language, and BaseX, an XML database engine and XQuery processor. This tutorial covers text to speech using the Microsoft Cognitive Services Translator Text API with XQuery and BaseX.

The Microsoft Translator Text API returns a base64Binary audio data stream in a desired language from text input. With an Azure account, a free subscription allows two million characters of text to speech every month. Please note that text to speech requires we can retrieve a Microsoft Translator API access token with XQuery. My GitHub repository ms-translator-api-xquery contains all XQuery code for this tutorial.

Text to speech

We can convert text to speech with XQuery via an HTTP request to the Microsoft Translator Text API’s “Speak” endpoint. For this tutorial, we’ll examine the function textToSpeech() within ms-translator-api-basex.xquery. First, we handle input parameters. Second, we build our HTTP request. Third, we send our HTTP request to the Microsoft Translator Text API. Finally, we return a base64Binary audio data stream. Our text-to-speech XQuery code follows:

declare function ms-translator-api:textToSpeech(
  $accessToken as xs:string,
  $text as xs:string,
  $language as xs:string,
  $format as xs:string?,
  $options as xs:string?
) as xs:base64Binary {
  let $mimeType := (
    if (
      $format = "audio/mp3"
    )
    then (
      "audio/mpeg"
    )
    else (
      "audio/x-wav"
    )
  )
  let $requiredParameters := map {
    "text=": $text,
    "language=": $language
  }
  let $optionalParameters := (
    if (
      fn:empty($format)
      or ($format = "")
    )
    then ()
    else (
      map {
        "format=": $format
      }
    ),
    if (
      fn:empty($options)
      or ($options = "")
    )
    then ()
    else (
      map {
        "options=": $options
      }
    )
  )
  let $buildQueryString := (
    function(
      $key,
      $value
    ) {
      fn:concat(
        $key,
        $value
        => fn:encode-for-uri()
      )
    }
  )
  let $queryString := (
    map:merge((
      $requiredParameters,
      $optionalParameters
    ))
    => map:for-each($buildQueryString)
    => fn:string-join("&")
  )
  let $speakRequest := (
    <http:request method="get">
      <http:header name="Accept" value="{$mimeType}" />
      <http:header name="Authorization" value="{$accessToken}" />
    </http:request>
  )
  let $speakResponse := http:send-request(
    $speakRequest,
    fn:concat(
      "https://api.microsofttranslator.com/v2/http.svc/Speak?",
      $queryString
    )
  )
  let $audioData := $speakResponse[2]
  return $audioData
};

textToSpeech() explained

We begin textToSpeech() by handling API call parameters and building a query string for our HTTP request. Because $format and $options are optional, as denoted by the ‘?’ in xs:string?, a conditional expression checks if $format has the value audio/mp3. If so, we assign audio/mpeg to $mimeType ; if not, we assign the default: audio/x-wav. Next, we declare two collections of key/value pairs (i.e., maps): $requiredParameters and $optionalParameters. Again, we use conditional expressions to check for parameters in $optionalParameters. We then declare an inline function, $buildQueryString. This function uses fn:concat() to concatenate key/value pairs into sequences of strings, after each $value is piped to fn:encode-for-uri(). Ultimately, we create $queryString by combining our maps with the BaseX function map:merge(), piping that combined map to BaseX’s map:for-each() function to iterate through it with $buildQueryString, and finally piping the resulting sequence of concatenated strings to fn:string-join().

The variable $speakRequest is our HTTP request. We use the GET request method, <http:request method="get">, required by the Microsoft Translator Text API’s “Speak” endpoint. The HTTP header Accept has the variable value $mimeType and represents the MIME type of the audio format we’ll return. The Authorization HTTP header represents our access token, which is required to use the Microsoft Translator Text API.

We send our HTTP request using the BaseX function http:send-request() and bind the response to $speakResponse. The first argument of http:send-request() is our HTTP request, $speakRequest, and the second argument is the URI, to which we send our request, concatenated with $queryString. A successful request yields an <http:response> element and a base64Binary audio data stream. We declare the variable $audioData and assign it the value $authenticationResponse[2] since we don’t need the <http:response> element within $authenticationResponse. Finally, we return our base64Binary audio data stream: $accessToken.

XQuery Microsoft Translator library module

Import ms-translator-api-basex.xquery

We can call the function ms-translator-api:textToSpeech() by importing the Microsoft Translator library module ms-translator-api-basex.xquery and assigning it the namespace ms-translator-api:

import module namespace ms-translator-api="ms-translator-api-basex" at "https://raw.githubusercontent.com/AdamSteffanick/ms-translator-api-xquery/master/ms-translator-api-basex.xquery";

Call textToSpeech()

Using the library module, we can convert text to speech by calling the function ms-translator-api:textToSpeech(). This function returns a base64Binary audio data stream that we’ll save as a file. In the following XQuery code, we replace myKey with an appropriate Azure subscription key before setting our Microsoft Translator Text API “Speak” endpoint parameters and calling textToSpeech(). The “Speak” endpoint requires an access token, input text, and an input language code. Optional parameters are audio format, audio quality, and gender of the voice.

The $text input parameter is limited to 2,000 characters per API request. Microsoft maintains a list of languages that must be entered as codes for our $language parameter, and language codes are also available from the API’s “GetLanguagesForSpeak” endpoint. For example, we can use the language code en for General American English or we can use a more specific language variant: en-au for Australian English, en-ca for Canadian English, en-gb for British English, en-in for Indian English, or en-us for U.S. English. The default audio format is audio/wav, but we’ll assign $format the value audio/mp3. The API also defaults to low audio quality and a female voice, and we’ll override these by assigning MaxQuality|Male to the variable $options. We pass these parameters to textToSpeech() as arguments and bind its result to the variable $audioData. We then save $audioData to a file using the BaseX function file:write-binary(). In this sample XQuery code, the resulting file, text-to-speech-output.mp3, is written to the current working directory because we didn’t include a path, so you may need to search to find your file.

Get the code

English text-to-speech example

xquery version "3.1" encoding "UTF-8";

import module namespace ms-translator-api="ms-translator-api-basex" at "https://raw.githubusercontent.com/AdamSteffanick/ms-translator-api-xquery/master/ms-translator-api-basex.xquery";

(: # API credentials :)
let $azureKey := "myKey"

(: # Speak parameters :)
let $accessToken := ms-translator-api:retrieveAccessToken($azureKey)
let $text := "Success." (: 2,000 character maximum :)
let $language := "en-ca" (: https://www.microsoft.com/en-us/translator/languages.aspx :)
let $format := "audio/mp3" (: "" = audio/wav :)
let $options := "MaxQuality|Male" (: "" = MinSize|Female :)

let $audioData := ms-translator-api:textToSpeech(
  $accessToken,
  $text,
  $language,
  $format,
  $options
)
let $mp3 := file:write-binary(
  "text-to-speech-output.mp3",
  $audioData
)
return $mp3

Japanese text-to-speech example

xquery version "3.1" encoding "UTF-8";

import module namespace ms-translator-api="ms-translator-api-basex" at "https://raw.githubusercontent.com/AdamSteffanick/ms-translator-api-xquery/master/ms-translator-api-basex.xquery";

(: # API credentials :)
let $azureKey := "myKey"

(: # Speak parameters :)
let $accessToken := ms-translator-api:retrieveAccessToken($azureKey)
let $text := "成功でした。" (: 2,000 character maximum :)
let $language := "ja" (: https://www.microsoft.com/en-us/translator/languages.aspx :)
let $format := "audio/mp3" (: "" = audio/wav :)
let $options := "MaxQuality|Male" (: "" = MinSize|Female :)

let $audioData := ms-translator-api:textToSpeech(
  $accessToken,
  $text,
  $language,
  $format,
  $options
)
let $mp3 := file:write-binary(
  "text-to-speech-output.mp3",
  $audioData
)
return $mp3

Notes

The optional parameters, $format and $options, can be assigned the value "" or omitted entirely to use the API’s default values. If omitted, be sure to include empty arguments when calling the function: ms-translator-api:textToSpeech($accessToken, $text, $language, (), ()). The majority of the XQuery code in the library module ms-translator-api-basex.xquery is not specific to BaseX, however I did use the BaseX Conversion, File, HTTP, and Map modules. Changes to the library module and sample code may be required when using a different XQuery processor.

What we learned

Thanks to this XQuery tutorial, we can now:

Thank you for reading, and have fun coding.