Common Text-to-Speech API (draft, 17.5.2006)

Table of Contents


Next: , Previous: (dir), Up: (dir)

Common TTS Application Interface

Copyright © 2006 Brailcom, o.p.s. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
1. Redistributions of source code of this document must retain the above copyright notice and this list of conditions.
2. Redistributions in binary form and in printed form must reproduce the above copyright notice, this list of conditions and/or other materials provided with the distribution.
3. The names of the authors may not be used to endorse or promote products derived from this document without specific prior written permission.


Next: , Previous: Top, Up: Top

1 Introduction

The purpose of this document is to define a common low-level interface for access to the various speech synthesizers on Free Software and Open Source platforms. It is designed to be used by applications that do not need advanced functionality like message management (such as txt2wave) and by applications providing high-level interfaces (such as SpeechDispatcher, GnomeSpeech, KTTS etc.) The purpose of this document is not to define and force an API on the speech synthesizers. The synthesizers might use different interfaces that will be handled by their drivers.

This interface will be implemented by a simple layer integrating available speech synthesis drivers and in some cases emulating some of the functionality missing in the synthesizers themselves.

Advanced capabilities not directly related to speech, like message management, prioritization, synchronization etc. are left out of scope for this low-level interface. They will be dealt with by higher-level interfaces. Such a high-level interface (not necessarily limited to speech) will make good use of the already existing low-level interface.

It is desirable that simple applications can use this API in a simple way. However, the API must also be complex enough so that it doesn't limit more advanced applications in use of the synthesizers.

Requirements on this interface have been gathered between various accessibility projects, most notably KDE, GNOME, Emacspeak, Speakup and Free-b-Soft. They are summarized in Appendix A and Appendix B of this document. Appendix A deals with general requirements and required functionality, while Appendix B describes the extended SSML subset in use and thus also defines required parameter settings. The interface definition contained in chapter 2 was composed based on these requirements.

Temporary Note: A goal is a real implementation of this interface in the near future. The next step will be merging the available engine drivers in the various accessibility projects under this interface and using this interface. For this reason, we need all accessibility projects who want to participate in this common effort to make sure all their requirements on a low-level speech output interface are met and that such an interface is defined so that it is suitable for their needs.

Temporary Note: Any comments about this draft are welcome and useful. But since the goal of these requirements is a real implementation, we need to avoid endless discussions and keep the comments focused and to the point.


Next: , Previous: Introduction, Up: Top

2 Interface Description

This section defines the low-level TTS interface for use by all assistive technologies on free software platforms.


Next: , Up: Interface Description

2.1 General Points


Next: , Previous: General Points, Up: Interface Description

2.2 Speech Synthesis Driver Discovery

This section deals with the discovery of the synthesis drivers available behind this interface. It also covers discovery of the capabilities and voices provided by the drivers.

— Variable Type: driver_capabilities_t

driver_capabilities_t is a structure data type intended for carrying information about driver capabilities.

     
     typedef struct {
         /* Voice discovery */ 
         bool_t can_list_voices;
         bool_t can_set_voice_by_properties;
         bool_t can_get_current_voice;
         
         /* Prosody parameters */
         bool_t can_set_rate_relative;
         bool_t can_set_rate_absolute;
         bool_t can_get_rate_default;
     
         bool_t can_set_pitch_relative;
         bool_t can_set_pitch_absolute;
         bool_t can_get_pitch_default;
     
         bool_t can_set_pitch_range_relative;
         bool_t can_set_pitch_range_absolute;
         bool_t can_get_pitch_range_default;
     
         bool_t can_set_volume_relative;
         bool_t can_set_volume_absolute;
         bool_t can_get_volume_default;
     
         /* Style parameters */
         bool_t can_set_punctuation_mode_all;
         bool_t can_set_punctuation_mode_none;
         bool_t can_set_punctuation_mode_some;
         bool_t can_set_punctuation_detail;
     
         bool_t can_set_capital_letters_mode_spelling;
         bool_t can_set_capital_letters_mode_icon;
         bool_t can_set_capital_letters_mode_pitch;
     
         bool_t can_set_number_grouping;
     
         /* Synthesis */
         bool_t can_say_text_from_position;
         bool_t can_say_char;
         bool_t can_say_key;
         bool_t can_say_icon;
     
         /* Dictionaries */
         bool_t can_set_dictionary;
     
         /* Audio playback/retrieval */
         bool_t can_retrieve_audio;
         bool_t can_play_audio;
         
         /* Events and index marking */
         bool_t can_report_events_by_sentences;
         bool_t can_report_events_by_words;
         bool_t can_report_custom_index_marks;
     
         /* Performance guidelines */
         int honors_performance_guidelines;
     
         /* Defering messages */
         bool_t can_defer_message;
     
         /* SSML Support */
         bool_t can_parse_ssml;
     
         /* Multilingual utterences */ 
         bool_t supports_multilingual_utterances;
     } driver_capabilities_t;

can_set_rate_*, can_set_pitch_*, can_set_pitch_range_* and can_set_volume_* variables indicate whether the corresponding prosody parameter setting commands are supported. See (Prosody Parameters).

can_set_punctuation_mode_* variables indicate which parameters are supported for set_punctuation_mode(). See (set_punctuation_mode()).

can_set_punctuation_detail indicates whether the function set_punctuation_detail() is supported. See (set_punctuation_detail()).

can_set_capital_letters_mode_* variables indicate which parameters are supported for set_capital_letters_mode(). See (set_capital_letters_mode()).

can_set_number_grouping indicates whether the function set_number_grouping is supported. See (set_number_grouping()).

can_say_text_from_position indicates whether the capability to start synthesis at a given position in the text is supported, as described in (say_text()).

Other can_say_* variables indicate whether the corresponding say_ synthesis command is supported. See (Speech Synthesis Commands).

can_set_dictionary indicates whether the function set_dictionary() is supported.

can_play_audio and can_retrieve_audio variables indicate whether the corresponding audio output methods are allowed for set_audio_output. See (Audio Retrieval).

can_report_* variables indicate which kind of audio events and index marks are supported. See (Event Callbacks).

honors_performance_guidelines variable is 0 if performance guidelines are not honored, 1 if performance guidelines are honored on the (SHOULD HAVE): level and 2 if performance guidelines are honored on the (NICE TO HAVE): level.

can_defer indicates whether the defer capability is supported. If this variable is true, defer() and say_deferred must be supported. It is expected the synthesizer will be able to defer multiple messages at the same time. See (defer()), (say_deferred()).

can_parse_ssml indicates whether the synthesizer is able to parse SSML. It doesn't indicate which SSML elements and attributes are supported.

supports_multilingual_utterances indicates whether the synthesizer supports multilingual utterances (utterances containing multiple languages).

— Variable Type: driver_description_t

driver_description_t is a structure containing information about a single driver.

     
     typedef struct {
         char*           driver_id;
         char*           driver_version;
         char*           synthesizer_name;
         char*           synthesizer_version;
     } driver_description_t;

synthesizer_id is the identification string of the driver.

synthesizer_version carries information about the synthesizer version in use in a human readable form. There is no strict rule for formatting the version information inside the string as the versioning schemes of the various synthesizers differ significantly. If it is not possible to determine the synthesizer version, this string should be NULL.

synthesizer_name is a full name of the synthesizer engine.

driver_version carries information about the driver version in use for the given synthesizer. It has the form "major.minor" where major is the major version number for the driver and minor is the minor version number for the driver.

driver_capabilities contains information about the support of the driver for functions and features defined in this interface. See (driver_capabilities_t) for a list of the available information.

Example:

          driver_id = "festival"
          synthesizer_name = "Festival Speech Synthesis System"
          synthesizer_version = "1.94beta"
          driver_version = "1.2"
          
— Function: driver_description_t** list_drivers (void)

list_drivers() returns a newly allocated null-terminated array of available synthesizer drivers. Each of the items in the array is of the type driver_description_t*, (driver_description_t), and must carry a properly filled in variable driver_id.

In case of an error, the value NULL is returned..

— Function: driver_capabilities_t* driver_capabilities (char* driver_id)

driver_capabilities returns information about the capabilities of the driver in a driver_capabilities_t structure.

Under this API, each driver is not guaranteed to support all of the functionality as defined in this document. It must however provide the full set of functions. Whether the functions will have the described effect can be discovered by examining the entries of the driver_capabilities_t structure and comparing them with the documentation for the given functions.

driver_id is the unique identification string for the synthesizer driver whose capabilities should be reported. See (list_drivers()).

This function returns a properly filled driver_capabilities_t structure on success. In case of an error, the value NULL is returned..


Next: , Previous: Speech Synthesis Driver Discovery, Up: Interface Description

2.3 Voice Discovery

— Variable Type: voice_description_t

voice_description_t is a structure containing the description of a voice.

     
     typedef struct {
         wchar_t *name;
         char *language;
         wchar_t *dialect;
         voice_gender gender;
         unsigned int age;
     } voice_description_t;

name is the name of the voice as recognized by the synthesizer.

language is an ISO 639 language code represented as a character string. Examples are en, fr, cs.

dialect is a string describing the language dialect or NULL if unknown or not applicable. Examples are american or british with English language or moravian with Czech language.

Open Issue: Is there a standard way of describing dialects?

gender indicates the gender of the voice. The values MALE, FEMALE and UNKNOWN are permitted.

age gives the approximate age of the voice in years. A value of 0 means the age is unknown.

— Function: voice_description_t** list_voices (char* driver_id)

For a given driver specified as driver_id, driver_list_voices() returns a newly allocated null-terminated array of describing the available voices in voice_description_t* items, one for each voice.

driver_id is the identification string of the driver as returned by list_drivers() (list_drivers()).

In case of an error, the value NULL is returned..


Next: , Previous: Voice Discovery, Up: Interface Description

2.4 Speech Synthesis Commands

Functions defined in this section generally accept a message to synthesize, with driver, voice and other parameters according to the current settings at the time when the function is called. Several types of messages are handled by this API. It can be either a text message, containing plain text or SSML, or it can be a 'key' or 'character' event or any general event.

The functions defined in this section can only block the calling process for as long as is necessary to fully receive and/or transfer the message, which should generally be a very short time. These functions will not block the calling process for the time of synthesis of the message and audio output.

The result of these commands will either be that the resulting audio stream is played on the audio device or that the audio stream is returned via the registered communication channel. Please see Audio Settings.

— Variable Type: message_format_t

message_format_t is an enumeration type to indicate the type of the content of a message.

     
     typedef enum {
         MESSAGE_TYPE_SSML,
         MESSAGE_TYPE_PLAIN
     } message_format_t;

MESSAGE_TYPE_SSML means the content of the message is text formatted according to the Speech Synthesis Markup Language. See (SSML).

MESSAGE_TYPE_PLAIN means the content of the message is plain text.

— Variable Type: message_id_t
     
     typedef signed int message_id_t;

A positive value represents the identification number of the message. The value of 0 means 'no message' and -1 means an error occurred.

— Variable Type: event_type_t

event_type_t is used to describe the type of an event both in the original text and in the synthesized audio data.

     
     typedef enum {
         EVENT_MESSAGE_BEGIN,
         EVENT_MESSAGE_END,    
         EVENT_SENTENCE_BEGIN,
         EVENT_SENTENCE_END,
         EVENT_WORD_BEGIN,
         EVENT_WORD_END,
         EVENT_NONE
     } event_type_t;

EVENT_MESSAGE_BEGIN and EVENT_MESSAGE_END are events corresponding to the begin and end of the message.

EVENT_SENTENCE_BEGIN and EVENT_SENTENCE_END are events corresponding to the begin and end of a sentence.

EVENT_WORD_BEGIN and EVENT_WORD_END are events corresponding to the begin and end of a word.

— Function: mesage_id_t say_text (message_format_t format, wchar_t* text)
— Function: mesage_id_t say_text_from_event (message_format_t format, wchar_t* text, unsigned int position, event_type_t position_type)
— Function: mesage_id_t say_text_from_index_mark (message_format_t format, wchar_t* text, char* index_mark)
— Function: mesage_id_t say_text_from_character (message_format_t format, wchar_t* text, size_t character_position)

say_text accepts a text message to synthesize and starts synthesis at the given position.

position and position_type describe the position in the message where synthesis should be started. position_type can be either a word or sentence event. position is a positive counter of events of type position_type from the beginning of the message. So for example the position 2 of event EVENT_WORD_START describes the start of the second word. In a similar way, index_mark specifies the name of the index mark where synthesis should start and character gives a position in the text as a positive number in characters.

There is no explicit upper limit on the size of the text, but the server administrator may set one in the configuration or the limit can be enforced by available system resources. If the limit is exceeded, the whole text is accepted, but the excess is ignored and an error is returned.

When a markup language, such as SSML, is being used as the format of the text, this markup may or may not be checked for validity, according to users settings. If a validity check is performed and the text is found to be invalid, an error code is returned and the text is not processed further.

Errors found during processing the document, as for example a markup request to set a language which is not available for the synthesizer, are not reported.

If the position requested through say_text_from_char falls in the middle of a markup tag, the synthesis should begin with the text following the tag. If the position is in the middle of a word, the synthesizer can either synthesize from the exact position or it can start from the beginning of the word. Neither of these is considered an error.

format is a format of the message according to (message_format_t).

text is the text to be synthesized in the form according to the value of the format argument.

position is a positive number counting the events of the given type. If position_type is set to EVENT_MESSAGE_BEGIN, the value of this argument is irrelevant and is conventionally set to 0.

position_type is one of EVENT_MESSAGE_BEGIN, EVENT_SENTENCE_START, EVENT_SENTENCE_END, EVENT_WORD_BEGIN and EVENT_WORD_END.

index_mark is the name of the index mark where synthesis should begin.

character_position is a positive number of the character where synthesis should begin.

On success, a positive value – a unique message identifier – is returned. In case of an error, the value -1 is returned. When this function is not supported by the driver, -2 is returned.

For example calling say_text() with the following arguments

          say_text(MESSAGE_TYPE_PLAIN, "This is an example.", 3, EVENT_WORD_BEGIN)

should result in audio which starts with the word 'an' and continues to the end of the sentence.

Note: For longer and more complicated texts, it will not be possible to say in advance where the audio will start, given just the original text of the message and the position description. The placing of events across the original text may be ambiguous and depends on the synthesizer. However, this capability is designed for purposes like rewinding (rewind 5 sentences forward) or context pause (resume speaking from a place which we already got event information about when we executed pause). The application must not try to guess where exactly the events are and rely on that guess if it did not receive the information from the synthesizer earlier.

— Function: mesage_id_t say_deferred (message_id_t message_id)
— Function: message_id_t say_deferred (message_id_t message_id, signed int position_from, PositionType position_type)
— Function: mesage_id_t say_deferred_from_index_mark (message_id_t message_id, char* index_mark)
— Function: mesage_id_t say_deferred_from_character (message_id_t message_id, size_t character)

say_deferred works just like say_text, except it works on messages which were previously deferred and if position is set to 0, this has an additional meaning of ”start where speech was interrupted last time”. Please see (defer()).

message_id is the id of the message to synthesize, as obtained by defer().

On success, a positive value – a unique message identifier – is returned. In case of an error, the value -1 is returned. When this function is not supported by the driver, -2 is returned.

— Function: message_id_t say_key (wchar_t* key_name)

say_key accepts a key name to synthesize. The command is intended to be used for speaking keys pressed by the user.

key_name is a valid key name as defined in appendix-C.

On success, a positive value – a unique message identifier – is returned. In case of an error, the value -1 is returned. When this function is not supported by the driver, -2 is returned.

— Function: message_id_t say_char (wchar_t character_name)

say_char accepts a letter (or syllable if the language doesn't have individual letters) to synthesize. The command is intended to be used for speaking single character messages, produced when the user is moving the cursor over a word.

character_name is the character to synthesize.

On success, a positive value – a unique message identifier – is returned. In case of an error, the value -1 is returned. When this function is not supported by the driver, -2 is returned.

— Function: message_id_t say_icon (char* icon_name)

say_icon accepts a general sound icon to synthesize. The command is intended to be used for general events like `new-line', `message-arrived', `question' or `new-email' . The exact sound produced or text synthesized depends on user's configuration.

The name for the icon can be one of the names given in (recommended-sound-icons) or any other name. If the icon name is not recognized by the synthesizer, the synthesizer tries to synthesize the name of the event itself.

If the icon name is not recognized by the synthesizer, the synthesizer tries to synthesize the name of the icon itself.

icon_name is the name of the icon to synthesize. It must not contain any whitespace characters.

On success, a positive value – a unique message identifier – is returned. In case of an error, the value -1 is returned. When this function is not supported by the driver, -2 is returned.


Next: , Previous: Speech Synthesis Commands, Up: Interface Description

2.5 Speech Control Commands

— Function: void cancel (void)

cancel immediately stops synthesis and audio output of the current message. When this function returns, the audio output is fully stopped and the synthesizer is ready to synthesize a new message.

If this function is called during the transfer of audio data to the application, the data block currently being transferred is completed and no further data block is sent.

Calling this command when no message is being processed is not considered an error. In case of an error, the value -1 is returned. When this function is not supported by the driver, -2 is returned.

— Function: void defer (void)

defer is similar to cancel except after stopping the synthesis process and audio playback the message is not thrown away in the synthesizer, but data that might be useful for future working with the message (such as rewinding, repeating or resuming the synthesis process) are preserved. This might or might not include the original text of the message. In any case, enough information must be preserved so that the synthesizer is able to fully reproduce the audio data for the message.

If this function is called during the transfer of audio data to the application, the data block currently being transferred is completed and no further data block is sent.

This function can also be called after all the audio has been already transferred to the application, but before another synthesis request is issued, with no cancel() request in between, the data for the previous message are stored.

There is no explicit upper limit on the number of messages that can be simultaneously postponed by defer(). There might however be a limit imposed by the administrator or forced by available system resources. In case such a limit is passed, defer() will return with an error.

After the message is no longer needed, the application must make sure to discard it through discard(), otherwise system resources will be wasted.

On success, the value 0 is returned. In case of an error, the value -1 is returned. When this function is not supported by the driver, -2 is returned.

— Function: int discard (message_id_t message)

Discards a previously deferred message. The driver/engine will drop all information about this message and the message will be removed from the list of paused messages.

See (defer()).

message is the message ID of the message to discard. Passing an ID of a message that is not paused is considered an error.

On success, the value 0 is returned. In case of an error, the value -1 is returned. When this function is not supported by the driver, -2 is returned.


Next: , Previous: Speech Control Commands, Up: Interface Description

2.6 Parameter Settings


Next: , Up: Parameter Settings

2.6.1 Driver Selection and Parameters

— Function: int set_driver (char* driver_id)

Set the synthesis driver. See list_drivers().

driver_id is the unique ID of the driver as returned by list_drivers().


Next: , Previous: Driver Selection and Parameters, Up: Parameter Settings

2.6.2 Voice Selection

Setting parameters in this section only has effect until the synthesizer driver in use is changed by the application.

— Function: int set_voice_by_name (wchar_t* voice_name)

set_voice_by_name selects the voice with the given name.

voice_name is the name of the desired voice. It must be one of the names returned by list_voices(). See list_voices.

On success, the value 0 is returned. In case of an error, the value -1 is returned. When this function is not supported by the driver, -2 is returned.

— Function: int set_voice_by_properties (voice_description_t *voice_description, unsigned int variant)

set_voice_by_properties selects a voice under the current driver most closely matching the given description. The exact voice selected might be subject to user preference settings for voice selection inside the synthesizer.

There is no guarantee that any of the given parameters will be respected, although language generally is supposed to be respected, unless impossible or unless the user wishes otherwise.

In case no voice matches the given language, the synthesizer should pick the general default voice (if applicable) or choose an arbitrary voice. This alone is not considered an error and must not be a reason for the synthesizer to refuse further synthesis requests unless for some other related reason (as for example the voice being unable to handle the given Unicode character range).

The application can check which voice was selected and how closely (if at all) it matches the given description.

voice_description is a description of the desired voice. Any of its entries except language can be filled in or left blank (NULL for strings, 0 for integer values, UNKNOWN for VoiceGender). Please see voice_description_t for more information about the format and allowed values.

variant is a positive (1,2,3...) number specifying which of the voices matching the description and assigned equal priority inside the synthesizer should be selected. Please see (SSML) for more details.

Note: This function is different from performing voice_list and following that with set_voice_by_name as user settings about voice selection inside the synthesizer are respected.

On success, the value 0 is returned. In case of an error, the value -1 is returned. When this function is not supported by the driver, -2 is returned.

— Function: voice_description_t* get_current_voice (void)

get_current_voice returns a voice_description_t structure filled in with all known information about the voice currently in use.

In case of an error, the value NULL is returned..


Next: , Previous: Voice Selection, Up: Parameter Settings

2.6.3 Prosody parameters

Setting parameters in this section only has effect until the synthesizer is changed.

— Function: int set_rate_relative (signed int rate_relative)
— Function: int set_rate_absolute (unsigned int rate_absolute)
— Function: unsigned int get_rate_absolute_default (void)

Set/get the rate of speech.

rate_relative represents the relative change with respect to the default value for the given voice. For example 0 means the default value for the given voice while -50 means a fifty percent lower rate with respect to the default.

rate_absolute is the desired rate in words per minute.

On success, the value 0 is returned. In case of an error, the value -1 is returned. When this function is not supported by the driver, -2 is returned.

— Function: int set_pitch_relative (signed int pitch_relative)
— Function: int set_pitch_absolute (unsigned int pitch_absolute)
— Function: unsigned int get_pitch_absolute_default (void)

Set/get the voice base pitch.

pitch_relative represents the relative change with respect to the default value for the given voice. For example 0 means the default value for the given voice while -50 means a fifty percent lower pitch with respect to the default.

pitch_absolute is the desired pitch in Hz.

On success, the value 0 is returned. In case of an error, the value -1 is returned. When this function is not supported by the driver, -2 is returned.

— Function: int set_pitch_range_relative (signed int range)

Set voice pitch range in relative units. Pitch range is how much pitch changes in intonation with respect to the base pitch.

pitch represents the relative change with respect to the default value for the given voice. For example 0 means the default value for the given voice while -50 means a fifty percent lower pitch range with respect to the default.

— Function: int set_pitch_range_absolute (unsigned int range)

Open Issue: How should this work? It is not clear from the SSML specs.

On success, the value 0 is returned. In case of an error, the value -1 is returned. When this function is not supported by the driver, -2 is returned.

— Function: int set_volume_relative (signed int volume_relative)
— Function: int set_volume_absolute (unsigned int volume_absolute)
— Function: unsigned int get_volume_absolute_default ()

Set/get the volume of speech.

volume_relative represents the relative volume change with respect to the default value for the given voice. For example 0 means the default value for the given voice while -50 means a fifty percent lower volume with respect to the default.

volume_absolute is a number from the range 0 to 100 where the value of 0 means silence and 100 means maximum volume.

On success, the value 0 is returned. In case of an error, the value -1 is returned. When this function is not supported by the driver, -2 is returned.


Next: , Previous: Prosody Parameters, Up: Parameter Settings

2.6.4 Style parameters

— Variable Type: punctuation_mode_t

punctuation_mode_t is an enumeration type containing information about punctuation signalling mode.

     
     typedef enum {
         PUNCTUATION_NONE,
         PUNCTUATION_ALL,
         PUNCTUATION_SOME
     } punctuation_mode_t;

PUNCTUATION_NONE means no punctuation is signalled.

PUNCTUATION_ALL means all punctuation characters are signalled.

PUNCTUATION_SOME means only selected punctuation characters are signalled. (See set_punctuation_detail()).

— Function: int set_punctuation_mode (punctuation_mode_t mode)

Set punctuation reading mode. In other words, this influences which punctuation characters will be signalled while reading the text. Signalling means either synthesizing their name (e.g. `qustion mark') or playing the appropriate sound icon, according to user settings inside the synthesizer.

For example the `.' (dot) and `?' (question mark) are not normally pronounced and their presence only influences the intonation of the sentence. However, in some cases such as copyediting text or editing program source code, it is desirable to have them spoken or otherwise indicated.

mode is one of PUNCTUATION_NONE, PUNCTUATION_ALL and PUNCTUATION_SOME (See set_punctuation_detail()) as defined in punctuation_mode_t.

On success, the value 0 is returned. In case of an error, the value -1 is returned. When this function is not supported by the driver, -2 is returned.

— Function: int set_punctuation_detail (wchar_t *detail)

set_punctuation_detail influences which punctuation characters should be signalled when the punctuation mode is set to PUNCTUATION_SOME (See set_punctuation_mode())

detail is a string enumerating the punctuation characters that should be signalled without any spaces.

Example:

          set_punctuation_detail("?!.#");
          
— Variable Type: capital_letters_mode_t

capital_letters_mode_t is an enumeration type containing information about selected mode for signalling capital letters.

     
     typedef enum {
         CAPITAL_LETTERS_NO,
         CAPITAL_LETTERS_SPELLING,
         CAPITAL_LETTERS_ICON,
         CAPITAL_LETTERS_PITCH,
     } capital_letters_mode_t;

CAPITAL_LETTERS_NO means no signalling of capital letters.

CAPITAL_LETTERS_SPELLING means that each capital letter is prepended with the word “capital” or similar appropriate for the given language. Alternatively, the whole word containing the capital letter may be spelled. These two approaches may be combined.

For example the text “My name is John” would be read as “Capital my name is capital John.” or “Capital m way name is capital j ou age en.”.

CAPITAL_LETTERS_ICON means that each capital letter is prepended with a sound icon.

The above example text “My name is John” would be read as “*ding* My name is *ding* John” where *ding* is the appropriate sound for capital letter signalling as provided by the synthesizer or configured by the user.

CAPITAL_LETTERS_PITCH is a method where capital letters are indicated by raising pitch of the voice when reading them.

Open Issue: How exactly does CAPITAL_LETTERS_PITCH work?

— Function: int set_capital_letters_mode (capital_letters_mode_t mode)

set_capital_letters_mode sets the capital letters speaking mode as requested.

When the engine is not able to set the requested mode, but it is able to set some other mode, this should be done.

mode is one of CAPITAL_LETTERS_NO, CAPITAL_LETTERS_SPELLING, CAPITAL_LETTERS_ICON and CAPITAL_LETTERS_PITCH as defined in capital_letters_mode_t.

On success, the value 0 is returned. In case of an error, the value -1 is returned. When this function is not supported by the driver, -2 is returned.

— Function: int set_number_grouping (unsigned int grouping)

Sets how many digits should be grouped together when reading a number. See tts:digits for a detailed description of the functionality.

grouping a positive number indicating how many digits should be grouped together or 0 for reading numbers as a whole.

On success, the value 0 is returned. In case of an error, the value -1 is returned. When this function is not supported by the driver, -2 is returned.


Next: , Previous: Style Parameters, Up: Parameter Settings

2.6.5 Dictionaries

— Function: int set_dictionary (Dictionary dictionary)

Open Issue: How should this work? What is the Dictionary type?

On success, the value 0 is returned. In case of an error, the value -1 is returned. When this function is not supported by the driver, -2 is returned.


Previous: Dictionaries, Up: Parameter Settings

2.6.6 Audio Settings

Generally, there are two ways of dealing with audio. Either the application can ask this API to send the synthesized audio samples as data or it can ask for it to be played (e.g. on computer audio device or the internal speakers of hardware devices). Of course both options are not available in every synthesizer.

In the case where the application asks for the audio to be played on the audio device, the means of handling audio events and index marking will be callbacks (handled either as function callbacks or asynchronous socket notifications). This way, event signalling and/or index marking callbacks can be provided by every synthesizer which supports synchronization and/or index marking, regardless of whether it plays audio itself or it gives data to its driver.

If the application asks for audio data to be returned to the application, then event marks and custom index marks are embedded as additional information in the retrieved audio data blocks. This is more accurate and is very useful when the application doesn't want to play the audio immediately, but it wants to store it either as a file or in memory. However, this is only possible with synthesizers that can give audio data to its driver.

Of course it is possible to discover the capabilities of each driver in advance. See (Speech Synthesis Driver Discovery).

— Data Type: output_method_t

output_method_t is an enumeration type for selecting the audio output method for the synthesizer.

     
     typedef enum {
         AUDIO_OUTPUT_PLAYBACK,
         AUDIO_OUTPUT_RETRIEVAL,
     } audio_output_method_t;

OUTPUT_AUDIO_PLAYBACK means the audio should be played on the synthesizer or automatically sent to playback.

OUTPUT_AUDIO_RETRIEVAL means the audio should be returned to the application from the synthesizer.

— Function: int set_audio_output (output_method_t method)

This option deals with the output of the synthesizer. The two possibilities are to have the audio played (which is the only possibility for some synthesizers) or have audio retrieved over a socket as a series of data blocks (either synchronously or asynchronously).

method is either AUDIO_OUTPUT_PLAYBACK or AUDIO_OUTPUT_RETRIEVAL. See (output_method_t).

On success, the value 0 is returned. In case of an error, the value -1 is returned. When this function is not supported by the driver, -2 is returned.

— Function: int set_audio_retrieval_destination (char *host, unsigned int port)

Sets the TCP socket where audio data should be sent. See (Audio Retrieval) for more details.

host is the IP address of the machine where audio data should be delivered.

port is the port on the machine where audio data should be delivered.

On success, the value 0 is returned. In case of an error, the value -1 is returned. When this function is not supported by the driver, -2 is returned.


Next: , Previous: Parameter Settings, Up: Interface Description

2.7 Audio Retrieval

This section deals with the situation when the application wants to retrieve audio data from the synthesizer as described in (Audio Settings).

The audio data are delivered to a TCP socket on the address specified by the application using set_audio_retrieval_destination().

For each message sent to the synthesizer, one or more data blocks are delivered asynchronously over the socket. Each data block contains the identification of the original message and the serial number of the block, information about the audio format in use, events and custom index marks in the given block and the audio data itself.

Each data block is composed of four sections: BLOCK acting as a header specifying which message this data belongs to, PARAMETERS carrying information about the parameters of the audio data, EVENTS as a list of events and custom index marks reached in this audio data, and DATA containing the data itself.

The following syntax is used for each block. Where arguments are provided, they are separated by one or more spaces (except for the PARAMETERS section where spaces are not allowed).

     BLOCK msg_id block_number
     PARAMETERS
     data_format=data_format
     data_length=data_length
     audio_length=audio_length
     sample_rate=sample_rate
     channels=channels
     encoding=encoding_string
     END OF PARAMETERS
     EVENTS
     type    text_position       time_position
     END OF EVENTS
     DATA
     audio_data
     END OF DATA

BLOCK

msg_id is the unique identification number of the message this audio data belongs to.

block_number is positive string-represented number indicating the position of this audio chunk in the resulting audio for the message. block_number one means the first part of the data.

PARAMETERS

The parameters section contains the following parameters (not all of them are always used).

data_format is a string identification for the format of the audio data. Recognized names are: “raw”, “wav” and “ogg”. This parameter is required.

Open Issue: Is there any specification that we could refer here so that we do not need to enumerate the possible values? The goal of this specification is not to dictate which data format should be used.

data_length is an unsigned string-represented number indicating the length of the data contained in the DATA section in bytes. This parameter is required.

audio_length is an unsigned string-represented number indicating the length of the audio data contained in the DATA section in milliseconds. This parameter is required.

sample_rate and channels are only used for the “raw” data format and they are string-represented numbers describing the three common audio parameters. sample_rate is the sampling frequency in Hz and channels is the number of channels in the audio data.

encoding is a string describing the encoding details of the audio data. This parameter is only used for the “raw” audio output. It has the following usual form:

     signed/unsigned_bits-per-word_endian

where signed/unsigned is either S or U for signed or unsigned data type, bits-per-word is a two digit number representing word data width and endian is either LE for little endian or BE for big endian.

EVENTS

The events section contains zero or more lines, each of them representing an event or a custom index mark which is reached in the sent audio data chunk. Each gives its order and position in both the original message text and the synthesized audio.

The synthesisizer or synthesizer driver will only report as much information as is possible. The message_start and message_end events must always be signalled, though.

DATA

The section DATA contains audio data in exactly the length as is specified by the data_length parameter in the PARAMETERS section for the given block.

Example

Below is an example of audio data for a message being sent in a single block:

     BLOCK 142 1
     PARAMETERS
     data_format=raw
     data_length=109368 (bytes)
     audio_length=1240  (ms)
     sample_rate=44100
     channels=1
     architecture=S16_LE
     END OF PARAMETERS
     EVENTS
     message_start
     word_start       1           0       12
     sentence_start   1           0       12
     index_mark       "my-1"      14      123
     word_start       1           19      442
     index_mark       "my-2"      31      821
     word_start       2           31      821
     message_end
     END EVENTS
     DATA
     here are the audio data
     END OF DATA


Previous: Audio Retrieval, Up: Interface Description

2.8 Event Callbacks

If the output method is set for audio playback, meaning the audio is being played on the audio device behind this API, events and custom index marks are reported through callbacks.

— VariableType: callback_function_t

Type for a function to be used as a callback for reporting events and custom index markers.

     
     typedef int callback_function_t(event_type_t event,
         signed int n, size_t text_pos, char *name);

event is the type of the event reported. (event_type_t)

n and text_pos are defined in (EVENTS). Where not applicable (n for index marks and all three for message events), these variables are set to -1.

name is only used when the event is of type custom index mark and contains the name of the index mark. Otherwise its value is set to NULL.

— Function: int register_callback (callback_function_t* callback_function)

This function registers a function to be called whenever an event or a custom index mark is reached during playing the audio for the synthesized message on the audio device.

callback_function is the function to be used as a callback. Please see (callback_function_t) for details about the exact form.

On success, the value 0 is returned. In case of an error, the value -1 is returned. When this function is not supported by the driver, -2 is returned.


Next: , Previous: Interface Description, Up: Top

3 Notes About the Interface

Intended use

The primary use for this interface is access of applications to a low level layer, provided either by a process or a library, managing the synthesizer drivers.

A subset of this interface can however be used to interface this low level layer with the synthesizer drivers themselves. Even the capabilities provided by a driver itself and those provided by the interface using this same driver can differ as some functionality can be emulated by this low level library or process. Notably SSML conversion or stripping, [interfacing with] audio output and callbacks.

The audio retrieval method is designed in such a way that it will bypass this middle layer when the application wants to receive audio, and it can also possibly bypass the driver if the synthesizer supports it, resulting in better performance.

Repeat, rewind, context pause

The rewind and context pause functionality can be implemented in applications for every synthesizer that supports some kind of event notification for plain text messages. For SSML messages, support for the variable position start inside the message, as described in (say_text()) is needed. The least granularity for rewind and context pause is determined by the granularity with which the synthesizer supports events notification.

Rewind can work as follows: The event notification mechanism is used to determine the current position in the spoken text. The message is first canceled or deferred. The synthesis process is started again from a position n words or sentences forward or backward. If the synthesizer does not support this functionality, this can be emulated for plain text by simply sending only the desired part of the text. The application can possibly take advantage of the audio data already received.

The working of context pause is similar.

If supported by the synthesizer, the higher level can also make use of the defer() (defer()) functionality for better performance. This way, the text of the message does not need to be transferred again after each pause or resume and the synthesizer can make use of the already computed results, particularly SSML parsing and syntax analysis.

Audio formats in use

This interface does not enforce any particular audio format to be used by the synthesizer. The API used to interface synthesizers should not limit the synthesizers or the applications in the formats used to transfer audio.

Limits will however be given by the implementation of the audio output mechanism in use. Any audio output format fits these requirements, but synthesizer and synthesizer driver authors must be aware that output in a format not supported by the audio technology in use will be useless for the user.


Next: , Previous: Notes About the Interface, Up: Top

Appendix A Requirements on the API

This section defines a set of requirements on the interface and on speech synthesizer drivers that need to support assistive technologies on free software platforms.


Next: , Up: Requirements on the API

A.1 Design Criteria

The Common TTS Driver Interface requirements will be developed within the following broad design criteria:


Next: , Previous: Design Criteria, Up: Requirements on the API

A.2 Synthesizer Discovery Requirements


Next: , Previous: Synthesizer Discovery Requirements, Up: Requirements on the API

A.3 Synthesizer Configuration Requirements


Next: , Previous: Synthesizer Configuration Requirements, Up: Requirements on the API

A.4 Synthesis Process Requirements


Previous: Synthesis Process Requirements, Up: Requirements on the API

A.5 Performance Guidelines

In order to make the speech synthesizer driver actually usable with assistive technologies, it must satisfy certain performance expectations. The following text provides a clue to the driver implementors to get a rough idea about what is needed in practice.

Typical scenarios when working with a speech enabled text editor:


Next: , Previous: Requirements on the API, Up: Top

Appendix B Extended SSML Markup

This section defines the set of extended SSML markup and special attribute values for use in input texts for the drivers. The markup consists of two namespaces: 'SSML' (default) SSML and 'tts', where 'tts' introduces several new attributes to be used with the 'say-as' element and a new element 'style'.

If an SSML element is supported, all its mandatory attributes by the definition of SSML 1.0 (SSML) must be supported even if they are not explicitly mentioned in this document.

This section also defines which functions the API needs to provide for default prosody, voice and style settings, according to A.3.2.

Note: According to available information, SSML is not known to suffer from any IP issues.


Next: , Previous: Extended SSML Markup, Up: Top

Appendix C Key Names


Next: , Up: Key Names

C.1 General Rules

Key name may contain any character excluding control characters (the characters in the range 0 to 31 in the ASCII table and other “invisible” characters), spaces, dashes and underscores.

The recognized key names are:

Examples of valid key names:

     A
     shift_a
     shift_A
     $
     enter
     shift_kp-enter
     control
     control_alt_delete


Previous: General Rules, Up: Key Names

C.2 List of symbolic key names


Next: , Previous: Key Names, Up: Top

Appendix D Requirements on the synthesizers

This section gives guidelines to the synthesizer authors and driver implementators about what capabilities should be supported by the synthesizers accessible under this API.

The requirements are sorted into three categories: (MUST HAVE):, (SHOULD HAVE):, (NICE TO HAVE): with meaning analogous to that specified in (appendix-A). A synthesizer which does not fit all of the (MUST HAVE): requirements cannot be accessed under this interface.

  1. General points
    1. (MUST HAVE): Interaction with the synthesizer must not cause inapropriate blocking of system resources or affect other operating system components in an unexpected way. Especially, the synthesizer must not block audio output for other applications.
  2. Discovery of available voices
    1. (NICE TO HAVE): It would be nice if it was possible to discover all available voices.
    2. (NICE TO HAVE): It would be nice to have the possibility of discovering languages and possibly also countries or dialects supported by each voice.
  3. Synthesizer configuration requirements
    1. The synthesizer should (would be nice to if) support configuration options as defined in the interface description under (Parameter Settings). The relevant priorities for these capabilities are specified as points A.3.1-A.3.3, A.3.5 of the requirements on the API (appendix-A) and in the extended SSML subset in use specifications (appendix-B).
  4. Synthesis process requirements
    1. (MUST HAVE): The synthesizer must be able to process plain text as input.
    2. (NICE TO HAVE): If the synthesizer can't process Unicode encoding for the text, it would be nice if possible to determine the encoding used for a given voice and language.
    3. (SHOULD HAVE): The synthesizer should be able to process text formatted using extended SSML markup defined in (see appendix-B) of this document and encoded in Unicode. The synthesizer may choose to ignore markup it cannot handle or even to ignore all markup as long as it is able to process the text inside the markup.
    4. (SHOULD HAVE): The speech synthesizer should be able to properly process the extended SSML markup defined in the (see appendix-B). of this document as SHOULD HAVE. Analogously for NICE TO HAVE.
    5. (NICE TO HAVE): It would be nice if the synthesizer was able to start the synthesis process from a position in the text where an event (word or sentence boundary) occurs, as described in (say_text()).
    6. (NICE TO HAVE): It would be nice if the synthesizer supported the (defer()) capability or a similar compatible mechanism how to achieve good performance when rewinding and pausing/resuming inside long texts.
    7. (MUST HAVE): An application must be able to cancel a synthesis operation in progress. In case of hardware synthesizers, or synthesizers that produce their own audio, this means cancelling the audio output as well.
    8. (MUST HAVE): If the synthesized audio is being played, it must be possible to discover when the playback started and when it terminated.
  5. Audio retrieval
    1. (NICE TO HAVE): It would be nice if the synthesizer could retrieve audio data rather than play them itself. Preferably through the mechanism described in the interface definition (Audio Retrieval).
  6. Performance guidelines
    1. (SHOULD HAVE): The speech synthesizer driver should honor the Performance Guidelines described in (appendix-A).
    2. (NICE TO HAVE): It would be nice if the synthesizer was able to process long input texts in such a way that the audio output starts to be available for playing as soon as possible. The driver is not required to split long texts into smaller pieces.
  7. Other requirements
    1. (NICE TO HAVE): It would be nice if a synthesizer were able to support multilingual utterances.
    2. (NICE TO HAVE): It would be nice if the synthesizer supported notification of events and custom index marks as defined in (event_type_t) and if the application was able to align these events with the synthesized audio as in (Audio Retrieval)

      Rationale: This is useful to update cursor position as a displayed text is spoken. It is also essential for rewinding and context pause capabilities.


Next: , Previous: Requirements on the synthesizers, Up: Top

Appendix E Recommended Sound Icons

This appnedix specifies a set of recommended names of sound icons to be used by the application and recognized by the synthesizer. The set is divided in groups according to their purpose.

[...]


Next: , Previous: Recommended Sound Icons, Up: Top

Appendix F Related Specifications

  1. [SSML], Speech Synthesis Markup Language, W3C,
    http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/
  2. [SSML-req], SSML Requirements, W3C,
    http:/www.w3.org/TR/2004/REC-speech-synthesis-20040907ref-reqs
  3. [SSML-say-as], SSML 'say-as' Element Attribute Values, W3C,
    http://www.w3.org/TR/2005/NOTE-ssml-sayas-20050526/
  4. [MRCP], MRCP, ,
    http://www.ietf.org/html.charters/speechsc-charter.html


Previous: Related Specifications, Up: Top

Index of Functions