SAPI 5.4/11 Japanese TTS in Delphi GUI vs console


Trying to use MS Speech API v11 with a Japanese engine (MS Haruka) in Delphi 10.3.

I have a sample app with a form and a button. The click handler goes:

uses SpeechLib11_TLB; // Imported from "Microsoft Speech Object Library" v.B.0
procedure TForm1.Button1Click(Sender: TObject);
    v: ISpeechVoice;
    v := CoSpVoice.Create();
    //5.4 only but won't hurt
    v.Voice := v.GetVoices('language=411', '').Item(0);

    v.Speak('時間', SVSFDefault);

That causes an error, "Catastrophic failure" (HRESULT 0x8000FFFF, E_UNEXPECTED). The code that I think should be equivalent works in a C++ project:

#include <windows.h>
#import "libid:d3c4a7f2-7d27-4332-b41f-593d71e16db1" rename_namespace("SAPI") //v11
//#import "libid:C866CA3A-32F7-11D2-9602-00C04F8EE628" rename_namespace("SAPI") //v5.4

int wmain()
        SAPI::ISpeechVoicePtr v;

        //Needed for 5.4 only, but won't hurt
        SAPI::ISpeechObjectTokensPtr voices(v->GetVoices(L"language=411", L""));
        v->Voice = voices->Item(0);

        v->Speak(L"時間", SAPI::SVSFDefault);
    return 0;

That works and speaks. So the SAPI per se is not broken on the machine. The platform is Win32 in both projects, not Win64. The Japanese voice is the default one (no need to set it explicitly).

Same result with SAPI 5.4 proper (not OneCore), although the Japanese voice is not the default one and I had to add a couple of lines to set it as the default.

Further debugging reveals that on the Delphi side, as much as calling the v.Voice property getter immediately after the first line causes the same error, E_UNEXPECTED. Meanwhile, the Voice setter works if you pass it a valid voice token object from GetVoices(). It looks as if the voice object initializes itself to its defaults correctly in C++, but somehow skips that in the Delphi project.

Requesting v.Voice right after construction works though in Delphi with SAPI 5.4. Calling Speak() still throws a E_UNEXPECTED.

What could be the difference in process/threadwide execution context between C++ and Delphi? It's not the thread locale. The COM thread model is apartment in both.

The same Delphi code works with an English phrase and an English voice (MS Helen). So whatever init failure there might be, it's probably specific to Haruka.

The SAPI 11 runtime is available here. The language data for TTS are here.

Another data point. I've rewritten the SAPI logic in Delphi to use SAPI 5.4 OneCore instead (not SAPI 5.4 proper). Unlike 5.4 and 11, it doesn't expose an IDispatch-based interface, and it's somewhat clumsier specifically in Delphi, but Japanese TTS works. The question, as initially posed, is still unanswered, but at least there's a workaround. I'll write up an answer, but I won't accept it.

However, it's not the custom vs. dual distinction that's to blame. I've changed the logic to use custom interfaces instead of automation ones with SAPI 5.4 proper (the typelib defines both), still got E_UNEXPECTED from Speak(). There's no error info.

Here's another beautiful data point: SAPI 5.4 TTS with automation based API works and talks as expected in a Delphi console app. So it's not even Delphi specific, it's somehow VCL specific. What is it with Delphi GUI? Needless to say, I immediately retested the C++ snippet in a C++ GUI application, with a form and a button. The C++ one talks.

asked on Stack Overflow Mar 9, 2020 by Seva Alekseyev • edited Mar 12, 2020 by Seva Alekseyev

1 Answer


Not an answer, but a workaround.

Windows 10 comes with two flavors of 32-bit SAPI. There's SAPI 5.4 proper (in system32\speech), and also SAPI 5.4 OneCore (in system32\speech_onecore). The latter, even though it's superficially the same, exposes a different typelib - there's no automation support, all interfaces are custom instead of dual. What's more important, when you download the Japanese TTS voice in the Windows 10 Settings app, you end up with 3 voices under OneCore (Sayaka, somehow, missing) and only one, Haruka, under 5.4 proper.

Delphi can consume custom interfaces in a typelib, but the methods look somewhat clumsier. Also, the enumeration of voices in the automation API is cleaner. Anyway, here goes.

uses SpeechLib54Core_TLB; // This time it's OneCore
procedure TForm1.Button1Click(Sender: TObject);
    v: ISpVoice;
    cat: ISpObjectTokenCategory;
    toks: IEnumSpObjectTokens;
    tok: ISpObjectToken;
    sno: LongWord;
    hr: HResult;
    n: Cardinal;
    v := CoSpVoice.Create();
    hr := v.GetVoice(tok);
    cat.EnumTokens('language=411', '', toks); //411 means Japanese
    if n = 0 then
        exit; // No Japanese voices installed
    toks.Item(0, tok); //Take the first one - typically Ayumi

    v.Speak('時間', 0, sno);

Note that passing a Japanese string literal to a COM method works without an explicit cast to a wide string.

answered on Stack Overflow Mar 10, 2020 by Seva Alekseyev • edited Mar 11, 2020 by Seva Alekseyev

User contributions licensed under CC BY-SA 3.0