How can I send audio to Nexmo Voice through websocket

3

I am trying to implement Nexmo's Voice api, with websockets, in a .Net Core 2 web api.

This api needs to :
  • receive audio from phone call, through Nexmo
  • use Microsoft Cognitive Speech to text api
  • send the text to a bot
  • use Microsoft Cognitive text to speech on the reply of the bot
  • send back the speech to nexmo, through their voice api websocket

For now, I'm bypassing the bot steps, as I am first trying to connect to the websocket. When trying an echo method (send back to the websocket the audio received), it works without any issue. But when I try to send the speech from Microsoft text to speech, the phone call ends.

I am not finding any documentation implementing something different than just an echo.

The TextToSpeech and SpeechToText methods work as expected when used outside of the websocket.

Here's the websocket with the speech-to-text :

public static async Task Echo(HttpContext context, WebSocket webSocket)
    {
        var buffer = new byte[1024 * 4];
        WebSocketReceiveResult result = await webSocket.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None);
        while (!result.CloseStatus.HasValue)
        {
            while(!result.EndOfMessage)
            {
                result = await webSocket.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None);
            }
            var text = SpeechToText.RecognizeSpeechFromBytesAsync(buffer).Result;
            Console.WriteLine(text);
        }
        await webSocket.CloseAsync(result.CloseStatus.Value, result.CloseStatusDescription, CancellationToken.None);
    }

And here's the websocket with the text-to-speech :

public static async Task Echo(HttpContext context, WebSocket webSocket)
    {
        var buffer = new byte[1024 * 4];
        WebSocketReceiveResult result = await webSocket.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None);
        while (!result.CloseStatus.HasValue)
        {
            var ttsAudio = await TextToSpeech.TransformTextToSpeechAsync("Hello, this is a test", "en-US");
            await webSocket.SendAsync(new ArraySegment<byte>(ttsAudio, 0, ttsAudio.Length), WebSocketMessageType.Binary, true, CancellationToken.None);

            result = await webSocket.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None);
        }
        await webSocket.CloseAsync(result.CloseStatus.Value, result.CloseStatusDescription, CancellationToken.None);
    }

Update March 1st 2019

in reply to Sam Machin's comment I tried splitting the array into chunks of 640 bytes each (I'm using 16000khz sample rate), but nexmo still hangs up the call, and I still don't hear anything.

public static async Task NexmoTextToSpeech(HttpContext context, WebSocket webSocket)
    {
        var ttsAudio = await TextToSpeech.TransformTextToSpeechAsync("This is a test", "en-US");
        var buffer = new byte[1024 * 4];
        WebSocketReceiveResult result = await webSocket.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None);

        while (!result.CloseStatus.HasValue)
        {
            await SendSpeech(context, webSocket, ttsAudio);
            result = await webSocket.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None);
        }
        await webSocket.CloseAsync(WebSocketCloseStatus.NormalClosure, "Closing Socket", CancellationToken.None);
    }

    private static async Task SendSpeech(HttpContext context, WebSocket webSocket, byte[] ttsAudio)
    {
        const int chunkSize = 640;
        var chunkCount = 1;
        var offset = 0;
        
        var lastFullChunck = ttsAudio.Length < (offset + chunkSize);
        try
        {
            while(!lastFullChunck)
            {
                await webSocket.SendAsync(new ArraySegment<byte>(ttsAudio, offset, chunkSize), WebSocketMessageType.Binary, false, CancellationToken.None);
                offset = chunkSize * chunkCount;
                lastFullChunck = ttsAudio.Length < (offset + chunkSize);
                chunkCount++;
            }

            var lastMessageSize = ttsAudio.Length - offset;
            await webSocket.SendAsync(new ArraySegment<byte>(ttsAudio, offset, lastMessageSize), WebSocketMessageType.Binary, true, CancellationToken.None);
        }
        catch (Exception ex)
        {
        }
    }

Here's the exception that sometimes appears in the logs :

System.Net.WebSockets.WebSocketException (0x80004005): The remote party closed the WebSocket connection without completing the close handshake.

c#
websocket
speech-recognition
text-to-speech
nexmo
asked on Stack Overflow Feb 27, 2019 by xvercruysse • edited Jun 20, 2020 by Community

1 Answer

2

It looks like you're writing the whole audio clip to the websocket, the Nexmo interface requires the audio to be in 20ms frames one per message, this means that you need to break your clip up into 320 or 640 byte (depending on if you're using 8Khz or 16Khz) chunks and write each one to the socket. If you try and write too larger file to the socket it will close as you are seeing.

See https://developer.nexmo.com/voice/voice-api/guides/websockets#writing-audio-to-the-websocket for the details.

answered on Stack Overflow Feb 28, 2019 by Sam Machin

User contributions licensed under CC BY-SA 3.0