Media Foundation - Problems using aggregate media source with audio/video capture

1

I have been tasked with creating an application that takes an audio/video capture input (using an Elgato Cam Link) and outputs it back to the user. According to Microsoft's Audio/Video Capture in Media Foundation documentation, it suggests "If you want to combine audio capture with video capture, use the aggregate media source."

I lifted a majority of the code that I needed from topoedit's source code, but topoedit does not make use of aggregate sources anywhere, so I'm flying somewhat blind when it comes to the proper usage of them.

I am encountering two different problems when using an aggregate media source.

  1. Frame rate
    I have one capture device (Elgato Cam Link) that successfully plays back audio and video when I using an aggregate media source, but for some reason, the frame rate is worse than if I just run two completely separate topologies (one for video, one for audio).

  2. Error MF_E_TOPO_CODEC_NOT_FOUND (0xC00D5212) when calling IMFTopoloader::Load()
    When I attempt to use the aggregate media source on two different laptops (using whatever built-in webcam/microphone combo), I encounter this error. This error confuses me, because it runs just fine if I use two completely separate topologies (one for video, one for audio).

Primary setup:

  • OS: Windows 10
  • Capture device: Elgato Cam Link (a USB stick with an HDMI input)
  • Language: C++

Just to rule out some suggestions ahead of time...

  • Just use DirectShow!
    I wish I could. The only reason I've been tasked with this is because our DirectShow solution is having severe frame rate issues for the Elgato Cam Link device. We were unable to figure out why, so Media Foundation seemed like our best bet.

  • Use two separate topologies!
    This is how I have it currently working. It just feels wrong, you know? The documentation says to use an aggregate source, so I wanted to do my due diligence. In the end, if I don't get an answer to this question, I can at least rest easy knowing that I have something functional, but I'm mainly asking this question in case I could be doing something better. Skype has to be doing something like this right? (On second thought, Skype doesn't output your audio back out your own speakers, so perhaps not) Surely someone out there knows how to use an aggregate source.

Given that my program runs successfully with the Elgato device, I suspect I'm not doing anything horribly wrong. If there's a problem, it's probably in code that I don't have, rather than anything I could show.

With that said, here's where I'm creating the aggregate source (excluding error-checking for brevity).

HRESULT GenerateAggregateSource(IMFMediaSource*& pAggSource, IMFMediaSource* pSource1, IMFMediaSource* pSource2)
{
    pAggSource = NULL;

    HRESULT hr = S_OK;

    IMFCollection* pSourceCollection = NULL;
    hr = ::MFCreateCollection(&pSourceCollection);

    hr = pSourceCollection->AddElement(pSource1);

    hr = pSourceCollection->AddElement(pSource2);

    hr = ::MFCreateAggregateSource(pSourceCollection, &pAggSource);
    pSourceCollection->Release();   // Done with this

    return hr;
}

and later I connect the EVR to stream descriptor 0, and the SAR to stream descriptor 1.

And here's where I call IMFTopoloader::Load(), if you're more keen on solving problem number 2 (again, excluding error-checking for brevity).

HRESULT ResolveTopology(IMFTopology*& pTopology, IMFMediaSession* pMediaSession)
{
    HRESULT hr = S_OK;

    IMFTopoLoader* pTopoLoader = NULL;
    hr = ::MFCreateTopoLoader(&pTopoLoader);

    IMFTopology* pFullTopology = NULL;
    hr = pTopoLoader->Load(pTopology, &pFullTopology, NULL);
    // Laptop webcams seem to encounter this error here
    //MF_E_TOPO_CODEC_NOT_FOUND

    hr = pMediaSession->SetTopology(MFSESSION_SETTOPOLOGY_IMMEDIATE, pFullTopology);

    // Swap the topology we're holding
    pTopology->Release();
    pTopology = pFullTopology;

    // Done with this
    pTopoLoader->Release();

    return hr;
}

Update with requested topology:

Here is a crude example of my partial topology. It's probably what you'd expect, since I'm not trying to do anything fancy.

                 ⇗ EVR
Agg. Source (A/V)
                 ⇘ SAR

Once I resolve the topology, what extra stuff gets injected? I'm not sure how to walk the chain and query what all is in the topology at that point. If you trust what topoedit does with the audio and video sources separately, then it looks like this.

                 ⇗ {CF862982-23B0-4E3D-8C76-D03FEF084AF8} ⇒ EVR
Agg. Source (A/V)
                 ⇘ SAR

What is {CF862982-23B0-4E3D-8C76-D03FEF084AF8}? I'm not sure, and I'm not sure how to find out.

I will update the above topology when I have access to the Elgato again. I am not currently in possession of it.

Update 2 The client this was written for doesn't seem to care enough to provide me with any feedback (good or bad), so I don't care enough to follow-up on this weird issue.

video-capture
ms-media-foundation
audio-capture
asked on Stack Overflow Apr 8, 2019 by Brandon • edited May 28, 2019 by Brandon

0 Answers

Nobody has answered this question yet.


User contributions licensed under CC BY-SA 3.0