Efficient byte pattern search in matlab memory map

Question

Efficient byte pattern search in matlab memory map

I have large binary files (2+ GB) that are arranged with a sync pattern (0xDEADBEEF) followed by a data block of a fixed size. Example:

0xDE AD BE EF ... 96 bytes of data
0xDE AD BE EF ... 96 bytes of data
... repeat ...

I need to locate the offsets to the start of each packet. Ideally this would just be [1:packetSize:fileSize] However, there is other data that can be interspersed, headers etc. so I need to search the file for the sync pattern.

I am using the following code which is based on Loren from Mathworks findPattern2 but modified a little to use a memory map.

function pattLoc = findPattern(fileName, bytePattern)
%Mem Map file
m = memmapfile(fileName);
% Find candidate locations match the first element in the pattern.
pattLoc = find(m.data==bytePattern(1));
%Remove start values that are too close to the end to possibly match
len = numel(bytePattern);
endVals = pattLoc+len-1;
pattLoc(endVals>length(m.data)) = [];
% loop over elements of Sync Pattern to check possible location validity.
for pattval = 2:len
    % check viable locations in array
    locs = bytePattern(pattval) == m.data(pattLoc+pattval-1);    
    pattLoc(~locs) = []; % delete false ones from indices
end

This works pretty well. However, I think there might be room for improvement. First I know my patterns can't be closer than packetSize (100 in this example) but may be farther apart. Seems like I should be able to use this information somehow to speed up the search. Second the initial search on line 5 uses a find for numerical indexing instead of logical. This line takes almost twice as long as leaving it as a logical. However, I tried to rework this function using logical indexing only and failed miserably. The problem arises inside the loop and keeping track of the nested indexing with logicals without using more finds ... or checking more data than needs to be.

So any help speeding this up would be appreciated. Below is some code which will create a simple sample binary file to work with if necessary.

function genSampleFile(numPackets)
pattern = hex2dec({'DE','AD','BE','EF'});
fileName = 'testFile.bin';
fid = fopen(fileName,'w+');
for f = 1:numPackets
    fwrite(fid,[pattern; uint8(rand(96,1)*255)],'uint8');
end
fclose(fid);

Searching a file with 10000000 packets took the following:

>> genSampleFile(10000000); %Warning Makes 950+MB file
>> tic;pattLoc = findPattern(fileName, pattern);toc
Elapsed time is 4.608321 seconds.

matlab

asked on Stack Overflow Apr 18, 2016 by

Aero Engy • edited Sep 20, 2018 by

Aero Engy

1 Answer

You can get an immediate boost by using findstr, or better yet strfind instead of find:

pattLoc = strfind(m.data, bytePattern)

This removes the need for any further looping. You just need to clean up a couple of things in the returned array of indices.

You want to remove things that are closer than 100 bytes to the end, not closer than 4 bytes to the end, so set len = 100, instead of len = length(bytePattern).

To filter out elements that are closer than 100 bytes from each other, use diff on the list of indices:

pattLoc[diff(pattLoc) < 100] = []

This should speed up your code by relying more on builtins, which are generally much more efficient than loops.

answered on Stack Overflow Jul 12, 2016 by

Mad Physicist

User contributions licensed under CC BY-SA 3.0