REPNZ SCAS Assembly Instruction Specifics

19

I am trying to reverse engineer a binary and the following instruction is confusing me, can anyone clarify what exactly this does?

=>0x804854e:    repnz scas al,BYTE PTR es:[edi]
  0x8048550:    not    ecx

Where:

EAX: 0x0
ECX: 0xffffffff
EDI: 0xbffff3dc ("aaaaaa\n")
ZF:  1

I see that it is somehow decrementing ECX by 1 each iteration, and that EDI is incrementing along the length of the string. I know it calculates the length of the string, but as far as exactly HOW it's happening, and why "al" is involved I'm not quite sure.

assembly
x86
reverse-engineering
asked on Stack Overflow Nov 6, 2014 by Michael Scott • edited Nov 6, 2014 by Drew McGowen

3 Answers

33

I'll try to explain it by reversing the code back into C.

Intel's Instruction Set Reference (Volume 2 of Software Developer's Manual) is invaluable for this kind of reverse engineering.

REPNE SCASB

The logic for REPNE and SCASB combined:

while (ecx != 0) {
    temp = al - *(BYTE *)edi;
    SetStatusFlags(temp);
    if (DF == 0)   // DF = Direction Flag
        edi = edi + 1;
    else
        edi = edi - 1;
    ecx = ecx - 1;
    if (ZF == 1) break;
}

Or more simply:

while (ecx != 0) {
    ZF = (al == *(BYTE *)edi);
    if (DF == 0)
        edi++;
    else
        edi--;
    ecx--;
    if (ZF) break;
}

String Length

However, the above is insufficient to explain how it computes the length of a string. Based on the presence of the not ecx in your question, I'm assuming the snippet belongs to this idiom (or similar) for computing string length using REPNE SCASB:

sub ecx, ecx
sub al, al
not ecx
cld
repne scasb
not ecx
dec ecx

Translating to C and using our logic from the previous section, we get:

ecx = (unsigned)-1;
al = 0;
DF = 0;
while (ecx != 0) {
    ZF = (al == *(BYTE *)edi);
    if (DF == 0)
        edi++;
    else
        edi--;
    ecx--;
    if (ZF) break;
}
ecx = ~ecx;
ecx--;

Simplifying using al = 0 and DF = 0:

ecx = (unsigned)-1;
while (ecx != 0) {
    ZF = (0 == *(BYTE *)edi);
    edi++;
    ecx--;
    if (ZF) break;
}
ecx = ~ecx;
ecx--;

Things to note:

  • in two's complement notation, flipping the bits of ecx is equivalent to -1 - ecx.
  • in the loop, ecx is decremented before the loop breaks, so it decrements by length(edi) + 1 in total.
  • ecx can never be zero in the loop, since the string would have to occupy the entire address space.

So after the loop above, ecx contains -1 - (length(edi) + 1) which is the same as -(length(edi) + 2), which we flip the bits to give length(edi) + 1, and finally decrement to give length(edi).

Or rearranging the loop and simplifying:

const char *s = edi;
size_t c = (size_t)-1;      // c == -1
while (*s++ != '\0') c--;   // c == -1 - length(s)
c = ~c;                     // c == length(s)

And inverting the count:

size_t c = 0;
while (*s++ != '\0') c++;

which is the strlen function from C:

size_t strlen(const char *s) {
    size_t c = 0;
    while (*s++ != '\0') c++;
    return c;
}
answered on Stack Overflow Nov 8, 2014 by QuasarDonkey • edited Nov 9, 2014 by QuasarDonkey
18

AL is involved, because scas scans the memory for the value of AL. AL has been zeroed so that the instruction finds the terminating zero at the end of the string. scas itself increments (or decrements, depending on the direction flag) EDI automatically. The REPNZ prefix (which is more readable in the REPNE form) repeats the scas as long as the comparison is false (REPeat while Not Equal) and ECX > 0. It also decrements ECX automatically in every iteration. ECX has been initialized to the longest possible string so that it doesn't terminate the loop early.

Since ECX counts down from 0xffffffff (also known as -1), the resulting length will be -1-ECX which due to the peculiarity of 2's complement arithmetic can be calculated using a NOT instruction.

answered on Stack Overflow Nov 6, 2014 by Jester
3

It compares the byte at es:[edi] to whatever in al is and repeats this step until either ecx becomes zero or the value at es:[edi] matches the value in al. After each step, edi is incremented so it points to the next byte in memory. The program applies not to the counter (ecx) afterwards, based on the following instruction.

repnz means "repeat until zero flag is not set and cx is not zero". Each iteration decrements ecx. scas or more precisely scasb compares the value in al to the memory operand (always es:[edi] or es:[di] depending on address size), then sets the flags accordingly (zero flag will be set if the two values equal) and increments (or decrements, based on the direction flag) edi.

answered on Stack Overflow Nov 6, 2014 by Powerslave • edited Apr 29, 2020 by evandrix

User contributions licensed under CC BY-SA 3.0