How do I get rid of '\n' in RISC-V?

Question

How do I get rid of '\n' in RISC-V?

I've been given a task to create a program that would read the name of a file and then copy its insides to other file, name of which is also read from the input. I wrote the program itself, but it appeared to do nothing.

Experimenting further, I've discovered that, whilst reading first string, the program also saves a '\n' character in it, which, obviously, resulted in some issues with searching the target file. I figured out one solution, that I'm not completely fond of, that's why I am here to ask for opinions on the code and overall further improvements, maybe?

I've pinned only the part that is responsible for writing file name to buffer until '\n' appears.

    .text
main:
#first block
    sbrk(128)
    mv s3, a0
    li a7, 8
    li a1, 127
    ecall
    
for:
    lw t0, 0(a0)
    li s1, 0x000000ff
    li s2, 0x0000000a
    
ff_and:
    and t1, t0, s1
    addi s4, s4, 1
    beq t1, s2, kill
    slli s1, s1, 8
    slli s2, s2, 8
    bnez s1, ff_and
    
    addi a0, a0, 4
    b for
    
kill:
    neg s1, s1
    addi s1, s1, -1
    and t0, t0, s1
    sw t0, 0(a0)

string

assembly

io

riscv

rars-simulator

asked on Stack Overflow Mar 31, 2021 by

Glass Puppet • edited Mar 31, 2021 by

Peter Cordes

2 Answers

This is pretty good in that it works! Comments:

As it deals with the string in word sizes — chunks of 4 characters at a time — it is arguably more complicated than the other approach of doing it 1 byte at a time.
Working with 4 bytes at a time, it ignores whether the string in question starts on a word aligned boundary — while that it does in this program's case, there is no such guarantee for arbitrary strings in general. Unaligned word-sized loads & stores are subject to certain hazards, and somewhat depends on the underlying processor, ranging from performance issues to access faults. Fixing the algorithm to accommodating arbitrary alignment, while still working 4 bytes at a time, will add considerable complexity.
Working with 4 bytes at a time, it potentially reads past the end of the string, into another data structure. While this is unlikely to cause any problems when a word aligned string is given, it is generally frowned upon reading past the end of a data structure, and if the processor supports unaligned loads, this have the undesired effect of reading into the next cache line or page not part of the string itself.

answered on Stack Overflow Mar 31, 2021 by

Erik Eidt

It's normal for a line of terminal input to include the terminating newline. If RARS doesn't allow the user to "submit" input without a newline, you could just zero the last byte. But the RARS read-string ecall very inconveniently doesn't return a length, so searching for a \0 is no better than just searching for a \n.

(A Unix read system call will return a length: RARS has that as ecall #63 read which returns a length in a0, so you could use that to read input if it allows fd=0 for stdin.)

Loop efficiency

You're only doing one byte per loop iteration; the only thing you're saving is a byte-load every iteration (lb), at the expense of a lot more ALU work.

The simple way looks like this, and is probably faster on most real-world RISC-V machines. (Especially if they have any cache, which makes it cheap to do multiple nearby loads instead of one wider load.) Unrolling some to hide load latency might be a good idea for high-performance in-order machines, if you really care about optimizing this loop for potentially large inputs. (Which you shouldn't for this use-case since it only runs once per user-input, so just keep it compact for code-size.)

    li  t1, '\n'
 .loop:                     # do{
    lbu   t0, (a0)
    addi  a0, a0, 1
    bne   t0, t1, loop      # }while(*p != '\n')
                 # assume the string will *always* contain a newline,
                 #  otherwise check for 0 as well
    sb    zero, -1(a0)

    # a0 points to one-past-the-end of the terminating 0
    # so if you want the string length, you can get it by subtracting

But there's more to say about the design choices of your word-at-a-time loop:

Since RISC-V has a byte-store instruction, you don't need to mask the word where you found a newline and store the whole word, just sb x0, (position) at the position where you found the newline, even if you find that position by incrementing a counter for every inner-loop shift count (which should also simplify that loop).

Also, storing a whole word is especially bad if your buffer isn't a whole number of aligned words: you don't want to do a non-atomic RMW of bytes past the end of your buffer. That's a very bad habit for thread-safety. (See also Erik's answer re: possible downsides of word-at-a-time in general, and Is it safe to read past the end of a buffer within the same page on x86 and x64?)

(If you were going to mask a word and store it, use not instead of neg / addi -1 to invert the bits in your mask. not is a pseudo-instruction for xori with -1. In general, you can ask a compiler for stuff like that, e.g. https://godbolt.org/z/EPGYGosKd shows how clang implements x & ~mask for RISC-V.)

Fast word-at-a-time

To actually quickly check a whole word at a time for a newline byte, do word ^ 0x0a0a0a0a to map that byte value to 0, and other values to non-zero. Then use the bithack for finding if a word has a zero byte https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord. (Like what glibc's portable-C fallback strlen does: Why does glibc's strlen need to be so complicated to run quickly?). IIRC, it's not an exact test (false positive matches are possible), so you'd want to quickly check a whole word, then loop over the bytes checking one at a time to make sure. If none, go back into the word loop.

Of course even better would be if you had some SIMD support for doing 4 or 8 (or 16) byte-compares in parallel, with RV32 P (packed-SIMD) or RV32 V (vector) extensions.

If you're doing this on a buffer you didn't allocate, you'd probably want to do one unaligned load (after checking it's not going to cross a page or maybe cache-line boundary), then get to an alignment boundary for aligned word loads. Or loop byte-at-a-time until a word boundary. (Or double-word on RV64).

answered on Stack Overflow Mar 31, 2021 by

Peter Cordes • edited Mar 31, 2021 by

Peter Cordes

User contributions licensed under CC BY-SA 3.0