PHP - Check character for EOL

2

Combining answers from here and here I created a function which checks if the character I'm looking at is EOL. I need it for strings with mixed line endings and possibly mixed encodings. Maybe even sanitize it by replacing all line endings with \n

// check if (possibly multibyte) character is EOL
protected function _is_eol($char) {
    static $eols = array(
            "\0x000D000A", // [UNICODE] CR+LF: CR (U+000D) followed by LF (U+000A)
            "\0x000A",     // [UNICODE] LF: Line Feed, U+000A
            "\0x000B",     // [UNICODE] VT: Vertical Tab, U+000B
            "\0x000C",     // [UNICODE] FF: Form Feed, U+000C
            "\0x000D",     // [UNICODE] CR: Carriage Return, U+000D
            "\0x0085",     // [UNICODE] NEL: Next Line, U+0085
            "\0x2028",     // [UNICODE] LS: Line Separator, U+2028
            "\0x2029",     // [UNICODE] PS: Paragraph Separator, U+2029
            "\0x0D0A",     // [ASCII] CR+LF: Windows, TOPS-10, RT-11, CP/M, MP/M, DOS, Atari TOS, OS/2, Symbian OS, Palm OS
            "\0x0A0D",     // [ASCII] LF+CR: BBC Acorn, RISC OS spooled text output.
            "\0x0A",       // [ASCII] LF: Multics, Unix, Unix-like, BeOS, Amiga, RISC OS
            "\0x0D",       // [ASCII] CR: Commodore 8-bit, BBC Acorn, TRS-80, Apple II, Mac OS <=v9, OS-9
            "\0x1E",       // [ASCII] RS: QNX (pre-POSIX)
            "\0x15"        // [EBCDEIC] NEL: OS/390, OS/400
    );
    $is_eol = false;
    foreach($eols as $eol){
        if($char === $eol){
            $is_eol = true;
            break;
        }
    }
    return $is_eol;
}

I might need to take a peek at the next character, when the current character is CR or LF so I don't mistake CRLF or LFCR as two line endings, but otherwise this looks good to me. Problem is that I have no knowledge about encodings and no data to test it yet.

Are there any fatal mistakes in my approach?
Am I missing line separators from other popular encodings?
The code says [UNICODE] but isn't there a difference between utf8/16/32?
I found this snippet on github:

if ($this->file_encoding = 'UTF-16LE') {
    $this->line_separator = "\x0A\x00";
}
elseif ($this->file_encoding = 'UTF-16BE') {
    $this->line_separator = "\x00\x0A";
}
elseif ($this->file_encoding = 'UTF-32LE') {
    $this->line_separator = "\x0A\x00\x00\x00";
}
elseif ($this->file_encoding = 'UTF-32BE') {
    $this->line_separator = "\x00\x00\x00\x0A";
}

It made me think, that I might be missing some. If I'm not mistaken, the last one "\x00\x00\x00\x0A" would be "0x0000000A"?

php
unicode
encoding
utf-8
eol
asked on Stack Overflow Jun 29, 2014 by FrankM • edited May 23, 2017 by Community

0 Answers

Nobody has answered this question yet.


User contributions licensed under CC BY-SA 3.0