Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rtext] Fix GetCodepointNext() to return default value on invalid input with size=0 #2997

Merged
merged 4 commits into from
Apr 6, 2023

Conversation

chocolate42
Copy link
Contributor

Fixes #2996

  • Fixed GetCodepointNext in the minimal way possible, when the first byte of an encoding is invalid '?' is now returned with size=0. Not all invalid input is replaced with '?' but the vast majority is
  • Checked all usage of GetCodepointNext in rtext.c and rtextures.c, there are two places size==0 were not properly handled, LoadCodepoints() and GetCodepointPrev()
  • LoadCodepoints() now handles size==0 the same as everywhere else in the code, without the fix there would have been a buffer overflow on invalid input
  • Left GetCodepointPrev as-is
  • Internally raylib should be fixed, but external code may break when encountering invalid input if they never properly handled size==0, for either GetCodepointNext or GetCodepointPrev

There's a few remaining issues to become bulletproof that haven't been fixed, probably won't be implemented but mentioned for completeness:

  • Checking that the last codepoint is not truncated (arbitrary data can be read up to 3 bytes past the buffer which might but unlikely lead to a crash, very unlikely to trigger accidentally and it can be hotfixed externally if desired by overallocating the input buffer by 4)
  • Checking 10xxxxxx of bytes 1..3 for multibyte encodings (currently a valid encoding is returned even if it's invalid, not a big deal if good input is expected)
  • Overlong encodings are technically invalid UTF-8 but there's nothing wrong with allowing them

Tested on Linux with this.

#include "raylib.h"
#include <stdio.h>
int main(int argc, char *argv) {
	char t[]={0,0,0,0}, t2[]={0xFF,1,1,0xFE,1,0};
	int len, codepoint, count;
	//GetCodepointNext
	for(int i=0;i<256;++i){
		t[0]=i;
		codepoint=GetCodepointNext(t, &len);
		printf("%d : codepoint %d length %d\n", i, codepoint, len);
	}
	//LoadCodepoints
	LoadCodepoints(t2, &count);
	printf("count %d\n", count);
}

…nput. Modify LoadCodepoints to work when GetCodepointNext returns a size of 0. All internal use of GetCodepointNext and GetCodepointPrev checked. This fix may break external code dealing with invalid input as the old code erroneously never returned a size of 0, external code that doesn't properly check for size=0 may endlessly loop or overflow a buffer on invalid input.
@orcmid
Copy link
Contributor

orcmid commented Apr 1, 2023

  1. I'm a big fan of over-allocating and having pad bytes (usually \0) where they help keep things sane. It makes scanning simpler.
  2. I assume you mean overlong in the sense that the UTF-8 for the particular code point could have been in a shorter sequence. If that's the case, I agree the result should be accepted, even though there is a lazy producer somewhere.
  3. I don't understand the 10xxxxxx bytes comment. In what way is a valid encoding returned for an invalid sequence and what does it do for position in the UTF-8 stream? I suspect this is a good place for defensive behavior of some sort, whether returning SUB or `?'.
  4. So, technically, this is a bug fix but the fix might trigger bugs in code that failed to defend against size==0 as an edge case.? I'm unclear for the rationale around returning ? but also specifying size == 0. Under other conditions I would expect size to indicate how many bytes were consumed if one were stepping through a series of encoded characters, but not including anything beyond the invalid sequence. So in the examples, I assume all t[0] (unsigned) values 128 to 255 would fail, but still with size == 1. Here I'm confessing I am not clear on the precise context for the use of these raylib API functions.

@chocolate42
Copy link
Contributor Author

I assume you mean overlong in the sense that the UTF-8 for the particular code point could have been in a shorter sequence.

Exactly that. Not something anyone is likely to actually encounter but I mention it for completeness.

I don't understand the 10xxxxxx bytes comment. In what way is a valid encoding returned for an invalid sequence and what does it do for position in the UTF-8 stream?

All subsequent bytes in a multibyte UTF-8 encoding have the pattern 10xxxxxx, a 2 bit multibyte identifier followed by a 6 bit payload. This char t[]={0xF1, 0x80, 0x80, 0x80} is a valid 4 byte UTF-8 encoding of codepoint=1<<18. This char t[]={0xF1, 0xC0, 0x80, 0x80} is invalid but results in a valid codepoint with the same value (as does any permutation where one or more of the last 3 bytes start with something other than 0b10). Not checking for 10xxxxxx lets some erroneous values slip through instead of being replaced with one or more ? characters. With or without checking 10xxxxxx the input should quickly sync to decoding properly, but a fully checked version would emit a few more total codepoints.

So, technically, this is a bug fix but the fix might trigger bugs in code that failed to defend against size==0 as an edge case.?

Yes. I don't know how long it's been like this, but bad input should be rare enough that even if there are bugs in code it's unlikely they'll be triggered.

I'm unclear for the rationale around returning ? but also specifying size == 0

The size==0 is so that validation can happen elsewhere. It is a little odd to me too that a default value and an error state is returned, but I just restored the intent. It has led to some places internally checking the size for an error, and others checking for ?.

Under other conditions I would expect size to indicate how many bytes were consumed if one were stepping through a series of encoded characters

That is the intent, with a weird default edge-case of size==0 that has to be checked. Maybe when the default ? is triggered it should return size=1. This would simplify things including the fix no longer potentially breaking flakey external code, but also makes the function unsuitable for validation as we're papering over the cracks. As it stands it's in a bit of a hybrid state where we expect good input but can't assume good output.

, but not including anything beyond the invalid sequence.

All internal use of the function checks for size=0 and replaces it with size=1 anyway, meaning it keeps reading beyond an invalid code. Ignoring the error is another indicator that size=0 should probably be eliminated entirely.

So in the examples, I assume all t[0] (unsigned) values 128 to 255 would fail, but still with size == 1

Valid first byte values are 0..127 for size=1, 192..223 for size=2, 224..239 for size=3, 240..247 for size=4. Everything else is invalid and returns size=0 with this PR, before this PR invalid first-byte values would erroneously be returned as valid with codepoint=t[0], size=1

Here I'm confessing I am not clear on the precise context for the use of these raylib API functions.

raylib uses it internally to read UTF-8 strings where allowing bad input is desirable so it doesn't fail, and it's available for external use. As the error state of size=0 is not really used internally you have to assume it's for external use only, in which case a separate function would probably be better that also does full validation.

Personally I would have the default case return size=1 and also check 10xxxxxx so that all invalid input is replaced with ?, and let the programmer overallocate externally if they want to be cautious instead of adding expensive bounds checking. This would have the benefit of not breaking dodgy external code and also simplify internal code that uses GetCodepointNext, GetCodepointNext tends to be used like this

        int letter = GetCodepointNext(ptr, &next);

        if (letter == 0x3f) ptr += 1;
        else ptr += next;

, eliminating size=0 would eliminate the branch

        int letter = GetCodepointNext(ptr, &next);
        ptr += next;

@orcmid
Copy link
Contributor

orcmid commented Apr 2, 2023

@chocolate42

Personally I would have the default case return size=1 and also check 10xxxxxx so that all invalid input is replaced with ?, and let the programmer overallocate externally if they want to be cautious instead of adding expensive bounds checking. This would have the benefit of not breaking dodgy external code and also simplify internal code that uses GetCodepointNext, GetCodepointNext tends to be used like this

        int letter = GetCodepointNext(ptr, &next);

        if (letter == 0x3f) ptr += 1;
        else ptr += next;

, eliminating size=0 would eliminate the branch

        int letter = GetCodepointNext(ptr, &next);
        ptr += next;

I'm included to lean toward that also. There are two possible code points to return other than "?" (in case that's the actual codepoint).

There's the ASCII SUB code, U+001A, which is intended for precisely this purpose and also U+FFFD SUBSTITUTE which has this nice "?"-in-diamond glyph and which I see occasionally in Twitter subject-lines in my reader.

Either way, I think we are talking about breaking changes and I'm unclear what provides the softest landing.

Also, I can see using U+001A as pretty universally sensible and a client could substitute U+FFFD as part of any presentation. This would have users see that something was amiss without stopping the train.

@chocolate42
Copy link
Contributor Author

There's the ASCII SUB code, U+001A, which is intended for precisely this purpose and also U+FFFD SUBSTITUTE which has this nice "?"-in-diamond glyph and which I see occasionally in Twitter subject-lines in my reader.

We probably need to be mindful of basic fonts that might only define ascii or even just the obviously visible part of ascii. If U+001A is pretty universally defined then it's suitable but I don't think U+FFFD can be suitable (not a font guy). I've assumed 0x3f was used specifically because it is universal. This function is used by rtextures to convert a string to an image, I could be wrong but it looks part of the main way text is displayed by raylib, guaranteeing a visual representation of bad input seems important so the dev can notice and fix it.

Either way, I think we are talking about breaking changes and I'm unclear what provides the softest landing.

The current PR is a breaking change, not from the intended functionality but from the functionality that's been in place for who knows how long. IMO the softest landing is definitely to remove size=0 as a possibility (after all it has been accidentally removed already and no one has seemed to notice), and fixing it so that erroneous codepoints cannot be returned at all just makes sense (if we want to replace bad input with ? why are we only partially doing that?). I'll modify the PR to that effect later today.

@raysan5
Copy link
Owner

raysan5 commented Apr 2, 2023

Excuse the late response, I'm on vacation and a bit out of the loop but I think avoiding the size=0 is the best option, just return the processed bytes, independently if they could processed as a recognizable codepoint.

anon added 3 commits April 2, 2023 11:00
…stead of 0. This matches existing prod behaviour and guarantees size 1..4 is returned. Simplify internal code that uses GetCodepointNext that previously had to account for size=0.
…ureTextEx. This change matches existing precedent in DrawTextEx
@orcmid
Copy link
Contributor

orcmid commented Apr 2, 2023

@chocolate42

We probably need to be mindful of basic fonts that might only define ascii or even just the obviously visible part of ascii. If U+001A is pretty universally defined then it's suitable but I don't think U+FFFD can be suitable (not a font guy).

The advantage of U+001A is that it does not confuse with an actual "?". There is the problem of having no visible graphic usually, and "?" is a common practice. I agree that it is rare to find programs that deal with U+001A and practice trumps purism here. :(

We are assuming that the app does not check on these and the "?" is rather automatically incorporated in whatever the text is being processed on behalf of. It would be nice to have a way to be more defensive but I concede that's too much breaking change for edge cases, despite my general preference.

PS: I assume there are cases of these errors where size ends up being greater than 1, so that advancement past whatever was bogus/incomplete happens properly, but not going beyond what was consumed as valid, when more than 1.

@chocolate42
Copy link
Contributor Author

The advantage of U+001A is that it does not confuse with an actual "?". There is the problem of having no visible graphic usually, and "?" is a common practice. I agree that it is rare to find programs that deal with U+001A and practice trumps purism here. :(

If U+001A is visible with raylib's default font then there's an argument that U+001A is a good default choice. I have no preference.

We are assuming that the app does not check on these and the "?" is rather automatically incorporated in whatever the text is being processed on behalf of. It would be nice to have a way to be more defensive but I concede that's too much breaking change for edge cases, despite my general preference.

What we are assuming is that the input is good and if not that it's better to push through than error. This function could be used more defensively even if the return value remains 0x3f, all it would take is comparing IO on 0x3f return to notice the error.

PS: I assume there are cases of these errors where size ends up being greater than 1, so that advancement past whatever was bogus/incomplete happens properly, but not going beyond what was consumed as valid, when more than 1.

Not in the PR implementation, on error size is always 1 regardless of the size the first byte of input indicates the codepoint is meant to be. Normally this means for example an invalid 3 byte UTF-8 is decoded to three 0x3f codepoints, but returning size=1 on error always means it is still possible for invalid codepoints to be returned as valid (ie an invalid codepoint followed by a "misaligned" valid codepoint). This is unlikely unless the input is crafted or garbage, a single bit error in the control bits of a UTF-8 character is not enough to generate a misaligned valid codepoint. If the input is crafted or garbage all bets are off anyway.

We could instead on error return the size of the invalid UTF8 character. The problem with that is that good characters could be skipped by a bad character basically hiding them (three ascii characters can be skipped by prepending 11110xxx). Not to mention that sloppy looped use of this function to read a UTF-8 c-string could skip over the \0 and merrily overread into oblivion. This is much worse IMO.

@orcmid
Copy link
Contributor

orcmid commented Apr 4, 2023

@chocolate42

We could instead on error return the size of the invalid UTF8 character. The problem with that is that good characters could be skipped by a bad character basically hiding them (three ascii characters can be skipped by prepending 11110xxx). Not to mention that sloppy looped use of this function to read a UTF-8 c-string could skip over the \0 and merrily overread into oblivion. This is much worse IMO.

I agree and I did not mean that it be done that easily. At least one octet has to be consumed and valid successive characters (in the case of a good multibyte first-octet) can be consumed, stopping prematurely before the first improper octet. That makes for one "?" and whatever number of octets were consumed before there is an unexpected one (including the end of the string).

It is true that if someone treats an ISO 8-bit code as UTF8, there can be many "?" either way.

I'm planting this here in case I am ever foolish enough to dig into the proper treatment of Unicode in all its glory, especially composed characters, reverse-writing, and all of that. I do think some sensible (and TTF quality) handling of CJK would be valuable along with Arabic and Hebrew, and (shudder) some highly-composed languages from Southeast Asia. Maybe not in my life-time :)

@raysan5 raysan5 merged commit 8367aba into raysan5:master Apr 6, 2023
@raysan5
Copy link
Owner

raysan5 commented Apr 6, 2023

@chocolate42 Thanks for the review! Current implementation looks ok to me, actually I like that it simplifies code in other functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[rtext] GetCodepointNext() does not return ? as intended on invalid input
3 participants