fix wrong buffer size in path string conversion functions #746

striezel · 2024-05-05T16:16:39Z

Description

Fixes #733.

Issue:
The length calculation of the destination buffers for conversions from std::wstring to std::string or wchar_t* to char* is incorrect, causing a buffer overflow in some cases.

Explanation:
The current code uses wcslen() to "calculate" the length of the converted destination string buffer. However, wcslen() is just a strlen() for wchar_t* strings. So it returns the number of wide characters in a wchar_t* string, but this does not need to be the same as the number of narrow characters (that is: char) in the converted string (char* or std::string). It just happens to be the same, if the wide character string only uses characters of the English alphabet where each wide character is converted to exactly one narrow character.

So the length has to be calculated properly. There are (at least) two possibilities:

Use wcstombs()'s POSIX extension to calculate the length before doing the conversion.

POSIX specifies a common extension: if dst is a null pointer, this function returns the number of bytes that would be written to dst, if converted.

But that is an extension and cannot be relied upon to be present.
We should use wcsrtombs() instead, where that behaviour is not an extension but part of the function and we do not need to rely on the presence of an extension. This is what this pull request does.

References

Tasklist

Ensure all CI builds pass
Review and approve

Fixes boostorg#733.

include/boost/gil/io/path_spec.hpp

simmplecoder · 2024-05-09T22:18:40Z

I dont know much about multibyte string conversions, but I will review it to the best of my ability tomorrow.

mloskot

LGTM. Big thanks @striezel

include/boost/gil/io/path_spec.hpp

simmplecoder

Thanks for the commits, I learned a lot from reading the description. From the include chain I couldn't find #include <cwchar>, could you please add that? std::wcsrtombs seems to be declared there.
I looked at std::codecvt as possible C++ refactoring, but unfortunately it was disappointing.

include/boost/gil/io/path_spec.hpp

striezel · 2024-05-10T21:19:57Z

I dont know much about multibyte string conversions, but I will review it to the best of my ability tomorrow.

Just to explain the basics (may not be completely accurate, but it should be enough to get a basic understanding):
In this code, there are two different kinds of string: wide strings using wchar_t for character representation (used in std::wstring or C-style wchar_t* strings) and narrow strings using char for character representation (used by std::string or C-style char* strings). On Windows systems wchar_t is 16 bits and contains UTF-16 code units. On other systems it may be 32 bits and contain UTF-32, but in practice it is always more than 8 bits. In contrast, char only has 8 bits and therefore usually contains UTF-8 code units. (A narrow string could also contain ASCII, ISO-8859-1, etc., but listing and explaining all possibilities would go to far here. Therefore, let's assume that it is UTF-8.)

So when a string is converted from a wide string (wchar_t / UTF-16) to a narrow string (char / UTF-8) the strings do not necessarily have the same number of code points. If a string only uses letters from the English alphabet, then both wide and narrow string can have the same length, because one wchar_t can be converted to exactly one char. However, when the wide strings use characters from a more "exotic" alphabet, say Cyrillic letters or or Chinese letters, the situation is different. A Cyrillic or Chinese letter will only use one code unit in UTF-16, that is it can be represented by a single wchar_t. But in UTF-8 that same letter will need multiple code units (two or three), meaning it needs more than one char to be represented properly. That is why one wchar_t may be converted to more than one char and the length of those strings can differ after conversion. Therefore, the length of the buffer for the converted string has to be calculated before conversion. Otherwise it may be too short to hold the entirety of the converted string.

Here, the length calculation is done by wcsrtombs(), and then the actual conversion is done by wcstombs().

simmplecoder · 2024-05-10T21:29:48Z

Thanks for the explanation. I suppose the easiest solution is to throw boost.locale at it to completely nuke it, but I am not sure if it is worth it. I suppose people can open their streams and pass those if they want to do anything exotic.

striezel · 2024-05-10T21:31:57Z

Thanks for the commits, I learned a lot from reading the description. From the include chain I couldn't find #include <cwchar>, could you please add that? std::wcsrtombs seems to be declared there.

You are right. <cwchar> is where the function is declared, so I added it.
(It's a bit surprising that the CI passed on all those GCC, Clang and MSVC variants without it.)

I looked at std::codecvt as possible C++ refactoring, but unfortunately it was disappointing.

Yes. As far as I remember, parts of that are deprecated in newer C++ standards.

striezel · 2024-05-10T22:00:07Z

I suppose the easiest solution is to throw boost.locale at it to completely nuke it, but I am not sure if it is worth it. I suppose people can open their streams and pass those if they want to do anything exotic.

As far as I understand it, some parts of Boost.Locale also need the ICU library (International Components for Unicode) and that library is ca. one order of magnitude larger than Boost.Locale itself (counting file sizes for installed libraries here). Furthermore, GIL does not have any dependency on Boost.Locale as of now, so this would potentially add two new dependencies to GIL. That kind of overhead may not be acceptable to some users of Boost.GIL, especially when we can have a smaller solution with the help of the standard library.

mloskot · 2024-05-13T07:57:48Z

@striezel

Furthermore, GIL does not have any dependency on Boost.Locale as of now, so this would potentially add two new dependencies to GIL. That kind of overhead may not be acceptable to some users of Boost.GIL, especially when we can have a smaller solution with the help of the standard library.

I agree, we should try to keep the list of required dependencies as short as possible.
But, I don't see any issue with introducing optional dependencies for features that can be controlled via CMake option/conditional compilation. The only issue I see is the potential maintenance overhead.

fix wrong buffer size in path string conversion functions

9c66d9f

Fixes boostorg#733.

Osyotr reviewed May 5, 2024

View reviewed changes

include/boost/gil/io/path_spec.hpp Outdated Show resolved Hide resolved

get rid of extra allocation via alloca() in convert_to_string()

d89afab

mloskot approved these changes May 10, 2024

View reviewed changes

include/boost/gil/io/path_spec.hpp Outdated Show resolved Hide resolved

mloskot added the cat/bug But reports and bug fixes label May 10, 2024

mloskot added this to the Boost 1.83+ milestone May 10, 2024

simmplecoder requested changes May 10, 2024

View reviewed changes

include/boost/gil/io/path_spec.hpp Outdated Show resolved Hide resolved

adjustments for suggestions from code review

a27fc03

simmplecoder approved these changes May 10, 2024

View reviewed changes

mloskot merged commit abb561a into boostorg:develop May 13, 2024
18 checks passed

striezel deleted the fix-wide-string-path-conversions branch May 13, 2024 21:57

striezel mentioned this pull request Jun 30, 2024

Prepare for release with Boost 1.86 #748

Closed

6 tasks

striezel mentioned this pull request Jul 23, 2024

tests: add test cases for wide character path conversions #754

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix wrong buffer size in path string conversion functions #746

fix wrong buffer size in path string conversion functions #746

striezel commented May 5, 2024 •

edited

Loading

simmplecoder commented May 9, 2024

mloskot left a comment

simmplecoder left a comment

striezel commented May 10, 2024 •

edited

Loading

simmplecoder commented May 10, 2024

striezel commented May 10, 2024

striezel commented May 10, 2024

mloskot commented May 13, 2024

fix wrong buffer size in path string conversion functions #746

fix wrong buffer size in path string conversion functions #746

Conversation

striezel commented May 5, 2024 • edited Loading

Description

References

Tasklist

simmplecoder commented May 9, 2024

mloskot left a comment

Choose a reason for hiding this comment

simmplecoder left a comment

Choose a reason for hiding this comment

striezel commented May 10, 2024 • edited Loading

simmplecoder commented May 10, 2024

striezel commented May 10, 2024

striezel commented May 10, 2024

mloskot commented May 13, 2024

striezel commented May 5, 2024 •

edited

Loading

striezel commented May 10, 2024 •

edited

Loading