Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix wrong buffer size in path string conversion functions #746

Merged

Conversation

striezel
Copy link
Contributor

@striezel striezel commented May 5, 2024

Description

Fixes #733.

Issue:
The length calculation of the destination buffers for conversions from std::wstring to std::string or wchar_t* to char* is incorrect, causing a buffer overflow in some cases.

Explanation:
The current code uses wcslen() to "calculate" the length of the converted destination string buffer. However, wcslen() is just a strlen() for wchar_t* strings. So it returns the number of wide characters in a wchar_t* string, but this does not need to be the same as the number of narrow characters (that is: char) in the converted string (char* or std::string). It just happens to be the same, if the wide character string only uses characters of the English alphabet where each wide character is converted to exactly one narrow character.

So the length has to be calculated properly. There are (at least) two possibilities:

  • Use wcstombs()'s POSIX extension to calculate the length before doing the conversion.

    POSIX specifies a common extension: if dst is a null pointer, this function returns the number of bytes that would be written to dst, if converted.

    But that is an extension and cannot be relied upon to be present.

  • We should use wcsrtombs() instead, where that behaviour is not an extension but part of the function and we do not need to rely on the presence of an extension. This is what this pull request does.

References

Tasklist

  • Ensure all CI builds pass
  • Review and approve

@simmplecoder
Copy link
Contributor

I dont know much about multibyte string conversions, but I will review it to the best of my ability tomorrow.

Copy link
Member

@mloskot mloskot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Big thanks @striezel

include/boost/gil/io/path_spec.hpp Outdated Show resolved Hide resolved
@mloskot mloskot added the cat/bug But reports and bug fixes label May 10, 2024
@mloskot mloskot added this to the Boost 1.83+ milestone May 10, 2024
Copy link
Contributor

@simmplecoder simmplecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the commits, I learned a lot from reading the description. From the include chain I couldn't find #include <cwchar>, could you please add that? std::wcsrtombs seems to be declared there.
I looked at std::codecvt as possible C++ refactoring, but unfortunately it was disappointing.

include/boost/gil/io/path_spec.hpp Outdated Show resolved Hide resolved
@striezel
Copy link
Contributor Author

striezel commented May 10, 2024

I dont know much about multibyte string conversions, but I will review it to the best of my ability tomorrow.

Just to explain the basics (may not be completely accurate, but it should be enough to get a basic understanding):
In this code, there are two different kinds of string: wide strings using wchar_t for character representation (used in std::wstring or C-style wchar_t* strings) and narrow strings using char for character representation (used by std::string or C-style char* strings). On Windows systems wchar_t is 16 bits and contains UTF-16 code units. On other systems it may be 32 bits and contain UTF-32, but in practice it is always more than 8 bits. In contrast, char only has 8 bits and therefore usually contains UTF-8 code units. (A narrow string could also contain ASCII, ISO-8859-1, etc., but listing and explaining all possibilities would go to far here. Therefore, let's assume that it is UTF-8.)

So when a string is converted from a wide string (wchar_t / UTF-16) to a narrow string (char / UTF-8) the strings do not necessarily have the same number of code points. If a string only uses letters from the English alphabet, then both wide and narrow string can have the same length, because one wchar_t can be converted to exactly one char. However, when the wide strings use characters from a more "exotic" alphabet, say Cyrillic letters or or Chinese letters, the situation is different. A Cyrillic or Chinese letter will only use one code unit in UTF-16, that is it can be represented by a single wchar_t. But in UTF-8 that same letter will need multiple code units (two or three), meaning it needs more than one char to be represented properly. That is why one wchar_t may be converted to more than one char and the length of those strings can differ after conversion. Therefore, the length of the buffer for the converted string has to be calculated before conversion. Otherwise it may be too short to hold the entirety of the converted string.

Here, the length calculation is done by wcsrtombs(), and then the actual conversion is done by wcstombs().

@simmplecoder
Copy link
Contributor

Thanks for the explanation. I suppose the easiest solution is to throw boost.locale at it to completely nuke it, but I am not sure if it is worth it. I suppose people can open their streams and pass those if they want to do anything exotic.

@striezel
Copy link
Contributor Author

Thanks for the commits, I learned a lot from reading the description. From the include chain I couldn't find #include <cwchar>, could you please add that? std::wcsrtombs seems to be declared there.

You are right. <cwchar> is where the function is declared, so I added it.
(It's a bit surprising that the CI passed on all those GCC, Clang and MSVC variants without it.)

I looked at std::codecvt as possible C++ refactoring, but unfortunately it was disappointing.

Yes. As far as I remember, parts of that are deprecated in newer C++ standards.

@striezel
Copy link
Contributor Author

I suppose the easiest solution is to throw boost.locale at it to completely nuke it, but I am not sure if it is worth it. I suppose people can open their streams and pass those if they want to do anything exotic.

As far as I understand it, some parts of Boost.Locale also need the ICU library (International Components for Unicode) and that library is ca. one order of magnitude larger than Boost.Locale itself (counting file sizes for installed libraries here). Furthermore, GIL does not have any dependency on Boost.Locale as of now, so this would potentially add two new dependencies to GIL. That kind of overhead may not be acceptable to some users of Boost.GIL, especially when we can have a smaller solution with the help of the standard library.

@mloskot
Copy link
Member

mloskot commented May 13, 2024

@striezel

Furthermore, GIL does not have any dependency on Boost.Locale as of now, so this would potentially add two new dependencies to GIL. That kind of overhead may not be acceptable to some users of Boost.GIL, especially when we can have a smaller solution with the help of the standard library.

I agree, we should try to keep the list of required dependencies as short as possible.
But, I don't see any issue with introducing optional dependencies for features that can be controlled via CMake option/conditional compilation. The only issue I see is the potential maintenance overhead.

@mloskot mloskot merged commit abb561a into boostorg:develop May 13, 2024
18 checks passed
@striezel striezel deleted the fix-wide-string-path-conversions branch May 13, 2024 21:57
@striezel striezel mentioned this pull request Jun 30, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat/bug But reports and bug fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

heap-buffer-overflow when using io-related functions
4 participants