Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX512 vbool16f/vbool16d are incorrect #5

Closed
jeffamstutz opened this issue Dec 12, 2017 · 5 comments
Closed

AVX512 vbool16f/vbool16d are incorrect #5

jeffamstutz opened this issue Dec 12, 2017 · 5 comments

Comments

@jeffamstutz
Copy link
Owner

jeffamstutz commented Dec 12, 2017

Currently all 16-wide masks (i.e. pack<bool32_t> and pack<bool64_t>), when compiling with AVX512, assume that all elements in the mask are 32/64 byte integers. However, AVX512 changes these to be bit masks instead of int masks, changing the underlying representation.

Ideas (thus far) to make this work with minimal duplication:

  • Parameterize the type of the arr data member (in the union) based on a trait that would map to either std::array<T, N> or std::array<bool32_t, 2> for the special case of AVX512 vbool.
  • Either of the following:
    • Use SFINAE to choose between operator[]() returning a reference or returning by value
    • Have operator[]() always return by value
  • Add insert()/extract() methods to get/set values in individual lanes
  • Specialize begin()/end() to return conforming iterators if they can't be just pointers (as they are now)

Also note that technically vboold16 doesn't exist yet...but this will be directly relevant to that implementation when the time comes.

@jeffamstutz
Copy link
Owner Author

@johguenther feel free to add any discussion here if the above is incomplete or incorrect for solving the AVX512 mask problem.

I also just realized we could always use the 8-wide fallback for vbool16 until we have the above implemented...that would at least give us code which behaves correctly.

@johguenther
Copy link
Collaborator

Although I already went in proposed direction I think we should step back for a sec and think of what the main goal of tsimd should be:

  1. a very thin abstraction layer of SIMD-ISAs, basically "just" wrapping single intrinsics, or
  2. a higher abstraction for vector-parallel programming in the direction of SPMD (within what is possible to achieve with C++).

The main difference will be how to handle the mixture of types of different size (short, int, float, double, and pointers).

With option 1 there will be different TSIMD_DEFAULT_WIDTH for each type, e.g. for AVX2 it's 8 for floats and 4 for doubles. When mixing float and double the user has to manually convert masks (e.g. when applying a mask resulting from a comparison of float to a vector of doubles), and/or manually double-pumping doubles (no pun intended). Actually, for packs of short and byte a bool16_t and a bool8_t is additionally needed.

With option 2 there will be one dominant TSIMD_WIDTH for all types (set by the user, probably defaulting to the native width of 32bit elements for the used ISA), and tsimd takes care of mask conversions and double-pumping. There won't be a bool32_t and a bool64_t, just one bool_t, and a single mask type for all arithmetic types. A pack size of 64 is sensible for user code mainly operating on bytes (also, with AVX-512 masks can be as large as 64bits). AVX-512 has nice instructions to support double-pumping by manipulating masks (KSHIFTR to split masks, KUNPCK to combine masks). But this also means that e.g. pack<16, double> will consist of more than one intrinsic_t v.

@Twinklebear
Copy link
Contributor

For AVX512VL this issue of "vector of bools" vs. "bitmask" also comes up, since it backports a lot of the added mask intrinsics from AVX512 to the 128/256 bit vectors. These are able to use __mask8 for the results of comparisons on 128 and 256 bit vectors and pass that around to be used as a mask in gathers, etc.

I started looking at doing some further template specialization on this route in a branch https://github.com/Twinklebear/tsimd/tree/avx512-vl , but I think (and it sounds like some discussion was had along this route too?) is that instead of going the nasty std::vector<T> and std::vector<bool> specialization type route we would want to split the types, and have a vbool and mask types. The vbool would be an SSE/AVX style vector of bools, where each bool is still given 32 bits, while the mask would be a bitmask type produced from comparison operations for AVX512 and AVX512VL (though on SSE and AVX it would actually be just like vbool, but this would be hidden).

For @johguenther comment, my thought of tsimd was that it was aiming to do option 2, to give us something more in the style of ISPC (though with explicit control flow masking).

@johguenther
Copy link
Collaborator

We discussed this last year and the outcome was that we actually want both, a low-level and thin intrinsic abstraction and building upon that a high-level ISPC-style abstraction (where there is a fixed number of parallel "tasks" computed in parallel with SIMD instructions, for all basic types).

@jeffamstutz
Copy link
Owner Author

This should be fixed now, please re-open if any issues come up. There are some corner cases that are still being ironed out (e.g. there are bugs in gcc?), but the fundamental changes for getting AVX512 masking right are now complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants