AVX512 vbool16f/vbool16d are incorrect #5

jeffamstutz · 2017-12-12T19:18:34Z

Currently all 16-wide masks (i.e. pack<bool32_t> and pack<bool64_t>), when compiling with AVX512, assume that all elements in the mask are 32/64 byte integers. However, AVX512 changes these to be bit masks instead of int masks, changing the underlying representation.

Ideas (thus far) to make this work with minimal duplication:

Parameterize the type of the arr data member (in the union) based on a trait that would map to either std::array<T, N> or std::array<bool32_t, 2> for the special case of AVX512 vbool.
Either of the following:
- Use SFINAE to choose between operator[]() returning a reference or returning by value
- Have operator[]() always return by value
Add insert()/extract() methods to get/set values in individual lanes
Specialize begin()/end() to return conforming iterators if they can't be just pointers (as they are now)

Also note that technically vboold16 doesn't exist yet...but this will be directly relevant to that implementation when the time comes.

The text was updated successfully, but these errors were encountered:

jeffamstutz · 2017-12-12T19:20:52Z

@johguenther feel free to add any discussion here if the above is incomplete or incorrect for solving the AVX512 mask problem.

I also just realized we could always use the 8-wide fallback for vbool16 until we have the above implemented...that would at least give us code which behaves correctly.

johguenther · 2017-12-13T09:46:51Z

Although I already went in proposed direction I think we should step back for a sec and think of what the main goal of tsimd should be:

a very thin abstraction layer of SIMD-ISAs, basically "just" wrapping single intrinsics, or
a higher abstraction for vector-parallel programming in the direction of SPMD (within what is possible to achieve with C++).

The main difference will be how to handle the mixture of types of different size (short, int, float, double, and pointers).

With option 1 there will be different TSIMD_DEFAULT_WIDTH for each type, e.g. for AVX2 it's 8 for floats and 4 for doubles. When mixing float and double the user has to manually convert masks (e.g. when applying a mask resulting from a comparison of float to a vector of doubles), and/or manually double-pumping doubles (no pun intended). Actually, for packs of short and byte a bool16_t and a bool8_t is additionally needed.

With option 2 there will be one dominant TSIMD_WIDTH for all types (set by the user, probably defaulting to the native width of 32bit elements for the used ISA), and tsimd takes care of mask conversions and double-pumping. There won't be a bool32_t and a bool64_t, just one bool_t, and a single mask type for all arithmetic types. A pack size of 64 is sensible for user code mainly operating on bytes (also, with AVX-512 masks can be as large as 64bits). AVX-512 has nice instructions to support double-pumping by manipulating masks (KSHIFTR to split masks, KUNPCK to combine masks). But this also means that e.g. pack<16, double> will consist of more than one intrinsic_t v.

Twinklebear · 2018-01-17T23:27:05Z

For AVX512VL this issue of "vector of bools" vs. "bitmask" also comes up, since it backports a lot of the added mask intrinsics from AVX512 to the 128/256 bit vectors. These are able to use __mask8 for the results of comparisons on 128 and 256 bit vectors and pass that around to be used as a mask in gathers, etc.

I started looking at doing some further template specialization on this route in a branch https://github.com/Twinklebear/tsimd/tree/avx512-vl , but I think (and it sounds like some discussion was had along this route too?) is that instead of going the nasty std::vector<T> and std::vector<bool> specialization type route we would want to split the types, and have a vbool and mask types. The vbool would be an SSE/AVX style vector of bools, where each bool is still given 32 bits, while the mask would be a bitmask type produced from comparison operations for AVX512 and AVX512VL (though on SSE and AVX it would actually be just like vbool, but this would be hidden).

For @johguenther comment, my thought of tsimd was that it was aiming to do option 2, to give us something more in the style of ISPC (though with explicit control flow masking).

johguenther · 2018-01-18T10:08:38Z

We discussed this last year and the outcome was that we actually want both, a low-level and thin intrinsic abstraction and building upon that a high-level ISPC-style abstraction (where there is a fixed number of parallel "tasks" computed in parallel with SIMD instructions, for all basic types).

jeffamstutz · 2018-02-12T02:11:19Z

This should be fixed now, please re-open if any issues come up. There are some corner cases that are still being ironed out (e.g. there are bugs in gcc?), but the fundamental changes for getting AVX512 masking right are now complete.

jeffamstutz closed this as completed Feb 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX512 vbool16f/vbool16d are incorrect #5

AVX512 vbool16f/vbool16d are incorrect #5

jeffamstutz commented Dec 12, 2017 •

edited

Loading

jeffamstutz commented Dec 12, 2017

johguenther commented Dec 13, 2017

Twinklebear commented Jan 17, 2018

johguenther commented Jan 18, 2018

jeffamstutz commented Feb 12, 2018

AVX512 vbool16f/vbool16d are incorrect #5

AVX512 vbool16f/vbool16d are incorrect #5

Comments

jeffamstutz commented Dec 12, 2017 • edited Loading

jeffamstutz commented Dec 12, 2017

johguenther commented Dec 13, 2017

Twinklebear commented Jan 17, 2018

johguenther commented Jan 18, 2018

jeffamstutz commented Feb 12, 2018

jeffamstutz commented Dec 12, 2017 •

edited

Loading