Professional Documents
Culture Documents
x86 Intrinsics Cheat Sheet v1.0
x86 Intrinsics Cheat Sheet v1.0
x86 Intrinsics Cheat Sheet v1.0
Jan Finis
finis@in.tum.de
Legend
Introduction
This cheat sheet displays most x86 intrinsics supported by Intel processors. The following intrinsics were omitted:
· obsolete or discontinued instruction sets like MMX and 3DNow!
· AVX-512, as it will not be available for some time and would blow up the space needed due to the vast amount of new instructions
· Intrinsics not supported by Intel but only AMD like parts of the XOP instruction set (maybe they will be added in the future).
· Intrinsics that are only useful for operating systems like the _xsave intrinsic to save the CPU state
· The RdRand intrinsics, as it is unclear whether they provide real random numbers without enabling kleptographic loopholes
Each family of intrinsics is depicted by a box as described below. It was tried to group the intrinsics meaningfully. Most information is taken from the Intel Intrinsics Guide (http://software.intel.com/en-us/
articles/intel-intrinsics-guide). Let me know (finis@in.tum.de) if you find any wrong or unclear content.
When not stated otherwise, it can be assumed that each vector intrinsic performs its operation vertically on all elements which are packed into the input SSE registers. E.g., the add instruction has no
description. Thus, it can be assumed that it performs a vertical add of the elements of the two input registers, i.e., the first element from register a is added to the first element in register b, the second is
added to the second, and so on. In contrast, a horizontal add would add the first element of a to the second element of a, the third to the fourth, and so on.
To use the intrinsics, included the <x86intrin.h> header and make sure that you set the target architecture to one that supports the intrinsics you want to use (using the -march=X compiler flag).
Available Bitwidth: If no bitwidth is specified, the operation is available for 128bit and 256bit SSE registers. Use the
_mm_ prefix for the 128bit and the _mm256_ prefix for the 256bit flavors. Otherwise, the following restrictions apply:
Sequence Indicator: If this Name(s) of the intrinsic: The names of the various flavors of the intrinsic. To assemble the final name to be used for
indicator is present, it means that
the intrinsic will generate a
Float Compare an intrinsic, one must add a prefix and a suffix. The suffix determines the data type (see next field). Concerning the
prefix, all intrinsics, except the ones which are prefixed with a green underscore _, need to be prefixed with _mm_ for
sequence of assembly instructions SEQ! 128 128bit versions or _mm256_ for 256bit versions. Blue letters in brackets like [n] indicate that adding this letter leads
instead of only one. Thus, it may
not be as efficient as anticipated.
SSE2 cmp[n]Z to another flavor of the function. Blue letters separated with a slash like l/r indicate that either letter can be used
and leads to a different flavor. The different flavors are explained in the notes section. A red letter like Z indicates that
ps/d,ss/ various different strings can be inserted here which are stated in the notes section.
d,si128[SSE4.1]
Instruction Set: Specifies the
NOTE: Z can be one of: List of available data type suffixes: Consult the suffix table for further information about the various possible suffixes.
instruction set which is necessary
ge/le/lt/gt/eq The suffix chosen determines the data type on which the intrinsic operates. It must be added as a suffix to the intrinsic
for this operation. If more than
Returns a single int that is either 1 name separated by an underscore, so a possible name for the data type pd in this example would be
one instruction set is given here,
or 0. The n version is a not version, _mm_cmpeq_pd. A suffix is followed by an instruction set in brackets, then this instruction set is required for the
then different flavors of this
e.g., neq computes not equal. suffix. All suffixes without an explicit instruction set are available in the instruction set specified at the left.
operation require different
If no type suffixes are shown, then the method is type independent and must be used without a suffix.
instruction sets. md cmpeq_pd(md a,md b)
Notes: Explains the semantics of the intrinsic, the various flavors, and other important information.
Signature and Pseudocode: Shows one possible signature of the intrinsic in order to depict the parameter types and
the return type. Note that only one flavor and data type suffix is shown; the signature has to be adjusted for other
suffixes and falvors. The data types are displayed in shortened form. Consult the data types table for more
information.
In addition to the signature, the pseudocode for some intrinsics is shown here. Note that the special variable dst
Comparisons
depicts the destination register of the method which will be returned by the intrinsic.
128
Columns
128 128 128
Check (CRC32)
≤64
Abort the z version ZF. For 128 bit, 128 bit, there is also the operation _SIDD_CMP_RANGES
For each character c in a, determine
whether b0 <= c <= b1or b2 <= c <= b3…
test_mix_ones_zeros which
AES aeskeygenassist AES aesimc AES aesenc[last] AES aesdec[last] SSE4.2 crc32 Transaction Packed Conversions Convert all elements in a packed SSE register Reinterpet Casts Rounding there is also the operation
test_all_zeros which does does the same. _SIDD_CMP_EQUAL_ORDERED Check for string equality of a and b
Search substring b in a. Each byte where b
the same as testz_si128. i test_mix_ones_zeros(mi a,mi b) _SIDD_CMP_EQUAL_EACH begins in a is treated as match.
si128 si128 si128 si128 u8-u64 TSX _xabort See also: Conversion to int ZF := ((a & b) == 0)
NOTE: Assist in expanding the AES NOTE: Performs the inverse mix NOTE: Perform one round of an NOTE: Perform one round of an NOTE: Starting with the initial Convert 16bit Float Pack With S/D/I32
i testc_si128(mi a,mi b) Polarity specifier
cipher key by computing steps columns transformation on the
input.
AES encryption flow on data
(state) in a using the round key.
AES decryption flow on data
(state) in a using the round key.
value in crc, accumulates a CRC32
-
NOTE: Force an RTM abort. The ↔ 32bit Float Sign Extend Zero Extend 128bit Cast performs rounding implicitly ZF := ((a & b) == 0)
CF := ((a & !b) == 0)
CF := ((a & !b) == 0)
RETURN !CF && !ZF;
_SIDD_POSITIVE_POLARITY
_SIDD_NEGATIVE_POLARITY
Match is indicated by a 1-bit.
Negation of resulting bitmask.
towards generating a round key
for encryption cipher using data mi aesimc_si128(mi a) The last version performs the The last version performs the
value for unsigned X-bit integer v,
and stores the result in dst. EAX register is updated to reflect Saturation Conversion 128
Round up RETURN CF; _SIDD_MASKED_NEGAT Negation of resulting bitmask except for
bits that have an index larger than the size
from a and an 8-bit round last round. last round.
u crc32_u32(u crc,u v)
an XABORT instruction caused the
abort, and the imm parameter will
CVT16 cvtX_Y SSE4.1 cvtX_Y SSE4.1 cvtX_Y SSE2 castX_Y IVE_POLARITY of a or b.
constant specified in imm. mi aesenc_si128(mi a,mi key) mi aesdec_si128(mi a,mi key)
be provided in bits [31:24] of EAX. ph ↔ ps SSE2 pack[u]s epi8-32 epu8-32 → epi8-32 SSE2 cvt[t]X_Y si128,ps/d (ceiling)
mi aeskeygenassist_si128 Following an RTM abort, the NOTE: Converts between 4x 16 bit epi16,epi32 NOTE: Sign extends each element NOTE: Zero extends each element epi32,ps/d NOTE: Reinterpret casts from X to
(mi a,ii imm) logical processor resumes floats and 4x 32 bit floats (or 8 for from X to Y. Y must be longer than from X to Y. Y must be longer than Y. No operation is generated. SSE4.1 ceil String Compare String Compare String Compare String
execution at the fallback address 256 bit mode). For the 32 to 16-bit NOTE: Packs ints from two input
X. X. NOTE: Converts packed elements
md castsi128_pd(mi a) ps/d,ss/d String Compare
computed through the outermost
XBEGIN instruction.
conversion, a rounding mode must
be specified.
registers into one register halving
the bitwidth. Overflows are mi cvtepi8_epi32(mi a) mi cvtepu8_epi32(mi a)
from X to Y. If pd is used as source
or target type, then 2 elements are md ceil_pd(md a)
Mask Index with Nullcheck Nullcheck
Miscellaneous v _xabort(ii imm) m cvtph_ps(mi a)
handled using saturation. The u
version saturates to the unsigned
converted, otherwise 4.
The t version is only available 128/256bit SSE4.2 cmpi/estrm SSE4.2 cmpi/estri SSE4.2 cmpi/estrc SSE4.2 cmpi/estra SSE4.2 cmpi/estrs/z
mi cvtps_ph(m a, i rounding) integer type. when casting to int and performs a
Cast 256 Round down ps/d,ss/d ps/d,ss/d ps/d,ss/d ps/d,ss/d ps/d,ss/d
Monitor Get MXCSR Commit Begin mi pack_epi32(mi a,mi b)
truncation instead of rounding.
(floor)
NOTE: Compares strings in a and b NOTE: Compares strings in a and b NOTE: Compares strings in a and b NOTE: Compares strings in a and b NOTE: Compares two strings a and
Pause Monitor Wait md cvtepi32_pd(mi a)
AVX castX_Y
and returns the comparsion mask. and returns the index of the first and returns true iff the resulting and returns true iff the resulting b and returns if a (s version) or b
- - - - - - Single Element Conversion Convert a single element in the lower bytes of an SSE register i256 md floor_pd(md a)
used, then the result is a byte
mask which has ones in all bits of
i cmpistri(mi a,mi b,ii imm)
i cmpestri
i cmpestrc
(mi a,i la,mi b,i lb,ii imm)
i cmpestra
(mi a,i la,mi b,i lb,ii imm)
i cmpestrs
(mi a,i la,mi b,i lb,ii imm)
NOTE: Provide a hint to the NOTE: Reinterpret casts from X to the bytes that do not match.
NOTE: Arm address monitoring NOTE: Hint to the processor that it NOTE: Get the content of the NOTE: Specify the end of an RTM NOTE: Specify the start of an RTM (mi a,i la,mi b,i lb,ii imm)
processor that the code sequence hardware using the address can enter an implementation- MXCSR control and status register. code region. If this corresponds to code region. If the logical Single Single SSE Float Old Float/Int Y. If cast is from 128 to 256 bit, mi cmpistrm(mi a,mi b,ii imm)
is a spin-wait loop. This can help specified in p. A store to an dependent-optimized state while the outermost scope, the logical Single Float to Int Single 128-bit Int then the upper bits are undefined! mi cmpestrm
improve the performance and address within the specified waiting for an event or store
u getcsr()
processor will attempt to commit
processor was not already in
transactional execution, then it Conversion to to Normal Float Conversion 128 No operation is generated. Round (mi a,i la,mi b,i lb,ii imm)
power consumption of spin-wait address range triggers the operation to the address range the logical processor state transitions into transactional Conversion Conversion m128d castpd256_pd128(m256d a)
loops. monitoring hardware. Specify specified by MONITOR.
Set MXCSR atomically. If the commit fails, the execution. On an RTM abort, the Float with Fill 128 128 128 Conversion 128 SSE cvt[t]_X2Y
v pause() optional extensions in ext, and ?????? logical processor will perform an logical processor discards all SSE4.1 round
optional hints in hints. RTM abort. SSE2 cvtX_Y SSE cvt[t]X_Y SSE2 cvtX_Y SSE2 cvtX_Y pi↔ps,si↔ss
v monitor(v* ptr,u ext,u hints)
v mwait(u ext,u hints) Register v _xend()
architectural register and memory
updates performed during the
si32-64,ss/d → ss/d ss/d → si32-64 si32-128 ss→f32,sd→f64 NOTE: Converts X to Y. Converts 256bit Cast ps/d,ss/d
Arithmetics
RTM execution, restores NOTE: Rounds according to a
SSE setcsr architectural state, and starts NOTE: Converts a single element NOTE: Converts a single element NOTE: Converts a single integer NOTE: Converts a single SSE float
between int and float. p_ versions
convert 2-packs from b and fill
256 rounding paramater.
in b from X to Y. The remaining from X (SSE Float type) to Y
- execution beginning at the fallback from X (int) to Y (float). Result is from X to Y. Either of the types result with upper bits of operand AVX castX_Y md round_pd(md a,i rounding)
NOTE: Set the MXCSR control and
address computed from the
outermost XBEGIN instruction.
bits of the result are copied from
a.
normal int, not an SSE register!
The t version performs a
must be si128. If the new type is
longer, the integer is zero
(normal float type).
f cvtss_f32(m a)
a. s_ versions do the same for
ps,pd,si256
Basic Arithmetics
status register with a 32bit int. one element in b.
md cvtsi64_sd(md a,i64 b) truncation instead of rounding. extended. NOTE: Reinterpret casts from X to
u _xbegin()
v setcsr(u a) double(b[63:0]) |
(a[127:64] << 64)
i cvtss_si32(m a) mi cvtsi32_si128(i a) The t version truncates instead of
rounding and is available only
Y. No operation is generated.
m256 castpd_ps(m256d a)
Multiplication
when converting from float to int.
m cvt_si2ss(m a,i b) Mul High with
Register I/O Loading data into an SSE register or storing data from an SSE register Carryless Mul Mul Mul Low Mul High
128 Round & Scale
SSE SSE2
CLMUL clmulepi64 SSE2 mul SSE4.1 mullo SSE2 mulhi SSSE3 mulhrs
Load Loading data into an SSE register Fences Misc I/O Bit Operations si128 epi32[SSE4.1],epu32,ps/
epi16,epi32[SSE4.1] epi16,epu16 epi16
d,ss/d
NOTE: Perform a carry-less
NOTE: Multiplies vertically and NOTE: Multiplies vertically and NOTE: Treat the sixteen-bit words
multiplication of two 64-bit NOTE: epi32 and epu32 version
Set Register Inserting data into an register without loading from memory
Store Fence Prefetch Boolean Logic Bit Shifting & Rotation integers, selected from a and b multiplies only 2 ints instead of 4!
writes the low 16/32 bits into the
result.
writes the high 16bits into the
result.
in registers A and B as signed 15-
bit fixed-point numbers between
according to imm, and store the
mi mul_epi32(mi a,mi b) mi mullo_epi16(mi a,mi b) mi mulhi_epi16(mi a,mi b) −1 and 1 (e.g. 0x4000 is treated as
results in dst (result is a 127 bit
SSE sfence SSE prefetch Arithmetic Logic Shift int).
0.5 and 0xa000 as −0.75), and
Set Reversed Set Insert Replicate Bool XOR Bool AND Bool NOT AND Bool OR Rotate Left/Right mi clmulepi64_si128
multiply them together with
NOTE: Loads 128 bit from NOTE: Loads 128 bit from
memory into register. Memory
NOTE: Loads 128 bit from
NOTE: Loads from memory if the
highest bit for each element is set SSE2 mfence
with undefined contents.
Bit Masking Set or reset a range of bits
NOTE: Returns the index of the
mi hsubs_epi16(mi a,mi b) mi hsub_epi16(mi a,mi b) mi sub_epi16(mi a,mi b) mi subs_epi16(mi a,mi b)
memory into register. Memory memory into register. The mi undefined_si128() lowest/highest bit set in the 32bit
address must be aligned! Memory address must be aligned! in the mask. Otherwise 0 is used.
-
is fetched at once from an USWC md load_pd(d* ptr)
elements are loaded in reversed
ptr must be aligned!
Reset Lowest Mask Up To Find Lowest Bit Counting
int a. Undefined if a is 0.
SSE3 lddqu SSE2 loadh/l SSE2 load1 SSE3 loaddup AVX2 i32/i64gather AVX2 mask_i32/i64gather Mix Registers Mix the contents of two registers Byte Shuffling Change the byte order using a control mask
Min 128
epi8-32
NOTE: For each element in a and
SSE2 SSE2
si128 pd,pi pd,ps pd epi32-64,ps/d epi32-64,ps/d SSE4.1 minpos SSE4.1 min SSE4.1 max SSE2 avg b, set result element to a if b is
NOTE: Loads 128bit integer data NOTE: Loads a value from NOTE: Loads a float from memory NOTE: Loads a 64-bit float from NOTE: Gather elements from NOTE: Same as gather but takes Move Element Move Concatenate and 32-bit Int High / Low positive, set result element to -a
into SSE register. Is faster than memory into the high/low half of into all slots of the 128-bit register. memory into all slots of the 128- memory using 32-bit/64-bit an additional mask and src 256bit Insert Byteshift (Align)
Byte Shuffle epu16 SSE:ps SSE2:epu8, SSE:ps SSE2:epu8, epu8-16 if b is negative or set result
loadu if value crosses cache line the register and fills the other half
from the input register. Memory
For pd, there is also the operation bit register. Memory does not
need to be aligned.
indices. The elements are loaded register. Each element is only with Fill 128 High↔Low 128 256
Shuffle 16bit Shuffle NOTE: Computes horizontal min of
one input vector of 16bit uints.
epi16,pd
SSE4.1:
epi16,pd
SSE4.1:
mi avg_epu16(mi a,mi b) element to 0 if b is 0.
boundary. Should be preferred loaddup in SSE3 which may from addresses starting at ptr gathered if the highest mi sign_epi16(mi a,mi b)
SSE AVX
over loadu. does not need to be aligned. perform faster than load1.
Memory does not need to be
md loaddup_pd(d* ptr) and offset by each 32-bit/64-bit
element in a (each index is scaled
corresponding bit in the mask is
set. Otherwise it is copied from
SSE2 move SSE movelh/hl AVX2 insertf128 SSSE3 alignr SSE2 shuffle SSE2 shufflehi/lo SSSE3 shuffle The min is stored in the lower 16
bits and its index is stored in the
epi8-32,epu8-32 epi8-32,epu8-32
mi lddqu_si128(mi* ptr) md loadl_pd(md a,d* ptr) ss[SSE],sd[SSE2] ps si256,ps/d epi32 epi16 epi8 following 3 bits mi min_epu16(mi a,mi b) mi max_epu16(mi a,mi b)
aligned. by the factor in s). Gathered src. Memory does not need to mi alignr_epi8(mi a,mi b,i c)
dst[63:0] :=*(ptr)[63:0]
elements are merged into dst. s be aligned. NOTE: Moves the lowest element NOTE: The lh version moves NOTE: Inserts 128bit at a position ((a << 128) | b) >> c*8 mi shuffle_epi32(mi a,ii imm) NOTE: Shuffles the high/low half NOTE: Shuffle packed 8-bit mi minpos_epu16(mi a)
dst[127:64] := a[127:64] md load1_pd(d* ptr)
should be 1, 2, 4 or 8. The 128bit from first input and fills remaining lower half of b into upper half of specified by imm. For AVX2, there S(s, mask){ of the register using an immediate integers in a according to shuffle
mi mask_i64gather_epi32 (mi src,
versions zero the upper 128bit of elements from second input result. Rest is filled from a. The hl is only si256 and the operation is CASE(mask[1:0]) control mask. Rest of the register control mask in the corresponding
i* ptr,mi a,mi mask,i32 s)
128bit Pseudo
256bit registers. The number of
gathered elements is limited by
FOR j := 0 to 1; i:=j*32;m:=j*64 m move_ss(m a,m b)
version moves upper half to lower
half.
named inserti128.
m insertf128_ps(m a,m b,ii i)
128-bit Dual 0: r := s[31:0]
1: r := s[63:32]
is copied from input.
mi shufflehi_epi16(mi a,ii imm)
8-bit element of b, and store the
results in dst.
Load Unaligned Load Single Broadcast Load the minimum of the type to load
IF mask[i+31]
dst[i+31:i]:=*(ptr+a[i+63:i]*s)
a[31:0] | b[127:32]
m movehl_ps(m a,m b) dst[255:0] := a[255:0] Register Shuffle 2: r := s[95:64] dst[63:0] := a[63:0]
Composite Arithmetics Perform more than one operation at once
128 SEQ!
Gather 256 and the type used for the offset.
mask[i+31] := 0 sel := i*128
AVX
256 3: r := s[127:96] dst[79:64] :=
mi shuffle_epi8(mi a,mi b)
FOR j := 0 to 15
Memory does not need to be
ELSE dst[sel+15:sel]:=b[127:0]
AVX2 permute2f128 RETURN r[31:0] (a >> (imm[1:0]*16))[79:64]
i := j*8
SSE2 loadu SSE2 load AVX broadcast AVX loadu2 aligned.
dst[i+31:i] := src[i+31:i] } dst[95:80] :=
IF b[i+7] == 1
pd,ps,si16-si128 ss,sd,epi64 ss/d,ps/d m128,m128d,m128i
mi i64gather_epi32
(i* ptr,mi a,i s)
mask[MAX:64] := 0
dst[MAX:64] := 0
ps/d,si256 dst[31:0]:=S(a,imm[1:0])
dst[63:32]:=S(a,imm[3:2])
(a >> (imm[3:2]*16))[79:64]
dst[111:96] :=
dst[i+7:i] := 0 Dot Product Composite Int Arithmetics
NOTE: Takes 2 registers and ELSE
NOTE: Loads 128 bit or less (for NOTE: Loads a single element NOTE: Broadcasts one input (ss/ NOTE: Loads two 128bit elements FOR j := 0 to 1; dst[95:64]:=S(a,imm[5:4]) (a >> (imm[5:4]*16))[79:64]
shuffles 128 bit chunks to a target k := b[i+3:i]*8 Byte Multiply and Multiply and
si16-64 versions) from memory from memory and zeros remaining d) element or 128bits (ps/d) from two memory locations which i := j*32 register. Each chunk can also be
dst[127:96]:=S(a,imm[7:6]) dst[127:112] :=
dst[i+7:i]:= a[k+7:k] Conditional Float Sum of Absolute Sum of Absolute
into the low-bits of a register. bytes. For epi64, the command is from memory into all slots of an do not need to be aligned. m := j*64 Dual Register cleared by a bit in the mask. For (a >> (imm[7:6]*16))[79:64] Horizontal Saturated Horizontal Saturated
Memory does not have to be loadl_epi64! Memory does SSE register. All suffixes but ss m loadu2_m128(f* p1,f* p2) dst[i+31:i] :=
Interleave AVX2, there is only the si256 Dot Product Add Add Differences Differences 2
aligned. not need to be aligned. are only available in 256 bit mode! dst[127:0] = *p2; *(ptr + a[m+63:m]*s]) Float Shuffle version which is renamed to
Float Shuffle
md loadu_pd(d* ptr) md load_sd(d* ptr) m broadcast_ss(f* a) dst[255:128] = *p1; dst[MAX:64] := 0
(Unpack) Blend SSE
shuffle
permute2x128. SSE4.1 dp SSSE3 maddubs SSE3 madd SSE2 sad SSE4.1 sadbw
SSE2 md permute2f128_pd ps/d epi16 epi16 epu8 epu8
SSE2 unpackhi/lo SSE4.1
blend[v]
ps[SSE],pd[SSE2] (md a,md b,ii imm) AVX permute[var] NOTE: Computes the dot product NOTE: Multiply vertical 16 bit ints NOTE: Compute the absolute NOTE: Compute the sum of
AVX2 NOTE: Shuffles floats from two S(s1, s2, control){ 4x64bit Shuffle of two vectors. A mask is used to NOTE: Adds vertical 8 bit ints
producing a 32 bit int and adds differences of packed unsigned 8- absolute differences (SADs) of
Store Storing data from an SSE register into memory or registers epi8-64,ps[SSE],pd epi8-16, registers. The lower part of the CASE(control[1:0])
0: tmp[127:0] := s1[127:0]
ps/d
256
8x32bit Shuffle state which components are to be producing a 16 bit int and adds
horizontal pairs with saturation horizontal pairs with saturation bit integers in a and b, then quadruplets of unsigned 8-bit
epi32[AVX2],ps/d result receives values from the NOTE: 128-bit version is same as multiplied and stored. producing 4x32 bit ints horizontally sum each consecutive integers in a compared to those in
NOTE: Interleaves elements from 1: tmp[127:0] := s1[255:128] 256 producing 8x16 bit ints. The first
the high/low half of two input NOTE: blendv uses 128bit mask, first register. The upper part
2: tmp[127:0] := s2[127:0]
int shuffle for floats. 256-bit AVX2 permute4x64 m dp_ps(m a,m b,ii imm) input is treated as unsigned and
8 differences to produce two b, and store the 16-bit results in
Storing data to a memory address which must be 16- blend uses immediate receives values from the second version performs the same AVX2 permutevar8x32 FOR j := 0 to 3 unsigned 16-bit integers, and pack dst. Eight SADs are performed
Aligned Store byte aligned (or 32-byte for 256bit instructions) Extraction registers into a target register.
unpackhi_epi32(a,b)
mi blend_epi16(mi a,mi b,ii imm)
register. Shuffle mask is an
immediate!
3: tmp[127:0] := s2[255:128]
IF control[3]
operation on two 128-bit lanes.
The normal version uses an
epi64,pd
NOTE: Same as float shuffle but
epi32,ps IF imm[4+j]
the second as signed. Results and
intermediaries are signed.
+S
these unsigned 16-bit integers in
the low 16 bits of 64-bit elements
using one quadruplet from b and
eight quadruplets from a. One
FOR j := 0 to 7 tmp[127:0] := 0 tmp[i+31:i] := * +S
dst[31:0] := a[95:64] immediate and the var version a no [var] version is available. In NOTE: Same as 4x64 bit shuffle a[i+31:i] * b[i+31:i] in dst. quadruplet is selected from b
i := j*16 md shuffle_pd(md a,md b,ii imm) RETURN tmp[127:0]
Aligned Stream Aligned dst[63:32] := b[95:64]
dst[63:0] := (imm[0] == 0) ?
register b (the lowest bits in each addition, the shuffle is performed but with 8x32bit and only a [var] ELSE * starting at on the offset specified
Reverse Store Aligned Store Broadcast Store Masked Store Extract dst[95:64] := a[127:96]
IF imm[j]
dst[i+15:i] := b[i+15:i] a[63:0] : a[127:64]
}
dst[127:0] := S(a, b, imm[3:0])
element in b are used for the mask over the full 256 bit instead of two version taking a register instead of tmp[i+31:i] := 0
+S +S in imm. Eight quadruplets are
Store SEQ! 128 SEQ! 128 128
dst[127:96]:= b[127:96]
ELSE dst[127:64] := (imm[1] == 0) ? dst[255:128]:= S(a, b, imm[7:4])
of that element in a). lanes of 128 bit. an immediate is available.
sum[31:0] := +S +S
ABS(a-b) 0
formed from sequential 8-bit
integers selected from a starting at
mi unpackhi_epi16(mi a, mi b) dst[i+15:i] := a[i+15:i] b[63:0] : b[127:64] dst[MAX:256]:= 0 m permutevar_ps(m a,mi b) md permute4x64_pd(md a,ii imm) m permutevar8x32_ps(m a,mi b) tmp[127:96] + tmp[95:64] * *
SSE SSE SSE SSE AVX +
stream storer store store1 AVX2 maskstore SSE4.1 extract the offset specified in imm.
...
...
...
Version 1.0