1 /*****************************************************
3 Cross-platform SIMD intrinsics header file
5 VERSION: 2004.10.26 (alpha)
7 Created by Patrick Roberts
9 This is an on-going project. Please add functions and
10 typedefs as needed, but try to follow the guideline
13 The goal of this file is to stay cross-platform.
14 Only intrinsics or #defines that mimic another system's
15 SIMD instruction should be included, with the only exception
16 being instructions that, if non existant, are not
19 Currently, the goal is to base support around 128-bit SIMD.
20 (Only the Gekko and x86-MMX are 64-bit)
24 2004.05.09 [Patrick Roberts]
25 *) Created file with some i386, GCC dialect
26 2004.10.22 [Patrick Roberts]
27 *) Created emulated SIMD
28 2004.10.25 [Patrick Roberts]
29 *) Created arm-iwmmx GCC dialect
30 *) Fixed sqrt bug in emu dialect
31 *) Organized directories
32 *) Makefile for test app
38 *( Add new intrinsics to test app
39 *( MinGW x86 dialect (same as GCC on Linux?)
40 *( Does 3DNOW buy us anything?
41 *( Intel ICC x86 dialect
42 *( MSVC .NET x86 dialect
43 *( Support for ARM ARM6, VFP and NEON SIMD? What compilers use these?
44 *( PowerPC AltiVec/Velocity/VMX components
45 *( MIPS-MMI / PS2-VU components
46 *( See if SSE2 buys us anything beyond what the compiler does already
47 *( Compaq Alpha components
51 /***************************************************
59 NOTE: Code must be 16-byte aligned. Align to 16 when allocating memory.
61 X86/XSCALE (Intel) vs. PowerPC/MIPS
63 While the PowerPC and MIPS SIMD instructions take 2 source vectors
64 and a destination vector, the Intel platforms only take a source and
75 Code written either way will work on the X86, and still be faster than
76 387 math, but preserving the registers takes significant overhead.
77 (Disassemble the test program for an example. The prints preserve, the
78 'disassembly test' does not.) For the fastest code between systems, write
79 your SIMD math as the X86 expects, manually preserving SIMD variables.
80 At least GCC for PPC doesn't seem to have any issues figuring out how to
81 deal with a source and destination memory address being the same.
87 You must compile with -msse and -mmmx. I try to avoid mmx as mmx is slower on
88 the P4 than on the P3 and XP, but sse doesn't have integer math.
90 You may want to set -msse2 if you have a P4 CPU (-msse2 is set by default
91 for x86-64 CPUS), as some of the simd functions not supported on x86
92 can be sped up by gcc using sse2 commands rather than standard pipeline
99 You must compile with the switch -maltivec
102 GCC ARM (Xscale only)
105 GCC ARM only seems to support Intel Wirekess MMX (XSCALE), not ARMv6,
106 Neon, or VFP? (Are these all the same beast?)
108 You must compile with +iwmmxt