|
Introduction to the
Streaming SIMD Extensions in the Pentium III: Part III
By Bipin Patwardhan
1. Data Swizzling
The speedup that the Pentium III SSE achieves on
floating-point operations comes at a price. The data operated on
by SSE instructions has to be stored in the new data type defined
by SSE. If the application stores the data in its own format, the
data has to be converted into the new data type before the SSE
instructions can operate on it, and has to be converted back
afterward.
This conversion of data from one format into another is termed
"data swizzling."
This conversion takes time and machine cycles. If an
application converts data from one format to another too often,
the machine cycles saved by executing SSE instructions may well
be lost. Hence, care is needed.
1.1 Data Organization
Usually, 3D applications store the coordinates of a point in
one structure. When handling multiple points, applications use an
array of structures, also called AoS. Typical geometric
operations operate differently on the x, y and z coordinates of
the point. The code given below lists the typical declaration
used by applications processing 3D data. When handling large data
sets, this structure amounts to an array-of-structures, as
illustrated in figure 9.
struct point {
float x, y, z;
};
...
point dataset[...];

Figure 1: Array of structures.
To exploit the advantages of SSE, it would be better to
operate on multiple points simultaneously. This can be done by
operating on the coordinates of multiple points. This is possible
if we collect together the x-, the y- and z-coordinates of the
points. The application can then process multiple x-, y- and
z-coordinates separately. For this, the application must
rearrange the data into either three separate arrays, or a
structure of arrays with one array each for one coordinate of the
point. This arrangement is called the SoA arrangement.
The code given below lists the declaration of the struture of
arrays, while figure 10 is the diagrammatic representation of the
struture of arrays.
struct point {
float *x, *y, *z;
};

Figure 2: Structure of arrays.
2. Memory Issues
2.1 Alignment
Handling and manipulating simple variables of the new data
type does not create problems. However, it is recommended that
variables of the new data type be aligned to 16-byte boundaries.
This alignment can be enforced either by setting the appropriate
compiler flags or by explicitly using align commands
in the program, during variable declaration.
A variable can be specified to be aligned to a 16-byte
boundary using the __declspec compiler directive, as
illustrated in the following example. The variable myVar
will be aligned to a 16-byte boundary due to the align directive.
It is not necessary to align the new data types to 16-byte
boundaries, as the compiler aligns the data types when it comes
across the new data type declarations. The alignment directive is
issued as shown:
__declspec(align(16)) float[4] myVar;
2.2 Dynamic Memory
The condition on the new data types stipulating that pointers
accessing memory locations be aligned to 16-byte boundaries
creates problems when allocating memory dynamically or at the
time of accessing allocated arrays through a pointer.
When accessing arrays through pointers, we have to ensure that
the pointer is aligned to a 16-byte boundary.
To allocate memory at run time we use either the malloc
function or the new command. The default behaviour
of both is that they do not align the pointer address to a
16-byte boundary. Hence, we have to either allocate memory and
then adjust the pointer to a 16-byte boundary, or allocate the
memory using the _mm_malloc function. The _mm_malloc
function allocates a memory block that is aligned to a 16-byte
boundary.
Just as malloc has a free, the _mm_malloc
function has the function _mm_free. Memory blocks
allocated using _mm_malloc have to be freed using _mm_free.
2.3 Custom Datatype
The restriction that pointers be aligned to 16-byte boundaries
can be troublesome. It would be much better to be able to ignore
the alignment of pointers.
When operating on 128-bit data types, it may be necessary to
access the floats stored in the data type. In assembly
language there is not much choice but to use assembly language
constructs. Using C or C++ and the intrinsics library, however,
the data will be sortd in the data type __mm128. In
this data type, once the value is set, it is not possible to
access the individual floating-point numbers directly. One way to
access them is to transfer all floating point numbers into an
array of floats, change the values and load the array of floats
back into the data type. The second method is to cast the data
type into a float array and then access the required
element. The first method is time consuming and the second method
may cause problems if not used properly.
Defining a custom data type can overcome these problems. The
custom data type is defined as a union of the data type (__m128)
and an array of four floats. The declaration of the new
data, called sse4 for now, is given below.
union sse4 {
__m128 m;
float f[4];
};
Using this data type, it is no longer necessary to align
memory locations to 16-byte boundaries. When the compiler
encounters the data type __m128, it aligns it to a
16-byte boundary. An added advantage of this data type is that
the individual floating-point numbers stored in the 128-bit data
can be acessed directly.
2.4 Detecting the CPU
As the usage of SSE depends on the presence Pentium III, it is
important that applications be able to detect the Pentium III
chip. This is done using the cpuid instruction.
For the cpuid instruction to work as desired, the
eax register has to be set to the appropriate value.
As we are interested only in the CPU ID, we need to set the eax
register to 1 before invoking the cpuid instruction.
The source code to detect the presence of the Pentium III CPU
is given below. To be able to compile the code, the file fvec.h
has to be included.
BOOL CheckP3HW()
{
BOOL SSEHW = FALSE;
_asm {
// Move the number 1 into eax - this will move the
// feature bits into EDX when a CPUID is issued, that
// is, EDX will then hold the key to the cpuid
mov eax, 1
// Does this processor have SSE support?
cpuid
// Perform CPUID (puts processor feature info in EDX)
// Shift the bits in edx to the right by 26, thus bit 25
// (SSE bit) is now in CF bit in EFLAGS register.
shr edx,0x1A
// If CF is not set, jump over next instruction
jnc nocarryflag
// set the return value to 1 if the CF flag is set
mov [SSEHW], 1
nocarryflag:
}
return SSEHW;
}
The SSE SDK also has an SSE emulation mode that emulates the
Pentium III and the SSE registers. The code given below can be
used to detect this emulation. To be able to compile the code,
the file fvec.h has to be included.
// Checking for SSE emulation support
BOOL CheckP3Emu()
{
BOOL SSEEmu = TRUE;
Fvec32 pNormal = (1.0, 2.0, 3.0, 4.0);
Fvec32 pZero = 0.0;
// Checking for SSE HW emulation
__try {
_asm {
// Issue a move instruction that will cause exception
// w/out HW support emulation
movups xmm1, [pNormal]
// Issue a computational instruction that will cause
// exception w/out HW support emulation
divps xmm1, [pZero]
}
}
// If there's an exception, set emulation variable to false
__except(EXCEPTION_EXECUTE_HANDLER) {
SSEEmu = FALSE;
}
return SSEEmu;
}
3. Additional References
For more details about the architecture of the Pentium III,
refer [11], [12], [13] and [10].
For more information about Processor identification and CPUID,
refer [15] and [7].
For more information about the Streaming SIMD Extensions,
refer [19]. For more information about the programming issue, the
software conventions and the software development strategies,
refer [9], [16] and [17] respectively.
For more about application tuning for SSE and the VTune
performance enhancement application, refer [1] and [8]
respectively. For more details about VTune and the Intel C/C++
Compiler, refer [3] and [2] respectively.
For additional information about the Pentium III processor and
its capabilities, refer [14], [18], [20], [6], [5] and [4].
4. Additional Examples
In this section, we present additional examples to illustrate
the usage of the Streaming SIMD Extensions.
4.1 Array Manipulation
In this example, we take two arrays, each with 400 floats. A
multiplication operation is performed on each of the array
elements. The result of the multiplication is stored in a third
array. The two arrays used as operands are named A
and B. The result of the multiplication is stored in
array C. In all the sources given below, the
following declartion is assumed
#include <fvec.h>
#define ARRSIZE 400
__declspec(align(16)) float a[ARRSIZE], b[ARRSIZE], c[ARRSIZE];
4.1.1 Assembly Language
_asm {
push esi;
push edi;
mov edi, a;
mov esi, b;
mov edx, c;
mov ecx, 100;
loop:
movaps xmm0, [edi];
movups xmm1, [esi];
mulps xmm0, xmm1;
movups [edx], xmm0;
add edi, 16;
add esi, 16;
add edx, 16;
dec ecx;
jnz loop;
pop edi;
pop esi;
}
4.1.2 Intrinsics
__m128 m1, m2, m3;
for ( int i = 0; i <ARRSIZE; i +="4" ) {
m1= _mm_loadu_ps(a+i);
m2= _mm_loadu_ps(b+i);
m3= _mm_mul_ps(m1," m2);
_mm_storeu_ps(c+i, m3);
}
4.1.3 C++
F32vec4 f1, f2, f3;
for ( int i = 0; i <ARRSIZE; i +="4" ){
loadu(f1, a+i);
loadu(f2, b+i);
f3 = f1 * f2;
storeu(c+i, f3);
}
4.2 Vector for 3D
This example presents a vector in 3D. The vector is
implemented as a class. The functionality of the class is
implemented using the intrinsics library.
The class declaration is given below.
union sse4 {
__m128 m;
float f[4];
};
class sVector3 {
protected:
sse4 val;
public:
sVector3(float, float, float);
float& operator [](int);
sVector3& operator +=(const sVector3&);
float length() const;
friend float dot(const sVector3&, const sVector3&);
};
The class implementation is given below.
sVector3::sVector3(float x, float y, float z) {
val.m = _mm_set_ps(0, z, y, x);
}
float& sgmVector3::operator [](int i) {
return val.f[i];
}
sVector3& sVector3::operator +=(const sVector3& v) {
val.m = _mm_add_ps(val.m, v.val.m);
return *this;
}
float sVector3::length() const {
sse4 m1;
m1.m = _mm_sqrt_ps(_mm_mul_ps(val.m, val.m));
return m1.f[0] + m1.f[1] + m1.f[2];
}
float dot(const sVector3& v1, const sVector3& v2) {
sVector3 v(v1);
v.val.m = _mm_mul_ps(v.val.m, v2.val.m);
return v.val.f[0] + v.val.f[1] + v.val.f[2];
}
4.3 4x4 Matrix
This example presents a 4x4 matrix. The matrix is implemented
as a class. The functionality of the class is implemented using
the intrinsics library.
The class declaration is given below.
float const sEPSILON = 1.0e-10f;
union sse16 {
__m128 m[4];
float f[4][4];
};
class sMatrix4 {
protected:
sse16 val;
sse4 sFuzzy;
public:
sMatrix4(float*);
float& operator()(int, int);
sMatrix4& operator +=(const sMatrix4&);
bool operator ==(const sMatrix4&) const;
sVector4 operator *(const sVector4&) const;
private:
float RCD(const sMatrix4& B, int i, int j) const;
};
The class implementation is given below.
sMatrix4::sMatrix4(float* fv) {
val.m[0] = _mm_set_ps(fv[3], fv[2], fv[1], fv[0]);
val.m[1] = _mm_set_ps(fv[7], fv[6], fv[5], fv[4]);
val.m[2] = _mm_set_ps(fv[11], fv[10], fv[9], fv[8]);
val.m[3] = _mm_set_ps(fv[15], fv[14], fv[13], fv[12]);
float f = sEPSILON;
sFuzzy.m = _mm_set_ps(f, f, f, f);
}
float& sMatrix4::operator()(int i, int j) {
return val.f[i][j];
}
sMatrix4& sMatrix4::operator +=(const sMatrix4& M) {
val.m[0] = _mm_add_ps(val.m[0], M.val.m[0]);
val.m[1] = _mm_add_ps(val.m[1], M.val.m[1]);
val.m[2] = _mm_add_ps(val.m[2], M.val.m[2]);
val.m[3] = _mm_add_ps(val.m[3], M.val.m[3]);
return *this;
}
bool sMatrix4::operator ==(const sMatrix4& M) const {
int res[4];
res[0] = res[1] = res[2] = res[3] = 0;
res[0] = _mm_movemask_ps(_mm_cmplt_ps(_mm_sub_ps(
_mm_max_ps(val.m[0], M.val.m[0]),
_mm_min_ps(val.m[0], M.val.m[0])), sFuzzy.m));
res[1] = _mm_movemask_ps(_mm_cmplt_ps(_mm_sub_ps(
_mm_max_ps(val.m[1], M.val.m[1]),
_mm_min_ps(val.m[1], M.val.m[1])), sFuzzy.m));
res[2] = _mm_movemask_ps(_mm_cmplt_ps(_mm_sub_ps(
_mm_max_ps(val.m[2], M.val.m[2]),
_mm_min_ps(val.m[2], M.val.m[2])), sFuzzy.m));
res[3] = _mm_movemask_ps(_mm_cmplt_ps(_mm_sub_ps(
_mm_max_ps(val.m[3], M.val.m[3]),
_mm_min_ps(val.m[3], M.val.m[3])), sFuzzy.m));
if ( (15 == res[0]) && (15 == res[1])
&& (15 == res[2]) && (15 == res[3]) )
return 1;
return 0;
}
sVector4 sMatrix4::operator *(const sVector4& v) const {
return sVector4(
val.f[0][0] * v[0] + val.f[0][1] * v[1]
+ val.f[0][2] * v[2] + val.f[0][3] * v[3],
val.f[1][0] * v[0] + val.f[1][1] * v[1]
+ val.f[1][2] * v[2] + val.f[1][3] * v[3],
val.f[2][0] * v[0] + val.f[2][1] * v[1]
+ val.f[2][2] * v[2] + val.f[2][3] * v[3],
val.f[3][0] * v[0] + val.f[3][1] * v[1]
+ val.f[3][2] * v[2] + val.f[3][3] * v[3]);
}
float sMatrix4::RCD(const sMatrix4& B, int i, int j) const {
return val.f[i][0] * B.val.f[0][j] + val.f[i][1] * B.val.f[1][j]
+ val.f[i][2] * B.val.f[2][j] + val.f[i][3] * B.val.f[3][j];
}
References
[1] James Abel, Kumar Balasubramanian, Mike Bargeron, Tom
Craver, and Mike Phlipot. Applications tuning for streaming simd
extensions. Technical report, Intel Corporation, 1999.
[2] Intel Corporation. Intel C/C++ Compiler Web Site.
http://developer.intel.com/vtune/icl.
[3] Intel Corporation. Vtune Performance Analyzer Web Site.
http://developer.intel.com/vtune/performance.
[4] Intel Corpotation. Developer Relations Group Web Site.
http://developer.intel.com/drg.
[5] Intel Corpotation. Intel Developer Web Site.
http://developer.intel.com.
[6] Intel Corpotation. Web site. http://www.intel.com.
[7] Stephan Fischer, James Mi, and Albert Tang. Pentium iii
processor serial number feature and applications. Technical
report, Intel Corporation, 1999.
[8] Joe Wolf III. Programming methods for the pentium iii
processor streaming simd extensions using the vtune performance
enhancement environment. Technical report, Intel Corporation,
1999.
[9] Intel Corporation. Data Alignment and Programming Issues
for the Streaming SIMD Extensions with the Intel C/C++ Compiler,
1999. App Note ap833.
[10] Intel Corporation. Intel Architecture Optimization
Reference Manual, 1999.
[11] Intel Corporation. Intel Architecture Software
Development Manual. Volume 1: Basic Architecture, 1999.
[12] Intel Corporation. Intel Architecture Software
Development Manual. Volume 2: Instruction Set Reference, 1999.
[13] Intel Corporation. Intel Architecture Software
Development Manual. Volume 3: Systems Programming Guide, 1999.
[14] Intel Corporation. Intel Pentium III Processor
Performance Brief, 1999.
[15] Intel Corporation. Intel Processor Identification and
CPUID Instruction, 1999. App Note Ap-485.
[16] Intel Corporation. Software Conventions for Streaming
SIMD Extensions, 1999. App Note AP589.
[17] Intel Corporation. Software Development Strategies for
Streaming SIMD Extensions, 1999. App Note AP814.
[18] Jagannath Keshavan and Vladimir Penkovski. Pentium iii
processor implementation trade-offs. Technical report, Intel
Corporation, 1999.
[19] Shreekant Thakkar and Tom Huff. Internet streaming simd
extensions. Technical report, Intel Corporation, 1999.
[20] Paul Zagacki, Deep Duch, Emil Hsiech, Daniel Melaku, and
Vladimir Pentkovski. Architecture of 3d software stack for peak
pentium iii processor performance. Technical report, Intel
Corporation, 1999.
Bipin Patwardhan
National Centre for Software Technology, Mumbai.
email: bipin@ncst.ernet.in
Back to Book and
Articles
|