# Thread: For all you speed freaks

1. ## For all you speed freaks

I think we should have an optimization thread.... i don't want a debate over asm-optimization vs algorithmical optimization, they both have their place (and yes, algorithmical should _always_ come first, but asm has it's spot too)

so yea, i'm gonna start one, i'll post an SSE cross-product routine below, and i'm gonna work on a matrix mult (4x4 * 4x4)
routine using SSE next (god, a decent dot product would be nice... fsck intel and their dumb choices)

Anyone else got _any_ cool optimization tricks, put 'em in here and show us what you got (benchmark data would be cool too)

EDIT:

Oh yea, feedback! (I know my sh*t could be a lot better, i'm just incompetent and still learning ;p)

2. I'm posting both my C code and the ASM code, both assume

Code:
```typedef float vector4&#91;4&#93; __attribute__ &#40;&#40;aligned &#40;16&#41;&#41;&#41;;
#define vector3 vector4```
benching using the rdtsc instruction gives me 92 clock cycles for the C version
and 52 clock cycles for the asm version, both inlined (the C one by hand)

Code:
```void    vector3_cross &#40; vector3 src, vector3 dest &#41;
&#123;
vector3 buf = &#123;
src&#91;1&#93;*dest&#91;2&#93; - src&#91;2&#93;*dest&#91;1&#93;,
src&#91;0&#93;*dest&#91;2&#93; - src&#91;2&#93;*dest&#91;0&#93;,
src&#91;0&#93;*dest&#91;1&#93; + src&#91;1&#93;*dest&#91;0&#93;
&#125;;

vector3_copy&#40;buf, dest&#41;;
&#125;```
Code:
```static float neg_one = -1.0f

...

QMATH void vector3_cross &#40; vector3 src, vector3 dest &#41;
&#123;
/*
* the shufps imm value is a bit confusing,
* to get it&#58;
*
* write out the reverse of what you want to
* rearrange the vector as
*
* &#40;src&#41; y x x w = w x x y
*
* x = 0 ... w = 3
*
* so&#58;
* w x x y = 3 0 0 1
*
* shift needed&#58;
* 3 0 0 1 *
* 4 1 4 1
* --------
* c 0 0 1
*
* combine
* c + 0 = c
* 0 + 1 = 1
*/
__asm__ volatile &#40;
"movaps %&#91;src&#93;, %%xmm0\n\t"
"movaps %&#91;dest&#93;, %%xmm2\n\t"
"movaps %&#91;src&#93;, %%xmm1\n\t"
"movaps %&#91;dest&#93;, %%xmm3\n\t"
"shufps \$0xc1, %%xmm0, %%xmm0\n\t"
"shufps \$0xda, %%xmm2, %%xmm2\n\t"
"shufps \$0xda, %%xmm1, %%xmm1\n\t"
"shufps \$0xc1, %%xmm3, %%xmm3\n\t"
"mulps  %%xmm0, %%xmm2\n\t"
"mulps  %%xmm1, %%xmm3\n\t"
"subps  %%xmm3, %%xmm2\n\t"
"shufps \$0xb5, %%xmm2, %%xmm2\n\t"
"mulss  %&#91;neg_one&#93;, %%xmm2\n\t"
"shufps \$0xb5, %%xmm2, %%xmm2\n\t"
"movaps %%xmm2, %&#91;dest&#93;\n\t"

&#58; &#91;dest&#93; "+m" &#40;*dest&#41;
&#58; &#91;src&#93; "m" &#40;*src&#41;, &#91;neg_one&#93; "m" &#40;neg_one&#41;
&#58; "xmm0", "xmm1", "xmm2", "xmm3"

&#41;;
&#125;```

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•