Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
Forum
 
Other projects
   Altirra

Search

Archives

01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ Floating point specials and shader emulation

I decided to work a bit more on my shader compiler and emulator last night, and ran into some unexpectedly ugly problems with floating point specials.

The first problem had to do with the following innocent looking expression:

length(vec)

This returns the magnitude of a vector. The conventional way to compute this is to compute the squared magnitude by taking the dot product of the vector with itself -- the vector analog of squaring a scalar -- and then taking the square root of the result. It turns out that shader hardware doesn't actually support a direct square root. The reciprocal square root, 1/sqrt(x) is easier to compute, and it's also useful in some other cases, most notably normalizing a vector. In this particular case, though, we need the normal square root, and therefore we need to invert it in the assembly:

dp3 r0.x, r_vec, r_vec    ;compute dot(vec, vec)
rsq r0.x, r0.x ;compute reciprocal square root of squared magnitude
rcp r0.x, r0.x ;invert reciprocal square root

This little code fragment has a major surprise lurking in it, which may not be apparent until you try optimizing it. Both rsq and rcp are scalar instructions, so in cases where multiple magnitudes are being computed, the temptation is high to replace 1/rsqrt(x) with x*rsqrt(x) instead to take advantage of the much more available mul instruction:

dp3 r0.x, r_vec, r_vec
rsq r0.y, r0.x
mul r0.x, r0.x, r0.y

This works just fine... until you have the misfortune of trying to compute the length of a zero vector. In that case, the reciprocal square root operation returns +Infinity, and then the next thing that happens is the computation 0*+Infinity, which then returns a NaN (invalid result). Suck. Therefore, for the general case, that rcp has to stay there.

The real gotcha comes when you try implementing the rsq and rcp instructions in software. Reciprocal is a very slow instruction on most FPUs, being done with the divide unit and usually taking dozens of nonpipelineable clocks. Reciprocal square root may not even exist in full precision form, and 1/sqrt(x) is a horribly painful way to implement it. If you want to implement this fragment quickly in SSE, you need to take advantage of the rcpps and rsqrtps instructions, which are very fast and work in parallel on four values. They only provide limited precision up to about 2^-12, though. The WPF shader engine just goes ahead and uses the approximation result directly, which works and is accurate enough for half precision (10 bit mantissa), but technically it's not Direct3D compliant as 22 bits of precision are needed.

The usual way to improve the accuracy of the reciprocal and reciprocal square root operations is by iteration through Newton's Method. For the reciprocal, it looks like this:

x = reciprocal_approx(c);
x' = x * (2 - x * c);

...and for the reciprocal square root, it looks like this:

x = rsqrt_approx(c);
x' = 0.5 * x * (3 - x*x*c);

Assuming that you have a good estimate, these will tend to double the number of significant digits per iteration, which means that just one iteration will give us pretty good precision, and quickly, too. And unfortunately wrong, as I discovered when I implemented it. The problem is once again zero. In order for the zero case to work, we need rsq to transform 0 -> Inf and rcp to transform Inf -> 0, but thanks to the x*c expression in both of these iterations, you once again get 0 * Inf = NaN. The way I fixed it was to insert a couple of carefully placed min/max operations in the iteration, although I'm not quite sure they're 100% correct.

The specials struck again when I was trying to optimize the code generated for a gamma correction shader. Gamma correction is primarily a power operation, which expands as follows:

pow(x, y) = exp(y * log(x))

If you actually try gamma correcting an image in this manner, you'll be waiting a long time for the result. For limited precision (8-bit), you can get away with a lookup table, but that doesn't scale well to higher precision or vector computations and definitely not to a shader where floating point is involved. Therefore, in order to get a faster version working, I had to implement the log and exp instructions, which compute the base 2 logarithm and exponential functions. SSE doesn't provide you any help to do this, so you're stuck implementing this from the ground floor. It's a bit like an old BASIC interpreter, except at least you start with add and multiply. This triggered the following conversation with a friend:

"What are you doing?"
"I'm implementing the log() and exp() functions."
"Doesn't the runtime provide those?"
"This is the runtime."

Anyway, I ended up implementing exp2(x) = c*floor(x) + f(x - floor(x)) and log2(x) = c*exponent(x) + g(mantissa(x)). For the most part they're not too hard, as long as you find good polynomial expansions and make sure it's exact at the right values, i.e. you don't want exp(0) = 1.004. In the end, I used a fifth-order polynomial for exp2() and a first-order Padé approximation with a change of variables for log2()... but I digress.

As you have probably already guessed, zero rears its ugly head again here, because 0^y becomes exp2(y * log2(0)), and for this to work you need log2(0) = -Infinity and exp2(-Infinity) = 0. A couple of well-placed min/max operations in the expansions once again fixed the SSE version, but I unexpectedly ran into problems with the x87 version. I hadn't bothered to optimize the x87 version, so it simply called into expf() and scaled the result. Since I compile with /fp:fast /Oi, expf() ended up getting expanded in intrinsic form like this:

fldl2e
fmul
fld     st(0)
frndint
fxch    st(1)
fsub st, st(1)
f2xm1
fld1
fadd
fscale

If you look closely, you'll see that this computes (x - round(x)), which in this case is -Inf - (-Inf) = NaN. There are two basic ways to fix this. One is to force the runtime library version of the function, which is faster for SSE2 but unfortunately quite slow for x87. The other way, which is what I did, is to just check for infinity and special-case the result.

Ordinarily I don't think much about floating point specials, and the last place I would have expected to find them is in a pixel shader. I have to say it's been a bit of a humbling (and frustrating) experience.

Comments

Comments posted:


There's a fast Inverse Square Root C function on the web somewhere, that made its way into the Quake III Arena source code. It starts off with an extremely accurate first iteration, and could possibly suit your code better than re-inventing the wheel, if you can re-implement it in assembly.

Jethro - 27 10 08 - 05:08


There's an article on the fast square root at:
http://betterexplained.com/articles/unde..

Michael - 27 10 08 - 17:35


I've used the fast square root routine before. It's basically the same thing, a fast initial guess based on exponent conversion and mantissa lookup followed by one or more Newton-Raphson iterations. It's not so useful for SSE because the SSE rsqrtss/rsqrtps instructions get you more accurate estimates in a lot fewer cycles and also gets the special cases correct, like one and zero.

Phaeron - 27 10 08 - 23:25


Fast square root is just one example of this technology. The core idea is using bit whacking of IEEE floating point to do fast log2(x) and 2**x. From there just apply logarithm identities to generate general functions like sqrt() and pow().

Google for a copy of "IEEE 00595279.pdf" by James F. Blinn, Microsoft Research, to get the gory theory details.

IanB - 01 11 08 - 18:23


If that's the same article that was reprinted in his books, I've read it (I have his first tree). The functions in the article are fast, but fairly low precision.

The NIST book by Abramowitz & Stegun, Handbook of Mathematical Functions with Formulas, Graphics and Mathematical tables, provides some nice approximations for various trancendental functions. You'll generally find about three approximations for each, so you can choose between around 10^-3, 10^-5, and 10^-8 accuracy. I'd also recommend reading the presentation "Faster Math Functions" by Robin Green at SCEA, which gives a lot of background and practical advice.

Phaeron - 02 11 08 - 15:09

Comment form


Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
Name:  
Remember personal info?

Email (Optional):
Your email address is only revealed to the blog owner and is not shown to the public.
URL (Optional):
Comment: /

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.



Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.