Some improvements around `Q_rsqrt()` #1458

illwieckz · 2024-12-09T19:55:49Z

Add Q_sqrt_fast() that skips the second iteration, it is believed to be useless:
Do we need this level of precision in Q_rsqrt? #1457
Use better magic constants for the Q_rsqrt trick,
the values come from: http://rrrola.wz.cz/inv_sqrt.html
Use Q_sqrt_fast() in R_TBNtoQtangents() and VectorNormalizeFast().

illwieckz · 2024-12-09T20:13:03Z

Hmm, I removed the engine: use VectorNormalizeFast when the return value isn't used commit, maybe it breaks the tests.

illwieckz · 2024-12-09T20:16:49Z

The commit removing the second iteration makes this gives 20% performance bump on such scene with 170+ models rendered using the CPU (r_vboModel 0):

Before, 8fps:

After, 10fps:

illwieckz · 2024-12-09T20:28:19Z

Hmm, I removed the engine: use VectorNormalizeFast when the return value isn't used commit, maybe it breaks the tests.

I readded that commit, what breaks the test is the removal of the second iteration.

How do we check test to be true, by comparing against previous values we computed with Dæmon?

slipher · 2024-12-09T20:36:09Z

How do we check test to be true, by comparing against previous values we computed with Dæmon?

Yeah that's how I made the patch trace tests. Since there is a long chain of calculations for converting a patch to collision geometry, I wasn't able to find the platonically ideal value (unlike with brushes where it easy).

Making traces imprecise is probably much more likely to cause bugs than graphics stuff. Let's not mess with traces.

illwieckz · 2024-12-09T20:43:17Z

But is it correct to do y *= ( 1.5f - ( x * y * y ) );
after _mm_store_ss( &y, _mm_rsqrt_ss( _mm_load_ss( &number ) ) );?

slipher · 2024-12-09T23:10:00Z

src/engine/qcommon/q_math.cpp

@@ -1082,7 +1082,7 @@ void PerpendicularVector( vec3_t dst, const vec3_t src )
 	/*
 	 * * normalize the result
 	 */
-	VectorNormalize( dst );
+	VectorNormalizeFast( dst );


I definitely don't want to see the inaccurate version used for basic math routines. It's OK to use the sloppy version for graphics-specific stuff

Why is it inaccurate? They both use Q_sqrt() (and then they both use _mm_rsqrt_ps()).

// returns vector length inline vec_t VectorNormalize( vec3_t v ) { vec_t length = DotProduct( v, v ); if ( length != 0.0f ) { vec_t ilength = Q_rsqrt( length ); /* sqrtf(length) = length * (1 / sqrtf(length)) */ length *= ilength; VectorScale( v, ilength, v ); } return length; } // does NOT return vector length, uses rsqrt approximation // fast vector normalize routine that does not check to make sure // that length != 0, nor does it return length inline void VectorNormalizeFast( vec3_t v ) { vec_t ilength = Q_rsqrt( DotProduct( v, v ) ); VectorScale( v, ilength, v ); }

The comment before VectorNormalizeFast() saying uses rsqrt approximation is misleading, they both uses rsqrt. Maybe long time ago in the past the other one didn't, but they both do.

The difference is that VectorNormalize() tests for the length not being zero, skips an useless scale if it is, and returns the length.

In the past I did the contrary, replacing VectorNormalizeFast() with VectorNormalize() in hope the extra test would actually make it faster by skipping the scale, it was slower because of the test breaking the vectorization.

The two functions should produce the same results as they use the same operations, one of them is just slower.

OK I commented on an irrelevant diff here, but I stand by the words themselves. The old version of Q_rsqrt was pretty accurate, having only the least significant 1-2 binary places wrong, so it was good enough to use more or less anywhere. Now if we are going to make it a rough approximation, we shouldn't be using it all the time.

illwieckz · 2024-12-10T12:21:02Z

Actually 3 constants are needed to be modified to get better precision. I reimplemented the Q_rsqrt() function with them. I moved the code to a different place of the header because I want it to be after the VectorSubtract and similar macros for a branch I have that provide vectorized Q_rsqrt() functions for vectorized VectorNormalizeFast that computes multiple reverse square root at once for multiple vectors to be normalized.

illwieckz · 2024-12-10T12:22:29Z

The old version of Q_rsqrt was pretty accurate, having only the least significant 1-2 binary places wrong, so it was good enough to use more or less anywhere.

Was it correct to always do the second iteration of the magic trick even when using SSE's _mm_rsqrt_ss() instead of the magic trick?

illwieckz · 2024-12-10T13:02:51Z

The sseQuatNormalize() function implements the second iteration right after calling _mm_rsqrt_ps, why?

	inline __m128 sseQuatNormalize( __m128 q ) {
		__m128 p = _mm_mul_ps( q, q );
		__m128 t, h;
		p = _mm_add_ps( sseSwizzle( p, XXZZ ),
				sseSwizzle( p, YYWW ) );
		p = _mm_add_ps( sseSwizzle( p, XXXX ),
				sseSwizzle( p, ZZZZ ) );
		t = _mm_rsqrt_ps( p );
		h = _mm_mul_ps( _mm_set1_ps( 0.5f ), t );
		t = _mm_mul_ps( _mm_mul_ps( t, t ), p );
		t = _mm_sub_ps( _mm_set1_ps( 3.0f ), t );
		t = _mm_mul_ps( h, t );
		return _mm_mul_ps( q, t );
	}

illwieckz · 2024-12-10T17:35:53Z

Using the values from http://rrrola.wz.cz/inv_sqrt.html the compute is wrong if we do two iterations, but it is meant to give good results with a single iteration anyway.

illwieckz · 2024-12-10T18:08:40Z

ioquake3 has a second function named SquareRootFloat() that do two iterations, this function lives in code/qcommon/cm_trace.c and is used in functions like CM_TraceThroughSphere().

On our side in src/common/cm/cm_trace.cpp, the CM_TraceThroughSphere() doesn't use any Q_sqrt() or SquareRootFloat(), it uses fsqrt(). The code from cm_trace.cpp doesn't use Q_sqrt() at all.

Making traces imprecise is probably much more likely to cause bugs than graphics stuff. Let's not mess with traces.

Do you know some tracing code that uses Q_rsqrt()?

illwieckz · 2024-12-10T18:33:40Z

I've checked many idtech-derivated sources like ioquake3, vkQuake, Qfusion, ET:Xreal, ET:Legacy, RBQUAKE3, JediOutcast, OpenJK, KingpinQ3, dhewm3, UFO:AI, Smokin' Guns, World of Padman, Urban Terror, ETe, Quake3e, Qio, iodfe, FTE:QW and many others…

Some use two iterations in a second function used exclusively in tracing code.

The only ones using two iterations in generic rsqrt are XreaL, Dæmon, KingpinQ3 and UFO:AI.

KingPingQ3 is a fork of XreaL like us, so it's just inherited.

UFO:AI has very different code so maybe the function is also used for tracing (unlike what we do).

illwieckz · 2024-12-10T18:40:11Z

I'm just losing my time, the question should not be:

Can we make Q_sqrt() faster with a single iteration?

It doing two iterations is actually useless, we're the only ones doing that.

The question should be:

Can we use a more precise variant of Q_sqrt() for tracing instead of fsqrt()?

Many others do that.

illwieckz · 2024-12-10T18:43:14Z

Here is something we may use for tracing (we currently use fsqrt(), but others do this with a less good constant):

inline float Q_rsqrt_precise( const float n )
{
	static const float a = 0.5f;
	static const float b = 3.0f;
#if defined(DAEMON_USE_ARCH_INTRINSICS_i686_sse)
	float o;
	// SSE rsqrt relative error bound: 3.7 * 10^-4
	_mm_store_ss( &o, _mm_rsqrt_ss( _mm_load_ss( &n ) ) );
#else
	/* Magic constant from Quake 3.
	See https://github.com/id-Software/Quake-III-Arena/blob/dbe4ddb/code/game/q_math.c#L561
	const uint32_t magic = 0x5f3759dful; */

	/* Magic constant computed by Chris Lomont.
	See: https://www.lomont.org/papers/2003/InvSqrt.pdf */
	static const uint32_t magic = 0x5f375a86ul;
	float o = RsqrtMagic( n, magic );
	o *= a * ( b - n * o * o );
#endif
	// Two iterations for higher precision.
	o *= a * ( b - n * o * o );
	return o;
}

Here is what we may use for everything else:

inline float Q_rsqrt( const float n )
{
#if defined(DAEMON_USE_ARCH_INTRINSICS_i686_sse)
	float o;
	_mm_store_ss( &o, _mm_rsqrt_ss( _mm_load_ss( &n ) ) );
#else
	static const float a = 0.703952253f;
	static const float b = 2.38924456f;
	static const uint32_t magic = 0x5f1ffff9ul;
	float o = RsqrtMagic( n, magic );
	o *= a * ( b - n * o * o );
#endif
	return o;
}

Both codes assume this exist:

#if defined(DAEMON_USE_ARCH_INTRINSICS_i686_sse)
#else
inline float RsqrtMagic( const float n, const uint32_t magic )
{
	return Util::bit_cast<float>( magic - ( Util::bit_cast<uint32_t>( n ) >> 1 ) );
}
#endif

slipher · 2024-12-10T20:03:07Z

Using the values from http://rrrola.wz.cz/inv_sqrt.html the compute is wrong if we do two iterations, but it is meant to give good results with a single iteration anyway.

If you get a worse result after a second iteration, you are doing something wrong. The Newton's method iteration should always get closer to the answer.

Was it correct to always do the second iteration [...] even when using SSE's _mm_rsqrt_ss() instead of the magic trick?

Yes that improves the precision a lot

The sseQuatNormalize() function implements the second iteration right after calling _mm_rsqrt_ps, why?

To get good precision

Can we use a more precise variant of Q_sqrt() for tracing instead of fsqrt()?

I would rather phrase it as a fast one for graphics code, and a correct one for everything else.

illwieckz · 2024-12-10T20:24:33Z

I would rather phrase it as a fast one for graphics code, and a correct one for everything else.

On the other hand Q_rsqrt() always meant fast since 25 years.

Also I looked at the game code, there is no usage of it. It looks like only some of the renderer is using Q_rsqrt(), we even have a lot of 1.0f / fsqrt(), like in QuatFromMatrix().

illwieckz · 2024-12-10T20:50:05Z

If you get a worse result after a second iteration, you are doing something wrong.

Not sure it applies here, rrrola not only provides a different magic value, but also provides other values than 3.0 and 0.5 to be used in the iteration.

slipher · 2024-12-10T21:03:10Z

On the other hand Q_rsqrt() always meant fast since 25 years.

Not in our code base. Apparently Q_rsqrt has used a precise version since 2013.

illwieckz · 2024-12-10T21:03:16Z

If you get a worse result after a second iteration, you are doing something wrong.

Not sure it applies here, rrrola not only provides a different magic value, but also provides other values than 3.0 and 0.5 to be used in the iteration.

Well, one can just use the 3.0 and 0.5 values in the next iterations.

illwieckz · 2024-12-10T21:35:32Z

So now is the status of the PR:

Add Q_sqrt_fast() that skips the second iteration,
Use better magic constants for the Q_rsqrt trick,
Use Q_sqrt_fast() in R_TBNtoQtangents() and VectorNormalizeFast().

I removed other things.

Anyway:

I see no current usage for Q_sqrt() that can't use a single-iteration Q_sqrt_fast() instead,
But I see usages of 1.0f / fsqrt() that can use a double-iteration Q_sqrt().

In that regard, not naming Q_sqrt() what is Q_sqrt() since 25 years in all other forks is weird, and naming Q_sqrt() something that is not named Q_sqrt() since 25 years in other forks is weird as well.

The purpose of this PR is to speedup R_TBNtoQtangents() so it does the job, we can discuss the names and the reuse of Q_sqrt_fast() and VectorNormalizeFast() later.

illwieckz · 2024-12-10T22:11:27Z

Here is the common alien scene I use for testing model processing on CPU (plat23 1920 1920 0 0 0):

Before, 89fps:

After, 104fps:

So that's about a performance increase of +17%.

illwieckz · 2024-12-10T22:38:28Z

For information, the only functions remaining to use Q_rsqrt() are:

VectorNormalize()
VectorNormalize2()
PlaneNormalize()
QuatNormalize()

The functions using Q_rsqrt_fast() are:

VectorNormalizeFast()
R_TBNtoQtangents()
that also relies heavily on VectorNormalizeFast() hence using non-fast Q_rsqrt() would not makes sense.

slipher · 2024-12-12T01:30:45Z

src/engine/qcommon/q_shared.h

-		// relative error bound after the initial iteration: 1.8 * 10^-3
+	/* Magic constants by Jan Kadlec.
+	See: http://rrrola.wz.cz/inv_sqrt.html */
+	static const float a = 0.703952253f;


Use constexpr not static const

slipher · 2024-12-12T01:32:43Z

src/engine/qcommon/q_shared.h

+	static const float a = 0.5f;
+	static const float b = 3.0f;
+	float o = Q_rsqrt_fast( n );
+	// Second iteration for higher precision/


This comment no longer makes sense since the preceding calculations are different. You could describe this as "Do an iteration of Newton's method for finding the zero of f(x) = 1/x^2 - n"

slipher · 2024-12-12T01:34:48Z

src/engine/qcommon/q_shared.h

+The relative error bound is: 2.00010826×10⁻⁷ */
+
+// Compute approximate inverse square root.
+inline float Q_rsqrt_fast( const float n )


So what is the error bound for Q_rsqrt_fast?

The page http://rrrola.wz.cz/inv_sqrt.html says:

max rel_error avg rel_error²

6.50196699×10⁻⁴ 2.00010826×10⁻⁷

I don't know what means the 2 in avg rel_error²

Ah it may simple be the squared relative error, annoying.

Then I guess the relative error is √2.00010826×10⁻⁷ so 4.4722569917212941×10⁻⁴.

max rel error is what you want for a unit test

Hmm, when compiled with -DUSE_ARCH_INTRINSICS=OFF, with a relative tolerance of 4.4722569917212941e-4, the tests for Q_rsqrt_fast() fail this way:

src/engine/qcommon/q_math_test.cpp:228: Failure Value of: Q_rsqrt_fast(1e-6) Expected: is approximately 1000 (absolute error <= 0.44722569) Actual: 1000.5 (of type float), which is 0.502686 from 1000 src/engine/qcommon/q_math_test.cpp:229: Failure Value of: Q_rsqrt_fast(0.036) Expected: is approximately 5.270463 (absolute error <= 0.0023570864) Actual: 5.27387 (of type float), which is 0.00340557 from 5.27046 src/engine/qcommon/q_math_test.cpp:232: Failure Value of: Q_rsqrt_fast(3) Expected: is approximately 0.57735032 (absolute error <= 0.0002582059) Actual: 0.576975 (of type float), which is -0.00037539 from 0.57735 [ FAILED ] QSharedMathTest.FastInverseSquareRoot (0 ms)

max rel error is what you want for a unit test

Ah OK.

Now the tests pass (both with SSE and without).

…ts()

slipher · 2024-12-17T04:32:42Z

LGTM but I want the error bounds to be more clearly documented. The error bound number for Q_rsqrt_fast is in there somewhere, but it is unclear whether it applies to that function or something else from the giant wall of text. And Q_rsqrt now says nothing at all. Maybe we don't know what it is exactly but we can say it's as least as good as the old one. Like the relative error is at most 5e-6, probably less.

illwieckz · 2024-12-17T18:16:17Z

LGTM but I want the error bounds to be more clearly documented. The error bound number for Q_rsqrt_fast is in there somewhere, but it is unclear whether it applies to that function or something else from the giant wall of text. And Q_rsqrt now says nothing at all. Maybe we don't know what it is exactly but we can say it's as least as good as the old one. Like the relative error is at most 5e-6, probably less.

Done.

q_shared: better magic constants for Q_rsqrt(), add Q_rsqrt_fast()

f687302

illwieckz force-pushed the illwieckz/fast-rsqrt branch from 1146cce to f6d637e Compare December 9, 2024 20:12

illwieckz force-pushed the illwieckz/fast-rsqrt branch from f6d637e to d535089 Compare December 9, 2024 20:18

slipher reviewed Dec 9, 2024

View reviewed changes

illwieckz force-pushed the illwieckz/fast-rsqrt branch 2 times, most recently from 95fa8f8 to 4ba3860 Compare December 10, 2024 12:17

illwieckz force-pushed the illwieckz/fast-rsqrt branch 4 times, most recently from 92c11f2 to a2e2874 Compare December 10, 2024 14:14

illwieckz force-pushed the illwieckz/fast-rsqrt branch 4 times, most recently from b0069d3 to 1fc022f Compare December 10, 2024 19:04

illwieckz force-pushed the illwieckz/fast-rsqrt branch from 1fc022f to 4abf136 Compare December 10, 2024 21:28

illwieckz force-pushed the illwieckz/fast-rsqrt branch from 4abf136 to 4cb2b6f Compare December 10, 2024 21:37

slipher mentioned this pull request Dec 12, 2024

Unit test for Q_rsqrt #1460

Merged

slipher reviewed Dec 12, 2024

View reviewed changes

illwieckz added 2 commits December 15, 2024 03:14

q_math_test: also test Q_rsqrt_fast

020a04d

engine: use Q_sqrt_fast() in VectorNormalizeFast() and R_TBNtoQtangen…

7d522ec

…ts()

illwieckz force-pushed the illwieckz/fast-rsqrt branch 3 times, most recently from 565861e to 6879ccb Compare December 17, 2024 02:10

illwieckz force-pushed the illwieckz/fast-rsqrt branch from 6879ccb to 7d522ec Compare December 17, 2024 18:10

illwieckz merged commit 80821bf into master Dec 17, 2024
9 checks passed

illwieckz deleted the illwieckz/fast-rsqrt branch December 17, 2024 19:16

illwieckz mentioned this pull request Jan 28, 2025

q_math: do tracing in id Software/J.M.P. van Waveren's QuatFromMatrix() implementation like Martin John Baker's one did, fix model distortion with GCC 14 #1532

Merged

Some improvements around Q_rsqrt() #1458

Some improvements around Q_rsqrt() #1458

Conversation

illwieckz commented Dec 9, 2024 • edited Loading

illwieckz commented Dec 9, 2024

illwieckz commented Dec 9, 2024

illwieckz commented Dec 9, 2024

slipher commented Dec 9, 2024

illwieckz commented Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

illwieckz Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

illwieckz commented Dec 10, 2024

illwieckz commented Dec 10, 2024 • edited Loading

illwieckz commented Dec 10, 2024

illwieckz commented Dec 10, 2024

illwieckz commented Dec 10, 2024 • edited Loading

illwieckz commented Dec 10, 2024 • edited Loading

illwieckz commented Dec 10, 2024 • edited Loading

illwieckz commented Dec 10, 2024

slipher commented Dec 10, 2024

illwieckz commented Dec 10, 2024

illwieckz commented Dec 10, 2024 • edited Loading

slipher commented Dec 10, 2024

illwieckz commented Dec 10, 2024

illwieckz commented Dec 10, 2024

illwieckz commented Dec 10, 2024

illwieckz commented Dec 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

illwieckz Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

illwieckz Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

slipher commented Dec 17, 2024

illwieckz commented Dec 17, 2024

Some improvements around `Q_rsqrt()` #1458

Some improvements around `Q_rsqrt()` #1458

illwieckz commented Dec 9, 2024 •

edited

Loading

illwieckz commented Dec 9, 2024 •

edited

Loading

illwieckz Dec 9, 2024 •

edited

Loading

illwieckz commented Dec 10, 2024 •

edited

Loading

illwieckz commented Dec 10, 2024 •

edited

Loading

illwieckz commented Dec 10, 2024 •

edited

Loading

illwieckz commented Dec 10, 2024 •

edited

Loading

illwieckz commented Dec 10, 2024 •

edited

Loading

illwieckz Dec 17, 2024 •

edited

Loading

illwieckz Dec 17, 2024 •

edited

Loading