Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some improvements around Q_rsqrt() #1458

Merged
merged 3 commits into from
Dec 17, 2024
Merged

Some improvements around Q_rsqrt() #1458

merged 3 commits into from
Dec 17, 2024

Conversation

illwieckz
Copy link
Member

@illwieckz illwieckz commented Dec 9, 2024

@illwieckz illwieckz force-pushed the illwieckz/fast-rsqrt branch from 1146cce to f6d637e Compare December 9, 2024 20:12
@illwieckz
Copy link
Member Author

Hmm, I removed the engine: use VectorNormalizeFast when the return value isn't used commit, maybe it breaks the tests.

@illwieckz
Copy link
Member Author

The commit removing the second iteration makes this gives 20% performance bump on such scene with 170+ models rendered using the CPU (r_vboModel 0):

Before, 8fps:

unvanquished_2024-12-09_210743_000

After, 10fps:

unvanquished_2024-12-09_210927_000

@illwieckz illwieckz force-pushed the illwieckz/fast-rsqrt branch from f6d637e to d535089 Compare December 9, 2024 20:18
@illwieckz
Copy link
Member Author

Hmm, I removed the engine: use VectorNormalizeFast when the return value isn't used commit, maybe it breaks the tests.

I readded that commit, what breaks the test is the removal of the second iteration.

How do we check test to be true, by comparing against previous values we computed with Dæmon?

@slipher
Copy link
Member

slipher commented Dec 9, 2024

How do we check test to be true, by comparing against previous values we computed with Dæmon?

Yeah that's how I made the patch trace tests. Since there is a long chain of calculations for converting a patch to collision geometry, I wasn't able to find the platonically ideal value (unlike with brushes where it easy).

Making traces imprecise is probably much more likely to cause bugs than graphics stuff. Let's not mess with traces.

@illwieckz
Copy link
Member Author

illwieckz commented Dec 9, 2024

But is it correct to do y *= ( 1.5f - ( x * y * y ) );
after _mm_store_ss( &y, _mm_rsqrt_ss( _mm_load_ss( &number ) ) );?

@@ -1082,7 +1082,7 @@ void PerpendicularVector( vec3_t dst, const vec3_t src )
/*
* * normalize the result
*/
VectorNormalize( dst );
VectorNormalizeFast( dst );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely don't want to see the inaccurate version used for basic math routines. It's OK to use the sloppy version for graphics-specific stuff

Copy link
Member Author

@illwieckz illwieckz Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it inaccurate? They both use Q_sqrt() (and then they both use _mm_rsqrt_ps()).

// returns vector length
inline vec_t VectorNormalize( vec3_t v )
{
	vec_t length = DotProduct( v, v );

	if ( length != 0.0f )
	{
		vec_t ilength = Q_rsqrt( length );
		/* sqrtf(length) = length * (1 / sqrtf(length)) */
		length *= ilength;
		VectorScale( v, ilength, v );
	}

	return length;
}

// does NOT return vector length, uses rsqrt approximation
// fast vector normalize routine that does not check to make sure
// that length != 0, nor does it return length
inline void VectorNormalizeFast( vec3_t v )
{
	vec_t ilength = Q_rsqrt( DotProduct( v, v ) );

	VectorScale( v, ilength, v );
}

The comment before VectorNormalizeFast() saying uses rsqrt approximation is misleading, they both uses rsqrt. Maybe long time ago in the past the other one didn't, but they both do.

The difference is that VectorNormalize() tests for the length not being zero, skips an useless scale if it is, and returns the length.

In the past I did the contrary, replacing VectorNormalizeFast() with VectorNormalize() in hope the extra test would actually make it faster by skipping the scale, it was slower because of the test breaking the vectorization.

The two functions should produce the same results as they use the same operations, one of them is just slower.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I commented on an irrelevant diff here, but I stand by the words themselves. The old version of Q_rsqrt was pretty accurate, having only the least significant 1-2 binary places wrong, so it was good enough to use more or less anywhere. Now if we are going to make it a rough approximation, we shouldn't be using it all the time.

@illwieckz illwieckz force-pushed the illwieckz/fast-rsqrt branch 2 times, most recently from 95fa8f8 to 4ba3860 Compare December 10, 2024 12:17
@illwieckz
Copy link
Member Author

Actually 3 constants are needed to be modified to get better precision. I reimplemented the Q_rsqrt() function with them. I moved the code to a different place of the header because I want it to be after the VectorSubtract and similar macros for a branch I have that provide vectorized Q_rsqrt() functions for vectorized VectorNormalizeFast that computes multiple reverse square root at once for multiple vectors to be normalized.

@illwieckz
Copy link
Member Author

illwieckz commented Dec 10, 2024

The old version of Q_rsqrt was pretty accurate, having only the least significant 1-2 binary places wrong, so it was good enough to use more or less anywhere.

Was it correct to always do the second iteration of the magic trick even when using SSE's _mm_rsqrt_ss() instead of the magic trick?

@illwieckz
Copy link
Member Author

The sseQuatNormalize() function implements the second iteration right after calling _mm_rsqrt_ps, why?

	inline __m128 sseQuatNormalize( __m128 q ) {
		__m128 p = _mm_mul_ps( q, q );
		__m128 t, h;
		p = _mm_add_ps( sseSwizzle( p, XXZZ ),
				sseSwizzle( p, YYWW ) );
		p = _mm_add_ps( sseSwizzle( p, XXXX ),
				sseSwizzle( p, ZZZZ ) );
		t = _mm_rsqrt_ps( p );
		h = _mm_mul_ps( _mm_set1_ps( 0.5f ), t );
		t = _mm_mul_ps( _mm_mul_ps( t, t ), p );
		t = _mm_sub_ps( _mm_set1_ps( 3.0f ), t );
		t = _mm_mul_ps( h, t );
		return _mm_mul_ps( q, t );
	}

@illwieckz illwieckz force-pushed the illwieckz/fast-rsqrt branch 4 times, most recently from 92c11f2 to a2e2874 Compare December 10, 2024 14:14
@illwieckz
Copy link
Member Author

Using the values from http://rrrola.wz.cz/inv_sqrt.html the compute is wrong if we do two iterations, but it is meant to give good results with a single iteration anyway.

@illwieckz
Copy link
Member Author

illwieckz commented Dec 10, 2024

ioquake3 has a second function named SquareRootFloat() that do two iterations, this function lives in code/qcommon/cm_trace.c and is used in functions like CM_TraceThroughSphere().

On our side in src/common/cm/cm_trace.cpp, the CM_TraceThroughSphere() doesn't use any Q_sqrt() or SquareRootFloat(), it uses fsqrt(). The code from cm_trace.cpp doesn't use Q_sqrt() at all.

Making traces imprecise is probably much more likely to cause bugs than graphics stuff. Let's not mess with traces.

Do you know some tracing code that uses Q_rsqrt()?

@illwieckz
Copy link
Member Author

illwieckz commented Dec 10, 2024

I've checked many idtech-derivated sources like ioquake3, vkQuake, Qfusion, ET:Xreal, ET:Legacy, RBQUAKE3, JediOutcast, OpenJK, KingpinQ3, dhewm3, UFO:AI, Smokin' Guns, World of Padman, Urban Terror, ETe, Quake3e, Qio, iodfe, FTE:QW and many others…

Some use two iterations in a second function used exclusively in tracing code.

The only ones using two iterations in generic rsqrt are XreaL, Dæmon, KingpinQ3 and UFO:AI.

KingPingQ3 is a fork of XreaL like us, so it's just inherited.

UFO:AI has very different code so maybe the function is also used for tracing (unlike what we do).

@illwieckz
Copy link
Member Author

illwieckz commented Dec 10, 2024

I'm just losing my time, the question should not be:

  • Can we make Q_sqrt() faster with a single iteration?

It doing two iterations is actually useless, we're the only ones doing that.

The question should be:

  • Can we use a more precise variant of Q_sqrt() for tracing instead of fsqrt()?

Many others do that.

@illwieckz
Copy link
Member Author

Here is something we may use for tracing (we currently use fsqrt(), but others do this with a less good constant):

inline float Q_rsqrt_precise( const float n )
{
	static const float a = 0.5f;
	static const float b = 3.0f;
#if defined(DAEMON_USE_ARCH_INTRINSICS_i686_sse)
	float o;
	// SSE rsqrt relative error bound: 3.7 * 10^-4
	_mm_store_ss( &o, _mm_rsqrt_ss( _mm_load_ss( &n ) ) );
#else
	/* Magic constant from Quake 3.
	See https://github.com/id-Software/Quake-III-Arena/blob/dbe4ddb/code/game/q_math.c#L561
	const uint32_t magic = 0x5f3759dful; */

	/* Magic constant computed by Chris Lomont.
	See: https://www.lomont.org/papers/2003/InvSqrt.pdf */
	static const uint32_t magic = 0x5f375a86ul;
	float o = RsqrtMagic( n, magic );
	o *= a * ( b - n * o * o );
#endif
	// Two iterations for higher precision.
	o *= a * ( b - n * o * o );
	return o;
}

Here is what we may use for everything else:

inline float Q_rsqrt( const float n )
{
#if defined(DAEMON_USE_ARCH_INTRINSICS_i686_sse)
	float o;
	_mm_store_ss( &o, _mm_rsqrt_ss( _mm_load_ss( &n ) ) );
#else
	static const float a = 0.703952253f;
	static const float b = 2.38924456f;
	static const uint32_t magic = 0x5f1ffff9ul;
	float o = RsqrtMagic( n, magic );
	o *= a * ( b - n * o * o );
#endif
	return o;
}

Both codes assume this exist:

#if defined(DAEMON_USE_ARCH_INTRINSICS_i686_sse)
#else
inline float RsqrtMagic( const float n, const uint32_t magic )
{
	return Util::bit_cast<float>( magic - ( Util::bit_cast<uint32_t>( n ) >> 1 ) );
}
#endif

@illwieckz illwieckz force-pushed the illwieckz/fast-rsqrt branch 4 times, most recently from b0069d3 to 1fc022f Compare December 10, 2024 19:04
@slipher
Copy link
Member

slipher commented Dec 10, 2024

Using the values from http://rrrola.wz.cz/inv_sqrt.html the compute is wrong if we do two iterations, but it is meant to give good results with a single iteration anyway.

If you get a worse result after a second iteration, you are doing something wrong. The Newton's method iteration should always get closer to the answer.

Was it correct to always do the second iteration [...] even when using SSE's _mm_rsqrt_ss() instead of the magic trick?

Yes that improves the precision a lot

The sseQuatNormalize() function implements the second iteration right after calling _mm_rsqrt_ps, why?

To get good precision

  • Can we use a more precise variant of Q_sqrt() for tracing instead of fsqrt()?

I would rather phrase it as a fast one for graphics code, and a correct one for everything else.

@illwieckz
Copy link
Member Author

I would rather phrase it as a fast one for graphics code, and a correct one for everything else.

On the other hand Q_rsqrt() always meant fast since 25 years.

Also I looked at the game code, there is no usage of it. It looks like only some of the renderer is using Q_rsqrt(), we even have a lot of 1.0f / fsqrt(), like in QuatFromMatrix().

@illwieckz
Copy link
Member Author

illwieckz commented Dec 10, 2024

If you get a worse result after a second iteration, you are doing something wrong.

Not sure it applies here, rrrola not only provides a different magic value, but also provides other values than 3.0 and 0.5 to be used in the iteration.

@slipher
Copy link
Member

slipher commented Dec 10, 2024

On the other hand Q_rsqrt() always meant fast since 25 years.

Not in our code base. Apparently Q_rsqrt has used a precise version since 2013.

@illwieckz
Copy link
Member Author

If you get a worse result after a second iteration, you are doing something wrong.

Not sure it applies here, rrrola not only provides a different magic value, but also provides other values than 3.0 and 0.5 to be used in the iteration.

Well, one can just use the 3.0 and 0.5 values in the next iterations.

@illwieckz illwieckz force-pushed the illwieckz/fast-rsqrt branch from 1fc022f to 4abf136 Compare December 10, 2024 21:28
@illwieckz
Copy link
Member Author

So now is the status of the PR:

  • Add Q_sqrt_fast() that skips the second iteration,
  • Use better magic constants for the Q_rsqrt trick,
  • Use Q_sqrt_fast() in R_TBNtoQtangents() and VectorNormalizeFast().

I removed other things.

Anyway:

  • I see no current usage for Q_sqrt() that can't use a single-iteration Q_sqrt_fast() instead,
  • But I see usages of 1.0f / fsqrt() that can use a double-iteration Q_sqrt().

In that regard, not naming Q_sqrt() what is Q_sqrt() since 25 years in all other forks is weird, and naming Q_sqrt() something that is not named Q_sqrt() since 25 years in other forks is weird as well.

The purpose of this PR is to speedup R_TBNtoQtangents() so it does the job, we can discuss the names and the reuse of Q_sqrt_fast() and VectorNormalizeFast() later.

@illwieckz illwieckz force-pushed the illwieckz/fast-rsqrt branch from 4abf136 to 4cb2b6f Compare December 10, 2024 21:37
@illwieckz
Copy link
Member Author

Here is the common alien scene I use for testing model processing on CPU (plat23 1920 1920 0 0 0):

Before, 89fps:

unvanquished_2024-12-10_230539_000

After, 104fps:

unvanquished_2024-12-10_230757_000

So that's about a performance increase of +17%.

@illwieckz
Copy link
Member Author

For information, the only functions remaining to use Q_rsqrt() are:

  • VectorNormalize()
  • VectorNormalize2()
  • PlaneNormalize()
  • QuatNormalize()

The functions using Q_rsqrt_fast() are:

  • VectorNormalizeFast()
  • R_TBNtoQtangents()
    that also relies heavily on VectorNormalizeFast() hence using non-fast Q_rsqrt() would not makes sense.

@slipher slipher mentioned this pull request Dec 12, 2024
// relative error bound after the initial iteration: 1.8 * 10^-3
/* Magic constants by Jan Kadlec.
See: http://rrrola.wz.cz/inv_sqrt.html */
static const float a = 0.703952253f;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use constexpr not static const

static const float a = 0.5f;
static const float b = 3.0f;
float o = Q_rsqrt_fast( n );
// Second iteration for higher precision/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment no longer makes sense since the preceding calculations are different. You could describe this as "Do an iteration of Newton's method for finding the zero of f(x) = 1/x^2 - n"

The relative error bound is: 2.00010826×10⁻⁷ */

// Compute approximate inverse square root.
inline float Q_rsqrt_fast( const float n )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what is the error bound for Q_rsqrt_fast?

Copy link
Member Author

@illwieckz illwieckz Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The page http://rrrola.wz.cz/inv_sqrt.html says:

max rel_error avg rel_error²
6.50196699×10⁻⁴ 2.00010826×10⁻⁷

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what means the 2 in avg rel_error²

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah it may simple be the squared relative error, annoying.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I guess the relative error is √2.00010826×10⁻⁷ so 4.4722569917212941×10⁻⁴.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max rel error is what you want for a unit test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, when compiled with -DUSE_ARCH_INTRINSICS=OFF, with a relative tolerance of 4.4722569917212941e-4, the tests for Q_rsqrt_fast() fail this way:

src/engine/qcommon/q_math_test.cpp:228: Failure
Value of: Q_rsqrt_fast(1e-6)
Expected: is approximately 1000 (absolute error <= 0.44722569)
  Actual: 1000.5 (of type float), which is 0.502686 from 1000
src/engine/qcommon/q_math_test.cpp:229: Failure
Value of: Q_rsqrt_fast(0.036)
Expected: is approximately 5.270463 (absolute error <= 0.0023570864)
  Actual: 5.27387 (of type float), which is 0.00340557 from 5.27046
src/engine/qcommon/q_math_test.cpp:232: Failure
Value of: Q_rsqrt_fast(3)
Expected: is approximately 0.57735032 (absolute error <= 0.0002582059)
  Actual: 0.576975 (of type float), which is -0.00037539 from 0.57735
[  FAILED  ] QSharedMathTest.FastInverseSquareRoot (0 ms)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max rel error is what you want for a unit test

Ah OK.

Copy link
Member Author

@illwieckz illwieckz Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the tests pass (both with SSE and without).

@illwieckz illwieckz force-pushed the illwieckz/fast-rsqrt branch 3 times, most recently from 565861e to 6879ccb Compare December 17, 2024 02:10
@slipher
Copy link
Member

slipher commented Dec 17, 2024

LGTM but I want the error bounds to be more clearly documented. The error bound number for Q_rsqrt_fast is in there somewhere, but it is unclear whether it applies to that function or something else from the giant wall of text. And Q_rsqrt now says nothing at all. Maybe we don't know what it is exactly but we can say it's as least as good as the old one. Like the relative error is at most 5e-6, probably less.

@illwieckz illwieckz force-pushed the illwieckz/fast-rsqrt branch from 6879ccb to 7d522ec Compare December 17, 2024 18:10
@illwieckz
Copy link
Member Author

LGTM but I want the error bounds to be more clearly documented. The error bound number for Q_rsqrt_fast is in there somewhere, but it is unclear whether it applies to that function or something else from the giant wall of text. And Q_rsqrt now says nothing at all. Maybe we don't know what it is exactly but we can say it's as least as good as the old one. Like the relative error is at most 5e-6, probably less.

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants