Performance Optimization

Topics: Developer Forum
Mar 5, 2009 at 7:01 AM
Edited Mar 5, 2009 at 7:22 AM
I was doing some performance tests of AtomPhysics and i thought you might want to know about the various performance costs of using the 3 primary methods of doing a vector op: Direct Execution (calling the normal version of the Vector2.Any-Operation-Goes-Here function), Ref/Out (same as Direct Execution except calling the version with ref and out), and Inline Code (performing the neccessary steps inside your code as opposed to calling each function). I used ANTS Performance Profiler 4.3 Trial, and i sampled a 10-second block (the engine was under full load, however i could not sample the same ten seconds as the app took different times to load in some tests) in each version. Here are my results:
Direct Execution: 1023 MS
Ref/Out: 595/591 MS apiece from extracting the atom positions, another 612 for doing the actual calculation (so 1798 MS total, but this is because you can't shove a property through as ref variable)
Inline Execution: 200 MS for vector creation, 1305 MS for X value calculation, and 1394 MS for Y value calculation.
Additional notes: i did these three tests on a Vector2.Subtract in my broad phase code, which is called once for every possible combination of 2 atoms currently being simulated. Also note that these values are from a 10 second period during which the engine was in full simulation with about 450 atoms in it. Also, the sample project and the physics engine i used was compiled in release mode. Stay tuned for results on various parts of the Farseer physics engine!

Note that the reason the Ref/Out method was so slow was because i had to perform a separate extraction of the two vector2's i used in the calculation. The actual calculation was about 40 percent faster than Direct Execution (the fastest overall).
EDIT: i poked around in Farseer and i couldn't find any code that was really ripe for this kind of testing, so...
Coordinator
Mar 5, 2009 at 1:32 PM
Hehe. We inlined almost all of our code to make sure that the compiler produces high performance code. All the inline optimizations are commented with a region tag.

From what I have gathered, this should be the expected results:

Normal:
Vector = Vector1 + Vector2

Reason: Vectors are structs and thus copied several times.

Faster:
Vector.Add(ref Vector1, ref Vector2, out Vector)

Reason: The structs are referenced and their value is not copied on return.

Fastest:
Vector.X = Vector1.X + Vector2.X
Vector.Y = Vector1.Y + Vector2.Y

Reaon: No copying at all. Well, of vector structs that is.

That's without going into details about how structs work and how the compiler inlines code.
We choose the fastest method in FP to make sure we gained all the performance we possible could. This is at the expense of code structure and maintainability.
Mar 7, 2009 at 8:20 AM
Well unless you gain a significant performance by going with the inline over ref/out operations, i don't see how this would be beneficial. Also, is there really a large benefit by using inline code? i can see how you would want to squeeze every bit of power out of say, my vector subtraction in the main execution path (where the code might be called as much as a billion times (but thats a bit of a stretch) during a long session due to the fact that i have not decided to use a fixed timestep for my tests but thats another story), where you might gain anywhere from a couple to like 10 frames per second throughout, but an engine like yours where you would be lucky to hit 50,000 calls on any one vector operation (and maybe a few million total) over a long session? that seems to be a tad bit of "ignoring the dollars to save the pennies" (or whatever that phrase is). No offense intended. Also, my tests seem to indicate that the plain jane method of vector ops (one test admittedly) is the best way to go for my code. Ill post more when i do Xbox tests.
Coordinator
Mar 7, 2009 at 2:29 PM
There is no significant performance gain by doing this. You do not gain 50% extra performance by inlining your code.

First off, Jeff created a simple physics engine based on Box2D lite (have a look at box2d lite, It's as simple as it gets), performance was not the biggest issue as there were no reason to implement advanced algorithms to speed up the engine. It started with the brute force broad phase collider and the code was really neat and easy to read.
Jeff released the engine to the public to make sure others had an easy to use physics engine in C# and continued to implement feature requests from users. He got some help from other physics engine creators (Such as Bioslayer from Physics2D and Andy that implemented sweep and prune) to make the engine perform better in some cases.

It's impossible to create an engine that is fast in any way. No algorithm will perform optimal in all cases, but we try to give our users some different options that they can try out to make sure the engine perform as expected. It is always better to implement a homemade broadphase (or narrow phase) that fits the physics logic of the game. There is a good chance it perform better than the general algorithms we supply.

After we have implemented some good all purpose algorithms and optimized them as much as we could, we can do no more to speed up the engine - unless we look at some more low level optimization. And that's just what we did. The .net compiler is quite good, but it uses JIT compilation and thus it can't spend a lot of time optimizing the code. It does great job optimizing but being a game engine, we would like the best performance we can get. So we rewrote most of the engine to be inlined (a job the compiler usually do) and simplified our datastructures and updated our design patterns to make sure we produce fast code.
I don't have any benchmark results from before and after, but I can assure you we gained some good performance on both Windows and Xbox platforms by doing low level optimizations.
Mar 8, 2009 at 5:31 AM
Well, apparently the JIT compiler can look ahead and pre-compile code with all of the optimizations that it needs (although im not sure if it always optimizes the code). If i have the time (maybe next week or something) ill go through and replace the inline code with regular ops and do some profiling to test.