Gavin Pugh - A Videogame Programming Blog

XNA/C# – Thread-local storage on Xbox 360

29 November, 2010 at 9:48am | XNA / C#

Thread-local storage

 

I initially begun writing this particular article back in April. Back then with XNA 3.1 to play with, there was actually a method of ‘thread-local storage’ for Xbox. With the release of XNA 4.0, this last remaining native method was removed. The [ThreadLocal] attribute though was added to the Compact Framework, but it seems to be a stubbed non-functional attribute on Xbox 360.

So, where does that leave us? There is various ways you can manually code up a thread-local storage solution. The whole crux of it is data-retrieval and storage on a per-thread basis. The most immediately obvious way to go is just some sort of associative container, a Dictionary<> of some sort of thread ‘id’ mapped to the data you wish to store.

Just something like:

Dictionary< System.Threading.Thread, MyClass > my_tls_data;

There’s a lot of room for freedom in implementation though. Is Dictionary<> really the most suitable container? Should items be auto-created on the first access, or explicitly created by the user? Most of all I want to keep things performing well. Given some functionality performs extremely poorly on the Xbox 360, I want to be careful what I throw into my implementation.

The profiling approach

So, here’s how I set things up to profile changes I made throughout development:

  • My test case sets up ten threads.
  • Each thread has TLS data which consists of one simple integer.
  • I kick off all ten threads at once, and each runs a loop 100,000 times incrementing their integer by one each time. (I verified the loop is not optimized out for the release-JIT’d builds).
  • Once I see all ten threads have finished their loops I conclude the test and take the timing.
  • I run the tests over and over and calculate a running average time. I display the results on screen, as opposed to debug output. So I can see the results on a release build which isn’t connected to the debugger. Pretty simple stuff, it just looks like this:
    TLS Profiling
  • Xbox times have a steady average very quickly after just a dozen or so tests. Windows times were more erratic and I generally left them for a couple of minutes to settle down.

Implementation choices

So, given the myriad of different ways this thing could be done, I broke down the choices I’d need to make. Here’s a breakdown piece-by-piece of the decisions I made along the way. Forgive the language, they’re just lifted verbatim from notes I made:

  • Array over Dictionary<>
    • For both Windows and Xbox with ten test threads, using an array instead of a Dictionary<> was a little over twice as fast. 74ms versus 151ms on Xbox.
    • From additional testing on Xbox, array became a comparable speed to Dictionary<> at around 32 threads. I threw in a couple of different access patterns to test with too. Only if you’re going crazy with threads and data access, will Dictionary<> be a win.
    • See the source code for the array implementation. It’s just essentially a simple associative container of ‘Thread’->’Data’. Potentially the whole array needs to be searched on each access. But keeping the array size small makes this a win over Dictionary<>.
  • For vs foreach on array access
    • A for loop incrementing an integer was faster.
    • Doing the search with foreach() instead roughly doubled the times. 74ms versus 141ms.
  • Thread object vs integer thread ID (ManagedThreadId)
    • Each of these approaches were within a millisecond of one other. Something like 73.5ms versus 74ms. The integer version was the slight winner.
    • I kind of like using the object reference, it makes debugging easier. So I went with that instead of using the integer ID.
  • Call to Thread.CurrentThread on data access.
    • Cost is prohibitive on Xbox, see a blog post I wrote a while back:
      http://www.gavpugh.com/2010/04/30/xnac-thread-currentthread-is-slow-on-xbox-360/
    • This is still the case under XNA 4.0, figures haven’t changed. For the testing I was doing for these classes, calling Thread.CurrentThread on Xbox each access increases the time taken in one example from 74ms up to 7248ms. The equivalent test on Windows went from 26ms up to 32ms.
    • My class has to provide a data accessor which takes a thread object param.
  • Explicit cleanup, versus WeakReference to determine threads are done with.
    • WeakReference is slow on Xbox unfortunately too. Even worse than Thread.CurrentThread in my tests. The example case I used shot up from 74ms to over 40,000ms. Windows jumped from 26ms to 76ms.
    • I chose to use explicit cleanup.
    • Side note: To make cleanup easier on thread destruction, consider a TLS manager class. The TLS objects can register with the manager upon creation, and the manager can walk them all destroy any data associated to a specific thread.
  • Explicit construction, versus ‘auto’ construction.
    • Auto construction isn’t too much of a performance drop. It just adds an extra branch to the code if the data access fails to find its data. Which you probably would have in there anyway to throw an Exception, or fire an assert.
    • In my tests removing auto construction from my code, dropped the reference 74ms case down to 73ms. Good tradeoff in my opinion, so I kept auto-construction.
  • Avoid thread locks for data access
    • Locks are only required for adding new array entries, or removing them. The data access function should be able to execute in the general case without a lock.
    • With auto-construction though, a new entry could be added within the accessing function. There’s a risk of a race condition if two concurrent threads try to auto-construct the data for one specific thread.
    • As long as we can be sure that threads only access their own data, everything is fine. So the GetData(Thread) version should verify that the thread matches Thread.CurrentThread.
    • The check only needs to be made after the search fails, just as it’s about to do the auto-construction. I ended up just putting this code within #if DEBUG, so I didn’t have to worry about any performance hit.

Explicit cleanup

I wasn’t too happy with requiring explicit cleanup. It means the cleanest solution would be some kind of global static function which ensured all various TLS variables are freed. Something like:

public static void CleanupTLS()
{
    Profiler.GetCurrentNodeStack().FreeData();
    Debug.Output.GetStringBuilder().FreeData();
    Renderer.DebugLines.FreeData();

    // etc...
}

Consider too that some of these variables may be initially at private scope, so they would need to be specifically exposed somehow so the freeing can occur. At best you’d need to add explicit public freeing functions to various affected modules.

As hinted in my notes, a global manager class which registers all TLS variables could make this a lot cleaner. As TLS data is instantiated, I will register with the manager. Where this occurs is not performance-critical. The freeing isn’t performance-critical either. The freeing though is pretty simple, the manager just walks all the TLS references it’s been given, and attempts to purge any with data for a specific thread.

So instead of calling the above mess, you can just call:

Core.ThreadLocalManager.FreeData();

Bit happier with it now. 🙂

Profiling figures

I explained my profiling setup earlier in this article. Here are the figures with my final implementation, compared against the existing methods of TLS. The ‘No param’ version of my code makes a call to ‘Thread.CurrentThread’ each time to determine what data to pass back. The ‘Thread param’ version uses a passed in thread reference to retrieve the data instead.

The figures are for ten concurrent threads making 100,000 accesses each:

Windows PC Xbox 360
Release –
Debugger
Release – JIT’d Release –
Debugger
Release – JIT’d
My code – No param 152.3ms 31.2ms * 7805.9ms * 7248.8ms
My code – Thread param 139.4ms 27.9ms 456.8ms 74.2ms
[ThreadStatic] 172.7ms 45.6ms N/A N/A
GetData/SetData() 164.3ms 51.1ms ** 7854.1ms ** 7565.3ms

 
* Using ‘Thread.CurrentThread’ on Xbox 360 is very slow.
** The GetData/SetData() (LocalDataStoreSlot) method doesn’t work under XNA 4.0 on Xbox 360. These figures are from when I ran the same tests months ago under XNA 3.1 on 360.

Given these figures I’m happy to use my new implementation for both Windows and Xbox 360. There’s no alternatives currently for Xbox 360, so given I’ve explored many optimization possibilities it’s pretty much as good as it gets. For Windows it actually is slightly faster than even [ThreadStatic], so using the same code across both totally makes sense.

Show me the code

Here’s the code for download:

C# FileThreadLocal.cs

It also includes the class for ‘ThreadLocalManager’, which as detailed earlier assists with easier cleanup of data from destroyed threads.

Here’s a quick contrived bit of code to show how these classes should be used:

private static Core.ThreadLocal ms_tls_stringbuilder = 
    new Core.ThreadLocal();

//! Entry function for new worker threads
public static void ThreadRun()
{
    // Grab the TLS stringbuilder
    StringBuilder sb = ms_tls_stringbuilder.Data;

    // Do bunch of work
    ...

    // Free all TLS data associated with this thread
    Core.ThreadLocalManager.FreeData();
}

The TLS data access has a function which takes a thread parameter. This should be used in performance-critical code, given my findings detailed in this article. Just note the ‘Thread.CurrentThread’ once in your new thread, and pass it around to any performance-critical functions that require working with TLS.

Value Types

My implementation is currently tied to just using reference types, value types will not work with it. An alternative is to just simply wrap value types in a class, which is what I’ve been happy to do with my engine.

If you absolutely need a thread-local implementation which does work natively with value types, it’s not too hard to do. You’d want to set up an alternative class, with pretty much the same code in it. Change the ‘where’ constraint to simply: ‘struct’. There’s a couple of different ways I could see the implementation going:

  • Implement separate Get and Set data methods. Maybe just a single property?
  • Have a get method not return a value, but instead pass it back through a reference parameter.

There’s no reason why you couldn’t have both these types of interface for it.

References

Comments

  • […] This post was mentioned on Twitter by Danielle Potts, Gavin Pugh. Gavin Pugh said: New post on my blog: XNA/C# – Thread-local storage on Xbox 360 http://bit.ly/g94aqx […]

  • Leave a Reply

    Your email address will not be published. Required fields are marked *