Currently, accessing a threading.local() from multiple threads doesn't scale well because of reference count contention on the shared _thread._local object. We should use deferred reference counting on _thread._local to avoid this bottleneck.
Linked PRs