`threading.local()` access should scale well from multiple threads

Currently, accessing a `threading.local()` from multiple threads doesn't scale well because of reference count contention on the shared `_thread._local` object. We should use deferred reference counting on `_thread._local` to avoid this bottleneck.


### Linked PRs
* gh-128693
* gh-128753