<?xml version='1.0' encoding='UTF-8'?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" version="2.0">
  <channel>
    <title>0+0 &gt; 0: C++ thread-local storage performance - comments</title>
    <link>https://yosefk.com/cgi-bin/comments.cgi?post=blog/cxx-thread-local-storage-performance#comments</link>
    <description>Comments on "0+0 &gt; 0: C++ thread-local storage performance" by Yossi Kreinin</description>
    <docs>http://www.rssboard.org/rss-specification</docs>
    <generator>Yossi Kreinin's ugly publishing software</generator>
    <image>
      <url>https://yosefk.com/blog/self.jpg</url>
      <title>0+0 &gt; 0: C++ thread-local storage performance - comments</title>
      <link>https://yosefk.com/cgi-bin/comments.cgi?post=blog/cxx-thread-local-storage-performance#comments</link>
      <width>144</width>
      <height>144</height>
    </image>
    <language>en</language>
    <lastBuildDate>Sat, 27 Jun 2026 12:00:04 +0000</lastBuildDate>
    <item>
      <title>octordle</title>
      <link>https://yosefk.com/cgi-bin/comments.cgi?post=blog/cxx-thread-local-storage-performance#comment-f539a79a-30e2-4587-ac56-859e46d5b3db</link>
      <description><![CDATA[<html><head> </head><body>The alternative would be to use "ADD/SUB RSP,
8" but those take 4 bytes while PUSH/POP are 1 byte only and execute
just as fast.<p></p>
]]></description>
      <pubDate>Thu, 12 Feb 2026 01:00:52 +0000</pubDate>
      <dc:creator>octordle</dc:creator>
    </item>
    <item>
      <title>F16</title>
      <link>https://yosefk.com/cgi-bin/comments.cgi?post=blog/cxx-thread-local-storage-performance#comment-cce80ac1-cd20-4327-856f-514b4bc8506c</link>
      <description><![CDATA[<html><head> </head><body>Two workarounds:<p></p>
<ul>
<li><p>On x64 Linux, the <code>%gs</code> register is free. Put your own
TCB there. Initialize with <code>arch_ptrctl(ARCH_SET_GS)</code> on
thread start.</p></li>
<li><p>On ARM64 Linux you don't have userspace access to a secondary
thread-local register. In this case, put an <code>inital-exec</code>
thread local in the <strong>top-level executable file</strong>. Compute
an offset of this variable from the TCB. Pass that offset into your
library's init function. This offset is negative on x64 and positive on
ARM.</p></li>
<li><p>Compiler Explorer: https://godbolt.org/z/1vfzro3oP</p></li>
</ul>
<p></p>]]></description>
      <pubDate>Thu, 22 Jan 2026 15:44:08 +0000</pubDate>
      <dc:creator>F16</dc:creator>
    </item>
    <item>
      <title>slope 2</title>
      <link>https://yosefk.com/cgi-bin/comments.cgi?post=blog/cxx-thread-local-storage-performance#comment-95e5b39f-fe1d-464d-b4e7-d43e022e41cc</link>
      <description><![CDATA[<html><head> </head><body>they could just as well LD_PRELOAD the shared
library… though I suppose there are scenarios where using the tunable
ends up being the cleaner approach.<p></p>
<p></p>]]></description>
      <pubDate>Sun, 07 Dec 2025 23:38:45 +0000</pubDate>
      <dc:creator>slope 2</dc:creator>
    </item>
    <item>
      <title>Joker_vD</title>
      <link>https://yosefk.com/cgi-bin/comments.cgi?post=blog/cxx-thread-local-storage-performance#comment-deb0feaf-d88b-46e5-be47-8195838109af</link>
      <description><![CDATA[<html><head> </head><body>Well, that orange-coloured code is not really
about spilling and restoring registers, nothing touches RBP in the
between and it's a call-preserved register anyway. It's about
maintaining the proper stack alignment: you see, Linux's ABI on x64
mandates that before doing a CALL, the stack must be aligned at 16
bytes, but since a CALL naturally ruins this property, every non-leaf
function has to push and pop an odd number of registers in its
prologue/epilogue, RBP being particularly popular for this purpose. The
alternative would be to use "ADD/SUB RSP, 8" but those take 4 bytes
while PUSH/POP are 1 byte only and execute just as fast.<p></p>
<p></p>]]></description>
      <pubDate>Tue, 11 Mar 2025 00:32:58 +0000</pubDate>
      <dc:creator>Joker_vD</dc:creator>
    </item>
    <item>
      <title>Yossi Kreinin</title>
      <link>https://yosefk.com/cgi-bin/comments.cgi?post=blog/cxx-thread-local-storage-performance#comment-5b44e386-49d0-487c-ad1a-16f19db471b6</link>
      <description><![CDATA[<html><head> </head><body>You're right, though I guess if you can have
the user set this "tunable" via an env var, you can usually also have
them <code>LD_PRELOAD</code> the shared library?.. But maybe in some
cases the tunable is the better way<p></p>
<p></p>]]></description>
      <pubDate>Wed, 19 Feb 2025 11:03:19 +0000</pubDate>
      <dc:creator>Yossi Kreinin</dc:creator>
    </item>
    <item>
      <title>Aaron Puchert</title>
      <link>https://yosefk.com/cgi-bin/comments.cgi?post=blog/cxx-thread-local-storage-performance#comment-96be8b17-ba62-487e-8175-4477887c61c0</link>
      <description><![CDATA[<html><head> </head><body>You have already mentioned the initial-exec
model and how it is constrained by the static TLS reserve in case of
dlopen. This reserve seems to be configurable in glibc via the tunable
<a href="https://www.gnu.org/software/libc/manual/html_node/Dynamic-Linking-Tunables.html">glibc.rtld.optional_static_tls</a>.<p></p>
<p></p>]]></description>
      <pubDate>Tue, 18 Feb 2025 13:01:39 +0000</pubDate>
      <dc:creator>Aaron Puchert</dc:creator>
    </item>
    <item>
      <title>Yossi Kreinin</title>
      <link>https://yosefk.com/cgi-bin/comments.cgi?post=blog/cxx-thread-local-storage-performance#comment-670511b4-ce43-43ad-9d45-5dbcd304d299</link>
      <description><![CDATA[<html><head> </head><body><span class="citation" data-cites="billythefisherman">@billythefisherman</span>: I endorse this
scheme and it's consistent with the guidelines above; though making the
structure thread_local is probably faster most of the time (with the
colorful exception of it being more likely to fail under
<code>-ftls-model=initial-exec</code> if your shared library is neither
<code>LD_PRELOAD</code>ed nor <code>DT_NEEDED</code>)<p></p>
<p><span class="citation" data-cites="Daniel">@Daniel</span> Cranford:
even without symbol interposition, you will get calls to
<code>__tls_get_addr</code>, though you will get less such calls,
hopefully one per function with a recent clang on x86 (I updated the
post with some details of what hidden visibility (which precludes
interposition) does to code generation based on input from the same
MaskRay whose writing you're linking to; not sure if you commented
before this update or not, quite possibly it was before.) If this was
just about interposition, then with hidden visibility you wouldn't need
<code>__tls_get_addr</code> (and a lot of work was put in by (some of
the) compiler writers to use less <code>__tls_get_addr</code> calls
given hidden visibility; had they been able to avoid calling it
completely, I'm sure they would.)</p>
<p>So I maintain that this is first and foremost the result of TLS being
allocated non-contiguously by libc and the resulting need to find out
the module base pointer before you can add the offset to it.</p>
<p></p>]]></description>
      <pubDate>Tue, 18 Feb 2025 07:17:19 +0000</pubDate>
      <dc:creator>Yossi Kreinin</dc:creator>
    </item>
    <item>
      <title>billythefisherman</title>
      <link>https://yosefk.com/cgi-bin/comments.cgi?post=blog/cxx-thread-local-storage-performance#comment-12d5de42-bcc5-4e07-92f2-58950e829d79</link>
      <description><![CDATA[<html><head> </head><body>Just put a single u64 into thread local
storage and have it point to a structure that contains all your thread
local stuff. No need for multiple thread local objects etc. This scheme
has worked for me over the years at least but maybe Im missing the
requirements above?<p></p>
<p></p>]]></description>
      <pubDate>Mon, 17 Feb 2025 23:17:52 +0000</pubDate>
      <dc:creator>billythefisherman</dc:creator>
    </item>
    <item>
      <title>Daniel Cranford</title>
      <link>https://yosefk.com/cgi-bin/comments.cgi?post=blog/cxx-thread-local-storage-performance#comment-ed5230cf-b365-4505-836d-612039e9010a</link>
      <description><![CDATA[<p>The reason the dynamic linker does not (and can not) compute a simple
offset for the tls storage in the shared object case is the fault of
something called "elf symbol interpositition". The default flags on the
linker make every symbol in an .so potentially replaceable by another
.so loaded into the process, and so every function call and (static)
variable access is indirectly through the global offset table.</p>
<p>See
https://www.facebook.com/story.php?story_fbid=10107358290728348&amp;id=15706763
and
https://maskray.me/blog/2021-05-16-elf-interposition-and-bsymbolic</p>
<p></p>]]></description>
      <pubDate>Mon, 17 Feb 2025 16:07:52 +0000</pubDate>
      <dc:creator>Daniel Cranford</dc:creator>
    </item>
  </channel>
</rss>
