• Eric Myhre's avatar
    Correct a comment about performance in justString. · d0ce3ded
    Eric Myhre authored
    There's actually a doozy of performance considerations in this commit.
    Verifying (and then correcting) that comment kicked off a surprisingly
    deep and demanding binge of research.
    
    (Some of the considerations are still only "considerations",
    unfortunately -- one key discovery is that (surprise!) conclusive
    choices require more info than microbenchmarks alone can yield.)
    
    First, the big picture:
    
    One of the things we need to be really careful about throughout a
    system like go-ipld-prime (where we're dealing with large amounts of
    serialization) is the cost of garbage collection.  Since we're often
    inside of an application's "tight loop" or "hot loop" or whatever you
    prefer to call it, if we lean on the garbage collector too heavily...
    it's very, very likely to show up a system-wide impact.
    
    So, in essence, we want to call "malloc" less.  This isn't always easy.
    Sometimes it's downright impossible: we're building large tree
    structures; it's flatly impossible to do this without allocating
    some memory.
    In other cases, there are avoidable things: and in particular, one
    common undesirable source of allocations comes from "autoboxing" around
    interfaces.  (More specifically, the name of the enemy here will often
    show up on profiling reports as "runtime.convT2I".)  Sometimes this one
    can be avoided; other times, not.
    
    Now, a more detailed picture:
    
    There are actually several functions in the runtime that relate to
    memory allocation and garbage collection, and the less we use any of
    them, the better; but also, they are not all created equal.
    
    These are the functions that are of interest:
    
    - runtime.convT2I / runtime.convT2E
    - runtime.newObject
    - runtime.writeBarrier / runtime.gcWriteBarrier
    - runtime.convTstring / etc
    
    Most of these functions call `runtime.mallocgc` internally, which is
    why they're worth of note.  (writeBarrier/gcWriteBarrier are also
    noteworthy, but are different beasts.)
    
    Different kinds of implementations of something like `justString` will
    cause the generated assembly to contain calls to various combinations
    of these runtime functions when they're converted into a `Node`.
    
    These are the variations considered:
    
    - Variation 1: `type justString string`:
      results in `runtime.convTstring`.
    
    - Variation 2: `type justString struct { x string }`:
      results in `runtime.convT2I`.
    
    - Variation 3: as above, but returning `&justString{val}`:
      results in `runtime.newobject` *and* its friends
      `runtime.writeBarrier` and `runtime.gcWriteBarrier`.
    
    The actual performance of these... it depends.
    
    In microbenchmarks, I've found the above examples are roughly:
    
    - Variation 1:    23.9 ns/op     16 B/op     1 allocs/op
    - Variation 2:    31.1 ns/op     16 B/op     1 allocs/op
    - Variation 3:    23.0 ns/op     16 B/op     1 allocs/op
    
    So, a couple things about this surprised me; and a couple things I'm
    still pretty sure are just microbenchmarks being misleading.
    
    First of all: *all* of these call out to `mallocgc` internally.  And so
    we see an alloc-per-op reported in all three of them.  (This actually
    kinda bugs me, because I feel like we should be able to fit all the
    requisite information in the first case directly into the interface,
    which is, if I understand correctly, already always two words; and
    arguably, the compiler might be smart enough to do this in the second
    case as well.  But I'm wrong about this, for whatever reason, so let's
    just accept this one and move along.)  But they vary in time.  Why?
    
    Variation 2 seems to stand out as slower.  Interestingly, it turns out
    `convT2E` and `convT2I` are extra problematic because they involve a
    call of `typedmemmove` internally -- as a comment in the source says,
    there's both an allocation, a zeroing, and then a copy here (at least,
    as of go1.12); this is a big bummer.  In addition, even before getting
    that deep, if you check out the disassembly of just our functions:
    for our second variation, as inlined into our microbenchmark, there are
    9 instructions, plus 1 'CALL'; vs only 3+1 for the first variation.
    This memmove and extra instructions seems to be the explainer for why
    our second variation (`struct{string}`) is significantly (~8ns) slower.
    
    (And here I thought variation two would do well!  A struct with one
    field is the same size as the field itself; a string is one word
    of pointer; and an interface has another word for type; and that's
    our two words, so it should all pack, and on the stack!  Alas: no.)
    
    Now back up to Variation 1 (just a typedef of a string): this one
    invokes `runtime.convTstring`, and while that does invoke `mallocgc`,
    there's a detail about how that's interesting: it does it with an ask
    for a small number of bytes.  Specifically, it asks for... well,
    `unsafe.Sizeof(string)`, so that varies by platform, but it's "tiny".
    What's "tiny" mean?  `mallocgc` has a specific definition of this, and
    you can see it by grepping the runtime package source for
    "maxTinySize": it's 16 bytes.  Things under this size get special
    treatment from a "tiny allocator"; this seems to be why
    `runtime.convTstring` is relatively advantaged.
    
    (You can see benchmarks relating to this in the runtime package itself:
    try `go test -run=x -bench=Malloc runtime`.  There's a *huge* cliff
    between MallocLargeStruct versus the rest of its fellows.)
    
    Variation 3 also appears competitive.  This one surprises me, and this
    is where I still feel like microbenchmarks must be hoodwinking.
    The use of `runtime.newobject` seems to hit the same corners as
    `runtime.convTstring` at runtime in our situation here: it's "tiny";
    that's neat.  More confusingly, though, `runtime.writeBarrier` and
    `runtime.gcWriteBarrier` *should* be (potentially) very high cost
    calls.  And for some reason, they're not.  This particular workload in
    the microbenchmark must just-so-happen to tickle things in such a way
    that these calls are free (literally; within noise levels), and I
    suspect that's a happy coincidence in the benchmark that won't at all
    hold in real usage -- as any amount of real memory contention appears,
    the costs of these gc-related calls can be expected to rise.
    
    I did a few more permutations upon Variations 2 and 3, just out of
    curiosity and for future reference, adding extra fields to see if any
    interesting step functions revel themselves.  Here's what I found:
    
    - {str,int,int,int,int} is 48 bytes; &that allocs the same amount; in speed, & is faster; 33ns vs 42ns.
    - {str,int,int,int} is 48 bytes; &that allocs the same amount; in speed, & is faster; 32ns vs 42ns.
    - {str,int,int} is 32 bytes; &that allocs the same amount; in speed, & is faster; 32ns vs 39ns.
    - {str,int} is 32 bytes; &that allocs the same amount; in speed, & is faster; 31ns vs 38ns.
    - {str} is 16 bytes; &that allocs the same amount; in speed, & is faster; 24ns vs vs 32ns.
    
    Both rise in time cost as the struct grows, but the non-pointer
    variant grows faster, and it experiences a larger step of increase
    each time the size changes (which in turn steps because of alignment).
    The &{str} case is noticeably faster than the apparently linear
    progression that starts upon adding a second field; since we see the
    number 16 involved, it seems likely that this is the influence of the
    "tiny allocator" in action, and the rest of the values are linear
    relative to each other because they're all over the hump where the
    tiny allocator special path disengages.
    
    (One last note: there's also a condition about "noscan" which toggles
    the "tiny allocator", and I don't fully understand this detail.  I'd
    have thought strings might count as a pointer, which would cause our
    Variation 3 to not pass the `t.kind&kindNoPointers` check; but the
    performance cliff observation described in the previous paragraph
    seems to empirically say we're not getting kicked out by "noscan".
    (Either that or there's some yet-other phenomenon I haven't sussed.))
    
    (Okay, one even-laster note: in the course of diving around in the
    runtime malloc code, I found an interesting comment about using memory
    "from the P's arena" -- "P" being one of the letters used when talking
    about goroutine internals -- and I wonder if that contributes to our
    little mystery about how the `gcWriteBarrier` method seems so oddly
    low-cost in these microbenchmarks: perhaps per-thread arenas combined
    with lack of concurrency in the benchmark combined with quickly- and
    sequentially-freed allocations means any gcWriteBarrier is essentially
    reduced to nil.  However, this is just a theory, and I won't claim to
    totally understand the implications of this; commenting on it here
    mostly to serve as a pointer to future reading.)
    
    ---
    
    Okay.  So what comes of all this?
    
    - I have two choices: attempt to proceed further down a rabbithole of
    microbenchmarking and assembly-splunking (and next, I think, patching
    debug printfs into the compiler and runtime)... or, I can see that last
    one as a step too far for today, pull up, commit this, and return to
    this subject when there's better, less-toy usecases to test with.
    I think the latter is going to be more productive.
    
    - I'm going to use the castable variation here (Variation 1).  This
    won't always be the correct choice: it only flies here because strings
    are immutable anyway, and because it's a generic storage implementation
    rather than having any possibility of additional constraints, adjuncts
    from the schema system, validators, etc; and therefore, I don't
    actually care if it's possible to cast things directly in and out of
    this type (since doing so can't break tree immutability, and it can't
    break any of those other contracts because there aren't any).
    
    - A dev readme file appears.  It discusses what choices we might make
    for other cases in the future.  It varies by go native kind; and may
    be different for codegen'd types vs general storage implementations.
    
    - Someday, I'd like to look at this even further.  I have a persistent,
    nagging suspicion that it should be possible to make more steps in the
    direction of "zero cost abstractions" in this vicinity.  However, such
    improvements would seem to be pretty deep in the compiler and runtime.
    Someday, perhaps; but today... I started this commit in search of a
    simple diff to a comment!  Time to reel it in.
    
    Whew.
    Signed-off-by: default avatarEric Myhre <hash@exultant.us>
    d0ce3ded