Correct a comment about performance in justString.

There's actually a doozy of performance considerations in this commit. Verifying (and then correcting) that comment kicked off a surprisingly deep and demanding binge of research. (Some of the considerations are still only "considerations", unfortunately -- one key discovery is that (surprise!) conclusive choices require more info than microbenchmarks alone can yield.) First, the big picture: One of the things we need to be really careful about throughout a system like go-ipld-prime (where we're dealing with large amounts of serialization) is the cost of garbage collection. Since we're often inside of an application's "tight loop" or "hot loop" or whatever you prefer to call it, if we lean on the garbage collector too heavily... it's very, very likely to show up a system-wide impact. So, in essence, we want to call "malloc" less. This isn't always easy. Sometimes it's downright impossible: we're building large tree structures; it's flatly impossible to do this without allocating some memory. In other cases, there are avoidable things: and in particular, one common undesirable source of allocations comes from "autoboxing" around interfaces. (More specifically, the name of the enemy here will often show up on profiling reports as "runtime.convT2I".) Sometimes this one can be avoided; other times, not. Now, a more detailed picture: There are actually several functions in the runtime that relate to memory allocation and garbage collection, and the less we use any of them, the better; but also, they are not all created equal. These are the functions that are of interest: - runtime.convT2I / runtime.convT2E - runtime.newObject - runtime.writeBarrier / runtime.gcWriteBarrier - runtime.convTstring / etc Most of these functions call `runtime.mallocgc` internally, which is why they're worth of note. (writeBarrier/gcWriteBarrier are also noteworthy, but are different beasts.) Different kinds of implementations of something like `justString` will cause the generated assembly to contain calls to various combinations of these runtime functions when they're converted into a `Node`. These are the variations considered: - Variation 1: `type justString string`: results in `runtime.convTstring`. - Variation 2: `type justString struct { x string }`: results in `runtime.convT2I`. - Variation 3: as above, but returning `&justString{val}`: results in `runtime.newobject` *and* its friends `runtime.writeBarrier` and `runtime.gcWriteBarrier`. The actual performance of these... it depends. In microbenchmarks, I've found the above examples are roughly: - Variation 1: 23.9 ns/op 16 B/op 1 allocs/op - Variation 2: 31.1 ns/op 16 B/op 1 allocs/op - Variation 3: 23.0 ns/op 16 B/op 1 allocs/op So, a couple things about this surprised me; and a couple things I'm still pretty sure are just microbenchmarks being misleading. First of all: *all* of these call out to `mallocgc` internally. And so we see an alloc-per-op reported in all three of them. (This actually kinda bugs me, because I feel like we should be able to fit all the requisite information in the first case directly into the interface, which is, if I understand correctly, already always two words; and arguably, the compiler might be smart enough to do this in the second case as well. But I'm wrong about this, for whatever reason, so let's just accept this one and move along.) But they vary in time. Why? Variation 2 seems to stand out as slower. Interestingly, it turns out `convT2E` and `convT2I` are extra problematic because they involve a call of `typedmemmove` internally -- as a comment in the source says, there's both an allocation, a zeroing, and then a copy here (at least, as of go1.12); this is a big bummer. In addition, even before getting that deep, if you check out the disassembly of just our functions: for our second variation, as inlined into our microbenchmark, there are 9 instructions, plus 1 'CALL'; vs only 3+1 for the first variation. This memmove and extra instructions seems to be the explainer for why our second variation (`struct{string}`) is significantly (~8ns) slower. (And here I thought variation two would do well! A struct with one field is the same size as the field itself; a string is one word of pointer; and an interface has another word for type; and that's our two words, so it should all pack, and on the stack! Alas: no.) Now back up to Variation 1 (just a typedef of a string): this one invokes `runtime.convTstring`, and while that does invoke `mallocgc`, there's a detail about how that's interesting: it does it with an ask for a small number of bytes. Specifically, it asks for... well, `unsafe.Sizeof(string)`, so that varies by platform, but it's "tiny". What's "tiny" mean? `mallocgc` has a specific definition of this, and you can see it by grepping the runtime package source for "maxTinySize": it's 16 bytes. Things under this size get special treatment from a "tiny allocator"; this seems to be why `runtime.convTstring` is relatively advantaged. (You can see benchmarks relating to this in the runtime package itself: try `go test -run=x -bench=Malloc runtime`. There's a *huge* cliff between MallocLargeStruct versus the rest of its fellows.) Variation 3 also appears competitive. This one surprises me, and this is where I still feel like microbenchmarks must be hoodwinking. The use of `runtime.newobject` seems to hit the same corners as `runtime.convTstring` at runtime in our situation here: it's "tiny"; that's neat. More confusingly, though, `runtime.writeBarrier` and `runtime.gcWriteBarrier` *should* be (potentially) very high cost calls. And for some reason, they're not. This particular workload in the microbenchmark must just-so-happen to tickle things in such a way that these calls are free (literally; within noise levels), and I suspect that's a happy coincidence in the benchmark that won't at all hold in real usage -- as any amount of real memory contention appears, the costs of these gc-related calls can be expected to rise. I did a few more permutations upon Variations 2 and 3, just out of curiosity and for future reference, adding extra fields to see if any interesting step functions revel themselves. Here's what I found: - {str,int,int,int,int} is 48 bytes; &that allocs the same amount; in speed, & is faster; 33ns vs 42ns. - {str,int,int,int} is 48 bytes; &that allocs the same amount; in speed, & is faster; 32ns vs 42ns. - {str,int,int} is 32 bytes; &that allocs the same amount; in speed, & is faster; 32ns vs 39ns. - {str,int} is 32 bytes; &that allocs the same amount; in speed, & is faster; 31ns vs 38ns. - {str} is 16 bytes; &that allocs the same amount; in speed, & is faster; 24ns vs vs 32ns. Both rise in time cost as the struct grows, but the non-pointer variant grows faster, and it experiences a larger step of increase each time the size changes (which in turn steps because of alignment). The &{str} case is noticeably faster than the apparently linear progression that starts upon adding a second field; since we see the number 16 involved, it seems likely that this is the influence of the "tiny allocator" in action, and the rest of the values are linear relative to each other because they're all over the hump where the tiny allocator special path disengages. (One last note: there's also a condition about "noscan" which toggles the "tiny allocator", and I don't fully understand this detail. I'd have thought strings might count as a pointer, which would cause our Variation 3 to not pass the `t.kind&kindNoPointers` check; but the performance cliff observation described in the previous paragraph seems to empirically say we're not getting kicked out by "noscan". (Either that or there's some yet-other phenomenon I haven't sussed.)) (Okay, one even-laster note: in the course of diving around in the runtime malloc code, I found an interesting comment about using memory "from the P's arena" -- "P" being one of the letters used when talking about goroutine internals -- and I wonder if that contributes to our little mystery about how the `gcWriteBarrier` method seems so oddly low-cost in these microbenchmarks: perhaps per-thread arenas combined with lack of concurrency in the benchmark combined with quickly- and sequentially-freed allocations means any gcWriteBarrier is essentially reduced to nil. However, this is just a theory, and I won't claim to totally understand the implications of this; commenting on it here mostly to serve as a pointer to future reading.) --- Okay. So what comes of all this? - I have two choices: attempt to proceed further down a rabbithole of microbenchmarking and assembly-splunking (and next, I think, patching debug printfs into the compiler and runtime)... or, I can see that last one as a step too far for today, pull up, commit this, and return to this subject when there's better, less-toy usecases to test with. I think the latter is going to be more productive. - I'm going to use the castable variation here (Variation 1). This won't always be the correct choice: it only flies here because strings are immutable anyway, and because it's a generic storage implementation rather than having any possibility of additional constraints, adjuncts from the schema system, validators, etc; and therefore, I don't actually care if it's possible to cast things directly in and out of this type (since doing so can't break tree immutability, and it can't break any of those other contracts because there aren't any). - A dev readme file appears. It discusses what choices we might make for other cases in the future. It varies by go native kind; and may be different for codegen'd types vs general storage implementations. - Someday, I'd like to look at this even further. I have a persistent, nagging suspicion that it should be possible to make more steps in the direction of "zero cost abstractions" in this vicinity. However, such improvements would seem to be pretty deep in the compiler and runtime. Someday, perhaps; but today... I started this commit in search of a simple diff to a comment! Time to reel it in. Whew. Signed-off-by: Eric Myhre <hash@exultant.us>

Correct a comment about performance in justString.
There's actually a doozy of performance considerations in this commit. Verifying (and then correcting) that comment kicked off a surprisingly deep and demanding binge of research. (Some of the considerations are still only "considerations", unfortunately -- one key discovery is that (surprise!) conclusive choices require more info than microbenchmarks alone can yield.) First, the big picture: One of the things we need to be really careful about throughout a system like go-ipld-prime (where we're dealing with large amounts of serialization) is the cost of garbage collection. Since we're often inside of an application's "tight loop" or "hot loop" or whatever you prefer to call it, if we lean on the garbage collector too heavily... it's very, very likely to show up a system-wide impact. So, in essence, we want to call "malloc" less. This isn't always easy. Sometimes it's downright impossible: we're building large tree structures; it's flatly impossible to do this without allocating some memory. In other cases, there are avoidable things: and in particular, one common undesirable source of allocations comes from "autoboxing" around interfaces. (More specifically, the name of the enemy here will often show up on profiling reports as "runtime.convT2I".) Sometimes this one can be avoided; other times, not. Now, a more detailed picture: There are actually several functions in the runtime that relate to memory allocation and garbage collection, and the less we use any of them, the better; but also, they are not all created equal. These are the functions that are of interest: - runtime.convT2I / runtime.convT2E - runtime.newObject - runtime.writeBarrier / runtime.gcWriteBarrier - runtime.convTstring / etc Most of these functions call `runtime.mallocgc` internally, which is why they're worth of note. (writeBarrier/gcWriteBarrier are also noteworthy, but are different beasts.) Different kinds of implementations of something like `justString` will cause the generated assembly to contain calls to various combinations of these runtime functions when they're converted into a `Node`. These are the variations considered: - Variation 1: `type justString string`: results in `runtime.convTstring`. - Variation 2: `type justString struct { x string }`: results in `runtime.convT2I`. - Variation 3: as above, but returning `&justString{val}`: results in `runtime.newobject` *and* its friends `runtime.writeBarrier` and `runtime.gcWriteBarrier`. The actual performance of these... it depends. In microbenchmarks, I've found the above examples are roughly: - Variation 1: 23.9 ns/op 16 B/op 1 allocs/op - Variation 2: 31.1 ns/op 16 B/op 1 allocs/op - Variation 3: 23.0 ns/op 16 B/op 1 allocs/op So, a couple things about this surprised me; and a couple things I'm still pretty sure are just microbenchmarks being misleading. First of all: *all* of these call out to `mallocgc` internally. And so we see an alloc-per-op reported in all three of them. (This actually kinda bugs me, because I feel like we should be able to fit all the requisite information in the first case directly into the interface, which is, if I understand correctly, already always two words; and arguably, the compiler might be smart enough to do this in the second case as well. But I'm wrong about this, for whatever reason, so let's just accept this one and move along.) But they vary in time. Why? Variation 2 seems to stand out as slower. Interestingly, it turns out `convT2E` and `convT2I` are extra problematic because they involve a call of `typedmemmove` internally -- as a comment in the source says, there's both an allocation, a zeroing, and then a copy here (at least, as of go1.12); this is a big bummer. In addition, even before getting that deep, if you check out the disassembly of just our functions: for our second variation, as inlined into our microbenchmark, there are 9 instructions, plus 1 'CALL'; vs only 3+1 for the first variation. This memmove and extra instructions seems to be the explainer for why our second variation (`struct{string}`) is significantly (~8ns) slower. (And here I thought variation two would do well! A struct with one field is the same size as the field itself; a string is one word of pointer; and an interface has another word for type; and that's our two words, so it should all pack, and on the stack! Alas: no.) Now back up to Variation 1 (just a typedef of a string): this one invokes `runtime.convTstring`, and while that does invoke `mallocgc`, there's a detail about how that's interesting: it does it with an ask for a small number of bytes. Specifically, it asks for... well, `unsafe.Sizeof(string)`, so that varies by platform, but it's "tiny". What's "tiny" mean? `mallocgc` has a specific definition of this, and you can see it by grepping the runtime package source for "maxTinySize": it's 16 bytes. Things under this size get special treatment from a "tiny allocator"; this seems to be why `runtime.convTstring` is relatively advantaged. (You can see benchmarks relating to this in the runtime package itself: try `go test -run=x -bench=Malloc runtime`. There's a *huge* cliff between MallocLargeStruct versus the rest of its fellows.) Variation 3 also appears competitive. This one surprises me, and this is where I still feel like microbenchmarks must be hoodwinking. The use of `runtime.newobject` seems to hit the same corners as `runtime.convTstring` at runtime in our situation here: it's "tiny"; that's neat. More confusingly, though, `runtime.writeBarrier` and `runtime.gcWriteBarrier` *should* be (potentially) very high cost calls. And for some reason, they're not. This particular workload in the microbenchmark must just-so-happen to tickle things in such a way that these calls are free (literally; within noise levels), and I suspect that's a happy coincidence in the benchmark that won't at all hold in real usage -- as any amount of real memory contention appears, the costs of these gc-related calls can be expected to rise. I did a few more permutations upon Variations 2 and 3, just out of curiosity and for future reference, adding extra fields to see if any interesting step functions revel themselves. Here's what I found: - {str,int,int,int,int} is 48 bytes; &that allocs the same amount; in speed, & is faster; 33ns vs 42ns. - {str,int,int,int} is 48 bytes; &that allocs the same amount; in speed, & is faster; 32ns vs 42ns. - {str,int,int} is 32 bytes; &that allocs the same amount; in speed, & is faster; 32ns vs 39ns. - {str,int} is 32 bytes; &that allocs the same amount; in speed, & is faster; 31ns vs 38ns. - {str} is 16 bytes; &that allocs the same amount; in speed, & is faster; 24ns vs vs 32ns. Both rise in time cost as the struct grows, but the non-pointer variant grows faster, and it experiences a larger step of increase each time the size changes (which in turn steps because of alignment). The &{str} case is noticeably faster than the apparently linear progression that starts upon adding a second field; since we see the number 16 involved, it seems likely that this is the influence of the "tiny allocator" in action, and the rest of the values are linear relative to each other because they're all over the hump where the tiny allocator special path disengages. (One last note: there's also a condition about "noscan" which toggles the "tiny allocator", and I don't fully understand this detail. I'd have thought strings might count as a pointer, which would cause our Variation 3 to not pass the `t.kind&kindNoPointers` check; but the performance cliff observation described in the previous paragraph seems to empirically say we're not getting kicked out by "noscan". (Either that or there's some yet-other phenomenon I haven't sussed.)) (Okay, one even-laster note: in the course of diving around in the runtime malloc code, I found an interesting comment about using memory "from the P's arena" -- "P" being one of the letters used when talking about goroutine internals -- and I wonder if that contributes to our little mystery about how the `gcWriteBarrier` method seems so oddly low-cost in these microbenchmarks: perhaps per-thread arenas combined with lack of concurrency in the benchmark combined with quickly- and sequentially-freed allocations means any gcWriteBarrier is essentially reduced to nil. However, this is just a theory, and I won't claim to totally understand the implications of this; commenting on it here mostly to serve as a pointer to future reading.) --- Okay. So what comes of all this? - I have two choices: attempt to proceed further down a rabbithole of microbenchmarking and assembly-splunking (and next, I think, patching debug printfs into the compiler and runtime)... or, I can see that last one as a step too far for today, pull up, commit this, and return to this subject when there's better, less-toy usecases to test with. I think the latter is going to be more productive. - I'm going to use the castable variation here (Variation 1). This won't always be the correct choice: it only flies here because strings are immutable anyway, and because it's a generic storage implementation rather than having any possibility of additional constraints, adjuncts from the schema system, validators, etc; and therefore, I don't actually care if it's possible to cast things directly in and out of this type (since doing so can't break tree immutability, and it can't break any of those other contracts because there aren't any). - A dev readme file appears. It discusses what choices we might make for other cases in the future. It varies by go native kind; and may be different for codegen'd types vs general storage implementations. - Someday, I'd like to look at this even further. I have a persistent, nagging suspicion that it should be possible to make more steps in the direction of "zero cost abstractions" in this vicinity. However, such improvements would seem to be pretty deep in the compiler and runtime. Someday, perhaps; but today... I started this commit in search of a simple diff to a comment! Time to reel it in. Whew. Signed-off-by: Eric Myhre <hash@exultant.us>
d0ce3ded · Eric Myhre · 040f47d2 · d0ce3ded · d0ce3ded · d0ce3ded
Commit d0ce3ded authored Aug 11, 2019 by Eric Myhre
4 changed files
--- a/doc/README.md
+++ b/doc/README.md
@@ -23,3 +23,7 @@
 	- [Implementation](./schema.md#implementation)
 	- [Migration Techniques](./schema.md#schemas-and-migration)
 - [Advanced Data Layouts](./advLayout.md)
+
+---
+
+- [Development notes: on Node implementations](./dev/node-implementations.md)
--- a/doc/dev/node-implementations.md
+++ b/doc/dev/node-implementations.md
+Dev Notes: on Node implementations
+==================================
+
+> (This document is aimed for developers and contributors to the library;
+> if you're only using the library, it should not be necessary to read this.)
+
+The concept of "Node" in IPLD is defined by the
+[IPLD Data Model](https://github.com/ipld/specs/tree/master/data-model-layer),
+and the interface of `ipld.Node` in this library is designed to make this
+model manipulable in Golang.
+
+`Node` is an interface rather than a concrete implementation because
+there are many different performance tradeoffs which can be made,
+and we have multiple implementations that make them!
+Codegenerated types also necessitate having a `Node` interface they can conform to.
+
+
+
+Designing a Node Implementation
+-------------------------------
+
+Concerns:
+
+- 0. Correctness
+- 1. Immutablity
+- 2. Performance
+
+A `Node` implementation must of course conform with the Data Model.
+Some nodes (especially, those that also implement `typed.Node`) may also
+have additional constraints.
+
+A `Node` implementation must maintain immutablity, or it shatters abstractions
+and breaks features that build upon the assumption of immutable nodes (e.g. caches).
+
+A `Node` implementation should be as fast as possible.
+(How fast this is, of course, varies -- and different implementations may make
+different tradeoffs, e.g. often at the loss of generality.)
+
+Note that "generality" is not on the list.  Some `Node` implementations are
+concerned with being usable to store any shape of data; some are not.
+(A `Node` implementation will usually document which camp it falls into.)
+
+
+### Castability
+
+Castability refers to whether the `Node` abstraction can be added or removed
+(also referred to as "boxing" and "unboxing")
+by use of a cast by user code outside the library.
+
+Castability relates to all three of Correctness, Immutablity, and Performance.
+
+- if something can be unboxed via cast, and thence become mutable, we have an Immutablity problem.
+- if something mutable can be boxed via cast, staying mutable, we have an Immutablity problem.
+- if something can be boxed via cast, and thence skip a validator, we have a Correctness problem.
+
+(The relationship to performance is less black-and-white: though performance
+considerations should always be backed up by benchmarks, casting can do well.)
+
+If castability would run into one of these problems,
+then a Node implementation must avoid it.
+(A typical way to do this is to make a single-field struct.)
+
+Whether a `Node` implementation will encounter these problems varies primarily on
+the kind (literally, per `reflect.Kind`) of golang type is used in the implementation,
+and whether the `Node` is "general" or can have an addition validators and constraints.
+
+#### Castability cases by example
+
+Castability for strings is safe when the `Node` is "general" (i.e. has no constraints).
+With no constraints, there's no Correctness concern;
+and since strings are immutable, there's no Immutablity concern.
+
+Castability for strings is often *unsafe* when the `Node` is a `typed.Node`.
+Typed nodes may have additional constraints, so we would have a Correctness problem.
+(Note that the way we handle constraints in codegeneration means users can add
+them *after* the code is generated, so the generation system can't presume
+the absense of constraints.)
+
+Castability for other scalar types (int, float, etc) are safe when the `Node` is "general"
+for the same reasons it's safe for strings: all these things are pass-by-value
+in golang, so they're effectively immutable, and thus we have no concerns.
+
+Castability for bytes is a trickier topic.
+See also [#when-performance-wins-over-immutablity].
+(TODO: the recommended course of action here is not yet clear.
+I'd default to suggesting it should use a `struct{[]byte}` wrapping,
+but if the performance cost of that is high, the value may be dubious.)
+
+#### Zero values
+
+If a `Node` is implemented as a golang struct, zero values may be a concern.
+
+If the struct type is unexported, the concern is absolved:
+the zero value can't be initialized outside the package.
+
+If the `Node` implementation has no other constraints
+(e.g., it's not also a `typed.Node` in addition to just an `ipld.Node`),
+the concern is (alomst certainly) absolved:
+the zero value is simply a valid value.
+
+For the remaining cases: it's possible that the zero value is not valid.
+This is problematic, because in the system as a whole, we use the existence
+of a value that's boxed into a `Node` as the statement that the value is valid,
+rather than suffering the need for "defensive" checks cropping up everywhere.
+
+(TODO: the recommended course of action here is not yet clear.
+Making the type unexported and instead making an exported interface with a
+single implementation may be one option, and it's possible it won't even be
+noticably expensive if we already have to fit `Node`, but I'm not sure I've
+reconnoitered all the other costs of that (e.g., godoc effects?).
+It's possible this will be such a corner case in practice that we might
+relegate the less-ergonomic mode to being an adjunct option for codegen.)
+
+
+
+When Performance wins over Immutablity
+--------------------------------------
+
+Ideally?  Never.  In practice?  Unfortunately, sometimes.
+
+
+### bytes
+
+There is no way to have immutable byte slices in Go.
+Defensive copying is also ridiculously costly.
+
+Methods that return byte slices typically do so without defensive copy.
+
+Methods doing this should consistently document that
+"It is not lawful to mutate the slice returned by this function".
--- a/impl/free/justString.go
+++ b/impl/free/justString.go
@@ -6,23 +6,16 @@ import (
 )

 func String(value string) ipld.Node {
-	return justString{value}
+	return justString(value)
 }

 // justString is a simple boxed string that complies with ipld.Node.
-// It doesn't actually contain type info or comply with typed.Node
-// (which makes it cheaper: this struct doesn't trigger 'convt2e').
-// justString is particularly useful for boxing things like struct keys.
-type justString struct {
-	x string
-}
-
-// FUTURE: we'll also want a typed string, of course.
-//  Looking forward to benchmarking how that shakes out: it will almost
-//   certainly add cost in the form of 'convt2e', but we'll see how much.
-//    It'll also be particularly interesting to find out if common patterns of
-//     usage around map iterators will get the compiler to skip that cost if
-//      the key is unused by the caller.
+// It's useful for many things, such as boxing map keys.
+//
+// The implementation is a simple typedef of a string;
+// handling it as a Node incurs 'runtime.convTstring',
+// which is about the best we can do.
+type justString string

 func (justString) ReprKind() ipld.ReprKind {
 	return ipld.ReprKind_String
@@ -58,7 +51,7 @@ func (justString) AsFloat() (float64, error) {
 	return 0, ipld.ErrWrongKind{MethodName: "AsFloat", AppropriateKind: ipld.ReprKindSet_JustFloat, ActualKind: ipld.ReprKind_String}
 }
 func (x justString) AsString() (string, error) {
-	return x.x, nil
+	return string(x), nil
 }
 func (justString) AsBytes() ([]byte, error) {
 	return nil, ipld.ErrWrongKind{MethodName: "AsBytes", AppropriateKind: ipld.ReprKindSet_JustBytes, ActualKind: ipld.ReprKind_String}
@@ -97,7 +90,7 @@ func (nb justStringNodeBuilder) CreateFloat(v float64) (ipld.Node, error) {
 	return nil, ipld.ErrWrongKind{MethodName: "CreateFloat", AppropriateKind: ipld.ReprKindSet_JustFloat, ActualKind: ipld.ReprKind_String}
 }
 func (nb justStringNodeBuilder) CreateString(v string) (ipld.Node, error) {
-	return justString{v}, nil
+	return justString(v), nil
 }
 func (nb justStringNodeBuilder) CreateBytes(v []byte) (ipld.Node, error) {
 	return nil, ipld.ErrWrongKind{MethodName: "CreateBytes", AppropriateKind: ipld.ReprKindSet_JustBytes, ActualKind: ipld.ReprKind_String}

--- a/impl/free/justString_test.go
+++ b/impl/free/justString_test.go
+package ipldfree
+
+import (
+	"fmt"
+	"runtime"
+	"testing"
+
+	ipld "github.com/ipld/go-ipld-prime"
+)
+
+func BenchmarkJustString(b *testing.B) {
+	var node ipld.Node
+	for i := 0; i < b.N; i++ {
+		node = String("boxme")
+	}
+	_ = node
+}
+
+func BenchmarkJustStringUse(b *testing.B) {
+	var node ipld.Node
+	for i := 0; i < b.N; i++ {
+		node = String("boxme")
+		s, err := node.AsString()
+		_ = s
+		_ = err
+	}
+}
+
+func BenchmarkJustStringLogAllocs(b *testing.B) {
+	memUsage := func(m1, m2 *runtime.MemStats) {
+		fmt.Println(
+			"Alloc:", m2.Alloc-m1.Alloc,
+			"TotalAlloc:", m2.TotalAlloc-m1.TotalAlloc,
+			"HeapAlloc:", m2.HeapAlloc-m1.HeapAlloc,
+		)
+	}
+	var m1, m2 runtime.MemStats
+	runtime.ReadMemStats(&m1)
+	var node ipld.Node = String("boxme")
+	runtime.ReadMemStats(&m2)
+	memUsage(&m1, &m2)
+	sinkNode = node // necessary to avoid clever elision.
+}
+
+var sinkNode ipld.Node