In which I reverse engineered "inline pointers".

As you can see from the directory name ("multihoisting"), when I began exploring this topic, I didn't know they were called that. :/ This thus turned out to very much be one of those occasions where knowing the *right two words* to put into a search engine would've saved a fairly enormous number of hours. Once I finally found the right term, some perfectly lovely documentation appeared. But alas. (Like the previous commits, this stuff is coming from a while ago; roughly Oct 25. It uncovers a lot of stuff that gets us much closer to being able to make correct and performant designs to minimize and amortize the number of allocations that will be required to make our node trees work in codegen (though with that said, there will also still be plenty of details in need of refinement after this, too).) Still working on a "HACKME_memorylayout.md" doc, which will appear in the codegen directories in the near future and contain a summary of these learnings and how they balance against other design concerns. Meanwhile, a couple of other notes in their rough form: - basically, don't use non-pointer methods. - turns out value receivers tend to mean "copy it", and pointer receivers *don't* mean "heap alloc it" (they just mean "consider escape, maybe"). - so overall you're tying the compilers hands when you use a value receiver, and you're *not* when you use a pointer receiver. - taking pointers to things already being passed around in pointer form or already having heap-escaped seems to be free. - it might demand something become heap alloc if it wasn't already... - but this turns out to be fairly amortizable. - because if you write nearly-everthing to be pointers, then, there you go. - and if you're lucky and something is shortlived (provably doesn't escape), then even whole stacks of ptr-receiver methods will still probably inline and collapse to no heap action. - the analysis on this seems to reach much further than i thought it would. - `-gcflags '-m -m'` was extremely revealing on this point. - tl;dr: - more pointers not less in all functions and passing; - but do have one struct that's the Place of Residence of the data without pointers. - this pair of choices probably leads to the best outcomes. - hokay so. applied to topic: two sets of embeds. - builders might as well have their own big slab of embed too. - logically nonsense to embed them, but incredibly rare you're not gonna use the memory, so! And a couple important incantations, which can help understand What's Really Going On Here in a bunch of ways: - `-gcflags '-S'` -- gives you assembler dump - `-gcflags '-m'` -- gives you escape analysis (minimally informative and hard to read) - `-gcflags '-m -m'` -- gives you radically more escape analysis, in stack-form (actually useful!!) - `-gcflags '-l'` -- disables inlining! I learned about the '-m -m' thing by grepping the Go compiler source, incidentally. It's a wildly under-documented feature. No joke: I encountered it via doing a `grep "Debug['m']` in go/src; there is currently no mention of it in `go tool compile -d help`. Once I found the magic string, and could submit it to search engines, I started to find a few other blogs which mention it... but I'd seen none of them (and had not found queries that turned them up) until having this critical knowledge already in-hand. >:I So chalking up another score for "the right words would've been nice". Performance work is fun!

In which I reverse engineered "inline pointers".
As you can see from the directory name ("multihoisting"), when I began exploring this topic, I didn't know they were called that. :/ This thus turned out to very much be one of those occasions where knowing the *right two words* to put into a search engine would've saved a fairly enormous number of hours. Once I finally found the right term, some perfectly lovely documentation appeared. But alas. (Like the previous commits, this stuff is coming from a while ago; roughly Oct 25. It uncovers a lot of stuff that gets us much closer to being able to make correct and performant designs to minimize and amortize the number of allocations that will be required to make our node trees work in codegen (though with that said, there will also still be plenty of details in need of refinement after this, too).) Still working on a "HACKME_memorylayout.md" doc, which will appear in the codegen directories in the near future and contain a summary of these learnings and how they balance against other design concerns. Meanwhile, a couple of other notes in their rough form: - basically, don't use non-pointer methods. - turns out value receivers tend to mean "copy it", and pointer receivers *don't* mean "heap alloc it" (they just mean "consider escape, maybe"). - so overall you're tying the compilers hands when you use a value receiver, and you're *not* when you use a pointer receiver. - taking pointers to things already being passed around in pointer form or already having heap-escaped seems to be free. - it might demand something become heap alloc if it wasn't already... - but this turns out to be fairly amortizable. - because if you write nearly-everthing to be pointers, then, there you go. - and if you're lucky and something is shortlived (provably doesn't escape), then even whole stacks of ptr-receiver methods will still probably inline and collapse to no heap action. - the analysis on this seems to reach much further than i thought it would. - `-gcflags '-m -m'` was extremely revealing on this point. - tl;dr: - more pointers not less in all functions and passing; - but do have one struct that's the Place of Residence of the data without pointers. - this pair of choices probably leads to the best outcomes. - hokay so. applied to topic: two sets of embeds. - builders might as well have their own big slab of embed too. - logically nonsense to embed them, but incredibly rare you're not gonna use the memory, so! And a couple important incantations, which can help understand What's Really Going On Here in a bunch of ways: - `-gcflags '-S'` -- gives you assembler dump - `-gcflags '-m'` -- gives you escape analysis (minimally informative and hard to read) - `-gcflags '-m -m'` -- gives you radically more escape analysis, in stack-form (actually useful!!) - `-gcflags '-l'` -- disables inlining! I learned about the '-m -m' thing by grepping the Go compiler source, incidentally. It's a wildly under-documented feature. No joke: I encountered it via doing a `grep "Debug['m']` in go/src; there is currently no mention of it in `go tool compile -d help`. Once I found the magic string, and could submit it to search engines, I started to find a few other blogs which mention it... but I'd seen none of them (and had not found queries that turned them up) until having this critical knowledge already in-hand. >:I So chalking up another score for "the right words would've been nice". Performance work is fun!
38aa8b7c · Eric Myhre · 11d5d532 · 38aa8b7c · 38aa8b7c · 38aa8b7c
Commit 38aa8b7c authored Dec 03, 2019 by Eric Myhre
4 changed files
--- a/_rsrch/microbench/multihoisting/canhazptrcmp_test.go
+++ b/_rsrch/microbench/multihoisting/canhazptrcmp_test.go
+package multihoisting
+
+import (
+	"testing"
+)
+
+func BenchmarkCmpPtrs(b *testing.B) {
+	for i := 0; i < b.N; i++ {
+		v := hoisty{}
+		e := returnE(&v.b)
+		if e == &v.b.e {
+			sink = e
+		} else {
+			b.Fail()
+		}
+		sink = e
+	}
+}
+
+func BenchmarkCmpVal(b *testing.B) {
+	for i := 0; i < b.N; i++ {
+		v := hoisty{}
+		e := returnE(&v.b)
+		if *e == v.b.e {
+			sink = e
+		} else {
+			b.Fail()
+		}
+		sink = e
+	}
+}
+
+// not sure these are nailing it because no interfaces were harmed in the making
+// also the sink is... gonna cost different things for these; need something else for the inside of the 'if'.
+//  ^ jk that last, actually.  it does one alloc either way... just shuffles which line technically triggers it.
+// the value compare *is* slower, but on the order of two nanoseconds.
--- a/_rsrch/microbench/multihoisting/multihoisting_test.go
+++ b/_rsrch/microbench/multihoisting/multihoisting_test.go
+package multihoisting
+
+import (
+	"testing"
+)
+
+var sink interface{}
+
+type nest struct {
+	a, b, c nest2
+}
+
+type nest2 struct {
+	d nest3
+}
+
+type nest3 string
+
+type fizz interface {
+	buzz(string) fizz
+}
+
+func (x nest) buzz(k string) fizz {
+	switch k {
+	case "a":
+		return x.a
+	case "b":
+		return x.b
+	case "c":
+		return x.c
+	default:
+		return nil
+	}
+}
+func (x nest2) buzz(k string) fizz {
+	switch k {
+	case "d":
+		return x.d
+	default:
+		return nil
+	}
+}
+func (x nest3) buzz(k string) fizz {
+	return x
+}
+
+// b.Log(unsafe.Sizeof(nest{}))
+
+// These all three score:
+//		BenchmarkWot1-8         50000000                31.6 ns/op            16 B/op          1 allocs/op
+//		BenchmarkWot2-8         30000000                35.3 ns/op            16 B/op          1 allocs/op
+//		BenchmarkWot3-8         50000000                37.1 ns/op            16 B/op          1 allocs/op
+//
+// Don't entirely get it (namely the 16).
+// Something fancy that's enabled by inlining, for sure.
+// The one alloc makes enough sense: it's just what's going into 'sink'.
+func BenchmarkWot1(b *testing.B) {
+	for i := 0; i < b.N; i++ {
+		v := nest{}
+		sink = v.buzz("a")
+	}
+}
+func BenchmarkWot2(b *testing.B) {
+	for i := 0; i < b.N; i++ {
+		v := nest{}
+		fizzer := v.buzz("a")
+		sink = fizzer.buzz("d")
+	}
+}
+func BenchmarkWot3(b *testing.B) {
+	for i := 0; i < b.N; i++ {
+		v := nest{}
+		fizzer := v.buzz("a")
+		fizzer = fizzer.buzz("d")
+		sink = fizzer.buzz(".")
+	}
+}
+
+// This comes out pretty much like you'd expect:
+//		BenchmarkZot1-8         20000000                80.6 ns/op            64 B/op          2 allocs/op
+//		BenchmarkZot2-8         20000000                88.1 ns/op            64 B/op          2 allocs/op
+//		BenchmarkZot3-8         20000000                84.9 ns/op            64 B/op          2 allocs/op
+//
+// The `nest` type gets moved to the heap -- 1 alloc, 48 bytes.
+// Then the `nest2` type also gets moved to the heap when returned -- 1 alloc, 16 bytes.
+// Wait, where's 3?  one of either 2 or 3 is getting magiced, but which and why
+// is it because the addr of 3 is literally 2, or
+func BenchmarkZot1(b *testing.B) {
+	for i := 0; i < b.N; i++ {
+		v := nest{}
+		sink = buzzit(v, "a")
+	}
+}
+func BenchmarkZot2(b *testing.B) {
+	for i := 0; i < b.N; i++ {
+		v := nest{}
+		fizzer := buzzit(v, "a")
+		sink = buzzit(fizzer, "d")
+	}
+}
+func BenchmarkZot3(b *testing.B) {
+	for i := 0; i < b.N; i++ {
+		v := nest{}
+		fizzer := buzzit(v, "a")
+		fizzer = buzzit(fizzer, "d")
+		sink = buzzit(fizzer, ".")
+	}
+}
+
+// This is a function that bamboozles inlining.
+// (Note you need to take a ptr to it for that to work.)
+// (FIXME DOC wait what??? the above is not true, why)
+func buzzit(fizzer fizz, k string) fizz {
+	return fizzer.buzz(k)
+}
+
+type wider struct {
+	z, y, x nest
+}
+
+func (x wider) buzz(k string) fizz {
+	switch k {
+	case "z":
+		return x.z
+	case "y":
+		return x.y
+	case "x":
+		return x.x
+	default:
+		return nil
+	}
+}
+func BenchmarkZot4(b *testing.B) {
+	for i := 0; i < b.N; i++ {
+		v := wider{}
+		fizzer := buzzit(v, "z")
+		fizzer = buzzit(fizzer, "a")
+		fizzer = buzzit(fizzer, "d")
+		sink = buzzit(fizzer, ".")
+	}
+}
+
+// So the big question is, can we get multiple heap pointers out of a single move,
+// and is that something we can do with a choice in one place in advance
+// (rather than requiring different code in a fractal of use sites to agree)?
+
+func BenchmarkMultiStartingStack(b *testing.B) {
+	for i := 0; i < b.N; i++ {
+		v := wider{}
+		fizzer1 := buzzit(v, "z")
+		fizzer2 := buzzit(fizzer1, "a")
+		fizzer2 = buzzit(fizzer2, "d")
+		sink = buzzit(fizzer2, ".")
+		fizzer2 = buzzit(fizzer1, "b")
+		fizzer2 = buzzit(fizzer2, "d")
+		sink = buzzit(fizzer2, ".")
+	}
+}
+func BenchmarkMultiStartingHeap(b *testing.B) {
+	for i := 0; i < b.N; i++ {
+		v := &wider{}
+		fizzer1 := buzzit(v, "z")
+		fizzer2 := buzzit(fizzer1, "a")
+		fizzer2 = buzzit(fizzer2, "d")
+		sink = buzzit(fizzer2, ".")
+		fizzer2 = buzzit(fizzer1, "b")
+		fizzer2 = buzzit(fizzer2, "d")
+		sink = buzzit(fizzer2, ".")
+	}
+}
+
+func escape(root *nest) fizz {
+	confound := 8
+	var fizzer fizz
+	fizzer = root
+	if confound%2 == 0 {
+		fizzer = &root.a
+	}
+	if confound%4 == 0 {
+		fizzer = &root.b
+	}
+	if confound%8 == 0 {
+		fizzer = &root.c
+	}
+	return fizzer
+}
--- a/_rsrch/microbench/multihoisting/multihosting2_test.go
+++ b/_rsrch/microbench/multihoisting/multihosting2_test.go
+package multihoisting
+
+import (
+	"testing"
+)
+
+type hoisty struct {
+	a, b, c hoisty2
+}
+type hoisty2 struct {
+	d, e hoisty3
+}
+type hoisty3 struct {
+	f hoisty4
+}
+type hoisty4 string
+
+func BenchmarkHoist(b *testing.B) {
+	for i := 0; i < b.N; i++ {
+		v := hoisty{}
+		sink = returnE(&v.a)
+		sink = returnE(&v.b)
+		sink = returnE(&v.c)
+	}
+}
+
+func BenchmarkHoist2(b *testing.B) {
+	for i := 0; i < b.N; i++ {
+		v := &hoisty{}
+		sink = returnE(&v.a)
+		sink = returnE(&v.b)
+		sink = returnE(&v.c)
+	}
+}
+
+func returnE(x *hoisty2) *hoisty3 {
+	return &x.e
+}
+
+// okay, am now confident the above two are the same.  escape analysis rocks em.
+
+func BenchmarkHoistBranched(b *testing.B) {
+	for i := 0; i < b.N; i++ {
+		v := &hoisty{}
+		sink = returnE(&v.a)
+		oi := returnE(&v.b)
+		sink = returnE(&v.c)
+		sink = returnF(oi)
+	}
+}
+
+func returnF(x *hoisty3) *hoisty4 {
+	return &x.f
+}
+
+// now let's see if interfaces make this worse, somehow.
+// (i'm *hoping* all the buzz methods being pointery will fail escape clearly,
+//  and then returning more pointers from mid already-deffo-heap structures will be cheap.)
+
+func BenchmarkHoistBranchedInterfacey(b *testing.B) {
+	for i := 0; i < b.N; i++ {
+		v := &hoisty{}
+		sink = v.buzz("a")
+		oi := v.buzz("b")
+		sink = v.buzz("c")
+		sink = oi.buzz("f")
+	}
+}
+
+func (x *hoisty) buzz(k string) fizz {
+	switch k {
+	case "a":
+		return &x.a
+	case "b":
+		return &x.b
+	case "c":
+		return &x.c
+	default:
+		return nil
+	}
+}
+func (x *hoisty2) buzz(k string) fizz {
+	switch k {
+	case "d":
+		return &x.d
+	case "e":
+		return &x.e
+	default:
+		return nil
+	}
+}
+func (x *hoisty3) buzz(k string) fizz {
+	switch k {
+	case "f":
+		return &x.f
+	default:
+		return nil
+	}
+}
+func (x *hoisty4) buzz(k string) fizz {
+	return x
+}
+
+// also important: can i assign into this?
+// ja.  it's fine.  no addntl allocs, with or without inlining.
+
+func BenchmarkHoistAssign(b *testing.B) {
+	for i := 0; i < b.N; i++ {
+		v := &hoisty{}
+		oi := returnE(&v.b)
+		oi.f = "yoi"
+		sink = returnF(oi)
+	}
+}
--- a/_rsrch/microbench/multihoisting/reextenting_test.go
+++ b/_rsrch/microbench/multihoisting/reextenting_test.go
+package multihoisting
+
+import (
+	"fmt"
+	"runtime"
+	"testing"
+)
+
+// not sure how to test this
+// benchmem is going to report allocation costs... which we know.
+// the question is if the memory usage goes *down* after full gc.
+
+// okay here we go: `runtime.ReadMemStats` has consistency forcers.
+
+func init() {
+	runtime.GOMAXPROCS(1)
+}
+
+func BenchmarkReextentingGC(b *testing.B) {
+	memUsage := func(m1, m2 *runtime.MemStats) {
+		fmt.Println(
+			"Alloc:", m2.Alloc-m1.Alloc,
+			"TotalAlloc:", m2.TotalAlloc-m1.TotalAlloc,
+			"HeapAlloc:", m2.HeapAlloc-m1.HeapAlloc,
+			"Mallocs:", m2.Mallocs-m1.Mallocs,
+			"Frees:", m2.Frees-m1.Frees,
+		)
+	}
+	var m1, m2, m3, m4, m5 runtime.MemStats
+	runtime.GC()
+	runtime.ReadMemStats(&m1)
+	sink1 = &hoisty{}
+	runtime.GC()
+	runtime.ReadMemStats(&m2)
+	sink2 = &sink1.b
+	runtime.GC()
+	runtime.ReadMemStats(&m3)
+	sink1 = nil
+	runtime.GC()
+	runtime.ReadMemStats(&m4)
+	sink2 = nil
+	runtime.GC()
+	runtime.ReadMemStats(&m5)
+	fmt.Println("first extent size, size to get ref inside, size after dropping enclosing, size after dropping both")
+	memUsage(&m1, &m2)
+	memUsage(&m1, &m3)
+	memUsage(&m1, &m4)
+	memUsage(&m1, &m5)
+}
+
+var sink1 *hoisty
+var sink2 *hoisty2
+
+// results:
+// yeah, one giant alloc occurs in the first move.
+// subsequent pointer-getting causes no new memory usage.
+// nilling the top level pointer does *not* let *anything* be collected.
+// nilling both allows collection of the whole thing.
+
+// so in practice:
+// we can use internal pointers heavily without consequence in alloc count...
+// but it's desirable not to combine items with different lifetimes into
+//  a single extent in memory, because the longest living thing will extend
+//   the life of everything it was allocated with.