Attempt to amortize getting of an iterator.

It's possible, but this is a bit less trivial than it sounds... so I think this is one of those diffs I'm going to *save*, but via a "merge-ignore", and come back to the topic again later. Here's where we were at before this diff: ``` BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=0-8 3882726 308 ns/op 448 B/op 5 allocs/op BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=1-8 1273840 961 ns/op 464 B/op 6 allocs/op BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=2-8 690142 1785 ns/op 624 B/op 8 allocs/op BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=4-8 376281 3197 ns/op 944 B/op 11 allocs/op BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=8-8 197530 6084 ns/op 1584 B/op 16 allocs/op BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=16-8 102693 11534 ns/op 2864 B/op 25 allocs/op BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=32-8 54110 22601 ns/op 5424 B/op 42 allocs/op ``` (This is still a very sizable improvement over the node implementations on master; about double the speed and a quarter of the allocs. But still, could be improved?) With this diff adding iterator amortization (and a bench loop with timer pauses in it): ``` BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=0-8 1580808 778 ns/op 432 B/op 4 allocs/op BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=1-8 744278 1574 ns/op 432 B/op 4 allocs/op BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=2-8 468837 2529 ns/op 576 B/op 5 allocs/op BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=4-8 277303 4171 ns/op 864 B/op 6 allocs/op BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=8-8 172590 7187 ns/op 1440 B/op 7 allocs/op BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=16-8 93064 12494 ns/op 2592 B/op 8 allocs/op BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=32-8 52755 22671 ns/op 4896 B/op 9 allocs/op ``` So... this is kind of an unexpected result. We got the allocs to go down, as intended. But speed didn't go up. What gives? A priori, I've been running with the ballpark number of "88ns per alloc" (derived from some other observations on these packages today), which... would lead me to compute an expected gain in speed of about `88*(42-9)` given how many fewer allocations we see here... which in turn would lead to an expectation of about 12.8% speed improvement. Which... we don't see at all. The speed is more or less unchanged. Hrm. I suspect that the reason for this is that having StartTimer/StopTimer calls in the middle of your `b.N` loop just is not possible to do without having serious skewing effects on results, and if we could clear away the skew, we'd probably see the expected improvement. But still, in light of this unclarity, I'm going to merge-ignore this. We can come back to this optimization later. After this reflection, I feel ~12% isn't worth chasing today, especially if it's a complicated 12%, and especially since we already have other much larger effects that are still in progress of getting properly landed and into use on the master branch. Another issue that deserves due consideration is how this change would interact with concurrent use of datastructures. Since nodes are immutable, we *should* usually be able to expect that they can be read across threads without issue. This patch would break that: getting two different iterators could be a race condition. Fixing this is easy (I'd use https://golang.org/pkg/sync/atomic/#CompareAndSwapUint32), but raises the scope just a bit, and files this further as "not today". Iterators might also deserve a "Close" method in order to make this kind of optimization more accessible to the user; there's no reason it has to be once-only. (This would also just so happen to fix the reason we needed StartTimer/StopTimer in the b.N loop, making measurement of the improvement easier.) Lots of things to consider when tackling this properly in the future!

Attempt to amortize getting of an iterator.
It's possible, but this is a bit less trivial than it sounds... so I think this is one of those diffs I'm going to *save*, but via a "merge-ignore", and come back to the topic again later. Here's where we were at before this diff: ``` BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=0-8 3882726 308 ns/op 448 B/op 5 allocs/op BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=1-8 1273840 961 ns/op 464 B/op 6 allocs/op BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=2-8 690142 1785 ns/op 624 B/op 8 allocs/op BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=4-8 376281 3197 ns/op 944 B/op 11 allocs/op BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=8-8 197530 6084 ns/op 1584 B/op 16 allocs/op BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=16-8 102693 11534 ns/op 2864 B/op 25 allocs/op BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=32-8 54110 22601 ns/op 5424 B/op 42 allocs/op ``` (This is still a very sizable improvement over the node implementations on master; about double the speed and a quarter of the allocs. But still, could be improved?) With this diff adding iterator amortization (and a bench loop with timer pauses in it): ``` BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=0-8 1580808 778 ns/op 432 B/op 4 allocs/op BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=1-8 744278 1574 ns/op 432 B/op 4 allocs/op BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=2-8 468837 2529 ns/op 576 B/op 5 allocs/op BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=4-8 277303 4171 ns/op 864 B/op 6 allocs/op BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=8-8 172590 7187 ns/op 1440 B/op 7 allocs/op BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=16-8 93064 12494 ns/op 2592 B/op 8 allocs/op BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=32-8 52755 22671 ns/op 4896 B/op 9 allocs/op ``` So... this is kind of an unexpected result. We got the allocs to go down, as intended. But speed didn't go up. What gives? A priori, I've been running with the ballpark number of "88ns per alloc" (derived from some other observations on these packages today), which... would lead me to compute an expected gain in speed of about `88*(42-9)` given how many fewer allocations we see here... which in turn would lead to an expectation of about 12.8% speed improvement. Which... we don't see at all. The speed is more or less unchanged. Hrm. I suspect that the reason for this is that having StartTimer/StopTimer calls in the middle of your `b.N` loop just is not possible to do without having serious skewing effects on results, and if we could clear away the skew, we'd probably see the expected improvement. But still, in light of this unclarity, I'm going to merge-ignore this. We can come back to this optimization later. After this reflection, I feel ~12% isn't worth chasing today, especially if it's a complicated 12%, and especially since we already have other much larger effects that are still in progress of getting properly landed and into use on the master branch. Another issue that deserves due consideration is how this change would interact with concurrent use of datastructures. Since nodes are immutable, we *should* usually be able to expect that they can be read across threads without issue. This patch would break that: getting two different iterators could be a race condition. Fixing this is easy (I'd use https://golang.org/pkg/sync/atomic/#CompareAndSwapUint32), but raises the scope just a bit, and files this further as "not today". Iterators might also deserve a "Close" method in order to make this kind of optimization more accessible to the user; there's no reason it has to be once-only. (This would also just so happen to fix the reason we needed StartTimer/StopTimer in the b.N loop, making measurement of the improvement easier.) Lots of things to consider when tackling this properly in the future!
7c595d69 · Eric Myhre · 259c1f45 · 7c595d69 · 7c595d69
Commit 7c595d69 authored Feb 29, 2020 by Eric Myhre
Showing with 14 additions and 2 deletions

_rsrch/nodesolution/node/basic/map.go _rsrch/nodesolution/node/basic/map.go +9 -1

_rsrch/nodesolution/node/mixins/tests/marshalBenchmarks.go _rsrch/nodesolution/node/mixins/tests/marshalBenchmarks.go +5 -1

No files found.
--- a/_rsrch/nodesolution/node/basic/map.go
+++ b/_rsrch/nodesolution/node/basic/map.go
@@ -20,6 +20,9 @@ var (
 type plainMap struct {
 	m map[string]ipld.Node // string key -- even if a runtime schema wrapper is using us for storage, we must have a comparable type here, and string is all we know.
 	t []plainMap__Entry    // table for fast iteration, order keeping, and yielding pointers to enable alloc/conv amortization.
+
+	itr     plainMap_MapIterator // iterator for amortized use: first call to MapIterator is "free" because it returns this.
+	itrUsed bool
 }

 type plainMap__Entry struct {
@@ -54,7 +57,12 @@ func (n *plainMap) LookupSegment(seg ipld.PathSegment) (ipld.Node, error) {
 	return n.LookupString(seg.String())
 }
 func (n *plainMap) MapIterator() ipld.MapIterator {
-	return &plainMap_MapIterator{n, 0}
+	if n.itrUsed {
+		return &plainMap_MapIterator{n, 0}
+	}
+	n.itrUsed = true
+	n.itr.n = n
+	return &n.itr
 }
 func (plainMap) ListIterator() ipld.ListIterator {
 	return nil

--- a/_rsrch/nodesolution/node/mixins/tests/marshalBenchmarks.go
+++ b/_rsrch/nodesolution/node/mixins/tests/marshalBenchmarks.go
@@ -66,12 +66,16 @@ func BenchmarkSpec_Marshal_MapNStrMap3StrInt(b *testing.B, ns ipld.NodeStyle) {
 	for _, n := range []int{0, 1, 2, 4, 8, 16, 32} {
 		b.Run(fmt.Sprintf("n=%d", n), func(b *testing.B) {
 			msg := corpus.MapNStrMap3StrInt(n)
-			node := mustNodeFromJsonString(ns.NewBuilder(), msg)
 			b.ResetTimer()

 			var buf bytes.Buffer
 			var err error
 			for i := 0; i < b.N; i++ {
+				b.StopTimer()
+				// Create fresh node every time so single-use amortizers like iterators show up in the score.
+				node := mustNodeFromJsonString(ns.NewBuilder(), msg)
+				b.StartTimer()
+
 				buf = bytes.Buffer{}
 				err = codec.Marshal(node, refmtjson.NewEncoder(&buf, refmtjson.EncodeOptions{}))
 			}