Commits · eb71617f4aeb7e73f36276f14e76e47fd90241ef · ld / go-ld-prime

29 Feb, 2020 5 commits

Merge-ignore branch 'amortizing-iterators'. · eb71617f

Eric Myhre authored Feb 29, 2020

See comments in that commit.  It's a good idea, but needs some further
refinement in order to really shine.

eb71617f

Attempt to amortize getting of an iterator. · 7c595d69

Eric Myhre authored Feb 29, 2020

It's possible, but this is a bit less trivial than it sounds...
so I think this is one of those diffs I'm going to *save*, but
via a "merge-ignore", and come back to the topic again later.

Here's where we were at before this diff:

```
BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=0-8    3882726               308 ns/op             448 B/op          5 allocs/op
BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=1-8    1273840               961 ns/op             464 B/op          6 allocs/op
BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=2-8     690142              1785 ns/op             624 B/op          8 allocs/op
BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=4-8     376281              3197 ns/op             944 B/op         11 allocs/op
BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=8-8     197530              6084 ns/op            1584 B/op         16 allocs/op
BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=16-8    102693             11534 ns/op            2864 B/op         25 allocs/op
BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=32-8     54110             22601 ns/op            5424 B/op         42 allocs/op
```

(This is still a very sizable improvement over the node implementations on master;
about double the speed and a quarter of the allocs.  But still, could be improved?)

With this diff adding iterator amortization (and a bench loop with timer pauses in it):

```
BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=0-8            1580808               778 ns/op             432 B/op          4 allocs/op
BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=1-8             744278              1574 ns/op             432 B/op          4 allocs/op
BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=2-8             468837              2529 ns/op             576 B/op          5 allocs/op
BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=4-8             277303              4171 ns/op             864 B/op          6 allocs/op
BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=8-8             172590              7187 ns/op            1440 B/op          7 allocs/op
BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=16-8             93064             12494 ns/op            2592 B/op          8 allocs/op
BenchmarkSpec_Marshal_MapNStrMap3StrInt/n=32-8             52755             22671 ns/op            4896 B/op          9 allocs/op
```

So... this is kind of an unexpected result.  We got the allocs to go down,
as intended.  But speed didn't go up.  What gives?

A priori, I've been running with the ballpark number of "88ns per alloc"
(derived from some other observations on these packages today), which...
would lead me to compute an expected gain in speed of about `88*(42-9)`
given how many fewer allocations we see here... which in turn would lead
to an expectation of about 12.8% speed improvement.
Which... we don't see at all.  The speed is more or less unchanged.
Hrm.

I suspect that the reason for this is that having StartTimer/StopTimer
calls in the middle of your `b.N` loop just is not possible to do without
having serious skewing effects on results, and if we could clear away
the skew, we'd probably see the expected improvement.

But still, in light of this unclarity, I'm going to merge-ignore this.
We can come back to this optimization later.
After this reflection, I feel ~12% isn't worth chasing today,
especially if it's a complicated 12%, and especially since we already
have other much larger effects that are still in progress of getting
properly landed and into use on the master branch.

Another issue that deserves due consideration is how this change would
interact with concurrent use of datastructures.  Since nodes are
immutable, we *should* usually be able to expect that they can be read
across threads without issue.  This patch would break that: getting
two different iterators could be a race condition.  Fixing this is easy
(I'd use https://golang.org/pkg/sync/atomic/#CompareAndSwapUint32),
but raises the scope just a bit, and files this further as "not today".

Iterators might also deserve a "Close" method in order to make this
kind of optimization more accessible to the user; there's no reason it
has to be once-only.  (This would also just so happen to fix the reason
we needed StartTimer/StopTimer in the b.N loop, making measurement of
the improvement easier.)

Lots of things to consider when tackling this properly in the future!

7c595d69

Merge branch 'traversal-benchmarks' · 259c1f45
Eric Myhre authored Feb 29, 2020

259c1f45

Fix marshal to not have excessive token escape. · 55eebc40

Eric Myhre authored Feb 29, 2020

Sizable performance impact.

Recently added scalable benchmarks helped point this out.

(Arguably, certainly, I could've/should've seen this one earlier; it's
honestly pretty "duh".  But somehow it never got priority attention.
A benchmark is still awfully useful for directing attention!)

With this, the new node marshal is looking *really* snazzy.
Most of the allocations remaining there are just iterator creation...
and we have a plan to amortize those out of existence, too.
But I'm going to save that one for later.

55eebc40

Port scaled marshal tests to new node interfaces. · 33e2b330

Eric Myhre authored Feb 29, 2020

Marshal is 25% faster. Not bad. (There's also a perf bug in the codec
code which I'm about to fix which will change the speedup to *double*.)

Unmarshal is a bit faster, but not as much improved as marshal.
This is unsurprisingly because we chose not to amortize allocations
as much as we could in map and list creation because the size penalty
seemed too steep to make it a worthy trade. (This gives us *much*
better numbers to fuel further consideration of that trade, though.)

33e2b330

27 Feb, 2020 17 commits

Scalable marshal and unmarshal benchmarks. · 5acd08dd