It’s been a long slog getting PR 2698 ready to merge, and it’s almost ready to go. There’s a lot to write about the whole thing, but I want to focus on one aspect right now: time.
No, not that I originally opened the PR five and a half months ago. Who’s counting, anyway? It’s that time is the enemy of certainty.
Works on My Machine
“Shibes should be able to participate with a wide range of hardware, software, and contexts.” This is an explicit Dogecoin design and implementation goal. It shouldn’t matter if you can afford a shipping container full of GPU-enhanced server blades, a rack of ASICs in your basement, or a little Nuc plugged into a corner of your bedroom.
You should have the ability to participate in the network and community and control your coins and your transactions for as small or as large as an investment of time, resources, hardware, money, or whatever you decide.
This is a great goal, but it’s not free; it puts the onus on developers to ensure that what we create and test and document and design and release meets the goals for this wide range of people and circumstances.
I originally thought this feature would be great for smaller nodes running with bandwidth constraints, maybe an old Windows laptop tucked under your desk or a Raspberry Pi hooked up to a solar panel and a 3G network. It still might be great for those devices, but how do we know this feature will work?
What is this feature supposed to do? On its face, something simple: limit the number of connections to other nodes to a number you choose. This was always possible, but the nuance is that this setting could only be configured at startup. In that case, the node would never accept new connections if all of the available slots were full.
Here’s where it gets tricky. What if you lower the number of allowed connections as your node is running?
How long should you wait to see the number of connections hit the limit? How can we prove it?
The answer is “we need to test it”, which seems obvious, but the details are anything but.
You’re One in a Thousand
As Patrick wrote in the May 7 2022 Development Roundup, he’s running a test case in a loop a thousand times to see what happens.
The reason is interesting, if you’ve never had to debug something like this, and all too frustrating if you have.
It’s infeasible to set up an entire cluster of nodes, completely under our control, and even that might not give us the confidence we need that this feature works as intended.
Fortunately, the mininode.py test framework automates launching, connecting to, and controlling a bunch of core nodes from a test process. That makes it easy to launch one prime node, add a bunch of connections, and then send this new RPC command to raise or lower the maximum connection count.
Easy, right?
On my desktop machine (it’s a couple of years old, but pretty good in terms of CPU and RAM) or my laptop (also not the shiniest, newest thing but plenty good), these tests almost always run as expected. Sure, it takes 35 - 45 seconds to run through a couple of scenarios, but it’s obvious when things work and when they don’t work.
I’m also checking my email and reading a web page or two while these tests are running. That’s good. It simulates real-world conditions, right?
Yes and no.
An error that happens one in a thousand times on one of my machines is an error that could happen one in a hundred times on a smaller machine or one in ten times on a tiny machine.
Why?
Everything Isn’t Everywhere All At Once
The real world is messy. Nodes can come and go. The air conditioner and dishwasher and blender can all be running at once and flip a circuit breaker, and that Windows laptop in the corner of the desk could go offline. There goes one node.
A backhoe could sever a fiber line. There goes a neighborhood of nodes.
Godzilla could step on a data center. Who knows what that would even do?
Even in that hypothetical cluster of a thousand machines or this test case simulating several dozen nodes, we can’t predict what will happen where and when.
My desktop computer is juggling several processes, including all of the nodes I’ve launched. If Firefox suddenly needs to run garbage collection (or I accidentally click on a video on YouTube and it starts playing), that’s going to change what happens during the test. Maybe that’ll steal resources from one node and I’ll get different test results.
If you look at a failing continuous integration run for this PR, you’ll see that one test failed. That test launches a prime node, connects a few other nodes to it, manipulates the maximum connection count, then waits to see if enough nodes get disconnected that the current connection count is under the limit.
(If I haven’t explained it before: continuous integration is a system that rebuilds the software and runs the entire automated test suite on multiple platforms every time someone adds a new change to a PR.)
This test passed for me, eventually. By that I mean I had to write and revise and rewrite and rethink it several times before I could figure out the right balance between what I wanted it to do and what it actually did.
I could write a test that waits 10 minutes or 30 or 90 for nodes to disconnect, but could I be certain that that shows what I wanted it to so? Do I want to increase the length of the testing time for everyone by that long, just to prove the efficacy of a small feature that not everyone will use?
What do I really want to prove anyway?
The Real World is Indeterministic
Some people believe that if we could predict with full accuracy the path and velocity of every particle in the universe, there would be no secrets; we’d know everything that was ever going to happen ever. (Me, I think Spinoza’s classmates bullied him.)
There are two problems with this. First, we can’t. Second, even if we could, it’d be prohibitively expensive.
Even with Microsoft’s gigabucks backing GitHub and its continuous integration machines, I can’t predict where these tests will run when I push a change. I don’t know how many other tests are running on the same machine. I don’t know if there’s a brownout in the data center, or someone’s about to trip over a power cable, or if someone managed to figure out how to mine Bitcoin in a test process and it’ll eat up all the CPU my tests wanted too.
This is a good thing, even though it’s a frustrating thing.
As Patrick wrote in the development roundup, the tests were passing in CI until they weren’t.
A problem that happens once in a thousand times is still a problem.
If we’d merged and deployed the code as it is, when would that test fail again? Who would it fail for? Where would it fail, and how would we debug it? Would we remember? (I am looking forward to forgetting!)
The balance we’re looking for is confidence that the code is behaving as we expect: it will make its best effort to reduce the number of connections to the requested threshold, but it will not make any guarantees about when that happens.
As best as I can tell, the best we can do is say “We’re going to wait n seconds in the test and see if everything disconnected”. Maybe it shouldn’t be everything. Maybe it should be enough things. Maybe we don’t know enough.
That’s the interesting part of all of this work. What kind of questions can we ask that give us enough confidence that the system is working as intended in as many places as possible?
If we can answer that question as often as possible and as cheaply as necessary, well, that is a sign of a mature and disciplined engineering process.
After all, it’s not enough that this software works on as many machines for as many shibes in as many situations as possible. It also has to work well. Getting there isn’t always fast, but getting there is worth it.