What are You Waiting For? (PR 2986).

Imagine with me.

You have a cool idea you want to program. You’re writing code to generate a really interesting piece of art, and it’ll take a year to complete. You can’t speed it up by throwing more computing power at it. There’s no way to deploy it across lots of laptops or rented servers or virtual machines or containers in the cloud.

You just have to wait 365.25 days, 8760 hours, five-hundred (ah, nevermind, I don’t like that song).

What if there’s a bug?

Before you get all upset like Frankie and Benjy, you should ask yourself “How can I prove to my own satisfaction that this thing can run successfully for a year and give me the results I want?”

Turns out this is difficult, for several reasons. One of them is time.

What is Time?

Time is nature’s way of ensuring that everything doesn’t happen all at once.

Some things are predictable, like the resonance frequency of Cesium atoms. Other things aren’t.

When I wrote about the unpredictability of testing node eviction, I suggested that we use a rule of thumb to figure that we’re likely to get everything we want in n seconds and we’re unlikely to see failures in that timespan. We might, and we might have to use 35 seconds instead of 30, but we can use a rule of thumb.

That’s one reason to control time: figure out the likelihood that you can wait long enough that what you expected to happen (or not) actually happened (or didn’t).

Another reason to control time is to force the issue.

Spring Forward, Fall Back, Test the Eternal Now

What if you could control time?

I’m not suggesting time travel exists, other than the one we’re all doing.

I’m suggesting you lie to the computer.

This happens all the time in tests, where we usually call it Mocking. While testing without mocks is almost always the right way to go (hat tip to my colleague James Shore for the insight), occasionally proving that the right thing happens or doesn’t requires very, very precise control.

What if you’re writing code that tries to give multiple users a fair shot at a resource, like a roulette wheel where they win goofy little tokens they can redeem for snacks at the company canteen? You want everyone to have a fair chance, so you don’t want someone with programming skills to hit your machine a hundred times a second where everyone else can only click a couple of times a second.

As a junior developer, I click on the roulette wheel about 1.8 times per second. Patrick throws together a quick Python script that uses asynchronous IO and multiprocessing to click that wheel 36 times per second, and Michi runs curl in a shell script and shows us all up at 100 requests per second.

You might want to put some sort of rate limiting in place. Everyone gets two shots every second. That’s it. Anything else gets discarded. How do you test that?

LiE tO tHe CoMpUtEr

What if you could mock time? Tell the computer that right now it’s 6:06 pm on a lovely Saturday evening, and it’s going to stay 6:06 until you tell it otherwise.

That’s what Patrick’s PR does. Well, sort of. Dogecoin Core already had this behavior for seconds, but Patrick made it work for milliseconds. This is something you don’t normally want to do when running real code (and it only works on the regtest network chain anyhow), but it’s incredibly useful to test things quickly in very controlled circumstances.

For a more realistic example (no one’s going to make me spin a roulette wheel to get a little packet of jellybeans) that’s still completely hypothetical (I’m pretty sure this code doesn’t exist), let’s say we want to test that a Dogecoin node goes out and looks for other nodes no more than once every minute, or, on average, ten times in ten minutes.

If this feature existed, there’d be code somewhere in the network stack that said “Has it been more than 60 seconds since the last attempt to look for other nodes? If so, go searching!”

Testing that in the most simple and straightforward approach means waiting 10 minutes or so and seeing if there are about 10 attempts.

Testing that when we can mock time means we can bump the current timestamp in a very tight loop in the code and test 10 minutes in a few seconds, or 100 minutes in a few seconds, or 1000 minutes in not many seconds. We wait less time to get our results, and we get more chances to see randomness converge to an expected result because we’re trying more possibilities.

With Great Precision Comes Great Possibility

That latter point is subtle and important. If we can only test ten times, any perturbation has a big chance of giving us failures now and then. 9 out of 10 times isn’t great; a standard deviation of 10% is pretty high! If we can test 100 times, then we can accept maybe 48 to 52 connection attempts as the right range. A 4% standard deviation is pretty good. If we can test 1000 times, then maybe 490 to 510 is a fine range, and a standard deviation of 2% is highly acceptable.

We have options. Finer control gives us more options.

I know the PR itself isn’t complex and doesn’t touch much code. That’s one of the things I like most about it. It doesn’t take much work to give us the power to do something very important we need in very rare circumstances.

That rare and small thing is very valuable though. It gives us higher confidence cheaply, when we use it well.