Processor Intrinsics.

Do you want your software to run faster? Of course you do! Faster software gets more things done and takes less time. (If you’re on a low-power device, it probably uses less power too.)

Any cryptocurrency implementation has to calculate some complicated calculations; that’s what the “crypto” in “cryptocurrency” means. I won’t get into all of the math, but these important calculations include things like performing elliptic key cryptography (to generate wallet addresses from private keys), deriving cryptographically-secure hashes from other hashes (to calculate Merkle trees), and validating asymmetric cryptographic calculations to verify that the holder of a private key unlocked data signed with a public key and vice versa.

You can calculate all of this by hand, but it’ll take a while. It can take your computer a while too, and your computer is specialized to do some of these calculations.

Imagine if your general-purpose computer had more special features to perform these calculations directly, instead of developers having to write similar code again and again.

Hold that thought.

What’s a Processor Intrinsic?

As usual, I’m going to simplify a lot for the sake of explanation. If you want to know more, see something like the Intel Intrinsics Guide.

Way back at the dawn of time, when computers were new and simple, they only knew how to do a few things. They could load data into memory, perform calculations on it, and store it back in memory.

You can do almost anything on a computer that has only a few functions like that, but it’s a lot of work.

Computers became more powerful, and they gained more features. Sure, you can simulate multiplication by adding repeatedly, and you can simulate exponentiation by multiplying repeatedly, but sometimes it’s faster and more powerful for the computer to do those operations for you.

As more and more of these features get pushed down to the silicon, things get faster. What would take the computer a dozen operations to do if it were written in source code like C or C++ or Rust could be done in a single operation of the processor supported the right operation.

These features are called “intrinsics”, because they’re intrinsic to the CPU now.

Where’s a Processor Intrinsic?

As you might guess, the early Intel CPUs such as the 4004 or 8008 or the 486DX didn’t support intrinsics for things like SHA-2 hashing, partly because SHA-2 (published in 2001) didn’t exist then and partly because there wasn’t room or desire to make the processors complicated enough to support these features.

It’s interesting to compare the 486SX to the 486DX, because to my memory, the 486SX lacked special math features that the 486DX included, so specific operations were a lot faster…

… in some cases.

When’s a Processor Intrinsic?

The code you compile has to know that it’s running on a processor that supports the intrinsic you want to use and it has to be able to use that intrinsic when it’s compiled. In other words, if you’re writing C++ code that performs SHA-2 functions on some data, you won’t benefit from a processor that supports SHA-2 intrinsics unless you write code that uses those intrinsics or the compiler detects that you’re writing code that could benefit from those intrinsics and uses them.

This presents a problem for software compiled on one specific machine but intended to run in many places, like the Dogecoin Core.

When 1.14.6 comes out, if the code for Linux machines running on 64-bit hardware is compiled without SHA-2 intrinsic support, the Core will have to perform those hashing functions in C++ code, rather than in silicon. It’ll be slower. Even if you’re running the Core on a CPU that supports those intrinsics, it’ll be slower.

If the Core were compiled with support for those intrinsics, it wouldn’t run on CPUs without those intrinsics…

… unless we do something special.

How’s a Processor Intrinsic?

Look at PR 2104, a proposed implementation by edtubbs to implement SHA and AES intrinsics. These algorithms are central to cryptographic implementations, and if this work is done in silicon instead of in C++ code, they could run at least five or six times faster.

If you look also at PR 2773, Patrick makes a comment about “generic runtime capability detection” that’s worth a much deeper look.

I’ve explained the problem with assuming these intrinsics are or aren’t available on any given CPU where people might run this code when we compile the code.

You can always download and compile the code yourself to optimize it for the specific platform where you want to run it, of course, but that’s not the best option for anyone.

The better option is to attempt to detect if the machine running the code supports these operations as it’s running, and use the silicon-optimized paths if so but the normal operation anyway.

This is a little complicated and will take some time to get right. There are risks: even though these algorithms are well-studied and understood, there may be flaws in implementation on either side. This will need a lot of testing. As well, runtime capability detection needs a lot of care and testing too. The last thing anyone wants is to crash a machine that can’t handle a feature that’s an optimization, not a requirement.

Yet this is the right thing to do.

When’s a Processor Intrinsic?

Hopefully in 1.14.6, maybe in 1.14.7. Almost definitely in 1.21. Keep your eyes open.

The good news is that this is already pretty well researched; Bitcoin and Litecoin do something similar. We have some benchmarks, and compilers support these optimizations. This work will pay off in reducing CPU usage and making transaction and block validation faster, but it’ll take some time.

In the meantime, watch those two PRs I linked to for discussion. They’ll need testing on various CPUs and architectures of various vintages, from the newest machines to that old laptop you have sitting in the corner.

The more testing we get, the better we can support these features for all types of hardware.