A couple years ago, I began my journey to dump a secure device's firmware. I spent an ungodly amount of time attempting to find a vulnerability without avail. With my new found time not working skydiving events this season, I decided it was time to take off the gloves, learn a new skill, and Side Channel Attack (SCA) the target I've been wanting to exploit for years.
My purpose here is to shed some light on the practical implementations of the attack, including where I went wrong, and the steps that ultimately allowed me to determine the AES 128bit key. I'm still learning, but feel documenting my journey will be invaluable for those on the same quest.
Since cryptography is extremely math intensive, it always appears more cryptic than necessary to outsiders. Although I have my BSEE, I find myself needing 100% focus when reading encryption literature, and frankly, most of the time, I don't have the mindset. In this article, I'll focus on the important concepts one needs to know about AES to extract the encryption key via a Side Channel Attack.
This article assumes you have some knowledge about Side Channel Attacks and are planning on becoming familiar with the Chipwhisperer Hardware or similar platforms.
Quick overview of AES necessary for implementing a Side Channel Attack
There are some really good videos & write ups that breakdown the various steps involved in AES. However, from a side channel attack standpoint, we really don’t care too much about the fine details of these steps.
In the following attack I’m going to focus on a 128 bit key (16 bytes) AES-CBC with an initialization vector (IV) of zeros. Don't worry if you know what all this means yet, as it will come as you learn AES implementations. None-the-less, this is fairly common with bootloaders so it's a good place to start. The Chipwhisperer suite has details to extend the attack if an IV exists or the keys are longer.
For a Correlation Power Analysis (CPA) attack, we are going to focus on the s-box substitution, or in our decryption case, the inverse s-box substitution and XOR step. Guess what, this is just a fancy name for a specific lookup table. Since we all code, this step simply looks like (byte sized operations):
It's that simple. The next step just XORs this value with the key.
Now we get the result that we're planning on attacking with our CPA.
When you dive into AES, the key scheduling step is often not explained in detail as it's automatically handled in most code bases. Before the first byte is ever decrypted, the encryption key gets manipulated. The number of times, it gets manipulated is 10 for 128 bits. You'll see this written as 10 rounds of the key schedule in literature. Luckily, this key schedule is symmetric. This means if we determine the key for a particular round, we can easily compute the original key this round key was based upon.
For instance, if our key was all zeros. Here is how the key schedule would transpire for each round.
During decryption, the last round of the key schedule is the first round used. With a 128bit key, the decryption algorithm will first use the round 10's keys. For the first byte it would look like:
tmp = inverse_sbox_table[ input_buffer[ 0 ] ]
result = tmp ^ key_schedule_round_10[ 0 ]
In the above all zero key example, byte 0 of round 10 would be 0xB4. i.e. key_schedule_round_10 = 0xB4
Attacking the Victim
The attack is broken down into two stages.
- The capture stage
- The analysis stage
Since the bootloader needs to communicate with an outside interface to get it’s input buffer, we are in control of that part of the equation. So yes, we are going to send random data into the victim using it's update protocol and trace how it reacts. At first you may be worried about over writing the target. The truth is all bootloaders have some sort of CRC check after decryption which becomes mathematically improbably that check would pass with decrypted random data.
Some bootloaders front load the CRC in cleartext in order to accept the data packet. These style of bootloaders at minimum will always have a flag once the update is complete, so as long as we just program the first block, we will just be stuck in the bootloader each iterative cycle.
Our particular target encrypted the header containing the CRC and destination row with the row's data. It always sent 0x90 bytes with a program row size of 0x80 so we knew 0x10 bytes consisted of a header.
You may ask, why did they add 16 bytes when 4 CRC bytes and 2 offset sector bytes would have worked? Remember AES works on 16 bytes at a time, so it most always will be 16 byte aligned.
As we learned from many other SCA resources, we can put an inline, “shunt,” resistor (R9) to capture the current waveform the microcontroller uses during decryption. For the ATSAMD21, we measured the voltage across VDDCORE_SENSE to determine the current draw at that instant.
We then log this waveform, with the input data and create a single power trace data set. Doing this thousands of times, we now have enough data to run our analysis and hopefully compromise the victim's keys.
To simplify the attack theory (and leaving out the mechanics):
For each point in our power trace we ask a question. If the byte in key_schedule_round_10 was this value, would the power usage at this particular point correlate with this assumption? By aligning our traces, and performing this analysis on all our captured traces, for each key location, we find a certain value and time where the correlation is extremely high (a value whose absolute value is approaching 1).
An example of the correlation of x and y for various distributions of (x,y) pairs- Denis Boigelot
Surprisingly quickly, the point in time and value where the key schedule value is being written or read from the microcontroller's bus becomes apparent.
Colin O'Flynn talks about correlation in his Introduction to Side-Channel Power Analysis video at 45:50, however the whole video is highly recommended to watch.
The only curve ball is the order of operations is not byte 0,1, ... 15. Because of the inverse shift rows step, the bytes are not processed in order. However, due to the correlation basis of CPA, the order does not matter and the key and it’s location in the key schedule is easily identified. Since it helps during debugging our attack, the byte order will be:
Byte number 0, 13, 10, 7 | 4, 1, 14, 11 | 8, 5, 2, 15 | 12, 9, 6 ,3
If you look at the mix columns step, it processes four bytes at a time, hence the '|' pipe notation.
I should note, as I got held back with this fundamental for a couple days, we are not necessarily focusing on the exact time the "result" is written into the register. Our attack should have a broad view on every time this byte is being used for the first round of decryption. Remember it is not only written into the register when it is first calculated, but also read from the register when it is used for the following decryption step. In my particular attack, when the algorithm used these values down the decryption pipeline, the power correlation became much more apparent than when the values were first calculated.
You may be wondering why we don't just set our trigger to when round 0 key is being used and do the CPA there? The reason being is we no longer know the input_buffer at that point as it has already been manipulated 9 times.
Lastly, remember that we have found the round 10 key. You can then use the Chipwhisperer software, or this c code to determine the key for round 0.
The below power trace is that of the first round of AES128 on our victim.
Pink: i2c clock from the host used to communicate with bootloader
Yellow: Trigger from host
Green: Power trace
Where the 'yellow' trigger asserts is where the result from the above code is calculated. This is followed by the 4 mix columns calculations used for the first round. Originally, I attacked the location where the result was calculated, but found the mix columns usage of the result to produce a much higher correlation.
Having an internal vddcore regulator without an external shutdown pin makes the attack a bit more difficult. The easiest method to compromise this is to overdrive the supply on the ‘supply’ side of the resistor, not the side closest to the victim. For our device we found around 1.52v to be sufficient. Start with the current VddCore voltage and increase it by a millivolt until the signal becomes apparent.
VddCore internally supplied
Above you can see the Vddcore regulator switching frequency mixed with the core's current consumption.
VddCore externally over-driven
When we overdrive the VddCore, we now only see the current consumption of the core. Of course, this is not optimal for longevity, but at room temperature, the silicon should hold up for our analysis.
Note you can see the 16 operations of the Inverse Sub Bytes and XOR step. However, notice how little current the operation draws on the ATSAMD21 in comparison to the following mix columns. The above is a Bokeh graph of 50 traces aligned. You'll become very familiar with this graphing module while learning SCA.
If you capture the mix columns portion of the attack and then properly align the start of each of the four four byte operations, the 10th round key is easily recoverable. You can probably make out the alignment of the fourth operation here which would allow us to recover bytes 12, 9, 6 and 3.
Overview of what you'll need to perform your first real attack
First off, I highly suggest working through all the Chipwhisperer DPA tutorials. Be sure to spend time really understanding the examples, not just running through the paces. I found this video by Computerphile to be really good at explaining AES conceptually where most literature is a bit too cryptic.
Secondly, take your time. The fundamentals really do matter. Here is a list of things you’ll need to do and consider when setting up your attack your target.
- Buy into the fact that at least one of your target devices will never be the same.
- This attack is not going to happen overnight. Take your time.
- Reverse engineer the protocol necessary for the update. We need this to hijack the input buffer. For my attack, I created a simple serial host that took the 16bytes from the chipwhisperer software and correctly talked to the victim via i2c. Luckily in our case, the first stage of the victim took a different command than the remaining stages. This is how our victim knew to reset the IV and CBC.
- Optimize the above step. The quicker you can get trace data out, the faster your attack iteration cycle will become.
For ultra complex or small victims, you are best off fabricating a separate attack board where you can easily access all power rails and input signals. This allows you to hot air rework the device off of the target and onto your victim attack board when ready. When designing an attack board, be sure to add a possible shunt resistor to all power domains as well as going overboard on filtering caps to the supply side of the shunt resistor. You don’t need to populate them all, but it’s easier than bodging ones you missed. As always with test boards, add a metric ton of ground test points. My other suggestion is, even if you are used to using 0402, might as well design using something bigger as you’ll be swapping parts quite a bit.
Learn from my mistakes as I did everything I’m telling you not to do.
- I suggest running through a simulated attack with a known key on a target you own. Using the simple serial AES sample or an “secure bootloader sample” from the hardware vendor is an easy way to tweak your setup to best extract keys. Refine your filtering caps, number of traces, bandwidth, mv/div and even your trigger before attacking the real victim.
- Use the chipwhisperer differential probe. Spend time getting the supply side as clean as possible. Remember, long ground leads on your oscilloscope probes do not accurately show what the signal looks like, be sure to use probe based tip grounds when reducing the noise. Also, if you use an external oscilloscope or picoscope, you’ll want to set it up to use AC coupling and most likely a bandwidth limit filter.
- One concept that took me a while to understand is that for CPA attacks, the target’s clock rate does not necessarily need to be known. Yes, your scope has to have enough bandwidth to capture the power consumption, but even the power filtered down a bit still correlates properly. However, traces really can be minimized if you synchronize sampling with the target clock.
It is extremely important that the traces need to be aligned for a proper CPA attack. Get to know the Sum of Absolute Difference (SAD) Pre-Processing built into the Chipwhisperer library. For my particular attack. I used a coarse SAD filter first to get the signals fairly aligned as we were not synchronized with the target. Then I used a secondary extremely fine SAD to align the signals on an abrupt current swing like seen below.
The particular algorithm I attacked did not have a time deterministic mix columns function. Therefore, I found it easiest to align and attack four bytes at a time as they got processed by the mix columns function. To write this another way: I first attacked key bytes
0, 13, 10, 7 then realigned to the next mix word
4, 1, 14, 11 then realigned to the next mix word
8, 5, 2, 15 then realigned to the next mix word
12, 9, 6 ,3
It is possible this non-deterministic aspect of the algorithm can be compromised by another side channel attack!
- The correlation, or Partial Guessing Entropy (PGE) indicator on the Chipwhisperer setup really helps you determine these slight variations in the algorithm. This number is your gauge as to if your change helped or hurt your attack. It’s your score card. The PGE does go down as trace data increases. However, it should be obvious which value is the winner. If it is not, then you're doing something wrong.
- There are many ways to "check your work" once you obtain the key. If you have a couple bytes with a low PGE, you can run the entropy on the output to automate brute forcing the final keys. Remember, most companies are not reinventing the wheel. Check with the vendor bootloader sample to give you some insight into the process.
- Jupyter is a really nice way of logging your changes and attack results. Be sure to embrace its power during your quest.
- Finally, you really just need to do an SCA attack. Reading about it and trying to wrap your brain around the complexities does not work. You need to work through a victim yourself to grasp these complex concepts.
I owe a ton of gratitude to Colin O'Flynn and his NewAe team. It amazes me how powerful their Chipwhisper tool set is for the price and the constant refinements they make to the already awesome product. I'm not affiliated with them, I'm just an extremely impressed customer.
I have just scratched the surface of SCA and looking forward to continuing my research. If you have noticed anything conceptually wrong in this article feel free to drop me a line on twitter at @gethypoxic. I'm not one to answer direct DMs for support, but suggest reading NewAE's forums instead.