Part 2, Topic 1: CPA Attack on 32bit AES (MAIN)¶
NOTE: This lab references some (commercial) training material on ChipWhisperer.io. You can freely execute and use the lab per the open-source license (including using it in your own courses if you distribute similarly), but you must maintain notice about this source location. Consider joining our training course to enjoy the full experience.
SUMMARY: So far, we've been focusing on a single implementation of AES, TINYAES128C (or AVRCRYPTOLIB, if you're on XMEGA). TINYAES128C, which is designed to run on a variety of microcontrollers, doesn't make any implementation specific optimizations. In this lab, we'll look at how we can break a 32-bit optimized version of AES using a CPA attack.
LEARNING OUTCOMES:
- Understanding how AES can be optimized on 32-bit platforms.
- Attacking an optimized version of AES using CPA
Optimizing AES¶
A 32-bit machine can operate on 32-bit words, so it seems wasteful to use the same 8-bit operations. For example, if we look at the SBox operation:
$ b = sbox(state) = sbox(\left[ \begin{array} & S0 & S4 & S8 & S12 \\ S1 & S5 & S9 & S13 \\ S2 & S6 & S10 & S14 \\ S3 & S7 & S11 & S15 \end{array} \right]) = \left[ \begin{array} & S0 & S4 & S8 & S12 \\ S5 & S9 & S13 & S1 \\ S10 & S14 & S2 & S6 \\ S15 & S3 & S7 & S11 \end{array} \right] $
we could consider each row as a 32-bit number and do three bitwise rotates instead of moving a bunch of stuff around in memory. Even better, we can speed up AES considerably by generating 32-bit lookup tables, called T-Tables, as was described in the book The Design of Rijndael which was published by the authors of AES.
In order to take full advantage of our 32 bit machine, we can examine a typical round of AES. With the exception of the final round, each round looks like:
$\text{a = Round Input}$
$\text{b = SubBytes(a)}$
$\text{c = ShiftRows(b)}$
$\text{d = MixColumns(c)}$
$\text{a' = AddRoundKey(d) = Round Output}$
We'll leave AddRoundKey the way it is. The other operations are:
$b_{i,j} = \text{sbox}[a_{i,j}]$
$\left[ \begin{array} { c } { c _ { 0 , j } } \\ { c _ { 1 , j } } \\ { c _ { 2 , j } } \\ { c _ { 3 , j } } \end{array} \right] = \left[ \begin{array} { l } { b _ { 0 , j + 0 } } \\ { b _ { 1 , j + 1 } } \\ { b _ { 2 , j + 2 } } \\ { b _ { 3 , j + 3 } } \end{array} \right]$
$\left[ \begin{array} { l } { d _ { 0 , j } } \\ { d _ { 1 , j } } \\ { d _ { 2 , j } } \\ { d _ { 3 , j } } \end{array} \right] = \left[ \begin{array} { l l l l } { 02 } & { 03 } & { 01 } & { 01 } \\ { 01 } & { 02 } & { 03 } & { 01 } \\ { 01 } & { 01 } & { 02 } & { 03 } \\ { 03 } & { 01 } & { 01 } & { 02 } \end{array} \right] \times \left[ \begin{array} { c } { c _ { 0 , j } } \\ { c _ { 1 , j } } \\ { c _ { 2 , j } } \\ { c _ { 3 , j } } \end{array} \right]$
Note that the ShiftRows operation $b_{i, j+c}$ is a cyclic shift and the matrix multiplcation in MixColumns denotes the xtime operation in GF($2^8$).
It's possible to combine all three of these operations into a single line. We can write 4 bytes of $d$ as the linear combination of four different 4 byte vectors:
$\left[ \begin{array} { l } { d _ { 0 , j } } \\ { d _ { 1 , j } } \\ { d _ { 2 , j } } \\ { d _ { 3 , j } } \end{array} \right] = \left[ \begin{array} { l } { 02 } \\ { 01 } \\ { 01 } \\ { 03 } \end{array} \right] \operatorname { sbox } \left[ a _ { 0 , j + 0 } \right] \oplus \left[ \begin{array} { l } { 03 } \\ { 02 } \\ { 01 } \\ { 01 } \end{array} \right] \operatorname { sbox } \left[ a _ { 1 , j + 1 } \right] \oplus \left[ \begin{array} { c } { 01 } \\ { 03 } \\ { 02 } \\ { 01 } \end{array} \right] \operatorname { sbox } \left[ a _ { 2 , j + 2 } \right] \oplus \left[ \begin{array} { c } { 01 } \\ { 01 } \\ { 03 } \\ { 02 } \end{array} \right] \operatorname { sbox } \left[ a _ { 3 , j + 3 } \right]$
Now, for each of these four components, we can tabulate the outputs for every possible 8-bit input:
$T _ { 0 } [ a ] = \left[ \begin{array} { l l } { 02 \times \operatorname { sbox } [ a ] } \\ { 01 \times \operatorname { sbox } [ a ] } \\ { 01 \times \operatorname { sbox } [ a ] } \\ { 03 \times \operatorname { sbox } [ a ] } \end{array} \right]$
$T _ { 1 } [ a ] = \left[ \begin{array} { l } { 03 \times \operatorname { sbox } [ a ] } \\ { 02 \times \operatorname { sbox } [ a ] } \\ { 01 \times \operatorname { sbox } [ a ] } \\ { 01 \times \operatorname { sbox } [ a ] } \end{array} \right]$
$T _ { 2 } [ a ] = \left[ \begin{array} { l l } { 01 \times \operatorname { sbox } [ a ] } \\ { 03 \times \operatorname { sbox } [ a ] } \\ { 02 \times \operatorname { sbox } [ a ] } \\ { 01 \times \operatorname { sbox } [ a ] } \end{array} \right]$
$T _ { 3 } [ a ] = \left[ \begin{array} { l l } { 01 \times \operatorname { sbox } [ a ] } \\ { 01 \times \operatorname { sbox } [ a ] } \\ { 03 \times \operatorname { sbox } [ a ] } \\ { 02 \times \operatorname { sbox } [ a ] } \end{array} \right]$
These tables have 2^8 different 32-bit entries, so together the tables take up 4 kB. Finally, we can quickly compute one round of AES by calculating
$\left[ \begin{array} { l } { d _ { 0 , j } } \\ { d _ { 1 , j } } \\ { d _ { 2 , j } } \\ { d _ { 3 , j } } \end{array} \right] = T _ { 0 } \left[ a _ { 0 } , j + 0 \right] \oplus T _ { 1 } \left[ a _ { 1 } , j + 1 \right] \oplus T _ { 2 } \left[ a _ { 2 } , j + 2 \right] \oplus T _ { 3 } \left[ a _ { 3 } , j + 3 \right]$
All together, with AddRoundKey at the end, a single round now takes 16 table lookups and 16 32-bit XOR operations. This arrangement is much more efficient than the traditional 8-bit implementation. There are a few more tradeoffs that can be made: for instance, the tables only differ by 8-bit shifts, so it's also possible to store only 1 kB of lookup tables at the expense of a few rotate operations.
While the TINYAES128C library we've been using doesn't make this optimization, another library included with ChipWhisperer called MBEDTLS does.
SCOPETYPE = 'OPENADC'
PLATFORM = 'CW308_STM32F4'
VERSION = 'HARDWARE'
SS_VER = 'SS_VER_2_1'
allowable_exceptions = None
CRYPTO_TARGET = 'TINYAES128C'
CRYPTO_TARGET = 'MBEDTLS' # overwrite auto inserted CRYPTO_TARGET
if VERSION == 'HARDWARE':
#!/usr/bin/env python
# coding: utf-8
# # Part 2, Topic 1: CPA Attack on 32bit AES (HARDWARE)
# ---
# NOTE: This lab references some (commercial) training material on [ChipWhisperer.io](https://www.ChipWhisperer.io). You can freely execute and use the lab per the open-source license (including using it in your own courses if you distribute similarly), but you must maintain notice about this source location. Consider joining our training course to enjoy the full experience.
#
# ---
# Usual capture, just using MBEDTLS instead of TINYAES128
# In[ ]:
# In[ ]:
#!/usr/bin/env python
# coding: utf-8
# In[ ]:
import chipwhisperer as cw
try:
if not scope.connectStatus:
scope.con()
except NameError:
scope = cw.scope(hw_location=(5, 7))
try:
if SS_VER == "SS_VER_2_1":
target_type = cw.targets.SimpleSerial2
elif SS_VER == "SS_VER_2_0":
raise OSError("SS_VER_2_0 is deprecated. Use SS_VER_2_1")
else:
target_type = cw.targets.SimpleSerial
except:
SS_VER="SS_VER_1_1"
target_type = cw.targets.SimpleSerial
try:
target = cw.target(scope, target_type)
except:
print("INFO: Caught exception on reconnecting to target - attempting to reconnect to scope first.")
print("INFO: This is a work-around when USB has died without Python knowing. Ignore errors above this line.")
scope = cw.scope(hw_location=(5, 7))
target = cw.target(scope, target_type)
print("INFO: Found ChipWhisperer😍")
# In[ ]:
if "STM" in PLATFORM or PLATFORM == "CWLITEARM" or PLATFORM == "CWNANO":
prog = cw.programmers.STM32FProgrammer
elif PLATFORM == "CW303" or PLATFORM == "CWLITEXMEGA":
prog = cw.programmers.XMEGAProgrammer
elif "neorv32" in PLATFORM.lower():
prog = cw.programmers.NEORV32Programmer
elif PLATFORM == "CW308_SAM4S" or PLATFORM == "CWHUSKY":
prog = cw.programmers.SAM4SProgrammer
else:
prog = None
# In[ ]:
import time
time.sleep(0.05)
scope.default_setup()
def reset_target(scope):
if PLATFORM == "CW303" or PLATFORM == "CWLITEXMEGA":
scope.io.pdic = 'low'
time.sleep(0.1)
scope.io.pdic = 'high_z' #XMEGA doesn't like pdic driven high
time.sleep(0.1) #xmega needs more startup time
elif "neorv32" in PLATFORM.lower():
raise IOError("Default iCE40 neorv32 build does not have external reset - reprogram device to reset")
elif PLATFORM == "CW308_SAM4S" or PLATFORM == "CWHUSKY":
scope.io.nrst = 'low'
time.sleep(0.25)
scope.io.nrst = 'high_z'
time.sleep(0.25)
else:
scope.io.nrst = 'low'
time.sleep(0.05)
scope.io.nrst = 'high_z'
time.sleep(0.05)
# In[ ]:
try:
get_ipython().run_cell_magic('bash', '-s "$PLATFORM" "$CRYPTO_TARGET" "$SS_VER"', 'cd ../../../firmware/mcu/simpleserial-aes\nmake PLATFORM=$1 CRYPTO_TARGET=$2 SS_VER=$3\n &> /tmp/tmp.txt')
except:
x=open("/tmp/tmp.txt").read(); print(x); raise OSError(x)
# In[ ]:
fw_path = '../../../firmware/mcu/simpleserial-aes/simpleserial-aes-{}.hex'.format(PLATFORM)
cw.program_target(scope, prog, fw_path)
# In[ ]:
#Capture Traces
from tqdm.notebook import trange, trange
import numpy as np
import time
ktp = cw.ktp.Basic()
traces = []
N = 100 # Number of traces
project = cw.create_project("traces/32bit_AES.cwp", overwrite=True)
for i in trange(N, desc='Capturing traces'):
key, text = ktp.next() # manual creation of a key, text pair can be substituted here
trace = cw.capture_trace(scope, target, text, key)
if trace is None:
continue
project.traces.append(trace)
try:
print(scope.adc.trig_count) # print if this exists
except:
pass
project.save()
# In[ ]:
scope.dis()
target.dis()
elif VERSION == 'SIMULATED':
#!/usr/bin/env python
# coding: utf-8
# # Part 2, Topic 1: CPA Attack on 32bit AES (SIMULATED)
# ---
# NOTE: This lab references some (commercial) training material on [ChipWhisperer.io](https://www.ChipWhisperer.io). You can freely execute and use the lab per the open-source license (including using it in your own courses if you distribute similarly), but you must maintain notice about this source location. Consider joining our training course to enjoy the full experience.
#
# ---
# In[ ]:
import chipwhisperer as cw
project = cw.open_project("traces/32bit_AES.cwp")
INFO: Found ChipWhisperer😍
scope.gain.mode changed from low to high scope.gain.gain changed from 0 to 30 scope.gain.db changed from 5.5 to 24.8359375 scope.adc.basic\_mode changed from low to rising\_edge scope.adc.samples changed from 98134 to 5000 scope.adc.trig\_count changed from 10963366 to 22369307 scope.clock.adc\_src changed from clkgen\_x1 to clkgen\_x4 scope.clock.adc\_freq changed from 9513439 to 29538459 scope.clock.adc\_rate changed from 9513439.0 to 29538459.0 scope.clock.clkgen\_div changed from 1 to 26 scope.clock.clkgen\_freq changed from 192000000.0 to 7384615.384615385 scope.io.tio1 changed from serial\_tx to serial\_rx scope.io.tio2 changed from serial\_rx to serial\_tx scope.io.hs2 changed from None to clkgen Building for platform CW308\_STM32F4 with CRYPTO\_TARGET=MBEDTLS
SS\_VER set to SS\_VER\_2\_1
SS\_VER set to SS\_VER\_2\_1
Blank crypto options, building for AES128
.
Welcome to another exciting ChipWhisperer target build!!
arm-none-eabi-gcc (15:9-2019-q4-0ubuntu1) 9.2.1 20191025 (release) [ARM/arm-9-branch revision 277599]
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Size after:
text data bss dec hex filename
4936 1612 1544 8092 1f9c simpleserial-aes-CW308\_STM32F4.elf
+--------------------------------------------------------
+ Built for platform CW308T: STM32F4 Target with:
+ CRYPTO\_TARGET = MBEDTLS
+ CRYPTO\_OPTIONS = AES128C
+--------------------------------------------------------
Detected known STMF32: STM32F40xxx/41xxx Extended erase (0x44), this can take ten seconds or more
Attempting to program 6547 bytes at 0x8000000 STM32F Programming flash...
STM32F Reading flash...
Verified flash OK, 6547 bytes
31440
If we plot the AES power trace:
cw.plot(project.waves[0])
You probably can't even pick out the different AES rounds anymore (whereas it was pretty obvious on TINYAES128C). MBED is also way faster - we only got part way into round 2 with 5000 samples of TINYAES, but with MBED we can finish the entire encryption in less than 5000 samples! Two questions we need to answer now are:
- Is it possible for us to break this AES implementation?
- If so, what sort of leakage model do we need?
As it turns out, the answers are:
- Yes!
- We can continue to use the same leakage model - the SBox output
This might come as a surprise, but it's true! Two of the t_table lookups are just the sbox[key^plaintext] that we used before. Try the analysis for yourself now and verify that this is correct:
import chipwhisperer.analyzer as cwa
#pick right leakage model for your attack
leak_model = cwa.leakage_models.sbox_output
attack = cwa.cpa(project, leak_model)
results = attack.run(cwa.get_jupyter_callback(attack))
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PGE= | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 2B 0.982 |
7E 0.994 |
15 0.992 |
16 0.990 |
28 0.995 |
AE 0.997 |
D2 0.988 |
A6 0.993 |
AB 0.989 |
F7 0.997 |
15 0.993 |
88 0.991 |
09 0.993 |
CF 0.992 |
4F 0.990 |
3C 0.989 |
1 | D7 0.468 |
5A 0.454 |
6A 0.474 |
10 0.481 |
8C 0.460 |
B8 0.447 |
5F 0.447 |
CF 0.475 |
93 0.463 |
5A 0.453 |
E4 0.466 |
92 0.465 |
76 0.460 |
E0 0.490 |
70 0.508 |
A8 0.457 |
2 | 79 0.453 |
15 0.443 |
34 0.460 |
60 0.459 |
BB 0.450 |
01 0.447 |
D9 0.442 |
DF 0.460 |
6E 0.454 |
E2 0.445 |
2B 0.443 |
B4 0.455 |
80 0.437 |
22 0.487 |
42 0.488 |
12 0.448 |
3 | 57 0.446 |
F9 0.437 |
13 0.450 |
C7 0.455 |
5D 0.442 |
FA 0.440 |
83 0.441 |
62 0.458 |
0F 0.452 |
0F 0.431 |
25 0.438 |
7D 0.452 |
70 0.432 |
93 0.481 |
CD 0.455 |
0E 0.439 |
4 | 43 0.437 |
9E 0.436 |
63 0.449 |
B3 0.446 |
66 0.435 |
91 0.440 |
56 0.427 |
DD 0.453 |
88 0.452 |
AB 0.431 |
A0 0.436 |
A7 0.441 |
4D 0.432 |
59 0.468 |
76 0.436 |
EC 0.438 |
Improving the Model¶
While this model works alright for mbedtls, you probably wouldn't be surprised if it wasn't the best model to attack with. Instead, we can attack the full T-Tables. Returning again to the T-Tables:
$T _ { 0 } [ a ] = \left[ \begin{array} { l l } { 02 \times \operatorname { sbox } [ a ] } \\ { 01 \times \operatorname { sbox } [ a ] } \\ { 01 \times \operatorname { sbox } [ a ] } \\ { 03 \times \operatorname { sbox } [ a ] } \end{array} \right]$
$T _ { 1 } [ a ] = \left[ \begin{array} { l } { 03 \times \operatorname { sbox } [ a ] } \\ { 02 \times \operatorname { sbox } [ a ] } \\ { 01 \times \operatorname { sbox } [ a ] } \\ { 01 \times \operatorname { sbox } [ a ] } \end{array} \right]$
$T _ { 2 } [ a ] = \left[ \begin{array} { l l } { 01 \times \operatorname { sbox } [ a ] } \\ { 03 \times \operatorname { sbox } [ a ] } \\ { 02 \times \operatorname { sbox } [ a ] } \\ { 01 \times \operatorname { sbox } [ a ] } \end{array} \right]$
$T _ { 3 } [ a ] = \left[ \begin{array} { l l } { 01 \times \operatorname { sbox } [ a ] } \\ { 01 \times \operatorname { sbox } [ a ] } \\ { 03 \times \operatorname { sbox } [ a ] } \\ { 02 \times \operatorname { sbox } [ a ] } \end{array} \right]$
we can see that for each T-Table lookup, the following is accessed:
$\operatorname {sbox}[a]$, $\operatorname {sbox}[a]$, $2 \times \operatorname {sbox}[a]$, $3 \times \operatorname {sbox}[a]$
so instead of just taking the Hamming weight of the SBox, we can instead take the Hamming weight of this whole access:
$h = \operatorname {hw}[\operatorname {sbox}[a]] + \operatorname {hw}[\operatorname {sbox}[a]] + \operatorname {hw}[2 \times \operatorname {sbox}[a]] + \operatorname {hw}[3 \times \operatorname {sbox}[a]]$
Again, ChipWhisperer already has this model built in, which you can access with cwa.leakage_models.t_table
. Retry your CPA attack with this new leakage model:
import chipwhisperer.analyzer as cwa
#pick right leakage model for your attack
leak_model = cwa.leakage_models.t_table
attack = cwa.cpa(project, leak_model)
results = attack.run(cwa.get_jupyter_callback(attack))
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PGE= | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 2B 0.892 |
7E 0.906 |
15 0.881 |
16 0.875 |
28 0.893 |
AE 0.904 |
D2 0.891 |
A6 0.911 |
AB 0.803 |
F7 0.882 |
15 0.913 |
88 0.862 |
09 0.897 |
CF 0.868 |
4F 0.864 |
3C 0.894 |
1 | 57 0.483 |
FC 0.517 |
9B 0.493 |
4B 0.496 |
46 0.494 |
01 0.516 |
AA 0.502 |
B1 0.506 |
ED 0.501 |
E2 0.489 |
0B 0.495 |
E4 0.469 |
80 0.520 |
0F 0.511 |
20 0.500 |
A4 0.500 |
2 | D7 0.471 |
F9 0.508 |
68 0.482 |
0B 0.485 |
93 0.492 |
0F 0.492 |
B2 0.485 |
CF 0.500 |
1A 0.487 |
42 0.489 |
4D 0.478 |
0D 0.466 |
0A 0.491 |
6E 0.495 |
70 0.486 |
99 0.480 |
3 | A1 0.467 |
0B 0.483 |
30 0.478 |
24 0.473 |
80 0.477 |
B3 0.483 |
E7 0.470 |
DB 0.473 |
2F 0.482 |
E5 0.476 |
25 0.472 |
16 0.460 |
59 0.485 |
D0 0.486 |
FB 0.481 |
12 0.480 |
4 | 18 0.453 |
9A 0.481 |
84 0.474 |
9E 0.468 |
D7 0.473 |
D7 0.482 |
6A 0.468 |
4E 0.467 |
B5 0.468 |
39 0.468 |
E9 0.467 |
45 0.454 |
04 0.480 |
8E 0.475 |
BC 0.480 |
35 0.477 |
Did this attack work better than the previous one?
T-Tables for Decryption:¶
Recall that the last round of AES is different than the rest of the rounds. Instead of it applying subbytes
, shiftrows
, mixcolumns
, and addroundkey
, it leaves out mixcolumns
. You might expect that this means that decryption doesn't use a reverse T-Table in the first decryption round, but this isn't necessarily the case! Since mixcolumns
is a linear operation, $\operatorname{mixcolumns}( \operatorname{key} + \operatorname{state})$ is equal to $\operatorname{mixcolumns}(\operatorname{key}) + \operatorname{mixcolumns}(\operatorname{state})$. Again, this is the approach that MBEDTLS takes, so we would be able to use the reverse T-Table to attack decryption.