Two Attacks on Naive Tree Hashes

2025 March 30th

If you're building a tree hash,also known as a Merkle tree and you want it to have the same security properties as a cryptographic hash like SHA-3 or BLAKE3, there are a couple of "attacks" you need to know about. Here's a naive recursive tree hash based on SHA-3 in Python, which turns out to be vulnerable to these attacks:Dividing the length by 2 at each level of the tree like this is a shortcut, but the resulting tree structure isn't great, because we have to know the final input length to figure out where any of the leaf or subtree boundaries are. That's no problem in these examples, but in practice most hash functions need to be able to "stream" long inputs in a loop without knowing the final length until the end. A better splitting rule for real tree hashes is to make the left side as large as possible as long as it's a power-of-2. See the BLAKE3 paper, particularly sections 2.1 and 5.1.2, and also Binary Numeral Trees.

from hashlib import sha3_256

LEAF_SIZE = 1000

def tree_hash(input_bytes):
if len(input_bytes) <= LEAF_SIZE:
# Inputs below a certain size are "leaf nodes", and we feed them
# directly into SHA-3.
return sha3_256(input_bytes).digest()

# When the input is more than one leaf we split it in half, recursively
# tree_hash() each half, join the two subtree hashes into a "parent
# node", and feed that into SHA-3.
half = len(input_bytes) // 2
left_subtree_hash = tree_hash(input_bytes[:half])
right_subtree_hash = tree_hash(input_bytes[half:])
parent_node = left_subtree_hash + right_subtree_hash
return sha3_256(parent_node).digest()

Let's start with arguably the most important security property for cryptographic hashes, collision resistance. A "collision" is when two different inputs have the same hash. There are no known collisions on SHA-3.As of 2025, there are no known collisions on SHA-2 either. The first collision on SHA-1 was published in 2017. But even though our tree_hash uses SHA-3 on the inside, it has collisions:I'm going to use secrets.token_bytes instead of random.randbytes in all these examples. randbytes is more familiar, but it uses a non-cryptographic RNG called Mersenne Twister that has predictable output. Even when it doesn't really matter (like here), we might as well teach the good one.

# Here's an example input, two randomly generated leaves.
import secrets
input1 = secrets.token_bytes(2 * LEAF_SIZE)
hash1 = tree_hash(input1)

# And here's another input. We'll choose this one to be the concatenated
# hashes of the leaves of input1, i.e. exactly the same "parent node" that
# tree_hash(input1) would produce.
leaf_hash1 = sha3_256(input1[:LEAF_SIZE]).digest()
leaf_hash2 = sha3_256(input1[LEAF_SIZE:]).digest()
input2 = leaf_hash1 + leaf_hash2
hash2 = tree_hash(input2)

# These two inputs are a "collision" on tree_hash.
assert input1 != input2 and hash1 == hash2

SHA-3 also prevents length extension.Surprisingly, MD5, SHA-1, and SHA-2 all allow length extension. That's why historically we've needed HMAC for keyed hashing. In other words, seeing the hash of some secret input doesn't let us compute any related hashes, like say the hash of "that secret plus some more bytes". But our tree_hash does allow length extension:In fact, tree_hash allows length extension in both directions.

# Here's another input, which adds a couple more random leaves to input1.
# Of course we can't construct input3 like this if we don't know input1.
more_bytes = secrets.token_bytes(2 * LEAF_SIZE)
input3 = input1 + more_bytes
hash3 = tree_hash(input3)

# But we can compute hash3 without knowing input1, by "extending" hash1.
# This is a "length extension attack".
assert hash3 == sha3_256(hash1 + tree_hash(more_bytes)).digest()

In both cases, the problem is that we're giving the same inputs to SHA-3 in different places. We have collisions because the bytes of a (random) parent node can be the same as the bytes of a (deliberately chosen) leaf node. And we have extensions because there's no difference between the (maybe secret) hashes in the interior of the treeIn the literature, interior/non-root hashes are often called "chaining values". and the (maybe public) hash at the root. This leads us to two rules for secure tree hashes:

  1. Leaf hashes and parent hashes must never be the same.
  2. Root hashes and non-root hashes must never be the same.

One way to satisfy these rules is to prefix or suffix all our SHA-3 inputs. Here's a modified tree_hash that has collision resistanceas long as SHA-3 remains collision resistant and doesn't allow length extension:

def node_hash(node_bytes, is_root, is_parent):
# [0, 0]: non-root leaf
# [0, 1]: non-root parent
# [1, 0]: root leaf
# [1, 1]: root parent
suffix = bytes([is_root, is_parent])
return sha3_256(node_bytes + suffix).digest()

def tree_hash(input_bytes, is_root=True):
if len(input_bytes) <= LEAF_SIZE:
return node_hash(input_bytes, is_root, is_parent=False)
half = len(input_bytes) // 2
left_subtree_hash = tree_hash(input_bytes[:half], is_root=False)
right_subtree_hash = tree_hash(input_bytes[half:], is_root=False)
parent_node = left_subtree_hash + right_subtree_hash
return node_hash(parent_node, is_root, is_parent=True)

For the formal definitions and proofs of these rules, see Sufficient conditions for sound tree and sequential hashing modes by the Keccak/SHA-3 team. Section 7.5 ("Checking of some real-world tree hash modes") is especially interesting even if you aren't into the formal stuff.