Commit ce08e8ae authored by Poul-Henning Kamp's avatar Poul-Henning Kamp Committed by Tollef Fog Heen

Add explanation for Varnish "hashing" in light of advisories.

parent b72c1e60
......@@ -8,6 +8,7 @@ You may or may not want to know what Poul-Henning think.
.. toctree::
varnish_does_not_hash.rst
thetoolsweworkwith.rst
three-zero.rst
ssl.rst
......
.. _phk_varnish_does_not_hash:
=====================
Varnish Does Not Hash
=====================
A spate of security advisories related to hash-collisions have made
a lot of people stare at Varnish and wonder if it is affected.
The answer is no, but the explanation is probably not what most of
you expected:
Varnish does not hash, at least not by default, and
even if it does, it's still as immune to the attacks as can be.
To understand what is going on, I have to introduce a concept from
Shannons information theory: "entropy."
Entropy is hard to explain, and according to legend, that is exactly
why Shannon recycled that term from thermodynamics.
In this context, we can get away with thinking about entropy as how
much our "keys" differ::
Low entropy (1 bit):
/foo/bar/barf/some/cms/content/article?article=2
/foo/bar/barf/some/cms/content/article?article=3
High entropy (65 bits):
/i?ee30d0770eb460634e9d5dcfb562a2c5.html
/i?bca3633d52607f38a107cb5297fd66e5.html
Hashing consists of calculating a hash-index from the key and
storing the objects in an array indexed by that key.
Typically, but not always, the key is a string and the index is a
(smallish) integer, and the job of the hash-function is to squeeze
the key into the integer, without loosing any of the entropy.
Needless to say, the more entropy you have to begin with, the more
of it you can afford to loose, and loose some you almost invariably
will.
There are two families of hash-functions, the fast ones, and the good
ones, and the security advisories are about the fast ones.
The good ones are slower, but probably not so much slower that you
care, and therefore, if you want to fix your web-app:
Change::
foo=somedict[$somekey]
To::
foo=somedict[md5($somekey)]
and forget about the advisories.
Yes, that's right: Cryptographic hash algorithms are the good ones,
they are built to not throw any entropy away, and they are built to
have very hard to predict collisions, which is exactly the problem
with the fast hash-functions in the advisories.
-----------------
What Varnish Does
-----------------
The way to avoid having hash-collisions is to not use a hash: Use a
tree instead, there every object has its own place and there are no
collisions.
Varnish does that, but with a twist.
The "keys" in varnish can be very long, by default they consist of::
sub vcl_hash {
hash_data(req.url);
if (req.http.host) {
hash_data(req.http.host);
} else {
hash_data(server.ip);
}
return (hash);
}
But some users will add cookies, user identification and many other
bits and pieces of string in there, and in the end the keys can be
kilobytes in length, and quite often, as in the first example above,
the first difference may not come until pretty far into the keys.
Trees generally need to have a copy of the key around to be able
to tell if they have a match, and more importantly to compare
tree-leaves in order to "re-balance" the tree and other such arcanae
of data structures.
This would add another per-object memory load to Varnish, and it
would feel particularly silly to store 48 identical characters for
each object in the far too common case seen above.
But furthermore, we want the tree to be very fast to do lookups in,
preferably it should be lockless for lookups, and that means that
we cannot (realistically) use any of the "smart" trees which
automatically balance themselves etc.
You (generally) don't need a "smart" tree if your keys look
like random data in the order they arrive, but we can pretty
much expect the opposite as article number 4, 5, 6 etc are added
to the CMS in the first example.
But we can make the keys look random, and make them small and fixed
size at the same time, and the perfect functions designed for just
that task are the "good" hash-functions, the cryptographic ones.
So what Varnish does is "key-compression": All the strings hash_data()
are fed, are pushed through a cryptographic hash algorithm called
SHA256, which, as the name says, always spits out 256 bits (= 32
bytes), no matter how many bits you feed it.
This does not eliminate the key-storage requirement, but now all
the keys are 32 bytes and can be put directly into the data structure::
struct objhead {
[...]
unsigned char digest[DIGEST_LEN];
};
In the example above, the output of SHA256 for the 1 bit difference
in entropy becomes::
/foo/bar/barf/some/cms/content/article?article=2
-> 14f0553caa5c796650ec82256e3f111ae2f20020a4b9029f135a01610932054e
/foo/bar/barf/some/cms/content/article?article=3
-> 4d45b9544077921575c3c5a2a14c779bff6c4830d1fbafe4bd7e03e5dd93ca05
That should be random enough.
But the key-compression does introduce a risk of collisions, since
not even SHA256 can guarantee different outputs for all possible
inputs: Try pushing all the possible 33 bytes long files through
SHA256 and sooner or later you will get collisions.
The risk of collision is very small however, and I can all but
promise you, that you will be fully offset in fame and money for
any inconvenience a collision might cause, because you will
be the first person to find a SHA256 collision.
Poul-Henning, 2012-01-03
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment