Add explanation for Varnish "hashing" in light of advisories.

ce08e8ae · Poul-Henning Kamp · Tollef Fog Heen · b72c1e60 · ce08e8ae · ce08e8ae
Commit ce08e8ae authored Jan 03, 2012 by Poul-Henning Kamp Committed by Tollef Fog Heen Mar 09, 2012
Hide whitespace changes
Inline Side-by-side

Showing with 146 additions and 0 deletions

index.rst doc/sphinx/phk/index.rst +1 -0

varnish_does_not_hash.rst doc/sphinx/phk/varnish_does_not_hash.rst +145 -0

No files found.
--- a/doc/sphinx/phk/index.rst
+++ b/doc/sphinx/phk/index.rst
@@ -8,6 +8,7 @@ You may or may not want to know what Poul-Henning think.

 .. toctree::

+	varnish_does_not_hash.rst
 	thetoolsweworkwith.rst
 	three-zero.rst
 	ssl.rst

--- a/doc/sphinx/phk/varnish_does_not_hash.rst
+++ b/doc/sphinx/phk/varnish_does_not_hash.rst
+.. _phk_varnish_does_not_hash:
+
+=====================
+Varnish Does Not Hash
+=====================
+
+A spate of security advisories related to hash-collisions have made
+a lot of people stare at Varnish and wonder if it is affected.
+
+The answer is no, but the explanation is probably not what most of
+you expected:
+
+Varnish does not hash, at least not by default, and
+even if it does, it's still as immune to the attacks as can be.
+
+To understand what is going on, I have to introduce a concept from
+Shannons information theory: "entropy."
+
+Entropy is hard to explain, and according to legend, that is exactly
+why Shannon recycled that term from thermodynamics.
+
+In this context, we can get away with thinking about entropy as how
+much our "keys" differ::
+
+	Low entropy (1 bit):
+		/foo/bar/barf/some/cms/content/article?article=2
+		/foo/bar/barf/some/cms/content/article?article=3
+
+	High entropy (65 bits):
+		/i?ee30d0770eb460634e9d5dcfb562a2c5.html
+		/i?bca3633d52607f38a107cb5297fd66e5.html
+
+Hashing consists of calculating a hash-index from the key and
+storing the objects in an array indexed by that key.
+
+Typically, but not always, the key is a string and the index is a
+(smallish) integer, and the job of the hash-function is to squeeze
+the key into the integer, without loosing any of the entropy.
+
+Needless to say, the more entropy you have to begin with, the more
+of it you can afford to loose, and loose some you almost invariably
+will.
+
+There are two families of hash-functions, the fast ones, and the good
+ones, and the security advisories are about the fast ones.
+
+The good ones are slower, but probably not so much slower that you
+care, and therefore, if you want to fix your web-app:
+
+Change::
+	foo=somedict[$somekey]
+To::
+	foo=somedict[md5($somekey)]
+
+and forget about the advisories.
+
+Yes, that's right: Cryptographic hash algorithms are the good ones,
+they are built to not throw any entropy away, and they are built to
+have very hard to predict collisions, which is exactly the problem
+with the fast hash-functions in the advisories.
+
+-----------------
+What Varnish Does
+-----------------
+
+The way to avoid having hash-collisions is to not use a hash:  Use a
+tree instead, there every object has its own place and there are no
+collisions.
+
+Varnish does that, but with a twist.
+
+The "keys" in varnish can be very long, by default they consist of::
+
+	sub vcl_hash {
+	    hash_data(req.url);
+	    if (req.http.host) {
+		hash_data(req.http.host);
+	    } else {
+		hash_data(server.ip);
+	    }
+	    return (hash);
+	}
+
+But some users will add cookies, user identification and many other
+bits and pieces of string in there, and in the end the keys can be
+kilobytes in length, and quite often, as in the first example above,
+the first difference may not come until pretty far into the keys.
+
+Trees generally need to have a copy of the key around to be able
+to tell if they have a match, and more importantly to compare
+tree-leaves in order to "re-balance" the tree and other such arcanae
+of data structures.
+
+This would add another per-object memory load to Varnish, and it
+would feel particularly silly to store 48 identical characters for
+each object in the far too common case seen above.
+
+But furthermore, we want the tree to be very fast to do lookups in,
+preferably it should be lockless for lookups, and that means that
+we cannot (realistically) use any of the "smart" trees which
+automatically balance themselves etc.
+
+You (generally) don't need a "smart" tree if your keys look
+like random data in the order they arrive, but we can pretty
+much expect the opposite as article number 4, 5, 6 etc are added
+to the CMS in the first example.
+
+But we can make the keys look random, and make them small and fixed
+size at the same time, and the perfect functions designed for just
+that task are the "good" hash-functions, the cryptographic ones.
+
+So what Varnish does is "key-compression":  All the strings hash_data()
+are fed, are pushed through a cryptographic hash algorithm called
+SHA256, which, as the name says, always spits out 256 bits (= 32
+bytes), no matter how many bits you feed it.
+
+This does not eliminate the key-storage requirement, but now all
+the keys are 32 bytes and can be put directly into the data structure::
+
+	struct objhead {
+		[...]
+		unsigned char           digest[DIGEST_LEN];
+	};
+
+In the example above, the output of SHA256 for the 1 bit difference
+in entropy becomes::
+
+	/foo/bar/barf/some/cms/content/article?article=2
+	-> 14f0553caa5c796650ec82256e3f111ae2f20020a4b9029f135a01610932054e
+	/foo/bar/barf/some/cms/content/article?article=3
+	-> 4d45b9544077921575c3c5a2a14c779bff6c4830d1fbafe4bd7e03e5dd93ca05
+
+That should be random enough.
+
+But the key-compression does introduce a risk of collisions, since
+not even SHA256 can guarantee different outputs for all possible
+inputs:  Try pushing all the possible 33 bytes long files through
+SHA256 and sooner or later you will get collisions.
+
+The risk of collision is very small however, and I can all but
+promise you, that you will be fully offset in fame and money for
+any inconvenience a collision might cause, because you will
+be the first person to find a SHA256 collision.
+
+Poul-Henning, 2012-01-03