Recover the Architect notes from the old wiki

c8d062ab · Poul-Henning Kamp · d3e92326 · c8d062ab · c8d062ab
Commit c8d062ab authored Apr 22, 2016 by Poul-Henning Kamp
Hide whitespace changes
Inline Side-by-side

Showing with 255 additions and 0 deletions

index.rst doc/sphinx/phk/index.rst +1 -0

notes.rst doc/sphinx/phk/notes.rst +254 -0

No files found.
--- a/doc/sphinx/phk/index.rst
+++ b/doc/sphinx/phk/index.rst
@@ -32,4 +32,5 @@ You may or may not want to know what Poul-Henning thinks.
 	thoughts.rst
 	autocrap.rst
 	sphinx.rst
+	notes.rst

--- a/doc/sphinx/phk/notes.rst
+++ b/doc/sphinx/phk/notes.rst
+.. _phk_notes:
+
+========================
+Notes from the Architect
+========================
+
+Once you start working with the Varnish source code, you will notice
+that Varnish is not your average run of the mill application.
+
+That is not a coincidence.
+
+I have spent many years working on the FreeBSD kernel, and only
+rarely did I venture into userland programming, but when I had
+occation to do so, I invariably found that people programmed like
+it was still 1975.
+
+So when I was approached about the Varnish project I wasn't really
+interested until I realized that this would be a good opportunity
+to try to put some of all my knowledge of how hardware and kernels
+work to good use, and now that we have reached alpha stage, I can
+say I have really enjoyed it.
+
+So what's wrong with 1975 programming ?
+---------------------------------------
+
+The really short answer is that computers do not have two kinds of
+storage any more.
+
+It used to be that you had the primary store, and it was anything
+from acoustic delaylines filled with mercury via small magnetic
+dougnuts via transistor flip-flops to dynamic RAM.
+
+And then there were the secondary store, paper tape, magnetic tape,
+disk drives the size of houses, then the size of washing machines
+and these days so small that girls get disappointed if think they
+got hold of something else than the MP3 player you had in your
+pocket.
+
+And people program this way.
+
+They have variables in "memory" and move data to and from "disk".
+
+Take Squid for instance, a 1975 program if I ever saw one: You tell
+it how much RAM it can use and how much disk it can use. It will
+then spend inordinate amounts of time keeping track of what HTTP
+objects are in RAM and which are on disk and it will move them forth
+and back depending on traffic patterns.
+
+Well, today computers really only have one kind of storage, and it
+is usually some sort of disk, the operating system and the virtual
+memory management hardware has converted the RAM to a cache for the
+disk storage.
+
+So what happens with squids elaborate memory management is that it
+gets into fights with the kernels elaborate memory management, and
+like any civil war, that never gets anything done.
+
+What happens is this: Squid creates a HTTP object in "RAM" and it
+gets used some times rapidly after creation. Then after some time
+it get no more hits and the kernel notices this. Then somebody tries
+to get memory from the kernel for something and the kernel decides
+to push those unused pages of memory out to swap space and use the
+(cache-RAM) more sensibly for some data which is actually used by
+a program. This however, is done without squid knowing about it.
+Squid still thinks that these http objects are in RAM, and they
+will be, the very second it tries to access them, but until then,
+the RAM is used for something productive.
+
+This is what Virtual Memory is all about.
+
+If squid did nothing else, things would be fine, but this is where
+the 1975 programming kicks in.
+
+After some time, squid will also notice that these objects are
+unused, and it decides to move them to disk so the RAM can be used
+for more busy data. So squid goes out, creates a file and then it
+writes the http objects to the file.
+
+Here we switch to the high-speed camera: Squid calls write(2), the
+address i gives is a "virtual address" and the kernel has it marked
+as "not at home".
+
+So the CPU hardwares paging unit will raise a trap, a sort of
+interrupt to the operating system telling it "fix the memory please".
+
+The kernel tries to find a free page, if there are none, it will
+take a little used page from somewhere, likely another little used
+squid object, write it to the paging poll space on the disk (the
+"swap area") when that write completes, it will read from another
+place in the paging pool the data it "paged out" into the now unused
+RAM page, fix up the paging tables, and retry the instruction which
+failed.
+
+Squid knows nothing about this, for squid it was just a single
+normal memory acces.
+
+So now squid has the object in a page in RAM and written to the
+disk two places: one copy in the operating systems paging space and
+one copy in the filesystem.
+
+Squid now uses this RAM for something else but after some time, the
+HTTP object gets a hit, so squid needs it back.
+
+First squid needs some RAM, so it may decide to push another HTTP
+object out to disk (repeat above), then it reads the filesystem
+file back into RAM, and then it sends the data on the network
+connections socket.
+
+Did any of that sound like wasted work to you ?
+
+Here is how Varnish does it:
+
+Varnish allocate some virtual memory, it tells the operating system
+to back this memory with space from a disk file. When it needs to
+send the object to a client, it simply refers to that piece of
+virtual memory and leaves the rest to the kernel.
+
+If/when the kernel decides it needs to use RAM for something else,
+the page will get written to the backing file and the RAM page
+reused elsewhere.
+
+When Varnish next time refers to the virtual memory, the operating
+system will find a RAM page, possibly freeing one, and read the
+contents in from the backing file.
+
+And that's it. Varnish doesn't really try to control what is cached
+in RAM and what is not, the kernel has code and hardware support
+to do a good job at that, and it does a good job.
+
+Varnish also only has a single file on the disk whereas squid puts
+one object in its own separate file. The HTTP objects are not needed
+as filesystem objects, so there is no point in wasting time in the
+filesystem name space (directories, filenames and all that) for
+each object, all we need to have in Varnish is a pointer into virtual
+memory and a length, the kernel does the rest.
+
+Virtual memory was meant to make it easier to program when data was
+larger than the physical memory, but people have still not caught
+on.
+
+More caches
+-----------
+
+But there are more caches around, the silicon mafia has more or
+less stalled at 4GHz CPU clock and to get even that far they have
+had to put level 1, 2 and sometimes 3 caches between the CPU and
+the RAM (which is the level 4 cache), there are also things like
+write buffers, pipeline and page-mode fetches involved, all to make
+it a tad less slow to pick up something from memory.
+
+And since they have hit the 4GHz limit, but decreasing silicon
+feature sizes give them more and more transistors to work with,
+multi-cpu designs have become the fancy of the world, despite the
+fact that they suck as a programming model.
+
+Multi-CPU systems is nothing new, but writing programs that use
+more than one CPU at a time has always been tricky and it still is.
+
+Writing programs that perform well on multi-CPU systems is even trickier.
+
+Imagine I have two statistics counters:
+
+        unsigned    n_foo;
+        unsigned    n_bar;
+
+So one CPU is chugging along and has to execute n_foo++
+
+To do that, it read n_foo and then write n_foo back. It may or may
+not involve a load into a CPU register, but that is not important.
+
+To read a memory location means to check if we have it in the CPUs
+level 1 cache. It is unlikely to be unless it is very frequently
+used. Next check the level two cache, and let us assume that is a
+miss as well.
+
+If this is a single CPU system, the game ends here, we pick it out
+of RAM and move on.
+
+On a Multi-CPU system, and it doesn't matter if the CPUs share a
+socket or have their own, we first have to check if any of the other
+CPUs have a modified copy of n_foo stored in their caches, so a
+special bus-transaction goes out to find this out, if if some cpu
+comes back and says "yeah, I have it" that cpu gets to write it to
+RAM. On good hardware designs, our CPU will listen in on the bus
+during that write operation, on bad designs it will have to do a
+memory read afterwards.
+
+Now the CPU can increment the value of n_foo, and write it back.
+But it is unlikely to go directly back to memory, we might need it
+again quickly, so the modified value gets stored in our own L1 cache
+and then at some point, it will end up in RAM.
+
+Now imagine that another CPU wants to n_bar+++ at the same time,
+can it do that ? No. Caches operate not on bytes but on some
+"linesize" of bytes, typically from 8 to 128 bytes in each line.
+So since the first cpu was busy dealing with n_foo, the second CPU
+will be trying to grab the same cache-line, so it will have to wait,
+even through it is a different variable.
+
+Starting to get the idea ?
+
+Yes, it's ugly.
+
+How do we cope ?
+----------------
+
+Avoid memory operations if at all possible.
+
+Here are some ways Varnish tries to do that:
+
+When we need to handle a HTTP request or response, we have an array
+of pointers and a workspace. We do not call malloc(3) for each
+header. We call it once for the entire workspace and then we pick
+space for the headers from there. The nice thing about this is that
+we usually free the entire header in one go and we can do that
+simply by resetting a pointer to the start of the workspace.
+
+When we need to copy a HTTP header from one request to another (or
+from a response to another) we don't copy the string, we just copy
+the pointer to it. Provided we do not change or free the source
+headers, this is perfectly safe, a good example is copying from the
+client request to the request we will send to the backend.
+
+When the new header has a longer lifetime than the source, then we
+have to copy it. For instance when we store headers in a cached
+object. But in that case we build the new header in a workspace,
+and once we know how big it will be, we do a single malloc(3) to
+get the space and then we put the entire header in that space.
+
+We also try to reuse memory which is likely to be in the caches.
+
+The worker threads are used in "most recently busy" fashion, when
+a workerthread becomes free it goes to the front of the queue where
+it is most likely to get the next request, so that all the memory
+it already has cached, stack space, variables etc, can be reused
+while in the cache, instead of having the expensive fetches from
+RAM.
+
+We also give each worker thread a private set of variables it is
+likely to need, all allocated on the stack of the thread. That way
+we are certain that they occupy a page in RAM which none of the
+other CPUs will ever think about touching as long as this thread
+runs on its own CPU. That way they will not fight about the cachelines.
+
+If all this sounds foreign to you, let me just assure you that it
+works: we spend less than 18 system calls on serving a cache hit,
+and even many of those are calls tog get timestamps for statistics.
+
+These techniques are also nothing new, we have used them in the
+kernel for more than a decade, now it's your turn to learn them :-)
+
+So Welcome to Varnish, a 2006 architecture program.
+
+*phk*