Add docs: Logging, Events and the Varnish Service monitor

6af04dd6 · Geoff Simmons · 82c2430e · 6af04dd6 · 6af04dd6 · 6af04dd6
Commit 6af04dd6 authored Jan 03, 2019 by Geoff Simmons
Hide whitespace changes
Inline Side-by-side

Showing with 226 additions and 1 deletion

README.md docs/README.md +2 -0

monitor.md docs/monitor.md +214 -0

ref-cli-options.md docs/ref-cli-options.md +10 -1

No files found.
--- a/docs/README.md
+++ b/docs/README.md
@@ -7,6 +7,8 @@ The docs in this folder cover these topics:
  * [``VarnishConfig`` Custom Resource](ref-varnish-cfg.md)
  * [controller command-line options](ref-cli-options.md)

+* [Logging, Events and the Varnish Service monitor](monitor.md)
+
 * [Varnish as a Kubernetes Ingress](varnish-as-ingress.md)

 * [Self-sharding Varnish cluster](self-sharding.md): High-level

--- a/docs/monitor.md
+++ b/docs/monitor.md
+# Logging, Events and the Varnish Service monitor
+
+The controller provides a number of means to report its actions,
+report errors, check the state of Varnish instances implementing
+Ingress, and ensure that they realize the desired state of the
+cluster.
+
+* Ingress controller log
+* Kubernetes Events generated by the controller
+* Varnish Service monitor
+
+## Ingress controller log
+
+The controller logs to the standard output stream, and can be viewed
+in the usual ways for Kubernetes:
+
+```
+# Find the name of the Pod executing varnish-ingress-controller
+$ kubectl get pods -n kube-system
+[...]
+varnish-ingress-controller-64fd4944bd-gph24   1/1       Running   0          1h
+[...]
+
+# Tail the log
+$ kubectl logs -f -n kube-system varnish-ingress-controller-64fd4944bd-gph24
+time="2019-01-03T14:29:12Z" level=info msg="Starting Varnish Ingress controller version:96ccb7f"
+time="2019-01-03T14:29:12Z" level=info msg="Starting Varnish controller"
+Varnish controller logging at level: info
+[...]
+```
+The log level can be set with the ``-log-level``
+[command-line option](/docs/ref-cli-options.md), default ``INFO``.
+
+The log reports information obtained from watching the cluster API,
+actions taken to sync the cluster with the desired state, errors
+encountered, and information gathered by the Varnish Service monitor
+(see below). Currently the content and format of log output is
+intentionally not documented, since it is subject to change (so an
+external parser cannot depend on formats staying the same).
+
+## Events
+
+The controller generates Kubernetes Events for the cluster components
+that it acts upon, to report successful operations or failures. These
+include Ingress, Services (running Varnish to implement Ingress, or
+running as Ingress backends), the
+[VarnishConfig Custom Resource](/docs/ref-varnish-cfg.md), Secrets and
+Endpoints.
+
+Events can be viewed in the usual ways, for example with
+``kubectl get events``, ``kubectl describe`` or the Kubernetes dashboard:
+
+```
+$ kubectl describe ing cafe-ingress-varnish
+[...]
+Events:
+  FirstSeen	LastSeen	Count	From				SubObjectPath	Type		Reason		Message
+  ---------	--------	-----	----				-------------	--------	------		-------
+  2m		2m			1		varnish-ingress-controller			Warning		SyncFailure	Error, requeueing: [{172.17.0.11:6081: No known admin secret}{172.17.0.12:6081: No known admin secret}]
+  2m		2m			6		varnish-ingress-controller			Normal		SyncSuccess	Successfully synced
+```
+
+The EventSource (in the ``From`` column) is always
+``varnish-ingress-controller``, and the Reason for actions taken after
+a notification from the cluster API is ``SyncSuccess`` or
+``SyncFailure``.
+
+The Message in the case of ``SyncFailure`` reports the errors. The
+addresses shown in the error message (such as ``172.17.0.11:6081`` in
+the example) are Endpoints for a Service running Varnish, showing the
+admin port (to which the controller connects to issue commands). The
+controller attempts to sync all Endpoints belonging to a Varnish
+Service, even if errors are encountered along the way, and collects
+error reports from each Endpoint into a single message, in the "array"
+format shown above.
+
+``requeuing`` indicates that the work item has been placed back onto
+the controller's queues, and will be re-attempted (with backoff delays
+for repeated errors). The controller always repeats attempts to sync
+the state of the cluster until the sync is successful. If there is an
+unrecoverable error, such as an error in your configuration that must
+be corrected, then the ``SyncFailure`` event will be repeatedly
+generated (with backoff delays, and with the aggregation and de-dup of
+Events by Kubernetes), and ``SyncSuccess`` will never appear, until
+the problem is fixed. It is a good idea to monitor your Event logs for
+this situation.
+
+It is not unusual for ``SyncFailure`` to be closely followed by
+``SyncSuccess``, as in the example above, if the controller has not
+received complete information when it first attempts the sync. In the
+example, the controller had received the Ingress configuration before
+the Secret needed to connect with the Varnish admin ports, so it was
+unable to load configurations to realize the Ingress. The Secret was
+received shortly afterward, and then the Ingress could be successfully
+synced. If the most recent event is ``SyncSuccess`` shortly after
+``SyncFailure``, then probably all is well.
+
+The controller is notified about all Services, Ingresses and so on in
+the cluster, by default in every namespace, including components that
+have nothing to do with Ingress or Varnish. These are ignored -- for
+example, Ingresses without the ``ingress.class`` annotation set to
+``varnish``, or Secrets that do not have the label
+``app: varnish-ingress``. The controller may generate ``SyncSuccess``
+Events for such objects, but in fact it has done nothing for them.
+The controller log usually contains a message at the ``INFO`` level
+that it has ignored information about a component.
+
+## Varnish Service monitor
+
+When the controller starts, it launches a monitor (a goroutine) that
+at regular intervals checks the state of each Endpoint of a Varnish
+Service implementing Ingress, and loads the current configuration if
+necessary (if for some reason the regular sync operations have not
+brought the Varnish instance up to date with the desired state).
+
+By default, the monitor runs every 30 seconds. The interval can be
+set with the
+[``-monitorintvl`` command-line option](/docs/ref-cli-options.md).
+
+For every Endpoint of a Varnish Service, the monitor does the
+following:
+
+* Connect to the admin port
+
+* Issue the
+  [``ping`` command](https://varnish-cache.org/docs/6.1/reference/varnish-cli.html#ping-timestamp)
+
+* Issue the
+  [``status`` command](https://varnish-cache.org/docs/6.1/reference/varnish-cli.html#status)
+  to read the state of the Varnish cache process.
+
+* Check if a
+  [panic](https://varnish-cache.org/docs/6.1/reference/varnish-cli.html#panic-show)
+  has occurred
+
+* [Discard](https://varnish-cache.org/docs/6.1/reference/varnish-cli.html#vcl-discard-configname-label)
+  configurations that were loaded by the controller and have gone cold.
+
+* Update to the instance with a configuration for the current desired state,
+  if necessary.
+
+When all of the operations succeed for every Endpoint in the Service,
+the controller generates an Event with the Reason ``MonitorGood`` for
+the Service:
+
+```
+$ kubectl describe svc varnish-ingress
+Name:			varnish-ingress
+Namespace:		default
+[...]
+Events:
+  FirstSeen	LastSeen	Count	From				SubObjectPath	Type		Reason		Message
+  ---------	--------	-----	----				-------------	--------	------		-------
+  1h		4m			151		varnish-ingress-controller			Normal		MonitorGood	Monitor check good
+```
+
+When all is well, many of these accumulate over time. But Kubernetes'
+aggregation and de-dup of Events will show the Event series only once,
+and the ``count`` and ``lastTimestamp`` fields (reflected in the
+"Count" and "Last Seen" columns) update less frequently.
+
+The controller log contains more detailed information about the work
+of the monitor, such as the response received for ping, and the
+"banner" returned when admin connections are established, which
+contains information about the platform, the Varnish version, and a
+bit about the Varnish configuration.
+
+If any of the operations listed above fail, then the controller
+generates an Event at the ``Warning`` level for the Service, with a
+Reason string that may be one of:
+
+* ``NoAdminSecret``: no admin Secret currently known
+
+* ``ConnectFailure``: could not connect to the admin port
+
+* ``PingFailure``: error issuing the ``ping`` command (after the admin
+  connection succeeded)
+
+* ``StatusFailure``: error issuing the ``status`` command
+
+* ``StatusNotRunning``: the state of the child process was not
+  ``running``. The current state is shown in the Message.
+
+* ``PanicFailure``: error attempting to read the most recent panic
+  message
+
+* ``Panic``: a panic message was found. The panic is included as an
+  unstructuted string in the Message.
+
+* ``VCLListFailure``: error issuing the ``vcl.list`` command (to list
+  configurations, and find any that may be in the cold state)
+
+* ``VCLDiscardFailure``: error attempting to discard a cold
+  configuration
+
+* ``UpdateFailure``: error attempting to update to the current desired
+  configuration
+
+For each such Event, the Message contains the address of the Endpoint
+at which the error occurred, with the admin port (as in the example
+further up). And the Message also contains an error description.
+
+If the state of the child process is not ``running``, the controller
+reports this fact, but does not attempt to restart the process, since
+an administrator may have changed the state intentionally. Note that
+the Pod running Varnish goes to the not ready state when this happens,
+since it no longer answers the HTTP readiness check.
+
+When there has been a panic, the Varnish management process attempts
+to restart the child process. If successful, Varnish may be running
+normally again by the time the panic is detected. So it is a good idea
+to monitor your Event logs for this case, and save the panic message
+if found. Otherwise the Varnish cluster may appear superficially to be
+running normally, when in fact there has been a severe error.
--- a/docs/ref-cli-options.md
+++ b/docs/ref-cli-options.md
@@ -23,10 +23,14 @@ Usage of ./k8s-ingress:
    	log to standard error instead of files
  -masterurl string
    	cluster master URL, for out-of-cluster runs
+  -monitorintvl duration
+	interval at which the monitor thread checks and updates
+	instances of Varnish deployed to implement Ingress (default 30s)
  -namespace string
    	namespace in which to listen for resources (default all)
  -readyfile string
-    	path of a file to touch when the controller is ready, for readiness probes
+	path of a file to touch when the controller is ready,
+	for readiness probes
  -stderrthreshold value
    	logs at or above this threshold go to stderr
  -templatedir string
@@ -76,6 +80,11 @@ the file at that path immediately at startup, if any exists, and
 touches it when it is ready. Readiness probes can then test the file
 for existence. By default, no readiness file is created.

+``-monitorintvl`` sets the interval for the
+[monitor](/docs/monitor.md), default 30 seconds. The monitor sleeps
+this long between monitor runs for Varnish Services. See the
+documentation at the link for more details.
+
 ``-log-level`` sets the log level for the main controller code,
 ``INFO`` by default.