Changes between NaviServer 4.99.4 and 4.99.5

Abstract

This document describes the differences between NaviServer 4.99.4 and the experimental version named naviserver-connthreadqueue (turned into 4.99.5). The new version shows significantly improved response times of a monitored sample website (http://next-scripting.org). The queuing times are substantially reduced and more stable through various measures like an asynchronous log writer or by converting the synchronous SSL driver to an asynchronous and also by delivering content outside the connection threads.

1. New Request Dispatch Architecture

The request dispatch on NaviServer 4.99.4 (and the AOLserver family) works via (a) adding an incoming request into a queue and (b) send send a signal to an unknown number of connection threads of a pool which are assumed to wait for this condition. If there are multiple connection threads waiting for the conditions, the thread which is waked up is determined by the Operating System. Usually, this leads to a round-robin behavior. As a consequence, the thread idle timeout is pretty useless on somewhat busy sites (when request come in at a rate where each thread is waked up before threadtimeout). Furthermore threads terminating based on the connection count die more or less at the same time leading to a "mass extinction". To address this problem, the stochastic spread factor was introduced.

The new dispatch mechanism is maintaining the list of idle connection threads per pool. and can therefor control the dispatch explicitly. The implemented policy assigns the jobs in a LIFO manner to the connection threads. When there are several threads available, the requests are not evenly distributed, the least used thread (at the end of the LIFO queue) can properly time-out. The likelihood of the "mass extinction" is greatly reduced. The spread factor is not needed and has been removed.

In the new architecture, jobs can be directly assigned to threads and the single thread is activated with a signal. The job queue is just used for cases where no idle connection thread is available. The new implementation uses separate locks for waiting queue (wqueue) and thread queue (tqueue) per pool. The changes lead to significant finer lock granularity and improve scalability.

I have tested this new version with a synthetic load of 120 requests per seconds, some slower requests and some faster ones. This load keeps about 20 connection threads quite busy on my home machine. The contention on the new locks is very little: on this test we saw 12 busy locks on 217.000 locks on the waiting queue, and 9 busy locks out of 83.000 locks on the thread queue. These measures are much better than in NaviServer 4.99.4, which has on the same test on the queue 248.000 locks with 190 busy ones. The total waiting time for locks is reduced by a factor of 10. One has to add, that it was not so bad before either. The benefit will be larger when more threads and maybe multiple pools are used.

The new version implements connection thread scaling via the number of entries in the connection queue. The term connection thread scaling refers to automatically create additional connection threads between the minimal and the maximal number of threads (minthreads and maxthreads). Connection thread scaling is useful when a site master does not know how many connection threads should be created to keep up with the expected load. This is increasingly important when the load varies over time. In general, the server is allowed to grow up to maxthreads connection threads and can shrink again the number of threads vis the idle timeout or the number of jobs per connection thread. in practice, newer web applications tend to have today higher number of embedded resources than a few years ago. When a page contains e.g. 50 embedded resources to be requested from the same web server (e.g. css, js, images), current browser open several pipelines in parallel to the server (the default for firefox are 6 persistent connections). The arriving requests are parsed and dispatched to the available connection threads. For high number of embedded resources it is quite likely that some requests are queued. When requests are queued this is a good sign that maybe more connection threads should be created. The server uses to parameters to control this behavior which are defined as percentage of the queue size.

  • lowwatermark: if the actual queue size is below this value, don’t try to create threads; default 10%

  • highwatermark: if the actual queue size is above this value, allow parallel thread creates; default 80%

When lowwatermark is set to 0, this means that every queue operation might create a new threads. The variable highwatermark controls concurrent thread requests, which can cause problems when certain c-level extensions are used in the C code. Setting these value 100 turns this feature off. Both parameters control the aggressiveness (or willingness) of the server to create additional connection threads.

2. Request Queueing and Latencies

Although not directly related with the new request dispatch mechanism, the new implementation was a good opportunity to look closer on request latencies. Frist a small example from a real world website to clarify the problem (most synthetic benchmarks would not show this).

Consider a server with two running connection threads receives 6 requests, where requests 2-5 are received in a very short time. The first three requests are directly assign to connection threads, have fromQueue == 0. These have queuing times between 88 and 110 micro seconds, which includes signal sending/receiving, thread change, and initial setup in the connection thread. The runtimes for these requests are pretty bad, in the range of 0.24 to 3.8 seconds elapsed time.

[1] waiting 0 current 2 idle 1 ncons 999 fromQueue 0 queue 0.000110 run 0.637781 total 0.637891
[2] waiting 3 current 2 idle 0 ncons 998 fromQueue 0 queue 0.000090 run 0.245030 total 0.245120
[3] waiting 2 current 2 idle 0 ncons 987 fromQueue 0 queue 0.000088 run 0.432421 total 0.432509
[4] waiting 1 current 2 idle 0 ncons 997 fromQueue 1 queue 0.244246 run 0.249208 total 0.493454
[5] waiting 0 current 2 idle 0 ncons 986 fromQueue 1 queue 0.431545 run 3.713331 total 4.144876
[6] waiting 0 current 2 idle 0 ncons 996 fromQueue 1 queue 0.480382 run 3.799818 total 4.280200

Requests [4, 5, 6] are queued, and have queuing times between 0.2 and 0.5 seconds. The queuing times are pretty much the runtimes of [2, 3, 4], therefore the runtime determines the queuing time. For example, the totalTime of request [4] was 0.493454 secs, half of the time it was waiting in the queue. Request [4] can consider itself happy that it was not scheduled after [5] or [6], where its totalTime would have been likely in the range of 4 secs (10 times slower). Low waiting times are essential for good performance.

This example shows pretty well the importance of async delivery mechanisms like the writer thread or bgdelivery in OpenACS. A file being delivered by the connection thread over a slow internet connection might block later requests substantially (as in the cases above).

At the same time the example shows the importance of async receive operations, since when these packages are handled in a connection thread, the blocking time of the connection thread will be larger as well. NaviServer distinguishes between asynchronous and synchronous drivers, where e.g. nssock is defined synchronous and nsssl is defined asynchronous. For asynchronous drivers, the receiving of the request and the parsing of the request happens in the driver, where for synchronous drivers (which might block on read or write operations), the driver just accepts the request and delegates all further work into the connection threads, leading to higher latencies.

The analysis of the latencies lead to several extensions in the new version of NaviServer, which are described in the sequel.

2.1. Fine Grained Time Introspection

NaviServer 4.99.4 as very little introspection facilities for timing measurements, which consists essentially of the start time of a request, which is actually the time a request is put into a queue.

The new version tracks internally the following time stamps:

  • accept time stamp

  • queueing time stamp

  • dequeuing time stamp

  • filter done time stamp (after preauth and/or postauth filter)

Based on these time stamps, one can determine e.g. the queuing duration (which was shown already above), or one can measure the duration of NaviServer filters or the pure runtime. From these times, the accept time is the least interesting for performance tuning, since the accept might have happened already at some earlier time in an persistent connection. The queuing time is very important for analyzing latencies and can be influenced by the server settings. Often, just the pure runtime is what might me optimized by a user of a framework like OpenACS.

Furthermore, most web-servers have the problem that analyzing the access log is often problematic since the access log is (a) sorted by the request end-times and (b) the time granularity of one second is rather crude.

The new version of NaviServer allows to track optionally the partial times together with a high resolution time stamp (micro second resolution) in the access log. This can be achieved by setting logpartialtimes in the nslog parameter section to true. In addition, the new NaviServer version tracks the aggregated times and makes these available via ns_server stats. This command returns an attribute value list containing the aggregated durations, but as well the aggregated number of spool operations, the number of requests, the number of queued requests, and the number of connection threads created. All these values are measured since the start of the server.

% ns_server stats
requests 51 spools 4 queued 0 connthreads 5 accepttime 4.270051 queuetime 0.006188 filtertime 2.259168 runtime 1.191496

The command ns_server stats can be used to monitor the system response behavior. The graphic below shows the development of the website next-scripting.org which uses the new NaviServer version in various development states.

partial times
Figure 1. Monitor Partial Times

The blue line is showing the queuing time which was at the beginning often in the 0.1 seconds range. The reported values are the averages taken from a 5 minutes interval. By improving various factors (which will be addressed below), the current values are in the range of hundred microseconds. One can see as well the tendency, how the total response time decreased (although this is shown on an exponential y scale).

Another new feature is the ability to link NaviServer threads to the OS-specific IDs. This (currently Linux only feature) allows to query the operating system for several informations about a thread. One can for example query the user and system time used per thread. The OS-specific IDs are returned as last entry in ns_info threads. If there is no OS support available, the value 0 is used as a place holder.

The chart below shows what amount of CPU time is used by a web site for what kind of NaviServer thread. We can see here, how e.g. the driver thread nssock looped due to a bug during development (while the server was fully functioning).

CPU times
Figure 2. Monitor CPU Times

The next graph shows the total response times of the server on a metric scale. Notice, how the perceived response time was reduced in week 49, where the asynchronous SSL driver was introduced. The traffic patterns were very similar in weeks 45 to 49, around 4,000 request per day from 40 to 60 visitors.

Response times
Figure 3. Total Response Time

The last chart displays the memory consumption over the test period. At the begin of week 47, we switched from Tcl’s default malloc (zippy malloc) to TCMalloc (the change requires a modification of the Tcl sources). At the end of week 46, we restricted the number of connection threads to 2 (parameter minthreads and maxthreads to 2). In the begin of week 47 we set minthreads to 5 and maxthreads to 10 and switched to TCMalloc. The peaks in week 48 are short runs of the server without TCMalloc.

Memory sizes
Figure 4. Memory Footprint: RSS and VM size

For more information about different mallocs in multithreaded Tcl applications, see https://next-scripting.org/xowiki/docs/misc/thread-mallocs

2.2. Reduce Latencies

When we started to analyze a day’s traffic on next-scripting.org, we used the previous (synchronous) version of nsssl. For synchronous drivers, the accept time is always zero. By analyzing the traffic of the day, the queue time was on average 54 ms (milliseconds), which is huge, while the median is 0.1 ms. The reason is that we can see a maximum queue time of 21 secs! As a consequence, the standard deviation is as well huge. When one looks more closely to the log file, one can see that at times, when the queuing time is huge, as well the runtime is often huge, leading to cascading effects as described above.

        accept  queue   run     total
avg     0.0000  0.0547  0.2656  0.3203
stdev   0.0000  0.7481  0.9740  1,4681
min     0.0000  0.0000  0.0134  0.0135
max     0.0000  21.9594 16.6905 29.1668
median  0.0000  0.0001  0.0329  0.0330

The causes for the "random" slow cases are at least from two sources: slow network connections (some clients are slow in retrieving data, especially some bots seems to be bad) and system specific latencies (e.g. backup running, other servers becoming active, etc.). To address the problem with the slow network connection, i moved first more file deliveries to the writer thread (reducing writersize) and activated TCP_DEFER_ACCEPT. By taking these two measures, one can observer nice improvements: the average queuing time for a day is now 4.6 ms, but still the standard deviation is huge, caused by some "erratic" large values.

        accept  queue   run     total
avg     0.0000  0.0046  0.1538  0.1583
stdev   0.0000  0.0934  0.6868  0.6967
min     0.0000  0.0000  0.0137  0.0138
max     0.0000  4.6041  22.4101 22.4102
median  0.0000  0.0001  0.0216  0.0217

A typical source for system specific latencies is the file system. When the latency for writing an entry to a log files (e.g. access log or error log) is high, the thread performing is operation will be blocked for this time. At earlier times we observed on a different website write latencies in the seconds range caused by ext3. This site is much better in this regard, but we see there as well maximum latencies in the range of 200 ms (which is actually an acceptable value).

To address the problem, i have implemented asynchronous writer thread for log files. This causes logging operations from connection threads to be faster and independent from the write latencies of the operating system. The table below shows again a full day with the Async-Writer thread on the same site. The avg queueing time dropped from 0.0547 to 0.0009, the standard deviation from 0.7481 to 0.0211. This is a noticeable improvement.

        accept  queue   run     total
avg     0.0000  0.0009  0.1304  0.1313
stdev   0.0000  0.0211  0.9022  0.9027
min     0.0000  0.0000  0.0032  0.0033
max     0.0000  0.9772  28,1699 28,1700
median  0.0000  0.0001  0.0208  0.0209

But still, the sometime large values are worth to investigate. It turned out that the large queueing above time came from requests from a site with a probably bad internet connection (Taipeh), where for an error message the writer thread was not used. Funny enough, the log file analysis showed that the delivery of the error message blocked the connection thread longer than the delivery of an larger image. The reason for this is that the image had a size abover the writersize and was therefore using the background writer.. It turns out that allowing deliveries of error messages over the writer thread requires some refacturing of the infrastructure, which i have not done.

The problem in such cases is not only delivery of the files, but as well latencies caused for incoming packages by the sync driver. When a synchronous driver is used, all receive operations for the incoming request block the connection thread.

Therefore the next logic step to address the problem is to make nsssl asynchronous. Below is again the analysis of a full day of next-scripting.org with the asynchronous implementation of nsssl. The average queue time is lower by again a factor of 4, the standard deviation is as well much better.

        accept  queue   run     total
avg     0.0599  0.0002  0.0183  0.0196
stdev   0.3382  0.0039  0.0413  0.0530
min     0.0000  0.0001  0.0011  0.0012
max     12,1002 0.2324  0.8958  2,0112
median  0.0088  0.0001  0.0118  0.0120

After running the server a few days, we observed still some unexplained slow calls. While in the day below, the figures for queue are even better as for the day above (see e.g. stdev), a few requests took quite long.

        accept  queue   run     total
avg     0,1463  0,0001  0,0355  0,0429
stdev   1,0841  0,0003  0,6566  0,6567
min     0,0000  0,0000  0,0010  0,0012
max     31,6194 0,0066  30,2704 30,2798
median  0,0082  0,0001  0,0026  0,0117

Some digging revealed, that this were requests for a mobile user with a bad bandwidth. The delivery time for an adp page counts as "runtime". Therefore a slow delivery could still block a connection thread. The writer threads where implemented in NaviServer 4.99.4 just for file deliveries (and the manual invocations via ns_writer).

A dependency of the servers availability on on the client connection speed is a serious problem it this opens a vulerability for DOS attackers minicking slow connections. This was a further motivationto implement as well asynchronous delivery for adp pages (and all other dynamic content) and as well for streaming HTML output (sent e.g. via multiple ns_write operations) when enabled.

2.3. Asynchronous Architecture

Below are three Figures to show the differences between older versions of NaviServer (and AOLServer) and NaviServer 4.99.5.

synchronous driver
Figure 5. Synchronous Driver

The simplest case it the architecture with a synchronous driver (Figure 5), which was used e.g. in NaviServer 4.99.4 and before for connections over nsssl. In this case, the driver thread is just responsible for handling the incoming request (accept) and for handing it over to the connection thread. The connection thread is responsible for receiving the request, processing the request (computing the results) and delivering the content to the client. Once the connection is established the connection thread is handling all IO operations, which can be blocking IO, since the blocking affects only the single thread. The yellow background denotes spooling activities, which are activities involving receiving and sending TCP packages from and to the client. While the CPU requirements for spooling is very little, the elapsed time of these operations are hard to predict and depend on external factors.

in this case a connection thread is occupied in this case for a long time which is hard to predict. Since the connection threads are a scarce ressource, the scalability of the server is limited. The connection threads require much more resources since they are are responsible for the heavy-computations (in OpenACS often several thousands of procs/objects/classes); therefore they require much resources and are available only in small quantities.

asynchronous driver
Figure 6. Asynchronous Driver

When an asynchronous driver (Figure 6) is used (e.g. nssock under NaviServer 4.99.4, or nssock and nsssl under 4.99.5) the driver is responsible for the full receipt of the HTTP request. Fot this kind of architecture, the driver has to support asynchronous (non-blocking) IO for receiving the request. Since the CPU effort for handling every incoming packet is little, a single driver is able to handle multiple receive operatons at the same time. The big advantage of the server with the asynchronous driver is that the time spent in the connection thread is significantly reduced. By reducing the elapsed time span in the connection threads, the server is able to improve throughput and to reduce latencies (see above).

asynchronous driver with spooler and writer
Figure 7. Asynchronous Driver with Spooler and Writer Threads

NaviServer 4.99.5 favors the model of Figure 7, where the elapsed time-span in the connection thread is reduced further by delegating the output spooling the one or more writer threads. One writer thread is able to handle a high number of simultaneous output spooling tasks. 4.99.4 supported just a limited forms of writer deliveries (file-deliveries with open fd handles and string deliveries, no partial write support in writer thread, every block-write operation blocked a writer thread) where 4.99.5 supports as well ADP-deliveries, HTML streaming deliveries (via ns_write), and mmapped deliveries with full asynchronous output handling in the writer threads.

Optionally a spooler thread can be used to take some load from the driver when many uploads are performed, but this has typically little influence on the latencies. The spooler threads are identical to 4.99.4.

By delegating the various processing steps to multiple threads, NaviServer 4.99.5 is able to benefit much better from multi-core architectures and improves throughput via pipelining and using asynchronous IO-Operations for spooling activities. Depending on the needs, the server can be configured to use multiple (lightweight) spooler and writer threads.

3. Summary of Improvements

  • New Dispatch Architecture

    • lower latency, better scaling

  • Latency Related Improvements:

    • asynchronous nsssl driver

    • asynchronous log writer

    • asynchronous delivery of dynamic content (e.g. adp pages)

  • Improved writer thread support (managing async content delivery):

    • New command ns_conn clientdata to be handed to ns_writer list to establish connection between requests and listed active writer sessions

    • More detailed output in ns_writer list, e.g. containing starttime, threadname and client data

    • Handling of HTML streaming output via writer threads (activated via configuration paramater writerstreaming)

    • New commands ns_writer size /driver/ ?value? and ns_writer streaming /driver/ ?value? to query/change writer configuration values at runtime.

    • various fixes (added explicit eof-handling to avoid busy event loops, thread-safety of introspection, memory management)

    • when ns_writer submit or ns_writer submitfile is used, send content via writer regardless of the content size

  • Improved timing introspection

    • Compute partial times (accept time, queue time, filter time, run time)

    • Track partial times in accesslog together with high-resolution timestamps for determining the request order (the entries in the access log of the server are traditionally in the order of termination on a second granularity, which is often too coarse).

    • Determine OS level thread-ids if possible (Linux only) and report it via ns_info threads. This can be used e.g. by a monitor (like nsstats.tcl) to list user-time and system-time per thead.

    • New subcommand ns_server stats to return aggregated statistics from the server.

  • Added handling for deprecated tcl commands:

    • ns_deprecated can be used for procs which are deprecated.

    • Ns_LogDeprecated() can be used in C-implemented commands

  • Reform of ns_info / ns_server: Previously, ns_info returned various information from the "main" server of the nsd process, but there was no interface to obtain this information from an second (e.g. virtual) server or from a connection pool (every server might have multiple pools). These commands were moved to ns_server and the previous ns_info commands were marked deprecated. Also, passing optionally the (sometimes ignored) pool as second argument was deprecated in favor of an explicit -pool option.

    • Server-specific commands (accept option -server)

      ns_server ?-server server? filters
      ns_server ?-server server? pagedir
      ns_server ?-server server? pools
      ns_server ?-server server? requestprocs
      ns_server ?-server server? tcllib
      ns_server ?-server server? traces
      ns_server ?-server server? url2file
    • Pool-specific commands (accept -server and -pool)

      ns_server ?-server server? ?-pool pool? active
      ns_server ?-server server? ?-pool pool? all
      ns_server ?-server server? ?-pool pool? queued
      ns_server ?-server server? ?-pool pool? connections
      ns_server ?-server server? ?-pool pool? stats
      ns_server ?-server server? ?-pool pool? threads
      ns_server ?-server server? ?-pool pool? waiting
    • New command ns_server ?-server s? serverdir to return the base directory of the virtual server

    • New commands to change some configuration parameters at runtime

      • New command ns_server ?-server s? ?-pool p? minthreads to query or set minthreads from the specified server and pool.

      • New command ns_server ?-server s? ?-pool p? maxthreads to query or set maxthreads from the specified server and pool.

    • Marked several ns_info / ns_server commands as deprecated:

      ns_info filters
      ns_info pagedir
      ns_info pageroot
      ns_info platform  (use $::tcl_platform(platform))
      ns_info requestprocs
      ns_info tcllib
      ns_info traces
      ns_info url2file
      ns_info winnt  (use $::tcl_platform(platform))
      ns_server ?-server server? pageroot
      ns_server keepalive ?pool? (use "ns_conn keepalive" instead)
      ns_register_adptag (use instead ns_adp_registerscript)
      ns_adp_registertag (use instead ns_adp_registeradp)
  • TCP improvements:

    • support for deferaccept in nsssl

    • support for SO_ACCEPTFILTER on FreeBSD (nssock and nsssl)

    • support for TCP_NODELAY

    • support for TCP_CORK

    • support for TCP_FASTOPEN (requires Linux 3.7)

  • Security improvements for the SSL Driver:

    • Previous versions of nsssl could be attacked via several measures. In particular the server could be attacked via the BEAST and CRIME attacks, it was as well endangered via client-initiated renegotiation which is an attack point for DoS. Part of the problem could be fixed via different configuration values. With the new code and the provided recommendation, nsssl returns better results than e.g. current nginx.

  • Misc Improvements:

    • New commands:

      • New command ns_conn zipaccepted to query, whether the client accepts gzip-compressed content. This eases Tcl-level handling of such cases.

      • New command ns_fastpath_cache_stats obtain usage info from the fastpath cache.

    • Improved performance:

      • Sigificanly reduced number of lock operations: the previously most called mutex nsd:conf is practically eliminated

      • Optimization of branches of most busy code pieces based on coverage analysis data

    • Easier debugging and development:

      • Allowing to call "nsd -c" without config file.

    • Bug-fixes and documentation cleanups:

      • Bug-fix for rolling access log when rollfmt is specified. Previously the log file was closed but not reopened, log data was lost.

      • Fixed all known range request bugs from regression test.

      • Fixed potential crashes in news c-libraries, which return frequently NULL

      • Improved default configuration (using e.g. connsperthread 10000)

      • Various documentation improvements and code cleanups

4. Configuration of the New Features

4.1. Logging of Partial Times

To add logging of partial times in the access log, set logpartialtimes in the nslog section to true. Per default this feature is turned off.

ns_section ns/server/${server}/module/nslog
        ....
        ns_param    logpartialtimes     true    ;# false, include high resolution start time
                                                ;# and partial request durations (accept, queue, filter, run)
        ....

4.2. Asynchronous Logfile Writer

To activate the asynchronous logfile writer, set asynclogwriter to true. Per default this feature is turned off.

ns_section ns/parameters
        ....
        ns_param        asynclogwriter          true            ;# default: false
        ....

4.3. Non-blocking SSL Driver

The new version of the SSL driver nsssl is implemented asynchronously. The driver has as an additional new feature the ability to use the TCP performance option TCP_DEFER_ACCEPT, which is per default turned off. When defer accept is activated, the handshake is simplified and the accept function is only called at the time when data arrives.

ns_section ns/server/${server}/module/nsssl
       ....
       ns_param      deferaccept                true            ;# default false
       ....

Note that TCP_DEFER_ACCEPT is available in linux, bot not on all platforms. FreeBSD has a similar mechanism called an acceptfilter. If the the server is compiled on a platform where SO_ACCEPTFILTER is available, this will be used when deferaccept is activated.

The SO_ACCEPTFILTER change was also implemented for nssock.

4.4. Streaming HTML output via writer

The server can now stream HTML output optionally via the writer thread. When sending server responses over slow connections, this option reduces the blocking time of the connection threads and can therfore reduce the damage of slow-read attacks.

ns_section ns/server/${server}/module/nssock
           ...
           ns_param   writerstreaming   true    ;# default: alse
           ...