This site uses advanced css techniques
It includes facilities for reporting IP-to-name DNS translations, but it's a terribly slow and inefficient process. Even the inclusion of webazolver, a preprocessor which does nothing but populate a name cache via multiple child processes, is impacted badly in the face of an unreachable nameserver. These are very common.
Our own website sees relatively little traffic, but log processing still took much too long, and it was clear that DNS inverse lookups were the hangup. So we studied how The Webalizer used the cache and wrote our own program using asynchronous DNS. It's dramatically faster and more efficient.
On a dual-processor 500MHz Linux machine, fastzolver was able to achieve more than 100 resolution per second when run in unlimited queries mode (though this ought not be typical, as it's very hard on a nameserver). This runs in a single thread.
Most DNS lookups are synchronous, which means that the requesting process blocks until a response is received or a timeout occurs. These timeouts are typically fairly long - 30 seconds or one minute - and absolutely nothing is done by the requesting process during this time.
When inverse DNS servers respond immediately, a series of lookups can proceed rapidly, but it doesn't take many unreachable servers before runtime shoots up dramatically. This can be ameliorated somewhat with multiple child processes, but this requires a lot of coordination but still suffers from the waiting-for-reply hangs.
Our approach uses asynchronous DNS lookups, which send a request and immediately continue with do other work (such as performing other lookups). A note about the outstanding request is kept on a list, and when a reply arrives, it's mated with its associated request and the name is resolved. Timeouts form a pseudo reply that likewise completes the process.
By taking this approach, nothing is blocked while waiting for a reply that comes late (or never at all), and we can run as many outstanding requests as we have memory, bandwidth, and an available nameserver.
Rolling our own asynchronous DNS with the standard resolver libraries is a daunting task, so we instead use the wonderful adns library written by Ian Jackson. It provides exactly the support for a program of this kind, and its integration has been very smooth in this and other projects.
Like webalizer, this program reads webserver logfiles, but fastzolver has virtually no real knowledge of their format. It knows that the first word on each line is an IP address, and the rest of the line is not considered in any way. This makes fastzolver for general "resolve this list of addresses" purposes.
The whole process revolves around a database cache file maintained in Berkeley DB format: fastzolver populates it with IP-to-name translations, and webalizer reads it while processing the logs in detail. Each cache entry contains four items:
While processing the log, an IP address is extracted from the start of each line and as an optimization always skips any line with the same IP as the previous line: Runs of the same IP address are very common.
The cache is consulted using the IP address, and one of three results is obtained:
In case #1, this IP address is considered "translated" and no more work need be done: fastzolver moves onto the next entry in the file.
Otherwise, any existing stale entries are deleted outright, and a new entry saved that uses the IP address string itself as the hostname. This is really just a placeholder, and it's replaced later with the actual hostname if it's located. Otherwise the IP address string remains as a kind of negative cache entry so it's not repeatedly looked up without success.
The cache algorithm treats a valid hostname differently from a not-found IP address string on the assumption that the former is likely to be valid longer, and that the latter could be due to a connectivity issue that may be resolved soon. Default cache expiry times are 5 days for a hostname and one day for an IP address, though both can be modified on the command line.
Curiously, there is no explicit process that does large-scale expiry of cache entries. Though it would certainly be possible to create one that made a standalone expiration pass, it wouldn't produce any meaningful difference from the existing behavior that expires upon new lookup.
Though deleting records in Berkeley DB files makes room available for other records in the future, it won't ever return the space to the filesystem: this means that files grow, but they don't shrink. We recommend removing the DNS cache file periodically (perhaps once a month) to clean this out.
This program can be run standalone from the command line or included in the same daily log-rotation scripts that drive The Webalizer. These are the options supported on the command line:
... STATS: 359949 lines, 85 pending, 32425 results (21524 success): STATS: 362172 lines, 99 pending, 32621 results (21659 success): STATS: 364409 lines, 105 pending, 32835 results (21794 success): STATS: 365875 lines, 83 pending, 32981 results (21878 success): STATS: 367334 lines, 80 pending, 33120 results (21954 success): STATS: 368603 lines, 87 pending, 33245 results (22026 success): STATS: 370203 lines, 83 pending, 33377 results (22097 success): ... Processed 35668 queries in 307 seconds (116.18 q/sec)
Generally speaking, fastzolver is nearly a drop-in replacement for webazolver and is easy to integrate into an existing log-processing scheme. This is not a tutorial
Most users perform their log processing at midnight from cron with a common set of steps:
A prototype of one manifestation of this scheme - without quite a bit of error checking - could be written as a shell script and saved in /usr/local/bin/cron-run-webalizer. Then cron would launch the script nightly just after midnight:
# # Just a prototype! # OLDLOG=`date +access_log.%Y-%m-%d` cd /home/apache/logs mv access_log $OLDLOG apachectl graceful sleep 60 fastzolver -L100 -D/tmp/dns_cache.db $OLDLOG webalyzer -p -N0 -D/tmp/dns_cache.db $OLDLOG gzip $OLDLOG
There are, of course, other scenarios, such as those built on top of the logrotate utility or which support more than one website (and therefore more than one logfile). In the latter case, there is no reason that all log processing can't share a common DNS cache file so that all benefit from prior lookups.
Source code to fastzolver is available from this website:
It requires a Linux/Unix system, a C++ compiler, GNU make, the ADNS library, and the Berkeley DB library. It may optionally use ZLIB to process compressed logfiles.
ZLIB support is enabled by default, but may be disabled by commenting out the CFLAGS += -DSUPPORT_ZLIB line in the makefile.
The same version of the Berkeley DB library as used with webalizer is required here - the shared DNS cache database is in this format. We're using the version 1.85 compatibility mode: most default installations contain this, but those building db4 from source must include the --enable-compat185 when configuring the tree.
First published: 2005/08/06