I suppose I will be flamed for saying what comes next, but I have a very
good DISCARD key. The situation at CEARN is getting out of control.
LISTSERV has 900 files in its reader right now, a few hours ago I sent it
a command via CP MSG straight from CEARN and it took it 25 minutes to
execute it. When it did execute it, I checked the interactive message
queue (LSVIUCV DEPTH) and it said 416. It was getting about as much CPU
time as it needed to discard SENT FILE messages and issue an occasional
RSCS START command, but no time to actually look at its reader.
This is an old problem, and I suspect there are a few other sites with
similar problems, although probably not as critical. The problem of
course is that it pisses off people who are downstream the server and
have the CPU capacity to handle the files destined to them, yet since
they are farther away topologically speaking they have to go through the
bottleneck node. Removing the bottleneck server from the backbone might
mean exploding on the order of 50 files per job on the other side of a
perhaps not saturated but not precisely idle link and is therefore not an
option. With the stupid topology we have in Europe there are not many
alternatives either, so I looked at the code to see what I could do with
simple changes. The result is development fix 16E-009D; this is a new
type of fix, which you need LFIX release 1.1 to install and which you
should NOT install as preventive service. As soon as you order a
development fix, you are automatically AFD'ed to it and are expected to
re-install the updates you get as you receive them. Unlike other fixes
they can be re-installed over and over as many times as necessary, but to
save disk space only one version of the updated files (the one before you
installed the development fix the first time) is kept; if you want to
keep each and every update when you refresh the fix, you must do so
manually. In other words, the "back off" procedure for a development fix
is to remove the fix completely, rather than fall back to its previous
incarnation (which I will not support because I will not keep a copy).
Anyway the DISTRIBUTE change in question implements a new ':backbone'
option in PEERS NAMES, 'DISTRIBUTE(YES,LOCAL)' (treated as a normal
':backbone.YES' tag by servers not running 16E-009D). This defines the
entry in question as a "non-routing" DISTRIBUTE server, ie one that
accepts to receive distributions for recipients in its "service area"
like a normal server but that does not want to be used as a switchyard
for other recipients, even though doing so might save bandwidth. In other
words, it is a statement that bandwidth in this "area" is good enough (or
CPU time is scarse enough) that you will actually waste wall-clock time
in your attempt to save bandwidth, and you should therefore not do so. I
have no idea whether or not this will be sufficient in practice, but
that's the only "simple" change I could come up with.
I will be distributing a modified PEERS NAMES with modified ':backbone'
tags tomorrow (I have to run home now), to allow sites that want to try
the new algorithm to see whether or not it improves things. The modified
PEERS NAMES will of course work "normally" with unmodified servers.
Eric
|