It seems that the corrupted X-DEL and X-LUPD jobs have stopped flowing,
but for all we know there might be another outburst tomorrow. The cause
of the problem on the FRMOP22-FRORS13-FRORS12 (and back) path has not yet
been identified. Having still heard strictly nothing regarding the
service level of RSCS on the FRORSxx machines, despite the fact that the
EARN Office is located in the same building, and judging from the amount
of postings on that topic I have seen recently on the EARN-NOG list, I am
afraid I am NOT given the impression that this problem is being
investigated with the priority that it deserves. Meanwhile, all the X-DEL
problems are being blamed on LISTSERV (ie me), and if I hadn't lost my
spool this morning I would have spent the whole day answering complaints.
Since I have come to the unfortunate conclusion that we simply cannot
trust the network to deliver files uncorrupted (nor can we rely on the
willingness of staff at key sites to investigate such problems), I spent
the day writing and testing code to make LISTSERV checksum the DISTRIBUTE
jobs it is sending. This is admittedly preposterous - a high-level
application written in an interpreted language checksumming its data
because it cannot rely on a transport protocol as old and well understood
as NJE. But I feel there is simply no other choice, as a number of sites
have already "expressed their concern" over the network and CPU resources
eaten by the last X-DEL storm. Next time it happens, they will start to
be "worried about the continued existence of LISTSERV at UOFXYZ", and
after a couple more times they will be "sad to report that management has
decided the cost in manpower, system resources and membership dues of
BITNET far outweigh the benefits of a direct connection". Having LISTSERV
checksum DISTRIBUTE jobs will:
1. Pinpoint the link(s) causing the corruption, as each server on the
DISTRIBUTE path will verify that the checksum is correct.
2. Ensure that problems such as the X-DEL storm cannot happen any longer,
as jobs failing the CRC check will not be distributed. Even though
there will always be servers that do not run the CRC code, having the
"core" of the backbone check CRC's and discard corrupted jobs should
considerably decrease the duplication factor and thus the overall
impact of such jobs on the network.
3. Have a positive impact on sites which are starting to have second
thoughts about the use of LISTSERV and BITNET in general.
4. Considerably reduce the amount of corrupted mail files shown to end
users (but increase the amount of "lost" postings if postmasters do
not edit and resubmit the jobs as they ought to). Corrupted mail files
are really BAD press; lost files are common on the Internet, and not
quite as visible. This may sound ridiculous, since I'd rather have my
mail with some trash appended to it than no mail at all, but that's
not the way the users react to corrupted vs lost mail.
The cost is about 200 370 instructions per record being checksummed, ie
some 50ms of CPU time on a 9370-60 for your average job, one tenth of
that on a 3090 - plus 2 extra disk I/O's. That is much less than what the
1.7 performance improvements will save you on DISTRIBUTE jobs, but it's
still a pity, especially if you are I/O constrained.
This is a significant change, since it can potentially reject massive
amounts of jobs. It must be beta-tested carefully, and such beta-testing
will require more time than usual. No data will be lost - if the job is
rejected, it ends up in your reader and you can resubmit it after
removing the '//CRC DD' card. But you may well end up having to do that
for hundreds of jobs (I hope it won't happen, I've done all I could to
test it locally, but we won't know until we try on a larger scale). It is
equally important to check that the checksums are propagated (but
ignored) by servers not running with the CRC code - that means not only
1.6e, but 1.5o and LISTEARN, which each have a different DISTRIBUTE
processor. Finally, you must be ready to revert to the old code in case
of major job rejection, especially as I will be in Copenhagen from
thursday (910613) to the following friday (910620) and available only
during the evening (there is a terminal room on the conference site).
Still, the problem is pressing and I would like to begin testing as soon
as possible; ideally, I ought to be able to take care of any "obvious"
problem this evening or tomorrow, and one can hope that everything else
should work fine and I could safely proceed to release 1.7 when I'm back
from Copenhagen. Once the code is running on the major hub sites, X-DEL
filters can be pulled out and the LISTSERV network (not to mention my
mailbox) can resume its quiet everyday mode of operation.
Thanks in advance for your cooperation.
Eric
|