LISTSERV - LSTSRV-L Archives - COMMUNITY.EMAILOGY.COM

It seems that  the corrupted X-DEL and X-LUPD jobs  have stopped flowing,
but for all  we know there might be another  outburst tomorrow. The cause
of the problem on the FRMOP22-FRORS13-FRORS12 (and back) path has not yet
been  identified.  Having  still  heard strictly  nothing  regarding  the
service level of RSCS on the  FRORSxx machines, despite the fact that the
EARN Office is located in the  same building, and judging from the amount
of postings on that topic I have seen recently on the EARN-NOG list, I am
afraid  I  am  NOT  given  the impression  that  this  problem  is  being
investigated with the priority that it deserves. Meanwhile, all the X-DEL
problems are being  blamed on LISTSERV (ie  me), and if I  hadn't lost my
spool this morning I would have spent the whole day answering complaints.
 
Since I  have come to  the unfortunate  conclusion that we  simply cannot
trust the  network to deliver files  uncorrupted (nor can we  rely on the
willingness of staff at key sites  to investigate such problems), I spent
the day writing and testing code to make LISTSERV checksum the DISTRIBUTE
jobs  it is  sending.  This  is admittedly  preposterous  - a  high-level
application  written in  an  interpreted language  checksumming its  data
because it cannot rely on a transport protocol as old and well understood
as NJE. But I feel there is simply  no other choice, as a number of sites
have already "expressed their concern" over the network and CPU resources
eaten by the last  X-DEL storm. Next time it happens,  they will start to
be "worried  about the  continued existence of  LISTSERV at  UOFXYZ", and
after a couple more times they will be "sad to report that management has
decided the  cost in  manpower, system resources  and membership  dues of
BITNET far outweigh the benefits of a direct connection". Having LISTSERV
checksum DISTRIBUTE jobs will:
 
1. Pinpoint  the link(s) causing  the corruption,  as each server  on the
   DISTRIBUTE path will verify that the checksum is correct.
 
2. Ensure that problems such as the X-DEL storm cannot happen any longer,
   as jobs  failing the CRC  check will  not be distributed.  Even though
   there will always be servers that do  not run the CRC code, having the
   "core" of the  backbone check CRC's and discard  corrupted jobs should
   considerably  decrease the  duplication  factor and  thus the  overall
   impact of such jobs on the network.
 
3. Have  a positive  impact on  sites which are  starting to  have second
   thoughts about the use of LISTSERV and BITNET in general.
 
4. Considerably  reduce the amount of  corrupted mail files shown  to end
   users (but  increase the amount  of "lost" postings if  postmasters do
   not edit and resubmit the jobs as they ought to). Corrupted mail files
   are really BAD  press; lost files are common on  the Internet, and not
   quite as visible. This may sound  ridiculous, since I'd rather have my
   mail with some  trash appended to it  than no mail at  all, but that's
   not the way the users react to corrupted vs lost mail.
 
The cost is  about 200 370 instructions per record  being checksummed, ie
some 50ms  of CPU time on  a 9370-60 for  your average job, one  tenth of
that on a 3090 - plus 2 extra disk I/O's. That is much less than what the
1.7 performance improvements  will save you on DISTRIBUTE  jobs, but it's
still a pity, especially if you are I/O constrained.
 
This is  a significant  change, since it  can potentially  reject massive
amounts of jobs. It must  be beta-tested carefully, and such beta-testing
will require more time  than usual. No data will be lost -  if the job is
rejected,  it ends  up  in your  reader  and you  can  resubmit it  after
removing the '//CRC DD'  card. But you may well end up  having to do that
for hundreds of  jobs (I hope it  won't happen, I've done all  I could to
test it locally, but we won't know until we try on a larger scale). It is
equally  important  to  check  that the  checksums  are  propagated  (but
ignored) by servers not  running with the CRC code -  that means not only
1.6e,  but 1.5o  and LISTEARN,  which  each have  a different  DISTRIBUTE
processor. Finally, you must  be ready to revert to the  old code in case
of  major job  rejection,  especially as  I will  be  in Copenhagen  from
thursday (910613)  to the  following friday  (910620) and  available only
during the  evening (there is  a terminal  room on the  conference site).
Still, the problem is pressing and I  would like to begin testing as soon
as possible; ideally,  I ought to be  able to take care  of any "obvious"
problem this evening  or tomorrow, and one can hope  that everything else
should work fine and I could safely  proceed to release 1.7 when I'm back
from Copenhagen. Once  the code is running on the  major hub sites, X-DEL
filters can  be pulled out  and the LISTSERV  network (not to  mention my
mailbox) can resume its quiet everyday mode of operation.
 
Thanks in advance for your cooperation.
 
  Eric