The beta-1.5n global list registration protocol (ie the commands that the
servers send to each other to keep their list of lists up to date) did not
provide any kind of recovery and did not detect inconsistencies between the
information stored by the various backbone servers. If, for example, a X-LUPD
file was lost in a spool crash, the corresponding information would be lost
forever in case of a DEL request, until the list was updated again for REP
requests. Furthermore, whenever requests arrived out of chronological order,
any 'obsolete' request would be discarded, which works fine for REP requests
but may or may not be correct for DEL and ADDs. This latter problem could have
been solved by keeping the last update date/time information along with EACH
entry in the (large) list of lists, but this still wouldn't solve the lost
files problem, and it wouldn't help new backbone sites any (they would start
with an empty list of lists and only receive updates from the day they were
added to the backbone).
This has been solved in the following way:
- Whenever updates to the list of (local) lists are being generated, the
sending server (which, by definition, has the latest version of the list of
its own local lists) will send a checksum along with the DEL/REP commands.
- The checksum is checked by the target servers. If it doesn't match, an
informational message is sent to the postmaster, and a 'LSVLDELT REFRESHME'
command is sent to the LISTSERV at the node where the error was detected.
- A new command, LSVLDELT, has been implemented. It should normally not be
used by postmasters, so I chose a nasty name. There are four options to this
command:
* 'LSVLDELT REFRESHME' causes a "refresh" X-LUPD job to be sent to the
originator. This job contains a 'DEL *' statement and 'REP' instructions
for each list, and NO checksum (to avoid a loop if there somehow happens
to be a bug in the checksum routine). This should normally be used only by
another LISTSERV.
* 'LSVLDELT REFRESH userid@node' causes a "refresh" job to be sent to the
specified address. It should normally never be used.
* 'LSVLDELT REFRESH' causes a "refresh" job to be distributed to ALL the
backbone servers. It may be used when an inconsistency has been (manually)
detected.
* 'LSVLDELT INIT userid@node' causes a complete copy of GLOBLIST FILE to be
sent to the specified address. This should be used by the LMC when a new
server has been added to the backbone, to speed up its refresh process and
to avoid seeing it sending useless 'LSVLDELT REFRESHME' requests
everywhere in the world.
The LSVLDELT command is restricted to the postmaster, LMC and LCOORD.
Additionally, any LISTSERV may use the REFRESHME option (only).
- Each server will make sure to send at least one X-LUPD job per month. If
there has been no change broadcasted in the last 30 days, it will send a
dummy job with just a 'CKS' statement. Thus if some previous X-LUPD job was
lost in a COLD start, a checksum error will be detected within 1 month and
the problem will correct itself automatically.
This is a bit complicated because of the distributed approach and asynchronous
nature of the network, but it should work without generating too much traffic.
I couldn't think of anything simpler that would detect lost files.
Eric
|