A serious problem exists when a peered list has a mix of 1.8 and 1.7
peers, and the 1.8 peers are configured to use Internet addresses
("List-Address= FQDN", or equivalent default). SIGNOFF/DELETE (GLOBAL
commands may cause an infinite loop, which will furthermore fill up the
A-disk of all the peers involved (given enough time). Eventually the
servers will crash and will not be able to restart without manual
intervention.
**************************
* How to restore service *
**************************
If this happens to you, DO NOT DELETE PERMVARS FILE! Use XEDIT or PIPE to
eliminate all the lines containing the string X-REQ (ALL/X-REQ followed
by DEL * under XEDIT). PERMVARS FILE contains important information, and
should be treated as a LIST or FILELIST file, not as LISTSERV NETLOG.
Once you have returned PERMVARS FILE to its normal size, you will be able
to start LISTSERV, but the problem will remain until you identify and
remove the looping jobs. They will be in mail files called LISTSERV MAIL
and coming from the servers whose hostnames were in the X-REQ entries in
PERMVARS FILE. The job name will be the same as the name of the list in
question. This may or may not be sufficient to identify the files. There
is only one job to kill per 1.7 peer (jobs from 1.8 peers don't need to
be killed). If you can't identify the jobs, or if you are not sure,
remove all the peer subscriptions from the list in question temporarily.
This will stop the loop.
*************************
* Permanent restriction *
*************************
There are three ways to run a peered list with both 1.7 and 1.8 peers:
1. All 1.8 peers use "List-address= NJE", and all peer subscriptions (the
subscriptions with the name "Peer distribution list") are in BITNET
form. This produces the behaviour the 1.7 servers expect (the only
behaviour they supported) and there is no problem. But, of course, the
1.8 peers cannot take advantage of the "List-address=" support added
in 1.8a and continue to identify themselves under their BITNET
address.
2. The 1.8 peers use "List-address= FQDN", and all peer subscriptions are
in BITNET form. In that case there is no risk of loop, but subscribers
will receive an error message about duplicate postings every time they
post to the list.
3. The 1.8 peers use "List-address= FQDN", and the corresponding peer
subscriptions are changed to their Internet form. This solves the
problem mentioned above, but exposes the list to the loop.
Option 1 is fully supported. Option 2 is supported, but not recommended.
Option 3 is not supported. Note that servers running the base level of
1.8a may experience the symptoms described in option 2. However, they do
not suffer from the loop problem. The message about duplicate postings
does not necessarily mean the loop problem is present.
**********************
* Can't it be fixed? *
**********************
The problem cannot be fixed by a change to version 1.8. It is caused by
the algorithm version 1.7 uses to forward certain requests. When the
request comes back to the 1.8a server, it cannot tell whether it is due
to the 1.7 restriction, or caused by a new command from a user, because
the 1.7 server assigns a new ID. The only way to solve this problem in
1.8a would be to disable command forwarding completely.
There are three general strategies to bypass this restriction:
1. Upgrade all participating servers to 1.8.
2. Remove all 1.7 peers.
3. Run the list with "List-address= NJE" until all peers migrate to 1.8.
L-Soft was not aware of this problem until it hit UBVM tonight. This is
the reason why it was not mentioned in the release notes, and why we had
previously advised people to select the options that expose you to the
loop.
As you know, L-Soft offered to deliver 1.8a to all the beneficiaries of
the CREN/L-Soft contract who did not manage to push the paperwork past
their legal department, at L-Soft's own risk (without written agreement).
Because of the use of the word "executed" rather than "agreed" in section
4 of the CREN/L-Soft agreement, L-Soft cannot do this without CREN's
written permission. Regretfully, the last formal response we received
from CREN on this topic was on June 29. CREN did not agree to let us
deliver 1.8a to everyone, although no reason was stated. Instead, CREN
proposed that the deadline be extended by another three months. We turned
down this offer because it would not solve the problem. The reason less
than a third of the beneficiaries returned usable contracts is that the
contracts are too complicated and involve three parties (against our
recommendation - we wanted a simple, standalone maintenance agreement
with two parties), whereas purchase lawyers are not usually familiar with
three-party maintenance agreements where 18 out of 30 pages are totally
out of their control and not negotiable. Extending the deadline will not
solve this problem. Furthermore, we do not have three months at our
disposal. The backbone must be made LTCP-exploitive by the end of July,
and L-Soft does not believe that 100 universities will return their
contract over the next two weeks when only about 50 did so since March.
What is most unfortunate with this negotiation is that we have still not
been told why CREN opposes our proposal. Not knowing what bothers CREN
with the proposal, and being faced with an unusable counter-proposal,
there is not much we can do in terms of negotiations. We now have no
option but to begin retrofitting LTCP exploitation into 1.7f. Volunteers
for beta-testing are invited to contact L-Soft privately.
Eric
|