Sun, 3 Nov 1996 15:42:08 +0200
|
This problem has just been fixed. Please let me know privately if it
happens again (and if the "Date:" field in the error message is posterior
to 08:30 Eastern time).
PLUM.EASE.LSOFT.COM is the first half of a fully redundant LISTSERV setup
that we are going to build over the next couple months (the rest of the
hardware will take about one month to arrive due to being very new models
with an order backlog). Currently we have redundant mail delivery
machines, however the machine running LISTSERV is a single point of
failure. In practice this machine has proved very reliable, and the only
incident was a disk crash in February, which did not cause any loss of
data since the disk was mirrored. But there are a lot of sites for whom
this design is not acceptable; they need to be able to blow up a box
completely and still have a working service, for instance because they
are in the business of sending warnings about natural disasters. Another
issue is that taking the system down for maintenance causes an
interruption of service. In some businesses, a 15-min maintenance window
may just not be acceptable.
So, we are going to build the necessary tools to make it possible to have
a truly uninterrupted, 24x365 LISTSERV service. Currently this can only
be achieved using VMS cluster technology (there is a cluster product for
NT, but in its current version it does not have the necessary
functionality). To give this setup the level of testing it needs, we are
going to use it to take over PEACH's role in the DISTRIBUTE backbone.
Currently this is done by having PEACH forward all its mail to PLUM, so
that we can back out immediately if there should be a problem, or when we
need to do maintenance on PLUM (which we will have to as the new
components of the setup arrive).
The error that you noticed was a bug in the in-transit spam detector that
we run on PEACH and now on PLUM. This bug had not been detected until now
due to operating system specific considerations, but PEACH would probably
have run into it within the next few months, and its performance was
already being impacted by a side effect of the bug. PLUM is now the
largest LISTSERV site in the world and there are problems that you only
detect with this level of traffic. Some operating systems may hide
certain problems and I was actually expecting to find a couple bugs
during the transition. I made the switch late at night and monitored the
system for the next 3-4h, and everything worked fine, but obviously this
one decided to strike while I was asleep. At the rate mail is coming
through, it shouldn't take very long for other problems to be found.
Eric
|
|
|