Many people have written to complain about duplicate messages from
LSOFT.COM. Here is what happened. On Thursday night US time, PSUVM began
unleashing some 3.5M deliveries onto our machines in the space of a few
hours, following the resolution of a routing outage at CICnet, PSU's
provider. Our delivery machines are normally configured in redundant
mode, where only one machine is actually delivering mail while the other
acts as a backup. In this mode we normally have a delivery capacity of
around 210k/hour, depending on network outages, route flaps, etc. This is
with a normal queue (a most a couple hundred thousand recipients) and
input stream. With the kind of queue we are talking about here,
performance was reduced to 180k/hour. On Friday morning we reconfigured
the delivery machines in non-redundant mode, increasing overall
throughput to around 300k/hour (again, less than our normal throughput
due to the immenseness of the queue). Even at 300k/hour, a 3.5M backlog
would take around 12h to go through, and naturally we were also getting
our regular, daily ~2.5M deliveries from our everyday operations. The
bottom line is that we had to deliver 6,361,742 messages on 11/22,
instead of the usual 2.5M. So, things were slow, and some of our hosts
have been unresponsive yesterday.
Unfortunately, some messages were also duplicated Thursday night. This is
because, in redundant mode, all the new messages had to be processed by
one machine, and the configuration we were using did not allow it to
handle a queue of that size on its own. To give you an idea, the second
largest outage/backlog we have had to deal with involved delivering an
extra 1.1M messages, as compared to a normal day. Here we ended up
delivering 3.8M more than on a normal day. The server had been configured
with a large file cache, which still left more RAM for LSMTP than it had
ever attempted to use (even on the worst outage of record). The virtual
storage quota for LSMTP had been set accordingly (no use allowing it to
get into a situation where paging activity would bring the system to its
knees). So LSMTP was crashing every 30-45 minutes, and any messages that
had been in the process of being sent at the time would be resent after
the restart. We fixed this on Friday morning by increasing the amount of
storage available to LSMTP and its virtual storage quota, and by
splitting the queue between the two main delivery machines.
As you may know we are about to upgrade this setup to a fully redundant
configuration based on a VMS cluster. The new server was shipped on
Wednesday and should in principle arrive Monday (a bit too late, but it
never rains...) This machine will have 512M of RAM and should be able to
deliver 550k messages an hour. If we had had it yesterday, we would have
been able to clear the backlog in a few hours (our total throughput would
have been around 800k/hour), and there would have been no duplicate. It
will take a couple months for the new clustered configuration to go
online as we need to make a few software changes (and test them!) We will
migrate the production workload to the new server when it arrives, and
use the old one as a test machine until we are ready to go live with the
clustered setup. The old server has also been upgraded and we expect that
it will be able to handle 400-450k/hour, however this requires the
installation of VMS 7.1, which will be released in a week or two.
At any rate, we apologize for the inconvenience, but there is no need to
keep reporting duplicates unless they have occurred on Saturday or later
(per the time stamp on the "Received:" line for PEACH.EASE.LSOFT.COM). It
is also possible that messages might get duplicated down the line. Since
we sent 2.5 times as much traffic as on a regular day, mail servers all
over the world have received 2.5 times as much traffic from us as on a
regular day. For most sites, this is not likely to make any difference,
but large sites which get large absolute numbers of deliveries from us
may have been impacted. For instance, we sent around 650,000 deliveries
to AOL yesterday. This is not to say that there has been a problem at
AOL, just that the absolute numbers for some sites may have been
significant and can have caused problems for the mail servers in
question.
Eric
|