LSTSRV-L Archives

LISTSERV Site Administrators' Forum

LSTSRV-L

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Topic: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Eric Thomas <[log in to unmask]>
Sat, 23 Nov 1996 17:32:18 +0100
text/plain (68 lines)
Many  people  have written  to  complain  about duplicate  messages  from
LSOFT.COM. Here is what happened. On  Thursday night US time, PSUVM began
unleashing some 3.5M  deliveries onto our machines in the  space of a few
hours,  following the  resolution of  a routing  outage at  CICnet, PSU's
provider.  Our delivery  machines  are normally  configured in  redundant
mode, where only one machine is  actually delivering mail while the other
acts as a  backup. In this mode  we normally have a  delivery capacity of
around 210k/hour, depending on network outages, route flaps, etc. This is
with a  normal queue (a  most a  couple hundred thousand  recipients) and
input  stream.  With  the  kind  of queue  we  are  talking  about  here,
performance was reduced  to 180k/hour. On Friday  morning we reconfigured
the  delivery   machines  in   non-redundant  mode,   increasing  overall
throughput to  around 300k/hour (again,  less than our  normal throughput
due to the  immenseness of the queue). Even at  300k/hour, a 3.5M backlog
would take around  12h to go through, and naturally  we were also getting
our regular,  daily ~2.5M  deliveries from  our everyday  operations. The
bottom  line is  that  we had  to deliver  6,361,742  messages on  11/22,
instead of the  usual 2.5M. So, things  were slow, and some  of our hosts
have been unresponsive yesterday.
 
Unfortunately, some messages were also duplicated Thursday night. This is
because, in redundant  mode, all the new messages had  to be processed by
one machine,  and the  configuration we  were using did  not allow  it to
handle a queue of  that size on its own. To give you  an idea, the second
largest outage/backlog  we have had  to deal with involved  delivering an
extra  1.1M messages,  as compared  to  a normal  day. Here  we ended  up
delivering 3.8M more than on a normal day. The server had been configured
with a large file cache, which still  left more RAM for LSMTP than it had
ever attempted to  use (even on the worst outage  of record). The virtual
storage quota for  LSMTP had been set accordingly (no  use allowing it to
get into a situation where paging  activity would bring the system to its
knees). So LSMTP was crashing every  30-45 minutes, and any messages that
had been in the  process of being sent at the time  would be resent after
the restart. We fixed this on  Friday morning by increasing the amount of
storage  available  to  LSMTP  and  its virtual  storage  quota,  and  by
splitting the queue between the two main delivery machines.
 
As you may know  we are about to upgrade this setup  to a fully redundant
configuration  based on  a VMS  cluster. The  new server  was shipped  on
Wednesday and should  in principle arrive Monday (a bit  too late, but it
never rains...) This machine will have 512M  of RAM and should be able to
deliver 550k messages an hour. If we  had had it yesterday, we would have
been able to clear the backlog in a few hours (our total throughput would
have been around  800k/hour), and there would have been  no duplicate. It
will  take a  couple months  for the  new clustered  configuration to  go
online as we need to make a few software changes (and test them!) We will
migrate the  production workload to the  new server when it  arrives, and
use the old one as a test machine  until we are ready to go live with the
clustered setup. The old server has also been upgraded and we expect that
it  will be  able  to  handle 400-450k/hour,  however  this requires  the
installation of VMS 7.1, which will be released in a week or two.
 
At any rate, we apologize for the  inconvenience, but there is no need to
keep reporting duplicates unless they  have occurred on Saturday or later
(per the time stamp on the "Received:" line for PEACH.EASE.LSOFT.COM). It
is also possible that messages might  get duplicated down the line. Since
we sent 2.5 times  as much traffic as on a regular  day, mail servers all
over the world  have received 2.5 times  as much traffic from us  as on a
regular day. For  most sites, this is not likely  to make any difference,
but large  sites which get large  absolute numbers of deliveries  from us
may have been  impacted. For instance, we sent  around 650,000 deliveries
to AOL  yesterday. This is not  to say that  there has been a  problem at
AOL,  just  that the  absolute  numbers  for  some  sites may  have  been
significant  and  can  have  caused  problems for  the  mail  servers  in
question.
 
  Eric

ATOM RSS1 RSS2