On Sep 15, UIUC reported on the core operators' list that their core
system, a dedicated 4381 running the UIUCVM42 core node, could no longer
handle the load it was subjected to. It was quickly established that the
machine is simply out of steam. It is one of the smallest machines on the
core, it is 100% busy 24 hours, and it has reached a point where the
smallest amount of files in LISTSERV's input queue is on the order of
3000, in the middle of the night on Sundays. This machine and the
manpower to operate it are provided by UIUC on a volunteer basis, and we
can only thank UIUC for their continuing support, dedication, and
generosity. However, this problem does need to be solved. Several L-Soft
customers have complained about LISTSERV delivery delays of up to EIGHT
DAYS. Needless to say, this is totally unacceptable to the average user.
At first, we told people that the UIUCVM42 issue was being investigated.
The issue was being discussed on the core operators' list, and we hoped
that a solution would be found shortly. Unfortunately, this has not
happened. What's worse, nobody seems to have taken ownership of the
problem. We can't even tell our customers that a solution is being
actively implemented and is expected to be ready by a certain date,
because, to the best of our knowledge, nothing at all is being
implemented. This is intolerable. Our customers are not interested in our
explanations of the delicate, volunteer-based core support structure.
They pay us good money for software which happens to use the core. They
demand service. In their opinion, if the core needs 8 days to process a
LISTSERV distribution, the core should be either fixed or terminated,
because a structure that needs 8 days to deliver mail is simply not
useful. And they are right.
In order to ensure that our customers do receive a decent level of
service, we have had no option but to remove UIUCVM42 from the LISTSERV
backbone (and set UIUCVMD to LOCAL distribution mode, to avoid having it
attract the workload of UIUCVM42). This will bypass the LISTSERV@UIUCVM42
backlog and restore the expected level of service.
This is not a satisfactory solution. In fact, it is a last resort
solution, and this is why we waited 2 weeks before making this decision.
There was simply no other option. Removing UIUC from the backbone will
increase the level of traffic on the core, and break the INTERBIT
symmetry. We do not expect any major disaster, and there is no cause for
panic. This change simply puts UIUC in the same situation as Cornell,
back when it used to be a core site not running LISTSERV. We expect that
this change will solve the problem in the near future, at the expense of
additional traffic that the core structure can support today. However, we
also expect that other sites will find themselves in a situation similar
to UIUC's over the next 6 months. Removing a core site from the backbone
increases traffic in proportion to the number of remaining core sites on
the backbone. That is, it is bearable the first few times you do it, but
every additional removal becomes more expensive than the previous one.
And, since each removal increases traffic and contributes to saturating
machines and requiring another removal, this is a very dangerous
situation which could get out of control in no time.
Again, UIUC is not to blame for this problem. They are not being paid for
this service, which costs them real money. The machine is out of steam,
the traffic simply has to be moved elsewhere. There are several ways to
move SMTP/INTERBIT traffic from a VM system to a workstation. These are
not experimental mechanisms. SUNET has been running its SMTP service on a
workstation since March 1994. Others have started offloading their
mainframes in a similar fashion. The technology is available, today, to
solve problems such as UIUC's. And, if it is not deployed today, we will
not have a core for long. The only obstacle to this deployment is that
software cannot run on thin air, and someone has to purchase the
workstations in question. Some core sites are willing to spend $10-20k to
buy a workstation for the core service, others aren't, and we can't
really blame them for that. This problem has to be solved by the NJE
connectivity providers, who are getting paid by the participating
organizations for the provision of services that their users find useful
- when the turnaround time is within reasonable bounds, that is. A
comprehensive solution would probably cost around $200k and a minimal
solution, $50k. These are mostly one time charges for the purchase of
equipment. It is estimated that the NJE connectivity providers collect on
the order of 2 million US dollars a year (worldwide). Thus, the
comprehensive solution would cost about 10% of the yearly dues, and again
most of that is a one time charge. Since the NJE connectivity providers
have a monopoly, it is not possible for other companies to offer more
competitive or better operated NJE services. Your only option, as a
representative of your dissatisfied users, is to complain to your NJE
provider, and seek alternate solutions if you do not receive a
satisfactory answer.
Eric
|