LISTSERV - LSTOWN-L Archives - COMMUNITY.EMAILOGY.COM

I have received a number of private queries for background info regarding
this INTERBIT problem, and I think it will be simpler if I answer them on
the list.  I keep forgetting that  while this list was  started by BITNET
old-timers, there are many new people  here who are not familiar with the
way BITNET works and  may not even have heard of  "INTERBIT" before :-) I
am copying LSTOWN-L  since I think this background info  may prove useful
to list owners in general.
 
BITNET is  a network  started a  long time ago,  and the  environment for
which LISTSERV was originally written  in 1986. BITNET machines use IBM's
NJE protocol to  talk to each other.  The bulk of BITNET  is comprised of
machines running (from most to least common) VMS, VM and MVS.
 
The BITNET network is organized around  a "core" of about a dozen tightly
interconnected  machines. Regional  hubs connect  to the  core sites  and
other machines then connect to the hubs. The core machines, however, bear
the bulk of the  load, and many are now overloaded. This  is the cause of
most of the delivery delays we have observed since Oct-Nov. There is also
an "EARN core" in  Europe, but it is not currently  overloaded and I will
not discuss it.
 
The BITNET core is organized in  a number of "regions" with (usually) two
"core nodes" per region. These core  nodes are IBM mainframes running VM.
Unfortunately, for  practical and/or historical reasons,  the regions are
not always balanced, and some core  nodes carry more traffic than others.
Naturally  these are  the machines  that are  most likely  to run  out of
steam.
 
The bulk of  the load on the core  machines is in fact not  caused by the
BITNET/NJE traffic  itself, but  by a  BITNET service  called "INTERBIT".
This  is essentially  a mail  gateway  between BITNET  and the  Internet.
BITNET sites that are  not connected to the Internet or  that do not wish
to run their  own gateway can direct their Internet  traffic to INTERBIT,
and it will be delivered  on well-managed, properly configured systems. A
technically accurate  description of  the implementation of  the INTERBIT
service would take up  several pages, so let's just say  that, to a large
extent, this  INTERBIT traffic  is delivered  by the  core systems.  On a
machine like, say, UBVM, there is a  factor of 10 in resource costs (CPU,
I/O, etc) between the NJE traffic and the INTERBIT traffic.
 
The reason problems in the BITNET core impact LISTSERV is that a majority
of the LISTSERV servers, including most  of the top 10 servers, still run
the NJE  version of LISTSERV,  which needs  to use BITNET  to communicate
with other servers. It is estimated  that 80-90% of the BITNET traffic is
generated  by these  LISTSERV-NJE servers.  Many LISTSERV-NJE  sites also
take advantage of the INTERBIT service that they feel they are paying for
with their CREN membership fee (CREN is the organization that owns BITNET
in the US).
 
This situation, incidentally,  is the reason for some  of the not-so-nice
exchanges you may  have seen in the past couple  months. Some people feel
that L-Soft is abusing the BITNET core structure for corporate profit. In
fact what  happens is  that universities are  paying L-Soft  for LISTSERV
maintenance and  CREN for BITNET  services. It  is a well  published fact
that  LISTSERV-NJE  requires  BITNET  whereas  LISTSERV-TCP/IP  uses  the
Internet directly. With  membership fees of up to  $8,000/year, BITNET is
not precisely  a free network.  LISTSERV-NJE is "abusing" BITNET  in just
the  same manner  as LISTSERV-TCP/IP  is "abusing"  the Internet  service
providers  by making  use  of the  bandwidth they  are  selling to  their
customers.
 
At any  rate, the BITNET core  has been seriously overloaded  since about
Oct-Nov 94. The  resulting delays affect most BITNET  and LISTSERV users,
and also impact the (usually) thousands  of people who use the mainframes
that implement  the core function.  These mainframes are  usually general
purpose or administrative machines operated by universities that donate a
fraction  of  the machine's  resources  for  the  benefit of  the  BITNET
community. Naturally, it becomes harder and harder for these universities
to justify their continued contribution to  the BITNET core as the people
who actually pay for the machine complain about poor response time.
 
To provide some relief for this problem, CREN has been deploying a number
of "P/370" systems to assist the core mainframes in processing the BITNET
traffic. A "P/370" is an IBM PS/2 personal computer with a special "P/370
card". The  card contains  a 4381-13  CPU chipset, 16M  of memory,  and a
Microchannel  interface. Special  software is  installed on  the PS/2  to
emulate various S/370 peripherals (disk, ethernet adapter, etc). The 4381
chipset thinks it is talking to real peripherals and the result is a nice
little VM system  in a PC box.  Naturally, the performance of  a P/370 is
that of a  4381 in terms of CPU  speed, since this is the  CPU sitting on
the card,  and unfortunately it terms  of I/O it clearly  underperforms a
real 4381.  This is because the  real thing has all  sorts of independent
electronic  circuits  to process  I/O  operations  in parallel,  and  the
peripherals have their  own processors as well, whereas  the P/370 relies
on the PS/2  to emulate the peripherals. This emulation  is CPU intensive
and once the 486  or Pentium is 100% busy, the  I/O system is effectively
saturated.
 
Unfortunately, the delivery of BITNET traffic  or INTERBIT mail is a very
I/O intensive process. The files have  to be read, sent over the network,
and discarded.  To date, the P/370  has proved unable to  handle INTERBIT
traffic for  any of the  core nodes. It  can, however, handle  the BITNET
traffic of a typical core node.  Unfortunately, for the larger core nodes
such as  UBVM, this is only  about 10% of  the total cost of  the "BITNET
services".
 
So much for the technical background. In January, CREN deployed two P/370
systems: one called UNLBIT, to replace  ARIZVM1 that had decided to leave
the core, and one called UICBIT,  to replace both UIUCVM42 and UICVM. The
transition  from  ARIZVM1 to  UNLBIT  went  smoothly. Unfortunately,  the
UICBIT P/370 quickly  proved unable to handle the load  it was receiving.
In fact,  you may remember  that the  UIUCVM42 system was  having serious
trouble handling its  own traffic in October, and had  to be removed from
the LISTSERV backbone. UIUCVM42 is a "real" 4381 and thus faster than the
P/370, and  it came as  no surprise  to me that  the P/370 was  unable to
handle the  combined load  of UIUCVM42 and  UICVM. CREN  deployed another
P/370, UICBIT2, as  quickly as possible, but for one  week the files were
just not  getting through. With UICBIT2  in place, the situation  has now
stabilized as far as BITNET traffic  is concerned, although there is very
little spare capacity on the UICBIT* P/370s.
 
INTERBIT, however,  turned out to  be an entirely different  story. UICVM
and UIUCVM42  were both large regions  in terms of INTERBIT  traffic. The
sum of the two came a close second to the UGA region, which is by far the
largest  for INTERBIT  deliveries. Because  the  P/370s do  not have  the
horsepower to run  INTERBIT deliveries, this traffic has  to be forwarded
to a  mainframe-class machine.  In the  case of  UICBIT, the  traffic was
split between PUCC and UBVM. As a  result of this and other minor changes
I have  not described here, the  INTERBIT load on UBVM  was multiplied by
2.3 in the space of one week. UBVM is one of the most central and largest
LISTSERV  sites  and  this  is  one  of the  main  reasons  why  we  have
experienced delays in the past couple weeks.
 
With  the February  table update,  CREN entered  a change  in the  BITNET
tables that is essentially telling LISTSERV  not to use the UICBIT region
for any INTERBIT deliveries, presumably in the hope of relieving PUCC and
UBVM of all that traffic.  The management of these INTERBIT registrations
is a very technical  issue that I will not describe  here, but the bottom
line is that this change broke  the symmetry of the INTERBIT registration
which  had been  taken for  granted when  designing the  current INTERBIT
structure, and this will cause the  INTERBIT traffic of the UICBIT region
to  gradually shift  to the  UGA region  as people  install the  February
tables (which usually  takes 2-3 weeks). In January, UGA  was handling an
average  of  746998  deliveries  per  day while  UICBIT  was  doing  (via
PUCC/UBVM) 605453.  Over the next  few weeks,  a large fraction  of these
605453 will move  to UGA. There is  serious concern that UGA  will not be
able to handle all this extra traffic.
 
To  avoid any  misunderstanding, I  must point  out that  L-Soft was  not
consulted or even informed of these changes. I found out about them while
processing a delivery delay complaint from a customer. If I had been told
in advance, I would have pulled the alarm signal.
 
This  is the  context in  which CREN's  plans for  INTERBIT were  finally
released.  What we  have here  is not  a minor  long-term problem.  It is
something that must be solved immediately.  In a matter of weeks, UGA may
be driven  past its maximum capacity.  UGA is the fastest  machine in the
core, which means no other machine could possibly take over its load. The
other machines  have little spare capacity  with which to assist  UGA. In
fact, every  time a  mainframe core  node is replaced  with a  P/370, its
share  of the  INTERBIT traffic  has  to be  routed  to one  of the  core
mainframes, because P/370s cannot handle  INTERBIT. So the situation only
gets worse for the remaining mainframes.
 
[Note to LSTOWN-L/LSTSRV-M subscribers: CREN's plans and my comments were
posted to LSTSRV-L yesterday and can be retrieved from the archives]
 
CREN's plan  for solving the INTERBIT  problem revolve around the  use of
three tools: ListProc, Zmailer, and a future RS/6000-P/370 combination. I
will not discuss the third option  because it is not available today, and
in  fact CREN  did not  have  any estimate  for the  availability of  the
necessary  (IBM-developed)  drivers,  which probably  means  months,  not
weeks. So,  CREN's plans  for the  short term are  based on  ListProc and
Zmailer. CREN owns ListProc whereas Zmailer  is free and was developed in
the usual Internet fashion.
 
Well, ListProc would help because it  does not use BITNET. If people were
to abandon  LISTSERV in favour of  CREN's product, they would  stop using
BITNET and the problem would go away.  Naturally, it would be the same if
people upgraded to LISTSERV-TCP/IP or migrated to the unix/VMS/Windows NT
version of LISTSERV; CREN did not mention these options because they sell
ListProc and  it would  not be in  their interest to  suggest the  use of
L-Soft's products.  At any rate,  regardless of whether people  choose to
migrate to  ListProc, LISTSERV-TCP/IP for  VM or the  unix/VMS/Windows NT
versions  of LISTSERV,  this is  a good  long term  solution but  it does
nothing for  the short  term. CREN's  ListProc has  been available  for I
think about 8 months now, and in these  8 months it does not seem to have
noticeably decreased  the volume  of traffic processed  by the  core. The
probability that it will solve the  core's problems in the next 2-3 weeks
is zero.
 
As for Zmailer, it  is a high volume mail system for  unix: a unix system
with Zmailer can handle more mail  than a unix system with sendmail. When
used in  conjunction with ListProc  or the unix LISTSERV,  Zmailer allows
you to process more mail. This again  is nice, but people aren't going to
switch to ListProc or the unix LISTSERV en masse over the next 2-3 weeks,
or even over the next 2-3 months.
 
So, my conclusion after reading CREN's  plan is that there is no credible
short term  solution in the plan.  Unless I have missed  something in the
plan, CREN can only solve the current crisis if people switch to software
that no longer uses BITNET. Or,  in other words, BITNET can only continue
to exist if people stop using it.
 
This is a situation that I have worked very hard to try and avoid. BITNET
represents the  last 9 years  of my professional life  and I find  it sad
that things  have to  end this  way. This  is not  for lack  of technical
alternatives. L-Soft made a proposal  involving the migration of the bulk
of the  INTERBIT traffic  to a  number of unix  systems running  the unix
version of LISTSERV. L-Soft offered  to donate the necessary licenses. To
date 5  P/370 systems  have been  deployed, each of  which costs  $20k in
hardware alone. With these $100k, you  could have bought 20 90MHz Pentium
systems with 48M  and 1x1G each, or  10 RS/6000-250 with 64M  and 2x1G. A
similar PC configuration (with only 32M)  was shown to be able to process
over 58k  deliveries per day.  In fact, it  appears to deliver  about one
message per second, ie 86k per day,  again with 32M. The RS/6000 is known
to be able  to handle upwards of  150k. Because the bulk  of the resource
costs at the  core node come from the INTERBIT  traffic, this should have
released enough  resources to  continue hosting the  NJE traffic  for the
foreseeable future. These  are not theoretical solutions;  they have been
deployed by several L-Soft customers as far back as 3Q94, and they work.
 
But, whatever  the reasons, I'm afraid  I can't think of  any viable plan
that can  be implemented over the  next 2-3 weeks. Even  if L-Soft's plan
were to be adopted today, one would  still need to purchase and install a
large number of computers at a  dozen different sites. Besides, it is not
realistic  to expect  CREN to  adopt L-Soft's  plan when  they have  just
officially published their  own plan. So, the  best I can do  is look for
ways to extricate LISTSERV and its users from this mess.
 
The first  thing I  will do  is modify LISTSERV-TCP/IP  so that  it stops
using BITNET  altogether. In fact I  have already written the  code after
the EARN-RPG incident. LISTSERV-TCP/IP  will continue to communicate with
LISTSERV-NJE servers and use DISTRIBUTE, of course, but it will not cause
any data  to be delivered  over BITNET except  in the case  of recipients
with a BITNET-only address, where the  only alternative is not to deliver
the message at  all. The structure for doing that  was already present in
LISTSERV-TCP/IP, I had just never had time to finish and test the code. I
am also working on changes to LISTSERV-NJE to minimize the use of BITNET,
although it will not be possible to eliminate it totally, as LISTSERV-NJE
servers need to use  BITNET to talk to each other.  I am pursuing various
options and building models to analyze the probable impact. There will be
an emergency  release of version  1.8b of  LISTSERV-NJE when the  core is
about to collapse. The reason I am not doing that now is that I will need
as  much time  as possible  to make  measurements, test  the new  code on
selected systems, and study the impact on  the core. If I screw up, I may
precipitate  the fall,  and naturally  I would  then be  blamed for  what
happened. So I want  to make sure that I don't screw up,  and with all my
other commitments 2 weeks is far from being plenty.
 
  Eric