I have received a number of private queries for background info regarding this INTERBIT problem, and I think it will be simpler if I answer them on the list. I keep forgetting that while this list was started by BITNET old-timers, there are many new people here who are not familiar with the way BITNET works and may not even have heard of "INTERBIT" before :-) I am copying LSTOWN-L since I think this background info may prove useful to list owners in general. BITNET is a network started a long time ago, and the environment for which LISTSERV was originally written in 1986. BITNET machines use IBM's NJE protocol to talk to each other. The bulk of BITNET is comprised of machines running (from most to least common) VMS, VM and MVS. The BITNET network is organized around a "core" of about a dozen tightly interconnected machines. Regional hubs connect to the core sites and other machines then connect to the hubs. The core machines, however, bear the bulk of the load, and many are now overloaded. This is the cause of most of the delivery delays we have observed since Oct-Nov. There is also an "EARN core" in Europe, but it is not currently overloaded and I will not discuss it. The BITNET core is organized in a number of "regions" with (usually) two "core nodes" per region. These core nodes are IBM mainframes running VM. Unfortunately, for practical and/or historical reasons, the regions are not always balanced, and some core nodes carry more traffic than others. Naturally these are the machines that are most likely to run out of steam. The bulk of the load on the core machines is in fact not caused by the BITNET/NJE traffic itself, but by a BITNET service called "INTERBIT". This is essentially a mail gateway between BITNET and the Internet. BITNET sites that are not connected to the Internet or that do not wish to run their own gateway can direct their Internet traffic to INTERBIT, and it will be delivered on well-managed, properly configured systems. A technically accurate description of the implementation of the INTERBIT service would take up several pages, so let's just say that, to a large extent, this INTERBIT traffic is delivered by the core systems. On a machine like, say, UBVM, there is a factor of 10 in resource costs (CPU, I/O, etc) between the NJE traffic and the INTERBIT traffic. The reason problems in the BITNET core impact LISTSERV is that a majority of the LISTSERV servers, including most of the top 10 servers, still run the NJE version of LISTSERV, which needs to use BITNET to communicate with other servers. It is estimated that 80-90% of the BITNET traffic is generated by these LISTSERV-NJE servers. Many LISTSERV-NJE sites also take advantage of the INTERBIT service that they feel they are paying for with their CREN membership fee (CREN is the organization that owns BITNET in the US). This situation, incidentally, is the reason for some of the not-so-nice exchanges you may have seen in the past couple months. Some people feel that L-Soft is abusing the BITNET core structure for corporate profit. In fact what happens is that universities are paying L-Soft for LISTSERV maintenance and CREN for BITNET services. It is a well published fact that LISTSERV-NJE requires BITNET whereas LISTSERV-TCP/IP uses the Internet directly. With membership fees of up to $8,000/year, BITNET is not precisely a free network. LISTSERV-NJE is "abusing" BITNET in just the same manner as LISTSERV-TCP/IP is "abusing" the Internet service providers by making use of the bandwidth they are selling to their customers. At any rate, the BITNET core has been seriously overloaded since about Oct-Nov 94. The resulting delays affect most BITNET and LISTSERV users, and also impact the (usually) thousands of people who use the mainframes that implement the core function. These mainframes are usually general purpose or administrative machines operated by universities that donate a fraction of the machine's resources for the benefit of the BITNET community. Naturally, it becomes harder and harder for these universities to justify their continued contribution to the BITNET core as the people who actually pay for the machine complain about poor response time. To provide some relief for this problem, CREN has been deploying a number of "P/370" systems to assist the core mainframes in processing the BITNET traffic. A "P/370" is an IBM PS/2 personal computer with a special "P/370 card". The card contains a 4381-13 CPU chipset, 16M of memory, and a Microchannel interface. Special software is installed on the PS/2 to emulate various S/370 peripherals (disk, ethernet adapter, etc). The 4381 chipset thinks it is talking to real peripherals and the result is a nice little VM system in a PC box. Naturally, the performance of a P/370 is that of a 4381 in terms of CPU speed, since this is the CPU sitting on the card, and unfortunately it terms of I/O it clearly underperforms a real 4381. This is because the real thing has all sorts of independent electronic circuits to process I/O operations in parallel, and the peripherals have their own processors as well, whereas the P/370 relies on the PS/2 to emulate the peripherals. This emulation is CPU intensive and once the 486 or Pentium is 100% busy, the I/O system is effectively saturated. Unfortunately, the delivery of BITNET traffic or INTERBIT mail is a very I/O intensive process. The files have to be read, sent over the network, and discarded. To date, the P/370 has proved unable to handle INTERBIT traffic for any of the core nodes. It can, however, handle the BITNET traffic of a typical core node. Unfortunately, for the larger core nodes such as UBVM, this is only about 10% of the total cost of the "BITNET services". So much for the technical background. In January, CREN deployed two P/370 systems: one called UNLBIT, to replace ARIZVM1 that had decided to leave the core, and one called UICBIT, to replace both UIUCVM42 and UICVM. The transition from ARIZVM1 to UNLBIT went smoothly. Unfortunately, the UICBIT P/370 quickly proved unable to handle the load it was receiving. In fact, you may remember that the UIUCVM42 system was having serious trouble handling its own traffic in October, and had to be removed from the LISTSERV backbone. UIUCVM42 is a "real" 4381 and thus faster than the P/370, and it came as no surprise to me that the P/370 was unable to handle the combined load of UIUCVM42 and UICVM. CREN deployed another P/370, UICBIT2, as quickly as possible, but for one week the files were just not getting through. With UICBIT2 in place, the situation has now stabilized as far as BITNET traffic is concerned, although there is very little spare capacity on the UICBIT* P/370s. INTERBIT, however, turned out to be an entirely different story. UICVM and UIUCVM42 were both large regions in terms of INTERBIT traffic. The sum of the two came a close second to the UGA region, which is by far the largest for INTERBIT deliveries. Because the P/370s do not have the horsepower to run INTERBIT deliveries, this traffic has to be forwarded to a mainframe-class machine. In the case of UICBIT, the traffic was split between PUCC and UBVM. As a result of this and other minor changes I have not described here, the INTERBIT load on UBVM was multiplied by 2.3 in the space of one week. UBVM is one of the most central and largest LISTSERV sites and this is one of the main reasons why we have experienced delays in the past couple weeks. With the February table update, CREN entered a change in the BITNET tables that is essentially telling LISTSERV not to use the UICBIT region for any INTERBIT deliveries, presumably in the hope of relieving PUCC and UBVM of all that traffic. The management of these INTERBIT registrations is a very technical issue that I will not describe here, but the bottom line is that this change broke the symmetry of the INTERBIT registration which had been taken for granted when designing the current INTERBIT structure, and this will cause the INTERBIT traffic of the UICBIT region to gradually shift to the UGA region as people install the February tables (which usually takes 2-3 weeks). In January, UGA was handling an average of 746998 deliveries per day while UICBIT was doing (via PUCC/UBVM) 605453. Over the next few weeks, a large fraction of these 605453 will move to UGA. There is serious concern that UGA will not be able to handle all this extra traffic. To avoid any misunderstanding, I must point out that L-Soft was not consulted or even informed of these changes. I found out about them while processing a delivery delay complaint from a customer. If I had been told in advance, I would have pulled the alarm signal. This is the context in which CREN's plans for INTERBIT were finally released. What we have here is not a minor long-term problem. It is something that must be solved immediately. In a matter of weeks, UGA may be driven past its maximum capacity. UGA is the fastest machine in the core, which means no other machine could possibly take over its load. The other machines have little spare capacity with which to assist UGA. In fact, every time a mainframe core node is replaced with a P/370, its share of the INTERBIT traffic has to be routed to one of the core mainframes, because P/370s cannot handle INTERBIT. So the situation only gets worse for the remaining mainframes. [Note to LSTOWN-L/LSTSRV-M subscribers: CREN's plans and my comments were posted to LSTSRV-L yesterday and can be retrieved from the archives] CREN's plan for solving the INTERBIT problem revolve around the use of three tools: ListProc, Zmailer, and a future RS/6000-P/370 combination. I will not discuss the third option because it is not available today, and in fact CREN did not have any estimate for the availability of the necessary (IBM-developed) drivers, which probably means months, not weeks. So, CREN's plans for the short term are based on ListProc and Zmailer. CREN owns ListProc whereas Zmailer is free and was developed in the usual Internet fashion. Well, ListProc would help because it does not use BITNET. If people were to abandon LISTSERV in favour of CREN's product, they would stop using BITNET and the problem would go away. Naturally, it would be the same if people upgraded to LISTSERV-TCP/IP or migrated to the unix/VMS/Windows NT version of LISTSERV; CREN did not mention these options because they sell ListProc and it would not be in their interest to suggest the use of L-Soft's products. At any rate, regardless of whether people choose to migrate to ListProc, LISTSERV-TCP/IP for VM or the unix/VMS/Windows NT versions of LISTSERV, this is a good long term solution but it does nothing for the short term. CREN's ListProc has been available for I think about 8 months now, and in these 8 months it does not seem to have noticeably decreased the volume of traffic processed by the core. The probability that it will solve the core's problems in the next 2-3 weeks is zero. As for Zmailer, it is a high volume mail system for unix: a unix system with Zmailer can handle more mail than a unix system with sendmail. When used in conjunction with ListProc or the unix LISTSERV, Zmailer allows you to process more mail. This again is nice, but people aren't going to switch to ListProc or the unix LISTSERV en masse over the next 2-3 weeks, or even over the next 2-3 months. So, my conclusion after reading CREN's plan is that there is no credible short term solution in the plan. Unless I have missed something in the plan, CREN can only solve the current crisis if people switch to software that no longer uses BITNET. Or, in other words, BITNET can only continue to exist if people stop using it. This is a situation that I have worked very hard to try and avoid. BITNET represents the last 9 years of my professional life and I find it sad that things have to end this way. This is not for lack of technical alternatives. L-Soft made a proposal involving the migration of the bulk of the INTERBIT traffic to a number of unix systems running the unix version of LISTSERV. L-Soft offered to donate the necessary licenses. To date 5 P/370 systems have been deployed, each of which costs $20k in hardware alone. With these $100k, you could have bought 20 90MHz Pentium systems with 48M and 1x1G each, or 10 RS/6000-250 with 64M and 2x1G. A similar PC configuration (with only 32M) was shown to be able to process over 58k deliveries per day. In fact, it appears to deliver about one message per second, ie 86k per day, again with 32M. The RS/6000 is known to be able to handle upwards of 150k. Because the bulk of the resource costs at the core node come from the INTERBIT traffic, this should have released enough resources to continue hosting the NJE traffic for the foreseeable future. These are not theoretical solutions; they have been deployed by several L-Soft customers as far back as 3Q94, and they work. But, whatever the reasons, I'm afraid I can't think of any viable plan that can be implemented over the next 2-3 weeks. Even if L-Soft's plan were to be adopted today, one would still need to purchase and install a large number of computers at a dozen different sites. Besides, it is not realistic to expect CREN to adopt L-Soft's plan when they have just officially published their own plan. So, the best I can do is look for ways to extricate LISTSERV and its users from this mess. The first thing I will do is modify LISTSERV-TCP/IP so that it stops using BITNET altogether. In fact I have already written the code after the EARN-RPG incident. LISTSERV-TCP/IP will continue to communicate with LISTSERV-NJE servers and use DISTRIBUTE, of course, but it will not cause any data to be delivered over BITNET except in the case of recipients with a BITNET-only address, where the only alternative is not to deliver the message at all. The structure for doing that was already present in LISTSERV-TCP/IP, I had just never had time to finish and test the code. I am also working on changes to LISTSERV-NJE to minimize the use of BITNET, although it will not be possible to eliminate it totally, as LISTSERV-NJE servers need to use BITNET to talk to each other. I am pursuing various options and building models to analyze the probable impact. There will be an emergency release of version 1.8b of LISTSERV-NJE when the core is about to collapse. The reason I am not doing that now is that I will need as much time as possible to make measurements, test the new code on selected systems, and study the impact on the core. If I screw up, I may precipitate the fall, and naturally I would then be blamed for what happened. So I want to make sure that I don't screw up, and with all my other commitments 2 weeks is far from being plenty. Eric