While BTI did well selling systems on the BTI 5000 to car dealerships across the country, management boldly decided to move into the super-minicomputer This was a multi-year effort, and as the revenue from the older systems peaked and started to decline, the pressure for the 8000 to succeed mounted.
BTI 8000 Timeline (link)
(The following timeline was supplied by Ron Crandall)
While the BTI 3000, 4000, and 5000 was keeping BTI growing, a few people in both the hardware and software department starting thinking about a next generation design, one that broke from the HP CPU heritage. In 1974, Ron Crandall, George Lewis (Lew), Bill Cargile, and Bill Quackenbush started preliminary investigations along these lines. The effort was short lived, as BTI 4000 and 5000 work kept everyone too busy to do much else. Nevertheless, certain important decisions were made and the overall architecture for a new generation machine was mapped out. The machine would have a system backplane into which one of four types of modules would be plugged. Memory, CPU, PPU (peripheral processing unit, basically a DMA engine connecting to peripheral controllers), and SSU (system services unit, essentially what was left over, such as operator interface, boot, remote diagnostic, time of day clock, error handling). Bill Cargile designed an asynchronous backplane to interconnect the major parts of the machine.
The plan for the for the system never completely went dormant, but there were several major obstacles along the way that consumed inordinate amounts of time. A major one being what we 'affectionately' called the octo-bus. By late 1975 some people began full time work on the 8000. A group of people flew to Corvallis, OR in March of 1975 to meet with Jim Meeker and coax him into working for BTI. He set about specifying the very CISC-y BTI 8000 instruction set. Ron Crandall, frantically busy with system software issues on the 5000, used all of his available spare time to architect a robust file system structure, one that would be "crash-proof". Many of the key components came from the design of the 5000 disk structure, but a lot of thought went into the various faults that could occur. Other people started on various components of the design. Roger Fairfield was given the task of making the asynchronous backplane prototype work.
Roger's prototype was up and running in a few months, but there were persistent failures that no one could track down. The effort continued for most of a year, from mid 1975 to 1976 IIRC. When Bill Quackenbush was finally freed up from the nasty, unworkable octo-bus, he relatively quickly demonstrated that the problem lay in the synchronizers that were a crucial part of the asynchronous priority resolution protocol. Bill built a test rig that ran the two interacting devices off of the same clock but with a variable phase between them. He might have just adjusted two clocks to be as close as possible and then just relied on the drift to vary the phase, but in any event, by hooking up a 'scope to the synchronizers, you could see the output 'fence sit' for way longer than the advertised propagation delay. Not too surprising, since the parts weren't designed to be synchronizers and we were violating the setup and hold times. Bill was able to relatively quickly redesign the bus to be synchronous and another task (system clock) was added to the function of the SSU.
The overall hardware effort was plagued with a series of problems. Realizing early that the software development could in no way wait for actual hardware, an instruction set simulator was built using BTI 5000 hardware. The original simulator just simulated the instruction set and was used to develop software products like the assembler/linker and much of the early operating system kernel. A few I/O devices were added so that OS projects could be completed. At some time, various components of the machine became available. The emulator was modified to connect to a bus interface board and the emulator played the role of CPU while the memory, and later, even the PPU did their job installed in an actual card cage.
In February 1977, Lew decided that the burden of supporting the BTI 5000 was excessively interfering with progress on the 8000. So a pact with the devil was signed and the 5000 development was passed off to another (and incompetent) group. This had repercussions later for the 5000 product line. We also went on a concerted hiring effort and many more joined the 8000 effort.
As the magnitude of the project increased, Lew became overwhelmed and the continuity and logistics of the project started to suffer. Unfortunately, some of these issues resulted in problems that plagued the 8000 throughout its life. The most grievous of these issues involved the schedule for completion. Even as late as February 1980, Lew was insisting that a completed machine would ship by the computer conference in May. Just a walk down the row of offices and labs easily put paid to that idea. Some of the more optimistic schedules for such vital items as a disk controller were September. Unfortunately, marketing geared up for a major marketing push based on Lew's hopelessly unrealistic schedule. Consequently, interest in the machine peaked well before it could be shipped. Also, Lew had stockpiled about a million dollars in inventory for the initial production run of these machines resulting in needless expenses for the company. This is the particular bit of mismanagement that got Lew removed from the management position. He chose not to remain as a contributing staff member.
A few anecdotes about the CPU effort are instructive of the kind of problem that we suffered. Lew had decided that our chief PC board layout guy, John Caris, would do layouts for all of the boards so that we would always be working on close to shippable quality boards. Since John could do a layout of a BTI 5000 board (about 80 square inches and 80 DIPS) in about two weeks, Lew figured he could to a layout of a BTI 8000 board (about 460 square inches and 450 DIPS) in ten weeks (the simple ratio of the parts count). What Lew was smoking to assume such a thing is unknown. But John floundered with the first CPU PC board layout for eight months before Lew would relent and allow a wire wrap prototype. This prototype was brought up in a few months and several more CPUs were then wire-wrapped and debugged so that we could finally get a working, full speed (some devices had to be down-clocked, but this had little effect on the development effort) machine. Even with the new CPUS, the emulator still played several key rolls in the system operation. But development could finally go on full bore.
Meanwhile, the PC board layout for the CPU continued. In order to 'help' John Caris, another layout person was brought onto the project and they worked alternating shifts. This worked okay for a while, but then progress slowed to a virtual standstill. It seems that each would have particularly difficult routing issues to resolve. So they would remove some of the others traces in order to solve their issue. This comedy of errors went on until we finally qualified a vendor who could make four layer boards in the size we required. But another example of a massive expense and, worse, schedule slippage. I should note that when these CPUs came back, nothing worked, even though they were schematically correct versions of the working wire wrap prototypes. It seems that many buses were layed out with parallel traces and the crosstalk was sufficient to induce phantom signals. Since this problem afflicted almost all of the buses, it took a lot of time and effort to fix as well.
Because of Lew's optimistic schedules, BTI prematurely started letting the world know about the 8000 in 1978. They presented papers at technical conferences; the 8000 was mentioned in sales literature; glossy brochures were produced touting its advanced features.
The 8000 didn't really started shipping until June 1981, and even then, the first few systems were moderately unreliable. The bus transfers would suffer from protocol errors whose basic cause was some firmware problems. Making it worse, the remote diagnostic facility (RDF) wasn't in place, making it impossible for in the field failures to be diagnosed and repaired in a timely manner (subsequently, we successfully used the rdf on many occasions to restart a crashed system with no loss of user data; their sessions just resumed where they had stopped). In an odd reverse from the usual case, the operating system proved to be fairly reliable, even in these early times. By late 1981, these issues had been worked out.
BTI ramped up staff in anticipation of the need to ship many of the new BTI 8000 systems in 1982 ... but the orders only trickled in. A massive layoff was the result, and two smaller ones followed the same year. Finally, in March, 1986, the decision was made to halt all development on the 8000 and to simply support any existing customers. Slowly these customers dwindled, and BTI kept downsizing.
In the end, there were perhaps 30 paying customers for the system. BTI built around 15 others, some of which were used internally, others were used as demo systems to bait customers into a purchase.
By 1993, BTI was down to about a dozen employees, but surprisingly, there were still 19 systems in the field as of 1995. Phil Deal supported the few remaining customers, but the final systems were retired in 2002, and the US part of BTI was closed down.
Variable Resource Architecture (VRA) (link)
IBM pioneered the idea of a system architecture with the IBM/360 family. The 360 architecture was an abstract model of computation, where many different machines implementing that model could span a couple orders of magnitude in performance while sharing the cost of developing the OS and other tools and preserving the customer's own software investments.
BTI didn't have the resources to develop a family of computers, and took a different approach. They decided to build a multiprocessor, where a low end system contained a single CPU, a single memory controller, and a single I/O controller. Higher end systems were built by adding more resources, instead of having a family of uniprocessors with a range of performance.
BTI developed a model where a single high speed backplane connected together one to many instances of each of a few computing resources, with each type of resource being identical and treated equally. This is known as a symmetric multiprocessor. BTI didn't invent the idea (for example, Burroughs 5000, Tandem T/16), but it also wasn't very common either.
BTI called this idea Variable Resource Architecture, or VRA for short.
Here are some key design features of the BTI 8000 VRA:
- Scalable from low to high end
Although BTI was certainly counting on attracting a new class of customers with the 8000, they needed a path for existing customers to upgrade to the new system. A low end configuration that allowed to those customers and had a legitimate story for painless upgrades (for what company doesn't have plans for growth?) resulted in a compelling sales pitch.
An entry level BTI 8000 was at least three times faster than BTI's then-current flagship, the BTI 5000. While a 5000 could support six interactive users, a low-end 8000 could handle up to 30, and a maximum configuration controlled up to 500 terminals.
- Fully homogeneous & symmetric: CPUs, memory controllers, peripherals
By having a single design for each resource, BTI could develop a scalable system yet had a fixed development cost.
One benefit to the customer was they were able to grow their compute power by simply adding resources, instead of upgrading (i.e., pulling out a slow CPU and adding a fast CPU).
Also, not all customers had the same needs; some might have compute- heavy loads, others I/O intensive loads. By decoupling the resource configuration, a customer could meet their needs without having to beef up the entire system.
- Adding resources required no OS reconfiguration
If each possible hardware configuration needed the operating system to be customized, it would have diluted the benefit of such flexible reconfigurability. Instead, the OS was written from the very start to adapt to whatever resources were installed.
Demonstrating the degree of self configuration, a system with running processes could be halted and powered down, CPUs and memories could be added, and the system rebooted without any intervention than the actual addition or removal of the hardware. Upon power up, the users at the terminal would resume exactly where they had been, experiencing the reconfiguration only as a few minute pause, and noticing better response times due to the new CPUs. No configuration tables had to be manually edited; nothing had to be recompiled.
- User programs were unaware as resources were added
Just as the OS didn't need any changes to adapt to a different system configuration, user programs were completely insensitive to the number of memory controllers, the amount of memory, and the number of CPUs. Of course adding or removing I/O devices required the program to know if the device existed or not.
- Processes, CPUs, and Memory were Orthogonal
Many multiprocessing systems (then and now) have memory tightly coupled with a given CPU, called NUMA. Because it is more efficient to access the local memory, a program ends up being bound to the particular CPU which is associated with that memory. This is bad because it can happen that a disproportionate number of the active processes happen to map to the same CPU, keeping it over burdened, while other CPUs in the same system are idle. The OS must either migrate some of the active processes to a different CPU, or have some processes running on a CPU which is distant from its memory state.
CPUs had no memory of their own on the BTI 8000. All memory requests went over the backplane to a memory controller. This loose coupling meant that a process could run equally well from any CPU in the system, without any cost of moving between CPUs (frequently described as processes having no CPU affinity). In a system with eight CPUs, a single process might have run on each of the eight CPUs at some point in any given second. This resulted in automatic, seamless load balancing.
No matter how many memory controllers were installed in the system, and no matter how much memory was associated with each, it was all one large pool as far as the operating system was concerned. All memory was visible to all CPUs. One large memory pool is easier to manage than N memory pools each 1/Nth the size.
Fail-Soft Behavior (link)
Because of the bank/accounting/business focus, BTI wanted to assure customers that its data was safe. Although not nearly as fault tolerant as the Tandem line of computers, real effort was put into making the system "fail-soft."
BTI defined this to mean that it when hardware failed, the system would not cause harm, and it would be be easy to repair. Fail-Soft was engineered into different aspects of the system; it was not any one single piece of technology.
- The OS took great pains to store redundant copies of all important tables, both in memory and on disk. Any time one of these tables had to be updated, the "primary" table would be marked as unstable, modifications made to it, then it would be marked stable. Then the "junior" table would go through the same steps. If something went wrong during the update of the primary table (e.g. power loss), the junior table would be used to restore a coherent state. If the junior table was corrupt, it would be refreshed from the primary table.
- Not only the OS, but certain system software was hardened against system failure. Most notably, the DBMS software took great pains to prevent data loss through various redundancies and transaction logs.
- The 8000 employed a remote diagnostic feature, where BTI technicians could connect to systems in the field to check system status, deploy hot patches, and diagnose and repair problems after the fact.
- There was a lot of hardware checking. Every card performed diagnostic checks at start up, and could run them on command later. Every transfer on the backplane bus had parity per byte. Memory used ECC to detect and correct errors. Recoverable errors were logged, and later read out by RDF; if a pattern was found in the failures, a card would be replaced. Unrecoverable errors might require the removal of a card, or in some cases the OS could map around the problem (bad pages in memory; bad blocks on disk).
- The power supply was monitored and if it detected an imminent shut-down, a CPU would be interrupted, state would be saved, and no new process would be initiated.
- Since core memory was used by the 8000, it retained its state during power loss. The later DRAM-based memory board had battery backup.
- Because VRM made reconfiguration painless, it also meant that unplanned removal of resources, due to hard errors, were handled quickly without posing any special complexity.
Virtual Machine Multiprocessing (VMM) (link)
It was alluded to above, but one of the concepts of the 8000 was Virtual Machine Multiprocessing, or VMM. This meant that the user programs were entirely unaware how much memory the system had, how many CPUs the system had, and any attempt manipulate a resource was mediated by the OS.
The virtualization of the user state was and is very common; it is required for protecting processes from either other, due to either malice or errors.
But virtualization was especially important for BTI in that virtualization also meant that a user program couldn't tell if it was running on a single CPU system or one with eight. When a system was reconfigured, either adding or removing resources, user programs didn't need to be modified in any way, and nothing needed to be recompiled.
BTI 8000 Operating System (Monitor) (link)
The BTI 8000 OS was frequently called the monitor, as it monitored and controlled the system activities.
Like user programs, the monitor was distributed and ran on any and all CPUs. The only time there was any asymmetry was at boot time: after boot up diagnostics had finished, the SSU would enable the CPUs, and the CPUs would attempt to lock out all the other CPUs, but only one would succeed. That winning CPU would be responsible for bootstrapping the monitor into memory, and patching various configuration tables based on the resources which were available in the system.
Pervasive Security Model
A well thought out security model was designed to appeal to the business, accounting, and banking markets.
The core security boundary was the account, which was arranged in a hierarchy of four levels of account control. It is described in the next section.
Like most operating systems, there was the concept of user state and privileged state. The user mode programs had no ability to access resources directly, other than the memory pages owned by that user. All other resources were handled by making an "XREQ" (eXecutive REQuest) call to into the monitor.
All files stored on removable media, including the primary disk packs as well as tape backups, were encrypted. It took extra work to save unencrypted files, typically when generating a tape to be exported to a different computer system.
Each account had limits on the resources made available by its superior account, including cumulative and per-session CPU time, cumulative wall-clock time, saved file block limit, and scratch file block limit.
Account Model
The operating system had the concept of "accounts," with four levels of account hierarchy. The "system" accounts were at the top of the pyramid. These in turn granted resources and privileges to division level accounts. Division level accounts delegated to project level accounts, and these controlled user-level accounts.
The default state was for all files to be private to an account. An account had an access control list, basically a list of other accounts which were granted various levels of permission to use all files within the account. For example, an account might permit all people in his division to read all files, and grant a specific list of people read/write privileges.
Beside the per-account access list, there was a per-file access list, offering the same types of privileges. In both per-account and per-file access lists, the permissions could also be tied to a password as an extra measure of security.
Although superior accounts were able to do anything an inferior account had permission to do, a superior account could also relinquish some or all of those permissions. Once, done, these permissions could be restored by the inferior account.
The security system also recognized that the responsibilities of the system administrators often were different than those of managers. Although system accounts were higher in the security hierarchy, individual administrator accounts were typically set up so they didn't have permission to access private files. Instead, they were in charge of managing print queues, mounting and dismounting disk volumes, and monitoring the process table. There was a MASTER account, though, that had the ability to do anything.
Groups of accounts could be "encapsulated," meaning they were walled off from the rest of the system. An account inside the barrier couldn't share any files with people outside the barrier, and was unable to write to files outside the barrier. This would be useful for ensuring the accounting department files were inaccessible to anyone outside of accounting, even if someone in accounting attempted to defeat security.
File System
The file system was flat for an account, other than the schism between the normal files and library files.
The user process had available 202 virtual I/O channels, known as LUNs (Logical Unit Numbers). By default 1 and 2 were the standard in and out channels, respectively. The first 200 could be assigned by the process to point to a file or other device. LUN 200 was always pointing at the file holding the executable. LUNs 201 and 202 were not assignable, and were always pointing at the user's terminal if in interactive mode, otherwise in a batch process they pointed at the virtual card reader and spooled line printer device.
Beside mapping a LUN to a specific file, it could be mapped to one of a number of logical devices:
- .TERM
- user terminal
- .LP
- line printer (spooled)
- .MT
- magnetic tape drive
- .NULL
- "write only memory"
- .CDR
- card-image reader's view of .TASK
- .DIR
- the directory of an account library
- .LOCK
- inter-process semaphore
- .PATH
- inter-process communication link
- .CODE
- executable program memory image file
- .SAF
- sequential access file
- .RAF
- random access file
Note that the CDR card reader was used by batch programs, and what it really pointed at wasn't a card reader but a sequential file containing lines of text emulating a card reader.
Different logical devices had various properties associated with the logical file type. For instance, the .TERM type had information about the width of the terminal, the number of lines per page, baud rate, terminal type, etc.
BTI 8000 Software (link)
This needs to be fleshed out, but in short, major tools were:
- Assembler
single pass
- BASIC-X
extended BASIC dialect, derived from HP's 2100's BASIC with business-class enhancements. It was coded in assembly.
- COBOL 74
Word has it that this was a Ryan McFarland-licensed implementation.
Ron Crandall says BTI never had a COBOL compiler.
- DBMS-X
CODASYL-compliant database management system
- DRAGON
This systems language was originally started by Glen Andert, but ultimately finished by Pat Helland, now of Microsoft. DRAGON was widely used to implement some of the in-house programs and utilities, most notably the DBMS software.
- EDIT
text editor, either line mode or full screen
- FORTRAN 77
Word has it that this was a Ryan McFarland-licensed implementation.
Ron Crandall says BTI started an implementation of FORTRAN, but the developer (Lou DeMartini) left BTI before it was done and the project died.
- PASCAL-X
As I recall, the PASCAL was licensed from somewhere, then heavily modified to the point of nearly being a rewrite. I have a vague recollection that the guy working on this was named Bob.
In 1978, the brochures touted that all of the system software was written in PASCAL. It simply wasn't true. Programs were almost exclusively written in either assembly language or DRAGON.
- RPG-II
Ron Crandall says BTI never had an RPG compiler.
In 1985/1986, there was a project to develop a C compiler for the machine. One engineer worked on it, and supposedly got pretty far. It parsed standard C and had multiple passes making both high level and peephole optimizations. It was canceled before it was production ready, and the engineer working on it left to work developing mapping software for a charitable organization. (Does anybody remember? Perhaps Mike Byron?)
BTI 8000 Compute Modules (link)
The BTI 8000 had four different classes of resources. A minimal system had one of each, and more than one of each could be added to a system to increase throughput and capacity. These four were named:
- SSU
- System Services Unit
- MCU
- Memory Control Unit
- CPU
- Computational Processing Unit
- PPU
- Peripheral Processing Unit
Each board in the system was a 20" x 23" card that plugged into a 16 slot backplane. The CPU was self-contained, but some of the others were connected via ribbon cables to more distant resource; for example the memory controller was cabled over to another cabinet containing the core memory modules. Every board in the backplane was microcoded to allow self test and intelligent configuration.
At the time the 8000 was introduced, it wasn't practical to build an eight layer 20" x 23" board. Instead, the bus interface logic and power distribution were laid out using the copper on the board, and the rest of the wiring was wire wrapped by machine. A Plexiglas sheet was mounted on the rear of each board to prevent accidentally snagging any wires while adding or removing boards from the system.
SSU (System Services Unit)
The SSU performed some simple management functions. Normally, only one SSU was used in a system, but a second SSU could be installed, in which case the redundant one was used as a hot backup in case the primary SSU failed.
The SSU contained:
- the master system oscillator
- a real time clock
- an interface to front panel switches and a vacuum fluorescent status display
- a system monitor for abnormal power and temperature conditions
- a remote diagnostic interface, allowing BTI technicians to log in remotely
- a permanent, unique ID, which allowed both BTI and 3rd party software to be locked to a specific machine
At system reset, all the resources would perform self-test. The SSU would wait until all tests were complete, then poll all the slots to figure out which cards were present and healthy. Any failures would halt the system and be reported on the monitor panel, a small vacuum fluorescent display. The SSU would then enable all the CPUs to run. One CPU would lock out the other CPUs from running, and would then start the monitor bootstrap process. The monitor would use the system configuration data collected by the SSU to establish OS configuration tables.
The SSU was usually positioned in one of the middle slots. As the source of the backplane clock, and thus the clock for the entire system, a central position minimized the clock skew between boards.
The SSU used the Signetics 8X300 microcontroller for its intelligence. The 8x300 was one of the earliest microcontrollers, and had a reputation for being an ugly beast to program, and for running quite hot, as it was implemented in bipolar logic.
Ron Crandall adds:
I found it to be just another relatively simple instruction set device. What made this installation interesting is that the instruction word was a full 24 bits... 16 for the 8x300 and 8 more to control various gates on the board. So each instruction was a 16 bit opcode for the 8x300 and 8 more bits that controlled what the 8x300 saw on its buses as it executed the instruction.
MCU (Memory Control Unit)
Originally, and for most of the life of the 8000, an MCU was simply an interface, and didn't directly control any memory. The MCU ran some quick diagnostics after reset and took care of the backplane bus protocol.
Requests from the bus were sent via ribbon cables to an external box, mounted in a second cabinet, which contained core memory and the actual core memory timing, driver, and sense circuitry.
The MCU had minimal pipelining. It could accept two operations before it started turning away new requests. Even those two requests weren't pipelined, other than the act of transmitting the request across the bus. While the first command was being processed, the second command simply sat in an input buffer, waiting its turn.
It was a system feature that the MCU directly supported atomic operations. The CPU defined a number of instructions as atomic, meaning a memory location would be read and the MCU would deny all other requests until the card performing the read/modify/write wrote back the updated value.
Although the backplane bus protocol was largely fair, there was a slight latency advantage to cards in the lower slot numbers. Therefore, it was advantageous to place the memory controllers in the lower numbered slots, since read latency critically affected system performance.
The core-based MCU's could be expanded in increments of 128 KB. A minimal system required at least 256 KB total, although practically all systems had more than this.
In 1985 or 1986, BTI designed a new memory controller that had an array of 64Kb DRAM chips mounted on board. SECDED ECC logic performed error checking and correction; a Z80 performed extensive diagnostics at power up of both the DRAM array and the board's control logic. The Z80 also logged any corrected errors so later this log could be inspected to see if a given chip or column was marginal. The board also used idle cycles to sweep through memory, with the hope that single bit errors could be repaired before they turned into double bit (uncorrectable) errors.
CPU (Computational Processing Unit)
A minimal system would have a single CPU, and high end systems could have up to eight CPUs. It wasn't architecturally specified as such, but none of the CPUs designed by BTI had any cache. Every instruction or data operation went over the backplane to a MCU. Instructions were fetched with double word transfers, which both increased backplane and efficiency, and acted as a kind of prefetch.
Because of the long latency to memory (around 12 cycles, or 750 ns, in the best case), the CPU would initiate the fetch of the next instruction as soon as it knew the instruction wasn't a branch.
An earlier CPU design used eight 74F181 4b ALUs as the core computing element.
Five different CPU designs were done, although I'm not sure how many of them were shipped. The last one, CPU5, was completed in 1985. John Kinsel designed the hardware; Jeff Libby wrote the microcode. The heart of the CPU used eight 29C03 4b bit slice chips and a 29C11 (?) sequencer. It also used a Z80 supervisory processor to run diagnostics at power up.
CPU5 was very horizontally microcoded, with a 108 bit wide microword (96 functional, 12 parity). The microcode store was 8K words deep, and all in RAM; this allowed upgrading the microcode in the field by swapping in a different daughter card containing a bank of EPROMs.
In addition to the 2903 ALU, the CPU5 datapath logic contained a barrel shifter and some other assorted logic. The microcode assembler was intelligent and could compute the number of cycles it would take to compute a given result; this value was stored in the microword. Thus, different microinstructions took different amounts of time.
Considering the multicycle microinstructions, the lack of a cache, and the long latency to read memory from an MCU, performance of a given CPU wasn't stellar. A single CPU would use typically less than 10 percent of the available backplane bandwidth. Even though it was slow compared to higher end 32b CPUs of the era, the BTI 8000 was still at least three times faster than the then top-end BTI 5000. Note that the CISC instruction set of the 8000 was designed to maximize the amount of useful work done for each instruction word fetched from memory. This meant that in a head to head comparison with a more conventional machine, the 8000 would do the same work while fetching about half as many instructions. This helped mitigate the speed deficiencies of the architecture.
The slow CPU perversely had a system benefit; it made the simple backplane bus practical as a means of increasing system throughput. If a single CPU had been able to saturate the backplane bandwidth, it would have precluded adding more CPUs as a means of increasing performance. As it was, BTI estimated that seven CPUs in a system running a typical mix of operations ran as fast as about five and a half ideal CPUs.
PPU (Peripheral Processing Unit)
The PPU was essentially a DMA engine. Each PPU could connect to four I/O controllers over two high speed and two low speed channels. For instance, the disk controller used a high speed channel, and the terminal muxes sat on a low speed channel.
A CPU could set up a DMA channel operation in memory, consisting of a list of registers to poke in a given I/O controller, a transfer of a given size to/from a given memory block; a sequence of these could be chained together. Once the channel program was constructed, the CPU would point the PPU at it, go on to some other process, and the PPU would take care of it.
Because there were multiple CPUs and a process could switch between CPUs frequently, it made no sense for a completed PPU program to interrupt a CPU. Instead, the PPU channel program would be told to write a given word to a particular location in memory. The next time the monitor program was sweeping the suspended process list, looking for work to do, it would find the notice from the PPU that the requested work was done, and the CPU would move the process from the suspended process list (or whatever action was appropriate).
The PPU, acting as a DMA engine, had a byte wide interface to each I/O controller (via ribbon cables), with FIFO decoupling on each channel. The PPU took care of all the byte packing and unpacking and address generation so each I/O controller could be simplified. High speed channels at 10 MHz, and the low speed channels at 5 MHz.
The list of I/O controllers included:
- disk controller
- 9-track real-to-real tape controller
- cartridge tape controller (good for backups)
- printer controller
- terminal controller (up to 19.2 Kbps)
BMB (Bus Monitor Board)
This board was used only by the developers. In essence, it was a logic analyzer custom made for the BTI 8000 bus protocol. Because only a couple were ever built, it was a write-wrapped affair.
A Z80 was able to set up a few triggers and capture events meeting some constraint. It was useful for, say, finding all the traffic between the MCU and a given disk controller, or looking for the first read after a certain address was written with a certain value.
It was flexible enough that I was able to write code to do statistical analysis of the mix of reads, double word reads, writes, callbacks, etc. on the bus.
BTI 8000 Instruction Set Architecture (link)
The BTI 8000 was architected in the mid 1970s, when complex instruction sets, as typified by the DEC VAX computer, was state of the art. Memory was a very expensive commodity, and it was thought that highly encoded instruction sets would make the most use of this expensive resource. At the time, core memory was still a viable technology for main memory.
The instruction set was defined by the software architecture group. Many features of the instruction set were chosen for performing OS-centric operations, such as operating on linked lists, performing atomic read/modify/write operations, and automatic subroutine linkage tasks. The focus was on encoding as much information in as few bits as possible, and in operating on arbitrary sized fields a fundamental operation. While these did make efficient use of the limited memory, it greatly complicated the CPU design, and made some of the operations very slow.
Here are a few examples of the complications.
- A single instruction could add a register to 1-32 bit field with arbitrary bit alignment, even if it crossed word boundaries. If the instruction used an index register, the value in the index register was scaled by the field width so that an array of packed bitfields could be addressed in a single instruction.
- An instruction could reference an operand in memory through a pointer in a register, or indirectly through a pointer stored in memory. The pointer, either in a register or in memory, wasn't just a word address; 17 bits contained a word address, and the other 15 bits encoded a variety of addressing options. That is, a pointer wasn't just an address, it was both an address and an extension of the opcode.
- It was possible to form an arbitrary permutation (and/or duplication) of the 32 bits of a register in a single instruction. The instruction would read a 32 word block of memory and form the XOR sum of all words where the corresponding mask bit in a register was '1'.
- Subroutine linkage consisted of a
CALL
opcode followed by one or morePAR
(parameter) opcodes specifying the location and type of each parameter. The list of parameters was marked by an end-of-parameter-list opcode. There were three type bits, address (versus value), double (versus single), and last. They were copied into a register for the callee to check. On the receiving side, the subroutine had anENTR
(entry) opcode which specified which register was to hold the return address, followed by one or moreSTP
(store parameter) opcodes, indicating the type of parameter and where it should be stored. The end of the list was marked by an end-of-parameter-list opcode. The number and types of parameters were checked at runtime, as well as the functional part of moving the parameters to the desired locations. In some cases parameter types that differed between the caller and callee would be coerced to the expected type.... C: CALL S PAR A1 PAR A2 PAR2 A3 PARV A4 PARL A5 C+6: ...
... S: ENTR REG7 STP F1 STPV F2 STP2 F3 STPV F4 STPL F5 ... ( subroutine body ) ... LEAVE REG7
The order of execution of the calling sequence is as follows:
- The CALL instruction reads up the instruction word at location S and verifies that it is an ENTR instruction. Note that the CALL-ENTR pair assumes parameters to follow. If no parameters are to be passed, the instructions used are CALLNP-ENTRNP. A mismatch causes a fault. It then uses the operand of the ENTR instruction to save R7. It then places the address of the ENTR instruction plus 1 into R7. The program counter is advanced to the next instruction.
- The PAR A1 instruction is executed. It places the address of the operand A1 into R0. Then it swaps the incremented program counter with REG7.
- The STP F1 instruction is executed. It verifies that the passed value is an address (not a value). It copies the address of A1 (from R0) into the operand F1. Then it swaps the incremented program counter with REG7.
- The PAR A2 instruction is executed. It places the address of the operand A2 into R0. Then it swaps the incremented program counter with REG7.
- The STPV F2 instruction is executed. It sees that the caller gave an address, so it dereferences the address and stores the resulting value into F2. Then it swaps the incremented program counter with REG7.
- The PAR2 A3 instruction is executed. It places the address of the double operand A3 into R0. Then it swaps the incremented program counter with REG7.
- The STP2 F3 instruction is executed. It verifies that the caller gave an address for a double word, so it stores the address into F3. Then it swaps the incremented program counter with REG7.
- The PARV A4 instruction is executed. It places the contents of operand A4 into R0. Then it swaps the incremented program counter with REG7.
- The STPV F4 instruction is executed. It verifies that the caller gave an value so it stores the value into F4. Then it swaps the incremented program counter with REG7.
- The PARL A5 instruction is executed. It places the address of operand A5 into R0. Then it swaps the incremented program counter with REG7.
- The STPL F5 instruction is executed. It verifies that the caller gave an address so it stores the address into F5. It verifies that the caller specified that this was the last parameter. Then it continues to the next instruction in the called routine.
- The function runs and the LEAVE instruction places REG7 into the program counter to resume the callers context and restores REG7 from the indicated operand.
Supposedly the person writing the microcode for the first CPU exclaimed, facetiously, that the listing for the CPU microcode was larger than the listing for the OS.
User State
Like most OS's, there was an explicit model of the user state. BTI called this the virtual machine. By having a clear definition of this state, multiple generations of CPUs could run the same code without modification. It was even normal to have a system containing multiple CPUs of different generations with the processes running on a different type of CPU each time slice (typically 100 ms, or until the process blocked on a resource).
The user's concept of the computer was:
- one 17b program counter
17 bits was enough because the virtual address space was 128 Kwords (512 KB).
- one 15b process status register
This held various user-accessible mode bits and flag state. For instance, mode bits indicated if memory locations containing uninitialized values were to cause a trap. Other bits contained the results of the most recent comparison.
- eight 32b registers
All eight registers were nearly general purpose, although a few specialized instructions were hardwired to use certain registers, such as CMOVE (character move). Other registers were assigned a dedicated function via software convention, such as using R6 as the stack pointer.
- a 17b current console area register
This pointed to a 10-word area in the user space where the user state was stored in the event of an interrupt. Ten words were enough to hold the program counter, the process status register, and the eight general purpose registers.
- 512 KB of memory
A system could contain more than 512 KB, but any single process was limited to a total of 512 KB virtual address space to hold all the code and data. Although the user saw 512 KB, it was actually organized into 4 KB pages that could be swapped between main memory and disk. The OS also allowed limiting a given process to less than the total 512 KB.
Memory Paging
With a limited virtual address space of only 128K words, the paging system was very simple: a single table containing the mapping for 128 pages of 4 KB per page was sufficient. This table lived on the CPU in a small SRAM. The bottom 10 bits of the address were unchanged and indexed a word within the page, and upper 7 bits of the virtual address indexed the mapping table, producing the physical page address and other status.
The page mapper had 256 entries: 128 for the current user space, and 128 for the monitor. One bit in the monitor status register indicated if the CPU was in user mode or privileged mode, and that selected which half of the mapping table was in use.
Each page table entry had 20 bits, with various fields.
- 4 bits indicated which slot to address. Any slot could be specified, not just an MCU
- 12 bits provided the physical page address (up to 16 MB per slot)
- four control bits, indicating whether the page was resident, whether it was modified, whether it had been accessed (useful for LRU aging)
Data Types
The BTI 8000 had instructions that operated on a number of data types. Most of them are tersely listed here.
- 32 bit fixed point
- 64 bit fixed point
- 64 bit floating point
- bit field from 1 to 32 bits long
- 8 bit character (extra support vs. the generic bit field addressing)
- 32b pointer
- linked list primitives
- pushdown stack primitives
- miscellaneous
The machine used two's complement arithmetic, but an optional commercial
instruction set added extensive operations for supporting variable sized BCD
math operations, and things like "FIELD EDIT" opcodes
(like a PRINT USING
statement in a single instruction).
For integer and floating point values, a unique "uninitialized" value was defined by the instruction set. The uninitialized value was an msb of 1, with 31 or 63 trailing zeros. This corresponds to the most negative value in a two's complement number system. If the uninitialized value checking was enabled, a trap occurred if any operands were seen with that value.
Instruction Formats
Lacking an instruction set reference manual, the following information has been paraphrased from a paper BTI present in AFIPS Volume 48 National Computer Conference (1979, pp. 513-528).
All instructions in the BTI 8000, without exception, were 32 bits wide and aligned on 32b boundaries. The first ten bits supplied the major opcode, but some instruction formats encoded sub-opcodes in other parts of the instruction word.
Like most computers, the BTI 8000 trapped any illegally encoded instructions. The designers designated a word of all 0s or all 1s to be illegal, as well as any opcode that started with 0x20, ASCII space. These values were deemed the most likely data words, and so making them illegal meant that errant programs would more likely get trapped before doing harm.
There were a total of about 200 opcodes, and around 30 different addressing modes. Helping keep things sane, just about any address mode that made sense could be used for any opcode.
- Immediate
Format: [10b opcode][5b mode][17b field]
In this format, bits [16:0] are used to form either an immediate value, or an immediate address. Depending on the size of the operand called out by the opcode, the immediate value may be expanded to 32 bits or 64 bits.
- the 17b field is right justified and zero filled to form an immediate
- the 17b field is right justified and ones filled to form an immediate
- the 17b field is left justified and zero filled to form an immediate
- the 17b field is the word address of an operand in memory
- the 17b field is the word address of an indirect pointer in memory
- Indexed Memory
Format: [10b opcode][2b mode][3b idx reg][17b address]
This either supplies the address of a word in memory, or it supplies a location in memory of a pointer to another location in memory. The index register value is then added to that address to provide the location in memory where the operand resides. Instructions with double word length use an offset of two times the index register value.
- 17b direct address
- 17b indirect address
- Base Register
Format: [10b opcode][5b mode][3b base reg][4b submode][10b offset]
There are six different modes that use this format; their behaviors are complicated and not described here.
- register to register
- register indirect
- word array
- character array
- formal parameter
- stack
- Indexed Base Register
Format: [10b opcode][5b mode][3b base reg][3b idx reg][1b submode][10b offset]
This format is like the Base Register format, except there is a smaller offset field, and an index register value is added to the effective address that the plain Base Register format would compute.
- register indirect
- word array
- character array
- formal parameter
- Type Conversion
Format: [10b opcode][5b mode][3b reg][4b submode][3b unused][2b type][5b unused]
This format is used to convert between 32 bit fixed point, 64 bit fixed point, and 64 bit floating point formats. The fixed point formats can be treated as signed or unsigned, and conversions can be specified to round or not in case of loss of precision.
- Byte
Format: [10b opcode][5b mode][3b base reg][5b bit][5b field len][4b offset]
The instruction set has no shift or rotate instruction. Instead, this format is used by some instructions. In one mode a register is viewed as a circular list of bits and the instruction specifies an arbitrary field starting at an arbitrary offset. In the other mode the register specifies a word in memory where the bit field exists and again, an arbitrary field can be extracted. Rotates and shifts can be obtained by using Load-Effective-Address instruction and this addressing mode.
- register ("circular")
- array ("zigzag")
Words of memory which are used as pointers are also encoded:
[2b character][3b bit][5b field len][5b mode][17b address or immediate]
The "mode" field is akin to the (A) format above. Which fields were used and how they were interpreted depended on the operation. Note that a pointer could point to not just a word in memory, but an arbitrary 1-32b field in memory. Other wonders were possible. In array mode, the offset value is multiplied by the field size and the appropriate math is carried out so that a packed array of arbitrary (1-32b) values could be directly addressed.
Instruction Set Summary (link)
This set of instructions was lifted from BTI_8000_Technical_Summary_Sep78.pdf.
APPENDIX A: SUMMARY OF USER-MODE CPU INSTRUCTIONS
A.1 Fixed Point Arithmetic
- ADD
- operand added to contents of specified register, result stored back in that register
- ADDM
- ("add to memory") as above, but result replaces operand instead of register
- ADDB
- ("add to both") as in ADDH, but result also stored in register
- ADD2, ADD2M, ADD2B
- double-word analogs of above
- SUB
- operand subtracted from contents of specified register, result stored back in that register
- SUBM, SUB2, SUB2M
- see ADD family
- RSB
- ("reverse subtract") contents of specified register subtracted from operand, result stored back in that register
- RSBM, RSB2, RSB2M
- see SUB family
- MUL, MULM, MUL2, MULZM
- multiply family (see ADD, SUB)
- DIV, DIVM, DIV2, DIV2M
- divide family
- RDV, RDVM, RDV2, RDVW
- reverse divide family
- LD, LDN (N="negate"), LD2, LDN2
- load register family
- INCL, INCL2
- Increment operand by 1, then load reg. with this new value
- ST, ST2
- store register (single, double)
- STW, STW2, STMW, STMWZ
- store the value "one" (W) or "minus one" (MW)
- STU, STU2
- store the value "undefined" (hexadecimal 80000000)
- STZ, STZZ
- store the value "zero"
- EXCH, EXCH2
- exchange register, operand
- INC, INC2, DEC, DEC2
- increment/decrement operand by one
- INCP, DECP
- increment/decrement pointer. These instructions assume the operand is a pointer. The bit length of the pointed-to entity (carried in the pointer) is added to/subtracted from its bit address, thus moving the pointer forward/backward one entry, no matter what the size of the entry.
A.2 Floating Point Arithmetic
These instructions deal with 64-bit (double word) floating-point operands, which have 11-bit biased exponents and 52-bit mantissas. Double-precision floating-point operands (128 bits) are generated and manipulated by software.
- FAD, FADM, FADB
- floating add ("to memory", "to both")
- FSB, FSBM, FMU, FMUM, FDV, FDVM
- floating subtract, multiply, divide
- FRSB, FRSBM, FRDV, FRDVM
- floating reverse subtract, reverse divide
- FINC, FDEC
- floating increment, decrement memory (by one)
- FINCL
- increment floating-point operand by 1, then load adjacent registers with this new value
A.3 Boolean Arithmetic
- AND, ANDW, AND2
- similar to fixed-point ADD family
- BSUB, BSUBM
- result = register AND NOT operand (Boolean subtract)
- BRSBM
- Boolean reverse subtract to memory
- IOR, IORM, IOR2
- inclusive OR family
- XOR, XORM, XOR2
- exclusive OR family
- SETT
- (set and test) set operand to one after setting condition bits to comparison of register and operand (used for locking of critical regions)
A.4 Jumps
- Unconditional
- JMP (load Program Counter with operand)
- Conditioned on PSR condition bits
- JCC,JCS (if carry clear/set), JOC, JOS (if overflow clear/set), JEQ, JNE, JLT, JGT, JLE, JGE
- Conditioned on comparison of register contents to zero ("Z") or minus one ("MW")
- JEQZ, JEQZ2, JNEZ, JNEZ2, JLTZ, JLTZ2, JGTZ, JGTZ2, JLEZ, 3LEZ2, JGEZ, JGEZ2, JEQMW, JNEMW
- Bit tests
- JBT, JBF (if bit in register true/false)
- Address tests
- JZA, JNZA ( if address field of register zero/non-zero)
- Register increment/decrement
- IRJ, DRJ (inc/dec register, then jump if result not equal to zero); JIR, JDR (if register not equal to zero, inc/dec register and jump)
- Linkage jumps, conditioned on zero/non-zero address field fetched through register
- LJZA, LJNA (load register with address field of word it points to, then jump if result zero/non-zero); RLJZA, RLJNA (remember, 1ink, and jump -- save register in adjacent register, then proceed as in LJZA, LJNA)
A.5 Subroutine Linkage
Several instructions are provided for subroutine 1inkage; they check entrypoints and provide parameter type-checking for the subroutine. The calling sequence and the entry sequence are executed part by part, passing one parameter at a time with the PAR (pass parameter) instructions on the calling side and corresponding STP (store parameter) instructions on the subprogram side. These instructions specify the parameter type (including "2" for doubleword), whether the parameter is being passes by location or value ("V"), and whether this is the last ("L") parameter in the protocol.
- CALL, CALLNP ("NP" = no parameters)
- begin linkage from calling side
- ENTR, ENTRNP, ENTRS ("S" = start, for non-standard parameter passing)
- begin subroutine
- PAR, PARZ, PARL, PARZL, PARV, PARVZ, PARVL, PARVZL
- pass parameter
- STP, STPZ, STPL, STPZL, STPV, STPVZ, STPVL, STPVZL
- store parameter
- LEAVE
- leave subroutine
- LDPC, LDPCS
- load Program Counter ("S" = also load status bits)
- EXPC, EXPCS
- exchange Program Counter (and status) with operand
- JSR
- jump and save return address in register
A.6 Compare Instructions
- CKB, CKB2, FPCKB, I2CKB
- bounds checking for array indexing
- CPR, CPR2, UCPR, UCPRZ
- signed/unsigned compare register with operand
- MCPR
- masked compare register with operand (adjacent register selects bits)
- CMZ, CMZ2
- compare operand ("memory") to zero
- STLEQ
- store logical one ("1") iff condition bits = "EQ", else store zero
- STLNE, STLLT, STLGT, STLLE, STLGE
- as above for other conditions
A.7 Character Instructions
These instructions are interruptible, and deal with character strings whose starting address and length are given by register values. The CMOVE instruction loads and stores whole words and thus is quite efficient no matter what the character alignment might be.
- CSRCH
- search for a specified character in a specified string
- CMS
- compare strings (can be paired with CSRCH to search for substrings)
- CMOVE
- move string
A.8 Miscellaneous Instructions
- LDPSR, STPSR
- load/store Process Status Register
- CLPSR, IORPSR, XORPSR
- PSR bit manipulation
- HIB, HIB2
- find location of leftmost one-bit in operand
- LEA, LEA2
- load effective address (generate a pointer)
- XCT
- execute operand as if it were an Instruction (one level only)
- LSRCH
- linked list search. Searches through a linked list of structures for a match between the value in a specified part of each structure and a value in a register (or register pair)
- PMUT
- (permute) Using a 32-word table, this instruction can permute bits in a register, encrypt data, compute parity, and form block checksums.
- NOP
- no operation
A.9 Address Modes
In addition to specifying a register, many instructions also specify an operand through an address mode field. Address mode parameters can in turn involve the specification of one or two registers used to arrive at an operand. Indirect addressing proceeds through "pointers", which themselves specify five different methods of addressing. The following summary is by class, with the number in parentheses representing the total number of modes in each major class. The distinction between single-word and double-word addressing (for word-size operands) is not considered In this count, since that distinction is made in the instruction operation-code field.
( 1) DIRECT ( 1) INDEXED ( 3) IMMEDIATE ( 5) INDIRECT ( 2) INDIRECT AND INDEXED (first indirect, then indexed) ( 1) REGl (register select, with value biased) ( 1) ARWDl (offset from base register) ( 1) CACHl (offset to character from base register) ( 5) FPVRl (offset from base register, then indirect) ( 1) REG2 (as in REGl, but indexed) ( 1) ARWD2 (offset from base register, then indexed) ( 1) CACH2 (offset from base register, then indexed to character) ( 2) FPVR2 (offset from base register, then indirect, then indexed) ( 1) CBM (circular bit-string mode) ( 1) ZBM (zig-zag bit-string mode) ( 1) STK (stack mode) ( 4) TCONV (type conversions: integer/floating-point, etc.)
Totals: 32 address modes through 17 classes
Trivia (link)
-
The systems used by hardware and software in development and debugging were named after the seven deadly sins:
lust, gluttony, greed, sloth, wrath, envy, pride
I don't recall if all seven were actually in operation, or if they only got part way through the sequence.
-
The system clock was nominally 15 MHz, which is 66.7 ns. All hardware design was done to meet a 60 ns cycle time at worst case operating conditions. The extra 10% was design margin.
-
Although the realtime clock was battery backed up and crystal controlled, it could still drift somewhat. When these corrections were entered, the SSU would slowly drift the clock back into accuracy (rather than jumping the value all at once, which would mess up time stamps and accounting). Further, it would calculate the rate of drift and use that to set up a countering rate of drift to cancel out the inherent drift.
-
Systems were usually purchased on payment plan, and BTI was worried about scurrilous customers who would stop payment but continue to use their system.
The 8000 had a remote diagnostic facility (RDF), whereby BTI technicians could log in and repair problems remotely. If a customer stopped payment, BTI could log in and shut down the system. However, the RDF could be disabled from the front panel, as a security measure.
Worried that a deadbeat customer might just turn off the RDF access, BTI had a mechanism in the SSU where it would automatically disable the system after N days. Normally the RDF maintenance check would reset the timer if the customer was up to date, but if the customer shut out RDF, the system would be disabled anyway.
BTI never used the feature, though, as having any customers was much more of a concern than hypothetical deadbeat customers.