BTI 8000 Bus Protocol

System Interconnect (link)

The BTI 8000 was built to be a multiprocessing computer. A key feature of any multiprocessor is the interconnection scheme whereby the various resources communicate to each other.

Many multiprocessor systems of the day used either multi-ported memory (which was expensive and didn't scale well), or crossbar switches to interconnect devices. While bandwidth could scale as resources were added in the latter case, it was terribly expensive as the cost grew geometrically as the system was expanded.

A simple shared broadcast bus structure was used in the 8000. At the heart of it was a 32-bit wide synchronous bus, operating at 15 MHz. Obviously, the peak bandwidth could not be more than 60 MB/second, about half of the first generation PCI bus bandwidth. Each byte of the 32-bit bus was parity checked on every transfer.

This type of solution is not one that scales well in that the bandwidth provided by the backplane is constant, no matter what other resources are added. If many devices are on the bus and all are active, there will be contention and low throughput is the result.

This wasn't a problem in the 8000 for a few reasons. The first was that the system was designed to have a maximum of 16 cards, with no attempt to grow beyond that. Another important factor was that any individual device was not very efficient at consuming bandwidth. A given card could have only one request outstanding at a time. With a read latency of about 700 ns best case, a single slot would typically consume less than 10 percent of the theoretical peak bandwidth available on the backplane.

Another way of stating it is the backplane was way over-designed for a minimal system, and contention wasn't a factor until a system was fully loaded.

Bus Protocol (link)

The backplane in the BTI 8000 was passive; all arbitration and control was distributed on the various boards plugged into the slots. The design was such that an unused slot didn't require a dummy card.

Any card could be plugged into any slot in the system, although there were some practical considerations. Lower numbered slots had somewhat better latency characteristics (described later), so it was best to put memory controllers there. Next came I/O controllers, which had real time constraints. The SSU, which provided the system clock, was usually put in one of the middle slots, to minimize clock skew. Another reason for juggling the order of cards was the sad fact that some systems had marginal timing, and simply swapping cards around often got a flaky system to be reliable.

The bus consisted of a 32-bit payload along with a host of control signals. "Messages" consisted of one or two data beats. All signaling on the backplane was done with open collector drivers, so there was never any worry about bus contention.

A write request consisted of two beats of data; the first beat held the write command and the 22 bit address of the word to be written. The second data beat contained the 32 bit word to be written at the address given in the previous clock cycle. There were no byte write enables; only full words could be written.
A read request consisted of a single beat of data; this 32 bit message held the request type and the 22 bit address of the word being requested. Reads could either ask for a single word or a pair of consecutive words. The request was split, meaning that after the request was made, other boards could make requests of their own, and at some later time the requested word or words would be sent back to the requester in a response transaction.
A read/modify/write sequence was a hybrid of reads and writes. It too was a split transaction transfer, starting off like a normal read command. However, unlike a normal read, the MCU would lock out any new requests until a data transfer containing the modified data came back to the MCU. This was the means for making atomic memory updates.
A set of informational message types made up a small portion of bus traffic. These were called "Who Are You" messages, because at power up each board was queried for its type in order that the system could be configured. These messages were also used to ask a card to run its self-test routine, to query the result, and to get interrupt and error status.

Transfers were done in three phases. Although each phase was sequential from the point of view of a given transfer, the three phases were overlapped to increase backplane efficiency.

Arbitration Phase

Assuming an idle bus, any number of slots could make a request to use the bus. The winner was determined in the same cycle that all the requests were made, using a novel arbitration scheme, detailed in BTI's patent for it.

Requests were processed in order of the priority established by their slot numbers: a lower numbered slot has priority over a higher numbered slot. To prevent a group of active higher priority slots from locking out the lower priority slots, the rules of the bus were that once a group of slots started a request arbitration, no slots could join (or rejoin) the group of requesters until after all the original requests had been serviced.

This ensured a good degree of bandwidth fairness, while giving some slight advantage to the real-time cards plugged into the high priority slots.

Address Phase

The address phase was where a source (requester) notified the destination that it intended to send something in the next cycle. These values were simply 4-bit slot IDs. Despite the name, only the board ID was communicated, not a memory address.

Data Phase

The data phase was when the source sent a "message" to the destination slot. Most often it was a write request, a read request, or the response to an earlier read request. It was always either one (single word transfer) or two clocks (double word transfer) long.

A 32-bit, 15 MHz bus has a bandwidth of 60 MB/s. Because the address and data information was time multiplexed on the same bus, the actual usable bandwidth was lower. For writes, 30 MB/s is the theoretical best, and for reads, it was 30 MB/s for single word reads and 40 MB/s for double word reads.

Callbacks

If for any reason the destination slot was not ready to accept a transfer, it raised a "NACK" (negative acknowledge) signal, and issued a four bit "callback" number to the requesting slot. In fact, a single card could receive a number of requests (although never more than one from a given slot) that it denied before it became available again.

A slot could be unavailable for two reasons. This happened because an MCU was tied up in the middle of an atomic read/modify/write, or because it has reached its capacity for pipelining requests (two in practice, although it wasn't mandated by the bus protocol).

When the card that NACK'd the request or requests was once again available to process requests, it notified those rejected requesters that it was ready for work. But here was the clever part: it notified them in the same order they were rejected.

During callback transfers, rather than addressing cards by slot, they were addressed by callback number. Since the callback numbers were handed out in a simple sequential order, regenerating the sequence of IDs was trivial. Each rejected card knew the callback ID it was given, so when it saw a callback transfer from the card that rejected it using the same callback number, it knew it was being solicited to resend the previously rejected request.

The slot which did the rejecting drove the arbitration and address phases, then the rejected requester drove the data phase.

The patent doesn't mention it, but the real BTI 8000 added the idea of high priority requests. The MCU kept separate logic for high and low priority requests. The real time I/O devices (like the disk controller) used high priority requests and were handed high priority callback IDs separate from low priority callback IDs. Once a card began issuing callbacks, all high priority callbacks were serviced before any low priority callbacks.

Transfer Diagrams

Here are some simple timelines showing the different types of transfers which were just discussed.

   write:
        src->dst, cycle 1: arbitration
        src->dst, cycle 2: card addressing
        src->dst, cycle 3: write command + address
        src->dst, cycle 4: data word

   single word read:
        src->dst, cycle 1: arbitration
        src->dst, cycle 2: card addressing
        src->dst, cycle 3: read command + address
        ... time passes ...
        dst->src, cycle N+1: arbitration
        dst->src, cycle N+2: card addressing
        dst->src, cycle N+3: data word

   double word read:
        src->dst, cycle 1: arbitration
        src->dst, cycle 2: card addressing
        src->dst, cycle 3: read command + address
        ... time passes ...
        dst->src, cycle N+1: arbitration
        dst->src, cycle N+2: card addressing
        dst->src, cycle N+3: data word 0
        dst->src, cycle N+3: data word 1

   NACK'd transfer:
        src->dst, cycle 1: arbitration
        src->dst, cycle 2: card addressing
        src->dst, cycle 3: command + address ; dst->src NACK and callback #

   callback:
        dst->src, cycle N+1: arbitration
        dst->src, cycle N+2: card addressing using callback #
        src->dst, cycle N+3: data word 0
        src->dst, cycle N+3: data word 1 (if needed, wasted otherwise)

Bus Parking

I believe the protocol also allowed for "bus parking." The arbitration phase added a clock of latency to each request, but sometimes it could be skipped safely. A card was allowed to skip the arbitration phase if it was the most recent winner of arbitration and no other card had initiated arbitration in the previous cycle.

If these conditions were met, the parked card could simply start the address phase by driving ADDR, D[3:0], and SA[3:0] as if it had won arbitration in the previous cycle.

Other

Here are some things I don't recall and need to be verified:

If a read request was made to a board that wasn't there (or had gone AWOL), did the SSU time out the request and somehow ack it? I can't believe that the requester would simply hang. But I also can't see the SSU tracking enough state to know which transfers were dead. Perhaps each requester had to time out itself.
Deadlock -- was this ever a worry? MMU's were slave-only devices, and only MMUs could receive R/M/W requests, so it wasn't possible for two cards to attempt RMW to each other at the same time.