Home About Us Contact Us Clients Projects Services

Architecture
 

 

Up

B&C Transit Office Systems can be designed as either simple stand-alone systems, or as full blown networked redundant control systems.

Fault tolerant systems can be implemented by using dual networked servers and unique LAN's to all component workstations.

 

 

If one server is shutdown due to failure or maintenance, the standby system automatically takes over, and no data is lost during the transition period.

 

Office Networks

Common Network Components

 

 

Field processors (VPI, MICROLOK,

GEOLOC, VHLC, PLC, etc.)

 

One or more UNIX or Microsoft NT servers.

 

A database repository (generally ORACLE or SQL SERVER) residing on each server.

 

A B&C application server application residing on each server.

 

Routers, network switches, and network fiber or Ethernet cabling.

 

One or more computer workstations.

 

B&C Workstation Control and Monitoring Applications.

 

Overview Monitors.

 

 

Fault Tolerant Systems

 

Data Replication

 

If the server databases acted independently, then data would only be stored in the database when data was transmitted over the pathway and through the server. If one component on the pathway were to go down for any period of time, data could not be stored in the Oracle database on that server for the period of time the pathway was disrupted. As a result, the two databases would quickly become unsynchronized any time a pathway became disrupted. For playback of indications, reports, and journal entries, this would only represent a period of time where data was missing. However, for dynamic car tracking, just a simple missed occupancy indication would totally disrupt the ability of the server to accurately track cars.

This might not be noticed until a fail-over occurred from one server to the other. Suddenly, the operator would notice only that all the car information was completely wrong after the fail-over, or that journal and report entries were missing. Therefore, the data paths must be set such that the two databases always remain synchronized even when one server is down for a period of time. We do this by using RDBMS database built-in data replication features.

 

Ensuring Data between servers stays synchronized

 

To ensure data is always synchronized between the Oracle databases on the two servers, the servers must be setup for master-to-master data replication. This means, that an update on one Oracle database will automatically make the corresponding update internally on the second Oracle database residing on the other server. If one server is down while the other server is running, and databases updates are occurring on the running server, the failed server will automatically be fed the data it missed when it restarts. This ensures data is never lost and the two servers are always synchronized regardless if one instance of the database was not functioning over a certain period of time.

 

However, this also means that only one server can update its Oracle database with data. This is called “single point entry”. With Oracle replication active between the two servers, only one of the two servers can be allowed to update the database, otherwise duplicate records would occur with every transaction. And because the servers are time synchronized, attempts to insert duplicate records would cause exceptions on the database with identical time stamp records, resulting in inefficient operation, or worst, no operation at all. 

 

Data replication allows the servers to remain synchronized, but also means that only one data pathway “normal” or “standby” can be responsible for storing data and passing that data on to the workstations.

The following sections will describe all the components of a server. We will explore how the system determines which server is “in control” (can update the Oracle database) to prevent duplicate records in the database.

 

Server Components

 

For the purposes of this document, a yard server will represent all components between the fiber optic switches connected to the workstations, and the terminal servers connected to the non-vital field processors. These components consist of the following:

 

Server computers

 

The servers represent the computer hardware. The server consists of multiple hard drives set up in a RAID 5 fault tolerant configuration. Each server runs the operating system, which is the software platform for running all other applications on the server. Servers also can run NTP to ensure all systems on the network are time synchronized.

DATA REPLICATION

 

RDBMS (Relational Database) that runs on each file server

The ORACLE application is an advanced relational database software application. This database is used to store information for the YCS (yard control system) for system configuration, reporting, playback, security settings, car tracking, etc…

                       

ATS (Application Terminal Server) that runs on each file server 

The ATS is the centralized “brain” software for the control system. This software application communicates directly with field processors via a communications link from the hardware terminal servers (that are connected via serial cables to the field processors). The ATS centrally processes all information from the field processors, stores changes in the RDBMS database (if it is in control), provides all car tracking responsibilities, handles alarm acknowledges, and passes control requests from workstations back to the appropriate field processors.

 

JMS (Java Message Service) that runs on each file server

JMS is the transport layer for messages between the ATS and all workstations. JMS is a software application that must be running on the servers to provide communication between the server and the workstations. JMS provides a “publish” / “subscribe” method of communication, thereby eliminating the need to provide fixed IP addresses at the workstation level for individual communication.

 

Each redundant pair of servers is considered a “pathway” between the workstations and the field field processors. If any component in the “normal” pathway fails, the entire pathway fails, and the system is automatically switched to the “standby” pathway. Pathways are identified as the “normal” and “standby” pathways. All components in each corresponding pathway are identified as “normal” or “standby” components.

It is important to note that the “normal” and “standby” pathways do not necessarily include the non-vital processors. More will be discussed about the processor communications, but for now, recall that the non-vital processors are not identified as components in server fail over. The primary reason is because a fail-over from a “normal” to a “standby” processor does not have to change the data pathway within the control system. The two entities are non-exclusive.

Server automatic fail over is controlled exclusively by the “normal” and “standby” ATS applications running on each server.

 

ATS Fail-over Introduction

 

The mechanism for controlling server fail over resides within the ATS (Application Server) software running on each file server. As far as the office system is concerned, if this software is not running, then the server may as well not be running, since without ATS, the control system cannot function through the corresponding pathway.

 As mentioned in the previous section, ATS is the “brain” of the office system. Each pathway, “normal” or “standby” is controlled exclusively by the ATS application on that pathway.

Under normal operating conditions, when all the components within the “normal” pathway are functioning (again, this does not include non-vital processors), then the “normal” ATS is the controlling entity of the office system. When an ATS is in control, only that ATS application is allowed to update the RDBMS database with processor controls and indications, and pass that data between the systems. Remember this concept, as we will bring it up later in greater detail.

In the event a component on the “normal” pathway fails, the “normal” ATS will relinquish control to the “standby” ATS, and the “standby” pathway then becomes the controlling pathway. However, before getting too deep into the fail over scenarios, we’ll discuss each component of the networked office system and provide a more detailed overview of their function.

 

Field Processor Communication HOT Fail Over

 

It is important to note that field processors can communicate on both the “normal” and “standby” servers simultaneously – provided the ATS software is running on both servers.

 

 Non-Vital processor indications from the field to the Networked Office System

 

The illustration above shows a single “normal” and “standby” non-vital processor pair. The ATS applications on both servers will communicate with all processor pairs through the hardware terminal servers.

 

The non-vital processors are the office system interface into the railway. They provide all indications to the office system -- such as switch and train locations, breaker states, etc.. And those field processors receive and process control requests from the office system -- for route requests, breaker states, etc..

When both ATS applications are running, all non-vital processor pairs are running, and all connections are established, each ATS application will continuously interact with each processor. However, the ATS will only “store” and pass on data from one of the processors in the pair. The ATS will make the following determination for which processor will provide data to be used by the office system.

 

 

Referring to the illustration at right, only one set of the duplicate indications received by a normal and standby processor pair are saved to the database and passed on to the workstations.

 

It is also important that only one of the ATS applications (either “normal” or “standby”) save the data and pass it on, otherwise, duplicate records would be stored in the RDBMS database as described previously. Note that a field processor must “tell” the ATS which processor “normal” or “standby” by sending indications that the office system should use. This determination is made based on the state of an indication bit provided by the processor.

 

Only one ATS can be “in control” at any given time. The term “in control” simply means that the ATS is storing data and passing the data on to the workstations.

 

 

The important thing to note is that if the “normal” field processor fails for any reason, the office system will automatically start using the indications from the “standby” field processor. This switch over is nearly instantaneous because the ATS is always communicating with both processors simultaneously.

 

Control Requests from Office System to Field processors

 

When a networked workstation sends a control request, such as a command to open a breaker or selection of the gate as the entry of a route, the workstation “publishes” the control request to the server “in control”. As a result, the ATS application “in control” receives the control request, and passes the control request on to the field processor pair (if a redundant pair exists - otherwise send the control to the single processor).

The ATS “in control” will pass the workstation request to both the “normal” and “standby” field processors.

 

The logic for determining this is similar to the logic chart on the illustration above.

It is important to note that the “normal” ATS will not automatically fail over (relinquish control) to the “standby” ATS if it loses communication with both the “normal” and “standby” field processors. Automatic switchover from the “normal” ATS to the “standby” ATS occurs only if one of the system components on the server or network occurs. Of course, manual switch to the “standby” ATS is always an option via selection of the standby server icon in the System Configuration screen on any workstation. More will be explained about this later.

 

Data Flow (Normal ATS “in Control”)

 

 

In the illustration at right, the “normal” ATS is “in control”. This means that data transfer between the RDBMS database and the workstations occurs through the “normal” file server.

 

All indications on all stations and addresses are processed by both ATS applications, but they are only routed to and from workstations through the server “in control”. (Recall this is necessary since a single point update on RDBMS databases forces both databases with the same update when replication is used).

 

Data Flow (Standby ATS “in Control”)

 

When the standby server (ATS) assumes control, either by an automatic fail-over or by request from the office system operator, data is routed through the “standby” pathway. The “standby” server (ATS) then takes on the responsibility of handling communications between the relational database and the workstations.

 

Message Services

 

The Java Message Service (JMS) is the transport layer for communications between all office system software applications. JMS acts as the communications broker to pass information between the two ATS applications (“normal” and “standby”) and all the workstations.

 

JMS operates using publish/subscribe technology. When an office application (such as a workstation) needs to send information to another office system application (such as the ATS), the application “publishes” the message across the entire network. Any application that is “listening” (subscribed to the JMS service) can receive and process the message. The beauty of this approach is that applications do not have to target specific applications using IP addresses. They just send out a message, and every application that is “listening” for that type of message can intercept and receive it. This is most useful when ATS publishes the state of processor indications to workstations. Instead of sending 7 or 8 separate messages to each individual workstation, it simply publishes a single message, and all workstations on the YCS network receive it simultaneously.

A JMS separate application service will run on both the “normal” and “standby” file servers.

 

Office system applications will not use both JMS services simultaneously, as this would create duplicate messages across the network. The JMS service that all workstations and ATS applications will use at any given time is determined by the ATS application “in control”.

 

If the “normal” ATS application on the “normal” file server is “in control”, then the “standby” ATS and the workstations will all use the JMS service running on the “normal” file server.

 

If the “standby” ATS running on the “standby” file server is “in control”, then the JMS service running on the “standby” file server will be used by both ATS applications and the workstations.

 

JMS is one of the critical components of a file server. If JMS fails on the “normal” file server, the ATS running on the “normal” file server will relinquish control to the “standby” ATS. From that point, all workstations and ATS applications will use the “standby” JMS service.

 

Office System Network “Alive” Messages

 

To accomplish automated fail over from the “normal” server to the “standby” server, all software components must always know what components are functioning at any given instant. As mentioned before, the ATS applications are the “brain” of the office network system. The ATS applications must know at all times what components are operating, and what components are not operating. Decisions are then made by the ATS applications, based upon the unavailability of system components, of which ATS application will be in control of of the networked office system.

 

Both ATS applications will “publish” messages onto the office system network every few seconds to let all software components know they alive and functioning. This “alive” message will also contain information to let other applications know if it is controlling YCS.

 

Workstations also publish “alive” messages onto the network. Because of this, all workstations know which ATS applications are running and who is in control, and both ATS systems know which workstations on the network are running.

If an office system application does not receive an “alive” message from another system component within the specified (configurable) timeout period, the application will know the system component is no longer running.

 

ATS Fail-over Detail

Only one of the two ATS applications, one on each server, can be in control of the office system at any given instant. The fail-over from “normal” to “standby” is automatic. However, to place “normal” back in control when “standby” is in control, manual intervention is required (unless, of course, the “standby” ATS also detects a component failure as described below when the “normal” ATS is alive and well).

 

An automatic fail-over from “normal” to “standby”, or from “standby” to “normal”, requires one of the four following component failure scenarios.

 

The File Server stops functioning as a result of shutdown, power loss, or defective part. In this case, the ATS application “in control” on that server will stop functioning as well. The other ATS application will detect the loss of “alive” messages from the ATS previously in control, and will automatically take control after the timeout period expires. The timeout period is several seconds, but is configurable.

 

The RDBMS database instance on the server fails or is shutdown. If the ATS “in control” cannot update the relational database on it’s server, the ATS will automatically relinquish control to the other ATS. There is no timeout period. The fail-over is immediate.

 

The JMS Service on the “normal” server fails or is shutdown. If the JMS is not functioning, the ATS “in control” can no longer communicate with other software applications on the YCS. The other ATS will detect the loss of “alive” messages from the ATS previously in control, and will automatically take control after the timeout period expires. The timeout period is several seconds, but is configurable.

 

The ATS fails or is shutdown. The ATS not “in control” will detect the loss of “alive” messages from the ATS that was “in control”, and will automatically take control after the timeout period expires. The timeout period is several seconds, but is configurable.

 

The System Configuration screen on the workstations will show the status of servers and which is in control of the office system network. A green server icon indicates the server (ATS) is running and in control. A blue server icon indicates the server (ATS) is running but is not in control. And finally, a red server icon indicates the server (ATS) is not running. Of course, colors used are completely up to the discretion of the client.

 

The operator, with the proper login security clearance, can manually switch from “normal” to “standby” servers by left clicking on the server icons. Reasons for performing a manual switchover from one server to the other might include the following.

 

The “normal” server (ATS) is in control and the standby is running. Scheduled maintenance is to be performed on the “normal” server, and one or more components on this server must be shutdown to perform the maintenance. Rather than simply shutdown the “normal” server or its components and wait for the timeout to expire (when the “standby” server (ATS) would take control, the operator chooses the faster approach of simply giving control to the “standby” server (ATS).

 

The “normal” server failed or shutdown while in control, and the “standby” server is now in control. The “normal” server has been repaired and is once again running. An automatic fail over would require the “standby” server to go down (or one of its components to go down) before automatically switching control back to the “normal” server. Instead, the operator elects to leave both servers running and just give control back to the “normal” server.

 

Workstation Connections

Workstations need to connect to both the JMS (Java Message Service) for publishing and subscribing to network messages, and to the RDBMS databases for querying reports, saving Journal entries, or inquiring about car locations within the railway.

 Since there are two servers, each with its own JMS and relational database, the workstations need to know which one of the servers they should connect to -- in order to connect to the office system network properly.

If you recall previously, the ATS applications will always decide which ATS (server) is in control of YCS. Earlier we provided preview to the method by stating that the ATS will publish messages to workstations telling workstations if the ATS is in control of YCS. The workstations will use this information to switch to the proper server as required.

 

When a workstation connects to the office system network, the first action it will perform is to “subscribe” itself to the JMS messaging service located on the “normal” file server. It will then “publish” a special message to that JMS service to make an inquiry to the “normal” ATS – the sole purpose to see if that ATS application (on the “normal” server) is running and “in control”.

If the “normal” ATS responds, and says that it is in control of the office system, the workstation will then make a logical connection to the relational database on the “normal” server. It will retain that database connection until either the workstation is shutdown, the “normal” ATS fails or shuts down, or the ATS loses control of the office system (because of a system component failure described previously).

If the “normal” ATS does not respond, or says that it is not in control of office system, the workstation will unsubscribe to the JMS service on the “normal” server, and “subscribe to the JMS service on the “standby” server. It will then poll the ATS application on the “standby” server to see if it is in control. If so, it will connect to the Oracle instance on the “standby server” and retain that connection until notified to connect differently.

Since workstations are continually monitoring “alive” messages from the ATS “in control”, a workstation will always know when the ATS has lost control.

If the ATS loses control for any reason during operation, the workstation will reconnect to the server that is currently in control. Refer to the following illustration to visualize this process.

 

Fully redundant fiber network yard control system

 
Send mail to mkirk@bnctransit.com with questions or comments about this web site.
Last modified: 04/07/06