|
| |

|
B&C Transit Office Systems can be designed as
either simple stand-alone systems, or as full blown networked redundant
control systems.
Fault tolerant systems can be implemented
by using
dual networked servers and unique LAN's to all component workstations.
If one
server is shutdown due to failure or maintenance, the standby system
automatically takes over, and no data is lost during the transition period.
|
Office
Networks
 |
|
|
Common Network Components
Field processors (VPI, MICROLOK,
GEOLOC, VHLC,
PLC, etc.)
One or more UNIX or Microsoft NT servers.
A database repository (generally ORACLE or SQL
SERVER) residing on each server.
A B&C application server application residing
on each server.
Routers, network switches, and network fiber or
Ethernet cabling.
One or more computer workstations.
B&C Workstation Control and Monitoring
Applications.
Overview Monitors.
|
Fault
Tolerant Systems

|
|
|
Data Replication
If the server
databases acted independently, then data would only be stored in the
database when data was transmitted over the pathway and through the
server. If one component on the pathway were to go down for any
period of time, data could not be stored in the Oracle database on
that server for the period of time the pathway was disrupted. As a
result, the two databases would quickly become unsynchronized any
time a pathway became disrupted. For playback of indications,
reports, and journal entries, this would only represent a period of
time where data was missing. However, for dynamic car tracking, just
a simple missed occupancy indication would totally disrupt the
ability of the server to accurately track cars.
This might not be noticed until a fail-over occurred from one server
to the other. Suddenly, the operator would notice only that all the
car information was completely wrong after the fail-over, or that
journal and report entries were missing. Therefore, the data paths
must be set such that the two databases always remain synchronized
even when one server is down for a period of time. We do this by
using RDBMS database built-in data replication features.
To ensure data is always
synchronized between the Oracle databases on the two servers, the servers must
be setup for master-to-master data replication. This means, that an update on
one Oracle database will automatically make the corresponding update internally
on the second Oracle database residing on the other server. If one server is
down while the other server is running, and databases updates are occurring on
the running server, the failed server will automatically be fed the data it
missed when it restarts. This ensures data is never lost and the two servers are
always synchronized regardless if one instance of the database was not
functioning over a certain period of time.
However, this also means that
only one server can update its Oracle database with data. This is called “single
point entry”. With Oracle replication active between the two servers, only one
of the two servers can be allowed to update the database, otherwise duplicate
records would occur with every transaction. And because the servers are time
synchronized, attempts to insert duplicate records would cause exceptions on the
database with identical time stamp records, resulting in inefficient operation,
or worst, no operation at all.
|
|
Data replication allows the
servers to remain synchronized, but also means that only one data pathway
“normal” or “standby” can be responsible for storing data and passing that data
on to the workstations.
The following sections will
describe all the components of a server. We will explore how the system
determines which server is “in control” (can update the Oracle database) to
prevent duplicate records in the database.
Server Components
For the purposes of this
document, a yard server will represent all components between the fiber optic
switches connected to the workstations, and the terminal servers connected to
the non-vital field processors. These components consist of the following:
Server
computers
The servers
represent the computer hardware. The server consists of multiple hard drives set
up in a RAID 5 fault tolerant configuration. Each server runs the operating
system, which is the software platform for running all other applications on the
server. Servers also can run NTP to ensure all systems on the network are time
synchronized. |
DATA REPLICATION
 |
|
|
RDBMS
(Relational Database) that runs on each file server
The ORACLE
application is an advanced relational database software application. This
database is used to store information for the YCS (yard control system) for
system configuration, reporting, playback, security settings, car tracking, etc…
ATS
(Application Terminal Server) that runs on each file server
The ATS is
the centralized “brain” software for the control system. This software
application communicates directly with field processors via a communications
link from the hardware terminal servers (that are connected via serial cables to
the field processors). The ATS centrally processes all information from the
field processors, stores changes in the RDBMS database (if it is in control),
provides all car tracking responsibilities, handles alarm acknowledges, and
passes control requests from workstations back to the appropriate field
processors.
JMS (Java Message Service) that runs on each file
server.
JMS is the
transport layer for messages between the ATS and all workstations. JMS is a
software application that must be running on the servers to provide
communication between the server and the workstations. JMS provides a “publish”
/ “subscribe” method of communication, thereby eliminating the need to provide
fixed IP addresses at the workstation level for individual communication.
Each redundant pair of servers is considered a
“pathway” between the workstations and the field field processors. If any
component in the “normal” pathway fails, the entire pathway fails, and the
system is automatically switched to the “standby” pathway. Pathways are
identified as the “normal” and “standby” pathways. All components in each
corresponding pathway are identified as “normal” or “standby” components.
It is important to note that the
“normal” and “standby” pathways do not necessarily include the non-vital
processors. More will be discussed about the processor communications, but for
now, recall that the non-vital processors are not identified as components in
server fail over. The primary reason is because a fail-over from a “normal”
to a “standby” processor does not have to change the data pathway within the
control system. The two entities are non-exclusive.
Server automatic fail over is
controlled exclusively by the “normal” and “standby” ATS applications running on
each server.
|
|
ATS
Fail-over Introduction
The mechanism for controlling
server fail over resides within the ATS (Application Server) software running on
each file server. As far as the office system is concerned, if this software is
not running, then the server may as well not be running, since without ATS, the
control system cannot function through the corresponding pathway.
As mentioned in the previous
section, ATS is the “brain” of the office system. Each pathway, “normal” or
“standby” is controlled exclusively by the ATS application on that pathway.
Under normal operating
conditions, when all the components within the “normal” pathway are functioning
(again, this does not include non-vital processors), then the “normal” ATS is
the controlling entity of the office system. When an ATS is in control, only
that ATS application is allowed to update the RDBMS database with processor
controls and indications, and pass that data between the systems. Remember this
concept, as we will bring it up later in greater detail.
In the event a component on the
“normal” pathway fails, the “normal” ATS will relinquish control to the
“standby” ATS, and the “standby” pathway then becomes the controlling pathway.
However, before getting too deep into the fail over scenarios, we’ll discuss
each component of the networked office system and provide a more detailed
overview of their function.
|
 |
|
|
Field
Processor Communication HOT Fail Over
|
|
Non-Vital
processor indications from the field to the Networked Office System
The illustration above shows a
single “normal” and “standby” non-vital processor pair. The ATS applications on
both servers will communicate with all processor pairs through the hardware
terminal servers.
The non-vital processors are the
office system interface into the railway. They provide all indications to the
office system -- such as switch and train locations, breaker states, etc.. And
those field processors receive and process control requests from the office
system -- for route requests, breaker states, etc..
When both ATS applications are
running, all non-vital processor pairs are running, and all connections are
established, each ATS application will continuously interact with each
processor. However, the ATS will only “store” and pass on data from one of the
processors in the pair. The ATS will make the following determination for which
processor will provide data to be used by the office system.
|
|
Referring to the illustration
at right, only one set of the duplicate indications received by a normal and
standby processor pair are saved to the database and passed on to the
workstations.
It is also important that only one of the ATS applications
(either “normal” or “standby”) save the data and pass it on, otherwise,
duplicate records would be stored in the RDBMS database as described previously.
Note that a field processor must “tell” the ATS which processor “normal” or
“standby” by sending indications that the office system should use. This
determination is made based on the state of an indication bit provided by the
processor.
Only one ATS can be “in control”
at any given time. The term “in control” simply means that the ATS is storing
data and passing the data on to the workstations.
|
 |
|
|
The important thing to note is
that if the “normal” field processor fails for any reason, the office system
will automatically start using the indications from the “standby” field
processor. This switch over is nearly instantaneous because the ATS is always
communicating with both processors simultaneously.
When a networked workstation
sends a control request, such as a command to open a breaker or selection of the
gate as the entry of a route, the workstation “publishes” the control request to
the server “in control”. As a result, the ATS application “in control” receives
the control request, and passes the control request on to the field processor
pair (if a redundant pair exists - otherwise send the control to the single
processor).
The ATS “in control” will pass
the workstation request to both the “normal” and “standby” field processors.
The
logic for determining this is similar to the logic chart on the illustration
above.
It is important to note that the
“normal” ATS will not automatically fail over (relinquish control) to the
“standby” ATS if it loses communication with both the “normal” and “standby”
field processors. Automatic switchover from the “normal” ATS to the “standby”
ATS occurs only if one of the system components on the server or network occurs.
Of course, manual switch to the “standby” ATS is always an option via selection
of the standby server icon in the System Configuration screen on any
workstation. More will be explained about this later.
|
|
In the illustration
at right, the
“normal” ATS is “in control”. This means that data transfer between the RDBMS
database and the workstations occurs through the “normal” file server.
All
indications on all stations and addresses are processed by both ATS
applications, but they are only routed to and from workstations through the
server “in control”. (Recall this is necessary since a single point update on
RDBMS databases forces both databases with the same update when replication is
used).
|
 |
|
|
When the
standby server (ATS) assumes control, either by an automatic fail-over or by
request from the office system operator, data is routed through the “standby”
pathway. The “standby” server (ATS) then takes on the responsibility of handling
communications between the relational database and the workstations.
|
 |
|
The Java Message Service (JMS) is
the transport layer for communications between all office system software
applications. JMS acts as the communications broker to pass information between
the two ATS applications (“normal” and “standby”) and all the workstations.
JMS operates using
publish/subscribe technology. When an office application (such as a workstation)
needs to send information to another office system application (such as the ATS),
the application “publishes” the message across the entire network. Any
application that is “listening” (subscribed to the JMS service) can receive and
process the message. The beauty of this approach is that applications do not
have to target specific applications using IP addresses. They just send out a
message, and every application that is “listening” for that type of message can
intercept and receive it. This is most useful when ATS publishes the state of
processor indications to workstations. Instead of sending 7 or 8 separate
messages to each individual workstation, it simply publishes a single message,
and all workstations on the YCS network receive it simultaneously.
A JMS separate application
service will run on both the “normal” and “standby” file servers.
Office system
applications will not use both JMS services simultaneously, as this would create
duplicate messages across the network. The JMS service that all workstations and ATS applications will use at any given time is determined by the ATS application
“in control”.
If the “normal” ATS application
on the “normal” file server is “in control”, then the “standby” ATS and the
workstations will all use the JMS service running on the “normal” file server.
If the “standby” ATS running on
the “standby” file server is “in control”, then the JMS service running on the
“standby” file server will be used by both ATS applications and the
workstations.
JMS is one of the critical
components of a file server. If JMS fails on the “normal” file server, the ATS
running on the “normal” file server will relinquish control to the “standby” ATS.
From that point, all workstations and ATS applications will use the “standby”
JMS service.
Office System Network “Alive” Messages
To accomplish automated fail over
from the “normal” server to the “standby” server, all software components must
always know what components are functioning at any given instant. As mentioned
before, the ATS applications are the “brain” of the office network system. The
ATS applications must know at all times what components are operating, and what
components are not operating. Decisions are then made by the ATS applications,
based upon the unavailability of system components, of which ATS application
will be in control of of the networked office system.
Both ATS applications will
“publish” messages onto the office system network every few seconds to let all
software components know they alive and functioning. This “alive” message will
also contain information to let other applications know if it is controlling YCS.
Workstations also publish “alive”
messages onto the network. Because of this, all workstations know which ATS
applications are running and who is in control, and both ATS systems know which
workstations on the network are running.
If an office system application
does not receive an “alive” message from another system component within the
specified (configurable) timeout period, the application will know the system
component is no longer running.
|
|
Only one of the two ATS
applications, one on each server, can be in control of the office system at any
given instant. The fail-over from “normal” to “standby” is automatic. However,
to place “normal” back in control when “standby” is in control, manual
intervention is required (unless, of course, the “standby” ATS also detects a
component failure as described below when the “normal” ATS is alive and well).
An automatic fail-over from
“normal” to “standby”, or from “standby” to “normal”, requires one of the four
following component failure scenarios.
The File Server stops
functioning as a result of shutdown, power loss, or defective part. In
this case, the ATS application “in control” on that server will stop
functioning as well. The other ATS application will detect the loss of
“alive” messages from the ATS previously in control, and will automatically
take control after the timeout period expires. The timeout period is several
seconds, but is configurable.
The RDBMS database
instance on the server fails or is shutdown. If the ATS “in control”
cannot update the relational database on it’s server, the ATS will
automatically relinquish control to the other ATS. There is no timeout
period. The fail-over is immediate.
The JMS Service on the
“normal” server fails or is shutdown. If the JMS is not functioning, the
ATS “in control” can no longer communicate with other software applications
on the YCS. The other ATS will detect the loss of “alive” messages from the
ATS previously in control, and will automatically take control after the
timeout period expires. The timeout period is several seconds, but is
configurable.
The ATS fails or is
shutdown. The ATS not “in control” will detect the loss of “alive”
messages from the ATS that was “in control”, and will automatically take
control after the timeout period expires. The timeout period is several
seconds, but is configurable.
The System Configuration screen
on the workstations will show the status of servers and which is in control of
the office system network. A green server icon indicates the server (ATS) is
running and in control. A blue server icon indicates the server (ATS) is running
but is not in control. And finally, a red server icon indicates the server (ATS)
is not running. Of course, colors used are completely up to the discretion of
the client.
The operator, with the proper
login security clearance, can manually switch from “normal” to “standby” servers
by left clicking on the server icons. Reasons for performing a manual switchover
from one server to the other might include the following.
The “normal” server (ATS) is
in control and the standby is running. Scheduled maintenance is to be
performed on the “normal” server, and one or more components on this server
must be shutdown to perform the maintenance. Rather than simply shutdown the
“normal” server or its components and wait for the timeout to expire (when
the “standby” server (ATS) would take control, the operator chooses the
faster approach of simply giving control to the “standby” server (ATS).
The “normal” server failed
or shutdown while in control, and the “standby” server is now in control.
The “normal” server has been repaired and is once again running. An
automatic fail over would require the “standby” server to go down (or one of
its components to go down) before automatically switching control back to
the “normal” server. Instead, the operator elects to leave both servers
running and just give control back to the “normal” server.
|
|
Workstation
Connections
Workstations need to connect to
both the JMS (Java Message Service) for publishing and subscribing to network
messages, and to the RDBMS databases for querying reports, saving Journal
entries, or inquiring about car locations within the railway.
Since there are two servers,
each with its own JMS and relational database, the workstations need to know
which one of the servers they should connect to -- in order to connect to the
office system network properly.
If you recall previously, the ATS
applications will always decide which ATS (server) is in control of YCS. Earlier
we provided preview to the method by stating that the ATS will publish messages
to workstations telling workstations if the ATS is in control of YCS. The
workstations will use this information to switch to the proper server as
required.
|
|
When a workstation connects to
the office system network, the first action it will perform is to “subscribe”
itself to the JMS messaging service located on the “normal” file server. It will
then “publish” a special message to that JMS service to make an inquiry to the
“normal” ATS – the sole purpose to see if that ATS application (on the “normal”
server) is running and “in control”.
If the “normal” ATS responds, and
says that it is in control of the office system, the workstation will then make
a logical connection to the relational database on the “normal” server. It will
retain that database connection until either the workstation is shutdown, the
“normal” ATS fails or shuts down, or the ATS loses control of the office system
(because of a system component failure described previously).
If the “normal” ATS does not
respond, or says that it is not in control of office system, the
workstation will unsubscribe to the JMS service on the “normal” server, and
“subscribe to the JMS service on the “standby” server. It will then poll the ATS
application on the “standby” server to see if it is in control. If so, it will
connect to the Oracle instance on the “standby server” and retain that
connection until notified to connect differently.
Since workstations are
continually monitoring “alive” messages from the ATS “in control”, a workstation
will always know when the ATS has lost control.
If the ATS loses control for any
reason during operation, the workstation will reconnect to the server that is
currently in control. Refer to the following illustration to visualize this
process. |
 |
|
|
Fully redundant fiber network yard
control system
 |
|