Home > A Simplified Approach to Fault Tolerant State Machine
A Simplified Approach to Fault Tolerant State Machine
Design for Single Event Upsets
Melanie Berg
Principle Design
Engineer, Ball Aerospace & Technologies Corp.
As Integrated Circuit (IC)
geometries become smaller and core voltages scale down, the probability
of incurring system faults increases significantly. Errors occur when
charged particles penetrate a memory cell and cross a junction, creating
an aberrant charge that changes the state of the bit. Due to the complexity
and sensitivity of large designs, fault tolerant schemes are evolving
at the gate level.
This paper will address Single
Event Upsets (SEUs) within edge-triggered D-Flip-Flops (DFFs) and assumes
that the upsets are soft (correctable by the following clock edge …
if the DFF is enabled). Solely considering single events is a fair assessment
due to the low probability of having multiple errors occur within one
clock cycle. Due to the radiation effects in space, the Aerospace industry
has always had to design with SEU consideration. As far as gate-level
DFF protection is concerned, Triple Mode Redundant (TMR – voting)
logic is the most commonly used scheme to combat SEUs. However, TMR
can be very area extensive and - in a turbulent environment – may
not fully erase the probability of upsets. As a solution, many error-coding
techniques have been proposed as a compliment (or replacement) to TMR,
however due to their complexity, they are rarely implemented.
A simplified approach to fault tolerant state machine design starting from architectural development through synthesis will also be discussed. Examples of coding schemes that include additional logic for error detection and (in some cases) correction such as One Hot, Sequential, and Hamming will be examined. Due to the fact that users have run into roadblocks with synthesis tools “optimizing” away necessary logic for error handling, special attention will be given to the synthesis tools concerning the necessary techniques involved in producing the correct realization of functionality.
The definition of Fault Tolerance is the ability to mask or recover from erroneous conditions in a system once an error has been detected.
The degree of fault tolerance implementation is defined by your system level requirements… i.e. specifications that clearly state acceptable behavior upon error. The following are some questions that must be answered within the system requirements documentation:
Figure
1: TMR Implementation Diagram
Triple Mode Redundancy (TMR)
is the most commonly implemented solution of SEU tolerance because it
is a very simple solution. One needs to be cautious when implementing
mitigation. It is important that the voting logic is glitch free.
Otherwise, the glitchy TMR voting logic (which can occur due to mitigation
across separate clock domains or hazardous combinational logic) can
be caught by a clock edge. In low frequency designs, the probability
of mitigation glitches being caught by clock edges is low. However,
as we increase the frequency, so does the probability. Probability
of glitch capture is based on clock speed, fan-out, and the width of
the glitch (speed of the gate logic within the voting circuitry).
The following figures demonstrate
a poor implementation of TMR using a 32-bit counter. A counter
was chosen for this example due to the large fanout of the circuitry.
Figure
2: 32 Bit counter with TMR
Figure
3: Waveform Demonstrating the Effects of Glithes in the TMR Logic
It is important to note that SEUs are asynchronous. It is impossible to control when (or where) the event will occur. This is a point that is overwhelmingly overlooked.
Because this paper focuses
on sequential Single Event Upsets (SEUs) within a synchronous design
environment, it is important to discuss the affects of an asynchronous
event (in which every internal DFF is susceptible) within a synchronous
design environment. If the event occurs well within the clock
period, then there may be enough time to deterministically correct the
error (depending on the scheme and clock speed). However, if the
SEU occurs near a clock edge, Metastability (or unpredictable events)
can occur. Current mitigation schemes do not take this into account
and thus decreases the anticipated reliability.
DFFs play an important role
within synchronous designs. They are contained within one clock
domain and act as deterministic timing boundaries. Such a design strategy
(synchronous techniques) increases the verification coverage of a circuit
and enhances production.
Synchronous state machines
use DFFs to hold its current state value. State Machines are generally
used as controllers and are at the heart of most designs. If a synchronous
state machine is not designed to accommodate Single Event Upsets, the
circuit can become locked or produce unpredictable behavior until a
reset is generated. Unfortunately, waiting for a system reset may not
be a suitable solution. Thus, designers should be aware of techniques
for error detection and perhaps correction specifically for erroneous
state conditions.
It is important to note that
current synthesis tools are specifically geared towards area and timing
optimization. Their algorithms will want to “erase” redundant schemes
for fault tolerance that the designer places within the HDL code. Therefore,
the designer must also be familiar with the synthesis package of choice
and apply the necessary directives for preserving redundant logic.
Generally, state machines are utilized as controlling mechanisms within a design. They determine when signals should turn on or off, when to implement (or stop implementing) a function, how long to wait for events, etc…. However, state machines are not necessary to implement such functionality. As a matter of fact, using state machines has the tendency to increase the number of total gates within a design. In addition, if operating a state machine in a disruptive environment (such as a space mission), the entire system can lock up into an unreachable condition. So why use state machines? Utilization of state machines afford the designer: an easy to follow design methodology, manageable design reviews, and a hook into a systematic verification process. It also alleviates the propensity of creating “spaghetti “(out of control) designs.
A synchronous state machine
is designed to deterministically transition through a pattern of defined
states. A state is represented by a register (set of DFFs) and is referred
to as the “current state.”
Figure
4: Traditional State Machine Logic
Diagram
The structure consists of four
parts:
Each state of a state machine
must be mapped into some type of encoding (pattern of bits). Once the
state has been mapped, it is then considered a defined (legal) state.
Depending on the encoding scheme and the number of states within the
design, there can be “unused” encoded patterns – i.e. there is
no defined mapping from the encoded pattern to a state. In such a case,
if there is a fault within the circuitry and the state machine jumps
into one of these unmapped (illegal) states, then the circuit has the
potential to become locked (or stuck) in some unreachable condition
(undefined state).
The current most popular encoding
schemes used by designers are: One Hot, Binary, and Gray. The reader
should be aware that there are many other schemes; however, this paper
will only address Binary and One Hot – plus a special encoding technique
specifically for correction.
Figure
5: Binary and One Hot Encoding Schemes - Mapped States vs. Unmapped
States
The Binary Encoding technique
maps states into a base 2 counting scheme. (Log2(NStates))
registers are required – i.e. for 5 states (NStates), 3 registers
(NReg) are required. There will always be 2NReg –
Nstates unmapped states.
Because 2NReg – Nstates will
always be less than NStates (theory of binary of encoding), the probability
of flipping into a mapped state is higher than flipping into an unmapped
state (IDLE or error state was not including in this deduction).
Plainly speaking, it is more likely to jump into “any” of the mapped
states than the IDLE state (the word “any” brings the non-deterministic
nature to this problem).
Figure 6: SEU and Binary Encoding Response
The One-Hot encoding requires only one bit to be turned on at a time, i.e. each state is mapped to one bit. One Hot state-machines require more registers than Binary or Gray thus ultimately have more “illegal states”. However, it takes two bits to flip in order to transition into a mapped state. This makes detecting errors within One-Hot easy. If implemented correctly, this can make transitioning from an unmapped state deterministic.
There are many debates over
which technique is best for fault detection implementation: one-hot
vs. sequential. The two styles differ such that if a bit has flipped
(a fault) the binary value of the one-hot encoding will no longer be
“One-Hot” (more than one bit will be a ‘1’ or all bits will
be ‘0’). Thus the error detection lies within the binary representation.
With Binary Encoding this is
not true. The reasoning is that a flipped bit (a fault) can represent
a “legal” state. Therefore, combinatorial error detection will not
suffice – the logic needs memory to hold the parity of the good state.
Some designers believe that the use of the extra parity bit (extra DFF within the state machine) can be beneficial for fault detection in sequential machines. This is a false statement. There are many scenarios that can be presented that show the implications of assuming a level of fault detection while implementing sequential encoding and parity. Most of them are based on the premise that multiple paths within the circuitry have different routing delays and if an SEU happens near a clock edge, unpredictable results can occur. The obvious problem is that an incorrect transition can occur without any detection. This is a phenomenon of a “hamming distance of one” encoding scheme such as the sequential state machine. Employing a hamming-1 encode is not a recommended practice while implementing fault detection and correction.
The main objective of a fault tolerant state machine is to be able to detect an error (a flipped bit in the current state registers) and have a deterministic response within a deterministic time frame. The definition of the response and its response time is dependent on the design requirements. For example, the response may be as simple as indicating that an error occurred and waiting for a signal to bring the state machine to a known “working” state… or the response may be as complex as automatic correction of the error within one clock cycle.
Most designers believe that
they are implementing a bucket-approach to fault tolerance when using
the default clause (VHDL- when others) within a CASE statement. However,
the synthesis tools ignore these statements when synthesizing state
machines. Most designers are not aware that these statements (that suggest
“bucketing” unused states and supplying a transition out of a fault
condition) have no bearing on the actual gates being created. The reasoning
behind this is that if the synthesis tool were to implement all unused
states, the required area would be excessive.
Synthesis tools have a directive
that the designer can use called “safe”. When applying this directive
to a state machine, the tool can produce a bucket of illegal states.
However, the designer must be forewarned that using the “safe” directive
has many flaws:
1. It generally creates more gates than desired, i.e. buckets of illegal states can become very large.
2. Synplify will implement
a “safe” one-hot but Leonardo and Precision will only implement
a sequential state machine (binary or Gray encoding) while using the
“safe” directive.
Recently, there has been a
surge of designers using the “safe” options offered by several different
synthesis tools. They have defined a safe state machine as one that
will always transition to a known state - i.e., if an SEU occurs and
an illegal or unmapped state is reached, one will recover to a known
state.
Synthesis tools offer a “Safe” option (demand from the Aerospace industry):
TYPE states IS ( IDLE, GET_DATA, PROCESS_DATA, SEND_DATA, BAD_DATA );
SIGNAL current_state, next_state : states;
attribute SAFE_FSM: Boolean;
attribute SAFE_FSM of states: type is true;
Although this scheme may appear to be “fool-proof”, it is not. The “safe” option will not always transition to a reliable (or known) state. Some have defined the word “known” as mapped. However, the idea is to be deterministic. The idea is to always flip into some “ERROR” or “IDLE” state upon getting an SEU. Flipping into a mapped state, in which the rest of the system may not know the state machine is there, is not deterministic. It is not good enough to transition to a mapped state:
- The machine can get stuck waiting for an input that will never occur (rest of the system doesn’t know that you have flipped into this mapped state).
- Unexpected signals can turn on (or off).
- And many more reasons…
The compilers are geared towards implementing the “safe” option with Binary Encoding. In such a case, there exists a false sense of safety. For example: if an SEU transitions the state machine (i.e. one of the registers gets hit and flips) into a mapped (legal) state (however it is an illegal transition to that state) an extreme fault can occur – there can be severe output behavior with no error detection.
Most importantly, note that in a Binary Encoded State Machine there will always be a higher probability of flipping into a mapped state (“good”) state, than there is flipping into an unmapped state.
Unfortunately, Binary Encoded state machines can be faulty. The designer must consider the list of potential problems before using the safe directive as the method for fault detection.
An alternate approach beyond using the synthesis “safe” directive should be taken to ensure trust worthy fault tolerance. This paper will first address single-bit error detection techniques. Afterwards, a robust single-bit error correction scheme along with a new encoding technique will be presented.
- Single-Bit Error Detection and Recovery
This section will focus on detecting that an error has occurred within one of the state machine DFFs and then recovering from the fault. Remember that it is necessary for the designer to manually provide recovery logic.
- One-Hot
The beauty of the one hot encoding style stems from the fact that each state has a hamming distance of 2 (it takes 2 transitions to get from one state to another) thereby it inherently has SEU error detection. During normal operation of a one hot state machine, only one bit is turned on – indicating which state is current. Thus the current state register should always have an odd parity. Each transition requires 2 bits to flip (one bit turns off while one bit turns on). If there is an SEU within the state machine, only one bit will flip. The parity will then switch to even.
Example: Using the “five state” example, assume the circuit is in GET_DATA (“00010” - odd). AN SEU can cause the state machine to have the encoded pattern of “00000”. However, “00000” is not mapped to any state and is easily detected because it has even parity. It is thus considered an undefined or illegal state for the state machine. The same theory can be used if the state machine was in state GET_DATA and the current state flip flops changed to: “10010”.
- Using Multiplexer Control for Next State Transitions
The most efficient means of SEU error detection and recovery in one-hot state machines is the use of a combinatorial parity checker over the one-hot current state register set. The designer would use the output of the error detection logic to either place the state machine into its normal operational next state at the following clock edge or to set the machine into a designated error state (could also be as simple as going back to a reset state) upon detecting an SEU. The fact that the error detection is purely combinatorial logic is a plus because it contains no DFF’s and thus the detection logic cannot create a fault of its own.
Figure 7: SEU Error Detection within One Hot Encoding
Note: The user must declare the state machine (using attributes) as one-hot to ensure that the encoding is as expected.
- Error Correction
Notice that this section discusses Correction which is distinctively different than detection. There are many publications on the theory of Error Correction. However, none directly address how to correctly implement state machine fault correction while using current day synthesis tools. As mentioned earlier, synthesis tools focus on area reduction and speed – the tools will want to “optimize” away the necessary redundant logic for fault tolerance. Thus the designer is left to manually create the desired fault tolerant logic.
- Error Correction Implementation using Hamming 3 Encoding
One method of correction is based off of creating a substantial hamming distance between state encodings. Each state has a base encoding plus companion encodings.
The companions are every sequence that can be a hamming distance of one away (only one bit is different) from the base state encoding (thus representing a possible SEU). All of the base encodings must be a hamming distance of three away from each other. The following is an example of a four state implementation – States A,B,C,D:
1. Find an encoding such that the states have a hamming distance of 3 (at least
3 bits must be different from state to state)...
- 00000 (state-A)
- 11100(state-B)
- 01111(state-C)
- 10011(state-D)
Five bits are necessary to encode a four-state machine in order to achieve the required hamming distance of three.
2. For each encoding, calculate the companion encodings such that the hamming distance is one... for example - the companion encodings for state A (00000) is: 00001,00010,00100,01000,10000 (only one bit differs from state A for each companion state); B(11100) is 11101,11110,11001,10100,01100 ... and so on...
3. Now when implementing the state machine, state A is actually 00000 plus its companion encodings so that a possible SEU is covered (do the same for all other states).
- Error Correction and Glitch Control
One major issue that is extremely overlooked is SEUs occurring near clock edges. When implementing fault tolerance, the designer must be aware of all scenarios. If the designer does not carefully code the state machine in HDL, glitches (static hazards) can occur. It is left to the designer to shut off the synthesis optimization in order to ensure glitch-free state machine transitions.
In a synchronous design, glitches are generally not a problem as long as all critical paths meet timing, outputs are registered, and any asynchronous signals are synchronized to the system clock. However, an SEU can happen at any time and create dangerous asynchronous behavior – i.e., a glitch near a clock edge. In order to safe guard against glitchy circuitry, the designer must understand how glitches are formed.
Referencing a Karnaugh-Map, prime implicants are formed by grouping together adjacent, equivalent values. When the initial and final inputs are covered by the same prime implicant, no glitch can occur. However, when the input change spans prime implicants, a glitch can occur. The beauty of the proposed method of error correction is that hazard-free prime implicants (SOP – Sum of Products) can be easily derived and hand coded because all companion states are a hamming distance of one away from the base state. For simplicity, the K-Map presented in Figure 3 only shows 4 dimensions.
Figure 8: Karnaugh Map and Sum of Products (SOP) for StateA and its Companion States
As one can see from the K-Map in Figure 8, if the current state is “00000” and an SEU occurs, there will be a glitch-free transition into one of the companion states as long as the SOP contains the prime implicants circled in the K-Map.
- Conclusions
The usage of state machines in order to efficiently control mechanisms within a design has become very popular. This paper proposes methods of Fault Tolerant State Machine implementation due to potential IC SEU susceptibility. Before choosing any particular technique, the designer should first be aware of the probability of electrical faults and then decide the appropriate level of coverage necessary to achieve a robust design.
Special directives must be used in order to drive the synthesis tools when implementing fault tolerant redundant logic because the tools are generally focused on area and speed optimization. Thus, once the gates are produced, the designer should check that no functionality has been “optimized” away and that the appropriate state machine has been realized.
- References
[1] Douglas Smith, HDL Chip Design, Doone
Publications, Madison, AL, USA, 1996
[2] Parag K. Lala, Self-Checking and Fault-
Tolerant Digital Design, Academic Press,
2001
[3] Randy H. Katz, Contemporary Logic
Design, Benjamin/Cummings Publishing
Company, Inc., 1994
[4] PrecisionTM Synthesis Reference
Manual, Mentor Graphics Corp.,
Release 2003b.
All Rights Reserved Powered by Free Document Search and Download
Copyright © 2011