Adaptable Fault Tolerance Configurations for Multiprocessor Systems

Samia A.  Ali

Adaptable Fault Tolerance Configurations for Multiprocessor Systems

International Journal of Applied Information Systems
Foundation of Computer Science (FCS), NY, USA
Volume 3 - Number 2
Year of Publication: 2012
Authors: Samia A. Ali
http:/ijais12-450448

1471

Export

Samia A. Ali 2012. Adaptable Fault Tolerance Configurations for Multiprocessor Systems. International Journal of Applied Information Systems. 3, 2 (July 2012), 1-8. DOI=http://dx.doi.org/10.5120/ijais450448

@article{10.5120/ijais2017451568,
author = {Samia A. Ali},
title = {Adaptable Fault Tolerance Configurations for Multiprocessor Systems},
journal = {International Journal of Applied Information Systems},
issue_date = {July 2012},
volume = {3},
number = {},
month = {July},
year = {2012},
issn = {},
pages = {1-8},
numpages = {},
url = {/archives/volume3/number2/201-0448},
doi = { http:/ijais12-450448},
publisher = { xA9 2010 by IJAIS Journal},
address = {}
}

%1 450448
%A Samia A.  Ali
%T Adaptable Fault Tolerance Configurations for Multiprocessor Systems
%J International Journal of Applied Information Systems
%@ 
%V 3
%N 
%P 1-8
%D 2012
%I  xA9 2010 by IJAIS Journal

Abstract

The escalating increase in the complexity of multiprocessor systems increases the probability of faults occurring in these systems As a consequence there is a great need for achieving fault-tolerance of processing in multiprocessor systems. Fault-tolerance generally requires some forms of hardware and/or time redundancy. Two fault tolerant configurations are proposed for both single and double transient and permanent faults in any processor of multiprocessor systems. The tolerance for faults takes place in three consecutive steps; fault detection, fault diagnosing and system recovery. The overhead cost for the first (second) configuration is only 100% hardware (time) for fault detection, an extra 100% time for fault diagnoses and system recovery only for those processes running on the faulty processors. The advantages of the proposed configurations are the ease of applicability and the low associated overhead cost over the system without any fault tolerance. An enhancement is developed for both configurations to check upon the system state adequately to detect and recover from faults as soon as they infect the system. Simulations are performed to illustrate the usefulness of the proposed configurations.

References

Shivakumar, P. Keckler, S. W. , Moore, C. R. , Burger, D. , "Exploiting Microarchitectural Redundancy for Defect Tolerance", the 21st International Conference on Computer Design (ICCD), October, 2003.
Bernick, D. , Bruckert, B. , Vigna, P. D. , Garcia, D. , Jardine, R. , Klecka,J. , Smullen, J. , "NonStop® Advanced Architecture", DSN, 2005.
Anderson, T. , Lee, A. , "Fault-tolerance - Principles and Practice", Prentice Hall, Eaglewood Cliffs, 1981.
Qureshi, M. K. et al. Microarchitecture-based introspection: A technique for transient-fault tolerance in microprocessors. In Proc. of 32nd Intl. Symp. on Comp. Arch. (ISCA-32), June 2005.
Ray, J. et al. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the 34th International Symposium on Microarchitecture, December 2001.
Rotenberg, E. . AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the 29th International Symposium on Fault-Tolerant Computing, June 1999.
Vijaykumar, T. N. et al. Transient-fault recovery using simultaneous multithreading. In Proceedings of the 29th International Symposium on Computer Architecture, May 2002
Gomaa, M. et al. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th International Symposium on Computer Architecture, June 2003.
Mukherjee, S. S. et al. Detailed design and evaluation of redundant multithreading alternatives. In Proceedings of the 29th International Symposium on Computer Architecture, May 2002, 99–110.
Fair, M. L. , Conklin, C. R. , Swaney, S. B. , Meaney, P. J. , Clarke, W. J. , Alves, L. C. , Modi, I. N. , Freier, F. , Fischer, W. ,and Weber, N. E. Reliability, Availability, and Serviceability (RAS) of the IBM eServer z990. IBM Journal of Research and Development, Nov, 2004.
J. S. Plank and W. R. Elwasif, "Experimental assessment of workstation failures and their impact on checkpointing systems," in 28th International Symposium on Fault-Tolerant Computing, June 1998.
N. H. Vaidya, "Impact of checkpoint latency on overhead ratio of a checkpointing scheme," IEEE Transactions on Computers, vol. 46 ,Aug. 1997.
K. Li, J. F. Naughton, and J. S. Plank, "Low-latency, concurrent checkpointing for parallel programs," IEEE Transactions on Parallel and Distributed Systems, vol. 5, Aug. 1994.
J. S. Plank, J. Xu, and R. H. Netzer, "Compressed differences: An algorithm for fast incremental checkpointing," Tech. Rep. CS-95-302, University of Tennessee at Knoxville, Aug. 1995.

Keywords

Hardware Redundancy, Time Redundancy, Transient Fault, Permanent Fault, Cold Standby Spare

Index Terms

Computer Science

Information Sciences

Call for paper April Edition 2017

Number 2

Adaptable Fault Tolerance Configurations for Multiprocessor Systems

Export

Abstract

References

Keywords

Index Terms

Call for paper
April Edition 2017