EST

Call for paper
April Edition 2017

International Journal of Applied Information Systems solicits high quality original research papers for the
March 15, 2017
April 2017 Edition of the journal.
The last date of research paper submission is
March 15, 2017
SUBMIT YOUR PAPER

Number 2

Adaptable Fault Tolerance Configurations for Multiprocessor Systems

journal image
  • International Journal of Applied Information Systems
  • Foundation of Computer Science (FCS), NY, USA
  • Volume 3 - Number 2
  • Year of Publication: 2012
  • Authors: Samia A. Ali
  • http:/ijais12-450448
 Download
1471
  • Samia A. Ali 2012. Adaptable Fault Tolerance Configurations for Multiprocessor Systems. International Journal of Applied Information Systems. 3, 2 (July 2012), 1-8. DOI=http://dx.doi.org/10.5120/ijais450448
  • @article{10.5120/ijais2017451568,
    author = {Samia A. Ali},
    title = {Adaptable Fault Tolerance Configurations for Multiprocessor Systems},
    journal = {International Journal of Applied Information Systems},
    issue_date = {July 2012},
    volume = {3},
    number = {},
    month = {July},
    year = {2012},
    issn = {},
    pages = {1-8},
    numpages = {},
    url = {/archives/volume3/number2/201-0448},
    doi = { http:/ijais12-450448},
    publisher = { xA9 2010 by IJAIS Journal},
    address = {}
    }
    
  • %1 450448
    %A Samia A.  Ali
    %T Adaptable Fault Tolerance Configurations for Multiprocessor Systems
    %J International Journal of Applied Information Systems
    %@ 
    %V 3
    %N 
    %P 1-8
    %D 2012
    %I  xA9 2010 by IJAIS Journal
    

Abstract

The escalating increase in the complexity of multiprocessor systems increases the probability of faults occurring in these systems As a consequence there is a great need for achieving fault-tolerance of processing in multiprocessor systems. Fault-tolerance generally requires some forms of hardware and/or time redundancy. Two fault tolerant configurations are proposed for both single and double transient and permanent faults in any processor of multiprocessor systems. The tolerance for faults takes place in three consecutive steps; fault detection, fault diagnosing and system recovery. The overhead cost for the first (second) configuration is only 100% hardware (time) for fault detection, an extra 100% time for fault diagnoses and system recovery only for those processes running on the faulty processors. The advantages of the proposed configurations are the ease of applicability and the low associated overhead cost over the system without any fault tolerance. An enhancement is developed for both configurations to check upon the system state adequately to detect and recover from faults as soon as they infect the system. Simulations are performed to illustrate the usefulness of the proposed configurations.

References

  1. Shivakumar, P. Keckler, S. W. , Moore, C. R. , Burger, D. , "Exploiting Microarchitectural Redundancy for Defect Tolerance", the 21st International Conference on Computer Design (ICCD), October, 2003.
  2. Bernick, D. , Bruckert, B. , Vigna, P. D. , Garcia, D. , Jardine, R. , Klecka,J. , Smullen, J. , "NonStop® Advanced Architecture", DSN, 2005.
  3. Anderson, T. , Lee, A. , "Fault-tolerance - Principles and Practice", Prentice Hall, Eaglewood Cliffs, 1981.
  4. Qureshi, M. K. et al. Microarchitecture-based introspection: A technique for transient-fault tolerance in microprocessors. In Proc. of 32nd Intl. Symp. on Comp. Arch. (ISCA-32), June 2005.
  5. Ray, J. et al. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the 34th International Symposium on Microarchitecture, December 2001.
  6. Rotenberg, E. . AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the 29th International Symposium on Fault-Tolerant Computing, June 1999.
  7. Vijaykumar, T. N. et al. Transient-fault recovery using simultaneous multithreading. In Proceedings of the 29th International Symposium on Computer Architecture, May 2002
  8. Gomaa, M. et al. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th International Symposium on Computer Architecture, June 2003.
  9. Mukherjee, S. S. et al. Detailed design and evaluation of redundant multithreading alternatives. In Proceedings of the 29th International Symposium on Computer Architecture, May 2002, 99–110.
  10. Fair, M. L. , Conklin, C. R. , Swaney, S. B. , Meaney, P. J. , Clarke, W. J. , Alves, L. C. , Modi, I. N. , Freier, F. , Fischer, W. ,and Weber, N. E. Reliability, Availability, and Serviceability (RAS) of the IBM eServer z990. IBM Journal of Research and Development, Nov, 2004.
  11. J. S. Plank and W. R. Elwasif, "Experimental assessment of workstation failures and their impact on checkpointing systems," in 28th International Symposium on Fault-Tolerant Computing, June 1998.
  12. N. H. Vaidya, "Impact of checkpoint latency on overhead ratio of a checkpointing scheme," IEEE Transactions on Computers, vol. 46 ,Aug. 1997.
  13. K. Li, J. F. Naughton, and J. S. Plank, "Low-latency, concurrent checkpointing for parallel programs," IEEE Transactions on Parallel and Distributed Systems, vol. 5, Aug. 1994.
  14. J. S. Plank, J. Xu, and R. H. Netzer, "Compressed differences: An algorithm for fast incremental checkpointing," Tech. Rep. CS-95-302, University of Tennessee at Knoxville, Aug. 1995.

Keywords

Hardware Redundancy, Time Redundancy, Transient Fault, Permanent Fault, Cold Standby Spare

Index Terms

Computer Science
Information Sciences