More about HKUST
Software System Stability
Speaker: Prof. Lui SHA Department of Computer Science University of Illinois at Urbana Champaign Title: "Software System Stability" Date: Monday, 20 February 2006 Time: 4:00pm - 5:00pm Venue: Lecture Theatre F (Leung Yat Sing Lecture Theatre, near lift nos. 25/26) HKUST ABSTRACT: The development of large scale mission critical system has emerged to be a major scientific and engineering challenge. For example, FAA's major modernization project, the Advanced Automation System (AAS), was originally estimated to cost $2.5 billion with a completion date of 1996. In 1994, "FAA cancelled the AAS program, casting aside 11 years of development time and, according to GAO, wasting more than $1.5 billion of taxpayer money." The chaos in the opening of Hong Kong's new airport, "Many people could not find their departure gate. The monitor would say Gate 15, but the airline staff would say Gate 43 . . . or 19 or 33. ... Out on the tarmac, some pilots didn't know where to park and passengers sat perspiring in their seats. K a computer gremlin had prevented the main air cargo operator, HACTL, from retrieving vital shipping information. As a result, freight operations ground to a halt, and containers full of perishables rotted on the steamy tarmac" These are not isolated incidents. Serious problems in the development and integration of large software systems are in fact very common. The problems are so serious that US congress passed laws to mandate the reform of large scale information system development, acquisition and maintenance . This leads to creation of an architecture framework for system integration . In spite such efforts, the system development and integration problems remain. Even in less ambitious typical commercial system development, debugging-and-testing account for 50 -75% of total development cost. Large, networked system of systems is built with many components designed for different requirements in the past and contains known and unknown defects. They often have overly complex and unconstrained interactions. Indeed, major system failures often traced back to unexpected global interactions that involve many components, not an isolated defect in one module. On the other hand, it is worthy to note that in the current mission management software of civil aviation, there are hundreds of reported residual bugs but the flights remain safe. There are at least thousands of residual bugs in the telecomm network and it remains highly reliable. There are perhaps millions of bugs in the World Wide Web system of systems, but it works quite acceptably. On an even large scale, United States of America is a highly stable and evolvable system. It has grown and made truly remarkable progress by the metric of civilization, even though many problems remain. And its basic components, human beings, are complex, error prone, and hard to test or verify. Complex but stable systems are uncommon but have been built. This talk presents an initial investigation on - The root-causes of large system failures and what needs to be done. - Structures that keep highly complex systems stable in spite of residual errors. **************************** Biography: Lui Sha graduated with Ph.D. from Carnegie Mellon University in 1985. He was a member and then a senior member of Technical Staff at the Software Engineering Institute from 1985-1998. Since 1998, he has been a professor and then Donald B. Gillis Professor of Computer Science in University of Illinois at Urbana Champaign. He was elected ACM Fellow in 2005 for contributions to real time systems and elected IEEE Fellow in 1998 "for technical leadership and research contributions, which enabled the transformation of real-time computing practice from an ad hoc process to an engineering process based on analytic methods." He received the Outstanding Leadership and Technical Contribution Award from IEEE Technical Committee on Real Time Systems in 2001. He was cited in the UIUC's list of Teachers Ranked as Excellent by Their Students in 1999 and 2000. He was the Chair of IEEE Technical Committee on Real Time Systems form 1999 and 2000, and has been a member of National Academy of Science's study group on software dependability and certification from 2004 to 2005. His work was cited as a notable accomplishment in the Selected Accomplishment section of the 1992 National Academy of Science's report, A Broader Agenda for Computer Science and Engineering. His work on real time computing is supported by most of the open standards in real time computing. He has made significant contributions to many of national high technology projects including GPS upgrade, the Mars Pathfinder, and the International Space Station. He currently works on the technologies for the integration and development of robust real time systems.