Software System Stability

Speaker:	Prof. Lui SHA
		Department of Computer Science
		University of Illinois at Urbana Champaign

Title:		"Software System Stability"

Date:		Monday, 20 February 2006

Time:		4:00pm - 5:00pm

Venue:		Lecture Theatre F
		(Leung Yat Sing Lecture Theatre, near lift nos. 25/26)
		HKUST

ABSTRACT:

The development of large scale mission critical system has emerged to be a
major scientific and engineering challenge. For example,

FAA's major modernization project, the Advanced Automation System (AAS),
was originally estimated to cost $2.5 billion with a completion date of
1996. In 1994, "FAA cancelled the AAS program, casting aside 11 years of
development time and, according to GAO, wasting more than $1.5 billion of
taxpayer money."

The chaos in the opening of Hong Kong's new airport, "Many people could
not find their departure gate. The monitor would say Gate 15, but the
airline staff would say Gate 43 . . . or 19 or 33. ... Out on the tarmac,
some pilots didn't know where to park and passengers sat perspiring in
their seats. K a computer gremlin had prevented the main air cargo
operator, HACTL, from retrieving vital shipping information. As a result,
freight operations ground to a halt, and containers full of perishables
rotted on the steamy tarmac"

These are not isolated incidents. Serious problems in the development and
integration of large software systems are in fact very common. The
problems are so serious that US congress passed laws to mandate the reform
of large scale information system development, acquisition and maintenance
. This leads to creation of an architecture framework for system
integration .  In spite such efforts, the system development and
integration problems remain. Even in less ambitious typical commercial
system development, debugging-and-testing account for 50 -75% of total
development cost.

Large, networked system of systems is built with many components designed
for different requirements in the past and contains known and unknown
defects. They often have overly complex and unconstrained interactions.
Indeed, major system failures often traced back to unexpected global
interactions that involve many components, not an isolated defect in one
module.

On the other hand, it is worthy to note that in the current mission
management software of civil aviation, there are hundreds of reported
residual bugs but the flights remain safe. There are at least thousands of
residual bugs in the telecomm network and it remains highly reliable.
There are perhaps millions of bugs in the World Wide Web system of
systems, but it works quite acceptably. On an even large scale, United
States of America is a highly stable and evolvable system. It has grown
and made truly remarkable progress by the metric of civilization, even
though many problems remain. And its basic components, human beings, are
complex, error prone, and hard to test or verify. Complex but stable
systems are uncommon but have been built.

This talk presents an initial investigation on

- The root-causes of large system failures and what needs to be done.
- Structures that keep highly complex systems stable in spite of residual
  errors.



****************************
Biography:

Lui Sha graduated with Ph.D. from Carnegie Mellon University in 1985. He
was a member and then a senior member of Technical Staff at the Software
Engineering Institute from 1985-1998. Since 1998, he has been a professor
and then Donald B. Gillis Professor of Computer Science in University of
Illinois at Urbana Champaign. He was elected ACM Fellow in 2005 for
contributions to real time systems and elected IEEE Fellow in 1998 "for
technical leadership and research contributions, which enabled the
transformation of real-time computing practice from an ad hoc process to
an engineering process based on analytic methods."  He received the
Outstanding Leadership and Technical Contribution Award from IEEE
Technical Committee on Real Time Systems in 2001. He was cited in the
UIUC's list of Teachers Ranked as Excellent by Their Students in 1999 and
2000.

He was the Chair of IEEE Technical Committee on Real Time Systems form
1999 and 2000, and has been a member of National Academy of Science's
study group on software dependability and certification from 2004 to 2005.
His work was cited as a notable accomplishment in the Selected
Accomplishment section of the 1992 National Academy of Science's report, A
Broader Agenda for Computer Science and Engineering. His work on real time
computing is supported by most of the open standards in real time
computing. He has made significant contributions to many of national high
technology projects including GPS upgrade, the Mars Pathfinder, and the
International Space Station. He currently works on the technologies for
the integration and development of robust real time systems.