ITGS

Information Technology in a Global Society at Kelly High School

Ethics: Reliability and Safety

Areas of Impact

Reliability as an issue is most important with systems where failure of the system will be dangerous.  These include:

Mean time between failure

Reliability refers to the ability of information systems is often measured in MTBF:  Mean time between failure.  The tech people keep track of each time the system fails and average the time between.  Perhaps it is measured in years, perhaps days. 

Is MTBF a good guide to the reliability of software?  Give reasons for and against.

Why are computers unreliable?

Software can not be touched or seen as a physical entity. You have a feeling that you don't know what you are dealing with. Is it really less reliable than mechanical machinery?  In cases yes.  

Why do we settle for less reliable machines? What is our trade off?

Here are some reasons for why computers are unreliable:

Computers are flexible, but change brings errors

Computers are designed to be flexible, that is why they are programmable.  But changes in the system have unintended consequences.  Problems are caused when systems become complex due to changes in the specifications after the initial design.  Also, overtime, software from older systems is maintained as part of newer systems, adding to the complexity and chances of failure.  As software becomes more complex, the chance of having errors in the system approaches 100%.

In January 1990 someone wrote code something like this for a piece of AT&T switching equipment:

switch (errorCode)
{
   case 32:
      bla bla bla;
      break;

   case 33:
      bla bla bla;
      break;

   case 34:
      bla bla bla;
      if (condition)
      {
        bla bla bla;
        break;
      }

   case 35:
      bla bla bla;
      break;

    etc;

}

One line was entered in the wrong place that caused all long distance service for the east coast to fail for a day.  

Can you guess where the error is?

Read this story about the Therac-25.

Why was the software controlled machine more dangerous that the previously mechanical Therac?

Why would they choose to use software instead of mechanical devices when building the new Therac?

What changes in the development of the machine could have prevented the failures?

What can be done to make systems more reliable?

Redundancy and distribution

A simple way of making a system more reliable is to simply duplicate it.  If something goes wrong with one system and it crashes, just switch to a second system.  This is very common with hard drives (RAID and mirrored drives), systems (mirrored servers), databases, networks, etc.  

Redundancy works better with hardware than software, why do you think?

What is a RAID?  What is a mirrored system or disk?

If one system has a 1% chance of failure, and then you add another system as a backup, also with 1% chance of failure, is the new theoretical failure rate 1% times 1%, or 0.01%?  Why or why not?  (discussion on probability).

Distributed systems are very common today.  Distribution allows for scalability and reliability.  If one elements of the system fails, the other parts can pick up the slack.  Picture 100 small computers instead of 1 huge computer.  If one of the 100 small computers fail, you lose only 1% of the power and you would probably not even notice it.  If the one huge computer fails the whole system fails.  Do not put all your eggs in one basket.

System Monitoring

The Therac mechanical system had built in fail-safe mechanisms that protected from over exposure of radiation.  Software systems can have the same.  Modern software systems are often built in layers, with each layer performing error checking and verifying correct input from another level or output to another level.  Systems are more reliable when they monitor themselves with sensors (input) from the real world. The reason self monitoring is not done more extensively is because it is more expensive, both in hardware and software development.  Also, false positives (saying something is wrong when it isn't) could lead to more apparent system failures and downtime.

What is a fail-safe?  

Give examples of systems that use extensive monitoring.

Here is one school of thought of what needs to be done to make software development more reliable.

Testing

Testing systems is time consuming and expensive, but is a very good way of improving reliability.  Today, many companies rely on consumers to test their software for them.  They release a "beta" version that is probably free. The consumers trade their willingness to test a new product with the reduced cost and advantage of having the new, presumably superior software.  Microsoft releases their operating systems like that, which is why you should never buy the first edition of a new Windows operating system.

Five Levels of Software Engineering Development

Describe each stage of software development in the model we used to develop your projects.

An essential element of making software more reliable is the ability to change the way we write programs in large companies.  Many cooks spoil the broth, but we need many cooks to get these things done.  How to do that reliably has been outlined by software engineering concepts.  Here is a model of engineering, called the Capability Maturity Model (CMM), conceived by Watts Humphrey, and is used by the US Dept. of Defense Software Engineering Institute at  Carnegie Mellon.  Getting an organization to the top level, five, should be the goal of all companies that need to make very reliable software.

Why is the US department of defense so interested in software engineering?

The CMM has five levels you need to memorize:

Level 1 - Initial

Level 1 is basically where no good practices are in place, the baseline level.  This was how I normally worked when programming.  There might be a process for setting requirements, making a plan, designing, implementing, and testing a system, but it was normally not followed because of time constraints.  Everything was needed yesterday so we just cranked out what we could in the fastest way possible.  Fortunately, I never worked for the department of defense.  

Often, companies working on this level go over budget on projects and projects take longer than expected because they have no real plan.  As things start going wrong it cost more to fix the problems.  So it is difficult to determine how much something will cost and whether or not it is worth the effort.  Consequently, programs are often unreliable.

Can you relate this level to things you do in your own life?

When you test your software, should you be happy when defects are found? Should you be looking for defects or making sure it works correctly in the test? Why?

Level 2 - Repeatable

In level two a process of developing software is followed to the extent that if the project was developed a second time the development cycle would be done the same way.  There is a consistency and plan so that if down twice it would be repeated the same, like a science experiment.  The company uses some planning and tracking of costs. There is a process and it is mostly followed.  The company uses some project management (timelines, testing, etc..) but still faces budget overruns and unreliable timelines.

Did you develop your project for this class at level 1 or level 2? Explain.

Level 3 - Defined

On level three there is a standard process of software development that is used throughout the organization that improves some over time.  It is important that these standards are followed through the whole company, which they are, with budgets and timelines frequently met.

What would you have needed to reach level three in this class?

Level 4 - Managed

There are people managing and following the process.  There are measurements collected about the development cycle (time spent on each project, and on each stage of development), and documentation of all development (finding defects).  The process is checked for ways of improving it based on the information collected on previous projects.  Everything is quantitative, meaning measurable.  

In what ways did you document your project's development?  In what ways could you have improved the way you kept track of what you were doing in development to reach this level?

At level four, you can predict the probability of successful outcome of your project based on the measurements you collect in the development.

Level 5 - Optimized

At level five the organization is concerning the the improvement of the process to an extent that they keep statistical measures of process improvement.  They have people in place that can quickly change processes to optimize results based on objective measurements.  In short, they have a process for changing and improving the process.  A meta-meta-programming level.

Why would moving your company up to higher levels make your software more reliable?