The Technical Gains of Systems Management
Technical failures of aerospace projects are hard to hide. Rockets and missiles explode. Satellites stop sending signals back to Earth. Pilots and astronauts die. To the extent that systems management helped prevent these events, it must be deemed a technical success. Systems management methods such as quality assurance, configuration control, and systems integration testing were among the primary factors in the improved dependability of ballistic missiles and spacecraft. Missile reliability in air force and JPL missile programs increased from the 50% range up to the 80% to 95% range, where it remains to this day. JPL’s spacecraft programs suffered numerous failures from 1958 to 1963, but after that JPL’s record dramatically improved, with a nearly perfect record of success for the next three decades. The manned programs suffered a number of testing failures at the start but had an enviable flight record with astronauts, with the one glaring exception of the Apollo 204 fire. A strong correlation exists between systems management and reliability improvements.
The nature of reliability argues for the positive influence of systems methods. For aerospace projects to succeed, there must be high-quality components, proper integration of these components, and designed-in backups in case failures occur. Only the last of these is a technology issue in the design sense. The selection and proper integration of components has more to do with rigorous compliance with design and manufacturing standards than it does with new technology. High component quality comes through unflagging attention to manufacturing processes, backed by testing and selection of the best parts. In a nutshell, it is easy to solder a joint or crimp a connector pin but extremely difficult to ensure that workers perform thousands or millions of solders and crimps correctly. Even a worker with the best skills and motivation will make occasional mistakes. In systems management, social processes to rigorously inspect and verify all manufacturing operations ensure high quality across the thousands of workers involved in the process.
Similarly, ensuring proper integration is a matter of making sure that each and every joint is properly soldered, every pin and connector properly crimped, every structure properly handled at all times, and all of these operations rigorously tested. On top of this, ‘‘systems testing’’ checks for design flaws and unexpected interactions among components. In all of these issues, procedures and processes—not new technology—are the keys to success. Systems management provided these rigorous processes and tests.
Once organizations dealt with component problems, they ran into the next most likely cause of failure: interface problems caused by mismatches between designs. By the mid-1960s, both the air force and NASA obsessively concentrated on interface problems, which resulted ultimately from poor communication, poor organization, or both. Engineers and managers recognized that differences in organizational cultures and methods made communication between organizations more difficult than communication within an organization. Miscommunication led to incompatibilities between components and subsystems — incompatibilities often found when components were first connected and tested. More technology was not the solution. Instead, engineers needed improved communication through social processes.
Engineers enforced better communication by creating standard documents and processes. They required that one organization be responsible for analyzing both sides of an interface and that the specifications and analyses be documented in a formal Interface Control Document. Many interface problems were subtler than simple mismatches between physical or electrical components.
For example, engineers at Marshall Space Flight Center found that a ‘‘nonliftoff’’ of a Mercury-Redstone test vehicle occurred because the Mercury capsule had a different weight than the Redstone’s normal warhead, changing the time it took for the launch vehicle to separate from the launch tower. Because the combined launch complex-launch vehicle electronics required that the vehicle lift off at a certain rate, the changed rate led to a shutdown of the launch vehicle as emergency electronics kicked in to abort the launch.
Problems such as these were solvable not through technology but through better engineering communication and better design analysis. Once engineers understood all of the factors, the design solution was usually simple. The problem was making sure the right people had the right information and that someone had responsibility for investigating the entire situation. As ELDO’s history shows, getting an organization to pay for a change in an interface was often more difficult than formulating a technical solution. Authority and communication matter most in interface problems and solutions. Better organization and better systems, not better technology, made for reliability in large aerospace projects by standardizing the processes and providing procedures to cross-check and verify each item, from solder joints to astronaut flight procedures. These methods essentially provided insurance for technical success.
How much did this insurance cost? Did systems management lower costs or speed development compared to earlier processes and methods? Concurrency in the 1950s was widely believed to shorten development times, but at enormous cost. The secretary of the air force admitted that the air force could afford only one or two such programs. Schriever contended that concurrency saved money because it shortened development time. Because R&D costs are spent mostly on engineering labor, Schriever argued, shortening development time would reduce labor hours and hence cost. Most other experts then and later disagreed with him. Political scientist Michael Brown contends that concurrency actually led to further schedule slips because problems in one part of the system led to redesigns of other parts, often several times over.3
On any given design, having systems management undoubtedly costs more than not having systems management, just as buying insurance costs more than not buying insurance. The real question is whether systems management reduced the number of failures sufficiently, so that it counterbalanced the replacement cost. For example, a 50% rate of reliability for a missile system such as Atlas in the late 1950s meant that every other missile failed. With this failure rate, the air force and its contractors could afford to spend up to the cost of an entire second missile in improvements to management processes, ifthese processes could guarantee success. In other words, at a 50% reliability rate and a cost of $10 million per missile, each successful launch costs $20 million. Thus, if process improvements can guarantee success, then spending $10 million or less per missile in management process improvements is cost-effective.
In fact, the early Atlas, Titan, and Corporal projects achieved roughly 4060% reliability. Reliability improvement programs — that is, systems management processes — improved reliability into the 60-80% range during the 1950s and early 1960s and into the 85-95% range thereafter.4 The reliability improvement meant that roughly nine out of ten launches succeeded, instead of one out of two. Therefore, systems management could easily have added more than 50% to each missile’s cost and still been cost-effective. NASA’s efforts to ‘‘man-rate’’ Atlas and Titan could have added 100% to costs for Atlas and Titan and still been cost-effective, because success had to be guaranteed. In fact, considering the potential loss of not only the launchers but also the manned capsules and astronauts, NASA could likely spend 200-500% on launcher improvements and still be cost-effective, considering the low reliability of these vehicles at that time. Pending detailed cost analysis, systems management was probably cost-effective if costs were measured for each successful launch.
Another way to assess systems management is to compare missile and space programs that implemented systems management methods with programs that did not. ELDO provides the most extreme example of little or no systems management. None of its rockets ever succeeded, despite piecemeal introduction of some systems management methods. Comparison of JPL’s Ranger program with the contemporary Mariner program provides another example, because the Mariner design was a modification of the Ranger spacecraft. With less systems management, Ranger’s first six flights failed, whereas Mariner achieved a 50% success rate, with later Mariner spacecraft performing almost perfectly. After strengthening systems management, Ranger’s record was three successes out of four launches.5 Assuming Ranger and Mariner costs per spacecraft were roughly equal, Mariner cost less per successful flight than early Ranger.
Aside from pure cost considerations, failures hurt an organization’s credibility. In the rush to beat the Soviets, early space programs lived the old adage ‘‘There is never time to do it right, but there is always time to do it over.’’ They had many failures, but in the early days executive managers were not terribly concerned. By the early 1960s, however, failures mattered; they led to congressional investigations and ruined careers. Systems management responded to the need for better reliability by trying to make sure that engineers ‘‘did it right’’ so they would not have to ‘‘do it over.’’
It is no coincidence that engineers developed systems management for missiles and spacecraft that generally cannot be recovered. When each flight test and each failure means the irretrievable loss of the entire vehicle, thorough planning is much more cost-effective than it is for other technologies that can be tested and returned to the designers. This helps to explain why the bureaucratic methods of systems management work well for space systems but seem too expensive for many Earth-bound technologies. For most technologies, building a few prototypes and performing detailed tests with them before manufacturing is feasible and sensible. Lack of coordination and planning (each of which costs a great deal) can be overcome through prototype testing and redesign of the prototype. This option is not available for most space systems, because they never return.
The evidence suggests that systems management was successful in improving reliability sufficiently to cover the cost per successful vehicle. Although systems management methods were not the only factor involved in these improvements, from the standpoint of reliability, they were critical. Process improvements, not technology improvements, ensured the proper connection and integration of thousands of components. Systems management increased vehicle costs on a per vehicle basis compared to previous methods but reduced costs when reliability is factored in.