Top-Rated Free Essay
Preview

System Documentation

Powerful Essays
4667 Words
Grammar
Grammar
Plagiarism
Plagiarism
Writing
Writing
Score
Score
System Documentation
System Documentation – A Better way
Rahul Sarangdhar
Language Processing Group
Tata Consultancy Services
Hadapsar, Pune
91-20-4042434
rahul.sarangdhar@tcs.com
Nitin Daptardar
Language Processing Group
Tata Consultancy Services
Hadapsar, Pune
91-20-4042428
nitin.d@tcs.com

ABSTRACT
In today’s competitive environment, business requirements keep changing at a very fast pace in order to meet the demands of the marketplace. Organizations are increasingly relying on their underlying software systems to stay ahead of competition, and hence need to adapt to the new requirements. These software systems contain the important know-how of the organization 's business processes and rules. Investing in fresh requirements’ definition is not only a costly exercise, but also involves the risk of missing out some business rules or process information, leading to business losses. Reverse engineering to extract complete and accurate information embedded in the existing software systems is one of the major challenges faced by the businesses. A powerful reverse engineering tool can help in achieving this goal, to a great extent.
Revine is a reverse engineering tool, developed in-house, that enables the automation of the reverse engineering documentation process, to a large extent. In this paper, we share our experiences of the large reverse engineering projects in TCS that benefited extensively by using Revine. These projects achieved productivity for system documentation that was up to three times in excess of what was achieved using a fully manual process.

Categories and Subject Descriptors
D.2.7 [Software Engineering]: Distribution, Maintenance, and Enhancement – Documentation, reverse engineering

General Terms
Documentation, Design

Keywords
Reverse Engineering, Information extraction, Tool-based methodology for System documentation, Program understanding

INTRODUCTION

Agility is the buzzword in today’s world. Businesses want to adapt to the changing environment with least time and effort. Software systems form the backbone of a majority of large organizations, and it is therefore critical that these systems get transformed seamlessly. These software systems contain the important know-how of the organization’s business processes and are perhaps the only entity that holds complete and accurate information on all the critical business functions.

Enforcing change in these software systems is a multi-fold process. One has to understand the existing system, determine the scope of change, determine the impact on the existing system, carry out change and deploy the new system.

Understanding the existing system is a non-trivial task. Lots of factors contribute towards increasing the difficulty of this understanding; these include and are not limited to:

• Size and complexity

• Legacy technology/ programming languages used

• Scarcity of business domain expertise among the IT teams

• Lack of documentation

• Lack of structure due to patches applied

Thus, the IT teams face a huge challenge to cope up with the demands of the business and respond to the changing requirements.

The following sections elaborate various aspects of System understanding and documentation for Legacy systems, introduce a tool that facilitates in the process and present a few experiences that we have had within TCS that substantiate and illustrate a process that has significantly contributed towards achieving large productivity enhancements. We have illustrated System documentation for COBOL-based Mainframe systems in this paper.

PROCESS OF UNDERSTANDING

System Documentation has many flavours and they are primarily driven by the context in which the system documentation exercise is undertaken. Documentation exercises are typically initiated during: • Re-engineering of applications to newer technologies • Reference for maintenance of the applications • Replacing certain aspects of the existing systems like web-enabling the screen components while retaining the business components

The output of this understanding exercise therefore depends upon the context in which the exercise is carried out. Artefacts are created during the following: 1. System documentation for re-engineering, which consists of building use cases for various components present in the system. 2. System documentation for web-enabling of screens in a legacy application, which requires edit checks for all the screen fields to be documented. 3. System documentation from the maintenance perspective, which it is necessary to document the cross references of various entities like programs, copybooks, files, and database tables, etc.

All system documentation exercises require the source code of the programs, and associated files/copybooks, and data entity definitions, etc., that comprise the system. This implies that source code investigation and drawing inferences from the source inventory constitute the major activity from the documentation perspective.
The understanding process typically encompasses the following activities: • Form a higher level mental model of the system • Understand the components and their interfaces • Map the lower level inventory components such as programs, copybooks, data entities, etc., to the components identified above • Understand the functionality in the components context mainly through source code investigation.
Sebastien et al. inferred, “Understanding source code plays a prominent role during software maintenance and evolution. It is a time consuming activity, especially when dealing with large scale software systems, and can rapidly turn into a major bottleneck for software evolution” [1].
Singer et al. define Just in time comprehension as the source code exploration activity performed by software engineers that leads to software understanding [2]. Software engineers are doing Just in time comprehension by repeatedly searching for source code artefacts and navigating through their relationships. According to Sim et al., this activity can be split into two navigation styles normally used for the purpose of information retrieval [3]: Browsing, an exploratory and unstructured activity with no specific goal, and searching, a planned activity with a specific goal.
According to [3], browsing of information views aims at exploring high-level elements in the system and searching focuses on retrieving low-level details. Using a combination of these two styles, software engineers can investigate source code and build up a mental model of the system.
While building up the mental model, it is necessary that the software engineer remains focused on the understanding part and does not get distracted. He/she must have an easy access to any information needed. For example, while looking at a particular source statement he/she may want to understand the data structure that is being referred to in the statement. Many a time, this simplistic requirement is not as straightforward as it seems. First, the software engineer has to locate the copybook in which the variable is declared. This itself is a non-trivial task as COBOL allows you to declare a variable with the same name in multiple records. He/she has to find the variable in all the copybooks, determine the right one and then understand the data structure to which it belongs. He/she can spend anywhere between two and ten minutes just to locate the variable and then he/she may certainly lose the context in which the understanding occurred and would typically need to backtrack and re-synchronize. These types of repetitive digressions hamper the understanding process and the person wears out mentally over a period. Imagine a person gaining understanding of a system all day long over a couple of months and you’ll get the point. This simple example illustrates how critical it is for the users to have a powerful browser that allows extremely easy navigation across various elements and entities in the system.

REVINE

In the prior sections, we established the need and presented various flavours of system documentation while highlighting its various aspects. In this section, we present Revine – A reverse engineering tool for understanding legacy systems.
According to Chikofsky et. al., reverse engineering is “The process of analyzing a subject system to (a) identify the system’s components and their inter-relationships and (b) create representations of a system in another form at a higher level of abstraction” [4]. Revine is a reverse engineering framework, aimed at extracting information from existing source code for providing assistance in reverse engineering. Revine aids the programmers in understanding the existing application by investigating its source code and thereby: □ Analyzing the system to identify its components and their inter-relationships. □ Understanding the program structure and program flow information along data and control within a batch job or online transaction. This allows formation of a top-down model [5] of the system □ Analyzing and understanding the program logic in relation to a particular point of interest — a problem or a change. This allows formation of a bottom-up model [6] □ Documenting the relationships (cross-references) between all components within the system using the reports. This also provides a powerful means to document program understanding and program complexity. This allows formation of a knowledge-based understanding model [7]

While some of the above are provided as direct document outputs, others need to be derived manually using the information provided by Revine.
Revine has primarily two components. First, is the documentation component that generates program specification, and system-level documents containing pictorial call graphs, flow graphs, various cross-references, interface diagrams, data entities and their relationships. The second component is a browser that allows seamless navigation across various inventory arteifacts and also provides contextual information associated with various entities, for example, for a particular data file, the list of all programs accessing it and the mode of access. Similarly, having selected a database table, it gives out a list of programs that access it and the columns that are accessed by these programs. This information is necessary to correlate the dependencies on programs and have a cohesive understanding of the system. Figure 1 shows the Revine Browser Screenshot.

[pic]

Figure 1. Revine Browser Screenshot
Revine currently supports some flavours of legacy mainframe languages such as COBOL, PL/1, RPG and Synergy; databases such as DB2 (relational), IMS (hierarchical) and IDMS (network). Some of the application area-wise features provided by Revine are listed below: System Documentation
In this area, the following features are useful: □ System-level call graph □ System inventory □ Data files, their layout and access information □ Database entities, their layout and cross-references □ Screens and access information □ Jobs and access information □ Copybooks □ System level metrics

Program Specification
In this area, the following features are provided: ✓ Program and job call graph ✓ Paragraph flow graph ✓ Metrics ✓ Copybooks ✓ Database and data files referred and their access mode ✓ Screens and the access modes ✓ Unused and uninitialized variables ✓ Dead code

To use Revine effectively, one needs to follow a certain process to extract the knowledge. The process includes gathering the inventory using Revine and generating the documents manually or automatically. In the following sections, we present our experiences of the usage of Revine in large system documentation exercises, and also highlight the various aspects of methodology that was followed by the projects.

CASE STUDIES

1 DOCUMENTATION FOR USE CASES

This case study represents reverse engineering from re-engineering the project’s perspective. The output of this system documentation exercise was Use case documents that would form inputs to the forward engineering phase.
The system under study was a large legacy manufacturing process control system developed for a gigantic automobile manufacturing company. It consisted of several Mainframe applications developed in COBOL and PL/1 programming languages with the CICS screens. It employed relational as well as hierarchical databases, namely, DB2 and IMS, along with data files, to handle the data. The size of the system was over 2600 programs with around 3.7 million lines of code (MLOC).
As the application evolved over the years, the maintenance and development teams had patched new applications that had no architectural commonality. The default approach undertaken was simply to modify the current application on the mainframe. Given this environment: □ The system was not able to support new business requirements because of the complexity of the system implementation. □ Release management was difficult because of the monolithic nature of the system and its considerable breadth. □ Maintenance costs associated with the application were very high. □ Lack of proper documentation
The aim was to re-engineer the applications in a component framework that would result in a common architecture that included application componentization, data management, and a web interface, as well as the migration approach. This was intended to resolve the issues noted above, as well as to prepare the system for future enhancement.
Considering the complex nature of the application, it was critical to understand the way the user and/or other applications interact with the system. Therefore, it was decided to prepare use case documents that give information on how the system reacts to a specific input.
The methodology followed by the project to document the above use cases is described in the following sections
The first phase in the project was the source code inventory collection phase. The source code inventory consists of program files and all included copybooks. A list of all the missing components was generated using Revine. This ensured that all the necessary items are obtained upfront and that there were no discoveries at a later stage regarding any missing component. This is a very important aspect of any reverse engineering exercise. In earlier experiences, there are instances wherein, due to the lack of an automated process, the identification and organization of the source inventory could not be carried out and a lot of time was wasted at a later stage. It is a tedious job to find out the relevant missing components manually without visualizing the call sequences.
In a project of such large dimensions, organizing the huge source inventory and distributing the documentation amongst various documentation teams constituted a challenging job. With the call graph structures, it was possible to break up the system into 3 distinct parts. Further, based upon the naming conventions used in the system, it was possible to break up and distribute applications amongst the individual team members in the three primary teams. Figure 2 shows a graph of one such small cluster in an application.
[pic]
Figure 2. Cluster in a Call Graph
Job flows were particularly useful while understanding the top-level structure of programs. The JCL (Job Control Language) programs are those that call COBOL/PL1 programs and mainframe utilities. It was necessary to understand what all top-level programs were there in the system and how those programs accessed various data files. The JCL call graph generated by Revine gave a clear picture of the calling sequence as well as showed how the programs are clustered in particular groups. These clusters helped the team to focus on certain programs that belonged to the same cluster.
The application structure view of the system helped understand the inter-relationships among various programs and showed the details of various entities like programs, jobs, copybooks database tables, screens and queues that were present in the system. The study of various cross-references between programs and data entities helped in getting an overview of the system components.
[pic]
Figure 3. Execution Flow Graph
At program level, it was necessary to understand the internal execution flow within the program. This provided an overview of the program and allowed the team members to identify critical parts of the program that needed to be studied in detail. Figure 3 shows one such execution flow graph for a simple program.
While documenting programs, it is necessary to understand how the program interfaces with various data entities like storage files, database tables and the sequence of execution flows within these entities. This information is necessary to understand how the data flows between the various entities, as shown in the figure below.
[pic]
Figure 4. Data entity Flow Graph

Browsing the cluster of programs to infer their behaviour is the major activity during source code investigation. Revine provides an extremely user-friendly GUI that allowed seamless browsing between various entities like programs, copybooks, variables, and paragraphs. It also maintained the history of user actions, so that the user is able to go backward/forward along the path that he/she has already traced. It displayed a paragraph-level nested tree outline structure showing the internal control flow in execution sequence within a COBOL program. This was a particularly useful feature, as it provided the user with a higher-level mental model of the program and the user was able to browse any particular paragraph by just clicking on the same in the structural browser. This allowed the software engineer to form mental pivot points that are very useful while understanding the program. One is able to form a detailed understanding by studying the source code in execution sequence. At this point, variable level analysis information is required. Revine provides quick answers to questions like:

• “Where is this particular variable used or modified in this program?” In response to this query, one gets all the statements that affect the value of the variable directly or indirectly by a group-level data movement or through a redefined variable access.

• “How is the value being assigned to a variable being used further in the execution of the program?” This shows all the statements participating in the forward data flow from the currently selected statement.

Having gained the understanding of the programs, functional documentation of programs is done manually. This involves documenting the various process flows present in the programs as well as documenting the exception handling paths.
Finally, the information obtained from the various steps above is used to form a use case document containing: • Overview • Stimulus • Pre-conditions and Post-conditions • Process flows (Primary, Alternate and Exceptions) • Program specification containing: o Screen references o Call graph o Called programs and calling programs o Program flow o Sequenced processing steps o Input and output variables o Tables/Files and their usage o List of error messages, etc

Before this large project, a small pilot was undertaken, where a team documented two similar small systems independently. The manual productivity was calculated at 600 LOC per person day, while the team using tool-based methodology obtained a productivity of around 2000 LOC per person day. This methodology was followed for the large project that reported actual productivity of 2500 LOC/per day.

2 DOCUMENTATION OF EDIT CHECKS

This case study presents an instance of extraction of edit checks or validations and resulting error messages that are output from a transaction-based system involving journal handling. Typical operations carried out on a particular journal type include the display of journal information, addition of a new journal, and updating and deleting an existing journal. The system was being re-engineered to a new web-enabled J2EE platform. One of the key requirements of the new system was that it should replicate the behaviour of the original system and that all the necessary error messages arising due to validations for the screen fields be retained in the new system.
The system under study was a large journal processing system involving banking accounts, credit card processing, security handling, etc., and provided services to a large set of customer types. It was a legacy Mainframe application written in COBOL and embedded CICS for online screens and VSAM file handling. The application had evolved over the years, making it quite unstructured. The relevant part of the system to be studied was around 328,000 expanded lines of code (LOC) spread across 50 programs and 285 copybooks. There were around 85 journal types and the validations and edit checks that were applicable to the screen fields of journal types needed to be documented by the source code investigation technique.
As the system had evolved over the years, the code was poorly structured and one particular program had code for over 35 journal types embedded in a scattered fashion, making the documentation task even more challenging. This particular program had around 28000 LOC while the average size of the rest of the programs is around 6000 LOC. The other programs contained code for a single journal type each. Revine was used as an aid for the documentation and review purposes. It was critical to assess the quality of an edit check documentation with respect to the completeness and correctness and hence, a comprehensive review process was needed.
Once the source inventory was obtained and the tool-based analysis repository is created, the next step was to schedule the program comprehension activity with a limited number of resources. Scheduling is an important facet of a reverse engineering exercise. Estimating the number of resources required largely depends on how the system is logically clustered and how many software engineers can work in parallel to get the maximum throughput.
The Complexity Metrics report containing cyclomatic complexity and Halstead metrics for the source inventory were generated using Revine that aided in analyzing the size and complexity of individual programs. Using these metrics, one could classify the programs into simple, medium and complex categories. The decision table for the same is given below for COBOL programs.

Table 1. Classification based on Complexity Metrics
|Cyclomatic |Classification |
|complexity | |
|< 100 |A simple program, easy to understand |
|100 - 300 |Medium complex, moderately easy to understand |
|300 - 600 |Complex, difficult program |
|Greater than 600 |Highly unstructured – extremely complex |

Using the call graph, the various structural clusters of programs were identified. This information was useful in splitting the system into logical clusters and scheduling of the resources and coming up with the plan for documentation. More experienced software engineers were put onto more complex/critical clusters while simple clusters were distributed amongst less-experienced team members.

Browsing the programs to understand the detailed functionality is a very important activity. The powerful navigation capabilities of Revine allowed painless browsing between various entities like programs, copybooks, variables, and paragraphs. Team members were able to form detailed understanding by studying the source code in execution sequence. The following figure illustrates the use of the browser for understanding edit checks.

[pic]

Figure 5. Use of Browser for Edit check understanding

Extraction of edit checks from the system consisted of identifying the error messages that are output as a result of an edit check and working backwards to identify the conditions that govern the particular edit check. Edit checks can be dependent on multiple conditions and all the conditions need to be documented. Consider the following example

0314 IF WS-DEBIT-ACCT AND WS-CREDIT-ACCT
0315 MOVE 'CANNOT PERFORM CREDIT AS WELL AS DEBIT '
0316 TO SCR-MSG
0317 MOVE -1 TO SCR-DB-BRANCH-L . . .
0326 GO TO 13000-SEND-DATAONLY
0327 ELSE
0328 IF AJS-CREDIT-ACCT AND BRANCH-USER
0329 MOVE 'BRANCH USER CANNOT PERFORM CREDIT '
0330 TO SCR-MSG
0331 MOVE -1 TO SCR-DB-BRANCH-L . . .
0341 GO TO 13000-SEND-DATAONLY
0342 ELSE . . .
0370 GO TO 14000-SEND-CR-DB-TXN.

Indicates edit check (validation)
Indicates error path
Indicates Success path

This code snippet represents one set of the edit checks pertinent to a security journal and these checks are listed below • Credit and debit operations cannot be performed in a single journal transaction • Credit operation cannot be performed by a branch user • Debit operations can be performed by any user type
These checks are performed in sequence and are spaced apart in the code and are difficult to understand if one is not able to study all the control paths of the edit check. It is necessary to determine these paths and document the conditions along all the control paths. Revine provides a slicing feature that shows all the paths that lead to the execution of the current statement.
While documenting the edit checks, it was necessary to understand the edit checks in execution sequence; however, as understanding the code itself is largely manual, it is prone to errors. Certain edit checks may be totally missed out while some may get incorrectly documented. Only reviews can help in identifying such discrepancies and a mechanism that guaranteed the completeness and correctness was needed. While investigating the system, a typical pattern was identified in the programs; all error messages resulting out of edit check validations were output to the screen using a single variable. All the messages used in the program were defined in a single record. With this pattern, it was very easy to come up with the completeness measure and verify the results for completeness.

For example, as illustrated in Table 2, for one particular journal program, 57 error messages were defined, out of which five messages were unused in the program, which meant that 52 error messages were displayed as a result of edit check validations. This formed the lower bound on the number of edit checks documented. When actual use points of the messages were calculated, it was found that the messages were displayed at 60 places. This meant that some of the messages were common and reused in 8 edit checks. Thus, the upper bound amounted to 60 edit checks. Upon actual verification, it could then be easily determined whether all the edit checks had been documented and to ensure completeness.

Table 2. Completeness Review
| |Factor |Value |
|A |Total number of error messages defined |57 |
|B |Number of Unused Messages |5 |
|C |Number of Messages displayed |60 |
|D |Number of Duplicate messages |4 |
|E |Lower Bound (A – B) |52 |
|F |Upper Bound (C) |60 |
|G |Number of Edit Checks required (C – D) |56 |

Correctness on the other hand needed a detailed review and commonly occurring fault patterns were detected and the documentation process was refined to reduce that kind of error. For example, the condition involved in an edit check was documented in reverse manner. Thus, instead of documenting that “Debit IRA account not allowed for BOSS users”, it was documented as “Debit IRA account allowed for BOSS users”. Such errors could only be caught using rigorous manual reviews.
The documentation team achieved an end-to-end productivity of around 1300 LOC/pd and delivered high-quality edit check documentation without having any schedule and effort slippage.

CONCLUSION

In this paper, we started with understanding various aspects and the need for reverse engineering. We understood the process of System documentation of large systems and the pain areas associated with them. We presented various useful features of Revine – a reverse engineering tool for legacy applications developed in-house. Further, we shared our experiences of a few reverse engineering projects and put forward some of the finer aspects of reverse engineering methodology. We also cited some of the key benefits of using tools for Reverse engineering and highlighted the productivity improvements based on real life experiences.

ACKNOWLEDGMENTS

We would like to thank Mr. Ram Godbole, Mr. Arunashish Majumdar and Mr. Mandar Bhatavdekar, who provided a number of valuable suggestions, which helped improve this paper. We would also like to thank the members of the Revine team who shared their valuable experiences with us.

REFERENCES

1] Sebastien, Robitaille, Reinhard S., and Rudolph, K. Bridging program comprehension tools by design navigation. Proceedings of ICSM’ 2000, pages 22-32, San Jose, California, 2000
2] Singer, J., Lethbridge, T., Vinson, N. and Anquetil, N. An examination of software engineering work practices. Proceedings of CASCON’97, pages 209-223, Toronto, ON, Canada, 1997.
3] Sim, S.E., Clarke, C.L.A., Holt, R.C. and Cox, A.M., Browsing and searching Software architectures, In Proceedings of International conference on Software Maintenance (ICSM’99), pages 381-390, Oxford, England, August 1999
4] E. J. Chikofsky and J. H. Cross II, Reverse Engineering and Design Recovery: A Taxonomy. In IEEE Software 7, pages 13-17, Jan. 1990
5] Brookes, R., Towards a theory of comprehension of computer programs. International journal of man-machine studies 18, pages 543-554, 1987
6] Shneiderman, B., Software psychology: Human factors in computer and information systems. Winthrop Publishers Inc 1980
7] Letovsky, S., Cognitive processes in program comprehension. Empirical Studies of programmers, pages 59-79, 1986
8] Soloway, E., Pinto J., Letovsky, S., Littman, D., Lampert. R, Designing documentation to compensate for delocalised plans. In Communications of ACM, Volume 31, Issue 11, pages 1259-1267, 1988
9] Von MayrHauser, A. and Vans, A. M., Program comprehension during software maintenance and evolution. In IEEE Computer (Vol 28, No 8), pages 44-55, 1995
10] Von MayrHauser, A. and Vans, A. M., Comprehension processes during large-scale maintenance. In Proceedings of the International Conference on Software Engineering (ICSE ’94), pages 39-48, 1994

References: 1] Sebastien, Robitaille, Reinhard S., and Rudolph, K. Bridging program comprehension tools by design navigation. Proceedings of ICSM’ 2000, pages 22-32, San Jose, California, 2000 2] Singer, J., Lethbridge, T., Vinson, N. and Anquetil, N. An examination of software engineering work practices. Proceedings of CASCON’97, pages 209-223, Toronto, ON, Canada, 1997. 3] Sim, S.E., Clarke, C.L.A., Holt, R.C. and Cox, A.M., Browsing and searching Software architectures, In Proceedings of International conference on Software Maintenance (ICSM’99), pages 381-390, Oxford, England, August 1999 4] E. J. Chikofsky and J. H. Cross II, Reverse Engineering and Design Recovery: A Taxonomy. In IEEE Software 7, pages 13-17, Jan. 1990 5] Brookes, R., Towards a theory of comprehension of computer programs. International journal of man-machine studies 18, pages 543-554, 1987 6] Shneiderman, B., Software psychology: Human factors in computer and information systems. Winthrop Publishers Inc 1980 7] Letovsky, S., Cognitive processes in program comprehension. Empirical Studies of programmers, pages 59-79, 1986 8] Soloway, E., Pinto J., Letovsky, S., Littman, D., Lampert. R, Designing documentation to compensate for delocalised plans. In Communications of ACM, Volume 31, Issue 11, pages 1259-1267, 1988 9] Von MayrHauser, A. and Vans, A. M., Program comprehension during software maintenance and evolution. In IEEE Computer (Vol 28, No 8), pages 44-55, 1995 10] Von MayrHauser, A. and Vans, A. M., Comprehension processes during large-scale maintenance. In Proceedings of the International Conference on Software Engineering (ICSE ’94), pages 39-48, 1994

You May Also Find These Documents Helpful