Mainframe Reverse Engineering using Advanced LLM Techniques

Written by Puppala Satya Syam | Jan 3, 2025 9:44:21 AM

Reverse engineering, a mainframe application with over 1.5 million lines of code (LOC) spread across multiple components is no small feat. This blog delves into a sophisticated approach to tackling this challenge, aiming to extract critical artifacts such as Business Requirement Documents (BRDs), Technical Specification Documents (TSDs), detailed application and program-level call graphs, and dependency graphs for both batch and online systems.

Leveraging Large Language Models (LLMs) like Gemini 1.5 Flash and GPT-4.o, along with on-premise solutions incorporating Llama models, this blog details the methodology, challenges, and results achieved in securing client data while enabling comprehensive reverse engineering.

1. Introduction

Mainframe systems are the backbone of many enterprise environments, hosting critical applications with complex inter-dependencies. Reverse engineering these systems to extract meaningful insights, such as BRDs and TSDs, is an essential but challenging task.

Traditional methods are time-intensive and error-prone. With advancements in AI, particularly LLMs, automated reverse engineering has become a reality, enabling:

Automated extraction of technical and business-level documentation.
Visual representations of call and dependency graphs.
Insights into programmatic inter-dependencies across the application landscape.

2. Application Overview

The mainframe application consists of:

Batch and CICS Programs: Core components of transaction and batch processing.
Job Control Language (JCLs) and Procedures (Procs): Scripts defining job workflows.
CSD Dumps: Configuration details for CICS.
Flat and VSAM Files: Data storage mechanisms.
Copybooks: Reusable code templates.
DB2: Relational database back-end.
Control-M Job Scheduler: Batch job orchestration.

Reverse engineering requires parsing, analyzing, and correlating these heterogeneous components to produce accurate and meaningful documentation.

3. Challenges

Complexity of Mainframe Code base:
- Large-scale code base with intricate inter-dependencies.
- Variety of programming languages and paradigms.
Security Concerns:
- Client restrictions against exposing code to cloud environments.
- Need for strict data privacy and compliance.
Model Performance:
- Adapting LLMs for domain-specific mainframe terminology.
- Fine-tuning models to maintain accuracy across various components.

4. Solution Overview

4.1. Cloud-based LLMs

We experimented with Gemini 1.5 flash and GPT-4.o to process mainframe code with APIs for these LLMs. These models facilitated:

Chunking programs exceeding token limits to ensure comprehensive processing.
Generating individual summaries for Business Requirements and Technical Specifications at the program level.
Consolidating these summaries to produce final BRDs, TSDs, and call graphs.

The individual summaries at the program level enhanced the review process, enabling quick insights into specific mainframe artifacts. Additionally, we incorporated a dynamic prompting feature in the front-end interface, allowing users to provide specific context to the LLM for tailored results. User feedback was actively incorporated to refine prompts and improve document quality to meet expectations iteratively.

We also designed a user-friendly front-end interface with the following capabilities:

Uploading all mainframe artifacts.
Dynamic prompting for flexible input.
Downloading and storing the generated documents at desired locations.

4.2. On-Premise LLMs

Due to client concerns, we transitioned to an on-premise solution. Key highlights include:

Model Selection: Llama models with 70 billion parameters.
Model Fine-tuning: To enhance the LLMs' understanding of mainframe code and improve the accuracy of the generated documentation, we performed fine-tuning using a curated dataset. This involved:
- Dataset Preparation: We compiled a dataset of mainframe code samples and corresponding documentation, ensuring diversity and relevance to the client's specific environment.
- Fine-tuning Process: We fine-tuned the selected LLMs on this dataset, adjusting their parameters to capture the nuances of mainframe code better and generate more accurate documentation.
- Evaluation and Iteration: We evaluated the performance of the fine-tuned models and iteratively refined the fine-tuning process to achieve optimal results.
- Hardware Configuration for Llama model on-premise: 8 NVIDIA A100 GPUs (80 GB each) for high-speed inference and training, 512 GB RAM, 8TB SSD for high-speed data access.

4.3. Deployment and Integration in On-premise

Once the build pipeline and fine-tuning process were finalized, we deployed the solution within the client's on-premises environment. This involved:

Infrastructure Setup: We collaborated with the client's IT team to set up the necessary hardware and software infrastructure, including powerful GPUs and optimized network configurations.
Solution Deployment: We deployed the solution components, including the LLM engine, data processing modules, and user interface, within the client's secure environment.
Integration: We integrated the solution with the client's existing tools and workflows, ensuring seamless access and utilization of the generated documentation.

4.4. Validation and Refinement

Throughout the development and deployment process, we conducted rigorous validation and refinement to ensure the quality and accuracy of the generated documentation. This involved:

Expert Review: We engaged experienced mainframe developers and subject-matter experts (SMEs) to review the generated documentation and provide feedback on its accuracy, completeness, and clarity.
Iterative Feedback: We incorporated the feedback from the SMEs and iteratively refined the solution, adjusting the LLMs, prompts, and templates to improve the documentation quality.
Continuous Monitoring: We established a process for continuous monitoring and evaluation of the solution's performance, ensuring its ongoing effectiveness and addressing any emerging issues.

5. Methodology

5.1. Prompt Engineering

Effective use of prompt engineering techniques ensured precise outputs, which included Context-aware parsing of code components and Targeted generation of documentation and graphs as per the template design.

5.2. Pipeline build

We constructed a robust build pipeline to automate generating documentation from the prepared mainframe code base. The pipeline consisted of the following stages:

Code Analysis: The extracted code chunks were fed into the LLM engine. The LLMs, fine-tuned on mainframe-specific data, analyzed the code to understand its structure, logic, and dependencies.
Feature Extraction: The LLMs extracted relevant features from the code, such as program names, data structures, function calls, business logic, and control flow.
Document Generation: Based on the extracted features and predefined templates, the LLMs generated various types of documentation, including BRDs, technical specifications, call graphs, and dependency graphs.
Output Formatting: The generated documentation was formatted and organized into human-readable formats using Python for easy review and consumption.

6. Results and Achievements

Documentation:
- High-quality BRDs and TSDs are generated in a fraction of a minute with humans in the loop.
Graph Visualization:
- Comprehensive call and dependency graphs.
- Improved understanding of inter-dependencies.
Security:
- Complete client satisfaction with on-premise solutions.

7. Market Potential

The global mainframe modernization market is projected to grow significantly, with reverse engineering services anticipated to capture a substantial share. Enterprises seeking to migrate to modern platforms represent a key demand driver. This positions reverse engineering solutions as critical for:

Streamlining legacy application understanding by generating reverse engineering documents.
Easy analysis of Mainframe applications.
Reducing costs associated with modernization projects so that forward engineering teams will get the necessary documents.

8. Key Learnings

8.1. Using Gemini 1.5 Flash

Pros:

Speed & Optimization:
- Typically optimized for real-time, high-speed processing.
- Faster inference times compared to GPT-4.o can be crucial for large-scale document generation.
Cost-Effective:
- Likely more cost-efficient, especially if it offers better enterprise-use licensing options.
- Efficient token utilization can reduce API costs if hosted on similar computing resources.
Domain-Specific Fine-Tuning:
- May offer better fine-tuning capabilities for technical or domain-specific tasks like COBOL documentation.
- Could integrate with custom vocabularies, useful for legacy code.
Custom Integrations:
- Strong support for embedding into enterprise-grade pipelines and workflow documentation.

Cons:

Model Maturity:
- Fewer community resources and pre-trained datasets compared to GPT-4.o.

Language Comprehension:
- Might fall short in nuanced or complex natural language generation tasks compared to GPT-4.o.
- Risk of producing less coherent or verbose explanations.
Ecosystem:
- Limited ecosystem support (plugins, frameworks) compared to the mature OpenAI ecosystem.

8.2 Using GPT-4.o

Pros:

State-of-the-Art Language Understanding:
- Exceptional natural language understanding and generation capabilities.
- Handles complex, context-heavy tasks effectively, including COBOL code documentation and BRD generation.
Community & Ecosystem:
- Supported by a vast ecosystem (OpenAI APIs, plugins, integrations).
- Rich community resources and pre-built solutions to accelerate development.
Flexibility:
- Supports fine-tuning for niche use cases.
- Can integrate with Azure’s ecosystem seamlessly (if you're using Azure App Services).
Consistency in Output:
- High reliability for producing structured and context-aware content.
- Ideal for workflows demanding high accuracy and minimal revisions.

Cons:

Cost:
- Typically more expensive than competing models, especially for large-scale document generation tasks.
- Higher token costs could impact the budget for processing long COBOL files.
Inference Time:
- Slower compared to models like Gemini 1.5, mainly for multi-step workflows.
Customization:
- Less flexibility in adding domain-specific vocabularies or bespoke features compared to models designed for specific enterprise tasks.

8.3 Llama 3.2 On-Premise

Pros:

Cost-Effective: If you have existing on-premise infrastructure, then potentially lower cost than cloud LLMs, especially for long-term projects.
Data Security: Full control over your data, eliminating concerns about exposing sensitive code to external parties.
Customization: Greater flexibility to fine-tune the model specifically for your mainframe environment and COBOL code.
Integration: Can be tightly integrated with your existing mainframe workflows and tools.

Cons:

Hardware Requirements: Significant upfront investment in powerful GPUs and servers to handle the 70B parameter model.
Maintenance: Requires ongoing maintenance and updates of the on-premise infrastructure and software.
Expertise: Specialized AI/ML expertise is needed to set up, fine-tune, and manage the model effectively.
Scalability: Scaling resources might be less flexible compared to the cloud.

9. Conclusion

The successful reverse engineering of a 1.5 million LOC mainframe application demonstrates the transformative potential of LLMs. By adapting to client-specific security requirements, this solution provides a scalable and secure approach to automate the generation of documentation and call graph visualizations, ultimately reducing cost and improving accuracy. Furthermore, the growing market potential underscores the relevance and scalability of such solutions.

To learn more about the transformative power of LLMs and Generative AI, visit Quasar AI.

View full post