Skip to main content

Quality Engineering led DORA Operational Resilience Testing

article banner

Quick Glance.

The Digital Operational Resilience Act (DORA) mandates that financial institutions in the EU regularly test their operational resilience to handle ICT-related incidents, ensuring they can withstand and recover from disruptions. By leveraging AI and machine learning, financial entities can enhance their resilience testing, ensuring robust preparedness against cyber threats and operational disruptions. This blog provides a comprehensive guide to DORA compliance, covering key testing areas such as vulnerability assessments, capacity testing, network resilience, and more. The blog also shares practical insights to readers to improve their organization’s resilience and meet regulatory requirements effectively.

The Digital Operational Resilience Act (DORA) is an EU regulation that requires financial entities to test their operational resilience regularly in line with regulatory requirements. The goal of DORA's operational resilience testing is to ensure that financial institutions are prepared to handle ICT-related incidents and to identify any weaknesses in their digital resilience. 

DORA's operational resilience testing program includes a variety of tests, such as: 

  • Vulnerability assessments and Penetration Testing: Identify and classify vulnerabilities in a company's information systems 
  • Network security assessments: Check the robustness of a company's lines of defence 
  • ICT Systems Capacity Testing: Planning and implementing appropriate capacity to handle stress scenarios.
  • Source Code Reviews: Assess the security of applications produced or acquired by financial entities 
  • Scenario-based tests: Test a company's resilience against higher-level risks like
    • Data Centre and Cloud Services Resilience.
    • Third-party Service Resilience.
    • DORA also requires some financial entities to undergo threat-led penetration testing (TLPT) to test their resilience against higher-level risks, such as ransomware attacks.

1. Types of tests recommended for DORA compliance

1.1 Vulnerability Assessment & Penetration Testing

  • Validate static level and dynamic level security vulnerabilities. Its pertinent to perform SAST at build state.
  • Evaluate opportunities of automated DAST before exploiting vulnerabilities as part of penetration testing
  • Once penetration testers have uncovered significant vulnerabilities in system, stringent vulnerability management should be followed to address and patch them. This involves containing the issue, eliminating the threat, and recovering from the vulnerability to prevent similar problems in the future.

Detail reports should be prepared about the results of testing, including any vulnerabilities discovered and their recommendations for handling these flaws. Key decision-makers can use these documents for both short-term incident response and long-term strategic planning for DORA compliance.

1.2 ICT Systems Capacity Testing

  • Assess the ability of ICT systems to handle increased transaction volumes, data processing, and user loads.
  • Evaluate system performance under normal and peak conditions & identify the scaling factors across components. Note, resilience needs to be built into every infrastructure and software component for seamless operations even in stress simulations.
  • Identify potential bottlenecks and capacity limits and triage analysis findings with all ICT stakeholders (incl. third-party providers).

Coforge recommends use of appropriate extrapolation techniques in scenarios where production-like environment is not available -

  • Use Linear Extrapolation technique for Throughput, Hits per second, TPS, Java Heap Size etc.
  • Use S-curves or the Mixed mode technique for Response Time, Latency, CPU Utilization and Memory Utilization

1.3 Network Infrastructure Resilience

  • Test the resilience of network connections, including redundancy and failover mechanisms. Applicable for planned, unplanned & disaster recovery scenarios.
  • Evaluate the impact of network disruptions on critical business processes and record potential impacts and workarounds (automated or manually triggered).
  • Assess the effectiveness of backup communication channels.

Stride methodology is recommended to be used for threat modelling which will help to analyze systems and networks and classify threats in a prioritized list, based on the likelihood of them occurring and the scale of their potential impact. This helps to analyze systems and networks and classify threats in a prioritized list, based on the likelihood of them occurring and the scale of their potential impact.

1.4 Data Center and Cloud Services Resilience

  • Conduct failover tests for primary and secondary data centers.
  • Evaluate the resilience of cloud-based services and their ability to maintain operations during outages.
  • Test data replication and recovery processes.

Unless, detailed performance validation is carried out for cloud-based data centres, Kubernetes autoscaling may cause issues at cluster layer (cluster autoscaler) or at pod layer (HPA or VPA). These could potentially impact HPA/CA reaction time, Node provisioning time or Pod creation time. Timely testing could identify issues related to validation of autoscaling, configurations etc. and also optimize scaling time and cost.

1.5 Application and Service Availability

  • Perform stress tests on critical applications and services under simulated impacts and scenarios.
  • Assess the impact of application failures on interconnected systems.
  • Benchmark Test service recovery times and procedures. Report any aberrations based on existing baseline monitors.

It is also recommended that service resiliency and observability controls are validated under simulated chaos i.e. chaos engineering can manifest potential pitfalls which are usually not covered by traditional testing techniques. Chaos engineering can lead to improvement of recovery techniques and updating runbooks thereby leading to better responses from support teams in the event of an unplanned ICT event.

1.6 Third-Party Service Provider Resilience

  • Evaluate the operational resilience of critical third-party service providers.
  • Test contingency plans for third-party service disruptions.
  • Assess the effectiveness of service level agreements (SLAs) in maintaining operational resilience.

Testing third-party service resilience can amount to multiple dimensions. Some of the critical ones (but not limited to) recommended by Coforge are as follows –

  1. Data Dimension – Integrity& volume checks for data processing/transmitting/storing/accessing by third-party vendors.
  2. Customer Dimension – vulnerability checks if third-party systems have direct interface with FI customer base/prospects.
  3. Infra & Cloud Dimension – Reliability of provision of third party ICT system, website or application infrastructure from performance & security perspective.
  4. IT System Dimension – Connectivity tests between third-party IT infrastructure with FI group IT infrastructure - Simulation of various failover scenarios to be tested.
  5. BCP/DR Dimension – Third party support for business continuity & disaster recovery plans for the parent organisation.
  6. Service Impact – If there is a loss or deterioration of service, then qualification of impact to parent organisation.

1.7 Cybersecurity Incident Response

  • Conduct simulated cyber-attack scenarios to test incident response capabilities.
  • Evaluate the effectiveness of detection, response, and recovery procedures.
  • Test communication protocols during cybersecurity incidents.

Coforge recommends simulating various incident scenarios based on historical data and threat intelligence, followed by analyzing test results to identify areas for improvement in response plans and processes. This can then be tracked and measured for effectiveness of thorough data-driven metrics and actual baseline in case of response to similar incident. Specific additional provisions are required for conducting tests for higher level threats as per guidelines for TLPT (threat-led penetration tests).

2. Testing Methodology

  • Planning and Scoping: Define the scope of testing, including systems, applications, and processes to be evaluated. This will begin early in the service lifecycle. Validation activities will confirm the business needs, contracts and service attributes (specified in the Service Package) & are incorporated correctly into the service design as service level requirements (SLRs) and constraints e.g. capacity & demand limitations.
  • Risk Assessment: Identify potential risks and vulnerabilities in the ICT infrastructure.
  • Test Design: Develop specific test scenarios and scripts for each focus area. These are designed based on various test models. The validity of the test models are ascertained as follows -
    • Test Model has adequate and appropriate test coverage for the risk profile of the service (Levels of Criticality)
    • Test Model covers the key integration aspects and interfaces e.g. Service Provider Interfaces (SPIs)
    • Test Procedures are accurate.
  • Test Execution: Conduct tests in a controlled environment, minimizing impact on production systems apart from TLPT where test guidance is to prove resiliency in production or production-like system.
  • Results Analysis: Evaluate test results against predefined performance and resilience metrics.
  • Reporting: Document findings, identify gaps, and propose remediation measures.
  • Continuous Improvement: Implement a regular testing schedule and update the approach based on evolving threats and regulatory requirements.

3. Key Considerations

The key considerations required to ensure appropriateness of testing activities are as follows:

  • Ensure testing aligns with client’s overall risk management framework.
  • Involve relevant stakeholders, including IT, risk management, and business units.
  • Maintain detailed documentation of all testing procedures and results.
  • Regularly review and update the testing approach to address new technologies and emerging risks.

4. Compliance and Reporting

  • Align testing procedures with DORA requirements and other relevant EU financial regulations.
  • Prepare comprehensive reports for internal stakeholders and regulatory authorities.
  • Establish a process for timely implementation of identified improvements.
  • Collect relevant metrics especially for any deviations or potential defects/aberrations founds and incorporate those as part of next cycle of tests.

5. Tools and technology for DORA Testing

To effectively implement DORA operational resilience testing, Coforge recommends leveraging a variety of technologies and tools across different testing aspects:

5.1 Vulnerability Assessment & Penetration Testing

  • Static Application Security Testing (SAST) tools: SonarQube, Checkmarx, Fortify
  • Dynamic Application Security Testing (DAST) tools: OWASP ZAP, Burp Suite, Acunetix
  • Network vulnerability scanners: Nessus, OpenVAS, Qualys
  • Penetration testing frameworks: Metasploit, Cobalt Strike, Core Impact

5.2 ICT Systems Capacity Testing

  • Performance testing tools: Apache JMeter, Gatling, LoadRunner
  • Monitoring and observability platforms: Prometheus, Grafana, Datadog
  • Capacity planning software: TeamQuest, BMC TrueSight Capacity Optimization

5.3 Network Infrastructure Resilience

  • Network simulation tools: GNS3, Cisco Packet Tracer
  • Network monitoring software: SolarWinds, PRTG, Nagios
  • Failover testing tools: Chaos Monkey, Gremlin

5.4 Data Center and Cloud Services Resilience

  • Cloud management platforms: AWS CloudFormation, Azure Resource Manager, Google Cloud Deployment Manager
  • Disaster recovery tools: Veeam, Zerto, VMware Site Recovery Manager
  • Container orchestration: Kubernetes, Docker Swarm

5.5 Application and Service Availability

  • Application Performance Monitoring (APM) tools: New Relic, AppDynamics, Dynatrace
  • Chaos engineering platforms: Chaos Toolkit, Litmus Chaos, Chaos Mesh
  • Service mesh technologies: Istio, Linkerd, Consul

5.6 Third-Party Service Provider Resilience

  • Third-party risk management platforms: OneTrust, Prevalent, RiskRecon
  • API testing tools: Postman, SoapUI, Katalon Studio
  • Service level agreement (SLA) monitoring tools: Monitis, Pingdom, Uptrends

5.7 Cybersecurity Incident Response

  • Security Information and Event Management (SIEM) systems: Splunk, IBM QRadar, LogRhythm
  • Incident response platforms: TheHive, DFIRTrack, CyberCPR
  • Threat intelligence platforms: ThreatConnect, Recorded Future, Anomali

6. AI and Machine Learning in DORA Testing

Coforge leverages AI and machine learning technologies to enhance various aspects of DORA operational resilience testing. By integrating these AI-driven approaches, organizations can significantly enhance their operational resilience testing, ensuring they are better prepared to handle disruptions and maintain continuous operations.

  • Testing Predictive Analytics: Organizations are mitigating risks before they occur by leveraging AI. AI can analyze historical data to predict potential disruptions and their impacts. But testing these simulations is important to ensure that AI models are working as expected.
  • Automated Stress Testing: AI-driven tools can be used to simulate various stress scenarios, such as high traffic loads or cyberattacks, to test the resilience of systems. This helps identify vulnerabilities and ensure systems can handle peak loads especially in unplanned failover situations.
  • Continuous Monitoring and Anomaly Detection: AI can continuously monitor systems for unusual patterns or anomalies that might indicate a potential issue thereby improving the observability of platforms and systems. But it is key to verify these monitors in production-like environments so that monitoring and detection can be tweaked based on criticality of organisation’s needs.
  • Incident Response Automation: Testing and validating the incident response runbook which uses AI to automate parts of the incident response process, such as initial analysis and categorization of incidents. 
  • Using AI enabled Chaos Engineering: Using AI based modelling, simulations of various disruption scenarios can be created, allowing organizations to test their response strategies in a controlled environment. This helps improve response effectiveness.

Conclusion

As a key component of DORA compliance, operational resilience testing has a set of unique challenges as well as opportunities for financial organisations to comply. At Coforge, we're committed to guiding financial institutions through every step of their DORA compliance journey, leveraging our expertise, partnerships, and innovative solutions to ensure your success in this new regulatory landscape.

Anustup Ray
Anustup Ray

Anustup Ray is the Global Head of Quality Engineering for BFS at Coforge. A seasoned head of test / QE transformation leader who has advised & consulted tier-1 banks in US, Europe & Asia, Anustup is known for his unique perspective to shift – left approach by building an intersection between domain expertise & technical capabilities. As an agile evangelist, he has been responsible for building IT assurance capabilities in unprecedented timescales in major institutions based on enterprise agile best practices.

Sanjiv Roy
Sanjiv Roy

Sanjiv is a seasoned professional with over 25 years of experience in Banking and Financial Services Technology. His career spans work with global universal banks, investment banks, innovative neo-banks, and cutting-edge fintech companies. Currently, Sanjiv heads the BFS Solutions practice at Coforge, where he leads efforts to help clients solve complex business problems using advanced technology levers. His expertise lies in crafting custom technology solutions to address critical business challenges in the financial sector. Sanjiv possesses a deep understanding of artificial intelligence and its practical applications within the banking industry, positioning him at the forefront of technological innovation in finance.

Related reads.

WHAT WE DO.

Explore our wide gamut of digital transformation capabilities and our work across industries.

Explore