Background
With more and more organizations delivering digital services through the cloud, the information systems are perceived to be centrally deployed and managed. However, emerging data security restrictions and compliance requirements around the handling of personally identifiable data in IT systems are affecting the system's design significantly.
Many countries have been enacting legislation that requires personal data to be kept within geographic boundaries. These regulations have resulted in an increase in cost, complexity, and legal implications while transferring personal data across borders. In recent customer engagements, we have come across the following instances
Almost all the data processing IT systems are facing the above issues while meeting business and legislative requirements. Cloud-based solutions that work with PII data also require capabilities for supporting the rapidly evolving data storage and processing requirements. The combination of above mentioned regulatory, technical, and compliance elements require robust and secure data hosting and processing solutions for the customer preferences.
The scope of this white paper is limited to outlining the cloud-based reference solutions for the geo-localization of Personally Identifiable data. For illustration purposes, Azure Cloud service references will be used. For the exact implementation, the solution requirements will be informed by other data requirements and policies (e.g., Privacy; Security; Archiving, etc).
The document outlines requirements identified by a group of architects and a few customer representatives at Coforge, their potential solution options. For delivering solutions with increased flexibility of data storage and/or security of data transit, operational cost implications are considered. The document also presents cloud-based solutions recommending the most preferred solution subject to conditions.
Local Data Residency of Personally Identifiable Information (PII)
For any given customer, it must be possible to physically store all PII data in a geographical area, whether that is a specific continent, country, or region within a country, and whether this storage is legally mandated (Data Localization law), legally encouraged (Data Sovereignty laws), or simply the choice of the Client (Data Residency).
Subject to customer-specific requirements, the following type of data, meets the definition of personal data:
Personal data definition: Personal data is any information that relates to an identified or identifiable living individual. Different pieces of information, when collected lead to the identification of a particular person, also constitute personal data.
Privacy laws around the globe can vary in their definition of PII/Personal Data. For example, GDPR defines personal data as any information relating to an identified or identifiable natural person. Whereas the California Consumer Privacy Act (CCPA) defines personal information as “information that identifies, relates to, describes, is reasonably capable of being associated with, or could reasonably be linked, directly or indirectly, with a particular consumer or household.” In addition to data that is more intuitive (name, email address, social security number), PII under CCPA includes inferences drawn to create a profile about a consumer reflecting the consumer’s preferences, characteristics, psychological trends, predispositions, behavior, attitudes, intelligence, abilities, and aptitudes
In Canada, Personal Information Protection and Electronic Documents Act (PIPEDA)’s definition includes any factual or subjective information, recorded or not, about an identifiable individual. This includes opinions, evaluations, records of a dispute between a consumer and a merchant, and intentions (e.g., intentions to acquire goods or services, or change jobs).
China’s PIPL’s definition of PII follows that of the GDPR, though for sensitive data and is much broader including "information that once leaked or abused may cause damage to personal reputation or seriously endanger personal and property safety" as well as race, nationality, religion, biometric information, health, financial account, personal whereabouts, and other information.
These differences in definition can create challenges in the development of a globally applicable policy on personal data localization; therefore, the solution must provide for geo-specific adaptation.
The traditional architecture approaches while designing a cloud-based or on-premises system, are centralized services, applications, and databases; whereas the new privacy and compliance requirements need architects to think otherwise. The balance between a centralized, cost-efficient solution and a decentralized privacy-compliant solution needs to be achieved by bearing in mind the nature of the system and budget.
The solution largely depends on the amount of engineering effort budgeted for the initiative and the current state of the system. There would be more options for an IT system that is being built/designed compared to an existing system that is being tuned to accommodate data localization requirements.
Following are a few architecture options that have been explored and applied by our internal practice teams; all these solutions are relevant in their own space, subject to the customer, project, and other circumstances.
Option 1: A central system, with geo-spread DB.
Solution Overview
All services are hosted in a central environment. The central environment acts as the ‘default’ environment and other environments are created when required for serving a new region.
Benefits
Constraints
Recommended scenarios
Option 2: De-centralized system with locally hosted services and Data
Solution Overview
All services and all data are hosted and processed locally. The central environment will act as the ‘default’ environment and other environments will be created when required for serving a region.
There is regional deployment of all services and databases and other repositories.
Automatic redirection of a user will be done based on his / her source IP address.
There would be a landing page for redirecting users in case they are logging from non-home region / geography.
Master data replication will have to be designed separately; but this does not impact transactional performance
Consolidated data masked and replicated to central data mart / data lake for reporting.
While viewing reports in central environment, Region specific redirection will have to be done for enabling operational reporting.
For releasing new features, Applications and services will need to be deployed in multiple target environments to keep all regions current.
Benefits
Having independent environments ensure full compliance since region-specific tools and policies can be utilized.
Simple architecture without any specialized engineering, resulting in low-cost operational support
Regional environments can be scaled up, down, or decommissioned quickly, providing agility for services delivery to new regions.
No performance penalties since data transfer across regions are not required.
Constraints
Replicating the entire environment locally will incur higher technical architecture costs than other options.
New environments will have to be created only when a new customer is onboarded, requiring ‘just in time’ investment
The capacity and technical services plans can be selected and scaled up or down based on the expected volume of new business
A centralised solution with multiple tenants will have to be scaled up
A more complex system (options 1 or 3) would cost much more to develop and maintain, almost certainly more than replicating new localized environments on demand
The creation of a Disaster Recovery environment will provide more clarity about cost and support implications
With multi-region decentralized systems, operations will be complicated and the upkeep of services across multiple environments will cause operational overheads.
Recommended scenarios
Existing systems that cannot be subjected to re-engineering
The compliance and data security requirements are ongoing and changing and need regional treatment.
The user groups can be partitioned based on region; and regions can be allocated to the partitioned user base.
Systems that need to process large chunks of data that include PII and non-PII Data.
Systems requiring real-time updates.
Option 3: A central system with PII data geo-located and rest of the data is central
Solution Overview
All services and non-PII data will be hosted centrally. PII data and repositories will be hosted regionally. The services will fetch PII data on demand and will cache the reference data.
Benefits
Constraints
Recommended scenarios
While designing modern digital services delivery platforms, compliance requirements like GDPR and PII must be considered to ensure the longevity of the solution. The customer and legislative demands are increasingly encouraging privacy by design principles. Modern architecture patterns like microservices are already compatible with the above-mentioned approaches since they readily allow data partitioning and separation of data processing from data storage.