Data Replication

Sameer Bhujade
5 min readNov 23, 2020

Introduction

Data replication is the process of making multiple copies of data and storing them at different locations to improve their overall accessibility across a network.

It is important for any organization to protect their critical data across physical, virtual, and cloud environments to minimize the risk of business disruption. If a disaster occurs, fast data restore and restart capabilities are required to ensure business continuity. So, replication is one of the ways to ensure the business continuity in case of any disasters.

Having a replica can also make data access faster, especially in organizations with a large number of locations. Users in Asia or Europe may experience latency when reading data in North American data centers. Putting a replica of the data closer to the user can improve access times and balance the network load.

Replication Techniques Overview

Based on the organizations data requirements, data can be replicated to one or more locations. For example, data can be replicated within a data center, between data centers, from a data center to a cloud, or between clouds. In a software-defined data center environment, organizations have policy-based automation of replication process. This policy-based automatic replications decide the number of replicas to be created and the location where the data should be stored.

Typically, organizations providing cloud services have multiple data centers across the world and they can provide options to customers for choosing the location to which the data is to be replicated.

Data can be copied on demand or be transferred in bulk or in batches according to a schedule, or be replicated in real time as the data is written, changed, or deleted in the master source.

Use of Replication

Replication can be used as an alternative source for backup. Under normal backup operations, data is read from the production LUNs and written to the backup device. This places an additional pressure on the production infrastructure because production LUNs are simultaneously involved in production operations and servicing data for backup operations. To avoid this situation, a replication can be created from production LUN and it can be used as a source to perform backup operations. This alleviates the backup I/O workload on the production LUNs.

It can be used for fast recovery and restart. For critical applications, replicas can be taken at short, regular intervals. This allows easy and fast recovery from data loss. If a complete failure of the production LUN occurs, the replication solution enables one to restart the production operation on the replica to reduce the RTO.

It can be used for generating reports. Running reports using the data on the replicas greatly reduces the I/O burden placed on the production device.

It can be used for testing new business models. Replicas are also used for testing new applications or upgrades. For example, an organization may use the replica to test the production application upgrade; if the test is successful, the upgrade may be implemented on the production environment.

Categories of Replication

Replication can be classified into local and remote replication.

Local Replication -

Local replication refers to replicating data within the same storage system or the same data center. Local replicas help to restore the data in the event of data loss or enable restarting the application immediately to ensure business continuity. Local replication can be implemented at server, storage, and network level.

Remote replication -

Remote replication refers to replicating data to remote locations. Remote replication helps organizations to mitigate the risks associated with regional outages resulting from natural or human-made disasters. During disasters, the services can be moved (failover) to a remote location to ensure continuous business operation. Remote replication also allows organizations to replicate their data to the cloud for DR purpose. In a remote replication, data can be synchronously or asynchronously replicated and can also be implemented at server, storage, and network levels.

Replication procedures

Full Replication

It involves copying entire data from the source to the target system, including new, modified, and present information. However, this technique requires more processing power and increases load on the network. Plus, the cost usually upsurges as maintaining consistency becomes difficult when copying large data volumes.

Partial Replication

In this technique, only some part of data is replicated, such as the updated data. Thus, it is faster than full table replication because it deals with a comparatively smaller volume, which reduces network load and consistency issues.

Advantages of Replication

Database replication is often overseen by a database or replication administrator. A properly implemented replication system can offer several advantages, including the following:

Ø Load reduction — Because replicated data can be spread over several servers, it eliminates the likelihood that any one server will be overwhelmed with user queries for data.

Ø Efficiency — Servers that are less burdened with queries can offer improved performance to fewer users.

Ø High availability — Employing multiple servers with the same data ensures high availability, meaning that if one server goes down, the entire system can still provide acceptable performance.

Disadvantages of Replication

Many disadvantages of database replication stem from poor general data governance practices. These disadvantages include the following:

Ø Data loss — Data loss can occur during replication when incorrect data or iterations or updates of a database are copied and, consequently, important data is deleted or unaccounted for. This can happen if the primary key used to verify the quality of data in the replica is malfunctioning or incorrect. It can also occur if database objects are incorrectly configured within the source database.

Ø Data inconsistency — Similarly, incorrect or out-of-date replicas can cause different sources to be out of sync with each other. This may lead to wasted data warehouse costs that are spent needlessly analyzing and storing irrelevant data.

Ø Multiple servers — Running multiple servers has an inherent maintenance and energy cost associated. It requires either the organization or a third party to address these costs. If a third party handles them, the organization runs the risk of vendor lock-in or service issues beyond the organization’s control.

--

--