Cross Environmental Syncing

Syncing two databases between environments is one of the most challenging systems I have ever attempted. It sounds easy right? Database A has been updated, and now I want to sync all changes (since the last sync) from Database A to Database B by running a polling script of some sort.

The biggest issue that comes to mind, when attempting to poll changes from one database to another, is understanding the expected behavior when making those updates. If a sync is intended to always happen in one direction, one must assume that the receiver of that sync may have their changes overwritten by the incumbent synced changes. That level of behavior is a business decision, and likely configurable. Which makes things more complicated, mostly on that level.

More food for thought: what effect do primary keys have in each database table, during the sync? The users table in Database A for John Doe has a primary key of 100. But the corresponding record in Database B has a primary key of 101 (there are several reasons why a primary key could become offset, but we’ll discuss that later). This a disturbing issue to deal with, knowing that primary key relationships only extend within the scope of a single database - they are not managed across separate databases. One way to circumvent this issue would be to have a third-party system somewhere between the two databases. Its role would be to solely map the relationships between the two tables. A referee. Something with the middle perspective that can enforce that intended relationship. In just one table, the referee can map the primary key (or some unique key) from one database table to the other.

A different approach altogether, would be to create a new identifier field (call it sync_id) on every table - in both parties - and preemptively match corresponding rows using some identifier in that field. Which ultimately works, but at some point early on, we run into the circular question of how do we even relate those corresponding rows in the first place to make those identifiers match up? That question applies to every approach I have ever evaluated. There are several approaches, and they all have their pros and cons.

Here is a scenario: the business needs are - there is one authoritarian server that hosts a database of all sources of truth. That server is accessible by several child servers, each with a similar database (same structure, and the data is mostly identical with some exceptions). The business needs to update all child databases simultaneously with an important data change that should reflect everywhere in this scenario. In this case, an administrator would make the update on the authoritarian server. Each child would be prompted, in some way or another, to poll the big server upstairs for any new updates since the last time that child checked. Bingo, it finds a recent update (as applied by that administrator) and makes a decision: does this record already exist? Yes - update the record that has a matching sync_id value. Or, no - create a new record, using the sync_id that was presented during the sync.

The above story contains holes within the logic (bet you can find a couple), but the undisclosed business needs are key to the functionality and will, in fact, fill in those holes altogether. Trust that the story works in this case and that some details were purposely left out ;)