What We Have Learned Building Bi-Directional Sync Systems at 137Foundry
We have built bi-directional data syncs across a range of client contexts at 137Foundry: CRM-to-billing syncs for SaaS companies, inventory syncs between warehouse management systems and e-commerce platforms, and customer data syncs between support platforms and backend databases.
The first integration always seems straightforward. It is the one running six months later that reveals what we got wrong at the start.
The conflict resolution question is a business question first
Every team we work with wants to talk about the technical implementation of conflict resolution before they have answered the business question: when both systems update the same record simultaneously, which system should win?
We have learned to force this question early. If the CRM team and the billing team disagree about which system owns customer_status, the sync layer cannot resolve that disagreement -- it will just consistently pick the wrong answer for half the stakeholders.
We now require a field authority document before any code is written: a spreadsheet mapping every shared field to an owner and a conflict rule, signed off by both sides. The PostgreSQL logical replication documentation describes the mechanics of change capture, but it does not tell you which system is authoritative for which fields. Getting the business question answered in writing first saves enormous pain downstream.
Sync loops are invisible until they are catastrophic
On every new bi-directional sync integration, we add sync loop detection before we add anything else. The detection is simple: tag changes made by the sync layer with an application identifier, filter those events out of the CDC stream before propagating them.
The reason this is the first thing we add: a sync loop does not cause obvious errors. The sync runs, data looks correct, metrics look fine. What accumulates is API call volume and change event volume -- both growing exponentially with every sync cycle. We have seen sync loops run for 18 hours before anyone noticed, generating millions of change events and exhausting the API quota for three downstream integrations.
The fix once it is running is to stop the sync, clear the change log, and restart. The fix before it runs is one conditional check in the CDC consumer. We treat this as a non-negotiable day-one requirement for every integration we ship.
Replication slot lag is a disk space problem waiting to happen
For syncs using PostgreSQL WAL-based CDC, the replication slot holds WAL segments until the consumer acknowledges them. If the consumer stops processing -- a deployment, a bug, an overloaded consumer -- the slot accumulates unreprocessed WAL segments. PostgreSQL will not clean up WAL segments referenced by an active replication slot.
In a high-write production database, an unmonitored lagging slot can fill disk in hours. We add a monitoring check on pg_replication_slots for lag_bytes within 48 hours of any WAL-based CDC setup. Alert at 1 GB lag, page at 5 GB. The two times this alert has fired in production, it prevented a disk-full outage.
The dead-letter queue needs a human interface before it has human traffic
Every bi-directional sync we ship now includes a dead-letter queue and a lightweight admin interface for reviewing DLQ entries. We build the interface before we have any DLQ traffic, not after.
The reason: the first time an unresolvable conflict lands in the DLQ, you want someone to be able to inspect it and decide whether to retry, manually resolve, or dismiss it. If the first DLQ entry arrives and there is no UI to review it, it sits in Redis accumulating siblings until someone builds the interface under pressure.
The interface does not need to be sophisticated. A table showing the record ID, the two conflicting versions, the conflict reason, and buttons to apply version A, apply version B, or dismiss is sufficient. What matters is that it exists before it is needed, not after the first incident that makes it necessary.
We documented the architecture and code patterns for all of this in How to Build a Bi-Directional Data Sync Between Business Applications. That guide reflects what we actually deploy, not the idealized version.
What Changed After We Built the Infrastructure First
After establishing this foundation -- field authority document before code, sync loop detection on day one, replication slot monitoring, DLQ with UI before first traffic -- our subsequent bi-directional sync projects have been substantially less eventful operationally.
The first incident on a new integration used to happen within 2 to 3 weeks: either a sync loop, a replication slot growing unnoticed, or a conflict resolution bug discovered by a user. Now the first incident on a new integration typically happens at month 2 or 3, is detected by monitoring rather than user report, and resolves in minutes rather than hours.
The cost of building the foundation upfront is roughly 20 to 30 percent more initial implementation time. The cost of not building it is one 4am incident call, a Sunday afternoon cleanup, and a retro where the team agrees to "add better monitoring" -- which then happens under time pressure and is therefore incomplete.
We have documented all of this in the implementation guide at 137Foundry for the teams who want to build it themselves: How to Build a Bi-Directional Data Sync Between Business Applications. For teams who want help designing or implementing the integration, the 137Foundry data integration team is available for architecture reviews and implementation work.
Why This Matters for Production Reliability
The failure modes of bi-directional sync are almost always discovered in production, not in testing. Test environments rarely replicate the exact conditions that cause clock skew conflicts -- clock synchronization on development machines is generally better than on production infrastructure. Test environments rarely replicate the specific bulk operation patterns that create consistency gaps. And test environments rarely run long enough to reveal the slow drift that accumulates when a field authority map is not updated after a schema change.
This is not an argument against testing -- it is an argument for investing in observability alongside testing. The monitoring patterns in this guide (sync lag, DLQ depth, conflict rate, record count parity) give you visibility into problems that tests will not catch before they affect users.
For teams building a bi-directional sync for the first time, the practical recommendation is: build the operational baseline (DLQ, monitoring, idempotency, loop prevention) before the first production deployment, not after the first production incident. The upfront cost is modest. The incident prevention value is significant.
For technical implementation guidance, see 137Foundry and the data integration resources. For production architecture review and implementation support, the 137Foundry services team works with teams across the integration lifecycle.
Why This Matters for Production Reliability
The failure modes of bi-directional sync are almost always discovered in production, not in testing. Test environments rarely replicate the exact conditions that cause clock skew conflicts -- clock synchronization on development machines is generally better than on production infrastructure. Test environments rarely replicate the specific bulk operation patterns that create consistency gaps. And test environments rarely run long enough to reveal the slow drift that accumulates when a field authority map is not updated after a schema change.
This is not an argument against testing -- it is an argument for investing in observability alongside testing. The monitoring patterns in this guide (sync lag, DLQ depth, conflict rate, record count parity) give you visibility into problems that tests will not catch before they affect users.
For teams building a bi-directional sync for the first time, the practical recommendation is: build the operational baseline (DLQ, monitoring, idempotency, loop prevention) before the first production deployment, not after the first production incident. The upfront cost is modest. The incident prevention value is significant.
For technical implementation guidance, see 137Foundry and the data integration resources. For production architecture review and implementation support, the 137Foundry services team works with teams across the integration lifecycle.