8M+ Records, 7 Legacy Systems, Zero Data Loss

Situation

Sanofi’s commercial operations across Europe and APAC had accumulated seven legacy CRM and data systems over a decade of acquisitions, regional buildouts, and platform migrations that never fully completed. Customer and HCP (Healthcare Professional) data existed in fragments — some in aging on-premises systems, some in regional Salesforce sandboxes, some in spreadsheet-based processes that had never been formalized.

The mandate was to consolidate all commercial data into a single Salesforce platform, retire the seven legacy systems, and do so with zero tolerance for data loss. In pharmaceuticals, data integrity is not a performance metric — it is a regulatory requirement. Records of HCP interactions, prescribing data, promotional compliance documentation, and adverse event history must be complete and auditable. Any migration that loses, corrupts, or cannot account for a record creates regulatory exposure.

The scale was 8 million+ records across the seven source systems, with overlapping data models, inconsistent field mappings, duplicate entities, and no agreed canonical schema for what a “healthcare professional contact record” should contain. Three of the seven systems had no API access, requiring custom extraction tooling.

Diagnosis

The primary risk was not technical complexity — it was the combination of regulatory non-negotiability and data model divergence. Standard migration approaches (bulk export, transform, load) carry acceptable data loss risk in commercial contexts but are unacceptable in pharmaceutical compliance.

The source systems had three distinct data quality problems. First, duplicate records: the same HCP existed in multiple systems under slightly different names, addresses, and specialization codes. Second, structural inconsistency: what one system called “account type” another called “customer segment,” with different value sets and no cross-reference. Third, historical gaps: several systems had been used inconsistently, creating records with missing mandatory fields that would fail validation against the target Salesforce schema.

A migration that loaded these records as-is would result in thousands of validation failures on day one, requiring manual remediation after go-live — precisely the scenario that regulators and business stakeholders could not accept. The solution required resolving data quality problems before migration, not after.

The three systems with no API access added extraction risk. Custom connectors would need to be built and tested, with every extraction validated against source system record counts and checksums.

Action

Phase 1: Data Architecture and Schema Design

Before writing a single line of migration code, a canonical data model was designed for the target Salesforce environment. This established the authoritative definition of each entity: Healthcare Professional, Account (facility), Product, Interaction, Consent record, and Compliance documentation. Every source system field was mapped to a target field, with explicit decisions about how conflicts between systems would be resolved.

The canonical model was documented, reviewed by compliance, legal, and IT stakeholders, and signed off before migration work began. This created a single source of truth for what the migrated data should look like — not what the source systems contained.

Phase 2: Extraction and Profiling

Custom extraction connectors were built for the three API-less legacy systems, using direct database queries where possible and screen-scraping automation where database access was unavailable. Each extraction was validated: record counts matched, checksum validation confirmed completeness, and a sample audit compared extracted records against source system UI snapshots.

Data profiling was run against all extracted data to quantify the three quality problems: 340,000 potential duplicates identified, 12,000 records with missing mandatory fields, and 89,000 records with structural inconsistencies requiring value normalization.

Phase 3: Transformation and Quality Resolution

A deterministic deduplication process resolved the 340,000 potential duplicates. Matching rules used a combination of national identifier (where available), name + address proximity scoring, and specialty code matching. Each deduplication decision was logged with the matching criteria used, creating an audit trail for regulatory review.

Missing mandatory fields were remediated through a combination of automated enrichment (using public HCP registry data where available) and a structured review process for records that could not be automatically completed. The 12,000 records with missing fields were reduced to 847 that required human review before migration.

Structural inconsistencies were resolved through value mapping tables: every source system value was mapped to the target canonical value, with unmappable values flagged for business owner decision. No record was migrated with an unresolved mapping.

Phase 4: Load, Validation, and Go-Live

Migration was executed in parallel waves: non-production environments first, with full validation before production load. Each wave used a load-validate-reconcile cycle: load a batch, run automated validation queries comparing source and target record counts and field values, reconcile any discrepancies before proceeding.

A zero-tolerance reconciliation threshold was set: any batch where source-to-target reconciliation failed by even one record would halt migration and trigger investigation. Over the course of the migration, three batches triggered halts — all three were traced to extraction issues in legacy systems (not transformation errors), corrected, and re-loaded.

Post-migration, a 30-day parallel operation period ran both legacy systems and Salesforce simultaneously, with automated comparison queries confirming data consistency between systems. Sign-off on decommissioning required 100% reconciliation across all 8M+ records.

Result

Eight million records migrated across seven legacy systems with zero data loss incidents. The parallel operation period confirmed complete reconciliation — every record in every legacy system was accounted for in the Salesforce target, with a full audit trail from source extraction through transformation decisions to final load.

Seven legacy systems were decommissioned on schedule, eliminating the maintenance costs and compliance risk associated with aging infrastructure. The consolidated Salesforce platform reduced the annual IT overhead for maintaining multiple CRM systems by €1.2M.

The migration framework developed for Sanofi was reusable: the extraction connectors, transformation logic, and validation suite were packaged as reusable assets for future migration programs. The data quality resolution methodology — profile, resolve deterministically, flag residuals for human review, document every decision — has since been applied to two subsequent migration programs at other clients.

The migration also established a clean, well-governed data foundation that positions Sanofi’s commercial operations for future AI and Data Cloud initiatives. Clean data with complete audit trails is the prerequisite for intelligent automation; Sanofi’s commercial org is now in a position to deploy Agentforce agents on a foundation that can support them.

Technologies used: Salesforce Sales Cloud, Service Cloud, Data Loader, custom extraction connectors, Apex validation frameworks, DataWeave transformations, external data profiling tooling