Entity Resolution Without Fuzzy Matching: How Registry Identifiers Solve the Duplicate Problem
1. Your Entity Resolution Problem Isn’t a Matching Problem. It’s a Data Problem.
Most entity resolution guides start with algorithms. Levenshtein distance. Jaro-Winkler. Cosine similarity. They treat the problem like a math puzzle.
It’s not.
Entity resolution fails because the input data is bad. Not because the matching logic is wrong.
Think about it. Your CRM says “Acme Corp.” Your finance team booked “ACME CO.” Your compliance system has “Acme Corporation, New York.” Three records. Same company. Nobody knows.
Now multiply that by 50,000 counterparties across 30 countries. Each with its own naming conventions, legal form abbreviations, and character sets.
Fuzzy matching tries to solve this by comparing strings. It scores how similar two names look. But similarity is not identity. A 92% match score doesn’t mean you’re 92% sure it’s the same company. It means the letters are 92% alike. Those are very different things.
The real fix isn’t a better algorithm. It’s better data.
When every company record includes a government-issued registration number and jurisdiction code, you don’t need fuzzy matching. You have a deterministic identifier. The kind that doesn’t change when someone misspells a name.
That’s what this article is about. Not another tutorial on string comparison. A fundamentally different approach to resolving company entities — one that starts at the source.
2. How Fuzzy Matching Actually Works (And Where It Breaks)
Before we explain the alternative, let’s be fair to fuzzy matching. It’s not useless. It solves a real problem. But it was designed for a world where unique identifiers don’t exist or can’t be trusted.
Here’s how it works at a high level:
The main algorithms
- Levenshtein Distance counts the minimum number of single-character edits (insertions, deletions, substitutions) needed to turn one string into another. “Smith” to “Smyth” = 1 edit. Works well for typos. Falls apart with abbreviations.
- Jaro-Winkler gives extra weight to matches at the start of the string. Better for names. Still struggles with reordered words or foreign transliterations.
- Cosine Similarity converts strings into token vectors and measures the angle between them. Good for company names where word order varies. Breaks when names overlap but entities are different.
Where it breaks for company data
Here’s the thing. These algorithms were designed for English-language person names. Company data breaks them in specific ways:
- Legal form variations. “Siemens AG” vs “Siemens Aktiengesellschaft” vs “Siemens Limited.” Same group. Different legal entities in different countries. Fuzzy matching sees high similarity. A compliance officer needs to know they’re distinct legal persons.
- Non-Latin scripts. “シーメンス株式会社” is Siemens Japan. No fuzzy algorithm trained on Latin characters will catch that.
- Common name collisions. There are over 3,000 companies registered as some variation of “First National” across US state registries. Fuzzy matching thinks they’re all related. They’re not.
- Abbreviation chaos. “JPM” vs “JPMorgan” vs “J.P. Morgan Chase & Co.” vs “JPMorgan Chase Bank, N.A.” The edit distance between these varies wildly, but they’re all part of the same corporate family.
- The confidence score trap. An 85% match score doesn’t mean 85% probability of being correct. It means 85% character overlap. Two completely different companies can score 90%+ if their names share common words.
In compliance and KYB workflows, false positives create manual review backlogs. False negatives create regulatory risk. Neither is acceptable at scale.
3. What Registry Identifiers Are and Why They Exist
Every country that requires company registration assigns identifiers to each entity. These are not marketing names. They’re legal identifiers issued by a government authority.
They don’t change when a company rebrands. They don’t vary by who entered the data. They’re stable, unique within their jurisdiction, and machine-readable.
Primary identifiers
These are the most reliable data points for matching. They’re assigned at incorporation and stay the same for the entity’s entire lifecycle:
| Identifier | Issuer | Example | Coverage |
|---|---|---|---|
| Company Registration Number | National registry (Companies House, SEC, etc.) | UK: 07973805 | Country-specific |
| Tax ID / EIN / VAT | Tax authority | US EIN: 13-4922250 | Country-specific |
| LEI (Legal Entity Identifier) | GLEIF-accredited organizations | 5493001KJTIIGC8Y1R12 | ~2.5M entities globally |
| D-U-N-S Number | Dun & Bradstreet | 15-048-3782 | ~500M+ entities |
| Zephira Company ID | Zephira.ai | ZPH-GB-07973805 | 100+ countries |
Secondary identifiers
These add context but aren’t definitive on their own. They help confirm a match when primary identifiers aren’t available:
- Registered address and operational addresses
- Director and officer names
- Incorporation date
- Legal form (Ltd, GmbH, S.A., Inc.)
- Industry codes (NAICS, NACE, SIC)
- Website domain
Key insight: Governments solved entity resolution decades ago. Every company has a unique registration number in its home jurisdiction. The problem is that this data sits in 200+ separate registries, in different formats, behind different APIs. The solution isn’t a better matching algorithm. It’s a platform that aggregates and normalizes these identifiers at the source.
4. Deterministic vs Probabilistic: When to Use Which
Not every entity resolution scenario is the same. Sometimes you have a registration number. Sometimes all you have is a company name. The approach should match the data you have.
Deterministic matching
If you have a company registration number and its jurisdiction, you can match with near-perfect accuracy. No scoring. No thresholds. No manual review.
Registration number 07973805 + jurisdiction GB = one exact entity in Companies House. Done.
This works because registration numbers are unique within each jurisdiction. They don’t change. They can’t collide. This is the gold standard for entity resolution in compliance.
Probabilistic matching
When you only have a name and country — or just a name — you need to score candidates and rank them by confidence.
A good probabilistic workflow:
- Normalize the input (strip legal suffixes, standardize casing, transliterate non-Latin scripts)
- Search against the registry for that jurisdiction
- Return top candidates with confidence scores
- Flag low-confidence matches for human review
- Once confirmed, anchor the record to the registration number for all future lookups
The critical difference: probabilistic matching should be a one-time bridge to a deterministic identifier. Not a permanent matching strategy.
Once you’ve resolved a company to its registration number, every future lookup should use that number. You should never re-run fuzzy matching on the same entity twice.
Decision tree
| You Have | Method | Accuracy | Manual Review? |
|---|---|---|---|
| Registration number + jurisdiction | Deterministic | ~99.9% | No |
| Company name + country | Probabilistic → anchor to reg number | 85–95% | Low-confidence only |
| Company name only | Probabilistic + jurisdiction inference | 60–80% | Yes, most results |
| Partial name or abbreviation | Probabilistic + enrichment | 40–65% | Yes, all results |
5. The Cross-Border Problem: Why Global Entity Resolution Is Different
Matching companies within a single country is manageable. Matching across 200+ jurisdictions is a fundamentally different challenge.
No universal company identifier exists
The US uses EINs. The UK uses Company Registration Numbers. Germany uses Handelsregisternummer. France uses SIREN. Japan uses corporate numbers. Brazil uses CNPJ. Each system is local. None talks to the others.
The LEI system tried to solve this. It’s a 20-character global identifier for legal entities involved in financial transactions. But only about 2.5 million entities have LEIs worldwide. That’s less than 1% of registered companies globally.
Legal names aren’t standardized
The same corporate group can appear as:
- “Siemens Aktiengesellschaft” in the German registry
- “Siemens AG” in a contract
- “Siemens Limited” in the UK
- “Siemens” in your CRM
- “シーメンス株式会社” in Japan
Each of these is a different legal entity. Fuzzy matching might merge them. A registry-first approach knows they’re separate because each has a different registration number in a different jurisdiction.
Format chaos
Dates come in DD/MM/YYYY, MM/DD/YYYY, YYYY-MM-DD, or even just a year. Addresses follow local conventions. Legal forms use local abbreviations. Character encodings vary. Without normalization at the source, every downstream system inherits this mess.
This is why registry-first platforms matter. They normalize data at the point of collection — not after the fact.
6. How to Build an Entity Resolution Workflow Using Registry Data
Here’s a practical workflow that replaces fuzzy matching with registry-anchored resolution:
Step 1: Collect what you have
For each entity in your system, gather whatever identifiers you already have. Company name, country, registration number, tax ID, LEI, domain — anything. The more you have, the faster the resolution.
Step 2: Match against the registry via API
Send your data to a registry API. Two paths:
- If you have a registration number + jurisdiction: Deterministic lookup. Instant. Exact.
- If you have a name + country: The API runs a normalized search against the registry and returns ranked candidates with confidence scores.
- If you have minimal data: The API searches across relevant jurisdictions and returns a candidate list for review or automated filtering.
Step 3: Enrich the matched record
Once matched, pull the full entity profile from the registry:
- Legal name (as registered, not as typed in your CRM)
- Registration number and jurisdiction code
- Company status (active, dissolved, dormant, in liquidation)
- Registered address and operational addresses
- Directors and officers
- Beneficial ownership / UBO (where the registry publishes it)
- Filing history and incorporation date
- Corporate linkages (parent, subsidiary, branch relationships)
Step 4: Anchor and monitor
Store the registration number as the primary key in your system. This is now the anchor. Every future reference to this entity should resolve to this identifier — not a name string.
Set up monitoring. Companies change. Directors resign. Entities get dissolved. Names change. Status updates. A registry-first platform can push alerts when these changes happen, so your data stays current without periodic batch reloads.
7. Fuzzy Matching vs Registry Identifiers: Head-to-Head Comparison
Here’s how the two approaches compare across the dimensions that matter for enterprise compliance and KYB:
| Dimension | Fuzzy Matching | Registry Identifiers |
|---|---|---|
| Match accuracy | 60–90% (varies by data quality) | 99.9% with registration number |
| False positive rate | High (common names, abbreviations) | Near zero (unique ID per jurisdiction) |
| False negative rate | Moderate (non-Latin, abbreviations) | Near zero (if registry is sourced) |
| Cross-border support | Poor (no standard across countries) | Strong (each registry has its own IDs) |
| Audit trail | Confidence scores (hard to explain) | Registry number + jurisdiction (deterministic proof) |
| Speed at scale | Slows with dataset size | Constant-time lookup |
| Requires training data | Yes (for ML-based approaches) | No |
| Handles name changes | Breaks (old name ≠ new name) | Works (registration number stays the same) |
| Data provenance | Unknown (depends on input source) | Government registry (highest authority) |
| Ongoing monitoring | Manual re-runs needed | Automated alerts on entity changes |
When fuzzy matching still makes sense
Fuzzy matching is the right tool when:
- You’re matching consumer records (people, not companies) where no universal ID exists
- You’re deduplicating internal datasets before resolving against an external source
- You’re doing initial candidate generation before deterministic verification
- Your data is so messy that you can’t identify the jurisdiction (rare, but it happens)
When it doesn’t
For KYB, compliance, onboarding, and any workflow where you need to prove to a regulator which entity you verified — fuzzy matching isn’t enough. You need a deterministic link to an authoritative source. That means registry identifiers.
8. What to Look for in an Entity Resolution API
Not all entity resolution APIs are equal. Here’s a checklist for evaluating providers:
Must-haves
- Registry-sourced data. The API should pull from government registries, not scraped or aggregated sources. Ask where the data comes from. If the answer is vague, move on.
- Deterministic + probabilistic matching. Support both. Deterministic for when you have IDs. Probabilistic for when you don’t. Confidence scores on probabilistic results.
- Global coverage. 100+ countries minimum. Ask specifically about the countries your business operates in.
- Normalized output. Legal forms, addresses, industry codes, and names should be normalized into a consistent schema. You shouldn’t have to do this yourself.
- UBO and ownership data. If the API can’t return beneficial ownership structures where published, it’s incomplete for compliance use cases.
- Monitoring and alerts. Companies change. Directors resign. Entities dissolve. The API should push change notifications, not require you to re-query manually.
Red flags
- No data provenance. If the provider can’t tell you which registry a data point came from, your audit trail is broken.
- Stale data. Ask about refresh frequency. If it’s quarterly or annual, it’s not suitable for onboarding or continuous monitoring.
- Restrictive licensing. Some providers limit how you can use the data — no resale, no derivative products, no customer-facing displays. Check the terms.
- No API documentation. If you can’t review endpoint specs, authentication, and response schemas before buying, that’s a red flag.
9. How Zephira.ai Solves Entity Resolution at Scale
Zephira.ai is built for teams that need to resolve, verify, and monitor company entities across 100+ countries — without building a fuzzy matching pipeline from scratch.
Here’s how it works:
Registry-first data
Zephira sources company data directly from government registries worldwide. Not scraped. Not aggregated from third-party datasets. Every data point links back to its official filing.
This means the registration numbers, legal names, addresses, and statuses you get through the API are the same ones the government authority has on record.
Deterministic + probabilistic matching
Send a registration number and jurisdiction → instant deterministic match.
Send a company name and country → normalized search against the registry with ranked candidates and confidence scores.
Send minimal data → cross-jurisdiction search with a candidate list for review or automated rules.
Full entity enrichment
Once matched, every record is enriched with:
- Legal name, trading name, and historical names
- Registration number, jurisdiction code, and Zephira Company ID
- Company status (active, dissolved, dormant, in liquidation)
- Registered and operational addresses
- Directors, officers, and authorized signatories
- Beneficial ownership / UBO (where published by registries)
- Corporate linkages — parent, subsidiary, and branch relationships
- Industry codes (NAICS, NACE, SIC — normalized)
- Revenue and employee estimates
- Filing history and incorporation date
Monitoring
Zephira’s monitoring API watches your matched entities and pushes alerts when something changes. A director resignation. A status change to dissolved. An address update. A name change.
You set the rules. Zephira watches the registries. No batch reloads. No manual re-queries.
API-first delivery
Everything is available via REST API. Bulk resolution, webhook notifications, batch processing, autocomplete for onboarding forms, and credit reports. Full documentation. No black boxes.
Ready to replace fuzzy matching with registry-sourced resolution?
Request a demo at zephira.ai or explore the API documentation at zephira.ai/api.
10. FAQs: Entity Resolution
- What is entity resolution?
Entity resolution is the process of identifying whether two or more records refer to the same real-world entity. For company data, this means determining if “Acme Corp.” in your CRM and “Acme Corporation” in a contract are the same legal entity. The goal is a single, accurate “golden record” for each entity across all your systems. - What’s the difference between entity resolution and deduplication?
Deduplication removes duplicate records within a single database. Entity resolution goes further — it matches records across multiple databases, systems, or external sources to a single real-world entity. Deduplication is one step within the broader entity resolution process. - Why does fuzzy matching fail for company data?
Fuzzy matching compares strings character by character. It was designed for person names, not company data. It fails with legal form variations (“Ltd” vs “Limited” vs “GmbH”), non-Latin scripts, common name collisions (thousands of “First National” entities), and abbreviations. A high similarity score doesn’t prove two records are the same legal entity. - What are registry identifiers?
Registry identifiers are unique codes assigned to companies by government authorities at the time of incorporation. Examples include Company Registration Numbers (UK), EINs (US), Handelsregisternummer (Germany), and CNPJ (Brazil). They don’t change when a company rebrands and are unique within their jurisdiction. - How accurate is deterministic matching compared to probabilistic matching?
Deterministic matching using a registration number and jurisdiction code achieves near-perfect accuracy (~99.9%). Probabilistic matching based on company names typically achieves 60–95% accuracy depending on data quality, and requires confidence thresholds and manual review for borderline cases. - Can entity resolution work across countries?
Yes, but it’s significantly harder. There’s no universal company identifier. Each country uses its own registration system, naming conventions, legal form abbreviations, and data formats. Cross-border entity resolution requires a platform that aggregates and normalizes data from registries in every jurisdiction you operate in. - What’s the best entity resolution method for KYB and compliance?
Registry-anchored deterministic matching. Compliance teams need to prove which entity they verified, with an audit trail that links to an authoritative source. Fuzzy matching confidence scores aren’t sufficient evidence for regulators. A registration number from a government registry is. - How do I verify a company if I don’t have a registration number?
Use a probabilistic search against the registry for the company’s jurisdiction. Send the company name and country to a registry API. The API normalizes the input, searches the registry, and returns ranked candidates with confidence scores. Once you confirm the right match, anchor the record to the registration number for all future lookups. - What is a golden record?
A golden record is the single, most accurate and complete representation of an entity across all your systems. It combines the best data from every source into one definitive profile. In a registry-first approach, the golden record is anchored to a government-issued registration number — the most authoritative identifier available. - How does Zephira.ai handle entity resolution?
Zephira.ai sources company data directly from government registries in 100+ countries. It supports deterministic matching (registration number + jurisdiction), probabilistic matching (name + country with confidence scores), and full enrichment (legal name, status, directors, UBO, corporate linkages). Matched entities are monitored for changes with automated alerts. Everything is delivered via REST API.