Probabilistic Geocoder
Resolves noisy, ambiguous & privacy-protected location references from web data into a best-effort polygon & confidence-weight.
- NLP
- Geospatial
- Probabilistic Modeling
Overview
Web data — blogs, social posts, classifieds, news, registries — constantly reference where something happened, but rarely with a clean street address. Locations are implied through landmarks (“in front of the McDonald’s”), relations (“near X”, “between A & B”), or coordinates that are jittered, IP-derived, or otherwise imprecise by design. This project is a probabilistic geocoder: an event-location disambiguator that takes such fuzzy references & returns a best-effort polygon plus a confidence weight, rather than forcing a single false-precision point. It underpins the spatial-resolution layer of Neighbourhood Vibes.
The Challenge
- Location signals are inherently fuzzy — GPS error, untrusted or biased sources (e.g. a promoter’s post tagged with their location instead of the venue), & privacy practices that deliberately jitter or blurr coordinates
- Off-the-shelf geocoders expect a well-formed address or a POI name & return a single coordinate or polygon representing the precise know location for that entity — they can’t resolve contextual references like “in front of the McDonald’s”, “near downtown” or “within 200m”.
- The same place name can refer to multiple real locations (e.g. several areas called “Windsor” within Canada), requiring context to disambiguate
- Any NLP-based location extraction carries false positives & false negatives — the system has to decide what to do with both
- No labeled “true location” dataset exists for this kind of text, so accuracy can’t be checked directly against ground truth
Approach
- Extracted candidate place mentions using traditional NER: gazetteer-based place lists, heuristic/regex patterns (e.g. “-ville”, “-opolis”), grammar rules, & co-reference resolution for relational phrases (“near”, “corner of”)
- Scored every candidate with a 0-100% confidence; mentions below ~50% (e.g. an unqualified “Main Street”) were dropped rather than guessed
- Converted relational phrases into geometric buffers/corridors around the resolved entity, sized by that entity’s level in the geometry hierarchy — e.g. ~200m around a POI like a McDonald’s, ~5km around a city like Montreal
- Disambiguated repeated place names using other context within the same event (e.g. a co-mentioned city); if no such hint existed, the mention was dropped
- Mapped each event to the smallest geometry in the hierarchy (block → neighbourhood → city → metro → province → country) that fully contained it — sometimes multiple geometries for events spanning an area (e.g. a route between two neighbourhoods)
- Stored each event once at its natural-fit level; on read, finer-granularity queries inherit the full event & coarser-granularity queries dilute its weight by the parent area’s size
Results
- Every event resolves to a polygon (or set of polygons) plus a 0-100% confidence weight reflecting how likely it truly occurred there
- False negatives (dropped ambiguous mentions) were an acceptable cost — downstream consumers never assumed a complete view of all events
- False positives were the real risk, particularly systematic ones (e.g. a source that consistently mis-locates events) — random noise washes out at scale via the law of large numbers, but systematic bias compounds instead
- Proven end-to-end as the spatial-resolution layer of Neighbourhood Vibes