0: https://manifold.net/doc/mfd8/zip_codes_are_not_areas.htm
0: https://manifold.net/doc/mfd8/zip_codes_are_not_areas.htm
They don't represent geography at all, they represent the organizational structure of USPS.
They work by making the address on a letter almost meaningless. For some smaller population zip codes you can practically just put the name and zip code down and achieve delivery.
You're not going to wind up with a situation where zip codes with the same regional marker end up on different coasts.
Couldn't this happen for military or proxy codes (PO boxes or other) ?
In other words, is it safe to assume that for entity in a zip code is less than x distance away from the closest entity in the same zip code?
Please see: https://opencagedata.com/guides/how-to-think-about-postcodes-and-geocoding
I write this as someone who grew up in the ZIP code 09180
zip codes don't even need to be contiguous. It's a mail delivery route, not a polygon.
There are 5 cases where the assumption is violated:
- Non-contiguous areas
- Zip codes that are a single point (some big companies get their own zip with a single mailbox, e.g. GE in Schenectady, NY is zip 12345)
- Zip codes that are a single line (highway-based delivery routes)
- Overlapping boundaries (since mail routes are linear, choosing a polygon representation is arbitrary and often not unique in space)
- Residents of some zip codes are not stationary (e.g. houseboats)
In short, asking questions about the area of a zip code is a category error - zip codes do not have a uniform representation in space. And we should be highly skeptical of any geospatial analysis that assumes polygons.
What they do not have is any sort of spatial consistency, they are a convenience for mail sorting. So if you start analyzing patterns across zip codes, you are pulling in information that is likely useless for or harmful to answering your question.
A 5+4 formatted ZIP code maps to just a handful of addresses. In cities with larger populations, the +4 could map to a single building, and in more sparely populated place, it might include houses on a handful of roads.
For smaller datasets, ZIP+4 might as well be a unique household identifier. I just checked a 10 million address database and 60% of entries had a unique ZIP+4, so one other bit of PII would be enough to be a 99.99% unique identifier per person.
With a geo-coded ZIP+4 database, you could locate people with a precision that's proportional to the population density of their region.
Yeah, but any analysis you're likely to perform is approximate enough that the fact that ZIP codes aren't polygons is basically a rounding error.
Plus, it's a lot easier to get ZIP codes, and they're more reliably correct, so you might still get better results, than you would going with another indicator that is either (a) less reliable or (b) less available.
The big problem is zip codes are defined in terms of convenient postal routes and aren't suitable for most geospatial analysis. Census units, as the article explains, are a much better choice.
Use Addresses Use Census Units Use your own Spatial Index
Why not lat, long?
You also have to decide how you'll do that binning. Can bins overlap? What do you do at the poles? H3 provides some reasonable default choices for you so don't have to worry about that part of your solution design.
Btw. I have a need recently to compute the shortest distance from a point to a line defined by two points, all in lat/lon. Anyone has any lead on how to do it?
CGP Grey has a great video on this: https://m.youtube.com/watch?v=1K5oDtVAYzk
That data linked with the payment method that the register collects pretty much gives the store exactly who you are and where you live even if you chose not to sign up to the store's loyalty program.
No one has census blocks.
And coordinates can work but lack some inherent advantages, such as human readability and a semblance of pop density normalization.
[1]: https://www.npr.org/2025/01/08/1223466587/zip-code-history
- Well-known (everybody knows their zip code)
- Easily extracted (they're part of every address, no geocoding required)
- Uniform-enough (not perfect, but in most cases close)
- Granular-enough
- Contiguous-enough by travel time
Notably, the alternatives the author proposes all fail on one or more of these:
- Census units: almost nobody knows what census tract they live in, and it can be non-trivial to map from address to tract
- Spatial cells: uneven distribution of population, and arbitrary division of space (boundaries pass right through buildings), and definitely nobody knows what S2 or H3 cell they live in.
- Address: this option doesn't even make sense. Yes, you can geocode addresses, but you still need to aggregate by something.
Fact is a lot of web data contains a zip but if you can collect something better it will usually render better results. Unless you are analyzing shipments then that is fine.
Another consideration is what kind of reference information is available at different spatial units. There are plenty of Census Bureau data available by ZCTA but some data may only be available at other aggregate units. Zip Codes are often used as political boundaries.
I'd also mention the "best" areal unit depends on the data. There is a well known phenomenon called the modifiable areal unit problem in which spatial effects appear and vanish at different spatial resolutions. It can sort of be thought of as a spatial variation of the ecological fallacy.
At that point you need something like Smarty[1] to validate and parse addresses.
Similar issues for city name, of course.
According to one commenter on the subject:
It doesn't matter, as long as the zip code is correct
[0]: https://www.city-data.com/forum/boston/601106-mailing-address-jp-dorchester-etc.html[1]: https://www.city-data.com/forum/boston/601106-mailing-address-jp-dorchester-etc.html
If you get close enough, it usually gets handled in the local sort, but not always.
On cities, the mailing address city really is the name of the post office that handles your delivery route. Often there's a relationship with the city you live in, but there's cases both ways --- I used to live outside city limits, we had a census designated place name, a municipal sanitary district and had a fire department at one time... but never a post office, so our mailing address used the nearby city name, where our post office resided. The place name had an incorporated city on the other side of the state, so using that wouldn't be great.
Nowadays, post offices often have a list of alternative place names, so where I live now, I can pick between the incorporated city name, the nearby large city where a post office that processes all my mail is located, or any of the numerous small post offices that once served my city.
Bigger cities can have multiple post offices and zip codes with the same mail address city.
Most sites/apps will let me override the validator, but a few won't. The most common ones that insist on using the wrong address are financial institutions that say the law requires them to have my proper physical address and therefore they go with the (incorrectly) validated version.
USPS does not do home delivery in our area, and UPS/FedEx/etc. usually figure it out given that street numbers alone uniquely identify properties in our town.
I’d love to have you email your mailing address to support@smarty.com with a link to this HN thread. We may be able to help fix some of this.
We have non-postal addresses and a lot of other mechanisms to help here. We also have contacts at the USPS and others to help fix addresses.
The US Postal Services has a team of people that handle address updates. This team is localized to different regions so that they generally are aware of local nuances. If you need to talk to the USPS about getting an address issue resolved simply go to this USPS AMS site and enter your zipcode to find the team that handles addresses in that area:
https://postalpro.usps.com/ppro-tools/address-management-system
If they don't answer, leave a message. They have helped me thousands of times in my last 14 years working with address validations.
And in this case the fire companies had no problem finding my house in spite of the incorrect information in town records. As you suggest the field people on the ground generally know what the ground truth is.
Every customer I've worked with insisted on having all addresses ran through the USPS verification API so they could get their bulk mailing discounts.
Even if you get the delivery/cost side under control, you still have to make sure you are talking about the right address from a logical perspective. Mailing, physical, seasonal, etc. address types add a whole extra dimension of fun.
Regarding article, it really depends on the use case of whether to use ZIP Code (TM), postal code, Canada Post Forward Sortation Area, lat/lon, Census Bureau block and tract, etc.
As has been noted, the ZIP Code is often good enough for aggregating data together and can be a good first step if you don’t know where to start.
The real problem is ever using an average without also specifying some sort of bounds. For median-based data, this probably means the upper and lower quartiles (or possibly other percentiles); for mean-based data, this probably means standard deviation.
The functionality of it is closer to the "Zip+4" with extension used to have a more granular routing of physical mail for USPS.
But broadly speaking, nobody knows what their ZIP+4 is, while I imagine that most people in Canada know their postal code by heart.
It is interesting.
To the point that StatCan and other agencies have rules on the number of characters that are collected/disseminated with other data to make sure it's not too identifying:
* https://www12.statcan.gc.ca/nhs-enm/2011/ref/DQ-QD/guide_2-eng.cfm
And don’t forget sales tax. Which is state + county + city
Zip and distance as the crow flies often gives shit data. My zip suggests I'm off in bum fuck and since I'm on the puget sound things that are relatively near as the crow flies can actually be hours away.
That's only true if you can also access the spatial boundaries of the zipcodes themselves.
In Australia, this turns out not to be true: the postal system considers their boundaries to be commercial confidential information and doesn't share them. The best we can do is the Australian Bureau of Statistics' approximations of them, which they dub "postal areas".
This makes joining disparate data sources quite easy. And this also lets you do all sorts of cool stuff like aggregations, smoothing, flow modeling, etc.
We do some geospatial stuff and I wrote a polars plugin to help with this a while back [1].
Anything else is a loose correlation at best, that will likely change over time.
Though ultimately it was far too granular (for example the Bay Area would be so many different zip codes). Instead we went with Nielsen's DMA (Designated Market Area) mappings within the US to abstract aggregated data a bit better. And of course this DMA dataset also had a different original use case. It was used for TV / media market surveys so it has some weird vestiges. Some regions are grouped very far and wide (you'll notice there's a bit of Denver within Nevada and its just a remnant of how it used to be categorized), but it still provides a bit of a broader level grouping than something acute like zip code.
I do like this map from the article though and the granularity you can get with zip code when zooming: https://clausa.app.carto.com/map/29fd0873-64cb-42a6-a90d-c83a8840bbfe?lat=37.176174&lng=-120.862076&zoom=7
We've also been considering using Combined Statistical Areas using population instead. This is something that is under way, and in the interim we've considered charting styles that don't necessarily need borders (for example this bubble map: https://www.levels.fyi/bubble-plot/europe/). The benefit with DMAs is that it offers full border coverage of the entire US whereas some hubs can still be missing from CSAs if relying on a population threshold. But the plan is to create some of our own regional definitions and borders using our own submissions combined with population. Will be an interesting project.
GeoJSON data for the map borders: https://github.com/PublicaMundi/MappingAPI/blob/master/data/geojson/us-states.json
Nielsen DMA regions: https://blocks.roadtolarissa.com/simzou/6459889
The alternatives that the author suggests are much more complicated, both in terms of the implementation and in terms of convincing the user to give you their full address.
http://federalgovernmentzipcodes.us/free-zipcode-database-Primary.csv
Which is derived from longitude and latitude..
Always sad when these schemes don't include a check digit in them though, even if the layout of this one gets typo'd codes pretty close to their intended destination.
Reading their alternatives, it strikes me with "ZCTA's are the worst form of small area aggregation except for all others."
Its not a great geography to use but it is quite useful if you know it's limitations and inaccuracies when you get into it. Stuff like multipolygon entities, island-polys, etc aren't fun to resolve but can be accounted for.
Add on that ZCTA's will historically follow some sort of actual boundary(rivers/highways/etc) they can tell a story in a way Census tracts can't.
a. Excel treats them as numbers instead of strings of digits and thus drops the leading 0
b. Developers make assumptions about postal codes based on how they work (or more usually how the developer incorrectly thinks they work) in their own country and these assumptions absolutely do NOT hold in other countries.
A relevant guide to geocoding and postal codes: https://opencagedata.com/guides/how-to-think-about-postcodes-and-geocoding
Until very recently I naively assumed that the area of a given zip code would be entirely within the area of some single city or town which would then be entirely within the area of a single county.
It was quite a rude awakening working with software that tries to apply the correct local taxes to a given address and finding that the statement “A given X can contain multiple Y” is true for every possible combination of zip, city, and county.
It's so well written and informative that I completely didn't mind the "and here's how to do it in Carto" bit in the middle. Instead I thought they earned it.
I didn't need much precision so truncating seemed an easy way to group stuff.
Oh the surprise. I never again made such assumptions, let's just say I should have gotten a clue from Corsica being 2A and 2B.
Edit: I wanted to point out that I recall that ESRI maps used to come “out of the box” with zip code polygon layers. While I agree they are technically not polygons in the strictest sense, they often are or they are fully closed shapes or close enough to it - and even if they are missing a few nodes to make a complete polygon, whoever did the digitizing probably manually closed the loop so to speak. Remember, geospatial maps are used for many different purposes, likely none of them having anything to do with postal routes, so in that sense they are “good enough” for most purposes.
It was a really useful platform for uploading spatial data with a decent range of visualisation tools that didn't need code. You could do SQL if you wanted.
Then they got rid of the free tier, and set the cheapest tier at (iirc) $150 USD per month. And that was the end of that.
[1] https://techbio.org/wiki/Addresses/finding-addresses-in-webpages
Stop using zip codes
BANG! Why should I stop using zip codes? Must read!!!
And then you go to the old (2019) page, that appears to be filled with useless clicks and arguments that appear biased.
"Do not use zip codes for geospatial analysis."
If used for other purposes they fall short.
So for example, if you are sorting “rural zips” vs “urban zips” it will only take you so far, and may actually be harmful.
Same goes with MSAs/DMAs (media markets). These have to be used for buying media, but for geospatial analysis they are suboptimal for the same reasons.
Easiest way to dip your toe into the water of something better is to start with A-D census counties.