How do you handle production webhook delivery reliability in your apps?
Tanjim
a day ago
5
4
Hey everyone,

I’ve been thinking a lot about webhook delivery reliability lately. In many projects I’ve worked on, building robust webhook infra turned out to be deceptively complex:

- Retry logic (exponential backoff, timeouts) - Handling non-2xx responses - Delivery monitoring and alerting - Back-pressure or queueing to avoid overwhelming receivers - Secure signing and validation flows

In one project, a failed webhook caused a payment processing delay for hours because the retry logic was buggy. Another time, burst traffic took down the receiver endpoint with no DLQ strategy in place.

I’ve been researching different approaches teams here use:

Do you build your own custom webhook delivery queue and monitoring system? Use cloud solutions like AWS EventBridge or Step Functions to orchestrate? Or integrate third-party tools that handle delivery, retries, and observability for you?

I’m curious about how you ensure production-grade reliability at scale without burning dev hours on plumbing. Recently, I’ve been working on a tool in this space to handle these issues automatically, but would love to hear:

- What architecture have you found most reliable? - What are the edge cases you’ve encountered (e.g. signature mismatches, downstream outages)? - Any horror stories or lessons learned from webhook failures in production?

Looking forward to learning from your experiences and best practices around webhook infra!

tasna day ago
Very biased, but I think you should just use Svix[1].

Though if you're interested, I recorded a video about webhook architecture at some point you may find useful: https://m.youtube.com/watch?v=4jvV75OD620

1: https://www.svix.com

kasey_junktasna day ago
I’m not affiliated with svix but a happy customer.

It’s just worked, for years for us in production. We’ve never had an issue.

Now our use case is pretty simple but for us it’s a piece of infrastructure we never worry about.

leakycapa day ago
I think your questions beg another: where can we just take out this layer of complexity, and how?

Sometimes rather than chasing edge cases, I find another way to do the same thing using a routine or library that already has all the edge cases ironed out.

If you're a small team or one person, you can't expect to stay on top of something that starts broken.

ezekgleakycap21 hours ago
Totally agree. For me, with a vanilla Rails app, I leaned on Sidekiq to handle webhook queueing, processing, and retries: https://keygen.sh/blog/how-to-build-a-webhook-system-in-rails-using-sidekiq/

It's scaled quite well. Billions of webhooks. I barely ever think about it.