Over the last week I hit a cluster of production problems that looked unrelated at first: intermittent deploys that “completed” but left the app unresponsive, email blasts that “finished” with zero sends, and a Sidekiq dashboard screaming 60–70% failure rates. This post is my straight‑from-the-console write‑up of what actually broke, how I tracked it down, and the simple changes that stabilized everything.
If you’re running Rails on Nginx + Passenger with MySQL, Redis, and Sidekiq (and some Delayed Job still hanging around), this will probably feel familiar.
The symptoms
- Deploy “succeeds,” site feels sluggish or unresponsive afterward.
- Sidekiq shows high failure rates, idle workers, and no throughput.
- Email blasts target thousands of customers but report zero sent.
- Logs full of noise in test and prod, making the real issues easy to miss.
The real root cause
It wasn’t a single bug. It was a perfect storm:
- A Ruby upgrade changed memory/boot characteristics.
- I had both Sidekiq and Delayed Job running (historical reasons).
- Passenger was running multiple app processes.
- And the big one: the database pool was way too small.
Sidekiq wasn’t “broken.” It was starved. Jobs were dying with:
could not obtain a database connection within 5.000 seconds
Once I stopped staring at Sidekiq’s failure counter and looked at actual error classes in Retry/Dead sets, the picture was obvious: ActiveRecord connection timeouts across the board.
The simple calculation that matters
This is the mental model I now use and documented in a deployment guide for future reference.
You need at least:
- Passenger processes (each wants a DB connection)
- plus Sidekiq threads across all processes
- plus a little buffer (console, rake tasks, admin scripts, spikes)
Example from my production box:
- Passenger: 5 processes
- Sidekiq: 1 process × 3 threads (I dialed this down)
- Buffer: 5–10 (call it 7 for “stuff you forgot”)
So minimum pool = 5 (Passenger) + 3 (Sidekiq) + 7 (buffer) = 15. I set it to 25 because the server has plenty of RAM (32 GB) and I’d rather have headroom than timeouts.
Production database.yml
ended up like this:
production:
adapter: mysql2
pool: 25
host: localhost
# …rest omitted
After bumping the pool, the connection timeouts disappeared immediately.
The Sidekiq side of the house
I also simplified my worker setup. I don’t need hero numbers here—predictability beats bragging rights. I now run:
- Concurrency (threads): 3
- Processes: start one Sidekiq process (or two if I’m pushing a big blast)
My sidekiq.yml
reflects that:
:concurrency: 3
:queues:
- [critical, 10]
- [email_blast, 6]
- [default, 5]
If I want more fault tolerance or better CPU spread, I start a second Sidekiq process with the same concurrency. Two procs × 3 threads = 6 total DB connections, still fine under a 25‑connection pool.
The email blast gotcha (Sidekiq vs Delayed Job)
My email blasts were “completing” with zero sends. The reason: the controller path had been switched to run synchronously during an earlier incident, while mailers still used deliver_later
(which goes through ActiveJob → Delayed Job in this app). That left me in a half‑migrated state: some work in Sidekiq, some in DJ, both competing for the same small pool.
I re‑enabled async sending for blasts and let Sidekiq handle it. The worker is straight‑forward and logs failures cleanly:
class SendEspecialsWorker
include Sidekiq::Worker
sidekiq_options queue: :email_blast, retry: 3
def perform(account_id, coupons, email_blast_id = nil)
account = Account.find(account_id)
email_blast = email_blast_id && EmailBlast.find(email_blast_id)
account.send_especial(coupons, true, email_blast)
end
end
Key point: pick one background system per workflow. Mixing Sidekiq and Delayed Job on the same hot paths is a great way to create invisible contention.
What I run before deploy now
I pulled these steps into the deployment guide so I don’t “cowboy fix” at 2am again. The gist:
- Clean up any straggler processes (nginx, passenger, delayed_job, ruby)
- Drop caches and restart in a clean order
- Start Sidekiq to the target concurrency
sudo systemctl stop nginx
sudo pkill -f delayed_job; sudo pkill -f passenger; sudo pkill -f ruby
sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
sudo systemctl start nginx
# start sidekiq with -c 3 and proper logs/PIDs
Then I watch the first few minutes like a hawk:
passenger-status
for process health and queue length- Sidekiq stats for processed/failed trends and queue sizes
- Memory and swap to catch early pressure
Quick checks that saved me time
- Confirm Sidekiq’s “Failed” vs Retry/Dead sets. Dashboard counters are cumulative. The real signal is “do new jobs fail?”
- Use
Sidekiq::Stats.new
and print before/after numbers around a test job. If processed goes up without failed, you’re good. - Don’t trust logger levels alone in tests if you’re using
puts
.puts
goes to STDOUT regardless; either switch toRails.logger
or redirect STDOUT in test.
What I’d do differently next time
- Size the DB pool first. It’s the cheapest lever with the highest upside.
- Avoid half‑migrations. If a workflow starts on Sidekiq, finish the job and remove the Delayed Job path (or vice versa).
- Keep Sidekiq boring: low concurrency, more processes only when needed, queue weights tuned to business priorities.
- Document the decision math in the deployment guide, so future me doesn’t have to reconstruct it under pressure.
TL;DR
- The chronic failures weren’t Sidekiq being flaky—they were ActiveRecord connection timeouts.
- Passenger procs + Sidekiq threads + a buffer must fit under your DB
pool
. - On a 32 GB host, a
pool: 25
is a rounding error in memory and buys you stability. - Run email blasts async in Sidekiq; don’t split them across Sidekiq and Delayed Job.
- Keep Sidekiq to 3 threads per process unless you’ve proven you need more.
If you want the full step‑by‑step let me know. I put everything into a document including cleanup commands, monitoring, and the connection math. It’s the checklist I wish I’d had before the week started.