What finally fixed our deployment and Sidekiq mess

Over the last week I hit a cluster of production problems that looked unrelated at first: intermittent deploys that “completed” but left the app unresponsive, email blasts that “finished” with zero sends, and a Sidekiq dashboard screaming 60–70% failure rates. This post is my straight‑from-the-console write‑up of what actually broke, how I tracked it down, and the simple changes that stabilized everything.

If you’re running Rails on Nginx + Passenger with MySQL, Redis, and Sidekiq (and some Delayed Job still hanging around), this will probably feel familiar.

The symptoms

Deploy “succeeds,” site feels sluggish or unresponsive afterward.
Sidekiq shows high failure rates, idle workers, and no throughput.
Email blasts target thousands of customers but report zero sent.
Logs full of noise in test and prod, making the real issues easy to miss.

The real root cause

It wasn’t a single bug. It was a perfect storm:

A Ruby upgrade changed memory/boot characteristics.
I had both Sidekiq and Delayed Job running (historical reasons).
Passenger was running multiple app processes.
And the big one: the database pool was way too small.

Sidekiq wasn’t “broken.” It was starved. Jobs were dying with:

could not obtain a database connection within 5.000 seconds

Once I stopped staring at Sidekiq’s failure counter and looked at actual error classes in Retry/Dead sets, the picture was obvious: ActiveRecord connection timeouts across the board.

The simple calculation that matters

This is the mental model I now use and documented in a deployment guide for future reference.

You need at least:

Passenger processes (each wants a DB connection)
plus Sidekiq threads across all processes
plus a little buffer (console, rake tasks, admin scripts, spikes)

Example from my production box:

Passenger: 5 processes
Sidekiq: 1 process × 3 threads (I dialed this down)
Buffer: 5–10 (call it 7 for “stuff you forgot”)

So minimum pool = 5 (Passenger) + 3 (Sidekiq) + 7 (buffer) = 15. I set it to 25 because the server has plenty of RAM (32 GB) and I’d rather have headroom than timeouts.

Production database.yml ended up like this:

production:
  adapter: mysql2
  pool: 25
  host: localhost
  # …rest omitted

After bumping the pool, the connection timeouts disappeared immediately.

The Sidekiq side of the house

I also simplified my worker setup. I don’t need hero numbers here—predictability beats bragging rights. I now run:

Concurrency (threads): 3
Processes: start one Sidekiq process (or two if I’m pushing a big blast)

My sidekiq.yml reflects that:

:concurrency: 3
:queues:
  - [critical, 10]
  - [email_blast, 6]
  - [default, 5]

If I want more fault tolerance or better CPU spread, I start a second Sidekiq process with the same concurrency. Two procs × 3 threads = 6 total DB connections, still fine under a 25‑connection pool.

The email blast gotcha (Sidekiq vs Delayed Job)

My email blasts were “completing” with zero sends. The reason: the controller path had been switched to run synchronously during an earlier incident, while mailers still used deliver_later (which goes through ActiveJob → Delayed Job in this app). That left me in a half‑migrated state: some work in Sidekiq, some in DJ, both competing for the same small pool.

I re‑enabled async sending for blasts and let Sidekiq handle it. The worker is straight‑forward and logs failures cleanly:

class SendEspecialsWorker
  include Sidekiq::Worker
  sidekiq_options queue: :email_blast, retry: 3

  def perform(account_id, coupons, email_blast_id = nil)
    account = Account.find(account_id)
    email_blast = email_blast_id && EmailBlast.find(email_blast_id)
    account.send_especial(coupons, true, email_blast)
  end
end

Key point: pick one background system per workflow. Mixing Sidekiq and Delayed Job on the same hot paths is a great way to create invisible contention.

What I run before deploy now

I pulled these steps into the deployment guide so I don’t “cowboy fix” at 2am again. The gist:

Clean up any straggler processes (nginx, passenger, delayed_job, ruby)
Drop caches and restart in a clean order
Start Sidekiq to the target concurrency

sudo systemctl stop nginx
sudo pkill -f delayed_job; sudo pkill -f passenger; sudo pkill -f ruby
sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
sudo systemctl start nginx
# start sidekiq with -c 3 and proper logs/PIDs

Then I watch the first few minutes like a hawk:

passenger-status for process health and queue length
Sidekiq stats for processed/failed trends and queue sizes
Memory and swap to catch early pressure

Quick checks that saved me time

Confirm Sidekiq’s “Failed” vs Retry/Dead sets. Dashboard counters are cumulative. The real signal is “do new jobs fail?”
Use Sidekiq::Stats.new and print before/after numbers around a test job. If processed goes up without failed, you’re good.
Don’t trust logger levels alone in tests if you’re using puts. puts goes to STDOUT regardless; either switch to Rails.logger or redirect STDOUT in test.

What I’d do differently next time

Size the DB pool first. It’s the cheapest lever with the highest upside.
Avoid half‑migrations. If a workflow starts on Sidekiq, finish the job and remove the Delayed Job path (or vice versa).
Keep Sidekiq boring: low concurrency, more processes only when needed, queue weights tuned to business priorities.
Document the decision math in the deployment guide, so future me doesn’t have to reconstruct it under pressure.

TL;DR

The chronic failures weren’t Sidekiq being flaky—they were ActiveRecord connection timeouts.
Passenger procs + Sidekiq threads + a buffer must fit under your DB pool.
On a 32 GB host, a pool: 25 is a rounding error in memory and buys you stability.
Run email blasts async in Sidekiq; don’t split them across Sidekiq and Delayed Job.
Keep Sidekiq to 3 threads per process unless you’ve proven you need more.

If you want the full step‑by‑step let me know. I put everything into a document including cleanup commands, monitoring, and the connection math. It’s the checklist I wish I’d had before the week started.