Issues with Redis cache and DeviceAuth service

Incident Report for Hosted Mender

Postmortem

This morning, the operation team performed a planned Redis Cluster upgrade, starting at 04:40 UTC. Around 04:56 UTC, one of the Redis pod got killed because of Out of Memory issues, causing the Device Auth service to experience connection failure. To resolve this, the operation team increased the memory allocated to the Redis Cluster, starting at 05:05 UTC. The change was fully implemented by 05:14 UTC, and no more error log was seen from the Device Auth service, which was returned to normal operation.

Posted Jun 23, 2025 - 08:56 UTC

Resolved

This incident has been resolved.
Posted Jun 23, 2025 - 05:39 UTC

Monitoring

The issue has been identified: a new Redis pod was restarting because of OOMKill.
More memory has been given to the Redis pool and now the services are up. We're monitoring the result.
Posted Jun 23, 2025 - 05:19 UTC

Investigating

We are investigating an issue regarding Redis cluster and the Device Auth Service which is in degraded state.
Posted Jun 23, 2025 - 05:08 UTC
This incident affected: Hosted Mender US.