API issues with Mender Server

Incident Report for Hosted Mender

Postmortem

Date: October 22, 2025

Duration: 78 minutes (08:10 - 09:28 UTC)

Severity: Major service disruption

Executive Summary

A database migration in release v4.1.0-saas.16 caused a complete failure of the Device Authentication service across US and EU hosted Mender clusters. The migration incorrectly deleted a critical uniqueness constraint during online operations, leading to database corruption that prevented service recovery. We restored service by performing a point-in-time database rollback, resulting in 78 minutes of data loss.

Customer Impact: Device authentication was unavailable for 78 minutes. New device enrollments were blocked, and existing device operations may have been disrupted during this period.

Root cause

The new version contained a database migration to 2.0.1 for the Device Auth database, which was designed to replace a uniqueness constraint on device authentication records but executed the deletion and recreation as separate operations. During online migration, the window between index deletion and recreation allowed duplicate device entries to be created, corrupting the database state and preventing both forward migration completion and rollback. For this reason, the only viable solution was to rollback both the Mender Server version and the Database.

Resolution and recovery

With duplicate records preventing normal rollback procedures, we performed a point-in-time database restore to 08:10 UTC, with a safe timestamp before migration execution. This restored database integrity but resulted in permanent loss of all data created between 08:10 and 09:28.

Incident timeline (UTC)

  • 08:35 AM - the new v4.1.0-saas.16 version was published and both hosted Mender US and EU started the automated upgrade
  • 08:40 AM - the upgrade failed and rolled back automatically to v4.1.0-saas.15, because the deviceauth service wasn’t able to complete the migration job.
  • 08:42 AM - the On-call team acknowledged a possible issue with the upgrade, in the meantime the deviceauth service and MongoDB were at 100% load, because of the missing index.
  • 09:16 AM - we decided to restore the MongoDB database to the Point-in-time with timestamp 08:10:00 AM and the restoration process started.
  • 09:28 AM - the MongoDB restoration process finished.

What went wrong

  • Migration Strategy: The migration required an offline window or an atomic operation strategy, but this requirement was not identified during development or code review.
  • Testing Gaps: Pre-release testing did not simulate high-concurrency writing during the migration, failing to trigger the race condition found in production.
  • Data loss: We failed to export a snapshot of the corrupted state before the point-in-time retention window expired.

Action Items

  • Enhance Load Testing: pre-release tests are not sufficient to really simulate the production environment, so to catch this issue in an early stage. We are planning to run load testing and chaos testing more often and extensibly to mitigate this risk.
  • Update the rollback playbook: mandate that a snapshot of the "corrupted" database state be taken immediately following a destructive Point-in-Time recovery to preserve data and to allow recovery of data if necessary.

We sincerely apologize for the disruption to your operations and, specifically, for the data loss that occurred during the recovery window.

Posted Dec 11, 2025 - 20:29 UTC

Resolved

This incident has been resolved
Posted Oct 22, 2025 - 11:32 UTC

Monitoring

The restore to 08:10:00 UTC has completed at 09:28:49 UTC and the server is scaled back up. We will continue to monitor the situation.
Posted Oct 22, 2025 - 09:44 UTC

Identified

The issue has been identified. A migration triggered by an upgrade caused an index to be removed prematurely. This in turn caused data corruption. We have initiated a database restore and rolled back the upgrade. We apologize for the inconvenience.
Posted Oct 22, 2025 - 09:20 UTC

Update

We are continuing to investigate this issue.
Posted Oct 22, 2025 - 09:07 UTC

Investigating

We noted a spike in API error metrics; we are investigating the issue.
Posted Oct 22, 2025 - 08:56 UTC
This incident affected: Hosted Mender US.