Certificate issue on Hosted Mender EU

Incident Report for Hosted Mender

Postmortem

On the 8th of October, a critical disruption occurred in our services due to the unexpected expiration of an SSL certificate.

In Northern.tech, we deeply rely on automation, and we're currently using the Cert-Manager tool to automate the renewal process of the TLS certificates on our Hosted Mender cluster in Azure.

The tool was tested and deployed during the setup phase of the Hosted Mender EU cluster, and it was successfully renewing the certificates every three months. Still, it failed on the last cycle, which unfortunately went unnoticed.

Root Cause

In Northern.tech, we use Terraform to distribute changes in our Infrastructures. On the 22nd of September, an operator applied a planned change to the Hosted Mender cluster in Azure related to a new Cert-Manager instance alongside the current one dedicated to Hosted Mender.

Unfortunately, this change wiped out an Azure Role Assignment that gives DNS write permissions to the Cert-Manager agent to update TXT DNS challenges and get certificates signed by a CA.

The operator had not noticed this Azure Role Assignment deletion, and at that time, the Cluster was running fine, so the change was set as concluded successfully.

We know things could go wrong, so we have a monthly task to verify all the Northern.tech TLS endpoints manually. We're using this tool for that. The calendar schedule is set for every second Tuesday of the month: the 12th of September check was okay because the certificate was still in its grace period. The next run would have been on the 10th of October.

Meanwhile, the primary Hosted Mender EU certificate expiration was planned for the afternoon of the 8th of October; when the automated renewal started, the Cert-Manager agent tried to renew it, but it failed since it could not write in the DNS again.

Unfortunately, this also went unnoticed, and eventually, the TLS certificate expired.

Our external uptime checker, Statuscake, sends alerts to our On-Call system and files up a ticket to the on-call SRE member during downtime. We thought this was sufficient, but the simple Uptime check ignores TLS issues.

Immediate action taken

Upon discovering the issue on Monday morning at about 8:00 UTC, we took immediate steps to address the situation: we found that Cert-Manager was not working correctly and checked the error logs to understand the failure.

We found a permission error on the Azure DNS resource, so we issued a Terraform plan to see if there was any IaC drift, and we found a missing Azure Role Assignment. We applied the Terraform code, and the Azure Role Assignment was restored. Ultimately, we forced a new certificate renewal from Cert-Manager, and the new valid certificate was distributed at about 8:30 UTC.

Steps to Avoid Recurrence:

To prevent similar incidents from occurring in the future, we are implementing the following measures:

Change the external SSL monitoring system from Statuscake to New Relic
Add more internal custom alerts for the Cert-Manager

Conclusion:

We deeply regret the inconvenience and frustration this incident has caused. This situation is a stark reminder of the critical importance of meticulous certificate management, and its monitor. By implementing these measures and conducting regular reviews, we are committed to preventing similar incidents, ensuring uninterrupted service delivery, and upholding our users' trust in our services.

Posted Oct 11, 2023 - 11:34 UTC

Resolved

This incident has been resolved.

Posted Oct 09, 2023 - 09:00 UTC

Monitoring

A fix has been implemented and we're monitoring the results.

Posted Oct 09, 2023 - 08:53 UTC

Investigating

We are currently investigating the issue

Posted Oct 09, 2023 - 08:11 UTC

This incident affected: Hosted Mender EU.