Degradation

Incident Report for Yousign

Postmortem

Incident Impact and Resolution Report

Dear Yousign Users,

On 2024-06-27, following our network migration (https://yousign.statuspage.io/incidents/7dkq2kd8msv7), multiple unexpected degradations of our services occurred between 13:02 CEST to 2024-06-28 17:12 CEST (1 day, 4 hours, 10 minutes).

This incident affected all Yousign v3 and v2 customers and also included their sandbox/staging environments.

The degradation started by seven global unavailabilities of the service that lasted approximately three minutes each.

Unavailabilities list:

  • 2024-06-27 13:02 CEST
  • 2024-06-27 13:57 CEST
  • 2024-06-27 18:05 CEST
  • 2024-06-27 21:30 CEST
  • 2024-06-28 10:02 CEST
  • 2024-06-28 10:55 CEST
  • 2024-06-28 13:26 CEST

Each incident had an higher impact on the v2 as a side effect as it led to an increased load on the services upon recovery. To reduce the delays caused by this abnormal load, we temporarily placed our v2 product under maintenance twice.

  • Once on 2024-06-27 at 13:10 CEST to mitigate the load,
  • A second time with a maintenance on 2024-06-28 at 13:00 CEST to permanently fix the load issue.

The root cause was quickly determined, however, the implementation took some time and multiple mitigations were setup in the meanwhile.

A crisis management cell was dispatched in order to provide the most suitable solutions and focus on two tracks:

  • Customer impact mitigation,
  • Permanent fix.

Customers impact mitigation were finalized on 2024-06-28 at 13:26 CEST which helped us stabilize the network components until the permanent fix.

The permanent fix was deployed on 2024-06-28 during a planned maintenance between 22:00 and 00:00 CEST.

We want to assure you that despite the disruptions, no information was lost, and all received signatures were properly processed.

Root Cause Analysis

Following the migration of our network, a security component experienced unexpected crashs and restarts, leading to the whole v2 and v3 applications being unavailable for a short while. The degradations and slowdowns (mainly on v2) are mainly linked to the load recovery after each incidents.

The root cause first hypothesis was determined on 2024-06-27 at 13:30 CEST. The issue is related to an unexpected and erratic behaviour between our virtualization solution and the security component. Despite all our tests (stress tests, automatic and manual tests..) prior to migration, we were unable to detect this behaviour outside the production environment.

The hypothesis was confirmed on 2024-06-28 at 01:00 CEST by our engineers. As a precaution, we have decided not to make any immediate changes and to cancel the maintenance operation initially scheduled on 2024-06-28 at 04:30 CEST. The stabilization plan required more preparation time in order to guarantee the control and success of the operation.

Correction

As soon as we identified and confirmed the issue (2024-06-28 at 01:00 CEST), we focused on several short-term mitigations to limit the impact on service, but also work on the permanent fix to solve it definitively.

Mitigation

The following mitigations were implemented to ensure a maximum availability of our services:

  • 2024-06-27 13:10 CEST - (v2 only) temporarily down-scale the non-critical services to reduce the pressure
  • 2024-06-28 13:26 CEST - ad-hoc patches and configuration tuning to our security components were deployed
  • 2024-06-28 13:00 CEST - (v2 only) Significant increase in v2 database capacity

Permanent Fix

On 2024-06-28 at 22:00 CEST, during a dedicated noticed maintenance window, we completely isolated and redeployed the security components on dedicated hardware without virtualization.

As CTO of Yousign, I'm aware of the impact that the incident may have caused to your service. However, I would like to assure you that all technical and human resources have been engaged to minimise the impact of this incident. The problem was not visible during our tests, and we had to take sufficient time to diagnose it, define a short-term and medium-term remediation plan, and then implement it safely. We tried to reduce the effects of the impact during the day and organised our scheduled maintenance slots in the evening to take advantage of the lower level of activity.

I would like to apologise for the inconvenience caused to your service and thank you for your continuing support for Yousign.

Nicolas Baron - CTO Yousign

Posted Jul 02, 2024 - 10:33 CEST

Resolved

The service is completely stable since the intervention of the technical teams on Friday from 22:00 to 00:00 CEST (https://yousign.statuspage.io/incidents/0s8xy4nzl0tw). We resolved the problem permanently. We will publish a post-mortem in the coming days.
We would like to thank you once again for your confidence in us and apologise for any inconvenience caused.
Posted Jul 01, 2024 - 13:51 CEST

Update

Product v2 & v3:
The situation is back to normal, there are no more delays on the signature or eSeal.
Posted Jun 28, 2024 - 17:12 CEST

Update

Product v2 & v3 :
Users will experience some delays in processing signatures and eseals in v2 and v3, but no data is lost.
The fix to reduce delays has been found, and the situation will be back to normal shortly.
Posted Jun 28, 2024 - 17:07 CEST

Update

Product v2 App & API:

The pending requests have been processed, and v2 is back to normal. The API webhooks are currently being processed and should be completed within the next few minutes. Please note that all requests will be processed.

Our teams remain on constant vigilance.
Posted Jun 28, 2024 - 13:50 CEST

Update

Our v2 product has a large number of pending requests. We will be carrying out exceptional corrective maintenance for 15 minutes at 1.15pm to reduce delay and increase the overall capacity of our system. We apologise for any inconvenience this might cause.
Posted Jun 28, 2024 - 13:10 CEST

Update

Some v2 users may experience delays in processing their requests. Please be reassured that requests are taken into account and will be processed at a later time. There will also be a delay with the delivery of SMS.
Some API calls may also be impacted with 500 errors, and will need to be submitted again.
Our teams are working to stabilise the service as soon as possible.
Posted Jun 28, 2024 - 11:13 CEST

Update

Delays returned to normal from 3.45pm.
Our teams are committed to continuing to monitor the service.
Posted Jun 27, 2024 - 16:00 CEST

Monitoring

Between 2.27pm and 3.30pm, v2 users experienced delays in creating eSeals and retrieving audit trails.
Posted Jun 27, 2024 - 15:43 CEST

Identified

V2 Services (API and APP) are unavailable.
Posted Jun 27, 2024 - 15:25 CEST

Monitoring

The service is working, we continue to monitor.
Posted Jun 27, 2024 - 14:38 CEST

Update

We are continuing to work on a fix for this issue.
Posted Jun 27, 2024 - 14:04 CEST

Identified

The issue has been identified and a fix is being implemented.
Posted Jun 27, 2024 - 14:01 CEST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jun 27, 2024 - 13:34 CEST

Investigating

Some customers may be experiencing an issue with our service. There will be a delay in processing signatures, but no loss of data. Our engineers are working to identify and resolve this as fast as possible.

We will update you again shortly here.

If you have urgent questions, please contact our support team by sending an email to support@yousign.com.
Posted Jun 27, 2024 - 13:23 CEST
This incident affected: Yousign V2 (API V2 - https://api.yousign.com, APP V2 - https://webapp.yousign.com) and Yousign V3 (APP V3 - https://yousign.app, API V3 - https://api.yousign.app).