Dear Yousign Users,
On 2024-06-27, following our network migration (https://yousign.statuspage.io/incidents/7dkq2kd8msv7), multiple unexpected degradations of our services occurred between 13:02 CEST to 2024-06-28 17:12 CEST (1 day, 4 hours, 10 minutes).
This incident affected all Yousign v3 and v2 customers and also included their sandbox/staging environments.
The degradation started by seven global unavailabilities of the service that lasted approximately three minutes each.
Unavailabilities list:
Each incident had an higher impact on the v2 as a side effect as it led to an increased load on the services upon recovery. To reduce the delays caused by this abnormal load, we temporarily placed our v2 product under maintenance twice.
The root cause was quickly determined, however, the implementation took some time and multiple mitigations were setup in the meanwhile.
A crisis management cell was dispatched in order to provide the most suitable solutions and focus on two tracks:
Customers impact mitigation were finalized on 2024-06-28 at 13:26 CEST which helped us stabilize the network components until the permanent fix.
The permanent fix was deployed on 2024-06-28 during a planned maintenance between 22:00 and 00:00 CEST.
We want to assure you that despite the disruptions, no information was lost, and all received signatures were properly processed.
Following the migration of our network, a security component experienced unexpected crashs and restarts, leading to the whole v2 and v3 applications being unavailable for a short while. The degradations and slowdowns (mainly on v2) are mainly linked to the load recovery after each incidents.
The root cause first hypothesis was determined on 2024-06-27 at 13:30 CEST. The issue is related to an unexpected and erratic behaviour between our virtualization solution and the security component. Despite all our tests (stress tests, automatic and manual tests..) prior to migration, we were unable to detect this behaviour outside the production environment.
The hypothesis was confirmed on 2024-06-28 at 01:00 CEST by our engineers. As a precaution, we have decided not to make any immediate changes and to cancel the maintenance operation initially scheduled on 2024-06-28 at 04:30 CEST. The stabilization plan required more preparation time in order to guarantee the control and success of the operation.
As soon as we identified and confirmed the issue (2024-06-28 at 01:00 CEST), we focused on several short-term mitigations to limit the impact on service, but also work on the permanent fix to solve it definitively.
The following mitigations were implemented to ensure a maximum availability of our services:
On 2024-06-28 at 22:00 CEST, during a dedicated noticed maintenance window, we completely isolated and redeployed the security components on dedicated hardware without virtualization.
As CTO of Yousign, I'm aware of the impact that the incident may have caused to your service. However, I would like to assure you that all technical and human resources have been engaged to minimise the impact of this incident. The problem was not visible during our tests, and we had to take sufficient time to diagnose it, define a short-term and medium-term remediation plan, and then implement it safely. We tried to reduce the effects of the impact during the day and organised our scheduled maintenance slots in the evening to take advantage of the lower level of activity.
I would like to apologise for the inconvenience caused to your service and thank you for your continuing support for Yousign.
Nicolas Baron - CTO Yousign