Summary of Impact:
Between 11:17 UTC and 15:21 UTC on 22 June 2021, Vtiger DNS experienced a service availability issue. This resulted in customers being unable to resolve domain names for services they use, which resulted in intermittent failures accessing Vtiger services. Due to the nature of DNS, the impact of the issue was observed across multiple regions. Recovery time varied by service, but the majority of services recovered by 14:15 UTC.
Root Cause:
Vtiger DNS servers experienced an anomalous surge in DNS queries (possible DDOS attack) from across the globe targeting a set of domains hosted on Vtiger. In this incident, one specific sequence of events exposed a code defect in our DNS service that reduced the efficiency of our DNS server. As our DNS service became overloaded, DNS clients began frequent retries of their requests which added workload to the DNS service. Since client retries are considered legitimate DNS traffic, this traffic was not dropped by our volumetric spike mitigation systems. This increase in traffic led to decreased availability of our DNS service.
Mitigation:
The decrease in service availability triggered our monitoring systems and engaged our engineers. Our team restored DNS services by enabling backup DNS and recovering the existing infrastructure by 15:21 UTC. This recovery time exceeded our design goal, and our engineers prepared additional serving capacity and the ability to answer DNS queries from the volumetric spike mitigation system in case further mitigation steps were needed. The majority of services were fully recovered by 14:15 UTC. Immediately after the incident, we updated the logic on the volumetric spike mitigation system to protect the DNS service from excessive retries.
Next Steps:
We apologize for the impact to affected customers. We are continuously taking steps to improve the Vtiger Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
Repair the code defect so that all requests can be efficiently handled in cache.
Improve the automatic detection and mitigation of anomalous traffic patterns.
Please reach support team, if in case needed.
Update: We wanted to point out that no data breach has occurred during this issue and we continue to closely monitor our service.