coercitiv
Diamond Member
- Jan 24, 2014
- 6,403
- 12,864
- 136
More updates. The summary is the problem was related to the Linux kernel. A kernel patch has been issued.
For those new to the thread, the original article on the Algolia blog contains the full story + updates.
For those new to the thread, the original article on the Algolia blog contains the full story + updates.
UPDATE July 13:
Since the last update of this blog-post, we have been in a cooperation with Samsung trying to help them find the issue, during this investigation we agreed with Samsung not to communicate until their approval. As the issue was not reproduced on our server in Singapore, the reproduction is now running under Samsung supervision in Korea, out of our environment. Although Samsung requested multiple times an access to our software and corrupted data, we could not provide it to them in order to protect the privacy and data of our customers.
Samsung asked us to inform you about this:
- Samsung tried to duplicate the failure with the latest script provided to them, but no single failure has been reproduced so far.
After unsuccessful tries to reproduce the issue with Bash scripts we have decided to help them by creating a small C++ program that simulates the writing style and pattern of our application (no files are open with O_DIRECT). We believe that if the issue is coming from a specific way we are using the standard kernel calls, it might take a couple of days and terabytes of data to be written to the drive. We have been informed by Samsung that no issue of this kind have been reported to them. Our server provider has modified their Ubuntu 14.04 images to disable the fstrim cron in order to avoid this issue. For the last couple of months after not using trim anymore we have not seen the issue again.
- Samsung will do further tests, most likely from week 29 onwards, with a much more intensive script provided by Algolia.
UPDATE July 17:
We have just finished a conference call with Samsung considering the failure analysis of this issue. Samsung engineering team has been able to successfully reproduce the issue with our latest provided binary. Samsung had a concrete conclusion that the issue is not related to Samsung SSD or Algolia software but is related to the Linux kernel. Samsung has developed a kernel patch to resolve this issue and the official statement with details will be released tomorrow, July 18 on Linux community with the Linux patch guide. Our testing code is available on GitHub.
This has been an amazing ride, thank you everyone for joining, we have arrived at the destination.