Dear DigiServ Resellers,
As many of you are aware by now we've experienced prolonged outages on one of our Germany Reseller servers namely freedom.dns-guards.com. I would like to personally apologize for these extended outages, I truly feel your pain!. I myself and day shift technicians worked until 4am this morning to ensure that the server was running smoothly.
The initial problem started early yesterday morning with the server locking up intermittently. The problem got worse and eventually the entire server locked up. All our server reboots initially failed and we asked for assistance from the data centre in Germany. A technician was standing by and worked with us in bringing this server back on-line.
We went through our check list to check everything on the server including going through server logs to see what caused the server to lock up. We also ran an extensive hard drive test to ensure all drives were in good working order. During these tests, results showed that one of the primary drives were showing signs of possible drive failure in the near future. I took the decision to order a replacement drive immediately to prevent further downtime on this server in future.
The drive was replaced by a data centre technician within 2 hours and we restored the Operating System. The server was brought back on-line and everything seemed to be running smoothly.
Less than 24 hours before this outage we performed scheduled maintenance (on Sunday - 15 July 2012) to upgrade the Operating System kernel. With the outage being so close to the kernel upgrade we became suspicious that the new kernel may be the underlying cause of the problem.
Late last night the server started locking up again. We took the server off-line for the second time yesterday and cross checked everything. The kernel logs showed significant output errors and high possibility of a corrupt kernel. We did a complete re-install of the OS which took some time. We eventually got the server back on-line just before 11pm last night.
I myself and multiple techs worked on the server the entire night to check every thing. By 4am I went home for much needed sleep, being awake for nearly 24 hours. Approximately 2 hours later I was alerted by techs that the server shows signs of locking up again.
As we communicated this problem to the Operating System engineers yesterday we received word that we must change the kernel back to the previous version installed from a month ago. We reverted back to the previous kernel built as well as checked all server logs for further errors. While the server was down for this maintenance we ran checks again on all hardware and tests showed that the secondary drive is showing errors which were not hardware related. These were caused directly by the OS kernel upgrades, data that was written to this drive showed minimal errors however enough to be cautious as files were corrupt.
We are currently actively monitoring the server and if we detect further problems we may order a replacement drive for the secondary drive which is hosting all website files and emails. I am already in communication with the data centre to have a replacement drive ready should this happen again later today or tonight. If this happen we will take the server completely off-line and rebuilt the entire drive configuration on the server.
We are checking our off server data backups for any errors and will use these to restore data onto the new drive. We want to avoid restoring corrupt files to the new drive as this may cause further problem. Should this happen we will update our Network Status area and keep resellers informed.
For the time being the server is running smoothly, we can all hold thumbs and pray this outage is now permanently resolved. I would also like to thank everyone for their patience during this difficult outage. I know how frustrating it is and believe me I hate any amount of downtime. Any amount of downtime means your businesses are down as well as your clients. We are therefore working continuously to ensure the server is running smoothly at all times. I myself and technicians will work through tonight and monitor for any possible signs of the server locking up again.
Thank you for your time.
DigiServ Technologies CC - Director
Tuesday, July 17, 2012