Dediserve Issues
If you are on Lon2 - the main issue is a low Disk I/O Speed, which backlogs the processors in I/O wait, slowing everything down and exhausting RAM if you don't have good limits on apache. You can check this for yourself using "hdparm -t /dev/xvda1", "dd if=/dev/zero of=test bs=64k count=64k conv=fdatasync", or "latencytop". I found at times my disk speed went down to 1MB/s. Read the "client friendly version" of this at http://bluelightstudios.co.uk/dublin-migration/
If you need any help moving - or tying over services until the refit, let me know.
Do not ask them to IP forward - they can't do it.
---------------- SUPPORT TICKET TO DEDISERVE ------------
Hi,
Please read this query again:
The service bondsworldwide.co.uk and naturana.co.uk are assigned to old IPs 109.104.119.111 and 109.104.118.122 these IPs were meant to be forwarded to 217.78.0.61 and 217.78.0.62 respectively.
Please also note:
Changing the password of the VM was not needed. Looking at the problem, the IPs 217.78.0.61-64 has a good SSH, and therefore the server configuration was not the issue. Furthermore Apache was listening (before it was reconfigured) properly on 217.78.0.62:80, and Varnish was listening on 217.78.0.61:80, as confirmed by a simple "netstat -plt". This means that the broken Apache configuration tonight (along with the addition of a superfluous "/etc/apache2/httpd.conf" was not needed. The only part of the system that needed checking was the underlying network protocols, and their configured programs such as "iptables". Furthermore, I would ask, how does changing the password affect the service?! I have since reset the root password to it's original setting.
As of the first point, during the course of this evening my Apache Server was reconfigured to listen on 109.104.118.111 - hence when it was restarted by yourselves the service went offline. My logs have told me this happened at 02/09/2013 07:18:10PM - during your login time. Logs show this was for 7 minutes. After this time, I fixed the configuration myself, as I knew the response to such an incident could be a long time.
Your plans for the refit, however great the improvements will be, were not thought out. There was no thought to the impact on the SANs and whether or not "hdparm" is a valid real-value for the Disk I/O, it was not acceptable - even by a comparison to VMs, or using the "dd" suggested on your knowledge wiki. A warning should have been issued about this. Furthermore, blaming the outage on a program clearly not using any resources (tmux) to try and cover up the issue is not helpful. I would prefer to know about downtime up front, how long this would last, and for you to update me, as a fellow technician, on how long you are expecting to experience problems for. "latencytop" clearly showed that disk access was the issue.
The IP forwarding has not, and is not, being tested thoroughly. This is a major issue currently, and has been since Thursday (07/02/2013 10:19:19) when this service was offered to me. HTTP/SSH is not being forwarded on 109.104.118.111. It would have been better, like you did to begin with, not to offer me this service if you check it sufficiently. This is still an issue now on 109.104.118.111 which needs fixing.
Tickets are being sorted during office hours. My clients require high availability, and this is why I work through the night to get the services sorted. Your autoresponses are misleading when you say "Many thanks for your update, we are reviewing this for you and one of my colleagues will update you shortly." at 02:37:36 and your response isn't until 08:46:11. If a critical machine went down for this long I would have severe problems. Not having an ETA also makes it very hard to schedule periods of downtime for my clients. On performing my upgrade operations I schedule a maintenance window - normally in the early hours between 1am and 3am. My clients also know about what is happening to the service beforehand.
Furthermore, when I agree to maintenance being done on my server through a ticket response, I expect it to be carried out promptly, so I can monitor it. Furthermore, I expect the service to be fully operation at the end. I cannot have a ticket response from yourselves half way through the work asking me if I want this or that, then it being left in an unusable fashion until I respond. I suggest a possible solution to this to be using the instant messaging chat you have on your support website to better connect your technician with your client (me).
The backup you took was 24 hours in advance of the move and therefore meant that, had I not intervened, I would have lost 24 hours worth of server data. After you offered to delete my machine, I took my own backup, which I have since used to restore those 24 hours worth of emails. Had I lost this it would have been completely unacceptable.
I am extremely unhappy that your web support helpdesk panel keeps on redirecting me on refresh and cutting out due to redirection loops. Pingdom has logged 502 downtime periods with your panel since 05/02/2013 17:28:19 If you need to oursource the work to someone, I am more than willing to fix it at my normal rate, but you can't just leave it any longer.
My clients are now looking for compensation - and the only reason they aren't asking for more is because I stuck a caching proxy on the old server when I first found your SAN problem, with a TTL page of 5 days, so that the hard disk use was kept as low as possible. As such I have kept my uptime at 97% (currently below my 99% SLA), however I have calculated that without these measures I added I would have been at 90% and falling, due to continued outages that would have related to your SAN upgrades.
Please sort this out. If you need to outsource some work, then feel free, but these issues have been running on for days because, I suspect, the technical staff are at their limits with all the retrofit (and the issues it is causing to clients). If you don't have 24 hour technicians this is a major issue. I understand you don't want all your staff touching the routing configuration, but you should always have someone on hand to be able to if needed.
Please give me a proper ETA for a response on this email, and a complete the fix for 109.104.118.111, which does not affect the downtime of my server. I think through a passive look at the current configuration on the VM you should be able to see that the server is not the issue. I do not wish to see more downtime because of this IP forwarding problem. Please advise me a time window should you need to disable, reconfigure or reboot my server where these changes will take longer than 2 minutes offline.
Will Tinsdeall













