flexmls website operations @flexmlsops-blog-blog - Tumblr Blog

Server Reboot - May 29, 2014 (7:26a-7:58a)

An unplanned reboot of one of our database servers was necessary this morning. This caused an outage event from 7:26-7:58 AM (Central) for some of our customers.

Outage Summary - July 2nd, 2013

As referenced in the previous post, our data center ISP experienced some routing problems today that impacted flexmls users. The problems occurred three times (all Central): The first problem occurred at about 12:39pm and was fairly short lived (5-10 minutes). The second was an hour later at 1:48pm and was the most significant episode of the day affecting users for up to 40 minutes. There was one more small blip at around 4:30pm which, like the first, lasted under five minutes. Since 4:30pm, there have been no further outages or degraded states. Our provider continues to work with their hardware vendor(s) to ensure a permanent solution is put in place. We'll provide more information here when we've received an update from them.

Network Degradation

There were some networking issues in our primary data center this afternoon. Things are back to normal now (since about 3:00p Central) but we're still working with our colocation provider on the details of the outage. We will provide more information here as it becomes available. UPDATE (4:36PM): The network routing problems appear to be back. We're working on it. UPDATE (4:52PM): That last one was minor and our ISP is working with the hardware vendor to fully address the problems.

Network Maintenance - Sunday 12/9 and 12/16

Our colocation provider has informed us that there are two upcoming maintenance windows that may affect the primary hosting facility for flexmls Web: * Sunday, December 9th - 12am-1am There may be brief periods of intermittent network interruptions. * Sunday, December 16th - 12am-1am Service interruption of up to 30 minutes. (all times Central)

Connectivity Issues - Friday, August 24th

Some customers experienced issues connecting to the flexmls Web system on the afternoon of August 24th. These issues were a result of an upstream routing issue in Sprint's Chicago, IL network infrastructure. The event began around 1:30PM. By 3:15PM our ISP turned off their Sprint connection and service was restored to most all of our affected users. Around 4:00PM Sprint had their problems resolved and our ISP re-enabled their Sprint uplink. At this time, all affected users would have had their service restored. Our ISP has three redundant network connections to insulate from failure. Unfortunately, when there is a routing problem like this nothing is technically down and so traffic was still traveling over the degraded infrastructure in Chicago. That stopped once our ISP turned off their Sprint connection. (All times in Central)

Photo servers - August 22, 2012 - 11:03-11:07pm

We had a minor issue affect our ability to serve photos tonight from 11:03-11:07PM (Central). This issue lasted approximately four minutes and was caused by a configuration change in our storage system. This change was not expected to be service affecting. Future maintenance on this hardware will be scheduled more appropriately with an advance notification provided.

ISP Maintenance - Thursday July 19

Our ISP has notified us of a maintenance window in the early morning hours of Thursday, July 19th. The full scope of this window is 12:00am-6:00am (Central). Service interruption is only expected to last for a few minutes (if at all), but could happen any time within the window.

Database Server Reboot - 6/27

One of our database servers required an unplanned reboot this afternoon. This would have caused problems accessing flexmls Web for some of our customers between 2:00-2:10pm (Central).

ISP Maintenance - June 27, 2012 - 2:00-4:00am (Central)

On Wednesday, June 27 between 2am-4am (Central) our ISP will be performing maintenance on one of their core routers. Internet services will be served by alternate carriers during this maintenance. No service interruption is anticipated.

Software Updates - 5/24 and 5/25

We will be doing some software upgrades to the back-end systems this week. Some flexmls Web users may not be able to use the system during the following times: Thursday, May 24th - 4:00-4:30am (Central) Friday, May 25th - 4:00-4:30am (Central)

Network Maintenance - Tuesday, May 22, 2012

We're going to be doing some network maintenance between 4:00-4:30am (Central) on the morning of Tuesday, May 22nd. The flexmls Web system may become unavailable during this time.

Scheduled Maintenance - Friday, May 11 (4:30-5a Central)

We'll be doing some software updates to one of our back end database systems early Friday morning. Most flexmls Web customers will be unaffected, but if you have problems working in the system between 4:30a-5:00a (Central) please try back again after the maintenance window is complete.

Outage - Monday, April 9 2012 (1pm-2pm Central)

We experienced a failure yesterday from about 1:00PM-2:00PM (Central). This problem was related to a connection limit in one of our backend performance cache systems. This negatively impacted incoming flexmls traffic and inundated our authentication systems. We have increased this limit and will more proactively monitor the number of active connections to the service in the future. To reduce the liklihood of recurrence:

We have doubled the number of connections that the caching systems will accomodate.

We have added an event to our notification system so that appropriate FBS Hosting Staff members will be notified when the number of active connections reaches 80% of the configured max.

We will be reviewing the authentication systems to ensure faults in the cache will not harm the end user experience.

Sincerely, Jaison Freed FBS, Vice President of Hosting

Network outage - Oct 24 5:48p-5:54p

We had another firewall issue, similar to the one from about three weeks ago. This afternoon's outage lasted six minutes from approximately 5:48p-5:54p (Central). We'll be contacting our hardware vendor for further assistance to ensure this problem is not repeated.

Firewall Reboot - 10/4/2011

A minor configuration adjustment on our firewall this morning resulted in an unanticipated reboot. This occurred at approximately 5:17AM with service being restored at 5:39AM. ** All times Central

Outage Report

Unfortunately, in trying to fix the false alarms referenced in my earlier post, we ended up causing a real outage. The report is below. Outage Report - September 15, 2011 We experienced a site-wide outage today. The problems were sporadic at first, starting around 11:35AM and were across our entire web server farm by around 12:00PM. The problem was resolved by rolling back changes and the system was returned in service at approximately 12:37PM. Why did this happen? We were trying to make web server management easier by cleaning up the configuration and providing for an automated way to manage all the servers. This rollout went fine, but caused our availability tests to begin sending out false alarms. These are just tests that are used in confirming that the flexmls Web system is up and running. The problem came when we began trying to modify our new, clean configs to fix the false alarms we were seeing from our uptime monitoring service. A bad config was then introduced to the farm and made it's way to all servers, thus rendering the flexmls Web system unavailable. The problem was fixed by rolling back our web servers to the known good configuration from earlier this morning. We'll be doing an in depth review of the files that caused the problems. We will not put them back into production until a thorough review is completed to identify what caused today's problems. NOTE: There was another outage at 1:25-1:35PM as we discovered the rollback procedure performed earlier was not complete. The Hosting team at FBS strives to provide our customers with 100% system availability. We failed on that today but we will learn from the mistakes so that they are not repeated in the future. We know how important availability of the flexmls site is to your business. ** All times referenced in this report are US/Central.

False Alarms this Morning (Sept 15)

Sorry for the spam on the twitter account this morning. We're making some changes to the testing procedure and it keeps thinking the system is down (when it's not). We've turned off the automatic feed to twitter while we work to resolve this problem. Again, the system is not experiencing any problems, it's just this test that is failing to correctly see that the server is up and running. I apologize for the confusion.

#ARMLS

Trending Blogs

Recently Viewed Blogs

flexmls website operations