12/17/13 – Incident Report On Our Gallery Issues

Visited 1898 times , 1 Visit today

Hey again,

Our gallery is back up for the moment!

Our host is still doing some analysis on the machine to see exactly what went wrong and strengthen against it in the future.

But I figured i’d at least post something in the mean time.

This article is going to be mostly for the tech geeks among us,
But feel free to read along if you are just curious whats been going on with the gallery or what we experience behind the scenes.

 

We had three major issues that caused our gallery to experience downtime, slowness, and errors for the past few days.

 

The first issue was the driver we were using for our hard drive was in legacy mode more or less.
This caused uploads to lag randomly and error out at times, along with slow image and thumbnail loads.
We became aware of this problem rather quickly, and had it ironed out within about 36 hours.
This one lies on us, I should have been more aware of the difference in speed between the two modes.
I am acutely aware now and this should not happen again.

 

The second issue we ran into was a failure in our hosts raid array.
This caused complete and total downtime of the gallery, with 502 gateway unavailable error messages when you tried to go to it.
The raid array manages hard drives, if anything is wrong with it it will shut down immediately to prevent data corruption.
The process to bring the raid array back online takes them a couple hours minimum.
Unfortunately, this has happened twice in three days, and three times in two weeks.
They are currently investigating the raid array to try to remedy the problem.
Honestly, This may happen again sometime in the next week to two. I am not convinced they have it stabilized yet.
We are hoping they can keep it under control though.

 

The third and final issue we’ve ran into was a glitch in our hosts internal networking.
This has caused the on and off lag and long page loads in the gallery.
We have an internal network to send data between our servers in the datacenter.
For some reason, the machine that runs the gallery is having difficulty reaching our database via the internal network.
It has intermittent periods of extremely high connection latency (lag).
It took us quite a while and a lot of back and forth with our hosts tech team to narrow it down to the internal network as the culprit.
We managed to re-route the connection so it bypasses the bad network, and it appears to have fixed the latency issues.

 

 

So that’s where we are currently at.

Its been a long couple days for us here trying to keep up with these bugs and issues that have been popping up.

Things seem to be calming down a bit for now, but I could write another one of these tomorrow saying the exact opposite.

 

We will keep you posted, enjoy the improvements for now.
Hopefully things stay stable and we can move forward with other features soon.

–FDT

Comments are closed, but trackbacks and pingbacks are open.