Jump to content

High Availability - Database Connectivity Detection


cwaters

Recommended Posts

We have a high availability setup in Azure for Passwordstate that consists of a Traffic Manager, and 2 sets of App Gateway load balanced clusters in different regions.  We have encountered and issue with what happens when the DB is unreachable from a single region.  Currently in Azure, the load balancing health checks available only look for "good" HTTP status codes and strings to determine health for the backend pools.  In the case of Passwordstate, if one of the redundant servers/clusters behind the load balancing loses the ability to connect to the DB, the webserver still produces a good code, the health probe doesn't fail , and requests are still sent to the available server/cluster.  There doesn't seem to be a way in Azure to probe, for example, a 200 webserver code and match a string from the DB error page to have the load balancer trip (again, it only operates on good results for a health check).

 

Any thoughts on how to solve for this?  The limitations are in Azure, but I'm wondering if it makes more sense (or can have an option) to throw a true HTTP error code for that condition?  A 500 or a 503 (I'd settle for a 418 just to get past this), just something outside of 200-399 range which is the default "good" range for the Azure App Gateways and Traffic Managers.

 

I'm digging into a string match, but as we use SAML and AAD for authentication, the number of redirects makes it difficult to determine what would be a consistent string in a page that won't be on the DB Error page that gets served.  This is probably doable but seems like a long way to go.

 

Anyway, appreciate any thoughts here.  Thanks.

Link to comment
Share on other sites

3 hours ago, cwaters said:

Any thoughts on how to solve for this?

This is an excellent point and I hadn't considered it for our own HA implementation.
We use F5s across two DCs rather than having Passwordstate in Azure, but we'd have the same issue.

 

3 hours ago, cwaters said:

500 or a 503 (I'd settle for a 418 just to get past this), just something outside of 200-399 range which is the default "good" range for the Azure App Gateways and Traffic Managers.

A 503 should be thrown, with a custom error page detailing the exact nature of the failure. IE: "Application is up, database is down".

 

Suggest this be moved to feature requests for implementation.

Link to comment
Share on other sites

Hi Guys,

 

We're not sure about this sorry - we do not have much experience with Azure. Is it normal for an Azure DB to be unreachable from a single region - with High Availability, we would not have thought you should experience this?

Regards

Click Studios

Link to comment
Share on other sites

1 hour ago, support said:

We're not sure about this sorry - we do not have much experience with Azure. Is it normal for an Azure DB to be unreachable from a single region - with High Availability, we would not have thought you should experience this?

Azure aside, its an issue with the application in HA configuration. If one of the web nodes can't get a connection to the database for whatever reason then the health checks continue to work as the application still returns a HTTP status 200 - even though it can't reach the database.


If Passwordstate deems it can't connect to the database it should be returning a status code 503 instead of 200, that way when the load balances perform their health checks they'll get hit with a 503 status code and redirect traffic to the web node that is returning status 200.
As it stands, users hit the load balances, get directed to a web node that returns status 200, only to be greeted with the application saying database isn't accessible.

 

 

Screen Shot 2019-05-15 at 3.37.32 pm.png

Link to comment
Share on other sites

Hi Sarge,

 

If the database is unavailable, it should redirect you to a screen informing you there are database connectivity issues - the one you see above of dbconnectivityerror.aspx.

 

Are your Load Balancers smart enough to be able to check for this page? We'd prefer to present a meaningful error page, instead of the other status codes - as most customers would not know what they mean. And majority of customers don't us HA with Load Balancers.

 

I guess if you're seeing this page though, then there is a larger issue with the SQL HA implementation, which should be addressed.

 

Regards

Click Studios

Link to comment
Share on other sites

12 minutes ago, support said:

If the database is unavailable, it should redirect you to a screen informing you there are database connectivity issues. Are your Load Balancers smart enough to be able to check for this page? 

Yes, but some load balances can't check for string matches on the page - only status codes.

 

13 minutes ago, support said:

We'd prevent to present a meaningful error page, instead of the other status codes - as most customers would not know what they mean.

You should still be able to do that through IIS. https://docs.microsoft.com/en-us/iis/configuration/system.webserver/httperrors/

 

13 minutes ago, support said:

I guess if you're seeing this page though, then there is a larger issue with the SQL HA implementation, which should be addressed.

Or networks team has accidentally misconfigured something, trust relationship of the web node to the domain controllers has been isolated, or someone did something on one of the nodes they shouldn't have.


This issue would extend to maintenance mode as well - enter a web node into maintenance mode, it still returns status 200 - thus load balancers don't know to send traffic to the other node. Again, 503 would be the correct code to return here.

Link to comment
Share on other sites

Thanks for the feedback sarge. So you would prefer some sort of 503 error page from IIS, and not one of our custom error pages? I'm not sure if we could provide both - we'd need to look into that.

Regards

Click Studios

Link to comment
Share on other sites

Just now, support said:

Thanks for the feedback sarge. So you would prefer some sort of 503 error page from IIS, and not one of our custom error pages? I'm not sure if we could provide both.

Ideally yes.

It should be an option at setup - If you plan to run in HA, would you rather status 200 and dynamic pages, or 503 and static pages?

 

If the user chooses 503 and static pages, then your setup installer would configure the attached.
If the user chooses 200 and dynamic pages, then leave it as it currently is.

 

How this could be retrofitted to existing installs I'm not sure.

Screen Shot 2019-05-15 at 4.01.40 pm.png

Link to comment
Share on other sites

Hi Sarge,

 

If we cannot do this problematically, then we could not do it - unless we instructed all customers to make manual modifications in IIS.

 

We'll let you know if we have any luck with this, once we find the time to work on it.

Regards

Click Studios

Link to comment
Share on other sites

Glad to see this discussion moving! 

 

Just to share the info (though we seem to be past these specifics):

 

To answer the question previously about the HA capabilities of App Gateways in Azure, yes, they are somewhat limited as far as the health checks are concerned.  The best way to describe them is "If I see this thing, you're healthy and I can send you traffic."  There's no concept of "You're sick, I shouldn't send you traffic."

 

The question about a web node not being able to get to the DB in a region.  The DBs are replicated and available across multiple regions and have endpoints in the same regions as the web clusters.  It really had to do in our case about DNS resolution.  In one region, we lost a tunnel that we talk to our own DNS servers (internal ranges) through.  In the other region, that tunnel was up and would have been able to service the requests.  Since the web nodes are accessed publicly, the all web nodes showed up to the load balancer in the bad region, but since the region couldn't resolve the DB name, traffic was still being set to a node that couldn't get data.  Azure has also recently experienced some DNS related outages for their service so, this scenario is also possible if we were using the "built-in" DNS features from Azure.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...