Frequent full-screen "Error: the server connection failed" messages

Bug #1930001 reported by Nathaniel W. Turner
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
MAAS
In Progress
Medium
Unassigned
maas-ui
Fix Committed
Medium
Peter Makowski
3.4
Fix Committed
Medium
Peter Makowski

Bug Description

The MAAS 3.0-rc1 UI frequently disappears and is replaced by a message that says something like "Error: the server connection failed". After about 1 second, this message usually goes away and is replaced by the page that was being viewed before.

When this happens, if any data entry was in progress (e.g. entering text in a form, configuring something, etc.), all data is lost and the form is re-displayed with all edits lost.

I don't recall this happening in 2.9.

Could some timeout in the UI logic be a bit too short now?

I notice these errors no matter how I connect to MAAS, but they seem more frequent when connected from home via a VPN.

tags: added: ui
Changed in maas-ui:
importance: Undecided → Unknown
Revision history for this message
Huw Wilkins (huwshimi) wrote :

Hi Nathaniel, this message gets displayed when the UI loses the websocket connection.

Would you be able to load MAAS and then open your browser's dev tools and go to the network tab and see if the websocket connection is interrupted (it should try and reconnect while the websocket is down)?

Revision history for this message
Nathaniel W. Turner (nturner) wrote :

I don't see any clear indication that the websocket is failing, but every 50s (almost exactly), the UI shows this message and creates a new websocket.

Attached shows the Timing info for a representative sample. The Messages info just shows normal looking messages, and there is no new message info in the old ws when the UI shows the failure; a new ws simply appears. No idea if this is helpful. If I can do a more specific test, let me know.

Revision history for this message
Huw Wilkins (huwshimi) wrote :

Thanks Nathaniel, that's helpful. As you say, the regularity makes it look like a timeout somewhere. It could be a proxy or the server causing the issue.

Have you done any manual config of proxies or the like? I'll ask around, but do let me know if you think of anything specific to your setup that might be related.

Revision history for this message
Lee Trager (ltrager) wrote :

To add to that are you connecting to MAAS directly is there a proxy in front of MAAS which is forwarding the connection? I would look at regiond.log and maas.log to see if there are any clues as to what is going on.

Revision history for this message
Jeff Pihach (hatch) wrote :

I have seen this exact issue with certain Juju environments and the Juju dashboard. It's been caused by apache timing out idle websocket connections after 50s. In Juju we have a pinger interface that is used in these environments to keep the connection from timing out. You may also be able to increase this timeout on apache (or whatever is fronting the connection) to alleviate this.

Revision history for this message
Nathaniel W. Turner (nturner) wrote :

Ah, yes, I am using a reverse proxy in front of maas for https support. I used haproxy per the recommendations at https://maas.io/docs/snap/2.9/ui/hardening-your-maas-installation#heading--tls (which seem reasonable).

I'll attach my haproxy.cfg template --- it's just the stock Ubuntu 20.04 config with a couple additions to add the maas frontend/backend. The problem might be obvious to someone familiar with haproxy and the way it deals with websockets.

Revision history for this message
Nathaniel W. Turner (nturner) wrote :

I'm testing now with a "timeout tunnel 600000" setting added to that config. So far, I haven't seen the 50s timeout, so I think this may the right direction.

I'm not sure if 10m is the right value, though. Does maas do any kind of periodic messaging on the websocket on which we can rely to keep it alive? If not, a longer timeout might be better.

Bill Wear (billwear)
Changed in maas:
status: New → Triaged
status: Triaged → New
Revision history for this message
Jeff Pihach (hatch) wrote :

MAAS does not have any system for periodic messaging like Juju does. Although it will periodically send updates to the UI for machine status updates. The server typically keeps the connection open indefinitely and we rely on the client to reconnect in the event of a connection drop (which you've already experienced).

A solution here is twofold.
- We need to do a better job handling this timeout as it should be mostly transparent.
- In the interim You can set your haproxy `timeout tunnel` to some high value to reduce the probability of a timeout. You will want to evaluate what you set this value to depending on the level of access to this server for security purposes as keeping connections open like this can open you up to certain types of attacks. If you increase this value you may also want to look at setting a value for `timeout client-fin` so that haproxy doesn't wait for an extended period of time waiting for dropped connections to come back.

Sorry I can't be much more help with haproxy config, it's been a while since I've looked into it seriously.

Revision history for this message
Brandon Marler (pbmarler) wrote :

This issue also exists when running the MAAS: 3.1.0~beta2 snap after following the official documentation on using nginx to add https. https://maas.io/docs/snap/3.1/ui/configuring-tls-encryption#heading--nginx
In nginx, the default proxy read timeout is 60 seconds.
Default: proxy_read_timeout 60s;

I assume setting the nginx proxy_read_timeout to a higher value would workaround the issue.

Having the form data reset every 60 seconds does encourage expedient work habits so maybe we just call it a feature.

Bill Wear (billwear)
summary: - Frequent full-screen "Error: the server connection failed" messages
+ UI: Frequent full-screen "Error: the server connection failed" messages
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Does this still happen on MAAS 3.2? This version introduced native TLS support, removing the need for a haproxy, and increased timeout values.

summary: - UI: Frequent full-screen "Error: the server connection failed" messages
+ Frequent full-screen "Error: the server connection failed" messages
Changed in maas:
status: New → Incomplete
Revision history for this message
Tobias McNulty (tobias-mcnulty) wrote :

FWIW, I noticed this on 3.2 as soon as I switched to having haproxy in front of MAAS. I don't recall seeing it at all before that. I imagine it has something to do with the haproxy configuration.

Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

We don't know how to reproduce it, but it deserves a look.

Changed in maas:
importance: Undecided → Medium
milestone: none → 3.4.0
status: Incomplete → Triaged
Revision history for this message
zrsolis (zrsolis) wrote :

I'm still experiencing this issue in 3.2.6

Revision history for this message
Thorsten Merten (thorsten-merten) wrote :

From the maas site this looks as we could increase our timeouts and document to properly set timeouts when using proxies/LBs.
From the UI site we could have a look if we can show the message without blocking/reloading the page.

Changed in maas-ui:
status: New → Triaged
milestone: none → 3.4.0
importance: Unknown → Medium
Changed in maas-doc:
status: New → Triaged
importance: Undecided → Medium
milestone: none → 3.4.0
Changed in maas-doc:
milestone: 3.4.0 → 3.5.0
milestone: 3.5.0 → 3.4.0
Revision history for this message
Thorsten Merten (thorsten-merten) wrote :

Can you have a look if we can solve this better Max?

Changed in maas-ui:
assignee: nobody → Maximilian Blazek (maximilian-blazek)
Alberto Donato (ack)
Changed in maas:
milestone: 3.4.0 → 3.4.x
Bill Wear (billwear)
Changed in maas-doc:
status: Triaged → Incomplete
Revision history for this message
Nathaniel W. Turner (nturner) wrote (last edit ):

What information is missing?

For what it's worth, after adding "timeout tunnel 600000" to my custom haproxy configuration, I never saw this again.

I have another maas instance that uses the omnibus package's built-in nginx ("native TLS support") configuration, and that seems OK too (though I did see one full-screen timeout message early on).

Revision history for this message
Bill Wear (billwear) wrote :

- with credit to @cgrabowski for feeding me intel:

wrt comment #14 and the assignment to maas-doc: bumping up the default 90s in haprox.conf may help in HA deployments, but we do not know this; i see no other timeouts here. where are these timeout values to which you are referring?

nginx times out at a hardcoded 900s (line 9 in src/maasserver/templates/http/regiond.nginx.conf.template); with a comment "to match the Twisted one".

also note that this error happens to me from time to time, using one rack and one region. would the haproxy actually help in that circumstance?

moving this back to incomplete. i have researched this quite a bit and consulted with others, and haven't found a clear answer as to what i should recommend settings for, yet. lmk and i'll doc it.

Revision history for this message
Christian Grabowski (cgrabowski) wrote :

In non-HA deployments, it is nginx's timeout closing the connection (which coincides with Twisted's timeout as well), HAProxy wouldn't make a different in this case.

Revision history for this message
Bill Wear (billwear) wrote :

i could try adding timeouts to the /snap/maas/current/etc/nginx/nginx.conf like this:

    proxy_pass http://backend;
    proxy_connect_timeout 75s;
    proxy_send_timeout 900s;
    proxy_read_timeout 900s;

except that it's readonly, and i'm not sure we should be telling users to circumvent this.
i'm also not sure it's appropriate to add timeouts to the squid.conf file that's autogenerated in /snap/maas/<instance-number>/usr/lib/tmpfiles.d/, given its ephemeral nature.

"increase our timeouts and document to properly set timeouts when using proxies/LBs" seems like the right answer, just not sure how to do this, and whether part of this would require adding handles to the UI/CLI that we don't yet have. i'll gladly take a public swig of "slap ya mama" hot sauce if someone can tell me which timeouts a user can change, how to change them, and verify that it has an effect. :)

tags: added: bug-council
tags: removed: ui
Revision history for this message
Peter Makowski (petermakowski) wrote :

I had a quick look at this, adding what I gathered so far:

Logic around websocket is a little bit convoluted and for that reason it might be difficult to change.

Currently we're forcibly recreating the entire Redux store whenever a websocket disconnection occurs. Adjusting this behaviour will have some side effects, many of which may not be immediately obvious.

Also, status/websocketDisconnected is not synonymous with actual disconnection as we dispatch this manually.

https://github.com/canonical/maas-ui/blob/main/src/root-reducer.ts#L102-L116
https://github.com/canonical/maas-ui/blob/main/src/app/base/sagas/websockets/websockets.ts#L171

Revision history for this message
Thorsten Merten (thorsten-merten) wrote :

@peter: can we add a lifesign/ping every x seconds to resolve the issue?

Changed in maas-ui:
assignee: Maximilian Blazek (maximilian-blazek) → Peter Makowski (petermakowski)
milestone: 3.4.0 → 3.5.0
no longer affects: maas-doc
Changed in maas:
milestone: 3.4.x → 3.5.0
tags: removed: bug-council
Changed in maas-ui:
status: Triaged → Fix Committed
Changed in maas:
status: Triaged → Fix Committed
Revision history for this message
Bill Wear (billwear) wrote :

can confirm that this is still happening in MAAS 3.4.0~rc1 as of current date.

Changed in maas:
status: Fix Committed → Confirmed
Changed in maas-ui:
status: Fix Committed → Confirmed
Changed in maas:
status: Confirmed → Triaged
Changed in maas-ui:
status: Confirmed → Triaged
Revision history for this message
Peter Makowski (petermakowski) wrote (last edit ):
Changed in maas-ui:
status: Triaged → Fix Committed
Changed in maas:
status: Triaged → Fix Committed
Revision history for this message
Peter Makowski (petermakowski) wrote :

The committed bug fix reconnects automatically on interrupted WebSocket connection, but MAAS UI sometimes does not react to notifications past that point. Setting back to in progress to investigate.

Changed in maas:
status: Fix Committed → In Progress
Changed in maas-ui:
status: Fix Committed → In Progress
Revision history for this message
Peter Makowski (petermakowski) wrote :
Changed in maas-ui:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.