OpenStack Keystone Charm

Comment 1 for bug 2015103

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2023-04-04:

From keystone/1:

(keystone.server.flask.request_processing.middleware.auth_context): 2023-04-01 08:36:25,698 ERROR (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during
query')
(Background on this error at: https://sqlalche.me/e/14/e3q8)
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1802, in _execute_context
    self.dialect.do_execute(
  File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 148, in execute
    result = self._query(query)
  File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 310, in _query
    conn.query(q)
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 548, in query
    self._affected_rows = self._read_query_result(unbuffered=unbuffered)
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 775, in _read_query_result
    result.read()
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1163, in read
    self._read_result_packet(first_packet)
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1236, in _read_result_packet
    self._read_rowdata_packet()
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1270, in _read_rowdata_packet
    packet = self.connection._read_packet()
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 725, in _read_packet
    packet.raise_for_error()
  File "/usr/lib/python3/dist-packages/pymysql/protocol.py", line 221, in raise_for_error
    err.raise_mysql_exception(self._data)
  File "/usr/lib/python3/dist-packages/pymysql/err.py", line 143, in raise_mysql_exception
    raise errorclass(errno, errval)
pymysql.err.OperationalError: (1053, 'Server shutdown in progress')

So "Server shutdown in progress" is a bit ominous.

From mysql-innodb-cluster/0 at the same time:

2023-04-01T08:36:23.178312Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 192.168.33.61:3306'
2023-04-01T08:36:23.178353Z 0 [System] [MY-011500] [Repl] Plugin group_replication reported: 'Primary server with address 192.168.33.61:3306 left the group. Electing new Primary.'
2023-04-01T08:36:23.178515Z 0 [System] [MY-011507] [Repl] Plugin group_replication reported: 'A new primary with address 192.168.33.83:3306 was elected. The new primary will execute all previous group transactions before allowing writes.'
2023-04-01T08:36:23.178770Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.83:3306, 192.168.33.84:3306 on view 16803355385089891:10.'
2023-04-01T08:36:23.179472Z 68 [System] [MY-011565] [Repl] Plugin group_replication reported: 'Setting super_read_only=ON.'
2023-04-01T08:36:23.179579Z 68 [System] [MY-011511] [Repl] Plugin group_replication reported: 'This server is working as secondary member with primary member address 192.168.33.83:3306.'
2023-04-01T08:37:31.674902Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.61:3306, 192.168.33.83:3306, 192.168.33.84:3306 on view 16803355385089891:11.'
2023-04-01T08:37:36.190735Z 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 192.168.33.61:3306 was declared online within the replication group.'
2023-04-01T08:38:15.493860Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 192.168.33.83:3306'
2023-04-01T08:38:15.493903Z 0 [System] [MY-011500] [Repl] Plugin group_replication reported: 'Primary server with address 192.168.33.83:3306 left the group. Electing new Primary.'

i.e. 192.168.33.84 (which is mysql-innodb-cluster/0's local address) left the cluster and then rejoined during the the attempted keystone-manager.py connection. With the cluster at 2 units, it would go RO; the out of cluster unit (this one) would refuse connections if that's what the mysql-router was connected to.

Why did it leave the cluster and then rejoin?

It appeared to be handling a certificates-relation-changed hook from vault/0 at the same time:

2023-04-01 08:36:07 INFO unit.mysql-innodb-cluster/0.juju-log server.go:316 certificates:454: Reactive main running for hook certificates-relation-changed

and from the machine.lock:

2023-04-01 08:36:15 unit-mysql-innodb-cluster-0: mysql-innodb-cluster/0 uniter (run relation-changed (454; unit: vault/0) hook), waited 0s, held 9s

So the cluster dropped out due to vault providing certificates and the mysql-innodb-cluster/0 restarting. It then took 2 minutes 30 seconds (ish) to reform the cluster, but which time the keystone unit had run out of retries and given up connecting to the cluster.

Additionally, mysql-innodb-cluster/1 also dropped out at 8:36

2023-04-01T08:36:22.413986Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.32-0ubuntu0.20.04.2).

And mysql-innodb-cluster/2:

2023-04-01T08:37:20.236588Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.32-0ubuntu0.20.04.2) (Ubuntu).
...
2023-04-01T08:39:28.616528Z 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 192.168.33.83:3306 was declared online within the replication group.'

So the cluster was still up/down until 08:39:28.

Summary & thoughts
------------------

3 minutes is actually a long time for the mysql server to update certificates and re-join as a cluster. Basically, the cluster was offline for 3 minutes and no other hooks in the system that would cause a mysql database write would work, and would have to be deferred.

It may be that the system was being deployed in a resource constrained environment, and thus the timeouts on database access were exceeded.

keystone (as a classic charm) either has to wait longer in the hook, or hook retries should be used that will retry the erroring hook until it succeeds - both are suboptimal.

Alternatively, the hook could be deferred; in a reactive or ops-framework charm this is reasonably easy to achieve; unfortunately, classic charms do not have the capability to do this, and there are no (current) plans to do so.

The mysql-innodb-cluster charm could defer all router-db relations until it has achieved as TLS cluster if vault is related to it; this would prevent any other related charm from attempting to access the database until the it has formed a cluster AND upgraded to TLS if vault is related. This might be achieved using goal-state to decide if certificates are going to arrive.

As a mitigation strategy, it may be preferable to deploy vault + mysql-innodb-cluster *first*, wait for them to settle into a cluster with TLS enabled, and then do the rest of the deployment. This would at least work around this race hazard with the mysql-server bootstrapping itself into TLS mode.

From keystone/1:

(keystone.server.flask.request_processing.middleware.auth_context): 2023-04-01 08:36:25,698 ERROR (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during
 query')
(Background on this error at: https://sqlalche.me/e/14/e3q8)
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1802, in _execute_context
    self.dialect.do_execute(
  File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 148, in execute
    result = self._query(query)
  File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 310, in _query
    conn.query(q)
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 548, in query
    self._affected_rows = self._read_query_result(unbuffered=unbuffered)
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 775, in _read_query_result
    result.read()
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1163, in read
    self._read_result_packet(first_packet)
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1236, in _read_result_packet
    self._read_rowdata_packet()
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1270, in _read_rowdata_packet
    packet = self.connection._read_packet()
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 725, in _read_packet
    packet.raise_for_error()
  File "/usr/lib/python3/dist-packages/pymysql/protocol.py", line 221, in raise_for_error
    err.raise_mysql_exception(self._data)
  File "/usr/lib/python3/dist-packages/pymysql/err.py", line 143, in raise_mysql_exception
    raise errorclass(errno, errval)
pymysql.err.OperationalError: (1053, 'Server shutdown in progress')

So "Server shutdown in progress" is a bit ominous.

From mysql-innodb-cluster/0 at the same time:

i.e. 192.168.33.84 (which is mysql-innodb-cluster/0's local address) left the cluster and then rejoined during the the attempted keystone-manager.py connection.  With the cluster at 2 units, it would go RO; the out of cluster unit (this one) would refuse connections if that's what the mysql-router was connected to.

Why did it leave the cluster and then rejoin?

It appeared to be handling a certificates-relation-changed hook from vault/0 at the same time:

2023-04-01 08:36:07 INFO unit.mysql-innodb-cluster/0.juju-log server.go:316 certificates:454: Reactive main running for hook certificates-relation-changed

and from the machine.lock:

2023-04-01 08:36:15 unit-mysql-innodb-cluster-0: mysql-innodb-cluster/0 uniter (run relation-changed (454; unit: vault/0) hook), waited 0s, held 9s

So the cluster dropped out due to vault providing certificates and the mysql-innodb-cluster/0 restarting.  It then took 2 minutes 30 seconds (ish) to reform the cluster, but which time the keystone unit had run out of retries and given up connecting to the cluster.

Additionally, mysql-innodb-cluster/1 also dropped out at 8:36

2023-04-01T08:36:22.413986Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.32-0ubuntu0.20.04.2).

And mysql-innodb-cluster/2:

2023-04-01T08:37:20.236588Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.32-0ubuntu0.20.04.2)  (Ubuntu).
...
2023-04-01T08:39:28.616528Z 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 192.168.33.83:3306 was declared online within the replication group.'

So the cluster was still up/down until 08:39:28.

Summary & thoughts
------------------

3 minutes is actually a long time for the mysql server to update certificates and re-join as a cluster.  Basically, the cluster was offline for 3 minutes and no other hooks in the system that would cause a mysql database write would work, and would have to be deferred.

It may be that the system was being deployed in a resource constrained environment, and thus the timeouts on database access were exceeded.

keystone (as a classic charm) either has to wait longer in the hook, or hook retries should be used that will retry the erroring hook until it succeeds - both are suboptimal.

The mysql-innodb-cluster charm could defer all router-db relations until it has achieved as TLS cluster if vault is related to it; this would prevent any other related charm from attempting to access the database until the it has formed a cluster AND upgraded to TLS if vault is related.  This might be achieved using goal-state to decide if certificates are going to arrive.

As a mitigation strategy, it may be preferable to deploy vault + mysql-innodb-cluster *first*, wait for them to settle into a cluster with TLS enabled, and then do the rest of the deployment.  This would at least work around this race hazard with the mysql-server bootstrapping itself into TLS mode.