[3.3] Duplicate IPMI BMC IP + Username with unique passwords causes migration failure on upgrade to 3.3

Bug #2025026 reported by Trent Lloyd
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Won't Fix
Medium
Unassigned
MAAS documentation
Triaged
Medium
Bill Wear

Bug Description

If you have an environment where multiple machines have an IPMI BMC with the same IP address & username, but a different password, this will work on 3.2 but cause a migration failure on upgrade to 3.3.

The cause is 0290_migrate_node_power_parameters which moves the password from power_parameters to the secret store. There is a UNIQUE constraint on (power_type, md5(power_parameters::text)) which previously passed but now fails because the power_pass field is removed on migration to the secret store and it was the only unique value.

While it is almost certainly "incorrect" to have multiple machines with the same BMC IP address and username but a different password

- It works without any warning on 3.2
- The database migration fails leaving you without access to the UI to fix the relevant machines, plus, it's not at all obvious why it's failing - since the duplicate key reported is an MD5 hash.

This was experienced during a production upgrade from 3.2.7 to 3.3.3 in a large environment where 5 sets of 2 machines had such an issue.

While it generally makes sense to avoid duplicate IPMI IP/Username/Password combinations, the current unique constraint also does not really achieve that properly since any differing config value in power_parameters would cause the Unique constraint to pass - and there are many such fields that often don't vary much in practice because they are automatically created but could. So I also think this particular Unique constraint is not really ideal anyway.

Perhaps it would make sense to drop the unique constraint but enforce it at the API/UI level, to avoid this upgrade issue?

Tags: sts
Revision history for this message
Trent Lloyd (lathiat) wrote :
Download full text (7.1 KiB)

== Postgres Query ==

The following Postgres query will locate any such duplicate machines:

SELECT
maasserver_bmc.id,
maasserver_staticipaddress.ip,
maasserver_node.system_id,
maasserver_node.hostname,
maasserver_node.created,
maasserver_bmc.power_parameters ->> 'power_user' AS power_user,
maasserver_bmc.power_parameters ->> 'power_pass' AS power_pass,
maasserver_bmc.power_parameters ->> 'power_address' AS power_address,
maasserver_nodemetadata1.meta->>'system_vendor',
maasserver_nodemetadata1.meta->>'system_product',
maasserver_nodemetadata1.meta->>'system_serial',
maasserver_bmc.power_parameters,
maasserver_node.power_state_queried,
maasserver_node.power_state_updated,
maasserver_node.created,
maasserver_node.updated

FROM
maasserver_bmc
JOIN
(SELECT ARRAY_AGG(id) AS aggregated_ids FROM maasserver_bmc WHERE power_type = 'ipmi' GROUP BY power_parameters - 'power_pass' HAVING COUNT(*) > 1) AS dup_rows
ON maasserver_bmc.id = ANY(dup_rows.aggregated_ids)
JOIN maasserver_staticipaddress ON (maasserver_staticipaddress.id = maasserver_bmc.ip_address_id)
JOIN maasserver_node ON (maasserver_node.bmc_id = maasserver_bmc.id)
JOIN (SELECT maasserver_nodemetadata.node_id, jsonb_object_agg(maasserver_nodemetadata.key,maasserver_nodemetadata.value) AS meta from maasserver_nodemetadata GROUP BY maasserver_nodemetadata.node_id) AS maasserver_nodemetadata1 ON (maasserver_nodemetadata1.node_id=maasserver_node.id)

ORDER BY
ip;

=== Upgrade error log ===

Setting up maas-region-controller (1:3.3.3-13184-g.3e9972c19-0ubuntu1~22.04.1) ...
Operations to perform:
Apply all migrations: auth, contenttypes, maasserver, metadataserver, piston3, sessions, sites
Running migrations:
Applying auth.0006_default_auto_field... OK
Applying maasserver.0277_replace_nullbooleanfield... OK
Applying maasserver.0278_generic_jsonfield... OK
Applying maasserver.0279_store_vpd_metadata_for_nodedevice... OK
Applying maasserver.0280_set_parent_for_existing_vms... OK
Applying maasserver.0281_secret_model... OK
Applying maasserver.0282_rpc_shared_secret_to_secret... OK
Applying maasserver.0283_migrate_tls_secrets... OK
Applying maasserver.0284_migrate_more_global_secrets... OK
Applying maasserver.0285_migrate_external_auth_secrets... OK
Applying maasserver.0286_node_deploy_metadata... OK
Applying maasserver.0287_add_controller_info_vault_flag... OK
Applying maasserver.0288_rootkey_material_secret... OK
Applying maasserver.0289_vault_secret... OK
Applying maasserver.0290_migrate_node_power_parameters...Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "maasserver_bmc_power_type_parameters_idx"
DETAIL: Key (power_type, md5(power_parameters::text))=(ipmi, d8e8fca2dc0f896fd7cb4cb0031ba249) already exists.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/sbin/maas-region", line 33, in <module>
sys.exit(load_entry_point('maas==3.3.3', 'console_scripts', 'maas-region')())
File "/usr/lib/python3/dist-packages/maasserver/region_script.py",...

Read more...

Revision history for this message
Trent Lloyd (lathiat) wrote :
Changed in maas:
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Igor Brovtsin (igor-brovtsin)
Revision history for this message
Trent Lloyd (lathiat) wrote :

Workaround:

Update the BMC details of one of the machines to use the 'manual' provider. They will no longer have BMC access but the upgrade can be completed.

You could potentially also change this query to instead update the IP address of the BMC but I have not tested that. You'd need to at least update both the power_parameters field and the ip_address_id field which links to the maas_staticipaddress table.

sudo -u postgres -i psql maasdb

====
UPDATE maasserver_bmc
SET
power_type = 'manual',
power_parameters = '{}',
ip_address_id=NULL,
capabilities = '{}',
version='',
pool_id=NULL,
created_by_commissioning=NULL
FROM
maasserver_node
WHERE
maasserver_node.bmc_id = maasserver_bmc.id
AND maasserver_node.system_id IN ('xxxxxx', 'yyyyyy')
;
====

Revision history for this message
Trent Lloyd (lathiat) wrote :

Initially I thought you could not change the power configuration of a node, even if you knew about this issue ahead of time. That was actually incorrect - in my testing the nodes has been locked (Take action -> Lock). If unlocked, you can just edit the power configuration and adjust or set it to manual.

However: In practice, most users will only find out about this issue partway through the upgrade when the UI is not working. So only helps if someone can revert to a backup to get the installation working again.

For everyone else, the postgresql query I posted is the best workaround.

description: updated
tags: added: sts
tags: added: bug-council
Revision history for this message
Thorsten Merten (thorsten-merten) wrote :

@bill: can you add this to the list of known issues for 3.3: If you have multiple machines and their bmc share the same IP and same usernames/different passwords, this needs to be disentangled

@trent: could you clarify why they ended up in this situation?

Changed in maas-doc:
importance: Undecided → Medium
milestone: none → 3.3.x
Changed in maas:
status: Triaged → Won't Fix
assignee: Igor Brovtsin (igor-brovtsin) → nobody
Changed in maas-doc:
assignee: nobody → Bill Wear (billwear)
status: New → Incomplete
status: Incomplete → Triaged
tags: removed: bug-council
Revision history for this message
Trent Lloyd (lathiat) wrote :

@Thorsten: I asked them and basically there is no good reason, it just happened by accident.

It's a very large environment with many thousands of nodes, upgraded through many MAAS versions with many people involved and it seems in 5 cases for whatever reason the old BMCs got disconnected and their IPs were later re-used. Since BMC response is not actively monitored or used to power machines regularly, no one noticed. They were happy to switch to manual for those machines since it mostly doesn't matter/they aren't frequently re-deployed and may fix it up to reconnect the BMCs later.

While they intend to keep the BMC function working and for most nodes it does, it's not hard to see that happening occasionally in such a large and long lived environment I guess.

It seems this was handled in the past, 0190_bmc_clean_duplicates.py specifically found various duplicate cases and for BMCs unassigned all but one of them when the unique key was first added. Something similar could be simply added to 3.3:
https://github.com/maas/maas/blob/master/src/maasserver/migrations/maasserver/0190_bmc_clean_duplicates.py
https://github.com/maas/maas/commit/f39e29d623

Though not featured there but would probably make sense is to try keep the most recently created (or maybe updated or deployed) node, which is maybe more likely to be the correct node in most cases.

Revision history for this message
Trent Lloyd (lathiat) wrote :

I'm confused at this being a Won't Fix. At least something like 0190_bmc_clean_duplicates.py to avoid breaking upgrades seems sensible and not particularly difficult to implement.

Revision history for this message
Thorsten Merten (thorsten-merten) wrote :

Hi Trent. We discussed this in the team and we will make sure that we will cover this situation properly in the documentation. As you said, the unique constraint will help to avoid a constellation that is not particularly "correct".

That said, an automatic resolution would remove one of the passwords, and we do not really know which one to remove. Thus why we think the docs are the best place to tackle this.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.