NetApp: delete share "Failed to rename ruleset: duplicate entry"

Bug #1966176 reported by Maurice Escher
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Shared File Systems Service (Manila)
Triaged
Medium
Felipe Rodrigues

Bug Description

Hi,

I'm running on Xena with NetApp in DHSS=true. (But I remember seeing it in earlier releases, too)
Sometimes after manila-share service restart I find myself unable to delete some shares.

The trace is the following:
https://paste.opendev.org/show/813416/

api_args of rename_nfs_export_policy look the following:

{
'new-policy-name': 'policy_<uuid>',
'policy-name': 'temp_<other_uuid>'
}

I can only get out of this by manually deleting the respective export-policy 'policy_<uuid>' on the NetApp storage backend.

I wonder if there could be a shortcut version if the request is anyhow to delete all the rules instead of using the complex update via rename process.

_update_rules_through_share_driver() gets called with:
https://paste.opendev.org/show/bYrFNrFhETMFCdH8mtqB/

BR,
Maurice

tags: added: netapp
tags: added: access-rules
Vida Haririan (vhariria)
Changed in manila:
importance: Undecided → Medium
Changed in manila:
assignee: nobody → Nahim Alves de Souza (nahimsouza)
Revision history for this message
Vida Haririan (vhariria) wrote :
Changed in manila:
assignee: Nahim Alves de Souza (nahimsouza) → Andre Luiz Beltrami Rocha (andrebeltrami)
Revision history for this message
Andre Luiz Beltrami Rocha (andrebeltrami) wrote :

Hi Maurice,
Do you have any additional information about how to replicate this issue? The shares are in deleting state during the share service restart? Does the environment have a lot of shares being deleted? How much? They have several export-policies rules? I really appreciated any feedback on that.
Thanks, André.

Revision history for this message
Maurice Escher (maurice-escher) wrote :

I did not try to reproduce it.
Only additional info I can share is, that (additional to the real customer workload) we have periodic tests running that create a share, add access-rules, mount the share and delete the share. I can imagine that the restart is happening after the access rule api call has been made.
And then the test is sending a delete share api call, because the rule does not get active within our configured test-timeout (1 minute in our case).

We run this every 5 minutes in every availability zone, so there is not too much going on in parallel, only the timing of the restart is bad luck sometimes.

I saw this error roughly estimated only once in 200 manila-share restarts.

Revision history for this message
Andre Luiz Beltrami Rocha (andrebeltrami) wrote :

Hi Maurice, thank you for this feedback. I tried to reproduce the bug but without success. I added sleep of 180 seconds in the NetApp driver update access rules method, but I didn't see this behavior reported. In my tests after creating a share, add access rule, delete the share with access rules in applying state and then restarting the manila share service I cannot get the exception reported. In this scenario, after the restarting the access rule goes to denying state and the share stuck on deleting state. I think we can drop the importance of this to low, since there isn't a deterministic way to reproduce it yet.

Revision history for this message
Dmitry Galkin (galkindmitrii) wrote :
Download full text (8.7 KiB)

We have this happening again. Netapp driver tries to rename the temporary policy when an old policy that has been disassociated from a share has not been removed. No rule changes can be applied until we go the the Netapp and manually find and delete the old policy.

2022-07-08 10:02:52,417 24 ERROR oslo_messaging.rpc.server [req-0c05e614-4312-4343-abe4-b5ada9f065b6 gNone ff9770e6ecfe4689ba14b6fc796d380701f664462022ec14e804effbb319503d ad7380891cd144ee9611a3b68d286842 - - -] Exception during message handling: manila.share.drivers.netapp.dataontap.client.api.NaApiError: NetApp API failed. Reason - 13001:Failed to rename ruleset: duplicate entry
2022-07-08 10:02:52,417 24 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2022-07-08 10:02:52,417 24 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.8/site-packages/oslo_messaging/rpc/server.py", line 167, in _process_incoming
2022-07-08 10:02:52,417 24 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2022-07-08 10:02:52,417 24 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.8/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
2022-07-08 10:02:52,417 24 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2022-07-08 10:02:52,417 24 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.8/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
2022-07-08 10:02:52,417 24 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2022-07-08 10:02:52,417 24 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.8/site-packages/manila/share/manager.py", line 232, in wrapped
2022-07-08 10:02:52,417 24 ERROR oslo_messaging.rpc.server return f(self, *args, **kwargs)
2022-07-08 10:02:52,417 24 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.8/site-packages/manila/utils.py", line 573, in wrapper
2022-07-08 10:02:52,417 24 ERROR oslo_messaging.rpc.server return func(self, *args, **kwargs)
2022-07-08 10:02:52,417 24 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.8/site-packages/manila/share/manager.py", line 4085, in update_access
2022-07-08 10:02:52,417 24 ERROR oslo_messaging.rpc.server self.update_access_for_instances(context, [share_instance_id],
2022-07-08 10:02:52,417 24 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.8/site-packages/manila/share/manager.py", line 4099, in update_access_for_instances
2022-07-08 10:02:52,417 24 ERROR oslo_messaging.rpc.server self.access_helper.update_access_rules(
2022-07-08 10:02:52,417 24 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.8/site-packages/manila/share/access.py", line 313, in update_access_rules
2022-07-08 10:02:52,417 24 ERROR oslo_messaging.rpc.server self._update_access_rules(context, share_instance_id,
2022-07-08 10:02:52,417 24 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.8/site-packages/manila/share/access.py", line 350, in _update_access_rules
2022-07-08 10:02:52,417 24 ERROR oslo_messaging.rpc.server drive...

Read more...

Changed in manila:
assignee: Andre Luiz Beltrami Rocha (andrebeltrami) → Felipe Rodrigues (felipefutty)
Revision history for this message
Felipe Rodrigues (felipefutty) wrote (last edit ):

Hi folks,

I am trying to reproduce the issue and I am in same stage as andrebeltrami. I could not reproduce the issue. I tried some sleeps on the update_access with several create, delete and add/remove access to the share. I has not gotten the bug.

Reading the code, it seems that the reported issue happens when the policy of a share for some reason is created and not assigned to it. Or even it is assigned, but in a moment, it is lost. It keeps on the storage, but not assigned to the share. So, when a update_access runs, it checks that the share does not have the expected policy name and try to rename it, which breaks since there is a policy with that name on the storage.

To reach this scenario is not easy, so I would request to you a better log. Please, could you turn the debug mode on ?

To get the debug mode:
- add to the default manila.conf: debug=True
- add to the backend stanza: trace = True and netapp_trace_flags = method,api

You can send only the log since the share is created until the bug is hit with this share.

With this log trace we can understand how the policy is getting lost.

Thank you all.

Revision history for this message
Maurice Escher (maurice-escher) wrote :

Turning trace flags on in production is not an option sadly.

We see this error like in 1 of 200k share creations. My assumption is that it is related to high cpu load on the nodes. The last occurrence is too far in the past to make useful correlations, waiting for it to happen again...

Can you nevertheless try to improve the delete use case? When the share instance is going to be removed anyway there is no need to update export policies one by one, just get rid of all of them and don't care for errors, no?

Vida Haririan (vhariria)
Changed in manila:
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.