Zaza tests need to implement retries when external problems occur (test env flakiness)

Bug #1887510 reported by Alvaro Uria
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Cinder Charm
New
Undecided
Unassigned

Bug Description

In the last week, I have tried to re run all gate tests for PR[1]. However, tests fail in different ways:
* sample [2]: xenial-pike gate test creates a volume in error state (but all focal, bionic, and 2 xenial deployments [queens, ocata] did work fine; furthermore, the xenial-pike gate test worked fine when I manually run it from my serverstack bastion host)
* sample [3]: the first gate bundle did not go into idle before timeout

So far, I've tried 4 times and only the first time tests failed was correct (Func-Test-PR regexp failed because of a trailing slash, which I fixed in a subsequent patchset).

1. https://review.opendev.org/#/c/739182/
2. https://pastebin.ubuntu.com/p/KMP9xNJQgW/
3. https://pastebin.ubuntu.com/p/sTqtgzJ9Tw/

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

Have you looked through the logs about why the newly created volume went into an Error state?

Revision history for this message
Alvaro Uria (aluria) wrote :

@chris yes, although I did not find it relevant because it worked on previous gate tests and the same one when run manually.

cinder-volume.log shows:
"""
2020-07-13 11:39:17.610 18925 ERROR cinder.cmd.volume [req-ecfbb173-3674-46e2-ba8d-b3f89555c23e - - - - -] Volume service juju-3c71ff-zaza-8c0ecc927ef0-0@lvm failed to start.: DBNonExistentTable: (sqlite3.OperationalError) no such table: services [SQL: u'SELECT services.created_at AS services_created_at, services.de
leted_at AS services_deleted_at, services.deleted AS services_deleted, services.id AS services_id, services.cluster_name AS services_cluster_name, services.host AS services_host, services.binary AS services_binary, services.updated_at AS services_updated_at, services.topic AS services_topic, services.report_count AS
 services_report_count, services.disabled AS services_disabled, services.availability_zone AS services_availability_zone, services.disabled_reason AS services_disabled_reason, services.modified_at AS services_modified_at, services.rpc_current_version AS services_rpc_current_version, services.object_current_version A
S services_object_current_version, services.replication_status AS services_replication_status, services.active_backend_id AS services_active_backend_id, services.frozen AS services_frozen \nFROM services \nWHERE services.deleted = 0 AND services.binary = ?'] [parameters: ('cinder-scheduler',)]
"""

And the same log also shows:
"""
2020-07-13 11:42:40.214 24040 ERROR oslo.messaging._drivers.impl_rabbit [req-f26ade91-573f-4847-b041-6e81199e6d95 - - - - -] [1abe28df-dca5-406e-b261-16179df0d178] AMQP server on 127.0.0.1:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. Client port: None: error: [Errno 111] ECONNREFUSED
"""

This happened after a successful test to pause and resume cinder services.

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

Given that it sounds like the service isn't actually available, I'd suggest that the pause/resume test is incorrectly passing

Revision history for this message
Alvaro Uria (aluria) wrote :

I tried to run charm-recheck-full and it now failed on xenial-queens with error:
"""
OSError: [Errno 113] Connect call failed ('252.10.0.1', 17070)
"""

That error happened when async_block_until_wl_status_info_starts_with tried to connect to the model. Everytime, it's a different error on a different point.

WRT to pause/resume not working fine, I manually run xenial-pike (commented all gate tests except the xenial-pike one) and ran "tox -e func", which passed.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.