OpenStack-Gate

SSH to guest sometimes fails publickey authentication: AuthenticationException: Authentication failed.

Bug #1911574 reported by melanie witt on 2021-01-14

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack-Gate	New	Undecided	Unassigned

Bug Description

Seen in the gate today in the tempest-slow job [1]:

2021-01-13 05:10:39.881333 | controller | 2021-01-13 05:10:23,697 8834 ERROR [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.5.152 after 17 attempts
2021-01-13 05:10:39.881352 | controller | 2021-01-13 05:10:23.697 8834 ERROR tempest.lib.common.ssh Traceback (most recent call last):
2021-01-13 05:10:39.881370 | controller | 2021-01-13 05:10:23.697 8834 ERROR tempest.lib.common.ssh File "tempest/lib/common/ssh.py", line 107, in _get_ssh_connection
2021-01-13 05:10:39.881392 | controller | 2021-01-13 05:10:23.697 8834 ERROR tempest.lib.common.ssh sock=proxy_chan)
2021-01-13 05:10:39.881411 | controller | 2021-01-13 05:10:23.697 8834 ERROR tempest.lib.common.ssh File "/opt/stack/tempest/.tox/tempest/local/lib/python2.7/site-packages/paramiko/client.py", line 424, in connect
2021-01-13 05:10:39.881429 | controller | 2021-01-13 05:10:23.697 8834 ERROR tempest.lib.common.ssh passphrase,
2021-01-13 05:10:39.881447 | controller | 2021-01-13 05:10:23.697 8834 ERROR tempest.lib.common.ssh File "/opt/stack/tempest/.tox/tempest/local/lib/python2.7/site-packages/paramiko/client.py", line 714, in _auth
2021-01-13 05:10:39.881465 | controller | 2021-01-13 05:10:23.697 8834 ERROR tempest.lib.common.ssh raise saved_exception
2021-01-13 05:10:39.881483 | controller | 2021-01-13 05:10:23.697 8834 ERROR tempest.lib.common.ssh AuthenticationException: Authentication failed.

Logstash query:

44 hits in the last 7 days, but only 3 unique changes. All failures

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22AuthenticationException%3A%20Authentication%20failed.%5C%22%20AND%20tags%3Aconsole&from=7d

It looks like there are a variety of different messages in the guest console output indicating why ssh auth ended up failing, depending on how far it got. All have to do with failure to retrieve data from the metadata service.

Here are some examples. First [2]:

This shows it successfully get the instance id from the metadata service, but then it fails to get the public keys (which I think causes the failure to ssh).

Second [3]:

2021-01-12 17:47:52.961860 | primary | 2021-01-12 17:47:52.961875 | primary | 2021-01-12 17:47:52.961890 | primary | 2021-01-12 17:47:52.961904 | primary | 2021-01-12 17:47:52.961919 | primary | 2021-01-12 17:47:52.961933 | primary | 2021-01-12 17:47:52.961948 | primary | 2021-01-12 17:47:52.961963 | primary | 2021-01-12 17:47:52.961977 | primary | 2021-01-12 17:47:52.961992 | primary | 2021-01-12 17:47:52.962007 | primary | 2021-01-12 17:47:52.962021 | primary | 2021-01-12 17:47:52.962048 | primary | 2021-01-12 17:47:52.962065 | primary | 2021-01-12 17:47:52.962080 | primary | 2021-01-12 17:47:52.962095 | primary | 2021-01-12 17:47:52.962109 | primary | 2021-01-12 17:47:52.962124 | primary | 2021-01-12 17:47:52.962139 | primary | 2021-01-12 17:47:52.962153 | primary | 2021-01-12 17:47:52.962168 | primary | 2021-01-12 17:47:52.962183 | primary | 2021-01-12 17:47:52.962197 | primary | 2021-01-12 17:47:52.962212 | primary | 2021-01-12 17:47:52.962227 | primary | 2021-01-12 17:47:52.962241 | primary | 2021-01-12 17:47:52.962256 | primary | 2021-01-12 17:47:52.962278 | primary | 2021-01-12 17:47:52.962292 | primary | WARN: failed: route add -net "0.0.0.0/0" gw "10.1.0.1"
cirros-ds 'net' up at 9.04
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 9.32. request failed
failed 2/20: up 11.83. request failed
failed 3/20: up 14.15. request failed
failed 4/20: up 16.43. request failed
failed 5/20: up 18.84. request failed
failed 6/20: up 21.30. request failed
failed 7/20: up 23.73. request failed
failed 8/20: up 26.06. request failed
failed 9/20: up 28.41. request failed
failed 10/20: up 30.77. request failed
failed 11/20: up 32.98. request failed
failed 12/20: up 35.33. request failed
failed 13/20: up 37.77. request failed
failed 14/20: up 40.18. request failed
failed 15/20: up 42.42. request failed
failed 16/20: up 44.71. request failed
failed 17/20: up 47.16. request failed
failed 18/20: up 49.63. request failed
failed 19/20: up 52.07. request failed
failed 20/20: up 54.46. request failed
failed to read iid from metadata. tried 20
no results found for mode=net. up 56.78. searched: nocloud configdrive ec2
failed to get instance-id of datasource
Top of dropbear init script
Starting dropbear sshd: failed to get instance-id of datasource
WARN: generating key of type ecdsa failed!

This shows it fail to get the instance id from the metadata service, and then it doesn't try to get the public keys.

Third [1]:

This shows it fail to get a specific public key from the metadata service.

In all of these jobs, force_config_drive is not set in the nova-cpu_conf.txt (and it defaults to False), which is why it's going to the metadata service for data.

[1] https://zuul.opendev.org/t/openstack/build/9c4c5ca5e99a4becb2f24762972bc8a0/log/job-output.txt#73303
[2] https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_b18/767107/4/gate/tempest-full/b18c480/job-output.txt
[3] https://ea3a162b13104dc2d966-37d30e8ca4786c547186bac4ea4dd159.ssl.cf5.rackcdn.com/770429/1/check/networking-ovn-dsvm-grenade/f35752c/job-output.txt
[4]

Revision history for this message

melanie witt (melwitt) wrote on 2021-01-14:

e-r query proposed here:

https://review.opendev.org/c/opendev/elastic-recheck/+/770688

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.