SSH to guest sometimes fails publickey authentication: AuthenticationException: Authentication failed.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack-Gate |
New
|
Undecided
|
Unassigned |
Bug Description
Seen in the gate today in the tempest-slow job [1]:
2021-01-13 05:10:39.881333 | controller | 2021-01-13 05:10:23,697 8834 ERROR [tempest.
2021-01-13 05:10:39.881352 | controller | 2021-01-13 05:10:23.697 8834 ERROR tempest.
2021-01-13 05:10:39.881370 | controller | 2021-01-13 05:10:23.697 8834 ERROR tempest.
2021-01-13 05:10:39.881392 | controller | 2021-01-13 05:10:23.697 8834 ERROR tempest.
2021-01-13 05:10:39.881411 | controller | 2021-01-13 05:10:23.697 8834 ERROR tempest.
2021-01-13 05:10:39.881429 | controller | 2021-01-13 05:10:23.697 8834 ERROR tempest.
2021-01-13 05:10:39.881447 | controller | 2021-01-13 05:10:23.697 8834 ERROR tempest.
2021-01-13 05:10:39.881465 | controller | 2021-01-13 05:10:23.697 8834 ERROR tempest.
2021-01-13 05:10:39.881483 | controller | 2021-01-13 05:10:23.697 8834 ERROR tempest.
Logstash query:
44 hits in the last 7 days, but only 3 unique changes. All failures
It looks like there are a variety of different messages in the guest console output indicating why ssh auth ended up failing, depending on how far it got. All have to do with failure to retrieve data from the metadata service.
Here are some examples. First [2]:
2021-01-07 16:15:28.308383 | controller | WARN: failed: route add -net "0.0.0.0/0" gw "10.1.0.1"
2021-01-07 16:15:28.308395 | controller | cirros-ds 'net' up at 11.93
2021-01-07 16:15:28.308405 | controller | checking http://
2021-01-07 16:15:28.308414 | controller | successful after 1/20 tries: up 12.27. iid=i-00000082
2021-01-07 16:15:28.308423 | controller | failed to get http://
2021-01-07 16:15:28.308432 | controller | warning: no ec2 metadata for public-keys
2021-01-07 16:15:28.308441 | controller | failed to get http://
2021-01-07 16:15:28.308450 | controller | warning: no ec2 metadata for user-data
2021-01-07 16:15:28.308460 | controller | found datasource (ec2, net)
2021-01-07 16:15:28.308469 | controller | Top of dropbear init script
2021-01-07 16:15:28.308478 | controller | Starting dropbear sshd: WARN: generating key of type ecdsa failed!
This shows it successfully get the instance id from the metadata service, but then it fails to get the public keys (which I think causes the failure to ssh).
Second [3]:
2021-01-12 17:47:52.961860 | primary | WARN: failed: route add -net "0.0.0.0/0" gw "10.1.0.1"
2021-01-12 17:47:52.961875 | primary | cirros-ds 'net' up at 9.04
2021-01-12 17:47:52.961890 | primary | checking http://
2021-01-12 17:47:52.961904 | primary | failed 1/20: up 9.32. request failed
2021-01-12 17:47:52.961919 | primary | failed 2/20: up 11.83. request failed
2021-01-12 17:47:52.961933 | primary | failed 3/20: up 14.15. request failed
2021-01-12 17:47:52.961948 | primary | failed 4/20: up 16.43. request failed
2021-01-12 17:47:52.961963 | primary | failed 5/20: up 18.84. request failed
2021-01-12 17:47:52.961977 | primary | failed 6/20: up 21.30. request failed
2021-01-12 17:47:52.961992 | primary | failed 7/20: up 23.73. request failed
2021-01-12 17:47:52.962007 | primary | failed 8/20: up 26.06. request failed
2021-01-12 17:47:52.962021 | primary | failed 9/20: up 28.41. request failed
2021-01-12 17:47:52.962048 | primary | failed 10/20: up 30.77. request failed
2021-01-12 17:47:52.962065 | primary | failed 11/20: up 32.98. request failed
2021-01-12 17:47:52.962080 | primary | failed 12/20: up 35.33. request failed
2021-01-12 17:47:52.962095 | primary | failed 13/20: up 37.77. request failed
2021-01-12 17:47:52.962109 | primary | failed 14/20: up 40.18. request failed
2021-01-12 17:47:52.962124 | primary | failed 15/20: up 42.42. request failed
2021-01-12 17:47:52.962139 | primary | failed 16/20: up 44.71. request failed
2021-01-12 17:47:52.962153 | primary | failed 17/20: up 47.16. request failed
2021-01-12 17:47:52.962168 | primary | failed 18/20: up 49.63. request failed
2021-01-12 17:47:52.962183 | primary | failed 19/20: up 52.07. request failed
2021-01-12 17:47:52.962197 | primary | failed 20/20: up 54.46. request failed
2021-01-12 17:47:52.962212 | primary | failed to read iid from metadata. tried 20
2021-01-12 17:47:52.962227 | primary | no results found for mode=net. up 56.78. searched: nocloud configdrive ec2
2021-01-12 17:47:52.962241 | primary | failed to get instance-id of datasource
2021-01-12 17:47:52.962256 | primary | Top of dropbear init script
2021-01-12 17:47:52.962278 | primary | Starting dropbear sshd: failed to get instance-id of datasource
2021-01-12 17:47:52.962292 | primary | WARN: generating key of type ecdsa failed!
This shows it fail to get the instance id from the metadata service, and then it doesn't try to get the public keys.
Third [1]:
2021-01-13 05:10:39.886171 | controller | WARN: failed: route add -net "0.0.0.0/0" gw "10.1.0.1"
2021-01-13 05:10:39.886185 | controller | cirros-ds 'net' up at 6.93
2021-01-13 05:10:39.886195 | controller | checking http://
2021-01-13 05:10:39.886204 | controller | successful after 1/20 tries: up 7.25. iid=i-0000002c
2021-01-13 05:10:39.886213 | controller | failed to get http://
2021-01-13 05:10:39.886223 | controller | warning: no ec2 metadata for public-keys
2021-01-13 05:10:39.886232 | controller | failed to get http://
2021-01-13 05:10:39.886248 | controller | warning: no ec2 metadata for user-data
2021-01-13 05:10:39.886258 | controller | found datasource (ec2, net)
2021-01-13 05:10:39.886267 | controller | Top of dropbear init script
2021-01-13 05:10:39.886276 | controller | Starting dropbear sshd: WARN: generating key of type ecdsa failed!
This shows it fail to get a specific public key from the metadata service.
In all of these jobs, force_config_drive is not set in the nova-cpu_conf.txt (and it defaults to False), which is why it's going to the metadata service for data.
[1] https:/
[2] https:/
[3] https:/
[4]
e-r query proposed here:
https:/ /review. opendev. org/c/opendev/ elastic- recheck/ +/770688