To increase clusters stability please create test which will kill or stop cluster services randomly.
What services should be stopped/killed: all OpenStack services 1 service at one time.
What we should to control during this test:
Stopped/killed service back online if we have automatically recovering procedure
Cloud monitoring system successfully reported about problem with detailed explanation what service on what cluster nodes failed. In case of automatically recovering procedure exists - cloud monitoring system should report about recovering.
Test log should contain time-stamp when the stop/kill command was sended (what service on what node), time-stamp when service was stopped (what service on what node), time-stamp when cloud monitoring system was able to report a problem (what service on what node), time-stamp when service was recovered (if automatically recovering procedure exists) (what service on what node).
Time difference in seconds between points p1-p2, p2-p3, p3-p4, p1-p3 should be logged too:
point 1 - service was stopped
point 2 - cloud monitoring system was able to report a problem
point 3 - service was recovered (if automatically recovering procedure exists) or manually (only for services without automatically recovering procedure)
point 4 - cloud monitoring system reported about service recovering
For services with automatically recovering procedure time difference should be p1-p2<p1-p3.
In case if some services do not have automatically recovering procedure - service should be started back by this test only after cloud monitoring system reported a problem related to this service.
What is the profit?
This test will help up to check:
Do all services recovered as expected?
Does service’s recovering time expected?
What is the time-shift of automatically recovering for each service?
Does cloud monitoring system report us about issues into cloud (what service on what node)?
What is the time-shift between real problem and reporting (what service on what node)?
What is the time-shift between service recovering and reporting (what service on what node)?
To increase clusters stability please create test which will kill or stop cluster services randomly.
What services should be stopped/killed: all OpenStack services 1 service at one time.
What we should to control during this test:
Stopped/killed service back online if we have automatically recovering procedure
Cloud monitoring system successfully reported about problem with detailed explanation what service on what cluster nodes failed. In case of automatically recovering procedure exists - cloud monitoring system should report about recovering.
Test log should contain time-stamp when the stop/kill command was sended (what service on what node), time-stamp when service was stopped (what service on what node), time-stamp when cloud monitoring system was able to report a problem (what service on what node), time-stamp when service was recovered (if automatically recovering procedure exists) (what service on what node).
Time difference in seconds between points p1-p2, p2-p3, p3-p4, p1-p3 should be logged too:
point 1 - service was stopped
point 2 - cloud monitoring system was able to report a problem
point 3 - service was recovered (if automatically recovering procedure exists) or manually (only for services without automatically recovering procedure)
point 4 - cloud monitoring system reported about service recovering
For services with automatically recovering procedure time difference should be p1-p2<p1-p3.
In case if some services do not have automatically recovering procedure - service should be started back by this test only after cloud monitoring system reported a problem related to this service.
What is the profit?
This test will help up to check:
Do all services recovered as expected?
Does service’s recovering time expected?
What is the time-shift of automatically recovering for each service?
Does cloud monitoring system report us about issues into cloud (what service on what node)?
What is the time-shift between real problem and reporting (what service on what node)?
What is the time-shift between service recovering and reporting (what service on what node)?