[Heat] Failing on creation 100+ VMs

Bug #1475274 reported by Sergey Kraynev
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
High
Peter Razumovsky
6.1.x
Won't Fix
High
Sergey Kraynev
7.0.x
Fix Released
High
Denis Egorenko
8.0.x
Fix Released
High
Peter Razumovsky

Bug Description

This issue happens on creating more then 100 vms in one template.

oslo_messaging or ha_proxy report about Timeout error:

Traceback (most recent call last):
  File "/usr/bin/heat", line 10, in <module>
    sys.exit(main())
  File "/usr/lib/python2.7/dist-packages/heatclient/shell.py", line 657, in main
    HeatShell().main(args)
  File "/usr/lib/python2.7/dist-packages/heatclient/shell.py", line 607, in main
    args.func(client, args)
  File "/usr/lib/python2.7/dist-packages/heatclient/v1/shell.py", line 114, in do_stack_create
    hc.stacks.create(**fields)
  File "/usr/lib/python2.7/dist-packages/heatclient/v1/stacks.py", line 119, in create
    data=kwargs, headers=headers)
  File "/usr/lib/python2.7/dist-packages/heatclient/common/http.py", line 254, in json_request
    resp = self._http_request(url, method, **kwargs)
  File "/usr/lib/python2.7/dist-packages/heatclient/common/http.py", line 344, in _http_request
    raise exc.from_response(resp)
heatclient.exc.HTTPException: ERROR: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

It happens, because during stack-create request Heat do template validation and check all constraints (Now it checks same image for each Server resource).
  https://github.com/openstack/heat/blob/master/heat/engine/service.py#L708-L711

This validation takes more then 1 minute and as result we get Timeout error.

Note, that template does not use resource group and just copy-paste part of template (server+port).
Template to reproduce was attached to bug.

Revision history for this message
Sergey Kraynev (skraynev) wrote :
Revision history for this message
Sergey Kraynev (skraynev) wrote :

MOS 6.1 - it affects only shaker tests:

- it's not possible to fix in Heat, so bug marked as Won't Fix.
- There are follow recommendations for users/customers:
   * Increase timeout option for HAProxy (default is 1 minute) to more then 2 minutes.
      Increase Heat timeout option for oslo messaging (rpc_response_timeout by default it's 1 minute) to more then 2 minutes.
      This ^ approach will be used on scale lab for shaker test now.

  * Use in-template resources instead of passing id of existing resource via parameters.
     In this case validation will be skipped on pre-create state and takes couple seconds, so Timeout Exception will not be raised.
     In MOS 6.1. Heat has follow list of constraints https://github.com/openstack/heat/blob/stable/juno/setup.cfg#L53-L60 , but not
    all of them is used in resources by default.

  * Use OS::Heat::ResourceGroup instead of copy-pasting part of template. In this case validation will be skipped too.
     So this approach allows to use references on external resources (resource ids ) via parameters.

Revision history for this message
Sergey Kraynev (skraynev) wrote :

MOS 7.0 - it affects shaker tests and also Sahara resources, when Heat engine is chosen instead of sahara direct-engine:

- it's not possible to fix in Heat and Sahara does not use resource groups yet, so bug marked as Won't Fix.

The recommendations are the same as for MOS 6.1

Revision history for this message
Sergey Kraynev (skraynev) wrote :

MOS 8.0 and Further:

 - it will be fixed in L in Heat via caching for constraints validation requests.
   https://blueprints.launchpad.net/heat/+spec/constraint-validation-cache
   https://review.openstack.org/#/c/166810/

 - Sahara also plans to migrate on resource-group approach instead of copy-pasting part of template.

description: updated
Revision history for this message
Sergey Kraynev (skraynev) wrote :

Note, that in comment #2 we should increase both options for rpc_response_timeout i Heat and in ha_proxy, because:

heatclient ask heat-api and waits answer from it during ha_proxy Timeout.

heat-api sends rpc call to heat-engine and wait answer from it during rpc_response_timeout (which is set in Heat configuration file)

Dina Belova (dbelova)
summary: - [Heat] Failing on creation 100+ vms in shaker test and without it.
+ [Heat] Failing on creation 100+ VMs
tags: added: heat scale
tags: added: murano sahara
Revision history for this message
Sergey Kraynev (skraynev) wrote :

Marked bug as Confirmed or 7.0 to do forget add nomination for 8.0

Revision history for this message
Sergey Kraynev (skraynev) wrote :

Moved back to Won't Fix and add personal reminder for myself (add nominate for MOS 8.0)

Changed in mos:
status: New → Won't Fix
Revision history for this message
Dina Belova (dbelova) wrote :

Not sure it's a good idea to move this bug to 8.0. It's blocking both creation of big Sahara clusters and Murano environments.

Revision history for this message
Sergey Kraynev (skraynev) wrote :

Dina, please read comments above - they contain description for each versions of MOS: with proposed solutions and explanations why we should do it in this way.

Revision history for this message
Sergey Kraynev (skraynev) wrote :

There are couple notes about comments and statuses:
 - This will be fixed on deployment side by for MOS 7.0:
    a) Need to add "timeout server 11m" to /etc/haproxy/conf.d/160-heat-api.cfg
    b) Need to increase rpc_response_timeout parameter in heat.conf to 600

Patch will be on review soon

- For MOS 8.0 this fix should be removed and new caching mechanism should be used instead:

https://blueprints.launchpad.net/heat/+spec/constraint-validation-cache
   https://review.openstack.org/#/c/166810/

Revision history for this message
Sergey Kraynev (skraynev) wrote :

Fix was committed to fuel-library for MOS 7.0:
https://review.openstack.org/#/c/212023/

Need re-check fix on MOS side in MOS 8.0. (and remove if it works, remove workaround on fuel side)

Revision history for this message
Evgeny Sikachev (esikachev) wrote :

Not need verification. In MOS 7.0 used direct engine instead Heat engine

Revision history for this message
Sergey Kraynev (skraynev) wrote :

Evgeny: Thx for the reminder :)

It's true, but we want to make sure, that same fix may be implemented in MOS 9.0, when Sahara totally migrates to Heat engine.
Also we need to check it without fix for puppets. if it works we should ask puppet team to revert this fix, because caching is more correct way for handling this issue.

Short summary:
 - wait first iso with MOS 8.0
 - check caching without puppet workaround
 - if it works assign bug on puppet team and ask them to remove workaround in puppets.
  - then close this bug and do not nominate it to the MOS 9.0

Revision history for this message
Peter Razumovsky (prazumovsky) wrote :

Template with 100 resources with using parameters has been checked on DevStack (cache constraints successfully works).

Now waiting for scale lab for checking it on MOS 8.0 and huge template.

Revision history for this message
Sergey Kraynev (skraynev) wrote :

We have checked caching feature on scale with 50 nodes.
For testing was used template from https://bugs.launchpad.net/mos/+bug/1475274/comments/1

Without caching validation takes ~ 4 min, with enabled caching ~ 40 sec.

This result shows, that we improve existing behavior, but it does not solve whole issue.
So after discussion with Puppets Team and Sahara Team we decided do not revert patches in Puppets and use 10 min timeout for ha-proxy and rpc_response for heat. Same decision should be applied to template_size and resource limits in Heat configs.

Also on puppet's side we plan fix bug: https://bugs.launchpad.net/fuel/+bug/1534510 (i.e. turn on caching in MOS by default)

Currently this bug and https://bugs.launchpad.net/mos/+bug/1483833 should be marked as resolved.

In the future we plan to implement BP https://review.openstack.org/#/c/234240/ in Fuel, which allows to use correct limits for different deployments.

Also from Heat side I think make sense to investigate way for refactoring validation behavior (it requires big architecture changes, so can not be done soon)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.