Mirantis OpenStack

Delayed S3 query processing due to memcached token locks

Bug #1489797 reported by Dmitriy Novakovskiy on 2015-08-28

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Invalid	High	Boris Bobrov	Mirantis OpenStack 7.0-updates
6.0.x	Won't Fix	High	Alexey Stupnikov	Mirantis OpenStack 6.0-updates
6.1.x	Invalid	High	Alexey Stupnikov	Mirantis OpenStack 6.1-updates
7.0.x	Invalid	High	Boris Bobrov	Mirantis OpenStack 7.0-updates

Bug Description

Customer has reported an issue with queries from single user to S3 interface (exposed by Ceph/RadosGW) experiencing processing delays. Switching keystone token storage from memcache to mysql helps resolve the problem.

Steps to reproduce:
1. Deploy vanilla MOS (for customer it's 6.0 deployed by Mirantis, whether 6.1 is affected is yet unclear)
2. Specify Ceph to back all the storage needs (incl. S3 API endpoint)
3. Create a user that application will use to upload data into Ceph object storage via S3 interface (ec2-credentials-create)
4. Start uploading objects (in this case - graphical images) using credentials from step 3, with multiple requests using same credentials going in parallel

Actual result:

keystone token-get takes from 0 to 9 seconds, request processing is consequently slow and tends to queue up.

Expected result:

keystone token-get is expected to take less time and be consistent (not vary from 0 to 9 secs). For example, if same parallel upload test is executed with multiple S3 user accounts applied randomly - the processing is always <1s

Suspected rootcause:

It seem that when keystone validates the log-ins through the rados gateway it seems to take some lock on user level when handling the back-end store of the tokens. In MOS it seems to be memcached that is used for storing tokens. Changing driver=keystone.token.backends.memcache.Token to driver=keystone.token.persistence.backends.sql.Token seems to increase the performance quite a lot (validated on the same cloud).

See original description

Tags:

Dmitriy Novakovskiy (dnovakovskiy) on 2015-08-28

description:	updated
description:	updated

Revision history for this message

Dmitriy Novakovskiy (dnovakovskiy) wrote on 2015-08-31:

Full story from customer

"When we use the S3 interface we use it from our application which saves a lot of small images. Naturally we set up a user that our application can use to access the S3 interface with one access and corresponding secret key (keystone ec2-credentials-create). The sending of the images (objects) is multi-threaded to overcome the natural latency of the object storage. What I have found out is that when keystone validates the log-ins through the rados gateway it seems to take some lock on user level when handling the back-end store of the tokens. In MOS it seems to be memcached that is used for storing tokens.

All keystone requests using the same user that is storing a lot of images through the S3 interface are slow when the application is sending stuff to the S3 interface. keystone token-get can take between 0 and 9 seconds when we send images to the S3 interface if you do it with the same user. If you try keystone token-get with another user (OS_USERNAME=some other user) it always takes < 1s.

I have been able to verify this in a virtual env. as well. It's very easy to see that the performance of the S3 interface goes down when issuing a lot of parallel requests to the S3 interface with the same user.

I have also tried to change the storage back end for the tokens to mysql in the virtual env. Changing driver=keystone.token.backends.memcache.Token to driver=keystone.token.persistence.backends.sql.Token seems to increase the performance quite a lot. Of course the load on mysql goes up significantly and you have to purge expired tokens from the keystone database, but all in all it works better. Would it be ok to change that in the real env. as well, or do you see any pitfalls with that?

The ldap queries (keystone is configured with ldap backend) doesn't seem to have anything to do with the bottleneck in the S3 interface we see right now."

description:	updated
description:	updated

Maciej Kwiek (maciej-iai) on 2015-08-31

tags:	added: customer-found
Changed in fuel:
assignee:	nobody → MOS Maintenance (mos-maintenance)
milestone:	none → 6.0-mu-6
status:	New → Confirmed

Maciej Kwiek (maciej-iai) on 2015-08-31

Changed in fuel:
importance:	Undecided → Medium

Boris Bobrov (bbobrov) on 2015-08-31

affects:	fuel → mos
Changed in mos:
milestone:	6.0-mu-6 → none
no longer affects:	mos/6.0-updates

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2015-08-31:

To get this fixed, I need to have [token]revoke_by_id = false. This needs to be done in fuel.

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-08-31: Fix proposed to openstack/keystone (openstack-ci/fuel-7.0/2015.1.0)

Fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Boris Bobrov <email address hidden>
Review: https://review.fuel-infra.org/10967

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2015-08-31:

> multiple requests using same credentials going in parallel

How many requests happen per second in your case?

Maciej Kwiek (maciej-iai) on 2015-09-01

no longer affects:

fuel

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2015-09-01:

The bug also affects fuel because in order to fix it we need to make changes to keystone config. Please see comment #2.

Matthew Mosesohn (raytrac3r) on 2015-09-01

Changed in fuel:
assignee:	nobody → Boris Bobrov (bbobrov)
milestone:	none → 7.0

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2015-09-01:

Nope, I need someone from fuel folks to do it for me

Changed in fuel:
assignee:	Boris Bobrov (bbobrov) → Fuel Library Team (fuel-library)

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2015-09-02:

I am marking the bugreport as invalid for 7.0 because it's not clear whether the bug can be reproduced in 7.0, because the locks that the reporter speaks about were heavily changed in 7.0. The original report also mentions only 6.0 or 6.1.

no longer affects:

fuel

Revision history for this message

Vitaly Sedelnik (vsedelnik) wrote on 2015-09-10:

Reassigning to keystone team for 6.0/6.1 as maintenance team need a fix in some to start backporting.

Revision history for this message

Alexander Makarov (amakarov) wrote on 2015-10-20:

Please follow the suggestion mentioned in #2
Not sure what team does this particular puppet, but it should set [token]revoke_by_id=False in /etc/keystone/keystone.conf upon initialization.

Revision history for this message

Alexey Stupnikov (astupnikov) wrote on 2016-02-20:

#10

Setting bug's status to "Won't fix" for MOS 6.0, since at this point we only fix security issues there and this bug is not a security issue.

Revision history for this message

Alexey Stupnikov (astupnikov) wrote on 2016-02-20:

#11

Plus we don't update puppet manifests in 6.0 and it takes puppet manifests to be updated to fix this issue.

Revision history for this message

Vitaly Sedelnik (vsedelnik) wrote on 2016-02-25:

#12

Retargeted to 6.1-updates as we need more time for investigation.

Revision history for this message

Alexey Stupnikov (astupnikov) wrote on 2016-02-26:

#13

test_program Edit (3.1 KiB, text/x-python)

I have set this bug's status to invalid for MOS 6.1 because I have tested S3 API performance of MOS 6.1 using a code from attachment. The results http://pastebin.com/9cnN6KbP are pretty consistent and doesn't differ much from simple file-by-file upload.

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-11-01: Change abandoned on openstack/keystone (openstack-ci/fuel-7.0/2015.1.0)

#14

Change abandoned by Boris Bobrov <email address hidden> on branch: openstack-ci/fuel-7.0/2015.1.0
Review: https://review.fuel-infra.org/10967