rbd.IncompleteWriteError: RBD incomplete write (Wrote only 8388608 out of 8394566 bytes) since Victoria Upgrade, ceph v nautilus

Bug #1916482 reported by How about no
52
This bug affects 11 people
Affects Status Importance Assigned to Milestone
Glance
Confirmed
Undecided
Unassigned
glance_store
In Progress
Undecided
Unassigned

Bug Description

full glance log:
https://paste.ubuntu.com/p/fVrj5vfc7m/

Feb 20 15:18:35 xxx-glance-container-071516d6 glance-wsgi-api[55]: 2021-02-20 15:18:35.577 55 DEBUG glance_store._drivers.rbd [req-663e7646-b12e-4e4c-ad18-d4b23644a504 23fd1e8e87af40b88418b118085104cf 946f0e543169462596dbaaf7504f7a4a - default default] creating image fdae789d-10ce-40f5-9c50-c8b206b81531 with order 23 and size 0 add /openstack/venvs/glance-22.0.1/lib/python3.8/site-packages/glance_store/_drivers/rbd.py:555
Feb 20 15:18:35 xxx-glance-container-071516d6 glance-wsgi-api[55]: 2021-02-20 15:18:35.578 55 WARNING glance_store._drivers.rbd [req-663e7646-b12e-4e4c-ad18-d4b23644a504 23fd1e8e87af40b88418b118085104cf 946f0e543169462596dbaaf7504f7a4a - default default] Since image size is zero we will be doing resize-before-write which will be slower than normal
Feb 20 15:18:35 xxx-glance-container-071516d6 glance-wsgi-api[55]: 2021-02-20 15:18:35.642 55 DEBUG glance_store._drivers.rbd [req-663e7646-b12e-4e4c-ad18-d4b23644a504 23fd1e8e87af40b88418b118085104cf 946f0e543169462596dbaaf7504f7a4a - default default] resizing image to 8192.0 KiB _resize_on_write /openstack/venvs/glance-22.0.1/lib/python3.8/site-packages/glance_store/_drivers/rbd.py:505
Feb 20 15:18:35 xxx-glance-container-071516d6 glance-wsgi-api[55]: 2021-02-20 15:18:35.761 55 ERROR glance_store._drivers.rbd [req-663e7646-b12e-4e4c-ad18-d4b23644a504 23fd1e8e87af40b88418b118085104cf 946f0e543169462596dbaaf7504f7a4a - default default] Failed to store image fdae789d-10ce-40f5-9c50-c8b206b81531 Store Exception RBD incomplete write (Wrote only 8388608 out of 8394566 bytes): rbd.IncompleteWriteError: RBD incomplete write (Wrote only 8388608 out of 8394566 bytes)
Feb 20 15:18:35 xxx-glance-container-071516d6 glance-wsgi-api[55]: 2021-02-20 15:18:35.809 55 DEBUG glance_store._drivers.rbd [req-663e7646-b12e-4e4c-ad18-d4b23644a504 23fd1e8e87af40b88418b118085104cf 946f0e543169462596dbaaf7504f7a4a - default default] Snap Operating Exception [errno 2] RBD image not found (error unprotecting snapshot b'fdae789d-10ce-40f5-9c50-c8b206b81531'@b'snap') Snapshot does not exist. _delete_image /openstack/venvs/glance-22.0.1/lib/python3.8/site-packages/glance_store/_drivers/rbd.py:464
Feb 20 15:18:35 xxx-glance-container-071516d6 glance-wsgi-api[55]: 2021-02-20 15:18:35.984 55 ERROR glance.api.v2.image_data [req-663e7646-b12e-4e4c-ad18-d4b23644a504 23fd1e8e87af40b88418b118085104cf 946f0e543169462596dbaaf7504f7a4a - default default] Failed to upload image data due to internal error: rbd.IncompleteWriteError: RBD incomplete write (Wrote only 8388608 out of 8394566 bytes)
Feb 20 15:18:36 xxx-glance-container-071516d6 glance-wsgi-api[55]: 2021-02-20 15:18:36.037 55 ERROR glance.common.wsgi [req-663e7646-b12e-4e4c-ad18-d4b23644a504 23fd1e8e87af40b88418b118085104cf 946f0e543169462596dbaaf7504f7a4a - default default] Caught error: RBD incomplete write (Wrote only 8388608 out of 8394566 bytes): rbd.IncompleteWriteError: RBD incomplete write (Wrote only 8388608 out of 8394566 bytes)

Revision history for this message
How about no (howaboutno) wrote :
summary: - Snapshots fail to be written to ceph/rbd backend since upgrade to
- Victoria release. Ceph v Nautilus
+ rbd.IncompleteWriteError: RBD incomplete write (Wrote only 8388608 out
+ of 8394566 bytes) since Victoria Upgrade, ceph v nautilus
Revision history for this message
admin0 (shashi-eu) wrote :
Revision history for this message
admin0 (shashi-eu) wrote :

This one is on victoria + ceph octopus

Revision history for this message
admin0 (shashi-eu) wrote :
Download full text (53.3 KiB)

Mar 11 19:17:31 cloud-c1-1030989-glance-container-071516d6 uwsgi[32133]: /openstack/venvs/glance-22.1.0/lib/python3.8/site-packages/oslo_policy/policy.py:743: UserWarning: Policy "default":"role:admin" was deprecated in Ussuri in favor of "default":"". Reason: In order to allow operators to accept the default policies from code by not defining them in the policy file, while still working with old policy files that rely on the ``default`` rule for policies that are not specified in the policy file, the ``default`` rule must now be explicitly set to ``"role:admin"`` when that is the desired default for unspecified rules.. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
Mar 11 19:17:31 cloud-c1-1030989-glance-container-071516d6 uwsgi[32133]: warnings.warn(deprecated_msg)
Mar 11 19:17:31 cloud-c1-1030989-glance-container-071516d6 glance-wsgi-api[32135]: 2021-03-11 19:17:31.701 32135 ERROR glance_store._drivers.rbd [req-400c49d7-2e76-49ce-b344-08908ae1d764 2e99c29bb33b45c7a001b1e21ec5e279 9d9e18f6ee4d4bb1bf7491a9c1bcad92 - default default] Failed to store image 0a8a79d0-afe7-402f-b256-80b31d0dea0d Store Exception RBD incomplete write (Wrote only 8388608 out of 8402336 bytes): rbd.IncompleteWriteError: RBD incomplete write (Wrote only 8388608 out of 8402336 bytes)
Mar 11 19:17:31 cloud-c1-1030989-glance-container-071516d6 glance-wsgi-api[32135]: 2021-03-11 19:17:31.855 32135 ERROR glance.api.v2.image_data [req-400c49d7-2e76-49ce-b344-08908ae1d764 2e99c29bb33b45c7a001b1e21ec5e279 9d9e18f6ee4d4bb1bf7491a9c1bcad92 - default default] Failed to upload image data due to internal error: rbd.IncompleteWriteError: RBD incomplete write (Wrote only 8388608 out of 8402336 bytes)
Mar 11 19:17:31 cloud-c1-1030989-glance-container-071516d6 glance-wsgi-api[32135]: 2021-03-11 19:17:31.897 32135 ERROR glance.common.wsgi [req-400c49d7-2e76-49ce-b344-08908ae1d764 2e99c29bb33b45c7a001b1e21ec5e279 9d9e18f6ee4d4bb1bf7491a9c1bcad92 - default default] Caught error: RBD incomplete write (Wrote only 8388608 out of 8402336 bytes): rbd.IncompleteWriteError: RBD incomplete write (Wrote only 8388608 out of 8402336 bytes)
                                                                                   2021-03-11 19:17:31.897 32135 ERROR glance.common.wsgi Traceback (most recent call last):
                                                                                   2021-03-11 19:17:31.897 32135 ERROR glance.common.wsgi File "/openstack/venvs/glance-22.1.0/lib/python3.8/site-packages/glance/common/wsgi.py", line 1347, in __call__
                                                                                   2021-03-11 19:17:31.897 32135 ERROR glance.common.wsgi action_result = self.dispatch(self.controller, action,
                                                                                   2021-03-11 19:17:31.897 32135 ERROR glance.common.wsgi File "/openstack/venvs/glance-22.1.0/lib/python3.8/site-packages/glance/common/wsgi.py", line 1391, in dispatch
                                                                                   2021-03-11 19:17:31.897 ...

Revision history for this message
Erik (est-it) wrote :

I had the same problem as you and made some "research".
I think it's not a problem between glance and ceph, but between the uploading client and glance.

Do you have a reverse proxy in front of the uwsgi processes that are used by glance? I mean something like apache or haproxy?

For apache, there is a note in the glance-api-uwsgi.ini on my debian system with victoria backport. It says, glance is not compatible for running behind apache because of two options for http chunked upload. I would say, this comment literally applies to every http reverse proxy
I use HAProxy in front of glance-api to make it HA compatible and transparently switch glance instances etc.

I tried to upload a image direct to an uwsgi network socket, which works fine. With HAProxy as L7 reverse proxy, i got the same errors as you.

TL;DR: When using HAProxy use "mode tcp".

I made TCPDumps on both sides:

Client---HAProxy and HAProxy---Glance

The data stream (that is the image that we are trying to upload) is send to the proxy over http. The proxy sends that data to glance over it's own http connection.
My glance+ceph lab cannot write back the data stream as fast as new data is arriving over network (that's okay, because it's a lab). The consequence is, that our TCP window fills up, until it's full. The HAProxy have to wait with sending new data until the buffer from the glance server is freed. The uploading client must wait too, of course. And because we have two independent HTTP connections, this wait/send/wait... procedure ist somewhat asynchron.

This confuses our client: When the glance server accepts new data (sends an TCP ACK), our client didn't answer with new data but he sends TCP keepalives. TCP keepalives are not the segments, that our glance server wants in a chunked data-stream.
Finally the glance server sends a http error 500 (internal server error).
Our data stream wasn't complete, so we get the error message, you've posted above.

I tried some HAProxy options, the only way I got it working, was to set the frontend/backend to TCP mode ("mode tcp"). That don't proxies HTTP (L7), it redirects a complete TCP stream to a glance server. That works so far.

Revision history for this message
Alexander Binzxxxxxx (devil000000) wrote :

I can confirm the issue when using haproxy with default config from openstack ansible.
"mode tcp" did fix it for me.
I think this issue may be one of two things:
* within haproxy or its configuration
* on glance side - miscalculating size when more/different headers are sent.

Revision history for this message
Alexander Binzxxxxxx (devil000000) wrote :

Let me add some more info:
I did see this on xena but I guess this is a issue with all versions. Storage backend used is ceph.
The first bytes value in the error is the object size used (e.g. 8388608, 4194304, 2097152 or 1048576) and the second is somehow bigger. So it may actually fail on the first or second object write to ceph.
I don't think this has anything to do with writing speeds as suggested by erik.

Revision history for this message
Bjoern (bjoern-t) wrote :

I do not experience this issue when Using Openstack V and Ceph O, we have been using l7 mode for years now but we also separate the storage and management traffic into different network bridges with different MTU sizes. I'm don't think that a TCP keepalive packet will mess with HTTP chunking these are on different OSI levels in general, I would much rather believe you could run into a timeout in which the next chunk is expected or premature port closures. Apart from other feature such as HTTP keepalive and pipelining.
You are using glance 2.1 API with show_image_direct_url set to true I assume

Revision history for this message
Alexander Binzxxxxxx (devil000000) wrote :

This happend on iso image uploads via horizon using legacy methond and also with volume to image upload. for used software versions see xena. I do still not think this has anything to do with speeds.

Revision history for this message
Robert Varjasi (robert.varjasi) wrote :

This BUG introduced here in glance_store RBD driver: https://opendev.org/openstack/glance_store/commit/c43f19e8456b9e20f03709773fb2ffdb94807a0a

For the first resize the "resize_amount" will be using the configured store_chunk_size (in my case its rbd_store_chunk_size from glance-api.conf), and only start doubling it after the first resize.

This BUG remains in hide until you start using some load-balancer/proxy where the client buffer size can be lower(for haproxy: bufsize = 16384) which can cause unaligned reads (https://github.com/openstack/glance/blob/master/glance/common/wsgi.py#L1028) and the response length can be bigger than the store_chunk_size for the first time, so at the end the RBD write will fail because it wants to write more data than the actual RBD image size after the first resize.

Revision history for this message
Robert Varjasi (robert.varjasi) wrote (last edit ):

Chunk octets sent by the client depends on client buffer size. Its limited when using a proxy, like haproxy.

HQ-A root@haproxy1-hq-a:~# haproxy -vv|grep buf
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Revision history for this message
Robert Varjasi (robert.varjasi) wrote (last edit ):

My working fix is to double the resize_amount for the first time. You can do it by:

* self.resize_amount = self.WRITE_CHUNKSIZE * 2

OR

* do self.resize_amount = min(self.resize_amount * 2, 8 * units.Gi) first and then do the actual resize

Revision history for this message
Robert Varjasi (robert.varjasi) wrote :
Changed in glance-store:
status: New → Confirmed
Changed in glance:
status: New → Confirmed
Revision history for this message
Nilesh (cnilesh) wrote :

I soled it using:

```
root@firoz-osa-infra-0:~# lxc-attach infra1_glance_container-e4081cf4
```

FROM:
```
root@infra1-glance-container-e4081cf4:~# grep -i "chunk" /etc/glance/glance-api.conf
rbd_store_chunk_size = 8
root@infra1-glance-container-e4081cf4:~#
```
TO
```
root@infra1-glance-container-e4081cf4:~# grep -i "chunk" /etc/glance/glance-api.conf
rbd_store_chunk_size = 16
root@infra1-glance-container-e4081cf4:~#
```

```
systemctl restart glance-api.service
```

```
root@infra1-utility-container-13e6957f:~# openstack image list
+--------------------------------------+--------------+--------+
| ID | Name | Status |
+--------------------------------------+--------------+--------+
| 725adddb-e538-4161-8855-aee00da2e5c8 | cirros_0.3.4 | active |
+--------------------------------------+--------------+--------+
root@infra1-utility-container-13e6957f:~#
```

Revision history for this message
Robert Varjasi (robert.varjasi) wrote :

@cnilesh It was just luck I think. I cannot understand how can it be a real solution or a workaround.

Revision history for this message
Nilesh (cnilesh) wrote :

Hello All,

Again I faced the same issue, and this time the above trick did not work for me, So changed in haproxy.

from:
```
HAProxy use "mode http".
```

to:
```
HAProxy use "mode tcp".
```

restarted haproxy,

Yes worked.

Revision history for this message
Kohei Nakahara (nakkoh) wrote :

@robert.varjasi

Will the following fix add upstream?
https://bugs.launchpad.net/glance/+bug/1916482/comments/13

Revision history for this message
Cyril Roelandt (cyril-roelandt) wrote :

Any fix you'd like to see included upstream should be sent for review on gerrit.

Revision history for this message
Kohei Nakahara (nakkoh) wrote :

Thank you. I understood.

Revision history for this message
Robert Varjasi (robert.varjasi) wrote (last edit ):

Yes I really like to see my patch in upstream but I am afraid the onboarding process would take too much time. Feel free to pick my patch and open a PR if anyone here have the time and already onboard.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to glance_store (master)
Changed in glance-store:
status: Confirmed → In Progress
Revision history for this message
Andrew Bogott (andrewbogott) wrote :

This issue affects me as well (running glance Zed with haproxy). Robert's fix works for me so I've submitted it, here:

https://review.opendev.org/c/openstack/glance_store/+/885581

Sorry for the stolen valor, Robert!

Revision history for this message
Robert Varjasi (robert.varjasi) wrote :

Thank you Andrew!

Revision history for this message
James Black (jblackosa) wrote (last edit ):

The fix works for me too, and Antelope appears to have the fix in it, though the order of the code is not what is listed in https://review.opendev.org/c/openstack/glance_store/+/885581 and causes the fix not to work.

By default install it is showing up as:
    def _resize_on_write(self, image, image_size, bytes_written, chunk_length):
        """Handle the rbd resize when needed."""
        if image_size != 0 or self.size >= bytes_written + chunk_length:
            return self.size
        new_size = self.size + self.resize_amount
        LOG.debug("resizing image to %s KiB" % (new_size / units.Ki))
        image.resize(new_size)
        # Note(jokke): We double how much we grow the image each time
        # up to 8gigs to avoid resizing for each write on bigger images
        self.resize_amount = min(self.resize_amount * 2, 8 * units.Gi)
        return new_size

Where it should be:

def _resize_on_write(self, image, image_size, bytes_written, chunk_length):
    """Handle the rbd resize when needed."""
    if image_size != 0 or self.size >= bytes_written + chunk_length:
        return self.size
    self.resize_amount = min(self.resize_amount * 2, 8 * units.Gi)
    # Note(jokke): We double how much we grow the image each time
    # up to 8gigs to avoid resizing for each write on bigger images
    new_size = self.size + self.resize_amount
    LOG.debug("resizing image to %s KiB" % (new_size / units.Ki))
    image.resize(new_size)
    return new_size

The order appears to be wrong, self.resize_amount is in the wrong position during install, but in the right oder in https://review.opendev.org/c/openstack/glance_store/+/885581

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.