occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04

Bug #2003851 reported by Walter
62
This bug affects 11 people
Affects Status Importance Assigned to Milestone
apt (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Hi!

Yesterday I spotted several machines of ours where a period `apt-get update` was stalled. The `http` children were hanging in `WaitFd` (waiting for parent instructions/queue). The parent was looping in `AcquireUpdate` every 500ms.

We have a cronjob that runs every few hours which calls `apt-get update` and does some post-processing. We noticed that several of them had stalled at some point in time. Killing the parent (apt-get) got it unstuck, removing the locks.

Example:
```
# apt-get update
Reading package lists... Done
E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 (apt-get)
N: Be aware that removing the lock file is not a solution and may break your system.
E: Unable to lock directory /var/lib/apt/lists/
```

Task listing:
```
root 153929 \_ /usr/sbin/CRON -f -P
root 153942 \_ /bin/sh -c [ -x /etc/zabbix/scripts/dpkg.updates ] && /etc/zabbix/scripts/dpkg.updates --cron
root 153943 \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron
root 154026 \_ apt-get update
_apt 154029 \_ /usr/lib/apt/methods/http
_apt 154030 \_ /usr/lib/apt/methods/http
_apt 154031 \_ /usr/lib/apt/methods/http
_apt 154033 \_ /usr/lib/apt/methods/gpgv
```
Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?):
```
# netstat -apn | grep -E '154026|154029|154030|154031|154033'
tcp 1 0 10.x.x.x:60868 217.x.x.x:80 CLOSE_WAIT 154030/http
tcp 1 0 10.x.x.x:40756 178.x.x.x:80 CLOSE_WAIT 154029/http
tcp 1 0 10.x.x.x:56818 185.x.x.x:80 CLOSE_WAIT 154031/http
```
All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, NULL, NULL).

The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [], NULL, {tv_sec=0, tv_nsec=500000000}, NULL).

The http sockets in the children were at fd=3.

Parent lsof:
```
# lsof -p 154026 +E
...
apt-get 154026 root 4uW REG 8,1 0 262281 /var/lib/apt/lists/lock
apt-get 154026 root 5r FIFO 0,13 0t0 4015176 pipe 154029,http,1w
apt-get 154026 root 6r FIFO 0,13 0t0 4012448 pipe 154030,http,1w
apt-get 154026 root 7r FIFO 0,13 0t0 4015192 pipe 154031,http,1w
apt-get 154026 root 8w FIFO 0,13 0t0 4015177 pipe 154029,http,0r
apt-get 154026 root 9r FIFO 0,13 0t0 4015233 pipe 154033,gpgv,1w
apt-get 154026 root 10w FIFO 0,13 0t0 4012449 pipe 154030,http,0r
apt-get 154026 root 12w FIFO 0,13 0t0 4015193 pipe 154031,http,0r
apt-get 154026 root 14w FIFO 0,13 0t0 4015234 pipe 154033,gpgv,0r
http 154029 _apt 0r FIFO 0,13 0t0 4015177 pipe 154026,apt-get,8w
http 154029 _apt 1w FIFO 0,13 0t0 4015176 pipe 154026,apt-get,5r
http 154030 _apt 0r FIFO 0,13 0t0 4012449 pipe 154026,apt-get,10w
http 154030 _apt 1w FIFO 0,13 0t0 4012448 pipe 154026,apt-get,6r
http 154031 _apt 0r FIFO 0,13 0t0 4015193 pipe 154026,apt-get,12w
http 154031 _apt 1w FIFO 0,13 0t0 4015192 pipe 154026,apt-get,7r
gpgv 154033 _apt 0r FIFO 0,13 0t0 4015234 pipe 154026,apt-get,14w
gpgv 154033 _apt 1w FIFO 0,13 0t0 4015233 pipe 154026,apt-get,9r
```
So:
- apt-get is waiting for any data written by any of its four children (at fd 5/6/7/9)
- http and gpgv are waiting for any data written by their parent (at their respective fd 0)

Parent backtrace:
```
#0 0x00007f420116a74d in select ()
   from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f420153fb5d in pkgAcquire::Run(int) ()
   from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
#2 0x00007f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) ()
   from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
#3 0x00007f420161d986 in ListUpdate(pkgAcquireStatus&, pkgSourceList&, int) ()
   from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
#4 0x00007f42016d127b in DoUpdate (CmdL=...)
    at ./apt-private/private-update.cc:73
#5 0x00007f420156d73f in CommandLine::DispatchArg(CommandLine::Dispatch const*, bool) ()
   from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
#6 0x00007f420169fa97 in DispatchCommandLine (CmdL=...,
    Cmds=std::vector of length 27, capacity 32 = {...})
    at ./apt-private/private-cmndline.cc:588
#7 0x0000561fc06bafbd in main (argc=2, argv=0x7ffc0e14aa38)
    at ./cmdline/apt-get.cc:447
```
Child backtrace:
```
#0 0x00007f58b1c9b74d in select ()
   from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f58b237cb43 in WaitFd(int, bool, unsigned long) ()
   from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
#2 0x00005643e37bc7b3 in BaseHttpMethod::Loop (
    this=0x7ffdeaed56c0) at ./methods/basehttp.cc:611
#3 main (argv=<optimized out>) at ./methods/http.cc:1052
```
Where this is:
```
(gdb) print *this
$1 = {<aptAuthConfMethod> = {<aptMethod> = {<pkgAcqMethod> = {<No data fields>}, <aptConfigWrapperForMethods> = {
        methodNames = std::vector of length 2, capacity 2 = {
          "http", "http::XXXXXX.nl"}}, Binary = "http",
      SeccompFlags = 6},
    authconfs = std::vector of length 1, capacity 1 = {
      std::unique_ptr<FileFd> = {get() = 0x5643e5343100}}},
  Server = std::unique_ptr<ServerState> = {
    get() = 0x5643e534fba0}, NextURI = "",
  AllowRedirect = true, Debug = false, PipelineDepth = 10,
  static FailFile = {static npos = 18446744073709551615,
    _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
      _M_p = 0x5643e534d650 "/var/lib/apt/lists/partial/apt.osso.nl_ubuntu-security_dists_jammy-security_InRelease"},
    _M_string_length = 85, {
      _M_local_buf = "\226", '\000' <repeats 14 times>,
      _M_allocated_capacity = 150}}, static FailFd = -1,
  static FailTime = 1674071760}
```

Relevant code in parent:
```
pkgAcquire::RunResult pkgAcquire::Run(int PulseInterval)
{
...
   // Run till all things have been acquired
   struct timeval tv = SteadyDurationToTimeVal(std::chrono::microseconds(PulseInterval));
   while (ToFetch > 0)
...
      int Res;
      do
      {
         Res = select(Highest+1,&RFds,&WFds,0,&tv);
      }
      while (Res < 0 && errno == EINTR);
```
Relevant code in child:
```
int BaseHttpMethod::Loop()
{
   signal(SIGTERM,SigTerm);
   signal(SIGINT,SigTerm);

   Server = 0;

   int FailCounter = 0;
   while (1)
   {
      // We have no commands, wait for some to arrive
      if (Queue == 0)
      {
         if (WaitFd(STDIN_FILENO) == false)
            return 0;
```

Versions:
```
Ubuntu 22.04.1 LTS (Jammy Jellyfish)
apt 2.4.8
```

Preliminiary conclusion/notes:

- I don't know why it stalls, but it does. And this was not a single incident. All the stalling apt processes had stopped at a different time/day.

- Apparently there is a situation possible where there are no commands queued from the parent, and the parent isn't planning on sending any either.

- Our apt-get is invoked by a cron job which calls either `update-notifier-common` or `apt-get update`: https://github.com/ossobv/zabbix-agent-osso/blob/c61aee6087c0d03e66d03013c61acd9f65d0eaab/scripts/dpkg.updates#L14-L31 -- on the machines where apt-get was hanging, there was no `update-notifier-common`.

- I don't know if this is 2.4.x related (on Focal we have apt 2.0.x), but it sure looks like it. I did check a few other Focal machines that lacked `update-notifier-common`, but they were not having a stalled apt-get (although _not_ seeing a hung process proves nothing)

Let me know if there's anything I can get you. Should this be filed elsewhere? (Debian Salsa?)

Cheers,
Walter Doekes
OSSO B.V.

Tags: jammy
Revision history for this message
Julian Andres Klode (juliank) wrote :

Patches are welcome but unfortunately there isn't much we can do otherwise, this highly depends on network specifics like latency and mtus and is never reliably reproducible.

Revision history for this message
Walter (wdoekes) wrote :

Thanks for the reply. Julian.

Let's assume that the problem is indeed latency/dropped packets/whatever on our side. IMO, an (occasionally) broken network should not cause apt-get to hang indefinitely. Do you think it should?

Also, it doesn't address that the behaviour seems recent. We have not observed anything like this with Focal or Bionic.

In the mean time, I've added notification code so at least we can we track when this occurs. I'll fill you in if I get more details (like versions that are (not) affected).

----

Additional debug info:

I called recv(3..) on the http socket with a recv-q. It returns 0 (no error, EOF).
```
(gdb) call (ssize_t)recv(3, "abcdef", 1, 0)
$1 = 0
```
```
# netstat -apn | grep -E '154026|154029|154030|154031|154033'
tcp 1 0 10.91.52.91:60868 217.21.205.139:80 CLOSE_WAIT 154030/http
tcp 1 0 10.91.52.91:40756 178.128.6.101:80 CLOSE_WAIT 154029/http
tcp 0 0 10.91.52.91:56818 185.37.124.14:80 CLOSE_WAIT 154031/http
```

Next, I called close(0) on the stdin socket on one of the processes, and this awoke the whole list of tasks:
```
(gdb) call (int)close(0)
$1 = 0
```
This yielded the following error message:
```
E: Method http has died unexpectedly!
```
And the apt-get process got unstuck / was reaped succesfully.

Revision history for this message
Walter (wdoekes) wrote :

Checking in.

I've had an `apt-get update` running in a while loop since Feb 6 on a machine: no luck getting it to stall.

Halfway, I added a `iptables -A INPUT -m statistic --mode random --probability 0.05 -j DROP` but this did not affect anything.

But, in the mean time, on production I have had two more of these since filing this bug report...

Revision history for this message
Walter (wdoekes) wrote :
Download full text (7.2 KiB)

Okay. I got some pcaps now. And apparently our apt-cacher-ng proxy sometimes serves:

- an InRelease
- with a 200 OK
- and a non-zero Content-Length
- BUT, no content, but instead a TCP-FIN.

Under the right circumstances, this causes apt to stall forever.

The right circumstances seem to include:

- a valid download (or 304) from the same host
- a valid download from another host

---

I created a test server, and got this to reproduce with apt 2.4.9 (from jammy-proposed):

=====================================
sources.list
=====================================
deb http://junk.devs.nu:3851/ubuntu jammy main
deb http://junk.devs.nu:3851/ubuntu jammy-updates main
deb [arch=amd64 signed-by=/etc/apt/keyrings/zabbix.gpg] http://repo.zabbix.com/zabbix/6.2/ubuntu jammy main

=====================================
server.py logs
=====================================
13:31:58 [2771009] client connected: ('217.21.199.56', 42838)
13:31:58 [2771009] client said: b'GET /ubuntu/dists/jammy/InRelease HTTP/1.1\r\nHost: junk.devs.nu:3851\r\nCache-Control: max-age=0\r\nAccept: text/*\r\nUser-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive\r\n\r\n'
13:31:58 [2771009] we say: 200 and serve local InRelease
13:31:58 [2771009] client said: b'GET /ubuntu/dists/jammy-updates/InRelease HTTP/1.1\r\nHost: junk.devs.nu:3851\r\nCache-Control: max-age=0\r\nAccept: text/*\r\nUser-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive\r\n\r\n'
13:31:58 [2771009] we say: 200 but serve a FIN
13:31:58 [2771009] client said: b''
13:31:58 [2771009] client disconnected: ('217.21.199.56', 42838)

=====================================
apt-get update -o Debug::Acquire::http=1
=====================================
0% [Connecting to junk.devs.nu] [Connecting to repo.zabbix.com (2604:a880:2:d0::2062:d001)]GET /zabbix/6.2/ubuntu/dists/jammy/InRelease HTTP/1.1
Host: repo.zabbix.com
Cache-Control: max-age=0
Accept: text/*
If-Modified-Since: Wed, 01 Feb 2023 15:02:57 GMT
User-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive

Answer for: http://repo.zabbix.com/zabbix/6.2/ubuntu/dists/jammy/InRelease
HTTP/1.1 304 Not Modified
Server: nginx/1.14.2
Date: Tue, 21 Feb 2023 12:31:54 GMT
Last-Modified: Wed, 01 Feb 2023 15:02:57 GMT
Connection: keep-alive
ETag: "63da7f21-135e"

Hit:1 http://repo.zabbix.com/zabbix/6.2/ubuntu jammy InRelease
0% [Connecting to junk.devs.nu]GET /ubuntu/dists/jammy/InRelease HTTP/1.1
Host: junk.devs.nu:3851
Cache-Control: max-age=0
Accept: text/*
User-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive

Answer for: http://junk.devs.nu:3851/ubuntu/dists/jammy/InRelease
HTTP/1.1 200 OK
date: Fri, 17 Feb 2023 15:11:49 GMT
content-type: octet/stream
content-length: 270087
last-modified: Fri, 17 Feb 2023 13:29:00 GMT
x-original-source: http://de.archive.ubuntu.com/ubuntu/dists/jammy/InRelease

Get:2 http://junk.devs.nu:3851/ubuntu jammy InRelease [270 kB]
0% [2 InRelease 0 B/270 kB 0%]GET /ubuntu/dists/jammy-updates/InRelease HTTP/1.1
Host: junk.devs.nu:3851
Cache-Control: max-age=0
Accept: text/*
User-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive

Answer for: http://junk.devs.nu:3851/ubuntu/dists/jammy-updates/InRelease
...

Read more...

Revision history for this message
Walter (wdoekes) wrote :

Got some more pcaps. I can also reproduce by sending a 503 after a 200, without aborting during/before body transfer (*).

============================
server.py logs
============================
13:29:55 [3666956] client said: b'GET /ubuntu/dists/jammy/InRelease HTTP/1.1\r\nHost: junk.devs.nu:3851\r\nCache-Control: max-age=0\r\nAccept: text/*\r\nUser-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive\r\n\r\n'
13:29:55 [3666956] we say: 200 and serve local InRelease
13:29:55 [3666956] client said: b'GET /ubuntu/dists/jammy-updates/InRelease HTTP/1.1\r\nHost: junk.devs.nu:3851\r\nCache-Control: max-age=0\r\nAccept: text/*\r\nUser-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive\r\n\r\n'
13:29:55 [3666956] we say: 503

============================
503 request/response (valid length, no early FIN)
============================
GET /ubuntu/dists/jammy-updates/InRelease HTTP/1.1
Host: junk.devs.nu:3851
Cache-Control: max-age=0
Accept: text/*
User-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive

HTTP/1.1 503 OK
date: Fri, 17 Feb 2023 15:11:49 GMT
content-type: text/plain
content-length: 19

something is broken

============================

I present to you:

- https://junk.devs.nu/2023/lp2003851/not-reproduced-port-34634.txt
- https://junk.devs.nu/2023/lp2003851/not-reproduced-port-34634.pcap
- https://junk.devs.nu/2023/lp2003851/reproduced-port-58014.txt
- https://junk.devs.nu/2023/lp2003851/reproduced-port-58014.pcap

In the not-reproduced case, the error is acknowledged and the download is retried. Then it fails quickly.

In the reproduced case, the error is not acknowledged, and the result is a hanging apt-get. Logs ending with:

Hit:3 http://repo.zabbix.com/zabbix/6.2/ubuntu jammy InRelease
0% [Working]

============================
(*) observed in the wild with these req/resp:
============================
GET /ubuntu/dists/jammy-backports/InRelease HTTP/1.1
Host: apt.osso.nl
Cache-Control: max-age=0
Accept: text/*
If-Modified-Since: Tue, 21 Feb 2023 19:32:00 GMT
User-Agent: Debian APT-HTTP/1.3 (2.4.8) non-interactive

HTTP/1.1 200 OK
date: Wed, 22 Feb 2023 02:36:08 GMT
content-type: octet/stream
content-length: 106807
last-modified: Wed, 22 Feb 2023 01:49:00 GMT
x-original-source: http://nl.archive.ubuntu.com/ubuntu/dists/jammy-backports/InRelease
(+ valid body)

GET /ubuntu-security/dists/jammy-security/InRelease HTTP/1.1
Host: apt.osso.nl
Cache-Control: max-age=0
Accept: text/*
If-Modified-Since: Tue, 21 Feb 2023 23:11:00 GMT
User-Agent: Debian APT-HTTP/1.3 (2.4.8) non-interactive

HTTP/1.1 503 Connection timeout
date: Wed, 22 Feb 2023 02:36:08 GMT
content-type: text/html
content-length: 486
(+ valid body)

============================
extra contemplations:
============================
We've only recently added the /ubuntu-security/ (extra path, on the same server). Maybe that is the reason why this is biting us right now.

Revision history for this message
Walter (wdoekes) wrote :
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in apt (Ubuntu):
status: New → Confirmed
Ken Sharp (kennybobs)
tags: added: jammy
Revision history for this message
Walter (wdoekes) wrote :

Hi Julian,

> [There] isn't much we can do otherwise, this highly depends on network specifics like latency and mtus and is never reliably reproducible.

So, with my bogus python PoC server I mentioned in the comments, it is quite easily reproducible.

If I can assist you in getting it up and running, do let me know.

Cheers,
Walter

Revision history for this message
Alexey Gavrilov (le9i0nx) wrote (last edit ):

I suggest adding a parameter or setting Timeout=120 in default settings.

apt -o Acquire::http::Timeout=120 update

https://askubuntu.com/questions/141513/how-to-lower-wait-time-for-repository-updates
added Acquire::http::Timeout "120"; in file apt.conf or /etc/apt/apt.conf.d/99timeout

Revision history for this message
Jay D.R. (vtsj) wrote :

FYI, ran into this same issue with Debian bookworm's apt-get (2.6.1) hanging indefinitely talking to an apt-cacher-ng 3.7.x instance, easily reproducible by having ansible run an `apt-get update` on ~20-25 machines simultaneously. Problem disappears when switching back to apt-cacher-ng 3.6.x, so +1 for the apt-cacher-ng bug linked above.

Still, the problem persists even when invoking apt-get with `-o Acquire::http::Timeout=120`, so presumably there must still be something bugged on the side of apt itself not properly adhering to the configured timeout.

Revision history for this message
anarcat (anarcat) wrote :

We're seeing a similar issue here. At first we thought it was an issue specific to a prometheus collector (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1028212 / https://github.com/prometheus-community/node-exporter-textfile-collector-scripts/issues/179) but now that I see this bug report, I can't help but think this is an issue in apt itself.

I should also mention this seems like a regression between bullseye and bookworm. For context, we have a nightly job that runs apt-update here as well, a home-grown (`dsa-update-apt-status`, from DSA) script that spews warnings through cron when there's an issue. In my mailbox where I track those, I do have instances of this before we started the bookworm upgrade, but those were rare. Starting from the beginning of our bookworm deployment though, we are seeing more and more of this as we upgrade machines over. We're now seeing daily warnings, as the `dsa-update-apt-status` runs into lock contention with another job (`apt_info.py` from the above collector, runs every 15m) more frequently. We were getting daily warnings from the fleet, all bookworm machines, with locks sometimes being held for hours.

Our current workaround has been to set a time limit to the `apt_info.py` job, but we're *still* seeing errors, which is interesting in itself, as it means the issue is *not* specific to that script: it's a global apt issue. We've had unattended-upgrades.py hanging forever as well now, which we've never seen before.

So I think this is an apt issue. Perhaps Acquire::http::Timeout=120 is a valid workaround, but I can't help but think this is an issue that was specifically introduced between bullseye and bookworm (2.2 vs 2.6).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.