ESM archive getting DoSed with legitimate traffic every day at 06:25 (cron.daily time)

Bug #2034656 reported by Haw Loeung
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
cloud-images
New
Undecided
Unassigned
apt (Ubuntu)
New
Undecided
Unassigned
ubuntu-advantage-tools (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

Hi,

We're seeing frequent alerts on the Ubuntu ESM archive servers due to surges in requests. On two systems, I'm seeing this:

| Sep 6 05:47:16 machine-2 systemd[1]: Starting Update the local ESM caches...
| Sep 6 05:47:17 machine-2 systemd[1]: Finished Update the local ESM caches.
| Sep 6 10:49:35 machine-2 systemd[1]: Starting Update the local ESM caches...
| Sep 6 10:49:35 machine-2 systemd[1]: Finished Update the local ESM caches.
| Sep 6 17:17:16 machine-2 systemd[1]: Starting Update the local ESM caches...
| Sep 6 17:17:17 machine-2 systemd[1]: Finished Update the local ESM caches.
| Sep 6 23:47:16 machine-2 systemd[1]: Starting Update the local ESM caches...
| Sep 6 23:47:17 machine-2 systemd[1]: Finished Update the local ESM caches.
| Sep 7 01:55:02 machine-2 systemd[1]: Starting Update the local ESM caches...
| Sep 7 01:55:02 machine-2 systemd[1]: Finished Update the local ESM caches.

On another:

| Sep 6 02:41:02 is-bastion-ps5 systemd[1]: Starting Update the local ESM caches...
| Sep 6 02:41:03 is-bastion-ps5 systemd[1]: Finished Update the local ESM caches.
| Sep 6 09:02:40 is-bastion-ps5 systemd[1]: Starting Update the local ESM caches...
| Sep 6 09:02:41 is-bastion-ps5 systemd[1]: Finished Update the local ESM caches.
| Sep 6 15:32:40 is-bastion-ps5 systemd[1]: Starting Update the local ESM caches...
| Sep 6 15:32:41 is-bastion-ps5 systemd[1]: Finished Update the local ESM caches.
| Sep 6 22:02:40 is-bastion-ps5 systemd[1]: Starting Update the local ESM caches...
| Sep 6 22:02:41 is-bastion-ps5 systemd[1]: Finished Update the local ESM caches.
| Sep 7 04:32:40 is-bastion-ps5 systemd[1]: Starting Update the local ESM caches...
| Sep 7 04:32:42 is-bastion-ps5 systemd[1]: Finished Update the local ESM caches.

This is all from `/usr/lib/systemd/system/esm-cache.service` which calls `/usr/lib/ubuntu-advantage/esm_cache.py`.

Can we please have this run less frequent? Perhaps only once daily which aligns with APT and apt-daily-upgrade.service / unattended-upgrades?

Perhaps check existence of a file and run if not, then age of that same file and only run if it's older than a day?

I think, from what I can see, this may be triggered from /lib/systemd/system/ua-timer.timer and /etc/apt/apt.conf.d/20apt-esm-hook.conf?

See also LP:1554848 which was for APT.

On Trusty and Xenial clients we only seem to update daily, but the problem is worse as it's a cron.daily job, so all clients fire simultaneously - could we get this changed to a cron.d job with a randomised firing time instead?

Haw Loeung (hloeung)
description: updated
Haw Loeung (hloeung)
description: updated
Barry Price (barryprice)
description: updated
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ubuntu-advantage-tools (Ubuntu):
status: New → Confirmed
Revision history for this message
Haw Loeung (hloeung) wrote :

From what I've been told, during the large cron.daily spikes, it's mostly Trusty clients. This is due to Trusty having /etc/cron.daily/apt and ubuntu-advantage-tools ships out an apt hook - /etc/apt/apt.conf.d/20apt-esm-hook.conf.

Any chance we could randomise that to ensure not all clients hit us all at once?

Revision history for this message
Grant Orndorff (orndorffgrant) wrote :

Thank you for reporting this and sorry for the delay - for some reason this bug hasn't shown up in the team's daily summary email :( (I'll look into why)

There are a couple cases to consider here:
1. Xenial onward all have v28.1 in -updates
2. Trusty is back on v19.6 in -updates
3. Cloud Pro Trusty instances I think got v26.3 but I don't seem to be able to launch a new one to verify

Xenial onward with v28.1 have the esm-cache.service but it is only supposed to run when `apt update` runs. Is there something on these systems that is triggering an apt update every few hours? On these instances, a `pro collect-logs` will generate a tarball of logs named ua_logs.tar.gz - the contents of that may be helpful in figuring out what is causing this.

On Trusty v19.6 as well as 26.3 there is no esm-cache.service and esm sources are configured as normal apt sources (but pinned to 'never'), these should also only get updated on an `apt update`.

So I'm pretty sure that in all cases the esm cache is only updated when `apt update` runs. My initial thought is that any staggering or randomization should be implemented at the level that calls apt update, not at the level of the apt hook.

Or if esm is updating more frequently than on `apt update`, then that is likely a bug in pro-client and the logs from `pro collect-logs` would be helpful there.

Revision history for this message
Haw Loeung (hloeung) wrote :

Thanks for following up and commenting.

Currently, the largest request spike seems to be the 0600 - 0700 window. On Xenial at least, while it spreads downloads over a larger window (throughout the day), they are applied within that window, IIUC. That would be in `/lib/systemd/system/apt-daily-upgrade.timer` which then starts `apt-daily-upgrade.service`. An example on a Xenial host we have available:

| [hloeung@nernst ~]$ systemctl status apt-daily-upgrade.service
| ● apt-daily-upgrade.service - Daily apt upgrade and clean activities
| Loaded: loaded (/lib/systemd/system/apt-daily-upgrade.service; static; vendor preset: enabled)
| Active: inactive (dead) since Tue 2023-09-19 06:27:06 UTC; 3h 30min ago
| Docs: man:apt(8)
| Process: 14530 ExecStart=/usr/lib/apt/apt.systemd.daily install (code=exited, status=0/SUCCESS)
| Main PID: 14530 (code=exited, status=0/SUCCESS)
|
| Sep 19 06:26:57 nernst systemd[1]: Starting Daily apt upgrade and clean activities...
| Sep 19 06:27:05 nernst apt.systemd.daily[14530]: An error occurred: 503 Service Unavailable
| Sep 19 06:27:05 nernst apt.systemd.daily[14530]: The URI https://esm.ubuntu.com/infra/ubuntu/pool/main/b/binutils/binutils_2.26.1-1ubuntu1~16.04.8+esm7_amd64.deb failed to download, aborting
| Sep 19 06:27:06 nernst systemd[1]: Started Daily apt upgrade and clean activities.

Yeah, the 503 is due to the surge / request spike causing the service to not cope with the load, hence this bug.

I've added more metrics to our grafana dashboard to see in particular what release and pocket so we can shift our focus there. Will report back.

Revision history for this message
Junien Fridrick (axino) wrote :

Grant, thanks for your reply. I'll add one datapoint here, which I would say is our biggest concern right now : a lot of clients still run their daily APT activities (I presume through unattended-upgrades) at cron.daily time, ie 06:25.

This causes a massive storm of traffic on the ESM servers every day at that time, and depending on whether there are actual package upgrades available or not, can cause the servers to be overloaded.

This has been the case for the last 3 days, and looking at the logs on one server during the cron.daily spike we can see that the vast majority of the traffic is client downloading linux-azure-5.4 version 5.4.0-1116.123~18.04.1, from bionic-infra-security.

Now my understanding was that we had worked around this problem in general for APT (i.e. for archive.ubuntu.com and security.ubuntu.com) in bionic by having two different timers for APT activities : one that essentially does "apt update + apt upgrade --download-only" (the tasks that interact with the archive servers), spread over 12h - that's apt-daily.timer. And another timer that does "apt upgrade" at around 06:25 - that's apt-daily-upgrade.timer

Given the bionic-infra-security spike mentioned above, it's clear that something is not respecting these timers and doing "apt update + apt upgrade" at cron.daily time. Would you have any idea what it could be ?

Thanks

Revision history for this message
Grant Orndorff (orndorffgrant) wrote :

pro-client doesn't run apt upgrade in any scenario or configure the system to do so.

Do we have control over any of the offending machines? I'd still like to see a `pro collect-logs` from one of them to be sure nothing unexpected is happening pro-client related.

That the major offending package is azure-specific may be a clue that it could be something happening only on Azure. We can talk to the cloud team about it. We would only see the spike on esm.ubuntu.com because clouds use mirrors for the main archive. Adding cloud-images here which I think will notify them.

I went ahead and checked default apt configuration on a couple machines and found the following:
on Azure Bionic Pro:
$ apt-config dump | grep Periodic
APT::Periodic "";
APT::Periodic::Update-Package-Lists "1";
APT::Periodic::Download-Upgradeable-Packages "0";
APT::Periodic::AutocleanInterval "0";
APT::Periodic::Unattended-Upgrade "1";

on GCP Bionic Pro:
$ apt-config dump | grep Periodic
APT::Periodic "";
APT::Periodic::Update-Package-Lists "1";
APT::Periodic::Download-Upgradeable-Packages "0";
APT::Periodic::AutocleanInterval "1";
APT::Periodic::Unattended-Upgrade "1";

So at least in those two places, "Download-Upgradeable-Packages" is disabled ("0"), which I believe is what corresponds to the apt upgrade --download-only in apt.daily, but I'm out of my depth here. Just to be sure though, I'm also adding this bug to apt in case this is a problem with default configuration there.

Changed in ubuntu-advantage-tools (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Julian Andres Klode (juliank) wrote :

Yes so esm archive is contacted whenever main archive is, as the service is started by a hook to apt update to run an apt update on a subcache.

And APT daily only runs once on a given calendar day, even if you configure lower intervals.

So this seems to me like misconfiguration of those systems and not a pro issue. Adding checks to pro would be dangerous, it's easy to get this wrong in the current architecture, it may then only run every 2 days if done the wrong way (you can't just check if you last checked 24 hours ago, because if the daily run runs 22 hours due to randomisation you wouldn't run, hence why apt normalizes the time stamps to midnight so it sees if the date is different).

And really we want you to get the up-to-date information for your ESM too if you run apt update (also when ESM is actually enrolled, it runs in the main update anyhow).

Azure has its own update management solution afaik, which will do its own update checks and update management, it's possible it checks multiple times per day and should be relying on the randomised once a day apt-daily instead.

Revision history for this message
Junien Fridrick (axino) wrote :

Thanks both for your updates, I'll try to see if someone at Azure can help.

Changed in ubuntu-advantage-tools (Ubuntu):
status: Incomplete → Invalid
Revision history for this message
Junien Fridrick (axino) wrote :

Also @orndorffgrant "Do we have control over any of the offending machines?" sadly no...

Revision history for this message
Junien Fridrick (axino) wrote :

@juliank : would you know how a trusty machine does its daily apt activities (including ESM) ? Are the downloads spread out during the day, or is everything done in cron.daily ?

Revision history for this message
Éric St-Jean (esj) wrote :

i do think somehow randomizing the actual times so not all machines ping at the same time would be good
i would not be in favor of greatly reducing the frequency though

the real solution, however, is ESM archives in cloud mirrors

Junien Fridrick (axino)
summary: - Update Ubuntu ESM cache only once daily?
+ ESM archive getting DoSed with legitimate traffic every day at 06:25
+ (cron.daily time)
Revision history for this message
Junien Fridrick (axino) wrote :

@esj : Hi ! Thanks for chiming in. I think doing APT activities once a day is fine indeed.

However, I think even if we had ESM archives in cloud mirrors, having a large amount of clients do their updates all at the same time would lead to said cloud mirrors being overwhelmed, unless we throw an absurd amount of resources at the problem - resources that wouldn't be used during the rest of the day.

I think we do need both : spread updates throughout the day (this bug), and have ESM archives in cloud mirrors.

Revision history for this message
Julian Andres Klode (juliank) wrote :

Unfortunately as far as trusty is concerned, it uses the legacy cron job and not the new systemd timers which spread the load over 24h, so it only is able to randomize over 30 minutes starting from the cron.daily run time (6am IIRC, probably UTC as everyone runs servers in UTC).

I don't think anyone remembers why we didn't backport the switch to timers to it (did we not have [working] timers in 18.04), but also they shouldn't then see more load than the archive mirror, this shouldn't be a new problem in that sense.

Revision history for this message
Junien Fridrick (axino) wrote :

@juliank : can we not spread the load over 24h without systemd timers ? With a random sleep for example ? Surely people were spreading tasks over a specific period before systemd got invented.

Also you say "they shouldn't then see more load than the archive mirror" but archive mirrors and ESM mirrors are vastly different, in terms of traffic, in terms of hardware, but also because ESM mirrors are HTTPS-only with an authentication backend whereas the archive mirrors are HTTP-only without authentication.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.