passthrough devices cause >17min boot delay
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Confirmed
|
Low
|
Unassigned | ||
qemu (Ubuntu) |
Invalid
|
Low
|
Unassigned |
Bug Description
Adding passthrough devices to a guest introduces a boot delay of > 17 minutes. During this time libvirt reports the guest to be "paused". The delay does not seem to scale with the # of hostdevs - that is, 1 hostdev causes about a 17min delay, while 16 only bumps that up to ~18min. Removing all hostdevs avoids the delay.
dann frazier (dannf) wrote : | #1 |
dann frazier (dannf) wrote : | #2 |
Changed in qemu (Ubuntu): | |
importance: | Undecided → Medium |
assignee: | nobody → Rafael David Tinoco (rafaeldtinoco) |
status: | New → Confirmed |
Rafael David Tinoco (rafaeldtinoco) wrote : | #3 |
> > sudo perf record -a -F $((250*100)) -e cycles:u --
> > /usr/bin/
>
> I instead attached perf to the qemu process after it was spawned by
> libvirt so I didn't have to worry about producing a working qemu
> cmdline. I let it run for several seconds while still in the paused
> state.
Sure!
> > And check where we're spending so much time for the boot to happen.
> > Note I'm using "cycles:u", getting only userland stack trace probes
> > every (HZ*100),.
>
> This showed no samples - apparently all samples were kernel side.
Either kernel or VMX/SVM mode.
> > That might give us enough information.
> >
> > Would you mind opening a launchpad bug (can be private if you prefer)
> > so Christian and I can take a look ?
>
> Sure:
> https:/
Err, sorry to ask you this, but I'm getting magic/endian check errors
on perf report, maybe something related to the arch and perf. Would
you mind reporting to stdio and pasting into the case ? Are you using
fully updated Eoan (userland/kernel) for everything ?
(c)inaddy@
report -f -v
magic/endian check failed
Tks!
Rafael David Tinoco (rafaeldtinoco) wrote : | #4 |
Attached file is already the stdio, sorry for prev message.
Christian Ehrhardt (paelzer) wrote : | #5 |
I have seen slow boots before, but it scaled with the amount of devices and memory reaching like ~5min (for >1TB mem init) and ~2.5min for 16 device pass-through.
Those times (the ones I mentioned) Nvidia has seen and sort of accepted. I discussed with Anish about how using HugePages help to reduce a lot of these times.
But 17min for just one device certainly feels wrong.
Christian Ehrhardt (paelzer) wrote : | #6 |
Since I knew memory often is more painful - start with 512MB, 1CPU, 1 PCI Passthrough
Note: I installed debug symbols for glibc and qemu
On init I find initially the guests CPU thread rather busy (well, booting up)
80.66% CPU 0/KVM [kernel]
Passthrough is successful - lspci from guest:
00:07.0 3D controller: NVIDIA Corporation Device 1db8 (rev a1)
Usually you have a few sections of execution, here are my timings:
1. from virsh start until there is a console connected
That is the libvirt prep and the early startup of the qemu process
until it provides something on the socket representing the console
=> 4 seconds
2. from #1 until we see anything in the console moving
That usually is mostly consumed by KVM (host) to prep mappings
and potentially scrub the memory and prepare devices.
=> 12 seconds
3. from #2 until we can log in
That is the guest kernel, followed by guest userspace initializing
=> 17 seconds
So I overall needed 33 seconds for a guest with one device passed through, that is not too bad.
Your example was the full size (1.4TB) guest and I'd expect that this is what happens here.
Trying a few more tests ...
Christian Ehrhardt (paelzer) wrote : | #7 |
I was bumping up to the config you had (but with one PT device).
- Host phys bits machine type for larger mappings
- more CPUS 1->32
Adding/removing a PT device in the configs above doesn't change a lot.
As assumed none of these increased the time tremendously.
Then I went to bump up the memory size.
T1: use 1.2 TB but no PT device
#1: 6 sec
#2: 21 sec
#3: 16 sec
This only very slightly increases in #2 due to more memory that needs to be set up.
T2: use 1.2 TB with one PT device
#1: 253 sec
#2: 20 sec
#3: 18 sec
The time consuming part is a single process, with kernel side load.
Associated userpace is qemu, but the load i in the kernel close to 100%.
Samples: 62K of event 'cycles:ppp', Event count (approx.): 34521154809
Overhead Shared Object Symbol
73.91% [kernel] [k] clear_page_erms
9.53% [kernel] [k] clear_huge_page
1.65% [kernel] [k] follow_
Christian Ehrhardt (paelzer) wrote : | #8 |
IMHO this is the same we discussed around March this year
=> https:/
In an associated mail thread we even discussed the pro/con of changes like
<allocation mode="immediate
We see above this is about (transparent huge page) setup for all the memory.
We can disable hugepages entirely, but it will make this even less efficient.
Christian Ehrhardt (paelzer) wrote : | #9 |
T3: use 1.2 TB with one PT device - THP=off
#1: 476 seconds
#2: 31 seconds
#3: 20 seconds
ubuntu@akis:~$ echo never | sudo tee /sys/kernel/
never
ubuntu@akis:~$ cat /sys/kernel/
always madvise [never]
Samples: 88K of event 'cycles:ppp', Event count (approx.): 44410625251
Overhead Shared Object Symbol
22.12% [kernel] [k] _raw_spin_lock
14.56% [kernel] [k] try_charge
13.00% [kernel] [k] clear_page_erms
11.48% [kernel] [k] get_page_
8.10% [kernel] [k] handle_pte_fault
2.04% [kernel] [k] __pagevec_
1.80% [kernel] [k] follow_page_pte
We still see page scrub and mapping initialization, but no more transparent pages.
As expected this changed the perf profile but didn't provide a lot of gain
(actually a loss on init time and later on ).
Christian Ehrhardt (paelzer) wrote : | #10 |
Now to gain a good result, lets use 1G Huge Pages.
Kernel cmdline:
default_
Gives:
HugePages_Total: 1210
HugePages_Free: 1210
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB
Guest config extra:
<memoryBacking>
<hugepages/>
</memoryBacking>
Slightly changed profile:
Samples: 28K of event 'cycles:ppp', Event count (approx.): 19134377327
Overhead Shared Object Symbol
27.02% [kernel] [k] follow_hugetlb_page
11.23% [kernel] [k] __get_user_pages
9.10% [kernel] [k] _raw_spin_lock
7.58% [kernel] [k] down_read
7.53% [kernel] [k] up_read
6.14% [kernel] [k] huge_pte_offset
T4: use 1.2 TB with one PT device, 1G sized explicit HugePages
#1: 59 seconds
#2: 21 seconds
#3: 18 seconds
This is what you'd expect from such a huge guest, and it works just as expected.
Christian Ehrhardt (paelzer) wrote : | #11 |
I mentioned in the last discussion around this, that the one thing that could be done is to make this single thread mem-init a multi thread action (in the kernel). I doubt that we can make it omit the initialization. Even though it is faster, even the 1G Huge Page setup could be more efficient.
To consider that we need to know what exactly it does.
The last call from userspace before this long 1-thread-busy period starts are:
openat(AT_FDCWD, "/dev/vfio/vfio", O_RDWR|O_CLOEXEC) = 57
ioctl(57, VFIO_GET_
ioctl(57, VFIO_CHECK_
ioctl(57, VFIO_CHECK_
ioctl(56, VFIO_GROUP_
ioctl(57, VFIO_SET_IOMMU, 0x3) = 0
ioctl(57, VFIO_DEVICE_
ioctl(12, KVM_CREATE_DEVICE, 0x7fff4499dbd4) = 0
ioctl(58, KVM_SET_
ioctl(57, VFIO_DEVICE_
ioctl(57, VFIO_DEVICE_
ioctl(57, VFIO_DEVICE_
ioctl(57, VFIO_DEVICE_
ioctl(57, VFIO_DEVICE_
Note: from our last thoughts on this it is quite possible that the granularity this could be split depends on the locks being used. E.g. per Node, as arbitrary chunks might just lock content and ping-pong those locks.
It might be similar to [1] that ended up in [2] after all. Not sure yet if in the kernel it does reset or map.
[1]: https:/
[2]: https:/
Christian Ehrhardt (paelzer) wrote : | #12 |
A trace into the early phase (first some other init with ~16 cpus) then the long phase of 1 thread blocking.
So we will see it "enter" the slow phase as well as iterating in it.
There seem to be two phases one around alloc_pages_current and then one around slot_rmap_
I'd appreciate if the kernel Team could take a look at the details here and how the call chains actually are.
And how this is related to the IOCTL that seem to trigger it.
After all the testcase is rather simple:
uvt-simplestrea
uvt-kvm create --password ubuntu test-slowstart arch=amd64 release=bionic label=daily
Then add to guest config:
<memory unit='KiB'
<currentMemory unit='KiB'
<type arch='x86_64' machine=
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x34' slot='0x00' function='0x0'/>
</source>
<address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
</hostdev>
A guest like that is already on the system for testing.
ftrace log (compressed 179M, full 4.8G) can be found on the system.
Christian Ehrhardt (paelzer) wrote : | #13 |
Summary:
As I mentioned before (on the other bug that I referred).
The problem is that with a PT device it needs to reset and map the VDIO devices.
So with >0 PT devices attached it needs an init that scales with memory size of the guest (see my fast results with PT but small guest memory).
As I outlined before the "fix" for that is either:
- (admin) use Huge pages (1G if possible) for the very huge guest which will make this way more efficient.
- (kernel development) get the kernel to use more cores for this initialization (currently 1)
Last time we disussed this Nvidia said they consider 1G Hugepages as recommended setup for huge guests (due to the benefits). But not for guests in general (due to the complexity managing the memory).
I don't know where it ended, btu I expect it to be in some guidelines.
That is why I wondered about th elack of hueg pages in the spec to test qemu passthrough perf if you remember.
I don't think there is much more qemu can do at this point (at least I'd not know), but we can use this re-discovery of this issue as a chance. If you want to dive (or ask one of your teammates) into the kernel side to check if it could be made multi-cpu initialization for better scaling.
I'm adding a kernel task for your consideration.
Changed in linux (Ubuntu): | |
status: | New → Confirmed |
Changed in qemu (Ubuntu): | |
assignee: | Rafael David Tinoco (rafaeldtinoco) → nobody |
status: | Confirmed → Incomplete |
Colin Ian King (colin-king) wrote : | #14 |
Naive question: can we tweak the hugepage file settings at run time via
/proc/sys/
Changed in linux (Ubuntu): | |
importance: | Undecided → Medium |
assignee: | nobody → Colin Ian King (colin-king) |
Christian Ehrhardt (paelzer) wrote : | #15 |
You can do so even per-size via e.g. /sys/kernel/
As discussed the later the allocation the higher the chance to fail, so re-check the sysfs file after each change if it actually got that much memory.
The default size is only a boot time parameter.
But you can specify explicit sizes in libvirt xml.
Christian Ehrhardt (paelzer) wrote : | #16 |
As assumed this really seems to be cross arch and for all sizes.
Here 16 PU, 128G on ppc64el:
#1: 54 seconds
#2: 7 seconds
#3: 23 seconds
Upped to 192GB this has:
#1: 75 seconds
#2: 5 seconds
#3: 23 seconds
As a note, in this case I checked there are ~7 seconds before it does into this "single thread, kernel busy" load. Those 7 seconds might be other/usual init in qemu/kvm while the long load is the
Just the linear scaling is what breaks it, in case the same speed applies 10 times as much +1.28TB would be in the 10 minutes as well. The remaining difference might be dimm speed and bandwidth.
Christian Ehrhardt (paelzer) wrote : | #17 |
On this platform strace still confirms the same paths:
And perf as well (slight arch differences, but still mem setup).
46.85% [kernel] [k] lruvec_lru_size
16.89% [kernel] [k] clear_user_page
5.74% [kernel] [k] inactive_
4.48% [kernel] [k] find_next_bit
3.74% [kernel] [k] cpumask_next
2.72% [kernel] [k] copypage_power7
2.40% [kernel] [k] kvm_age_rmapp
1.94% [kernel] [k] snapshot_refaults
1.28% [kernel] [k] shrink_page_list
Note: This is 5.2.0-10-generic
Also cross arch, as soon as no hostdev is passed the init can be skipped. the 75 seconds above drop down to the ~7 seconds I have seen before the single thread init starts.
Christian Ehrhardt (paelzer) wrote : | #18 |
Hmm, with strace showing almost a hang on a single of those ioctl calls you'D think that is easy to spot :-/
But this isn't as clear as expected:
sudo trace-cmd record -p function_graph -l vfio_pci_ioctl -O graph-time
Disable all but 1 CPUs to have less concurrency in the trace.
=> Not much better
But it is the handler:
1276 »···.ioctl»·»···= vfio_pci_ioctl,
There seem to be a lot of calls to vfio_pci_ioctl and not just a few slow ones.
Whatever strace thought, that is not how it looks like.
Starting again with wider traces to check what the entry might be.
sys_splice is high on the radar atm.
As a bonus now the test env broke and can't passthrough anymore.
The emulex firmware breaks and fails.
Maybe abort for now and focus on other things and resetup the testbed another day.
Christian Ehrhardt (paelzer) wrote : | #19 |
I built qemu head from git
$ export CFLAGS="-O0 -g"
$ ./configure --disable-user --disable-
$ make -j
$ virsh nodedev-detach pci_0005_01_00_0 --driver vfio
$ virsh nodedev-detach pci_0005_01_00_1 --driver vfio
$ virsh nodedev-detach pci_0005_01_00_2 --driver vfio
$ virsh nodedev-detach pci_0005_01_00_3 --driver vfio
$ virsh nodedev-detach pci_0005_01_00_4 --driver vfio
$ virsh nodedev-detach pci_0005_01_00_5 --driver vfio
$ sudo ppc64-softmmu/
I found VFIO_IOMMU_
96783 0.000088 readlink(
96783 0.000066 openat(AT_FDCWD, "/dev/vfio/8", O_RDWR|O_CLOEXEC) = 16 <0.000025>
96783 0.000059 ioctl(16, VFIO_GROUP_
96783 0.000050 openat(AT_FDCWD, "/dev/vfio/vfio", O_RDWR|O_CLOEXEC) = 17 <0.000014>
96783 0.000049 ioctl(17, VFIO_GET_
96783 0.000039 ioctl(17, VFIO_CHECK_
96783 0.000040 ioctl(17, VFIO_CHECK_
96783 0.000037 ioctl(17, VFIO_CHECK_
96783 0.000037 ioctl(16, VFIO_GROUP_
96783 0.000037 ioctl(17, VFIO_SET_IOMMU, 0x7) = 0 <0.000039>
96783 0.000070 ioctl(17, VFIO_IOMMU_
96785 10.019032 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <10.022751>
96785 0.053520 madvise(
96785 0.007283 exit(0) = ?
96785 0.000072 +++ exited with 0 +++
96783 276.894553 <... ioctl resumed> , 0x7fffe3fd6b70) = 0 <286.974436>
96783 0.000107 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
This is repeatable and explains why I haven't seen exactly the same that I saw on x86.
VFIO_IOMMU_
Christian Ehrhardt (paelzer) wrote : | #20 |
On x86 this looks pretty similar and at the place we have seen before:
45397 0.000073 readlink(
45397 0.000053 openat(AT_FDCWD, "/dev/vfio/45", O_RDWR|O_CLOEXEC) = 31 <0.000033>
45397 0.000060 ioctl(31, VFIO_GROUP_
45397 0.000041 openat(AT_FDCWD, "/dev/vfio/vfio", O_RDWR|O_CLOEXEC) = 32 <0.000019>
45397 0.000046 ioctl(32, VFIO_GET_
45397 0.000038 ioctl(32, VFIO_CHECK_
45397 0.000035 ioctl(31, VFIO_GROUP_
45397 0.000033 ioctl(32, VFIO_SET_IOMMU, 0x3) = 0 <0.000174>
45397 0.000199 ioctl(32, VFIO_DEVICE_
45397 0.000038 ioctl(14, KVM_CREATE_DEVICE, 0x7ffd774f8094) = 0 <0.000015>
45397 0.000039 ioctl(33, KVM_SET_
45397 0.000054 ioctl(32, VFIO_DEVICE_
45397 0.000312 ioctl(32, VFIO_DEVICE_
45397 0.000058 ioctl(32, VFIO_DEVICE_
45397 0.000060 ioctl(32, VFIO_DEVICE_
45397 0.334593 ioctl(32, VFIO_DEVICE_
45397 0.000088 ioctl(32, VFIO_DEVICE_
45402 9.654538 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <10.000112>
45402 0.000144 madvise(
45402 0.000076 exit(0) = ?
45402 0.000074 +++ exited with 0 +++
45397 45.903415 <... ioctl resumed> , 0x7ffd774f7f20) = 0 <55.558237>
Christian Ehrhardt (paelzer) wrote : | #21 |
The above was through libvirt, doing that directly in qemu now to throw it into debugging more easily:
$ virsh nodedev-detach pci_0000_21_00_1 --driver vfio
$ qemu/x86_
The speed differs a lot between retries:
First run (after boot):
5204 3.793765 <... ioctl resumed> , 0x7ffffa6687b0) = 0 <13.464192>
Second run (after aborting the former qemu)
5251 179.361434 <... ioctl resumed> , 0x7ffed49fbc70) = 0 <189.030433>
Third run (after aborting the former qemu)
5728 703.886935 <... ioctl resumed> , 0x7fff9f6a7120) = 0 <713.556876>
Maybe a former unclean stop makes it take extra time?
Or there is still cleanup going on in the iommu which stalls the follow on mappings to be established.
Forth run (after proper guest shutdown + 10 min wait)
6171 11.167221 <... ioctl resumed> , 0x7ffe597f03e0) = 0 <20.838401>
Fifth run (again kill and restart)
6215 1182.244403 <... ioctl resumed> , 0x7fff20a49b10) = 0 <1191.917291>
10 sec iommu counters while in the hang:
34330395280 cycles
16 cmd_processed (33.33%)
6 cmd_processed_inv (33.37%)
0 ign_rd_
45 int_dte_hit (33.44%)
0 int_dte_mis (33.44%)
345 mem_dte_hit (33.44%)
907 mem_dte_mis (33.44%)
111 mem_iommu_
102 mem_iommu_
7927 mem_iommu_
1761 mem_iommu_
0 mem_pass_excl (33.30%)
0 mem_pass_pretrans (33.29%)
0 mem_pass_untrans (33.29%)
0 mem_target_abort (33.29%)
8492 mem_trans_total (33.30%)
0 page_tbl_read_gst (33.30%)
168 page_tbl_read_nst (33.30%)
129 page_tbl_read_tot ...
Christian Ehrhardt (paelzer) wrote : | #22 |
Just when I thought I understood the pattern.
Sixth run (again kill and restart)
6384 9.826097 <... ioctl resumed> , 0x7ffcc8ed6e20) = 0 <19.495688>
So for now lets summarize that it varies :-/
But it always seems slow.
Christian Ehrhardt (paelzer) wrote : | #23 |
Many ioctls (as expected) but they are all fast and match what we knew from strace.
Thread 1 "qemu-system-x86" hit Catchpoint 1 (call to syscall ioctl), 0x00007ffff72fae0b in ioctl () at ../sysdeps/
78 in ../sysdeps/
(gdb) bt
#0 0x00007ffff72fae0b in ioctl () at ../sysdeps/
#1 0x0000555555914217 in vfio_init_container (container=
#2 0x000055555591445e in vfio_connect_
#3 0x0000555555914f35 in vfio_get_group (groupid=45, as=0x5555565ca5a0 <address_
#4 0x0000555555920575 in vfio_realize (pdev=0x5555576
#5 0x0000555555b853d7 in pci_qdev_realize (qdev=0x5555576
#6 0x0000555555aac178 in device_set_realized (obj=0x55555760
#7 0x0000555555cb857a in property_set_bool (obj=0x55555760
#8 0x0000555555cb6731 in object_property_set (obj=0x55555760
#9 0x0000555555cb9965 in object_
#10 0x0000555555cb6a2a in object_
#11 0x0000555555a12cb1 in qdev_device_add (opts=0x5555566
#12 0x0000555555a1b11f in device_init_func (opaque=0x0, opts=0x5555566a
#13 0x0000555555e1c7fd in qemu_opts_foreach (list=0x5555564
at util/qemu-
#14 0x0000555555a2009d in main (argc=14, argv=0x7fffffff
The slow one is again the full iommu mapping of the guest memory:
(gdb) bt
#0 0x00007ffff72fae0b in ioctl () at ../sysdeps/
#1 0x0000555555911a9e in vfio_dma_map (container=
#2 0x000055555591299b in vfio_listener_
#3 0x000055555587dfb0 in listener_
#4 0x000055555587e3dd in memory_
#...
Christian Ehrhardt (paelzer) wrote : | #24 |
Reference:
this is the call from qemu that I think we see above (on x86) is at [1].
If this time the assumption is correct the kernel place would be at vfio_iommu_
For debugging:
$ gdb qemu/x86_
(gdb) catch syscall 16
(gdb) run -m 131072 -smp 1 -no-user-config -device vfio-pci,
(gdb) continue 22
# Now we are "in front" of the bad call.
Christian Ehrhardt (paelzer) wrote : | #25 |
I could next build a test kernel with some debug around the vfio iommu dma map to check how time below that call is spent.
I'm sure that data already is hidden in some of my trace data, but to eventually change/experiment I need to build one anyway.
I expect anyway to summarize and go into a discussion at some point, as even if one would split the map calls iommus are full of quirks and constraints that might bit us later.
Christian Ehrhardt (paelzer) wrote : | #26 |
Each qemu (version) is slightly different in the road to this, but then seems to behave.
This one is slightly better to get "in front" of the slow call to map all the memory.
$ virsh nodedev-detach pci_0000_21_00_1 --driver vfio
$ gdb /usr/bin/
(gdb) b vfio_dma_map
(gdb) command 1
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>silent
>if size != 134217728000
>cont
>end
>end
(gdb) run -m 131072 -smp 1 -no-user-config -device vfio-pci,
(gdb) catch syscall 16
(gdb) c
Also correct the link to qemu - it is the one from vfio_listener_
https:/
Christian Ehrhardt (paelzer) wrote : | #27 |
The iommu is locked in there early and the iommu element is what is passed from userspace.
That represents the vfio container for this device (container->fd)
qemu:
if (ioctl(
kernel:
static long vfio_iommu_
unsigned int cmd, unsigned long arg)
struct vfio_iommu *iommu = iommu_data;
[... into vfio_dma_do_map ...]
mutex_
There isn't much divide and conquer splitting that seems easily possible for now :-/
Down there while this lock is held all the memory size must be pinned
-> vfio_pin_
Which gets the biggest chunk it can to then map it
-> vfio_iommu_map
This is repeated until all of the requested size is handled.
Establishing iommu maps is known to be expensive, an assumption would be that in the semi-fast cases is either:
- memory is still non-fragmented so we only need a few calls
- the iommu is sort of async-busy from the former work (same calls, but longer)
That should be visible in the amount of vfio_pin_
Christian Ehrhardt (paelzer) wrote : | #28 |
I modified the kernel to have a few functions non-inlined to be better tracable:
vfio_dma_do_map
vfio_dma_do_unmap
mutex_lock
mutex_unlock
kzalloc
vfio_link_dma
vfio_pin_map_dma
vfio_pin_
vfio_iommu_map
Then run tracing on this load with limited to the functions in my focus:
$ sudo trace-cmd record -p function_graph -l vfio_iommu_
This I traced twice to compare it:
- after a clean boot; duration ~12 seconds
- right afterwards (rerun from gdb); duration 175~ seconds
It shows that the locks are interesting for concurrency considerations, but here mostly noisy.
OTOH it is interesting that the slow case has much more of them, maybe they need to unlock for interactiveness in between more often. I'll count them in postprocessing like:
The result shows that the locking involved scaled much more.
Duration is ~*14.5 and locks are ~*34 the average lock activity between vfio calls is even more with
1.52 -> 6.32.
But the calls to vfio_pin_
64020 -> 327852; which means smaller chunks I guess.
The timing of these functions changes slightly, initially it is much different than later in the trace.
Lets take a clean shot at just those durations and sizes:
Christian Ehrhardt (paelzer) wrote : | #29 |
(systemtap)
probe module(
printf("New vfio_iommu_
start_
}
probe module(
timer=
printf(
stop_
}
probe module(
timer=
printf("%ld: %s\n", timer, $$parms);
}
The overhead was significant enough, the fast case this time was 132 seconds.
The following slow case had 209 seconds (still slower but not more by factors).
While the overhead might have influenced the absolute timing of this too much.
It still might help indicate the size/frequency distribution of these calls that we wanted to know.
Christian Ehrhardt (paelzer) wrote : | #30 |
This is a silly but useful distribution check with log10 of the allocation sizes:
Fast:
108 3
1293 4
12133 5
113330 6
27794 7
1119 8
Slow:
194 3
1738 4
17375 5
143411 6
55 7
3 8
I got no warnings about missed calls, but always be aware that some numbers might be off - ususally they are ok for relative comparisons.
So yeah, the slow case just needs to map more smaller pieces as that is all it can find.
That explains the "getting worse with system runtime", and I don#t think there is much that can be done here.
I'll discuss if there is any gain in splitting this from one thread into many (for very huge memory sizes).
P.S. Finally I just want to re-iterate that using hugepages due to their pre-allocation, less fragmentation, less mapping behavior really is the "config" way out of this.
Christian Ehrhardt (paelzer) wrote : | #31 |
As qemu (seems) to be unable to do much I'll set it to triaged (we understand what is going on) and low (can't do much).
Changed in qemu (Ubuntu): | |
status: | Incomplete → Triaged |
importance: | Medium → Low |
Christian Ehrhardt (paelzer) wrote : | #32 |
As outlined in the past conceptually there is nothing that qemu can do.
The kernel can in theory get memory zeroing to become concurrent and thereby scale with CPUs but that is an effort that was already started twice and didn't get into the kernel yet.
Workarounds are known to shrink that size massively (HP).
Also this is nothing the kernel Team would usually "do" they would follow upstream on it, so I'm leaving it open but low for them.
I realized those discussions where going on but not linked here.
=> https://<email address hidden>/T/
=> https:/
=> https:/
Is one example. I'd ask the kernel Team to keep an eye open for that, be encouraging in discussions. This really has a huge benefit in those use cases.
Changed in linux (Ubuntu): | |
importance: | Medium → Low |
Changed in qemu (Ubuntu): | |
status: | Triaged → Invalid |
Changed in linux (Ubuntu): | |
assignee: | Colin Ian King (colin-king) → nobody |
Here's a perf report of a 'sudo perf record -p 6949 -a -F 25000 -e cycles', after libvirt spawned qemu w/ pid 6949.