Testflinger disconnections

Bug #1905094 reported by Jeff Lane 
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Testflinger
Invalid
Undecided
Paul Larson

Bug Description

Lately I've been seeing a lot of disconnection error messages during testflinger runs. Some recover, others seem to fail completely. Below are two examples. THe first one had several disconnections and eventual reconnection and that one seems OK otherwise, as testing did complete.

The second one however, seems to have completely disconnected. Iterestingly I was able to reconnect usgin checkbox-cli manually, and found that once I did, testing resumed. So that's an additional issue, the checkbox test in progress seemed to have paused when testflinger lost the connection, and sat there paused until I manually connected more than a day later.

This run disconnected and then later reconnected multiple times:
stress-ng: info: [3609835] dispatching hogs: 8 qsort
stress-ng: info: [3609835] successful run completed in 300.00s (5 mins, 0.00 secs)
stress-ng: info: [3609850] dispatching hogs: 8 stack
Reconnecting...
Rejoined session.
In progress: com.canonical.certification::memory/memory_stress_ng (59/104)

stress-ng: info: [3609850] successful run completed in 300.82s (5 mins, 0.82 secs)
stress-ng: info: [3609892] dispatching hogs: 8 str
stress-ng: info: [3609892] successful run completed in 300.00s (5 mins, 0.00 secs)
stress-ng: info: [3609906] dispatching hogs: 8 stream
stress-ng: info: [3609908] stress-ng-stream: stressor loosely based on a variant of the STREAM benchmark code
stress-ng: info: [3609908] stress-ng-stream: do NOT submit any of these results to the STREAM benchmark results
stress-ng: info: [3609908] stress-ng-stream: Using CPU cache size of 8192K
stress-ng: info: [3609911] stress-ng-stream: memory rate: 1293.94 MB/sec, 517.58 Mflop/sec (instance 3)
stress-ng: info: [3609914] stress-ng-stream: memory rate: 999.71 MB/sec, 399.88 Mflop/sec (instance 6)
stress-ng: info: [3609908] stress-ng-stream: memory rate: 1593.38 MB/sec, 637.35 Mflop/sec (instance 0)
stress-ng: info: [3609915] stress-ng-stream: memory rate: 910.95 MB/sec, 364.38 Mflop/sec (instance 7)
stress-ng: info: [3609912] stress-ng-stream: memory rate: 1199.38 MB/sec, 479.75 Mflop/sec (instance 4)
stress-ng: info: [3609913] stress-ng-stream: memory rate: 1100.02 MB/sec, 440.01 Mflop/sec (instance 5)
stress-ng: info: [3609910] stress-ng-stream: memory rate: 1399.39 MB/sec, 559.76 Mflop/sec (instance 2)
stress-ng: info: [3609909] stress-ng-stream: memory rate: 1498.41 MB/sec, 599.37 Mflop/sec (instance 1)
stress-ng: info: [3609906] successful run completed in 300.02s (5 mins, 0.02 secs)
stress-ng: info: [3609917] dispatching hogs: 8 tsearch
stress-ng: info: [3609917] successful run completed in 300.07s (5 mins, 0.07 secs)
stress-ng: info: [3609934] dispatching hogs: 8 vm-rw
stress-ng: info: [3609934] successful run completed in 300.01s (5 mins, 0.01 secs)
stress-ng: info: [3609953] dispatching hogs: 8 wcs
stress-ng: info: [3609953] successful run completed in 300.00s (5 mins, 0.00 secs)
stress-ng: info: [3609968] dispatching hogs: 8 zero
stress-ng: info: [3609968] successful run completed in 300.00s (5 mins, 0.00 secs)
stress-ng: info: [3609979] dispatching hogs: 8 mlock
stress-ng: info: [3609979] successful run completed in 300.26s (5 mins, 0.26 secs)
stress-ng: info: [3610002] dispatching hogs: 8 mmapfork
stress-ng: info: [3610002] successful run completed in 300.48s (5 mins, 0.48 secs)
stress-ng: info: [3692061] dispatching hogs: 8 mmapmany
stress-ng: info: [3692061] successful run completed in 300.03s (5 mins, 0.03 secs)
stress-ng: info: [3692083] dispatching hogs: 8 mremap
stress-ng: info: [3692083] successful run completed in 300.70s (5 mins, 0.70 secs)
stress-ng: info: [3692103] dispatching hogs: 8 shm-sysv
stress-ng: info: [3692103] successful run completed in 301.00s (5 mins, 1.00 secs)
stress-ng: info: [3692127] dispatching hogs: 8 vm-splice
stress-ng: info: [3692127] successful run completed in 300.00s (5 mins, 0.00 secs)
stress-ng: info: [3692137] dispatching hogs: 8 malloc
stress-ng: info: [3692137] successful run completed in 377.02s (6 mins, 17.02 secs)
stress-ng: info: [3692161] dispatching hogs: 8 mincore
stress-ng: info: [3692161] successful run completed in 377.00s (6 mins, 17.00 secs)
stress-ng: info: [3692177] dispatching hogs: 8 vm
stress-ng: info: [3692177] successful run completed in 377.01s (6 mins, 17.01 secs)
stress-ng: info: [3692197] dispatching hogs: 8 bigheap
stress-ng: info: [3692197] successful run completed in 377.22s (6 mins, 17.22 secs)

stress-ng: info: [3692221] dispatching hogs: 8 brk
Reconnecting...
Reconnecting...
Reconnecting...
Rejoined session.
In progress: com.canonical.certification::memory/memory_stress_ng (59/104)
stress-ng: info: [3692221] successful run completed in 379.32s (6 mins, 19.32 secs)

This run disconnected and never recovered:
stress-ng: info: [3349656] successful run completed in 922.02s (15 mins, 22.02 secs)
stress-ng: info: [3349749] dispatching hogs: 40 bigheap
stress-ng: info: [3349749] successful run completed in 922.55s (15 mins, 22.55 secs)
stress-ng: info: [3349845] dispatching hogs: 40 brk
Reconnecting...
Reconnecting...
Reconnecting...
Reconnecting...
Reconnecting...
Reconnecting...
Reconnecting...
Connection lost!
Service explicitly disconnected you. Possible reason: new remote connected to the service
+ EXITCODE=0
+ mkdir -p artifacts
+ cp launcher artifacts
+ find /home/ubuntu/ -name 'submission_*.junit.xml' -exec mv '{}' artifacts/junit.xml ';'
+ find /home/ubuntu/ -name 'submission_*.html' -exec mv '{}' artifacts/submission.html ';'
+ find /home/ubuntu/ -name 'submission_*.xlsx' -exec mv '{}' artifacts/submission.xlsx ';'
+ find /home/ubuntu/ -name 'submission_*.tar.xz' -exec mv '{}' artifacts/submission.tar.xz ';'
+ tar -xf artifacts/submission.tar.xz submission.json
tar: artifacts/submission.tar.xz: Cannot open: No such file or directory
tar: Error is not recoverable: exiting now
+ mv submission.json artifacts
mv: cannot stat 'submission.json': No such file or directory
++ _run grep /boot/efi /proc/mounts
++ ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o ServerAliveInterval=30 -o ServerAliveCountMax=3 ubuntu@10.245.130.20 grep /boot/efi /proc/mounts
++ grep -o '.*[^0-9]'
++ cut -d ' ' -f 1
+ ROOT_DISK=/dev/sda
+ echo 'Zeroing Disk /dev/sda'
Zeroing Disk /dev/sda
+ _run sudo sgdisk -Z /dev/sda
+ ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o ServerAliveInterval=30 -o ServerAliveCountMax=3 ubuntu@10.245.130.20 sudo sgdisk -Z /dev/sda
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot or after you
run partprobe(8) or kpartx(8)
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
2020-11-20 11:15:21,439 drapion INFO: DEVICE AGENT: END testrun
*************************************************

* Starting testflinger cleanup phase on drapion *

*************************************************

Cleaning up container if it exists...
drapion
complete

Revision history for this message
Jeff Lane  (bladernr) wrote :

Note, in the drapion example, I did NOT at any time connect to the checkbox remote session from another machine. I kicked it off via testflinger, started polling the output and walked away waiting for the job to complete.

tags: added: hwcert-server
Revision history for this message
Paul Larson (pwlars) wrote :

Hi, I think the disconnects you are seeing here are actually coming from checkbox. I’m guessing you are running remote? I’m wondering if the stress run is causing it maybe? If you investigate and find that testflinger itself is having an issue, feel free to reopen though.

Changed in testflinger:
assignee: nobody → Paul Larson (pwlars)
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.