ubuntu-22.10-live-server-s390x.iso installation crashed

Bug #1996006 reported by liwbj@cn.ibm.com
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
New
Undecided
Unassigned
subiquity
New
Undecided
Unassigned

Bug Description

I am trying to install new Ubuntu22.10 ubuntu-22.10-live-server-s390x.iso into my s390 env.
Using the SCSI FCP disk, after the process of input the proxy server address, got below error.
Have no idea for this, could you take a look for this? If you need more info please let me know.

BTW: I got success on NVMe storage, but no SCSI FCP.

Welcome to Ubuntu 22.10 (GNU/Linux 5.19.0-21-generic s390x)

 * Documentation: https://help.ubuntu.com
 * Management: https://landscape.canonical.com
 * Support: https://ubuntu.com/advantage

  System information as of Wed Nov 9 02:34:31 UTC 2022

  System load: 0.0 Memory usage: 9% Processes: 1076
  Usage of /home: unknown Swap usage: 0% Users logged in: 0

0 updates can be applied immediately.

The list of available updates is more than a week old.
To check for new updates run: sudo apt update

Last login: Wed Nov 9 02:28:02 2022 from 10.20.92.70
connecting...
generating crash report
report saved to /var/crash/1667961356.095709324.ui.crash
Traceback (most recent call last):
  File "/snap/subiquity/4005/lib/python3.8/site-packages/subiquity/client/controllers/filesystem.py", line 258, in _guided_choice
    self.ui.set_body(FilesystemView(self.model, self))
  File "/snap/subiquity/4005/lib/python3.8/site-packages/subiquity/ui/views/filesystem/filesystem.py", line 476, in __init__
    self.refresh_model_inputs()
  File "/snap/subiquity/4005/lib/python3.8/site-packages/subiquity/ui/views/filesystem/filesystem.py", line 522, in refresh_model_inputs
    self.avail_list.refresh_model_inputs()
  File "/snap/subiquity/4005/lib/python3.8/site-packages/subiquity/ui/views/filesystem/filesystem.py", line 410, in refresh_model_inputs
    for obj, cells in summarize_device(device, filter):
  File "/snap/subiquity/4005/lib/python3.8/site-packages/subiquity/ui/views/filesystem/helpers.py", line 34, in summarize_device
    anns = labels.annotations(device) + labels.usage_labels(device)
  File "/snap/subiquity/4005/usr/lib/python3.8/functools.py", line 875, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/snap/subiquity/4005/lib/python3.8/site-packages/subiquity/common/filesystem/labels.py", line 96, in _annotations_vg
    member = next(iter(vg.devices))
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/snap/subiquity/4005/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/snap/subiquity/4005/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/snap/subiquity/4005/lib/python3.8/site-packages/subiquity/__main__.py", line 5, in <module>
    sys.exit(main())
  File "/snap/subiquity/4005/lib/python3.8/site-packages/subiquity/cmd/tui.py", line 150, in main
    subiquity_interface.run()
  File "/snap/subiquity/4005/lib/python3.8/site-packages/subiquity/client/client.py", line 407, in run
    super().run()
  File "/snap/subiquity/4005/lib/python3.8/site-packages/subiquitycore/tui.py", line 381, in run
    super().run()
  File "/snap/subiquity/4005/lib/python3.8/site-packages/subiquitycore/core.py", line 135, in run
    raise exc
RuntimeError: coroutine raised StopIteration

liwbj@cn.ibm.com (liwbj)
affects: subiquity → ubuntu-z-systems
Revision history for this message
Dan Bungert (dbungert) wrote :

This looks like a known issue that I would like to fix soon (LP: #1993633)

To confirm:
1) Did you choose Custom storage layout?
2) If you wipe the device first with wipefs -a does the situation improve?

Thanks!

Revision history for this message
liwbj@cn.ibm.com (liwbj) wrote :

1) Did you choose Custom storage layout?
Yes, I chose the standard mode, lvm mode and Custom storage mode, but got same result.

2) If you wipe the device first with wipefs -a does the situation improve?
After using ssh installer@xxxxxxx, these is not change to input to command of wipefs -a, just follow the process and got crash. Could you tell me how to execute it?

Revision history for this message
Dan Bungert (dbungert) wrote :

Thank you for retesting.

> After using ssh installer@xxxxxxx, these is not change to input to command of wipefs -a, just follow the process and got crash. Could you tell me how to execute it?

If you use the keyboard up arrow to the top right "Help" button, you'll get a menu with "Enter Shell" as one of the options.

Revision history for this message
liwbj@cn.ibm.com (liwbj) wrote :

Thank you Dan. I tried, but no help, just got the same result.

Revision history for this message
Frank Heimes (fheimes) wrote :

Hello 'liwbj',
just to be sure that we understood you correctly (and that it's really similar to LP#1993633):

- so you started an installation (IPLed the installer)
- logged in to the installer from remote using ssh
- ran through the installer steps until reaching the 'zdev' screen
- enabled your LUN, by enabling the zfcp-host (aka HBAs)
- then you switched to the installer shell (via the Help menu or just by using 'F2')
- there you wiped the entire disk/LUN manually like
  (starting with potential partitions, and then the entire disk device):
  wipefs -a -f /dev/sda1
  wipefs -a -f /dev/sda2
  wipefs -a -f /dev/sda
- and then you restarted the installation from scratch?
  This is important - just leaving the installer shell with 'exit'
  and proceeding with the current install might not be sufficient for this workaround.
  You need to re-IPL the installation after having wiped out the disk/LUN manually.

Please could you confirm this?

Changed in ubuntu-z-systems:
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
tags: added: installer s390x
Revision history for this message
Frank Heimes (fheimes) wrote :

And btw. meanwhile Ubuntu Server 22.04.1 got released.
So you should use the updated ISO from the "point 1" release:
https://cdimage.ubuntu.com/releases/22.04.1/release/ubuntu-22.04.1-live-server-s390x.iso

Revision history for this message
liwbj@cn.ibm.com (liwbj) wrote :

@Frank, Thank you very much. Yeah, I did not re-IPL the installation after wiped out, just continue the current process.

This time, I got success.

Revision history for this message
liwbj@cn.ibm.com (liwbj) wrote :
Download full text (39.3 KiB)

Hi Frank,

Thank you for the installation workaround, but I got another IP issue, not sure I need to open another bug. If I need, please let me know.

After installation, I found I can not ssh it. And the OS has not been set VLAN and IP. But using VLAN installation process certanly can connect the repo sever and download the package, which mean the IP and VLAN worked in the installation process. After the reboot, looks like the OS lost the IP and VLAN.

So I have to manually vi netplan file (use route change to gateway4) and apply it, but it did not work, still have no IP here. (Here, I opened a similiar bug as https://bugs.launchpad.net/ubuntu-z-systems/+bug/1996007, but this one can get IP after netplan apply.)

So there is no way to set IP for this OS.
Could you take a look for this? Or need I open another bug.

This is the OS start message. And there is A start job is running for Wait until timeout(2 min)
I am not sure what it is and if it is the problem.

Message
[ 0.072523] Linux version 5.19.0-23-generic (buildd@bos02-s390x-002) (s390x-linux-gnu-gcc-12 (Ubuntu 12.2.0-3ubuntu1) 12.2.0, GNU ld (GNU Binutils for Ubuntu) 2.39) #24-Ubuntu SMP Fri Oct 14 15:39:36 UTC 2022 (Ubuntu 5.19.0-23.24-generic 5.19.7)
[ 0.072527] setup: Linux is running natively in 64-bit mode
[ 0.072528] setup: Linux is running with Secure-IPL disabled
[ 0.072528] setup: The IPL report contains the following components:
[ 0.072529] setup: 0000000000002000 - 0000000000006000 (not signed)
[ 0.072530] setup: 000000000000f000 - 0000000000010000 (not signed)
[ 0.072532] setup: 000000000000a000 - 000000000000e000 (not signed)
[ 0.072533] setup: 0000000000009000 - 0000000000009200 (not signed)
[ 0.072533] setup: 0000000000010000 - 0000000000825000 (not signed)
[ 0.072535] setup: 0000000000826000 - 0000000000826200 (not signed)
[ 0.072535] setup: 0000000000840000 - 00000000022b9400 (not signed)
[ 0.072536] setup: 0000000000836000 - 0000000000837000 (not signed)
[ 0.074849] setup: The maximum memory size is 32768MB
[ 0.074850] setup: Relocating AMODE31 section of size 0x00003000
[ 0.075297] cpu: 140 configured CPUs, 0 standby CPUs
[ 0.075480] cpu: The CPU configuration topology of the machine is: 0 0 4 4 2 8 / 4
[ 0.076122] Write protected kernel read-only data: 20336k
[ 0.076266] Zone ranges:
[ 0.076267] DMA [mem 0x0000000000000000-0x000000007fffffff]
[ 0.076270] Normal [mem 0x0000000080000000-0x00000007ffffffff]
[ 0.076271] Movable zone start for each node
[ 0.076272] Early memory node ranges
[ 0.076272] node 0: [mem 0x0000000000000000-0x00000005ffffffff]
[ 0.076276] Initmem setup node 0 [mem 0x0000000000000000-0x00000005ffffffff]
[ 0.282818] percpu: cpu 0 has no node -1 or node-local memory
[ 0.285008] percpu: Embedded 32 pages/cpu s91392 r8192 d31488 u131072
[ 0.285316] Fallback order for Node 0: 0
[ 0.285331] Built 1 zonelists, mobility grouping on. Total pages: 6193152
[ 0.285333] Policy zone: Normal
[ 0.285334] Kernel command line: root=/dev/disk/by-id/dm-uuid-part1-mpath-20017380030bb141b
[ 0.285367] printk: log_buf_len individual max cpu contribution: 4096 b...

Revision history for this message
Frank Heimes (fheimes) wrote :

Hello Wei WA,
it's a bit confusing to see your last post about the netplan issue:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1996006/comments/9
here in this bug, since this about the partitioning (and the wipe-out as workaround).

Am I right that your post here:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1996006/comments/9
is meant to be a response to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1996007
?

And do I get it right that the netplan issue does not happen on the machine where you had to wipe out the disk (LP#1996006), but only on a different/second machine (LP#1996007).

If that's the case - what is the difference between these two, that may cause netplan apply to fail on one of them?

Revision history for this message
liwbj@cn.ibm.com (liwbj) wrote (last edit ):

Hi Frank,

There are some difference about the network issue.

1996006:
1)Ubuntu22.10. Firstly, got the crash, and follow your walkaround. Right now, installation is done. But got more IP issue.
2)After reboot, can not get IP. No any IP address in the ip addr list.
3)Even I config the /etc/netplan/ file and netplan apply again, still can not set IP into this partition. In other words, I can not set IP for Ubuntu22.10

1996007
1)Ubuntu22.04.1
2)After reboot, just need to netplan apply can got IP, and can ssh it.

Comment #9 is a Ubuntu22.10 log, so I post it here.

Revision history for this message
Frank Heimes (fheimes) wrote :

Hello Wei WA,
okay, so let's stick here in this Launchpad bug to:
"
1996006:
1)Ubuntu22.10. Firstly, got the crash, and follow your walkaround. Right now, installation is done. But got more IP issue.
2)After reboot, can not get IP. No any IP address in the ip addr list.
3)Even I config the /etc/netplan/ file and netplan apply again, still can not set IP into this partition. In other words, I can not set IP for Ubuntu22.10
"

After 2), so after the reboot, can you please login to the console ('Operating System Messages' or 'Integrated ASCII Console', both should work) and:
- check if a netplan yaml file exists at all:
  $ ls -l /etc/netplan/
  -rw-r--r-- 1 root root 638 Nov 17 13:12 00-installer-config.yaml
- and if so, please share it's content:
  $ cat /etc/netplan/
  <if there's not netplan yaml file,
   it's an issue with the installer not having it created properly>
- then would you please check which (ccw-) devices are available and online with:
  $ lszdev
  Esp. double-check if the device ('1000') of your network interface ('enc1000')
  is listed there as an online device. The output should look like this for qeth:
  TYPE ID ON PERS NAMES
  qeth 0.0.1000:0.0.1001:0.0.1002 yes yes enc1000
  (and will have more entries in your case, for example for all the zFCP devices)
- and what the current state of the active network configuration is:
  $ ip a

(In case your device is not listed as an online device, set it manually online with:
$ sudo chzdev -e 1000
but should not be needed.
)

Then (in case there is a netplan yaml file in /etc/netplan) try to apply it in debug mode and monitor what's written in the syslog at the same time (well, usually I would just open two remote shells for that, but since this is not possible here, just do it like this:)

$ sudo netplan apply --debug # maybe initially with a '--dry-run', then without
<and please share the output>

<and assuming it takes less than 2 mins>
$ journalctl --since "2 minutes ago"
<and please share the output>

And please also share the kernel messages (content of the kernel ring buffer):
sudo dmesg -H -P
<that might help to identify any issues during boot that prevent the network from being properly setup...>

If netplan was successful, please check and share again:
$ ip a
___

On a high level if I read "can not set IP" I 'assume' there is either no active/enabled network device (lszdev) or the netplan yaml is broken or wrong.

The start job that you see 'waiting' is probably cloud-init, I see/have that on my system as well, and it's not causing an issue for my system.

I just did a manual install on my DPM system with 22.10 and everything way ok (also saw that start job thing) - changed afterwards the 'Boot from' to 'Storage SAN', shutdown the system, which landed in Paused mode, stopped it, and restarted it and the network came up properly in my case. And lszdev listed all my devices as online (but I only have qeth devices, since I can only use NVMe disk storage, don't have zFCP in that system).

Revision history for this message
liwbj@cn.ibm.com (liwbj) wrote :

Hi Frank,

Sorry, still have problem, this is my runlog.

Revision history for this message
Frank Heimes (fheimes) wrote :

Hi, please note that the argument "--dry-run" is just an initial test, that for example checks the syntax of the '/etc/netplan/00-installer-config.yaml' config file, but does not do any changes.
If netplan apply using '--dry-run' does not show any errors, you can (and need) to call it again without '--dry-run' - otherwise nothing will change.
I haven't seen that you called it without '--dry-run' in your runlog.
So please run:
$ sudo netplan apply --debug
and then
$ journalctl --since "2 minutes ago"
and
$ ip addr

In addition I've noticed that 'sudo dmesg -H -P' shows a block like this:
"qeth 0.0.1000: A recovery process has been started for the device
...
[ +0.000025] qeth 0.0.1000: Device successfully recovered!"
That is unusual and there should not be a device recovery for your qeth 1000 device.

Regarding the boot log:

This looks fine:
"Begin: Starting firmware auto-configuration ... QETH device 0.0.1000:0.0.1001:0.0.1002 configured"
So the device is automatically enabled (as it's supposed to be).

The system has no hostname?!
"[ 7.998415] systemd[1]: Hostname set to ."
Have you specified a hostname at install time? (or is it because the network is down?)
(But unsure which consequences that may have ...)

What I hardly miss is are messages like these (here from my system, hence with hostname and with device '1300' and interface 'enc1300'):
"
Nov 18 09:43:18 testlpar3 systemd-networkd[719]: enc1300: Link UP
Nov 18 09:43:18 testlpar3 systemd-networkd[719]: enc1300: Gained carrier
"
and
"
Nov 18 09:43:18 testlpar3 kernel: [ 5.480435] qeth: register layer 2 discipline
Nov 18 09:43:18 testlpar3 kernel: [ 5.482461] qeth 0.0.1300: CHID: 130 CHPID: 8
"
That let's me assume that there is no link / no connection.
(That would also explain the "A start job is running for Wait for" messages and the lines:
"M[K[[0;1;31mFAILED[0m] Failed to start [0;1;39mWait for Network to be Configured[0m."
"[KSee 'systemctl status systemd-networkd-wait-online.service' for details." )

From an Ubuntu/Linux point-of-view things look otherwise okay, except that gateway and name server (in netplan yam) are unusual (but not neccesarily wrong).
Well, the gateway can generally be any unique address within the subnet itself, but most network administrators designate the first number of the subnet as the gateway. So just double-check that your gateway is really '10.20.103.254' and not '10.20.103.1' - same for the name server?"

Other that than, please also double-check:
- that there is a link and proper network connectivity
- go the 'Partition Details' at the HMC (DPM) and verify that
  a) that the NIC is configured with the correct adapter AND port (0/1)
     and that the port that is configured is the one that is cabled and has the link
  b) that you have your VLAN id '1300' enforced in that adapter
     (compare with:
      https://launchpadlibrarian.net/633928551/Screenshot-20221116133530-640x356.png)

Revision history for this message
liwbj@cn.ibm.com (liwbj) wrote :

Hi Frank,

Sorry for miss the apply without '--dry-run'.
Upload it again.

Revision history for this message
liwbj@cn.ibm.com (liwbj) wrote :

I think DPM side is OK, we are using the config for RHEL ,SUSE and UBuntu20.04 they works fine.
And IP gateway should be fine, we have lots of 103 IP active partitions are running.

Revision history for this message
Frank Heimes (fheimes) wrote :

Okay, the screenshot of the NIC configuration in DPM looks good (esp. with having the "VLAN Enforcement" set) - and assuming that the port is correct.

Yes, I believe that other Ubuntu versions and even other distributions work fine, but do they also work flawlessly in this exact same LPAR? For example does a Ubuntu 20.04.5 (https://cdimage.ubuntu.com/releases/focal/release/ubuntu-20.04.5-live-server-s390x.iso) works fine on the exact same LPAR?

I'm still wondering about:
"qeth 0.0.1000: A recovery process has been started for the device"
That indicates any issue with the NIC or qeth device - I think ...

Revision history for this message
liwbj@cn.ibm.com (liwbj) wrote :

Hi Frank,

Thank you for your comments, there is a Ubuntu20.04 KVM host, I reboot it and the IP setting works fine. But this KVM host is using VLAN1292 which is as same as VLAN1300.

And it has some ubuntu kvm guest on this partition.

Revision history for this message
Frank Heimes (fheimes) wrote :

Well, your vlan interface on this system looks non-standard: "vlan1292@enc1000"
I assume you configured that yourself after the installation?
It's supposed to look like this: "enc1000@1292" - like on the other installation.
Nevertheless, it might work - and looks like it works in your case.
But that let's me assume that there is an issue in the network environment itself, maybe in the switch configuation and/or routing for vlan 1300.

And I do not see/find any indication of a qeth recovery in the logs, which is good.
So something is definitely different - and actually better.

Revision history for this message
liwbj@cn.ibm.com (liwbj) wrote :

>I assume you configured that yourself after the installation?
Yes, config it manully after standard installation.

Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
assignee: Skipper Bug Screeners (skipper-screen-team) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.