maas commisioning fails
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MAAS |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
Steps to reproduce...
1 - setup a maas server and a empty node to pxeboot.
2 - start the commission process
3 - the empty node pxeboots and os installs
4 - once os installs ssh from maas controller to node works fine
5 - the node has internet access through maas as gateway and using maas dns services
6 - the node shows "commission failed" status
how can we troubleshoot this?
we tried with both maas version 1.8 and 1.9.
the environment is live on the internet if you can help us login and troubleshoot
I am working with Narinder Gupta from Canonical on this issue - he asked me to submit this issue.
ubuntu@maas:~$ dpkg -l '*maas*'|cat
Desired=
| Status=
|/ Err?=(none)
||/ Name Version Architecture Description
+++-===
ii maas 1.9.0+bzr4533-
ii maas-cli 1.9.0+bzr4533-
ii maas-cluster-
ii maas-common 1.9.0+bzr4533-
ii maas-dhcp 1.9.0+bzr4533-
ii maas-dns 1.9.0+bzr4533-
ii maas-proxy 1.9.0+bzr4533-
ii maas-region-
ii maas-region-
ii python-django-maas 1.9.0+bzr4533-
ii python-maas-client 1.9.0+bzr4533-
ii python-
ubuntu@maas:~$ zip -r ~/maas.zip /var/log/maas/*
adding: var/log/
adding: var/log/
adding: var/log/
adding: var/log/
adding: var/log/
adding: var/log/
adding: var/log/
adding: var/log/
adding: var/log/
adding: var/log/maas/proxy/ (stored 0%)
adding: var/log/
adding: var/log/
ubuntu@maas:~$ ls -al /var/log/maas
total 9708
drwxr-xr-x 4 root root 4096 Jan 14 06:47 .
drwxrwxr-x 15 root syslog 4096 Jan 14 06:47 ..
lrwxrwxrwx 1 root root 16 Jan 6 17:38 apache2 -> /var/log/apache2
-rw-r--r-- 1 maas maas 8515 Jan 14 06:14 clusterd.log
-rw-r--r-- 1 syslog syslog 0 Jan 14 06:47 maas.log
-rw-r--r-- 1 syslog syslog 31775 Jan 14 06:44 maas.log.1
-rw-r--r-- 1 syslog syslog 1708 Jan 13 06:40 maas.log.2.gz
-rw-r--r-- 1 syslog syslog 1345 Jan 10 06:45 maas.log.3.gz
-rw-r--r-- 1 syslog syslog 1729 Jan 9 06:45 maas.log.4.gz
-rw-r--r-- 1 syslog syslog 1792 Jan 8 06:45 maas.log.5.gz
-rw-r--r-- 1 syslog syslog 6388 Jan 7 06:19 maas.log.6.gz
drwxr-x--- 2 proxy proxy 4096 Jan 14 06:47 proxy
-rw-r--r-- 1 maas maas 9848693 Jan 14 06:47 regiond.log
drwxr-xr-x 2 syslog syslog 4096 Jan 5 17:45 rsyslog
ubuntu@maas:~$ ubuntu@maas:~$
Changed in maas: | |
milestone: | 1.9.1 → 1.9.2 |
Changed in maas: | |
milestone: | 1.9.2 → 1.9.3 |
Changed in maas: | |
milestone: | 1.9.3 → 1.9.4 |
Changed in maas: | |
milestone: | 1.9.4 → 1.9.5 |
Hi Iben,
In the regiond.log I see:
Jan 14 06:13:29 maas maas.node: [INFO] bootstrap: Status transition from FAILED_ COMMISSIONING to COMMISSIONING b490-11e5- a57b-2cc2601000 09) COMMISSIONING
Jan 14 06:13:29 maas maas.power: [INFO] Changing power state (on) of node: bootstrap (node-689fa9b8-
Jan 14 06:13:29 maas maas.node: [INFO] bootstrap: Commissioning started
[...]
Jan 14 06:33:29 maas maas.node: [INFO] bootstrap: Status transition from COMMISSIONING to FAILED_
Jan 14 06:33:29 maas maas.node: [ERROR] bootstrap: Marking node failed: Node operation 'Commissioning' timed out after 0:20:00.
In the clusterd.log I see:
2016-01-14 06:13:41+0000 [TFTP (UDP)] Datagram received from ('10.20.1.101', 19705): <RRQDatagram( filename= pxelinux. 0, mode=octet, options={'blksize': '1432', 'tsize': '0'})> client] RemoteOriginRea dSession starting on 39738 client] Starting protocol <tftp.bootstrap .RemoteOriginRe adSession instance at 0x7f7a3874fb90> adSession (UDP)] Final ACK received, transfer successful .RemoteOriginRe adSession instance at 0x7f7a3874fb90> filename= pxelinux. cfg/00000000- 0000-0000- 0000-0000000000 00, mode=octet, options={'tsize': '0', 'blksize': '1408'})> filename= pxelinux. cfg/01- 2c-c2-60- 20-00-13, mode=octet, options={'tsize': '0', 'blksize': '1408'})> filename= pxelinux. cfg/01- 2c-c2-60- 20-00-13, mode=octet, options={'tsize': '0', 'blksize': '1408'})> filename= pxelinux. cfg/01- 2c-c2-60- 20-00-13, mode=octet, options={'tsize': '0', 'blksize': '1408'})> ,client] RemoteOriginRea dSession starting on 42410 adSession (UDP)] Final ACK received, transfer successful .RemoteOriginRe adSession instance at 0x7f7a397ba878> filename= ubuntu/ amd64/generic/ trusty/ daily/boot- kernel, mode=octet, options={'tsize': '0', 'blksize': '1408'})>
2016-01-14 06:13:41+0000 [ClusterClient,
2016-01-14 06:13:41+0000 [ClusterClient,
2016-01-14 06:13:41+0000 [RemoteOriginRe
2016-01-14 06:13:41+0000 [-] (UDP Port 39738 Closed)
2016-01-14 06:13:41+0000 [-] Stopping protocol <tftp.bootstrap
2016-01-14 06:13:42+0000 [TFTP (UDP)] Datagram received from ('10.20.1.101', 49152): <RRQDatagram(
2016-01-14 06:13:42+0000 [TFTP (UDP)] Datagram received from ('10.20.1.101', 49153): <RRQDatagram(
2016-01-14 06:13:42+0000 [TFTP (UDP)] Datagram received from ('10.20.1.101', 49153): <RRQDatagram(
2016-01-14 06:13:42+0000 [TFTP (UDP)] Datagram received from ('10.20.1.101', 49153): <RRQDatagram(
2016-01-14 06:13:42+0000 [HTTPPageGetter
[...]
2016-01-14 06:13:42+0000 [RemoteOriginRe
2016-01-14 06:13:42+0000 [-] (UDP Port 42410 Closed)
2016-01-14 06:13:42+0000 [-] Stopping protocol <tftp.bootstrap
2016-01-14 06:13:42+0000 [TFTP (UDP)] Datagram received from ('10.20.1.101', 49154): <RRQDatagram(
Which leads me to believe the machine, after set to COMMISSIONING, it tried to PXE boot, however, the commissioning process timeout out after 20 mins.
Things to consider:
Are you working behind a proxy or behind a firewall that might be preventing the macjhine from accessing the internet?
What's needed from you:
- Console log, to see whether it is booting fine or where it is failing in the boot process/ commissioning process.
- Full node event log.