Networking breaks after awhile in kvm guests using virtio networking. We run data intensive jobs on our virtual cluster (OpenStack Grizzly Installed on Ubuntu 12.04 Server). The job runs fine on a single worker VM (no data transfer involved). As soon as I add more nodes where the workers need to exchange some data, one of the worker VM goes down. Ping responds with 'host unreachable'. Logging in via the serial console shows no problems: eth0 is up, can ping the local host, but no outside connectivity. Restart the network (/etc/init.d/networking restart) does nothing. Reboot the machine and it comes alive again.
14/06/01 18:30:06 INFO YarnClientClusterScheduler: YarnClientClusterScheduler.postStartHook done
14/06/01 18:30:06 INFO MemoryStore: ensureFreeSpace(190758) called with curMem=0, maxMem=308713881
14/06/01 18:30:06 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 186.3 KB, free 294.2 MB)
14/06/01 18:30:06 INFO FileInputFormat: Total input paths to process : 1
14/06/01 18:30:06 INFO NetworkTopology: Adding a new node: /default-rack/10.20.20.28:50010
14/06/01 18:30:06 INFO NetworkTopology: Adding a new node: /default-rack/10.20.20.23:50010
14/06/01 18:30:06 INFO SparkContext: Starting job: count at hello_spark.py:15
14/06/01 18:30:06 INFO DAGScheduler: Got job 0 (count at hello_spark.py:15) with 2 output partitions (allowLocal=false)
14/06/01 18:30:06 INFO DAGScheduler: Final stage: Stage 0 (count at hello_spark.py:15)
14/06/01 18:30:06 INFO DAGScheduler: Parents of final stage: List()
14/06/01 18:30:06 INFO DAGScheduler: Missing parents: List()
14/06/01 18:30:06 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[2] at count at hello_spark.py:15), which has no missing parents
14/06/01 18:30:07 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (PythonRDD[2] at count at hello_spark.py:15)
14/06/01 18:30:07 INFO YarnClientClusterScheduler: Adding task set 0.0 with 2 tasks
14/06/01 18:30:08 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://<email address hidden>:44417/user/Executor#-1352071582] with ID 1
14/06/01 18:30:08 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor 1: host-10-20-20-28.novalocal (PROCESS_LOCAL)
14/06/01 18:30:08 INFO TaskSetManager: Serialized task 0.0:0 as 3123 bytes in 14 ms
14/06/01 18:30:09 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager host-10-20-20-28.novalocal:42960 with 588.8 MB RAM
14/06/01 18:30:16 INFO BlockManagerMasterActor$BlockManagerInfo: Added rdd_1_0 in memory on host-10-20-20-28.novalocal:42960 (size: 308.2 MB, free: 280.7 MB)
14/06/01 18:30:17 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://<email address hidden>:58126/user/Executor#1079893974] with ID 2
14/06/01 18:30:17 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on executor 2: host-10-20-20-23.novalocal (PROCESS_LOCAL)
14/06/01 18:30:17 INFO TaskSetManager: Serialized task 0.0:1 as 3123 bytes in 1 ms
14/06/01 18:30:17 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager host-10-20-20-23.novalocal:56776 with 588.8 MB RAM
fj14/06/01 18:31:20 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, host-10-20-20-28.novalocal, 42960, 0) with no recent heart beats: 55828ms exceeds 45000ms
14/06/01 18:42:23 INFO YarnClientSchedulerBackend: Executor 2 disconnected, so removing it
14/06/01 18:42:23 ERROR YarnClientClusterScheduler: Lost executor 2 on host-10-20-20-23.novalocal: remote Akka client disassociated
The same job finishes flawlessly on a single worker.
System Information:
==================
Description: Ubuntu 12.04.4 LTS
Release: 12.04
Linux 3.8.0-35-generic #52~precise1-Ubuntu SMP Thu Jan 30 17:24:40 UTC 2014 x86_64
I am reporting this for spark but this should be valid for any applications that involve fast data transfer between VMs. The bug has been reported in centos forums as well.
Networking breaks after awhile in kvm guests using virtio networking. We run data intensive jobs on our virtual cluster (OpenStack Grizzly Installed on Ubuntu 12.04 Server). The job runs fine on a single worker VM (no data transfer involved). As soon as I add more nodes where the workers need to exchange some data, one of the worker VM goes down. Ping responds with 'host unreachable'. Logging in via the serial console shows no problems: eth0 is up, can ping the local host, but no outside connectivity. Restart the network (/etc/init. d/networking restart) does nothing. Reboot the machine and it comes alive again.
14/06/01 18:30:06 INFO YarnClientClust erScheduler: YarnClientClust erScheduler. postStartHook done (190758) called with curMem=0, maxMem=308713881 rack/10. 20.20.28: 50010 rack/10. 20.20.23: 50010 erScheduler: Adding task set 0.0 with 2 tasks ulerBackend: Registered executor: Actor[akka. tcp://< email address hidden> :44417/ user/Executor# -1352071582] with ID 1 20-20-28. novalocal (PROCESS_LOCAL) terActor$ BlockManagerInf o: Registering block manager host-10- 20-20-28. novalocal: 42960 with 588.8 MB RAM terActor$ BlockManagerInf o: Added rdd_1_0 in memory on host-10- 20-20-28. novalocal: 42960 (size: 308.2 MB, free: 280.7 MB) ulerBackend: Registered executor: Actor[akka. tcp://< email address hidden> :58126/ user/Executor# 1079893974] with ID 2 20-20-23. novalocal (PROCESS_LOCAL) terActor$ BlockManagerInf o: Registering block manager host-10- 20-20-23. novalocal: 56776 with 588.8 MB RAM terActor: Removing BlockManager BlockManagerId(1, host-10- 20-20-28. novalocal, 42960, 0) with no recent heart beats: 55828ms exceeds 45000ms ulerBackend: Executor 2 disconnected, so removing it erScheduler: Lost executor 2 on host-10- 20-20-23. novalocal: remote Akka client disassociated
14/06/01 18:30:06 INFO MemoryStore: ensureFreeSpace
14/06/01 18:30:06 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 186.3 KB, free 294.2 MB)
14/06/01 18:30:06 INFO FileInputFormat: Total input paths to process : 1
14/06/01 18:30:06 INFO NetworkTopology: Adding a new node: /default-
14/06/01 18:30:06 INFO NetworkTopology: Adding a new node: /default-
14/06/01 18:30:06 INFO SparkContext: Starting job: count at hello_spark.py:15
14/06/01 18:30:06 INFO DAGScheduler: Got job 0 (count at hello_spark.py:15) with 2 output partitions (allowLocal=false)
14/06/01 18:30:06 INFO DAGScheduler: Final stage: Stage 0 (count at hello_spark.py:15)
14/06/01 18:30:06 INFO DAGScheduler: Parents of final stage: List()
14/06/01 18:30:06 INFO DAGScheduler: Missing parents: List()
14/06/01 18:30:06 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[2] at count at hello_spark.py:15), which has no missing parents
14/06/01 18:30:07 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (PythonRDD[2] at count at hello_spark.py:15)
14/06/01 18:30:07 INFO YarnClientClust
14/06/01 18:30:08 INFO YarnClientSched
14/06/01 18:30:08 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor 1: host-10-
14/06/01 18:30:08 INFO TaskSetManager: Serialized task 0.0:0 as 3123 bytes in 14 ms
14/06/01 18:30:09 INFO BlockManagerMas
14/06/01 18:30:16 INFO BlockManagerMas
14/06/01 18:30:17 INFO YarnClientSched
14/06/01 18:30:17 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on executor 2: host-10-
14/06/01 18:30:17 INFO TaskSetManager: Serialized task 0.0:1 as 3123 bytes in 1 ms
14/06/01 18:30:17 INFO BlockManagerMas
fj14/06/01 18:31:20 WARN BlockManagerMas
14/06/01 18:42:23 INFO YarnClientSched
14/06/01 18:42:23 ERROR YarnClientClust
The same job finishes flawlessly on a single worker.
System Information:
==================
Description: Ubuntu 12.04.4 LTS
Release: 12.04
Linux 3.8.0-35-generic #52~precise1-Ubuntu SMP Thu Jan 30 17:24:40 UTC 2014 x86_64
libvirt-bin: cloud2 7~cloud1 1.1-0ubuntu8. 7~cloud1 0 ubuntu- cloud.archive. canonical. com/ubuntu/ precise- updates/ havana/ main amd64 Packages cloud2 0 dpkg/status 9.8-2ubuntu17. 19 0 se.archive. ubuntu. com/ubuntu/ precise- updates/ main amd64 Packages 9.8-2ubuntu17. 17 0 security. ubuntu. com/ubuntu/ precise- security/ main amd64 Packages 9.8-2ubuntu17 0 se.archive. ubuntu. com/ubuntu/ precise/main amd64 Packages
--------------
Installed: 1.1.1-0ubuntu8~
Candidate: 1.1.1-0ubuntu8.
Version table:
1.
500 http://
*** 1.1.1-0ubuntu8~
100 /var/lib/
0.
500 http://
0.
500 http://
0.
500 http://
qemu-kvm: 3ubuntu5~ cloud0 3ubuntu5. 4~cloud0 5.0+dfsg- 3ubuntu5. 4~cloud0 0 ubuntu- cloud.archive. canonical. com/ubuntu/ precise- updates/ havana/ main amd64 Packages 3ubuntu5~ cloud0 0 dpkg/status 0+noroms- 0ubuntu14. 15 0 se.archive. ubuntu. com/ubuntu/ precise- updates/ main amd64 Packages 0+noroms- 0ubuntu14. 14 0 security. ubuntu. com/ubuntu/ precise- security/ main amd64 Packages 0+noroms- 0ubuntu13 0 se.archive. ubuntu. com/ubuntu/ precise/main amd64 Packages
---------------
Installed: 1.5.0+dfsg-
Candidate: 1.5.0+dfsg-
Version table:
1.
500 http://
*** 1.5.0+dfsg-
100 /var/lib/
1.
500 http://
1.
500 http://
1.
500 http://
XML DUMP for a VM ------- ------- ------- - instance- 000001b6< /name> 731c2191- fa82-4a38- 9f52-e48fb37e92 c8</uuid> >8388608< /memory> >8388608< /currentMemory> 'static' >4</vcpu> /machine< /partition> rer'>OpenStack Foundation</entry> >OpenStack Nova</entry> >2013.2. 3</entry> >01d3d524- 32eb-e011- 8574-441ea15e39 71</entry> >731c2191- fa82-4a38- 9f52-e48fb37e92 c8</entry> 'pc-i440fx- 1.5'>hvm< /type> 'delay' /> 'catchup' /> destroy< /on_poweroff> restart< /on_reboot> destroy< /on_crash> /usr/bin/ kvm-spice< /emulator> var/lib/ nova/instances/ 731c2191- fa82-4a38- 9f52-e48fb37e92 c8/disk' /> disk0'/ > 'fa:16: 3e:a7:de: 97'/> 'qbr43f8d3a5- e4'/> 5-e4'/> var/lib/ nova/instances/ 731c2191- fa82-4a38- 9f52-e48fb37e92 c8/console. log'/> var/lib/ nova/instances/ 731c2191- fa82-4a38- 9f52-e48fb37e92 c8/console. log'/> libvirt- 731c2191- fa82-4a38- 9f52-e48fb37e92 c8</label> >libvirt- 731c2191- fa82-4a38- 9f52-e48fb37e92 c8</imagelabel>
-------
<domain type='kvm' id='7'>
<name>
<uuid>
<memory unit='KiB'
<currentMemory unit='KiB'
<vcpu placement=
<resource>
<partition>
</resource>
<sysinfo type='smbios'>
<system>
<entry name='manufactu
<entry name='product'
<entry name='version'
<entry name='serial'
<entry name='uuid'
</system>
</sysinfo>
<os>
<type arch='x86_64' machine=
<boot dev='hd'/>
<smbios mode='sysinfo'/>
</os>
<features>
<acpi/>
<apic/>
</features>
<cpu mode='host-model'>
<model fallback='allow'/>
</cpu>
<clock offset='utc'>
<timer name='pit' tickpolicy=
<timer name='rtc' tickpolicy=
</clock>
<on_poweroff>
<on_reboot>
<on_crash>
<devices>
<emulator>
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' cache='none'/>
<source file='/
<target dev='vda' bus='virtio'/>
<alias name='virtio-
<address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
</disk>
<controller type='usb' index='0'>
<alias name='usb0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
</controller>
<controller type='pci' index='0' model='pci-root'>
<alias name='pci0'/>
</controller>
<interface type='bridge'>
<mac address=
<source bridge=
<target dev='tap43f8d3a
<model type='virtio'/>
<alias name='net0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</interface>
<serial type='file'>
<source path='/
<target port='0'/>
<alias name='serial0'/>
</serial>
<serial type='pty'>
<source path='/dev/pts/6'/>
<target port='1'/>
<alias name='serial1'/>
</serial>
<console type='file'>
<source path='/
<target type='serial' port='0'/>
<alias name='serial0'/>
</console>
<input type='tablet' bus='usb'>
<alias name='input0'/>
</input>
<input type='mouse' bus='ps2'/>
<graphics type='vnc' port='5904' autoport='yes' listen='0.0.0.0' keymap='en-us'>
<listen type='address' address='0.0.0.0'/>
</graphics>
<video>
<model type='cirrus' vram='9216' heads='1'/>
<alias name='video0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
</video>
<memballoon model='virtio'>
<alias name='balloon0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
</memballoon>
</devices>
<seclabel type='dynamic' model='apparmor' relabel='yes'>
<label>
<imagelabel
</seclabel>
</domain>
I am reporting this for spark but this should be valid for any applications that involve fast data transfer between VMs. The bug has been reported in centos forums as well.
http:// bugs.centos. org/view. php?id= 5526
and an older bug report on launchpad: /bugs.launchpad .net/ubuntu/ +source/ qemu-kvm/ +bug/997978? comments= all
https:/