Use namespace autocleanup ability from DHCP and L3 agents

Bug #1444978 reported by Eugene Nikanorov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
Medium
Sergey Kolekonov
6.1.x
Invalid
Critical
Eugene Nikanorov

Bug Description

Namespace cleanup in DHCP agent is available but turned off by default due to bug in older iproute package ('12 or '13) which prevented namespaces from being deleted properly.

With current iproute there is no such problem so this functionality should be turned on

Changed in mos:
assignee: nobody → Eugene Nikanorov (enikanorov)
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :
Changed in mos:
status: New → In Progress
summary: - Use namespace autocleanup ability from DHCP agent
+ Use namespace autocleanup ability from DHCP and L3 agents
Changed in mos:
milestone: none → 6.1
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Discussed the bug with Oleg Bondarev and we agreed that it should be moved to 7.0

Changed in mos:
milestone: 6.1 → 7.0
Changed in mos:
assignee: Eugene Nikanorov (enikanorov) → Sergey Kolekonov (skolekonov)
Changed in mos:
status: In Progress → Fix Committed
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/neutron (openstack-ci/fuel-6.1/2014.2)

Change abandoned by Alexander Ignatov <email address hidden> on branch: openstack-ci/fuel-6.1/2014.2
Review: https://review.fuel-infra.org/5825
Reason: Enabled in puppet manifests for neutron

Anna Babich (ababich)
tags: added: on-verification
Revision history for this message
Anna Babich (ababich) wrote :
Download full text (7.2 KiB)

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "187"
  build_id: "2015-08-18_03-05-20"
  nailgun_sha: "4710801a2f4a6d61d652f8f1e64215d9dde37d2e"
  python-fuelclient_sha: "4c74a60aa60c06c136d9197c7d09fa4f8c8e2863"
  fuel-agent_sha: "57145b1d8804389304cd04322ba0fb3dc9d30327"
  fuel-nailgun-agent_sha: "e01693992d7a0304d926b922b43f3b747c35964c"
  astute_sha: "e24ca066bf6160bc1e419aaa5d486cad1aaa937d"
  fuel-library_sha: "0062e69db17f8a63f85996039bdefa87aea498e1"
  fuel-ostf_sha: "17786b86b78e5b66d2b1c15500186648df10c63d"
  fuelmain_sha: "c9dad194e82a60bf33060eae635fff867116a9ce"

Verified on cluster: Neutron with VxLAN+L2pop, 3 controllers, 2 computes

Verification scenario
1. Create router01, create networks net01: net01__subnet, 192.168.1.0/24, net02: net02__subnet, 192.168.2.0/24 and attach them to router01.

2. Check that dhcp-agent hosts for both created networks appear:
root@node-1:~# NET_ID1=$(neutron net-list | grep net01 | awk '{print$2}')
root@node-1:~# NET_ID2=$(neutron net-list | grep net02 | awk '{print$2}')
root@node-1:~# neutron dhcp-agent-list-hosting-net net01
+--------------------------------------+-------------------+----------------+-------+
| id | host | admin_state_up | alive |
+--------------------------------------+-------------------+----------------+-------+
| 09543aed-00bf-4c0a-a05b-14e3617885f8 | node-1.domain.tld | True | :-) |
| ef925501-46c8-4dba-b830-d2b3675d3de5 | node-3.domain.tld | True | :-) |
+--------------------------------------+-------------------+----------------+-------+
root@node-1:~# neutron dhcp-agent-list-hosting-net net02
+--------------------------------------+-------------------+----------------+-------+
| id | host | admin_state_up | alive |
+--------------------------------------+-------------------+----------------+-------+
| 09543aed-00bf-4c0a-a05b-14e3617885f8 | node-1.domain.tld | True | :-) |
| ef925501-46c8-4dba-b830-d2b3675d3de5 | node-3.domain.tld | True | :-) |
+--------------------------------------+-------------------+----------------+-------+

3. Check that namespaces for both networks appear on both controllers where they are hosted on:
root@node-1:~# ip netns list | grep qdhcp-$NET_ID1
qdhcp-59373701-3ff7-450e-8b29-8cca8b1a2482
root@node-1:~# ip netns list | grep qdhcp-$NET_ID2
qdhcp-46edbf75-14ef-4e84-b351-250e33c7490d

root@node-3:~# ip netns list | grep qdhcp-$NET_ID1
qdhcp-59373701-3ff7-450e-8b29-8cca8b1a2482
root@node-3:~# ip netns list | grep qdhcp-$NET_ID2
qdhcp-46edbf75-14ef-4e84-b351-250e33c7490d

4. Check on which controller the router01 is hosted now:
root@node-1:~# router_id=$(neutron router-show router01 | grep ' id ' | awk '{print $4}')
root@node-1:~# neutron l3-agent-list-hosting-router $router_id
+--------------------------------------+-------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+-...

Read more...

Changed in mos:
status: Fix Committed → Fix Released
tags: removed: on-verification
Revision history for this message
Polina Petriuk (ppetriuk) wrote :

We ran into situation when VMs would periodically loose connectivity, since the old namespace would still exist with duplicated IP address.

tags: added: customer-found support
Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

Polina Petriuk (ppetriuk): please describe customer's environment and provide steps to reproduce this issue.

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

In fact you can reproduce the issue on any environment (6.1+):
set delete_dhcp_namespaces = False in /etc/neutron/dhcp_agent.ini on controllers and restart dhcp agents via pcs.

Shut down one DHCP agent having some networks scheduled on it. Wait for the networks to be rescheduled to a different agent by neutron server (usually that takes < 1 min).
See that on the host running first DHCP agent namespaces remain with active ports in them.

Change setting in the conf file to True and restart DHCP agent. It will remove namespaces of the networks which are not assigned to it.

Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

Eugene, thanks a lot for your answer. My teammates told me that MOS redundancy mechanism (pacemaker+corosync) cleans up deleted network environments, but has it drawbacks. So I had to confirm that thing.

Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

On our team meeting we discussed different options to solve this bug. Actually there are two ways to solve it: 1) use set_override method to override configured values of dhcp_delete_namespaces and l3_delete_namespaces vars and 2) change docs, so affected environments will be fixed by their respective owners.
We have decided that we shouldn't self-written patches to solve this issue since only few customers are affected. That is why we should take a second path and describe this issue and possible WA in MU5 Release Notes.

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

To clarify...
pacemaker+corosync is not a redundancy mechanism, it doesn't cleanup anything in case of resource rescheduling, it only does it on resource restart (which indeed is plain wrong thing to do).
It was bad design from the beginning (but neccessary evil because of some issues in ubuntu).

This issue should have been fixed in 6.1 timeframe when it was discovered, but unfortunately we've decided that it's not important enough and moved it to 7.0

> We have decided that we shouldn't self-written patches to solve this issue since only few customers are affected.

In fact, every customer of 6.1 is affected. The reason why we are not getting many of escalations on this issue just tells us that there are quite a few 6.1 production envs.
Way (1) is perfectly fine considering conditions: it'll help to avoid the issue by anyone who is going to deploy 6.1
Documentation, especially for MU... nobody will read it.

Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

Eugene, thanks for the clarification. I will raise this question again and will stand up for a patching.

Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

Eugene, I couldn't reproduce this bug for l3 or dhcp agents. I have tried to re-schedule them using neutron CLI, but haven't found any issues: all namespaces, that wasn't deleted didn't have any interfaces except loopback interfaces. Our cluster software processed agents restarts without any issues: old namespaces were deleted, and new ones were scheduled correctly.

At this point I can't say that this issue has a massive impact. Please provide me steps to reproduce for MOS 6.1.

Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

 I have set bug's status to incomplete, since I couldn't reproduce it and need additional information from bug's owner.

Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

Eugene, please provide steps-to-reproduce for MOS 6.1.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Returning back to Confirmed, otherwise, we should have closed it as Invalid after a month without feedback...

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Retargeted to 6.1-updates as there are no clear steps to reproduce the issue

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Closed as invalid per comments in the abandoned patch https://review.fuel-infra.org/5825

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.