domain name completion broken when dnsmasq is used
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
dnsmasq (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
Dnsmasq sometimes does not resolve DNS names correcty.
Sometimes it seems that if there never was a working name resolution, dnsmasq never gets to know about the DNS names.
Setup:
private network: 192.168.0.x/24
domain mydomain.intern
server: 192.168.0.1 hostname s1
dhcp (.100 - .200) and bind running, postfix and dovecot running
client: 192.168.0.100 (dhclient)
/etc/resolv.conf
...
nameserver 127.0.0.1
search mydomain.intern
/var/run/
server=192.168.0.1
Open Thunderbird -> Thunderbird fails to open s1
ssh admin@s1 -> ssh: Could not resolve hostname s1: Name or service not known
Adding
nameserver 192.168.0.1
to /etc/resolv.conf
resolves the issue immediately
calling sudo resolvconf -u
creates the lookup problem immediately again
This is a critical error
ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: dnsmasq-base 2.59-4
ProcVersionSign
Uname: Linux 3.2.0-24-generic x86_64
NonfreeKernelMo
ApportVersion: 2.0.1-0ubuntu7
Architecture: amd64
Date: Sun May 13 11:43:02 2012
InstallationMedia: Ubuntu 11.10 "Oneiric Ocelot" - Release amd64+mac (20111012)
SourcePackage: dnsmasq
UpgradeStatus: Upgraded to precise on 2012-04-29 (13 days ago)
Wolf Rogner (war-rsb) wrote : | #1 |
- Dependencies.txt Edit (553 bytes, text/plain; charset="utf-8")
- ProcEnviron.txt Edit (279 bytes, text/plain; charset="utf-8")
Simon Kelley (simon-thekelleys) wrote : Re: [Bug 998712] [NEW] dnsmasq integration into name resolution broken | #2 |
Changed in dnsmasq (Ubuntu): | |
status: | New → Incomplete |
Wolf Rogner (war-rsb) wrote : Re: dnsmasq integration into name resolution broken | #3 |
I recreated the situation by restarting the network manager.
resolv.conf contains link to 127.0.0.1
/run/nm-
However, even dig does not resolv correctly. Here are the results (my network is 10.x.x.x actually)
wolf@mbp:~$ ping s4
ping: unknown host s4
wolf@mbp:~$ dig s4
; <<>> DiG 9.8.1-P1 <<>> s4
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 27930
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;s4. IN A
;; Query time: 3 msec
;; SERVER: 127.0.0.
;; WHEN: Thu May 17 11:07:39 2012
;; MSG SIZE rcvd: 20
wolf@mbp:~$ dig @10.1.0.4 s4
; <<>> DiG 9.8.1-P1 <<>> @10.1.0.4 s4
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 34081
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; QUESTION SECTION:
;s4. IN A
;; AUTHORITY SECTION:
. 10800 IN SOA a.root-servers.net. nstld.verisign-
;; Query time: 21 msec
;; SERVER: 10.1.0.
;; WHEN: Thu May 17 11:07:50 2012
;; MSG SIZE rcvd: 95
wolf@mbp:~$ dig @10.1.0.4 s4.rsb.intern
; <<>> DiG 9.8.1-P1 <<>> @10.1.0.4 s4.rsb.intern
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35717
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 0
;; QUESTION SECTION:
;s4.rsb.intern. IN A
;; ANSWER SECTION:
s4.rsb.intern. 34000 IN A 10.1.0.4
;; AUTHORITY SECTION:
rsb.intern. 34000 IN NS s4.rsb.intern.
;; Query time: 3 msec
;; SERVER: 10.1.0.
;; WHEN: Thu May 17 11:08:03 2012
;; MSG SIZE rcvd: 61
wolf@mbp:~$ less /run/nm-
wolf@mbp:~$ dig s4
; <<>> DiG 9.8.1-P1 <<>> s4
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 18553
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; QUESTION SECTION:
;s4. IN A
;; AUTHORITY SECTION:
. 10725 IN SOA a.root-servers.net. nstld.verisign-
;; Query time: 14 msec
;; SERVER: 127.0.0.
;; WHEN: Thu May 17 11:09:05 2012
;; MSG SIZE rcvd: 95
wolf@mbp:~$ ping s4
PING s4.rsb.intern (10.1.0.4) 56(84) bytes of data.
^X^C64 bytes from 10.1.0.4: icmp_req=1 ttl=64 time=0.792 ms
--- s4.rsb.intern ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.792/0.
wolf@mbp:~$
I have not quite figured out what exactly happens. It takes about 1 to 30 minutes to resolve the issue. On some machines it never settles itself.
Certainly, if I manually adjust /etc/resolv.conf everthing works fine immediately (name resolution, access to services). If I keep the files the way they are, it is pure coincidence whether dns works (I had the chance to use a wired LAN recently and it seems to be the same issue there).
My guess would be that network manager and dns-mask do not work together in all cases (in fact, they do cooperate only in just one case, after a reboot). As I never reboot machines (if I don't have t...
Simon Kelley (simon-thekelleys) wrote : Re: [Bug 998712] Re: dnsmasq integration into name resolution broken | #4 |
On 17/05/12 10:19, Wolf Rogner wrote:
> I recreated the situation by restarting the network manager.
>
> resolv.conf contains link to 127.0.0.1
> /run/nm-
>
> However, even dig does not resolv correctly. Here are the results (my
> network is 10.x.x.x actually)
>
> wolf@mbp:~$ ping s4
> ping: unknown host s4
> wolf@mbp:~$ dig s4
>
> ; <<>> DiG 9.8.1-P1 <<>> s4
> ;; global options: +cmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 27930
> ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
>
> ;; QUESTION SECTION:
> ;s4. IN A
>
> ;; Query time: 3 msec
> ;; SERVER: 127.0.0.
> ;; WHEN: Thu May 17 11:07:39 2012
> ;; MSG SIZE rcvd: 20
>
> wolf@mbp:~$ dig @10.1.0.4 s4
>
> ; <<>> DiG 9.8.1-P1 <<>> @10.1.0.4 s4
> ; (1 server found)
> ;; global options: +cmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 34081
> ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
>
> ;; QUESTION SECTION:
> ;s4. IN A
>
> ;; AUTHORITY SECTION:
> . 10800 IN SOA a.root-servers.net. nstld.verisign-
>
> ;; Query time: 21 msec
> ;; SERVER: 10.1.0.
> ;; WHEN: Thu May 17 11:07:50 2012
> ;; MSG SIZE rcvd: 95
>
> wolf@mbp:~$ dig @10.1.0.4 s4.rsb.intern
>
> ; <<>> DiG 9.8.1-P1 <<>> @10.1.0.4 s4.rsb.intern
> ; (1 server found)
> ;; global options: +cmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35717
> ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 0
>
> ;; QUESTION SECTION:
> ;s4.rsb.intern. IN A
>
> ;; ANSWER SECTION:
> s4.rsb.intern. 34000 IN A 10.1.0.4
>
> ;; AUTHORITY SECTION:
> rsb.intern. 34000 IN NS s4.rsb.intern.
>
> ;; Query time: 3 msec
> ;; SERVER: 10.1.0.
> ;; WHEN: Thu May 17 11:08:03 2012
> ;; MSG SIZE rcvd: 61
>
> wolf@mbp:~$ less /run/nm-
> wolf@mbp:~$ dig s4
>
> ; <<>> DiG 9.8.1-P1 <<>> s4
> ;; global options: +cmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 18553
> ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
>
> ;; QUESTION SECTION:
> ;s4. IN A
>
> ;; AUTHORITY SECTION:
> . 10725 IN SOA a.root-servers.net. nstld.verisign-
>
> ;; Query time: 14 msec
> ;; SERVER: 127.0.0.
> ;; WHEN: Thu May 17 11:09:05 2012
> ;; MSG SIZE rcvd: 95
>
> wolf@mbp:~$ ping s4
> PING s4.rsb.intern (10.1.0.4) 56(84) bytes of data.
> ^X^C64 bytes from 10.1.0.4: icmp_req=1 ttl=64 time=0.792 ms
>
> --- s4.rsb.intern ping statistics ---
> 1 packets transmitted, 1 received, 0% packet loss, time 0ms
> rtt min/avg/max/mdev = 0.792/0.
> wolf@mbp:~$
>
The difference between the working and non-working examples is that the
non-working ones are looking up
s4.
and the working ones are looking up
s4.rsb.intern.
getting from "ssh s4" to a DNS lookup of the A record s4.rsb.intern, is
the responsibilty of the C library resolver, which is configured by
/etc/resolv.conf. There are a few parameters in there that can affect
things, look for domain, se...
Thomas Hood (jdthood) wrote : Re: dnsmasq integration into name resolution broken | #5 |
Wolf: Please post the FULL contents of your /etc/resolv.conf file as it is when the reported problem occurs.
Thomas Hood (jdthood) wrote : | #6 |
Wolf: In #3 you post some output but I don't know how to interpret it. You start with a "ping s4" which yields "unknown host". You end with "ping s4" which successfully pings. What happened in the meantime to change the results? Did you edit something?
Wolf Rogner (war-rsb) wrote : | #7 |
The /etc/resolv.conf held just a reference to 127.0.0.1
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
# DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 127.0.0.1
search rsb.intern rsb.at
I copied a working version over it:
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
# DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 10.1.0.4
nameserver 10.1.0.254
nameserver 195.202.128.3
search rsb.intern rsb.at
That way, name resolution worked immediately.
This brings back the old and stable name resolution. As far as I can observe this, no side effects as you discussed in #1003842
So far I have disabled dnsmasq in /etc/NetworkMan
Currently my view is that dnsmasq has some serious issues:
- name resolution in mixed DNS setups
- handling of refreshes of network manager (happens periodically with calls to resolvconf, aperiodically when new SSIDs emerge, network media changes or the machine is put into sleep mode)
- VPN management (which requires several different domains to be intersected)
- caching of resolved names
- handling of /etc/hosts
Hope this helps.
Thomas Hood (jdthood) wrote : | #8 |
Wolf: You list problems with dnsmasq. In this report (#998712) let's continue to discuss the name resolution failure that you originally reported. One of the other problems you listed is being discussed in #1003842. For the remaining problems you listed, please submit your information to bug reports (possibly newly opened by you) focusing on those problems.
Let's talk about the situation where name resolution fails on your system. When you have the following...
/etc/resolv.conf:
nameserver 127.0.0.1
search rsb.intern rsb.at
/var/run/
server=192.168.0.1
... what is the output of "ps -elf|grep dnsmasq"?
Wolf Rogner (war-rsb) wrote : | #9 |
My /var/run/
server=10.1.0.4
server=10.1.0.254
server=
wolf@mbp:~$ ps -elf | grep dnsmasq
4 S nobody 25661 25624 0 80 0 - 7579 poll_s 15:20 ? 00:00:00 /usr/sbin/dnsmasq --no-resolv --keep-
0 S wolf 25774 25489 0 80 0 - 2720 pipe_w 15:21 pts/1 00:00:00 grep --color=auto dnsmasq
wolf@mbp:~$
Thomas, please understand that I don't have the time to reiterate on situations that I have solved by reverting back to the original state. I gave you enough hints to solve the issue. You can set up a situation like mine in a bunch of virtual machines and simulate the effects.
I have a solution by disabling dnsmasq in network manager.
You might as well close this incident. But this will not fix the bug.
Thomas Hood (jdthood) wrote : | #10 |
> Please understand that I don't have the time
OK, marking this as invalid.
Changed in dnsmasq (Ubuntu): | |
status: | Incomplete → Invalid |
Thomas Hood (jdthood) wrote : | #11 |
I'd just like to note here before leaving this issue that the submitter originally said that he was running bind. If bind was set up to listen on 127.0.0.1#53 but was not correctly set up to provide name service then I imagine that this could have interfered with the NM-controlled dnsmasq. Exploring this possibility would have required cooperation by the submitter, but, unfortunately, the submitter is no longer willing to cooperate because he lacks time and has found a workaround that he is satisfied with.
Wolf Rogner (war-rsb) wrote : | #12 |
In fact my point of view is that I have submitted all information I could provide in reasonable time.
Give me a bunch of experiments I have to carry out, I'll spend an afternoon to help resolve the issue
From my point of view, I have given you more than enough input to localise the issue (see other bugs as well).
Feel free to drop. The issue is still here.
Changed in dnsmasq (Ubuntu): | |
status: | Invalid → Incomplete |
Thomas Hood (jdthood) wrote : | #13 |
On the affected system when it is manifesting the problem you reported (can't resolve names), is named running locally? (In your original submission you said that bind was running.) What role does this named play in your network? How is it configured? What happens if you stop this named? And then restart network-manager?
All these questions pertain, obviously, to the hypothesis that local named may be interfering with local dnsmasq.
Wolf Rogner (war-rsb) wrote : | #14 |
Thomas,
there is no local named on any notebook here.
dnsmasq does not interfere with named on any of my machines.
It should actually not even be possible. As far as I am concerned, the bind() request should return false and give an EADDRINUSE if the socket was already bound by something.
Unless dnsmasq uses a different technique to attach itself to port 53.
Thomas Hood (jdthood) wrote : | #15 |
The hypothesis was that named started before dnsmasq, preventing dnsmasq from binding port 53 on 127.0.0.1. But the hypothesis is false, since you are not running named after all.
Returning to your dig output, it can be summarized as follows.
dig s4 -> FAILURE
dig @10.1.0.4 s4 -> FAILURE
dig @10.1.0.4 s4.rsb.intern -> SUCCESS
(Wolf did something here)
dig s4 -> FAILURE
ping s4 -> SUCCESS
Notice that domain name completion failed even when the external server was specified. As Simon wrote in #4, domain name completion happens in the resolver library. So there seems to be something wrong with the resolver library: it doesn't complete domain names with the domain search suffixes when dnsmasq is in use. Am I right?
Does "dig @127.0.0.1 s4.rsb.intern" work on the affected system running NM-controlled dnsmasq?
summary: |
- dnsmasq integration into name resolution broken + domain name completion broken when dnsmasq is used |
Thomas Hood (jdthood) wrote : | #16 |
Wolf: I forgot to mention earlier that the reason I have to keep asking questions is that I am unable to reproduce the problem here. On my system, domain name completion works as expected with NetworkManager+
I just tried installing nscd to see if that made any difference, but it did not seem to do so; so I don't need to ask you if you are using nscd. Also adding entries to /etc/hosts didn't seem to make any difference, so I don't have to ask about /etc/hosts either.
Does "dig @127.0.0.1 s4.rsb.intern" work on the affected system running NM-controlled dnsmasq? If it does work then I think we can conclude that the dnsmasq at 127.0.0.1 behaves the same way as the nameserver at 10.1.0.4 and that your issue is not dnsmasq-related; then we can start looking elsewhere for the cause of the problem.
My next guess would be that the resolver library on the affected system can't read /etc/resolv.conf.
Thomas Hood (jdthood) wrote : | #17 |
Having just re-read the discussion, I realize that I may have misunderstood the problem. I'll try to summarize it.
Wolf, are you saying that when using the NM-enslaved dnsmasq, fully qualified domain names can always be resolved using the resolver(3) but short domain names cannot be resolved using the resolver, despite the correct "search" option being present in /etc/resolv.conf; that this anomaly does not occur at boot time, but does occur later and lasts one to thirty minutes?
What triggers the anomaly to occur?
I will assume that in the anomalous situation, /etc/resolv.conf contains the following
nameserver 127.0.0.1
search rsb.intern rsb.at
and /run/nm-
server=10.1.0.4
and the following fails with "unknown host s4".
ping s4
When the anomaly occurs, does "dig s4.rsb.intern" work on the affected system?
When the anomaly occurs, does "dig s4 +search" work on the affected system?
Reggie McMurtrey (reggie-mcmurtrey) wrote : | #18 |
I'm havin a very simular issue with the new DNS setup in 12.04 on my laptop. I run a server at home which runs bind. The server is setup correctly, all my machines with 11.04 installed work as expected, but the machines I have upgraded to 12.04 have issues. My server provides name resolution for machines spread out in my house. A working /etc/resolve.conf for 11.04 machines looks like:
# Generated by NetworkManager
domain home.lan
search home.lan
nameserver 192.168.99.2 #my bind server (Provide DNS for local IP's 192.168.99.x )
nameserver 24.177.176.38 #dns server provided by ISP (Provide DNS for Internet IP's)
nameserver 97.81.22.195 #dns server provided by ISP (Provide DNS for Internet IP's)
My new machines with 12.04 have a /etc/resolv.conf that looks like:
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
# DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 127.0.0.1
search home.lan
/run/nm-
server=192.168.99.2
server=
server=97.81.22.195
All DNS request for outside IP's (none 192.168.x.x) work. My server has the ip 192.168.99.2 with a FDN of linux.home.lan. If i do "ping linux" this fails, if i do "ping linux.home.lan" this works.
if I change my /etc/resolv.conf to the following:
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
# DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 127.0.0.1
nameserver 192.168.99.2
search home.lan
I'm back in buisness. Let me know what else I can do to help resolv (no pun intended :) this issue
Thomas Hood (jdthood) wrote : | #19 |
Reggie: First of all, thanks for providing information about the malfunction on your system. We will get to the bottom of this!
To get very clear on what's happening I will summarize. Let me know if any of the following is wrong.
With the following resolv.conf (omitting comments)
nameserver 127.0.0.1
search home.lan
libc resolution of "lenin" repeatedly fails but "lenin.home.lan" repeatedly succeeds, whereas with
nameserver 127.0.0.1
nameserver 192.168.99.2
search home.lan
both of them repeatedly succeed.
Dnsmasq listens at 127.0.0.1, configured with the following.
server=192.168.99.2
server=
server=97.81.22.195
Only 192.168.99.2 can resolve the name "lenin.home.lan"; the others cannot.
Now, what's going on here?
First I'd like to rule out the possibility of side-effects of #1003842. Please eliminate the lines
server=
server=97.81.22.195
from /run/nm-
In the case where you add "nameserver 192.168.99.2" to /etc/resolv.conf and name resolution subsequently succeeds, is there a noticeable delay?
After you have edited /etc/resolv.conf, is /etc/resolv.conf still a symbolic link to /run/resolvconf
According to resolv.conf(5) the following environment variables can affect the behavior of the resolver: LOCALDOMAIN, RES_OPTIONS. Is either of these set in your environment? (Run, e.g., the
env |grep '\(DOMAIN\|RES\)'
command to check this.)
Thomas Hood (jdthood) wrote : | #20 |
Reggie, I wrote:
> First I'd like to rule out the possibility of side-effects of #1003842. Please eliminate the lines
>
> server=
> server=97.81.22.195
>
> from /run/nm-
Just thought of a hitch.
After removing these lines dnsmasq has to be restarted or it won't notice the change. But you can't just kill the NM-controlled dnsmasq because NM then gets horribly confused. So you have to eliminate those lines in another way, the safest way probably being: temporarily reconfigure your DHCP server so that it only sends one nameserver address (192.168.99.2) to the client, then restart network-manager on the client. Make sure that /run/nm-
Wolf Rogner (war-rsb) wrote : | #21 |
Thomas,
I understand that you have not set up a simulation test bed and that your questions are directed to understand the problem. You have found out that you may have misinterpreted some pieces.
Reggie seems to have the same problem as I (and to my knowledge more than a dozen of others) have.
I have described the issues dnsmasq has somewhere else.
Here is another observation I can offer:
eliminating dnsmasq makes machines respond within seconds. Network reconnect after resume from sleep as well as after a reboot works immediately. Thunderbird (another issue I submitted) connects to the mail server without problems.
using dnsmasq, network reconnect (over WLAN) takes significantly longer (almost 5 secs).
Even though ping and dig resolv host names correctly (in out case s4 or mail.rsb.intern) Thunderbird does not connect. That implies that the connection dnsmasq and resolver libraries is broken as well. It works eventually (say after 1min or 5, not preproducably different). Curiously it stops working again even when the machine continues to operate.
Again: This was tested on several machines.
And, yes: Disabling dnsmasq in NetworkManager.conf resolves ALL issues at once: Network reconnect in less than a second, name resolution to mail and mail.rsb.intern works fine (what else should the search path hold?).
Referring to my reluctance of giving more than the basic information:
Everyone can read Launchpad entries. This is a severe security issue.
I try to be helpful but if you do not have the means to provide significant testing equipment, maybe taking dnsmasq out of an LTS would be the better solution.
Thomas Hood (jdthood) wrote : | #22 |
Wolf, dnsmasq is not going to be taken out of the distribution. Probably you meant that NM-driven dnsmasq shouldn't be enabled by default. If so then please file another report against network-manager with the title "Please don't enable dnsmasq by default so long as it's so buggy." But that measure will do nothing to address the issue in this report (#998712) which is the fact that domain name completion gets broken under certain circumstances when dnsmasq *is* used.
To remind you, we are still in the phase where we are trying to figure out exactly what those circumstances are. I believe you when you say that the failure occurs. But usually name service doesn't fail; so what is the trigger?
Finding some reproducible circumstances that trigger the malfunction is a precondition for preparing "test equipment". Characterizing those circumstances is also useful for isolating the bug.
In addressing this issue I know you'd like to be helpful, but for various reasons you can't be helpful. You can't provide the information I asked for in #17, for example, for security reasons, and because you are too busy, and so on. I understand.
Reggie McMurtrey (reggie-mcmurtrey) wrote : Re: [Bug 998712] Re: domain name completion broken when dnsmasq is used | #23 |
Thomas,
I'm not sure if an update has been pushed that has fixed my problem,
but I'm not seeing the issue at the moment, I was going to work on the
issue earlier this weekend, but some events come up. I have rebooted
several of my systems and I'm just not seeing the issue right now. I'll
make sure to follow your directions below if my system gets in this
state again and report back.
--Reggie
On 06/08/2012 04:44 AM, Thomas Hood wrote:
> Reggie: First of all, thanks for providing information about the
> malfunction on your system. We will get to the bottom of this!
>
> To get very clear on what's happening I will summarize. Let me know if
> any of the following is wrong.
>
> With the following resolv.conf (omitting comments)
>
> nameserver 127.0.0.1
> search home.lan
>
> libc resolution of "lenin" repeatedly fails but "lenin.home.lan"
> repeatedly succeeds, whereas with
>
> nameserver 127.0.0.1
> nameserver 192.168.99.2
> search home.lan
>
> both of them repeatedly succeed.
>
> Dnsmasq listens at 127.0.0.1, configured with the following.
>
> server=192.168.99.2
> server=
> server=97.81.22.195
>
> Only 192.168.99.2 can resolve the name "lenin.home.lan"; the others
> cannot.
>
> Now, what's going on here?
>
> First I'd like to rule out the possibility of side-effects of #1003842.
> Please eliminate the lines
>
> server=
> server=97.81.22.195
>
> from /run/nm-
> repeatedly (at least twice) try "ping lenin" and "ping lenin.home.lan"
> both with only nameserver 127.0.0.1 listed in /etc/resolv.conf, and with
> nameserver 127.0.0.1 and nameserver 192.168.99.2 listed in
> /etc/resolv.conf. Report the results back here.
>
> In the case where you add "nameserver 192.168.99.2" to /etc/resolv.conf
> and name resolution subsequently succeeds, is there a noticeable delay?
>
> After you have edited /etc/resolv.conf, is /etc/resolv.conf still a
> symbolic link to /run/resolvconf
> edit /run/resolvconf
>
> According to resolv.conf(5) the following environment variables can
> affect the behavior of the resolver: LOCALDOMAIN, RES_OPTIONS. Is
> either of these set in your environment? (Run, e.g., the
>
> env |grep '\(DOMAIN\|RES\)'
>
> command to check this.)
>
Thomas Hood (jdthood) wrote : | #24 |
Thanks for the update, Reggie.
Wolf, can you please put me in touch with one or more of the dozens of people you mentioned above (#21) who have this (#998712) problem?
Mathieu Trudel-Lapierre (cyphermox) wrote : | #25 |
What this particular bug looks like is that after trying to resolve s4. (as a TLD of some sort), dnsmasq simply returns NXDOMAIN (as it should). Then libc goes and tries to resolve the names with the search domains appended and that somehow also fails (or it's never tried).
I wonder if this could be due to specific settings in /etc/nsswitch.conf or in /etc/hosts?
If however this is time-sensitive, as in, not working when NetworkManager is restarted but eventually starts working again after a few minutes, then it would be an issue with how dnsmasq gets restarted when routes are added or addresses changes (especially with IPv6). It warrants more investigation. FWIW, I think those issues are fixed in Quantal too, so testing with a LiveCD would be appreciated.
Launchpad Janitor (janitor) wrote : | #26 |
[Expired for dnsmasq (Ubuntu) because there has been no activity for 60 days.]
Changed in dnsmasq (Ubuntu): | |
status: | Incomplete → Expired |
Changed in dnsmasq (Ubuntu): | |
status: | Expired → New |
Thomas Hood (jdthood) wrote : | #27 |
I have seen a couple more reports of this phenomenon in other fora and I have invited the parties in question to submit their information here.
Changed in dnsmasq (Ubuntu): | |
status: | New → Incomplete |
Wolf Rogner (war-rsb) wrote : | #28 |
In Ubuntu 12.10 this issue is as prominent as ever.
On resume from RAM or after boot, dnsmasq requires about 5 minutes to resolve names correctly.
Workaround: For ssh I use IP addresses
Thunderbird requires a 5 Minute waiting period after resume
Evolution requires two or three attempts to start, then it works
On most of my machines I disabled dnsmasq
Thomas Hood (jdthood) wrote : | #29 |
@Wolf: Is it the case that when using the NM-controlled dnsmasq, fully qualified domain names can always be resolved using the glibc resolver but short domain names cannot be resolved using the resolver, despite the correct "search" option being present in /etc/resolv.conf; that this anomaly does not always occur, but when it does occur it lasts about five minutes?
Please correct me if any of the following is not (no longer) true. Please try to provide as complete information as possible.
In the anomalous situation, your /etc/resolv.conf contains the following
nameserver 127.0.1.1
search rsb.intern rsb.at
and "nm-tool|grep DNS" shows the following
DNS: 10.1.0.4
and the following command fails with "unknown host s4"
ping s4
but the following two commands succeed.
ping s4.rsb.intern
ping s4.rsb.at
When the anomaly occurs, does "dig s4.rsb.intern" work on the affected system?
When the anomaly occurs, does "dig s4.rsb.at" work on the affected system?
When the anomaly occurs, does "dig s4 +search" work on the affected system?
Wolf Rogner (war-rsb) wrote : | #30 |
nm-tool | grep DNS gives
DNS: 10.1.0.4
DNS: 10.1.0.254
DNS: 195.202.128.3
dig s4 gives nothing
wolf@mbp:~$ dig s4
; <<>> DiG 9.8.1-P1 <<>> s4
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 60009
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;s4. IN A
;; Query time: 1 msec
;; SERVER: 127.0.1.
;; WHEN: Mon Dec 10 21:52:10 2012
;; MSG SIZE rcvd: 20
but
dig s4 + search returns s4.rsb.intern IP and the domain rsb.intern
wolf@mbp:~$ dig s4 +search
; <<>> DiG 9.8.1-P1 <<>> s4 +search
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28303
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 0
;; QUESTION SECTION:
;s4.rsb.intern. IN A
;; ANSWER SECTION:
s4.rsb.intern. 34000 IN A 10.1.0.4
;; AUTHORITY SECTION:
rsb.intern. 34000 IN NS s4.rsb.intern.
;; Query time: 1 msec
;; SERVER: 127.0.1.
;; WHEN: Mon Dec 10 21:51:24 2012
;; MSG SIZE rcvd: 61
Thomas Hood (jdthood) wrote : | #31 |
My guess is that this bug is fundamentally bug #1003842 with the twist that sometimes the lookup succeeds if the fully qualified internal domain name is given on the command line explicitly.
(I don't think it's a coincidence that both Wolf and Reggie are on networks with non-equivalent nameservers.)
That is, using Wolf's example, "ping s4" fails because 's4' can't be resolved (as expected) and then 's4.rsb.intern' also can't be resolved because of bug #1003842.
The remaining question is why "ping s4.rsb.intern" *does* work. And why did "dig s4 +search" succeed in Wolf's last experiment? It's possible that in these cases dnsmasq happened to talk to the internal nameserver instead of the external one.
If this hypothesis is correct then running nm-dnsmasq in strict-order mode should also fix the problem. If you are running Ubuntu 12.10 then please try running nm-dnsmasq in strict-order mode. To put nm-dnsmasq into strict-order mode, create a file /etc/NetworkMan
Wolf Rogner (war-rsb) wrote : | #32 |
Now I understand what you are getting at (took me a long time).
I can confirm that my DNS server serves any request from inside the network. I have a log on the router monitoring outgoing traffic. Under NO circumstances is a DNS request going out UNLESS the internal server is down but the clients still have their DHCP settings.
I verify this regularly as part of my error testing procedures. This is why I even use an external DNS server as my bind forwards DNS requests in case it cannot resolve them itself.
Now there could be a client timeout (which occures in very rare cases, say once a year per client). In that case, the browser (which is the only app that might get affected) will simply provide a 404 page and a reload usually works.
None of my dnsmasqed clients requests from the external server when our internal bind is up.
I will try to put dnsmasq into strict mode for testing.
Get back to you with results after some time.
Merry Christmas
Wolf Rogner (war-rsb) wrote : | #33 |
Did some testing
I set up the strict-order using the file approach as described above.
I see no different behaviour.
My bind does not get any queries at all. dnsmasq does not forward requests.
Applications like Evolution or Thunderbird break on every reboot or resume. It takes up to 3 minutes for them to get a hold of the right server settings.
The only working solution is turing dnsmasq off.
Thomas Hood (jdthood) wrote : | #34 |
Thanks for testing. My hypothesis from comment #31 is false.
Is there anything unusual about your resolver configuration? Do you have a "sortlist", "options", etc., line in resolv.conf? What is the "hosts" line in your /etc/nsswitch.conf?
Thomas Hood (jdthood) wrote : | #35 |
BTW I just discovered that "restart network-manager" no longer suffices to reload the configuration of nm-dnsmasq because nm-dnsmasq doesn't get killed on stopping network-manager. So if you edit files in /etc/NetworkMan
Wolf Rogner (war-rsb) wrote : | #36 |
I restarted my machine in order to test under correct circumstances. I did restart the network manager but did not kill all dnsmasq explicitly.
Now after a reboot I can give you better results. Give me until tomorrow for a complete test run
So far, Thunderbird did find the server immediately and Evolution could connect to my internal mail server as well.
I will do suspend and resume tomorrow.
Wolf Rogner (war-rsb) wrote : | #37 |
carried out some suspend/resume tests
worked fine.
tried some pings with name resolution:
It takes a long time to resolve the name but it works
seems that strict-order is the way to go in installations where an internal DNS is supported by external backups / extensions
Can that be automated during setup?
It should be pretty obvious that if there is a DNS server on the same LAN segment that it is preferable to any DNS on a different LAN. This would actually mirror the behaviour of DNS according to the RFCs
Just a thought.
Thanks for the help and clarification
Thomas Hood (jdthood) wrote : | #38 |
> seems that strict-order is the way to go [...] Can that be automated [...]?
There has been a discussion about the problem and possible solutions in bug #1003842. I most recently expressed my opinion in comment #41 of that ticket.
> It takes a long time to resolve the name
Why do you think that is there a long delay in resolving the name? Does the first nameserver not respond right away?
Is the delay resolving internal names different from the delay resolving external names?
Is there also a delay when you disable nm-dnsmasq (comment out "dns=dnsmasq", etc.) so that resolv.conf lists multiple nameservers? Is this delay just as long as the delay when you are using nm-dnsmasq?
Wolf Rogner (war-rsb) wrote : | #39 |
First the bad news.
Name resolution drops after a few minutes of inactivity (approx. 30mins).
Same phenomenon as before.
Second: The name server answers as soon as the request arrives. dnsmasq obviously takes some time to determine if it can serve the name itself.
So back to square one: disabling dnsmasq: immediate and correct name resolution (I am talking about not recognisable delays).
enabling dnsmasq the problems begin (with a difference if set to strict-order). Name resolution takes
Name resolution after reboot and resume now work fine. Unfortunately after some time it does not any more. Same symptoms as without strict-order.
Just read the man-pages:
-o serves in the order resolv.conf offers!
but here is what my resolv.conf looks like
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
# DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 127.0.1.1
search rsb.intern rsb.at
So it refers to itself.
There is no dnsmasq.conf file where the dns servers are defined (not installed by default).
Network manager just starts dnsmasq but hat no information about the network either.
Thomas Hood (jdthood) wrote : | #40 |
The "-o" option is the same as the strict-order option.
The dnsmasq man page says that in strict-order mode dnsmasq uses the order from "/etc/resolv.conf" but we shouldn't take that too literally since dnsmasq obtains nameserver addresses from sources other than /etc/resolv.conf. The dnsmasq process controlled by NetworkManager obtains its nameserver addresses from NetworkManager over D-Bus. I *presume* that nameserver addresses obtained by dnsmasq over D-Bus are checked in the order received in strict-order mode, but I haven't confirmed this.
To see what nameserver addresses NM has sent to nm-dnsmasq, run the nm-tool command and look for the "DNS:" line.
Thomas Hood (jdthood) wrote : | #41 |
When dnsmasq is malfunctioning, does sending the dnsmasq process a SIGHUP fix it?
sudo kill -HUP $(pidof dnsmasq)
This signal causes dnsmasq to clear its cache, but I imagine it might also kick dnsmasq out of whatever faulty state it has got into.
Thomas Hood (jdthood) wrote : | #42 |
By the way, is it still the case that only domain name completion is malfunctioning, not DNS lookups in general and lookups of fully qualified domain names in particular?
Thomas Hood (jdthood) wrote : | #43 |
Wolf, you wrote:
> nm-tool | grep DNS gives
>
> DNS: 10.1.0.4
> DNS: 10.1.0.254
> DNS: 195.202.128.3
You later wrote:
> I can confirm that my DNS server serves any request from inside the network.
Which server serves the requests, 10.1.0.4 or 10.1.0.254? Are these nameservers completely equivalent?
Thomas Hood (jdthood) wrote : | #44 |
The unusual time-dependent character of the malfunction makes me speculate about more exotic possibilities such as misconfigured firewalls or flaky hardware.
Wolf Rogner (war-rsb) wrote : | #45 |
Lots of speculations here.
My internal DNS server is 10.1.0.4. My fallback is the secondary 10.1.0.254 which acts as DNS forwarder and proxy to the third and others.
The resolver works its way down: All things well => 10.1.0.4
Main server down: 10.1.0.254 will serve rudimentary internal services and redirects all requests to external DNSs
The third server is there as we need two DNS servers for official domain name registrations.
I have another issue: If dnsmasq is on via Network Manager opening a VPN connection to a remote site violates all name resolution to internal addresses (10.x.x.x).
Here is the catch:
If I turn off dnsmasq, all things work as expected. Names get resolved correctyl in all networks (internal, remote and external).
I travel a lot and have my notebook set to attach in all these networks automatically. It worked fine until dnsmasq was introduced.
I doubt that dnsmasq queries D-Bus for name resolution. And even if so, I question if there is an order that says D-bus, then resolv.conf or vice versa. To verify this, I will download the source and look into how dnsmasq works internally.
I even question if my current understanding how DNS works is even accurate. There are so many RFCs that cover DNS mDNS and others that I need to update my knowledge first. I would not want you to search for something that actually does not exist.
All I can confirm at the moment is that disabling dnsmasq (even if that implies doing this on a multitude of machines) leads to a constantly working infrastructure with far better performance.
Thomas Hood (jdthood) wrote : | #46 |
Well, we are lucky that you do have a good workaround for the problem, even if we don't yet fully understand it.
Do I understand correctly that your two internal nameservers can resolve exactly the same domain names, neither one more names than the other? If that is not the case then bug #1003842 could still be part of the problem if both nameservers are online at the same time.
> If dnsmasq is on via Network Manager opening a
> VPN connection to a remote site violates all name
> resolution to internal addresses (10.x.x.x).
I am not sure I know what you mean by "violates". If you mean something like "causes to break" then I would guess that what happens is that the remote LAN's nameserver is used for all name resolution, and the remote LAN's nameserver doesn't know any of your internal names. If you configure search domain names for that VPN in NetworkManager's Connection Editor then NetworkManager will so configure nm-dnsmasq that the remote LAN's nameserver is used only to resolve names in those domains; non-VPN nameservers will be used to resolve other names. That's the advantage of dnsmasq: it can route DNS requests in that way.
> I doubt that dnsmasq queries D-Bus for name resolution
D-Bus is only used to send nameserver addresses to dnsmasq. This method replaces /run/nm-
Launchpad Janitor (janitor) wrote : | #47 |
[Expired for dnsmasq (Ubuntu) because there has been no activity for 60 days.]
Changed in dnsmasq (Ubuntu): | |
status: | Incomplete → Expired |
Arno Peters (awpeters) wrote : | #48 |
I was experiencing a similar problem as described by submitter.
The name resolution of the short name failed, subsequently the name resolution of the fqdn also failed. After 10-15 seconds, the fqdn resolved again, only to fail again after trying the short name.
The solution described by Thomas Hood:
sudo kill -HUP $(pidof dnsmasq)
resolved this issue for me. It is unknown to me how dnsmasq got itself into trouble.
Changed in dnsmasq (Ubuntu): | |
status: | Expired → Confirmed |
description: | updated |
Thomas Hood (jdthood) wrote : | #49 |
Arno,
Does disabling NetworkManager-
Hypothesis: dnsmasq is given two nameserver addresses. The first nameserver listed, which is the one always consulted first if dnsmasq is not used, functions correctly. The second one malfunctions. Dnsmasq sometimes consults the second one and therefore sometimes returns incorrect results.
On 13/05/12 11:00, Wolf Rogner wrote: nm-dns- dnsmasq. conf
> Public bug reported:
>
> dnsmasq does not resolve DNS names correcty.
>
> Applications like Thunderbird or tools like ssh rely on working name
> resolution. However, if there never was a working name resolution,
> dnsmasq never gets to know about the DNS names.
>
> Setup:
>
> private network: 192.168.0.x/24
> domain mydomain.intern
> server: 192.168.0.1 hostname s1
> dhcp (.100 - .200) and bind running, postfix and dovecot running
> client: 192.168.0.100 (dhclient)
>
> /etc/resolv.conf
> ...
> nameserver 127.0.0.1
> search mydomain.intern
>
> /var/run/
> server=192.168.0.1
>
> Open Thunderbird -> Thunderbird fails to open s1
> ssh admin@s1 -> ssh: Could not resolve hostname s1: Name or service not known
>
> Adding
> nameserver 192.168.0.1
> to /etc/resolv.conf
>
> resolves the issue immediately
>
> calling sudo resolvconf -u
>
> creates the lookup problem immediately again
Please could you add the output from
dig s1
run when DNS is broken to this bug report, also
dig @192.168.0.1 s1
Cheers,
Simon.