Discussion:
[9fans] dns
(too old to reply)
erik quanstrom
2012-08-21 01:59:02 UTC
Permalink
i'm using a modified version of dns. i found that aktomi
redirections too unreliable. even so, i still get crashes, which have
become more frequent in recent weeks. i've attached a copy of
"restartdns" which is ment to be called from cron on short intervals.

contrib quanstro/ndb has the whole nine yards.

one of these days i will redo dns with a better (and maintainable)
structure. :-). but please beat me to it.

- erik
arisawa
2012-08-21 04:23:19 UTC
Permalink
Hello

Recent version of dns crashes.
Old version (for example, Apr this year) is OK.
It seems crashed dns make other side effect:
Many processes stop with Fauth status. # I don't know the reason.

Kenji Arisawa
Has anyone else been seeing their Plan 9 dns servers work for a
little while and then stop responding? I've seen this happen
Aug 20 00:24:10 delegation loop 68.23.115.in-addr.arpa ns rev1.kornet.net -> ns H.ROOT-SERVERS.NET from 211.216.50.180
and no answers
Aug 20 00:24:11 delegation loop 68.23.115.in-addr.arpa ns rev1.kornet.net -> ns J.ROOT-SERVERS.NET from 211.216.50.170
Aug 20 00:24:11 and no answers
Aug 20 00:24:11 delegation loop 68.23.115.in-addr.arpa ns rev1.kornet.net -> ns G.ROOT-SERVERS.NET from 211.216.50.180
Aug 20 00:24:11 and no answers
Aug 20 00:24:13 delegation loop 68.23.115.in-addr.arpa ns rev1.kornet.net -> ns J.ROOT-SERVERS.NET from 211.216.50.170
Aug 20 00:24:13 and no answers
Aug 20 00:24:15 delegation loop 68.23.115.in-addr.arpa ns rev1.kornet.net -> ns H.ROOT-SERVERS.NET from 211.216.50.180
Jeff Sickel
2012-08-21 05:08:48 UTC
Permalink
As in using:

Apr 12 22:46:05 CDT 2012 /n/sourcesdump/2012/0501/plan9/386/bin/ndb/dns 310819 [jmk]
Oct 14 13:32:38 CDT 2011 /n/sourcesdump/2012/0412/plan9/386/bin/ndb/dns 310519 [sys]

There are a few differences if you compare with sourcesdump. Knowing one that
works should help narrow down the changes to get the latest working cleanly again.

-jas
Post by arisawa
Hello
Recent version of dns crashes.
Old version (for example, Apr this year) is OK.
Many processes stop with Fauth status. # I don't know the reason.
Kenji Arisawa
Kenji Arisawa
2012-08-21 08:42:26 UTC
Permalink
Hello,

I suspect the problem is in:
--rwxrwxr-x M 21231 glenda sys 311768 Jul 14 11:46 dns

I don't have 2012/0501 version. I am lazy sorry.

My current dns is
--rwxrwxr-x M 131526 sys sys 310819 Apr 13 12:46 /386/bin/ndb/dns
That works 12 days since last boot without any troubles.

ar% ps
....
none 99 0:00 0:00 184K Open listen
none 100 0:00 0:00 184K Open listen
bootes 540 0:00 0:01 240K Await patrol
none 241197 0:00 0:00 240K Await tcp110
arisawa 241202 0:00 0:00 360K Pread pop3

where patrol is a program that reboot cpu server if it detects a process that stops
with Fauth status. the contents is
ar% cat /usr/bootes/bin/rc/patrol
#!/bin/rc
rfork e
while(sleep 60){
a=`{ps|grep Fauth|wc}
if(! ~ $a(1) 0){
/usr/local/bin/386/logit -l reboot Fauth:$a(1) # for logging (my program). don't mind
reboot
}
}
ar%

Kenji Arisawa
Post by Jeff Sickel
Apr 12 22:46:05 CDT 2012 /n/sourcesdump/2012/0501/plan9/386/bin/ndb/dns 310819 [jmk]
Oct 14 13:32:38 CDT 2011 /n/sourcesdump/2012/0412/plan9/386/bin/ndb/dns 310519 [sys]
There are a few differences if you compare with sourcesdump. Knowing one that
works should help narrow down the changes to get the latest working cleanly again.
-jas
Post by arisawa
Hello
Recent version of dns crashes.
Old version (for example, Apr this year) is OK.
Many processes stop with Fauth status. # I don't know the reason.
Kenji Arisawa
c***@gmx.de
2012-08-21 11:27:55 UTC
Permalink
nothing wrong with diffing the changes and see if theres a clue, but
to solve this one really needs to find the underlying cause no matter
what. changes can just hide bugs or make them more or less likely to
appear. can anyone provide at least a stacktrace or process snapshot
of the crashed dns processes? from that you try to build a theory of
what might be going wrong by thinking really really hard... (the
thinking should be directly proportional to the time it takes to
reproduce the bug) and then you work on how to prove that theory.
just changing stuff without knowing what exactly was the problem with
the old code is sometimes tempting, but wrong and dangerous.

--
cinap
Lucio De Re
2012-08-21 12:18:00 UTC
Permalink
Post by c***@gmx.de
can anyone provide at least a stacktrace or process snapshot
of the crashed dns processes?
I've had double-free errors from dns on my server, but the same
machine suffers from fossil/venti errors, so I can't use it as
diagnostics.

I do think that Plan 9 dns is a bit fragile, though not in the way
BIND is monstrously complex.

++L
arisawa
2012-08-21 12:37:08 UTC
Permalink
Hello,

Files in /sys/src/cmd/ndb are not be changed since this Mar 31.
However dns in /n/sources/plan9/386/bin/ndb have changed two or three times since that day.
Probably the difference comes from others (library?).

hera% pwd
/sys/src/cmd/ndb
hera% ls -lt
--rw-rw-r-- M 21422 sys sys 34365 Mar 31 05:41 cs.c
--rw-rw-r-- M 21422 sys sys 40329 Mar 29 08:07 dn.c
--rw-rw-r-- M 21422 sys sys 20706 Mar 29 08:01 dns.c
--rw-rw-r-- M 21422 sys sys 6916 Oct 14 2011 dnudpserver.c
--rw-rw-r-- M 21422 sys sys 39487 Oct 14 2011 dnresolve.c
--rw-rw-r-- M 21422 sys sys 12571 Oct 14 2011 convM2DNS.c

hera% ls -lt /n/sources/plan9/sys/src/cmd/ndb
--rw-rw-r-- M 21436 glenda sys 34365 Mar 31 05:41 /n/sources/plan9/sys/src/cmd/ndb/cs.c
--rw-rw-r-- M 21436 glenda sys 40329 Mar 29 08:07 /n/sources/plan9/sys/src/cmd/ndb/dn.c
--rw-rw-r-- M 21436 glenda sys 20706 Mar 29 08:01 /n/sources/plan9/sys/src/cmd/ndb/dns.c
--rw-rw-r-- M 21436 glenda sys 6916 Oct 14 2011 /n/sources/plan9/sys/src/cmd/ndb/dnudpserver.c
--rw-rw-r-- M 21436 glenda sys 39487 Oct 14 2011 /n/sources/plan9/sys/src/cmd/ndb/dnresolve.c
--rw-rw-r-- M 21436 glenda sys 12571 Oct 14 2011 /n/sources/plan9/sys/src/cmd/ndb/convM2DNS.c
Post by c***@gmx.de
the old code is sometimes tempting, but wrong and dangerous.
I agree. And I wonder why dns runs as bootes (sever owner).

Kenji Arisawa
Post by c***@gmx.de
nothing wrong with diffing the changes and see if theres a clue, but
to solve this one really needs to find the underlying cause no matter
what. changes can just hide bugs or make them more or less likely to
appear. can anyone provide at least a stacktrace or process snapshot
of the crashed dns processes? from that you try to build a theory of
what might be going wrong by thinking really really hard... (the
thinking should be directly proportional to the time it takes to
reproduce the bug) and then you work on how to prove that theory.
just changing stuff without knowing what exactly was the problem with
the old code is sometimes tempting, but wrong and dangerous.
--
cinap
erik quanstrom
2012-08-21 18:32:03 UTC
Permalink
Post by c***@gmx.de
nothing wrong with diffing the changes and see if theres a clue, but
to solve this one really needs to find the underlying cause no matter
what. changes can just hide bugs or make them more or less likely to
appear. can anyone provide at least a stacktrace or process snapshot
of the crashed dns processes? from that you try to build a theory of
what might be going wrong by thinking really really hard... (the
thinking should be directly proportional to the time it takes to
reproduce the bug) and then you work on how to prove that theory.
just changing stuff without knowing what exactly was the problem with
the old code is sometimes tempting, but wrong and dangerous.
very good point. in the past, much of the trouble has been that the
rr records get smashed and you crash much later on. this makes debugging
from a crash frustrating.

but as it happily turns out, i keep all the snaps of all my broken dns processes.
i've had ~35 since january, when i last looked at dns. as it turns out
all of the crashes were due to an off-by-one in dnresolv:/^serveraddrs
the final loop in the function should be limited by Maxdest-1, since
Maxdest is an index not a count.

with this fixed, it will be interesting to see if we see better behavior
or not. my big concern right now is occassionally corrupted results we
see. the crashes and whatnot are relatively easy to clean up after.

if anyone else is interested in my compendium of dnssnaps, i'd be happy
to send them along. there's only a gb of them! :-).

- erik
Charles Forsyth
2012-08-21 20:05:37 UTC
Permalink
Doh! I can't tell you how often I've looked at that very line of code,
but evidently my eyes are shallow.
Post by erik quanstrom
all of the crashes were due to an off-by-one in dnresolv:/^serveraddrs
the final loop in the function should be limited by Maxdest-1, since
Maxdest is an index not a count.
erik quanstrom
2012-08-21 20:41:40 UTC
Permalink
Post by Charles Forsyth
Doh! I can't tell you how often I've looked at that very line of code,
but evidently my eyes are shallow.
i missed it last time, too.

- erik
c***@gmx.de
2012-08-21 22:44:44 UTC
Permalink
excellent! :)

--
cinap
c***@gmx.de
2012-08-22 10:32:42 UTC
Permalink
wait, Maxdest is a count, see:

dest = emalloc(Maxdest * sizeof *dest); /* dest can't be on stack */

code like:

Dest *curdest; /* pointer to next to fill */

p = dp->dest;
...
if (qp->ndest > qp->curdest - p) {

and

for(p = qp->dest; p < qp->curdest; p++)

indicates that dp->curdest is an end pointer. so it should be perfectly valid
for it to point at &dp->dest[Maxdest].

the check at the top of serveraddrs() should really be:

if(nd >= Maxdest) /* dest array is full? */
return Maxdest;

serveraddr() really returns a count. which is the same as an
end pointer index.

the result check of that serveraddr() call should really be:

if (j < 0 || j > Maxdest) {
dnslog("serveraddrs() result %d out of range", j);
abort();
}
qp->curdest = &qp->dest[j];

and the destck(dp->curdest); should be removed.

--- a/sys/src/cmd/ndb/dnresolve.c Wed Aug 22 00:11:42 2012 +0200
+++ b/sys/src/cmd/ndb/dnresolve.c Wed Aug 22 12:28:34 2012 +0200
@@ -832,7 +832,7 @@
Dest *cur;

if(nd >= Maxdest) /* dest array is full? */
- return Maxdest - 1;
+ return Maxdest;

/*
* look for a server whose address we already know.
@@ -1080,13 +1080,12 @@
*/
if (qp->ndest > qp->curdest - p) {
j = serveraddrs(qp, qp->curdest - p, depth);
- if (j < 0 || j >= Maxdest) {
+ if (j < 0 || j > Maxdest) {
dnslog("serveraddrs() result %d out of range", j);
abort();
}
qp->curdest = &qp->dest[j];
}
- destck(qp->curdest);

/* no servers, punt */
if (qp->ndest == 0)

--
cinap
Kenji Arisawa
2012-08-22 12:09:34 UTC
Permalink
Hello,
Post by c***@gmx.de
if (j < 0 || j > Maxdest) {
dnslog("serveraddrs() result %d out of range", j);
abort();
}
qp->curdest = &qp->dest[j];
what happens if j == Maxdest ? note that j is index.

I rather notice the foolowing code.

/* use any addresses that we found */
for(trp = arp; trp && nd < Maxdest; trp = trp->next){
cur = &qp->dest[nd];
parseip(cur->a, trp->ip->name);
/*
* straddling servers can reject all nameservers if they are all
* inside, so be sure to list at least one outside ns at
* the end of the ns list in /lib/ndb for `dom='.
*/
if (ipisbm(cur->a) ||
cfg.straddle && !insideaddr(qp->dp->name) && insidens(cur->a))
continue;
cur->nx = 0;
cur->s = trp->owner;
cur->code = Rtimeout;
nd++;
}
lock(&dnlock);
rrfreelist(arp);
unlock(&dnlock);
return nd;

returned value may be Maxdest.
This code is in function serveraddrs(), and the function must return index. (must not be Maxdest)

Kenji Arisawa
c***@gmx.de
2012-08-22 12:35:30 UTC
Permalink
Post by c***@gmx.de
qp->curdest = &qp->dest[j];
this just takes the address, it doesnt actually dereference anything. you
could write this as qp->curdest = qp->dest + j.

that destck(qp->curdest); call and the j >= Maxdest check after the
serveraddrs() call seem like a confused debug attempt to me. or is here
any reason why serveraddrs() should *not* fill the whole array up
to the last element and stop at Maxdest-2?

look how qp->curdest is used in the code. it is a pointer pointing *past*
the last filled entry (or represents the next entry to be filled). if the
whole array is filled, it should point at qp->dest + Maxdest (and serveraddrs
even checked exactly for that).

except that bogus destck(), dp->curdest is never dereferenced in the code
and just used in loops as the end marker with the p < dp->curdest or
p != dp->curdest condition.

--
cinap
erik quanstrom
2012-08-22 13:05:20 UTC
Permalink
Post by c***@gmx.de
Post by c***@gmx.de
qp->curdest = &qp->dest[j];
this just takes the address, it doesnt actually dereference anything. you
could write this as qp->curdest = qp->dest + j.
that destck(qp->curdest); call and the j >= Maxdest check after the
serveraddrs() call seem like a confused debug attempt to me. or is here
any reason why serveraddrs() should *not* fill the whole array up
to the last element and stop at Maxdest-2?
it will stop at setting qp->curdest = &qp->dest[Maxdest-1],
since the maximum return value is the loop limit + 1.

since qb->curdest is indirected, for example by setdestoutns, this must be
a valid pointer to remain safe.

and further, after failing to get an answer from 24 different servers, it's unlikely
that the 25th will trying 24 different servers for the same address is already
i think it should be smaller. but that's just me. :-).

i think we can all agree that this code is hard to read.

- erik
erik quanstrom
2012-08-22 13:08:34 UTC
Permalink
Post by erik quanstrom
and further, after failing to get an answer from 24 different servers, it's unlikely
that the 25th will trying 24 different servers for the same address is already
i think it should be smaller. but that's just me. :-).
good grief. what that was supposed to say was "the 25th is unlikely to be helpful
if the first 24 fail. 24 already seems excessive."

- erik
Charles Forsyth
2012-08-22 13:11:20 UTC
Permalink
for(n = 0; n < Maxdest; n++, qp->curdest++)
if (setdestoutns(qp->curdest, n) < 0)
Post by erik quanstrom
since qb->curdest is indirected, for example by setdestoutns, this must be
a valid pointer to remain safe.
only when the pointer is valid. I think cinap is right.
Charles Forsyth
2012-08-22 13:21:40 UTC
Permalink
that's why cinap removed that line in his diff.
you can argue that this is a mistake, but i'm sure there are others.
erik quanstrom
2012-08-22 13:18:23 UTC
Permalink
Post by Charles Forsyth
for(n = 0; n < Maxdest; n++, qp->curdest++)
if (setdestoutns(qp->curdest, n) < 0)
Post by erik quanstrom
since qb->curdest is indirected, for example by setdestoutns, this must be
a valid pointer to remain safe.
only when the pointer is valid. I think cinap is right.
i don't think so. destck will still indirect it:

if (qp->ndest > qp->curdest - p) {
[...]
qp->curdest = &qp->dest[j];
destsum(qp, j);
}
Post by Charles Forsyth
Post by erik quanstrom
destck(qp->curdest);
the definition if destck:

static void
destck(Dest *p)
{
assert(p);
assert(p->magic == Destmagic);
}

you can argue that this is a mistake, but i'm sure there are others.
so imho, this is not worth fixing. dns needs replacing.

- erik
c***@gmx.de
2012-08-22 13:31:18 UTC
Permalink
i just said it like 2 times already that i think this destck() and the
result serveraddrs() check is bogus and is the source of all this
confustion.

the "curdest" is handled as an end pointer in the rest of the code.

its where you start adding stuff, and after writing it, you increment
it to the next element. so it always points past the last valid element.
this is just idiomatic c code.

serveraddrs() should take a start index (current count) and return a
count. and at the beginning check if the start index is already the
maximum size of the dest array. this makes the most sense and is not
hard to understand.

--
cinap
erik quanstrom
2012-08-22 14:22:45 UTC
Permalink
Post by c***@gmx.de
i just said it like 2 times already that i think this destck() and the
result serveraddrs() check is bogus and is the source of all this
confustion.
the "curdest" is handled as an end pointer in the rest of the code.
doesn't this code from procansw also treat qp->curdest as something
other than an end pointer?

if(p != qp->curdest)
p->code = Rserver;

since, if qp->curdest is an end pointer, then p!=qp->curdest ≡ 1.

- erik
Charles Forsyth
2012-08-22 14:41:29 UTC
Permalink
No, see the call to procansw
since, if qp->curdest is an end pointer, then p!=qp->curdest ≡ 1.
c***@gmx.de
2012-08-22 14:47:47 UTC
Permalink
this is interesting. the p != qp->curdest check would just
support my point because it effectively checks if p is valid.
if p would be at qp->curdest, it would be past the last valid
entry and hence invalid so its not written.

but theres another thing, look in queryns() how p comes to be:

/* find responder */
// dnslog("queryns got reply from %I", srcip);
for(p = qp->dest; p < qp->curdest; p++)
if(memcmp(p->a, srcip, sizeof p->a) == 0)
break;

for(np = qp->dest; np < qp->curdest; np++)
if(np->s == p->s) <-- oops, p might be qp->curdest here
p->nx = Maxtrans; <-- fuck!

rv = procansw(qp, &m, srcip, depth, p);

i think we also need to check p != qp->curdest before that
2nd for loop or it would trash the entry at curdist.

@@ -1439,9 +1438,10 @@
break;

/* remove all addrs of responding server from list */
- for(np = qp->dest; np < qp->curdest; np++)
- if(np->s == p->s)
- p->nx = Maxtrans;
+ if(p != qp->curdest)
+ for(np = qp->dest; np < qp->curdest; np++)
+ if(np->s == p->s)
+ p->nx = Maxtrans;

/* free or incorporate RRs in m */
rv = procansw(qp, &m, srcip, depth, p);

--
cinap
erik quanstrom
2012-08-22 15:26:35 UTC
Permalink
Post by c***@gmx.de
this is interesting. the p != qp->curdest check would just
support my point because it effectively checks if p is valid.
if p would be at qp->curdest, it would be past the last valid
entry and hence invalid so its not written.
if that's true, then this could be some of the most convoluted
code we've got.

- erik
erik quanstrom
2012-08-22 15:33:28 UTC
Permalink
here's my copy. it probablly won't work if *ncpu>1 on the standard
system because vmap's no up to the task.

i didn't re-verify that all the vmap() wiggles are necessary with
nix; they are with the pae kernel.

- erik
erik quanstrom
2012-08-22 15:35:40 UTC
Permalink
Post by erik quanstrom
here's my copy. it probablly won't work if *ncpu>1 on the standard
system because vmap's no up to the task.
i didn't re-verify that all the vmap() wiggles are necessary with
nix; they are with the pae kernel.
- erik
sorry, fat fingered the message #. wrong list.

so far nothing's gone right today!

- erik
Charles Forsyth
2012-08-22 16:05:00 UTC
Permalink
I sent this to cinap earlier, who concurred.
Given the comment
/* remove all addrs of responding server from list */
for(np = qp->dest; np < qp->curdest; np++)
if(np->s == p->s)
p->nx = Maxtrans;
I wonder whether np->nx = Maxtrans was meant.
Lucio De Re
2012-08-22 13:23:54 UTC
Permalink
Post by c***@gmx.de
or is here
any reason why serveraddrs() should *not* fill the whole array up
to the last element and stop at Maxdest-2?
I'm just guessing here, but Kenji pointed to the need to add a server
address in the case of a straddling server. Would that perhaps be a
factor here?

++L
Kenji Arisawa
2012-08-25 00:10:26 UTC
Permalink
Hello cinap,

I got a one. I hope this a helpful.


ar% cat broken/1345779846.41356
name=dns
/proc/41356/text:386 plan 9 executable
/sys/lib/acid/port
/sys/lib/acid/386
acid: abort()+0x0 /sys/src/libc/9sys/abort.c:6
ppanic(p=0x3975c,fmt=0x394ec)+0x146 /sys/src/libc/port/malloc.c:166
pv=0x3e820
msg=0x3f804
v=0xdfffc800
n=0x2c
D2B(p=0x3975c,v=0x497f8)+0x5a /sys/src/libc/port/pool.c:968
a=0x497f0
poolfreel(v=0x497f8,p=0x3975c)+0x20 /sys/src/libc/port/pool.c:1192
ab=0x3e820
poolfree(p=0x3975c,v=0x497f8)+0x41 /sys/src/libc/port/pool.c:1327
free(v=0x49800)+0x23 /sys/src/libc/port/malloc.c:250
mydnsquery(qp=0x88cf0,udppkt=0xc76f0,len=0x2a,medium=0x1)+0x185 /sys/src/cmd/ndb/dnresolve.c:1032
rv=0xc
domain=0x49800
net=0x74656e2f
conndir=0x74656e2f
nci=0x52b59
belen=0x6e2f000f
xmitquery(qp=0x88cf0,depth=0x1,medium=0x1,inns=0x1,obuf=0xc76f0,len=0x2a)+0x227 /sys/src/cmd/ndb/dnresolve.c:1114
p=0xc7950
j=0x1
n=0x0
buf=0x1b59c4c3
tcpquery(qp=0x88cf0,waitms=0x63f,obuf=0xc76f0,ibuf=0xa7530,depth=0x1,inns=0x1,len=0x2a,req=0x1d85,mp=0xdfffc9b4)+0xea /sys/src/cmd/ndb/dnresolve.c:1353
rv=0x0
endms=0x56ba1ef1
queryns(qp=0x88cf0,obuf=0xc76f0,depth=0x1,inns=0x1,waitms=0x63f,ibuf=0xa7530)+0x4d3 /sys/src/cmd/ndb/dnresolve.c:1428
req=0xa9961d85
len=0x2a
dest=0xc7950
p=0xc7c30
ndest=0x1
endms=0x56ba1dcc
replywaits=0x0
buf=0x9dfa996
m=0x1d85
srcip=0xdfffca18
rv=0x9dfa996
udpquery(mntpt=0x3f0e0,qp=0x88cf0,patient=0x0,depth=0x1,inns=0x1)+0x1b7 /sys/src/cmd/ndb/dnresolve.c:1578
ibuf=0xa7530
obuf=0xc76f0
fd=0xb
msg=0x6faa
pcntprob=0x3c
reqtm=0x1f40
wait=0x63f
rv=0x87710
netquery(depth=0x1,qp=0x88cf0)+0x2b5 /sys/src/cmd/ndb/dnresolve.c:1660
rv=0x0
dp=0x6d460
qlp=0x6d4fc
lock=0x1
buf=0x3975c
triedin=0x0
inname=0x1
netqueryns(qp=0x88cf0,nsrp=0x876b0,depth=0x1)+0x1e /sys/src/cmd/ndb/dnresolve.c:338
rv=0x88ce8
issuequery(class=0x1,qp=0x88cf0,depth=0x0,name=0xdfffce13,recurse=0x0)+0x50 /sys/src/cmd/ndb/dnresolve.c:359
nsrp=0x876b0
cp=0x88cf0
dbnsrp=0x8558
rp=0x0
dnresolve1(name=0xdfffce13,type=0xf,class=0x1,req=0xdfffcdd8,depth=0x0,recurse=0x0)+0x25c /sys/src/cmd/ndb/dnresolve.c:505
dp=0x6d460
rp=0x0
qp=0x88cf0
dnresolve(status=0xdfffcce0,depth=0x0,rooted=0x0,name=0xdfffce13,class=0x1,type=0xf,req=0xdfffcdd8,cn=0x0,recurse=0x0)+0xa8 /sys/src/cmd/ndb/dnresolve.c:198
procname=0x9cb50
rp=0x0
drp=0x71a98
nrp=0x9cb40
nname=0x48
dp=0xdfffcca8
loops=0x9cb90
lookupqueryold(p=0xdfffce13,mf=0xbac50,req=0xdfffcdd8,rooted=0x0,job=0xba810,errbuf=0xdfffcd0c,wantsav=0x0)+0x70 /sys/src/cmd/ndb/dns.c:864
status=0x0
rp=0x9cb48
rwrite(job=0xba810,mf=0xbac50,req=0xdfffcdd8)+0x2be /sys/src/cmd/ndb/dns.c:838
err=0x0
cnt=0x1b
send=0x0
errbuf=0x0
atype=0xdfffce2c
io()+0x39e /sys/src/cmd/ndb/dns.c:532
req=0x1
mdata=0x32
n=0x32
job=0xba810
mf=0xbac50
main(argv=0xdfffefb0,argc=0x0)+0x32c /sys/src/cmd/ndb/dns.c:267
ext=0x0
_argc=0x72
_args=0xdfffefc7
servefile=0x642f7323
dir=0x0
kid=0x0
_main+0x31 /sys/src/libc/386/main9.s:16
acid:
echo kill > /proc/41356/ctl
ar%

Kenji Arisawa
Post by c***@gmx.de
nothing wrong with diffing the changes and see if theres a clue, but
to solve this one really needs to find the underlying cause no matter
what. changes can just hide bugs or make them more or less likely to
appear. can anyone provide at least a stacktrace or process snapshot
of the crashed dns processes? from that you try to build a theory of
what might be going wrong by thinking really really hard... (the
thinking should be directly proportional to the time it takes to
reproduce the bug) and then you work on how to prove that theory.
just changing stuff without knowing what exactly was the problem with
the old code is sometimes tempting, but wrong and dangerous.
--
cinap
c***@gmx.de
2012-08-25 10:54:14 UTC
Permalink
always make a process snapshot as the kernel might discard
your broken process once it runs low on memory so you have
time to debug:

snap 41356 >/tmp/dns.snap

char *domain strings alloc header seems to have been corrupted
(or just freed by accident?).

the string just gets allocated and freed in mydnsquery() so its
unlikely a bug there. someone else has corrupted its alloc header?

it looks more like corruption as we dont hand this pointer out to
someone else but netmkaddr().

look at the raw data, often one can get a clue by what it got
overridden with and try to figure out what the previous block
before our block that got corrupted was. the pool allocator keeps
the callerpc's of who allocated the block so you can use that
to figure out what it is, or look at the contents.

// dump the memory arround our corrupted block
dump(0x497f8 - 0x10, 0x100, "X")

maybe our block didnt got overridden but really freed with
a call to free but with the wrong pointer? check the alloc
magic!

// check the contents, should be an ip address string
dump(0x49800, 1, "s")

run acid with -lpool -lleak and run blockdump() if its
corrupted block, it might just stop at the block before
our one and will print the allocpc's and give some
diagnostics.

i can try this if you provide process snapshot file.

--
cinap
Kenji Arisawa
2012-08-25 12:37:05 UTC
Permalink
Hello cinap,

broken dns triggers Fauth problem, so I have rebooted.
I will get snapshot at next crash.

Kenji Arisawa
Post by c***@gmx.de
always make a process snapshot as the kernel might discard
your broken process once it runs low on memory so you have
snap 41356 >/tmp/dns.snap
char *domain strings alloc header seems to have been corrupted
(or just freed by accident?).
the string just gets allocated and freed in mydnsquery() so its
unlikely a bug there. someone else has corrupted its alloc header?
it looks more like corruption as we dont hand this pointer out to
someone else but netmkaddr().
look at the raw data, often one can get a clue by what it got
overridden with and try to figure out what the previous block
before our block that got corrupted was. the pool allocator keeps
the callerpc's of who allocated the block so you can use that
to figure out what it is, or look at the contents.
// dump the memory arround our corrupted block
dump(0x497f8 - 0x10, 0x100, "X")
maybe our block didnt got overridden but really freed with
a call to free but with the wrong pointer? check the alloc
magic!
// check the contents, should be an ip address string
dump(0x49800, 1, "s")
run acid with -lpool -lleak and run blockdump() if its
corrupted block, it might just stop at the block before
our one and will print the allocpc's and give some
diagnostics.
i can try this if you provide process snapshot file.
--
cinap
c***@gmx.de
2012-08-25 13:22:36 UTC
Permalink
did somebody notice that in dnresolve.c, netqueryns() frees the nsrp
rr list?

but in issuequery() it calls netqueryns() in a loop with the same
nsrp pointer!

--
cinap
c***@gmx.de
2012-08-25 13:38:43 UTC
Permalink
no, sorry. missed that nsrp = nil, carry on...

--
cinap
Charles Forsyth
2012-08-25 13:41:47 UTC
Permalink
In my copy, which I'm sure I haven't changed, it gets a new nsrp from
randomize(rrlookup...
each time, or uses dbnsrp, which is randomize(dblookup..., so both are new
and can be freed.
Post by c***@gmx.de
but in issuequery() it calls netqueryns() in a loop with the same
nsrp pointer!
arisawa
2012-08-27 22:44:28 UTC
Permalink
Hello,

I got a broken dns snapshot.

You can download from:
http://plan9.aichi-u.ac.jp/dns.snap.gz

Kenji Arisawa
c***@gmx.de
2012-08-28 01:45:35 UTC
Permalink
very good. thanks.

one wired thing is that the string pointer (0xfb900) it
tried to free (char *domain) points in the middle of the
querylck array of a allocated DN.

thats not a valid alloc block indeed.

there migh'v been a block there, but it got accidently freed
and then the space reused for that DN, or the pointer itself
got corrupted (only possible from our current process as its
stored on the stack which is private to our proc).

theres a block (0xfb9a0) after it that satisfies the requirement
of being the real thing (alloc callerpc is right after the smprint(),
still valid ip address string).

the char *domain pointer is stored on the stack at 0x74(SP).
char conndir[40] starts at 0x4c(SP). 0x4c+40 = 0x74 so if
dial overflows conndir (off by one error?) it could indeed
trash that pointer overriding the lsb of char *domain resulting
in bogus 0xfb900 address instead of 0xfb9a0.

acid: dump(0xdfffc8b4, 44/4, "X")
0xdfffc8b4: 0x74656e2f <- char conndir[40]
0xdfffc8b8: 0x7063742f
0xdfffc8bc: 0x0000392f
0xdfffc8c0: 0x00000000
0xdfffc8c4: 0x00000000
0xdfffc8c8: 0x00000000
0xdfffc8cc: 0x00000000
0xdfffc8d0: 0x00000000
0xdfffc8d4: 0x00000000
0xdfffc8d8: 0x00000000
0xdfffc8dc: 0x000fb900 <- char *domain

/sys/include/libc.h:480: #define NETPATHLEN 40

from sources /sys/src/libc/9sys/dial.c

static int
fillinds(DS *ds, Dest *dp)
{
Conn *conn;

if (dp->winner < 0)
return -1;
conn = &dp->conn[dp->winner];
if (ds->cfdp)
*ds->cfdp = conn->cfd;
if (ds->dir) {
strncpy(ds->dir, conn->dir, NETPATHLEN);
ds->dir[NETPATHLEN] = '\0'; <---- fuck!
}
return conn->dfd;
}

this bug was introduced with the new parallel dial implementation.
the old sequential dial doesnt have this bug so 9front systems are
not affected.

someone make a patch.

--
cinap
erik quanstrom
2012-08-28 01:57:54 UTC
Permalink
if (ds->dir) {
Post by c***@gmx.de
strncpy(ds->dir, conn->dir, NETPATHLEN);
ds->dir[NETPATHLEN] = '\0';
i think this is okay, since the definition is

char dir[NETPATHLEN+1];

- erik
c***@gmx.de
2012-08-28 02:08:04 UTC
Permalink
no.

just look at all the call sites for announce() and dial().

ghost drivers! ghost drivers everywhere!

--
cinap
erik quanstrom
2012-08-28 04:03:52 UTC
Permalink
Post by c***@gmx.de
no.
just look at all the call sites for announce() and dial().
ah, you're right about dial. i misread that. i incorrectly considered
the Conn and not the DS. both dial and announce could use a parameter
declaring the size of the buffer.

- erik
Skip Tavakkolian
2012-09-08 17:37:02 UTC
Permalink
I'm seeing dns panics about once a day. It was built from sources that
were updated about a week ago from sources. it's a double free error.
any ideas?

cpue% acid 7196
/proc/7196/text:386 plan 9 executable
/sys/lib/acid/port
/sys/lib/acid/386
acid: stk()
abort()+0x0 /sys/src/libc/9sys/abort.c:6
ppanic(p=0x3975c,fmt=0x394ec)+0x146 /sys/src/libc/port/malloc.c:166
D2B(p=0x3975c,v=0xa71f8)+0x5a /sys/src/libc/port/pool.c:968
poolfreel(v=0xa71f8,p=0x3975c)+0x20 /sys/src/libc/port/pool.c:1192
poolfree(p=0x3975c,v=0xa71f8)+0x41 /sys/src/libc/port/pool.c:1327
free(v=0xa7200)+0x23 /sys/src/libc/port/malloc.c:250
mydnsquery(qp=0x1b48e0,udppkt=0x1906c0,len=0x24,medium=0x1)+0x185
/sys/src/cmd/ndb/dnresolve.c:1032
xmitquery(qp=0x1b48e0,depth=0x1,medium=0x1,inns=0x1,obuf=0x1906c0,len=0x24)+0x227
/sys/src/cmd/ndb/dnresolve.c:1114
tcpquery(qp=0x1b48e0,waitms=0x9e4,obuf=0x1906c0,ibuf=0x21db20,depth=0x1,inns=0x1,len=0x24,req=0x4486,mp=0xdfffe36c)+0xea
/sys/src/cmd/ndb/dnresolve.c:1353
queryns(qp=0x1b48e0,obuf=0x1906c0,depth=0x1,inns=0x1,waitms=0x9e4,ibuf=0x21db20)+0x4d3
/sys/src/cmd/ndb/dnresolve.c:1428
udpquery(mntpt=0x3f0e0,qp=0x1b48e0,patient=0x0,depth=0x1,inns=0x1)+0x1b7
/sys/src/cmd/ndb/dnresolve.c:1578
netquery(depth=0x1,qp=0x1b48e0)+0x2b5 /sys/src/cmd/ndb/dnresolve.c:1660
netqueryns(qp=0x1b48e0,nsrp=0x21da60,depth=0x1)+0x1e
/sys/src/cmd/ndb/dnresolve.c:338
issuequery(class=0x1,qp=0x1b48e0,depth=0x0,name=0x18ee00,recurse=0x0)+0x1e9
/sys/src/cmd/ndb/dnresolve.c:413
dnresolve1(name=0x18ee00,type=0x1,class=0x1,req=0xdfffe740,depth=0x0,recurse=0x0)+0x25c
/sys/src/cmd/ndb/dnresolve.c:505
dnresolve(status=0x0,depth=0x0,rooted=0x1,name=0x18ee00,class=0x1,type=0x1,req=0xdfffe740,cn=0xdfffe774,recurse=0x0)+0xa8
/sys/src/cmd/ndb/dnresolve.c:198
doextquery(mp=0xdfffe760,req=0xdfffe740,recurse=0x0)+0x64
/sys/src/cmd/ndb/dnserver.c:186
dnserver(repp=0xdfffe760,reqp=0xdfffe788,rcode=0x0,srcip=0xdfffe7b0,req=0xdfffe740)+0x1fa
/sys/src/cmd/ndb/dnserver.c:83
dnudpserver(mntpt=0x3f0e0)+0x58b /sys/src/cmd/ndb/dnudpserver.c:255
main(argv=0xdfffefb0,argc=0x0)+0x319 /sys/src/cmd/ndb/dns.c:264
_main+0x31 /sys/src/libc/386/main9.s:16
Post by erik quanstrom
Post by c***@gmx.de
no.
just look at all the call sites for announce() and dial().
ah, you're right about dial. i misread that. i incorrectly considered
the Conn and not the DS. both dial and announce could use a parameter
declaring the size of the buffer.
- erik
c***@gmx.de
2012-09-08 18:02:13 UTC
Permalink
look at:

free(v=0xa7200)+0x23 /sys/src/libc/port/malloc.c:250

notice the lsb is 00 exactly like in the case i solved here:

http://9fans.net/archive/2012/08/249

i hate to repeat myself all the time. geoff needs to fix dial().

--
cinap
c***@gmx.de
2012-09-08 18:18:08 UTC
Permalink
look at this:

free(v=0xa7200)+0x23 /sys/src/libc/port/malloc.c:250

thats exactly what happend here:

http://9fans.net/archive/2012/08/249

its not a double free. dial() corrupts your stack. geoff needs
to fix fial().

--
cinap
Skip Tavakkolian
2012-09-08 21:53:35 UTC
Permalink
I there a patch? it's not obvious from the archive posting what
should be changed.

-Skip
Post by Skip Tavakkolian
free(v=0xa7200)+0x23 /sys/src/libc/port/malloc.c:250
http://9fans.net/archive/2012/08/249
its not a double free. dial() corrupts your stack. geoff needs
to fix fial().
--
cinap
c***@gmx.de
2012-09-08 22:27:04 UTC
Permalink
its pretty simple, just look where i made the <--- arrow in
my original post.

a patch would might look like this:

static int
fillinds(DS *ds, Dest *dp)
{
Conn *conn;
if (dp->winner < 0)
return -1;
conn = &dp->conn[dp->winner];
if (dp->cfdp)
*ds->cfdp = conn->cfd;
if (ds->dir) {
- strncpy(ds->dir, conn->dir, NETPATHLEN);
- ds->dir[NETPATHLEN] = '\0';
+ strncpy(ds->dir, conn->dir, NETPATHLEN-1);
+ ds->dir[NETPATHLEN-1] = '\0';
}
return conn->dfd;
}

to be clear, everyone seems to get confused with conn->dir vs ds->dir.
conn->dir has NETPATHLEN+1 capacity (why? makes no sense..). theres no
overflow at conn->dir. but ds->dir is a pointer to the connection dir
string passed by the caller of dial(). this buffer is just 40 (NETPATHLEN)
bytes long (thats its required minimum size), so doing:

ds->dir[NETPATHLEN] = '\0';

will write beyond it.

theres no patch yet. geoff is notified of the issue.

--
cinap
Charles Forsyth
2012-09-09 02:09:21 UTC
Permalink
it makes perfect sense: if the length of the path is NETPATHLEN, length of
a string usually referring to
non-zero bytes, you need to allow for the terminating zero byte.
unfortunately, as you describe, it's a bit pointless because the dnsresolve
callers rely on the manual page
("The path name is guaranteed to be less than 40 bytes long") and use an
explicit constant 40
(others use something random, or know to use the undocumented NETPATHLEN).
confusion!
Post by c***@gmx.de
conn->dir has NETPATHLEN+1 capacity (why? makes no sense..).
c***@gmx.de
2012-09-09 02:37:09 UTC
Permalink
ah, so NETPATHLEN was introduced later and is really refering to
the number of non null characters? and there was never an official
named constant for the size of the netdir buffers for dial/annouce?
thats indeed confusing.

--
cinap
Charles Forsyth
2012-09-09 03:01:36 UTC
Permalink
No, I didn't explain it well.
NETPATHLEN isn't named in the manual pages, so ndb/dns doesn't use it.
On the other hand, dial does, but because it's ...LEN, it's an easy mistake
to think of it as the length of a string, which wouldn't include the
terminating zero byte,
instead of the length of the underlying array.
Post by c***@gmx.de
ah, so NETPATHLEN was introduced later and is really refering to
the number of non null characters? and there was never an official
named constant for the size of the netdir buffers for dial/annouce?
thats indeed confusing.
--
cinap
erik quanstrom
2012-09-09 04:26:25 UTC
Permalink
No, I didn't explain it well. NETPATHLEN isn't named in the manual
pages, so ndb/dns doesn't use it. On the other hand, dial does, but
because it's ...LEN, it's an easy mistake to think of it as the length
of a string, which wouldn't include the terminating zero byte, instead
of the length of the underlying array.
indeed, a fair number of recent dns issues have been this sort of confusion.
there is something in ndb/dns (at least for me) that induces confusion.

- erik
Skip Tavakkolian
2012-09-09 05:28:33 UTC
Permalink
Thanks all.

it's interesting that announce.c doesn't have the same problem:

cpue% grep -n NETPATHLEN /sys/src/libc/*/*.[ch]
/sys/src/libc/9sys/announce.c:73: strncpy(dir, buf, NETPATHLEN);
/sys/src/libc/9sys/announce.c:74: dir[NETPATHLEN-1] = 0;
/sys/src/libc/9sys/announce.c:124: strncpy(newdir, buf, NETPATHLEN);
/sys/src/libc/9sys/announce.c:125: newdir[NETPATHLEN-1] = 0;
/sys/src/libc/9sys/dial.c:48: char dir[NETPATHLEN+1];
/sys/src/libc/9sys/dial.c:229: strncpy(ds->dir, conn->dir, NETPATHLEN);
/sys/src/libc/9sys/dcpuial.c:230: ds->dir[NETPATHLEN] = '\0';
/sys/src/libc/9sys/dial.c:459: snprint(conn->dir, NETPATHLEN,
"%s/%s", cname, name);
Post by erik quanstrom
No, I didn't explain it well. NETPATHLEN isn't named in the manual
pages, so ndb/dns doesn't use it. On the other hand, dial does, but
because it's ...LEN, it's an easy mistake to think of it as the length
of a string, which wouldn't include the terminating zero byte, instead
of the length of the underlying array.
indeed, a fair number of recent dns issues have been this sort of confusion.
there is something in ndb/dns (at least for me) that induces confusion.
- erik
c***@gmx.de
2012-08-25 21:56:47 UTC
Permalink
looking for bugs, that mydnsquery() function uses netmkaddr(),
which stores the resulting dialstring in a global static buffer.
the tcplock in the query makes sure theres just one tcp dial
going on *per query* but there could be multiple queries in
parallel i think which might override that buffer and corrupt
it.

its probably harmless (dial might try to connect to the wrong
machine) and not the cause of your crash tho.

--
cinap
c***@gmx.de
2012-08-26 13:16:06 UTC
Permalink
anyone look at rrcopy() in dn.c, it seems it assumes that structures
like rr->soa, rr->cert, rr->key, rr->sig and rr->null where embedded
into the RR struct but this is not true. (porting bug?)

when you look at rralloc(), you can see that these sub structures are
indeed malloced so rrcopy needs to copy these too (like it does for the
Tsrv case).

this would explain the double frees one sees indeed.

--
cinap
Charles Forsyth
2012-08-26 13:36:12 UTC
Permalink
It does. In fact, it's the Tsrv case that looks wrong (it leaks space
already allocated by rralloc).

nrp = rralloc(rp->type);
...
case Tsoa:
soa = nrp->soa; --- save space allocated by rralloc
*nrp = *rp; --- destroy pointer by copying onto it
nrp->soa = soa; --- restore pointer
*nrp->soa = *rp->soa; ---- copy substructure.
...
Post by c***@gmx.de
when you look at rralloc(), you can see that these sub structures are
indeed malloced so rrcopy needs to copy these too (like it does for the
Tsrv case).
Benjamin Huntsman
2012-08-21 05:08:28 UTC
Permalink
Has anyone else been seeing their Plan 9 dns servers work for a little while and then stop responding?
I had that, and other problems with it left and right. I probably had to restart DNS a few times per week. I was able to get it to support a local Windows Active Directory domain though, but it was painful, so I switched to BIND on FreeBSD. I'd love to switch back someday though... the Plan 9-based DNS server was much simpler to deal with from a config standpoint...

-Ben
c***@gmx.de
2012-08-25 13:44:17 UTC
Permalink
yes, missed that.

--
cinap
c***@gmx.de
2012-08-26 14:48:45 UTC
Permalink
9fans stoped sending me mail again, i read your response
on the 9fans archive.

you are indeed correct, i missed that it used rralloc() wich
allocates all the structures. so false alarm from me again,
but at least you spoted the memory leak.

another thing, the DN* rr->sig->signer doesnt seem to get marked
in dnchangenever() and dnageall(), which would cause it to get
accidently freed unless its referenced somewhere else no?

--
cinap
Continue reading on narkive:
Loading...