Discussion:
[9fans] devmnt crash, fix.
(too old to reply)
erik quanstrom
2012-08-22 18:53:00 UTC
Permalink
i'm probablly wrong again, but we had a crash with this back trace.
the process doing the i/o was given a note. ...

src(0xe010a6b9); // dumpstack+0x10
src(0xe0112653); // panic+0x112
src(0xe010a8d7); // fault386+0x17d
src(0xe0109dc2); // trap+0x15d
src(0xe010066f); // forkret
//passing interrupt frame; last pc found at sp=0xf4ea5d94
src(0xe01babbd); // lockloop+0x34
src(0xe01bacf6); // lock+0x103
src(0xe01aeed8); // wakeup+0x18
src(0xe0111a91); // mountmux+0xf1
src(0xe01115b6); // mountio+0x21d
src(0xe011127a); // mountrpc+0x27
src(0xe011116f); // mntrdwr+0xfb
src(0xe0110faa); // mntread+0x126
src(0xe01b90e3); // sysexec+0x172
src(0xe010abbb); // syscall+0x238
src(0xe0100cb5); // _syscallintr+0x18
...

i believe the unlock(m) needs to be moved to just before the return
because mntflushfree -> mntqrm can run after we drop the lock and
before the wakeup.

- erik

----

void
mountmux(Mnt *m, Mntrpc *r)
{
Mntrpc **l, *q;

lock(m);
l = &m->queue;
for(q = *l; q; q = q->list) {
/* look for a reply to a message */
if(q->request.tag == r->reply.tag) {
*l = q->list;
if(q != r) {
/*
* Completed someone else.
* Trade pointers to receive buffer.
*/
q->reply = r->reply;
q->b = r->b;
r->b = nil;
}
q->done = 1;
unlock(m);
if(mntstats != nil)
(*mntstats)(q->request.type,
m->c, q->stime,
q->reqlen + r->replen);
if(q != r)
mountmux+0xf1 wakeup(&q->r);
return;
}
l = &q->list;
}
unlock(m);
print("unexpected reply tag %ud; type %d\n", r->reply.tag, r->reply.type);
}
c***@gmx.de
2012-08-22 19:12:12 UTC
Permalink
ohhh! i see :)

--
cinap
c***@gmx.de
2012-08-22 19:09:38 UTC
Permalink
the mntrpc m->queue is protected by the mnt lock. when
mountmux() unlocks m, it has already removed the rpc from
the queue so how would mntqrm() get to the rpc in that case?

any other cases where the rpc is taken somewhere and then
the mnt lock is (maybe just temporarily) released and then
used again?

--
cinap
c***@gmx.de
2012-08-22 19:53:23 UTC
Permalink
i think this is not so mutch mntflushfree() / mntqrm(), but when we
complete a request for someone else (q != r) and we release m,
the other proc could indeed be woken up already freeing the rpc
while we are are trying to wakeup on that "done with" rpc?

--
cinap
erik quanstrom
2012-08-22 22:26:51 UTC
Permalink
Post by c***@gmx.de
i think this is not so mutch mntflushfree() / mntqrm(), but when we
complete a request for someone else (q != r) and we release m,
the other proc could indeed be woken up already freeing the rpc
while we are are trying to wakeup on that "done with" rpc?
that's what i ment.

- erik
c***@gmx.de
2012-08-22 23:30:24 UTC
Permalink
i think we'r seeing exactly what russ described on 9fans here:

http://9fans.net/archive/2011/02/358

after we set q->done = 1; (the unlock of m probably doesnt even
matter) it might be possible for mountio()'s sleep() call to return
immidiately and return, freeing the rpc before mountmux()
on another proc/cpu even call wakeup() and potentialy hitting freed
memory.

if this theory is true (anyone seeing what prevents this from
happening?) one trick could be to use a Rendez not embedded in the
rpc structure. If we make the Rendez into a Rendez* which we
take a local copy of before we set q->done = 1; in mountmux(),
and in mountio assign r->r = &up->sleep; (or maybe even dedicate
a new Rendez in the proc structure?) then its safe to call
wakeup even if the rpc got already freed. This might produce
spurious wakeups on that rendez but thats not a big problem
(for devmnt's case at least) as we always check rpc->done and
repeat the sleep on spurious wakeup. Doing interlocked refcounting
to synchronize the wakeup seems like overkill and would me mutch
more complicated i think.

suggestions?

--
cinap
erik quanstrom
2012-08-23 01:53:47 UTC
Permalink
Post by c***@gmx.de
http://9fans.net/archive/2011/02/358
after we set q->done = 1; (the unlock of m probably doesnt even
matter) it might be possible for mountio()'s sleep() call to return
immidiately and return, freeing the rpc before mountmux()
on another proc/cpu even call wakeup() and potentialy hitting freed
memory.
devaoe (9atom version) deals with a similar problem. see strategy().

- erik
c***@gmx.de
2012-08-23 09:47:14 UTC
Permalink
can you provide the source file, dont have current 9atom iso handy
and the stuff i found on the net doesnt seem to cut it. however,
i found http://www.quanstro.net/plan9/sleepwake.html which at the
end gives a solution to the problem:

void
dowake(Io *i)
{
i->cond = 1;
wakeup(i);
i->cond = 2; /* atomic */
/* hands off i now */
}
void
doio(void)
{
Io *i;

dispatch(i = malloc(sizeof *i));
while(waserror())
/* eat note */;
sleep(i, iodone, i);
poperror();
while(i->cond != 2)
/* rendez not done */;
free(i);
}

that would be relatively easy to apply to devmnt and doesnt have the
problem of spurious wakeups with abusing up->sleep as rendez.

--
cinap
erik quanstrom
2012-08-23 13:27:06 UTC
Permalink
Post by c***@gmx.de
can you provide the source file, dont have current 9atom iso handy
here's the full kernel source: http://ftp.quanstro.net/other/kernel.mkfs.bz2
Post by c***@gmx.de
and the stuff i found on the net doesnt seem to cut it. however,
i found http://www.quanstro.net/plan9/sleepwake.html which at the
this was actually distilled from the solution that devaoe uses. the
main simplification is in the case of a note. which is key in this case,
since it would be very annoying for devmnt to ignore notes.
Post by c***@gmx.de
that would be relatively easy to apply to devmnt and doesnt have the
problem of spurious wakeups with abusing up->sleep as rendez.
any time you use sleep/wakeup, you're going to have to deal with
spurious wakeups. it does not matter which rendezvous structure
you use. it doesn't matter if it's shared or not. it's perfectly valid
for sleep to return for no reason at all!

- erik
c***@gmx.de
2012-08-23 15:02:02 UTC
Permalink
why? are you counting notes as spurious wakeups? i would not count
them as such because sleep will not return in that case but do a
error jump out of it instead no? what other spurious wakeup sources
are there?

--
cinap
erik quanstrom
2012-08-23 15:48:05 UTC
Permalink
Indeed. One way to catch sleep() errors is to change wakeup() to wake *everyone* up.
i was thinking on this and it occurred to me ....

to make the point by absurdity, consider this approximation
of sleep and wakeup without notes.

void
sleep(Rendez*, void (*)(void*), void*)
{
}

void
wakeup(Rendez*)
{
}

these have the property that you may wake without the condition
but you may not miss a wakeup. it also doesn't matter if the condition
happens while sleep is executing. and there is no hazard between
the two. so i think this impemenation does satisfy the all
the requirements!

i didn't say it was efficient. :-)

i think what cinap is thinking is, how could devmnt's particular
use of sleep/wakeup with this particular implementation of sleep/wakeup
evade handling spurious wakeups. if that's what the thought is,
i would resist it. it may be that someone thinks of a very efficient
way of re-implementing sleep/wakeup that may have slightly different
properties. as long as the client and sleep/wakeup play by the rules,
this should not be a problem. (it's possible to do plan 9-style sleep
wakeup within, say, the windows kernel.)

- erik
Bakul Shah
2012-08-23 15:39:02 UTC
Permalink
Post by erik quanstrom
any time you use sleep/wakeup, you're going to have to deal with
spurious wakeups. it does not matter which rendezvous structure
you use. it doesn't matter if it's shared or not. it's perfectly valid
for sleep to return for no reason at all!
Indeed. One way to catch sleep() errors is to change wakeup() to wake *everyone* up.
c***@gmx.de
2012-08-23 15:53:24 UTC
Permalink
pwait() in port/proc.c would crash in that case if you wake it up
with up->waitq == nil.

or was this ment as a joke?

--
cinap
erik quanstrom
2012-08-23 16:06:56 UTC
Permalink
Post by c***@gmx.de
pwait() in port/proc.c would crash in that case if you wake it up
with up->waitq == nil.
how would that happen? with the approperiate locks held (waitqr),
up->waitq == nil is tested.
Post by c***@gmx.de
or was this ment as a joke?
no joke.

- erik
c***@gmx.de
2012-08-23 16:18:58 UTC
Permalink
pwait():

...

sleep(&up->waitr, haswaitq, up);

lock(&up->exl);
wq = up->waitq;
up->waitq = wq->next; <-- wq == nil, boom! its gone!
up->nwait--;
unlock(&up->exl);


if sleep returns or is spuriously woken up even tho up->waitq == nil.

--
cinap
erik quanstrom
2012-08-23 16:24:59 UTC
Permalink
Post by c***@gmx.de
sleep(&up->waitr, haswaitq, up);
lock(&up->exl);
wq = up->waitq;
up->waitq = wq->next; <-- wq == nil, boom! its gone!
up->nwait--;
unlock(&up->exl);
if sleep returns or is spuriously woken up even tho up->waitq == nil.
ah, right. looks like a bug.

- erik
Charles Forsyth
2012-08-23 16:47:10 UTC
Permalink
No. It's not true that all sleeps might return spuriously (it might have
been true in Unix, I can't remember).
It's not true in Plan 9.
Post by erik quanstrom
Post by c***@gmx.de
sleep(&up->waitr, haswaitq, up);
lock(&up->exl);
wq = up->waitq;
up->waitq = wq->next; <-- wq == nil, boom! its gone!
up->nwait--;
unlock(&up->exl);
if sleep returns or is spuriously woken up even tho up->waitq == nil.
ah, right. looks like a bug.
- erik
c***@gmx.de
2012-08-23 17:01:03 UTC
Permalink
yes, this is what i would'v expected. so who made up that rule
that sleep might spuriously wakeup for no reason or that kernel
code should be written as if sleep/wakeup being troll implementations?

--
cinap
Bakul Shah
2012-08-23 17:53:25 UTC
Permalink
Post by Charles Forsyth
No. It's not true that all sleeps might return spuriously (it might have
been true in Unix, I can't remember).
It's not true in Plan 9.
It is not that sleep might return spuriously (that is, the
wakeup condition is still false) but that it may have *become*
false once again due to an action by another process before
the sleeper has had a chance to run. To handle this one would
use a pattern like this

while (!wakeup_condition(&foo))
sleep(&foo);

I haven't looked at this in ages but this used to be the case
in unix. So here awakening everyone *spuriously* is /safe/
(but not efficient).

doc/sleep.ps says
We assume that a particular call to sleep corresponds to a
particular call to wakeup, that is, at most one process is
asleep waiting for a particular event.

But it is not clear to me that another process can't get in to
change the condition between the time wakeup() is called and
the sleeper actually runs -- the above condition is not
violated as the new process won't be sleeping. But if yet
another process tries to access the same condition, it will be
blocked on getting the rendezvous structure lock?
erik quanstrom
2012-08-23 18:03:13 UTC
Permalink
Post by Bakul Shah
the sleeper actually runs -- the above condition is not
violated as the new process won't be sleeping. But if yet
another process tries to access the same condition, it will be
blocked on getting the rendezvous structure lock?
in plan 9 sleep, only 1 sleeper is allowed.

- erik

erik quanstrom
2012-08-23 17:09:52 UTC
Permalink
Post by Charles Forsyth
No. It's not true that all sleeps might return spuriously (it might have
been true in Unix, I can't remember).
It's not true in Plan 9.
is this in /sys/doc/sleep.ps? i think i'm missing it.

- erik
Loading...