c***@gmx.de
2012-02-25 03:29:33 UTC
discovered odd behaviour on a mp system. it was running rchttpd and
werc and after like 3 or 7 days of load, broken sed and grep processes
appeared in the process table. inspecting the process with acid
yields a strange picture. the process crashed (or aborted themselves)
before any data was read from stdin just after allocating some memory
(grep uses sbrk() directly, where sed uses pool malloc). in the case
of grep, the global bloc seemed to have been reset to a past value,
and seds mainmem structure was also inconsistent with reality.
while trying to put the pieces together, something interesting came up.
duppage() is called by fixfault for COW, with a locked, image backed,
non shared page. it makes a new copy for the image cache, and then
removes the page from the image cache.
to do this, it has to allocate a new page from the page allocator,
temporarily unlocking the page. what we observe is that when duppage
reacquires the page lock, the pages refcount sometimes is >1 meaning
another processor just grabed that page out of the image cache.
(tested this with a print() and it triggered multiple times right after boot)
fixfault still assumes the page to be non shared and inserts it into
the process pagetable.
a change that rechecks the refcount after calling duppage() in
fixfault() and doing a copy like for the ref > 1 case seems to have
made the problem go away. (system is running for 8 days now)
anyone with a mp system can confirm this?
--
cinap
werc and after like 3 or 7 days of load, broken sed and grep processes
appeared in the process table. inspecting the process with acid
yields a strange picture. the process crashed (or aborted themselves)
before any data was read from stdin just after allocating some memory
(grep uses sbrk() directly, where sed uses pool malloc). in the case
of grep, the global bloc seemed to have been reset to a past value,
and seds mainmem structure was also inconsistent with reality.
while trying to put the pieces together, something interesting came up.
duppage() is called by fixfault for COW, with a locked, image backed,
non shared page. it makes a new copy for the image cache, and then
removes the page from the image cache.
to do this, it has to allocate a new page from the page allocator,
temporarily unlocking the page. what we observe is that when duppage
reacquires the page lock, the pages refcount sometimes is >1 meaning
another processor just grabed that page out of the image cache.
(tested this with a print() and it triggered multiple times right after boot)
fixfault still assumes the page to be non shared and inserts it into
the process pagetable.
a change that rechecks the refcount after calling duppage() in
fixfault() and doing a copy like for the ref > 1 case seems to have
made the problem go away. (system is running for 8 days now)
anyone with a mp system can confirm this?
--
cinap