Discussion:
[9fans] slow performance
(too old to reply)
pedro henrique antunes de oliveira
2007-03-31 16:42:31 UTC
Permalink
Well, i dont know if this is a plan9 question it self.

I've writen a program using a data structure and the program spend about 30
secs to finish the task, some tima later I've compiled the same program
under other operating system, (and the same computer) and the program finish
the task with 3 secs.


anyone knows what can it be?
Lorenzo Fernando Bivens de la Fuente
2007-03-31 16:50:55 UTC
Permalink
Post by pedro henrique antunes de oliveira
I've writen a program using a data structure and the program spend about 30
secs to finish the task, some tima later I've compiled the same program
under other operating system, (and the same computer) and the program
finish the task with 3 secs.
Are you running plan 9 native or on qemu/vmware/bla bla..?
What language are you using?
What kind of data structure?

Mmmm.... I haven't experienced any performance problems by myself... Au
contraire, Plan 9 has gifted me with an excellen performance...

Cheers!
Charles Forsyth
2007-03-31 16:53:05 UTC
Permalink
a few hints as to what it does and how it goes about that
always come in useful during performance analysis.
before that, {man 1 prof} would probably be helpful.
pedro henrique antunes de oliveira
2007-03-31 16:58:31 UTC
Permalink
Lorenzo, i'm not using a virtual machine, plan9 is installed in my computer
like the other operating system.
i'm using C and the datastructure is a binary search tree. i think that i
dont need to talk what a binary seach tree does.
i'll ead the prof(1)
W B Hacker
2007-03-31 16:56:27 UTC
Permalink
Post by pedro henrique antunes de oliveira
Well, i dont know if this is a plan9 question it self.
I've writen a program using a data structure and the program spend about 30
secs to finish the task, some tima later I've compiled the same program
under other operating system, (and the same computer) and the program finish
the task with 3 secs.
anyone knows what can it be?
No definitive answer from me.

But a guess: ISTR seeing that the Plan9 kernel 'lacked a scheduler'.
That can be inconsequential for some situations, very important for others.

That may be stale or even just plain wrong info.

In any case, I'd like to know more myself about the relative effectivness of the
'native' Plan9 & its kernel vs other 'real', not 'virtual' OS'en.

Being able to 'plumb' and network inherently 'better' is not all that useful if
a less-elegant OS does better with the BFBI API set.

Any benchmarks (yeah they lie, but ....) around?

Bill
Uriel
2007-03-31 17:13:00 UTC
Permalink
Post by W B Hacker
But a guess: ISTR seeing that the Plan9 kernel 'lacked a scheduler'.
That can be inconsequential for some situations, very important for others.
Could you please explain this? I'm still baffled as to what you mean.

As far as my limited knowledge goes, Plan 9 has had an SMP aware
scheduler since ancient times[1](I think before 1st Ed), and in more
recent times a real time scheduler[2] has been added.

So I'm puzzled as to how Plan 9 could 'lack a scheduler'.

I also would recommend at least taking a look at the performance
section of http://plan9.bell-labs.com/sys/doc/9.html

uriel

[1] http://plan9.bell-labs.com/sys/doc/sleep.html
[2] http://purl.org/utwente/fid/1149
Charles Forsyth
2007-03-31 17:15:35 UTC
Permalink
Post by W B Hacker
But a guess: ISTR seeing that the Plan9 kernel 'lacked a scheduler'.
see man 3 proc and /sys/src/9/port/proc.c
amongst other things
Armando Camarero
2007-03-31 18:50:57 UTC
Permalink
Could it run faster on other operating systems because of compiler's
optimizations on your program? Maybe 8c makes less optimizations and
code runs slower because of that.

Armando.
Post by pedro henrique antunes de oliveira
Well, i dont know if this is a plan9 question it self.
I've writen a program using a data structure and the program spend about 30
secs to finish the task, some tima later I've compiled the same program
under other operating system, (and the same computer) and the program finish
the task with 3 secs.
anyone knows what can it be?
C H Forsyth
2007-03-31 21:06:45 UTC
Permalink
a factor of 10 on a binary tree needs some investigation.
i'd be surprised if it's a compiler optimisation (given that it's
recursive, and thus full of function calls, which might i suppose
be tail recursive, but we don't know yet either way).
pedro henrique antunes de oliveira
2007-03-31 23:26:30 UTC
Permalink
Well, thank you bill and Forsyth.

i'm posting here the source code:
http://www.freewebtown.com/phao/t.c (it is very small)

and the `time PROGRAM' output:

(under 'other' OS)
% cc t.c
% time a.out
2.86user 0.09system 0:02.97elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (80major+792minor)pagefaults 0swaps
%


(under Plan9)
% 8c -FV t.c
% 8l -o t t.8
% time t
37.69u 0.00s 44.43r t

if it can help. But two guys have answer that it can be compiler optimation.
and if it is that. It is all ok.
erik quanstrom
2007-04-01 01:36:03 UTC
Permalink
when you compiled this on plan 9, you must have
either replaced stdlib.h with something else or have used
ape. i substituted libbio and get almost the same performance
you did on linux. i used my home cpu server, a amd64 revF 3800+.

cpu% 8c -FVw t.c
warning: t.c:64 set and not used: t
cpu% 8l -o t t.8
cpu% time t >/tmp/xyzw
2.08u 0.05s 2.36r t

i think the reason for the difference in performance is that somehow
your translation of print was calling write(2) for each integer printed.

assuming i understand the point of this program -- sorting a bunch of
integers, i rewrote this program using an array and qsort, which should
be more memory efficient and faster. here's what i get

cpu% time ./u >/tmp/xyz
0.08u 0.01s 0.22r ./u

- erik
Charles Forsyth
2007-04-01 09:24:59 UTC
Permalink
Post by pedro henrique antunes de oliveira
But two guys have answer that it can be compiler optimation.
no: i said `it requires investigation'. several others seemed to be
`speculating ahead of the data'. in fact, without asking for data of any sort.
no: must be those compilers. indeed, given the little data we had
(binary trees and thus [probably] recursive calls)
i was hinting that it was UNLIKELY to be compiler related,
and the factor of 10 was likely to be significant, but i still wanted the data.
Post by pedro henrique antunes de oliveira
So, plan 9 wasnt made for, hmm, heavy computing tasks, like 2 milions of calls of recursive functions
working with some complex data structure (not my data structure, hehe), or something like that?
back to work...
pedro henrique antunes de oliveira
2007-04-01 09:55:06 UTC
Permalink
heheh.. sorry for that 2 lines of include.
it was because i'm the 'other' O.S. and I was making changes to make the
code identical to the
plan9 code, i've forgoten to chance the #includeS lines.

Ok, the qsort is more faster, but i'm seeing something about that kind of
data structure... I've read in
9.intro about the libbio and about bufering output is better, but when i've
tested the program with time,
i did coment the call pbst line.
pedro henrique antunes de oliveira
2007-04-01 09:59:12 UTC
Permalink
Sorry the 'flood' my prcessor is a intel Pentiun IV, 2,4ghz, cache size 1mb
W B Hacker
2007-03-31 19:28:00 UTC
Permalink
Post by Uriel
Post by W B Hacker
But a guess: ISTR seeing that the Plan9 kernel 'lacked a scheduler'.
That can be inconsequential for some situations, very important for
others.
Could you please explain this? I'm still baffled as to what you mean.
No need. As said - stale info. You've just clarified it, thanks!

But .. there IS a plethora of stale info and - perhaps worse - broken links -
all over what Google finds as the Plan9 'universe'.

Sometimes hard to ascertain the date of the pages, as well.
Post by Uriel
As far as my limited knowledge goes, Plan 9 has had an SMP aware
scheduler since ancient times[1](I think before 1st Ed), and in more
recent times a real time scheduler[2] has been added.
ACK. I've saved a couple of .pdf presentations discussion how it *could be*, but
not clear to me that it *had been*.
Post by Uriel
So I'm puzzled as to how Plan 9 could 'lack a scheduler'.
So am I. Though my present interest is not so much lack of a scheduler, as
curiousity as to how it prioritizes, (e.g. equivalents to 'nice', runlevels, et al).
Post by Uriel
I also would recommend at least taking a look at the performance
section of http://plan9.bell-labs.com/sys/doc/9.html
uriel
[1] http://plan9.bell-labs.com/sys/doc/sleep.html
[2] http://purl.org/utwente/fid/1149
All great stuff - and outlining what the benefits of Plan9 were/are/should be.

But dated. Been a while since a 100 MHz Irix was top dog, if ever was.

How much, if any, of what Plan9 is/was has been overtaken by events? And for
better? Or for worse?

For example - Alef's parallizing features are noted, extolled, and:

"Although it is possible to write parallel programs in C, Alef is the parallel
language of choice."

Which raises the question (in my own mind at least) as to how much and how well
this has been preserved and extended in 'C' with Alef having left the building
with Elvis.

And what penalty comes with the benefit of communication by text stream vs
binary? And via a fs call (whether in cache/RAM or not), vs
closer-to-the-CPU-core. Registers, even.

'Universality and 'portability' are perhaps not such a big deal when very few
processor families are supported.

Granted - Plan9 does not, in most instances, attempt to do things in the same
way - or even do them *at all* - that a 'big iron' OS, a *BSD or Linux might do,
so head-to-head comparisons would certainly not be as easy as point and click
(or ..mouse chord...).

And I *have* seen some impressive figures mentioned for time to boot a large
grid from a cold start vs other OS'en.

But where can I/we find 'evidence' - current evidence - that all this is more
than a theoretical exercise?

A place where Plan9 holds the high ground in real-world use, so to speak.

Thanks for the patience...

Bill
erik quanstrom
2007-03-31 23:22:05 UTC
Permalink
Post by W B Hacker
But dated. Been a while since a 100 MHz Irix was top dog, if ever was.
you belie your youth. ;-) 1990 was a long time ago. i'm not sure what
"overtaken by events means". one easier to answer question is, does
plan 9 scale with today's processors and networks.

here's some observations.

as slow as 100Mhz seems today, that's 1/30th of the speed of
modern processors. improvements in networking have been even
more dramatic. from the table at the end of /sys/doc/net/net.ps:

test throughput latency
MB/s ms
pipes 8.15 0.255
IL/ether 1.02 1.42
URP/dk 0.22 1.75
cyclone 3.2 0.375

in 1990 there was only 10Mbit ethernet. so ~ 1MB/s was the speed limit
on the wire. today we have 10Gbit/s ethernet a wire speed of 1250MB/s.
10mbit ethernet is 1/1000th as fast.

without setting up a test harness, it's hard to get comparable numbers.
but for "cat bigfile > /dev/null" from our fileserver to the main cpu server
over il/1Gbit ether (i82563) i get

IL/gbe 45.8 0.054 (standard frames)

i ran AoE on the same hardware while it was in testing and got
basically wire-speed.

AoE/gbe 112 0.054 (9000byte frames)
AoE/2xgbe 220 0.054 "

the cyclone is the dual-fiber connection between the fileserver and the
main cpu server. it seems quite slow (a modern SATA drive can
easily do 65MB/s in the outer zones) until you realize that on a
contemperanious pc, you could get ~0.5 MB/s from the hard drive.
Post by W B Hacker
"Although it is possible to write parallel programs in C, Alef is the parallel
language of choice."
Which raises the question (in my own mind at least) as to how much and how well
this has been preserved and extended in 'C' with Alef having left the building
with Elvis.
the thread library provides the same csp primitives that alef did.
Post by W B Hacker
And what penalty comes with the benefit of communication by text stream vs
binary?
not every interface is text-based. /dev/bintime is an example of a binary
interface. the decision to use mostly text-based communcations is a real
benefit to plan 9. if you've ever been a couple of rounds with netlink
sockets, ioctl or other unix interfaces, you know what i mean.

the other great thing is that out-of-band information is generally handled
by a seperate control file. the plan 9 uart interface has seperate ctl, status amd data
files.
Post by W B Hacker
And via a fs call (whether in cache/RAM or not), vs
closer-to-the-CPU-core. Registers, even.
not sure what you're getting at here.
Post by W B Hacker
'Universality and 'portability' are perhaps not such a big deal when very few
processor families are supported.
it's a huge deal as soon as two architectures are supported.
Post by W B Hacker
And I *have* seen some impressive figures mentioned for time to boot a large
grid from a cold start vs other OS'en.
But where can I/we find 'evidence' - current evidence - that all this is more
than a theoretical exercise?
A place where Plan9 holds the high ground in real-world use, so to speak.
if you're looking for an "os for supercomputers", plan 9 might not be your thing.
if you're looking for an "os for programmers", plan 9 might just be your thing.

- erik
pedro henrique antunes de oliveira
2007-03-31 23:35:49 UTC
Permalink
"if you're looking for an "os for supercomputers", plan 9 might not be your
thing.
if you're looking for an "os for programmers", plan 9 might just be your
thing.

- eirk"

i never thought in this way. that an OS can be made for supercomputing use
and so on.
well, in fact i am not too much accostumed with computer yet.

And this can be an answer for my problem.

So, plan 9 wasnt made for, hmm, heavy computing tasks, like 2 milions of
calls of recursive functions
working with some complex data structure (not my data structure, hehe), or
something like that?
erik quanstrom
2007-04-01 01:41:59 UTC
Permalink
Post by pedro henrique antunes de oliveira
So, plan 9 wasnt made for, hmm, heavy computing tasks, like 2 milions of
calls of recursive functions
working with some complex data structure (not my data structure, hehe), or
something like that?
rather, when there was a conflict between squeezing every last cycle
out of the machine or simplicity, the designers of plan 9 generally
opted for simplicity. when there was a conflict between generality
and performance, the bias was toward generality.

the beauty of plan 9 is sum of these well-thought-out choices.
as you point out, these choices are always worth revisiting.
they seem to be working well for the time being.

btw, plan 9 is perfectly well-suited to your examples.

- erik
erik quanstrom
2007-04-01 12:13:59 UTC
Permalink
Post by pedro henrique antunes de oliveira
Ok, the qsort is more faster, but i'm seeing something about that kind of
data structure... I've read in
9.intro about the libbio and about bufering output is better, but when i've
tested the program with time,
i did coment the call pbst line.
if i understand you correctly, there must be something wrong
with your test setup. i got about the same runtime
for your original program on linux and plan 9.

- erik
pedro henrique antunes de oliveira
2007-04-01 12:38:32 UTC
Permalink
Sorry, but what do you mean with "test setup" ?


Making use of that topic. Actions like making new windows, `cat` something,
`ls` something, well, draw things on the screen (when i start rio, windows
opens and so on). that all actions are slower than here in Other O.S. (i
think that there isnt problem to say what is it, maybe it can help.
Slackware 10.1).


Well, i want to make it clear that i'm not comparing 2 O.S. and saying that
one is better of something like that, i'm using only a reference to say
"well, i'm getting slow performance here, what's wrong?".
Uriel
2007-04-01 14:42:39 UTC
Permalink
That sounds like a video driver issue, depends on what card and what
driver you are using. For me rio on ancient hardware is much faster
than any X window manager on modern hardware.

I have not been able to tell any difference on how fast rio is since
it was first released almost a decade ago, all GUI operations have
always been instantaneous, you can't get much faster than that.

If you find Plan 9 slow for anything, even on ten years old hardware,
then there is clearly something wrong with your setup, either you HD
is not doing DMA, you are using VESA and your card doesn't like it, or
something like that.

uriel - who runs a 600Mhz CPU server and an even slower thinkpad
terminal and still can compile kernels in seconds.
Post by pedro henrique antunes de oliveira
Sorry, but what do you mean with "test setup" ?
Making use of that topic. Actions like making new windows, `cat` something,
`ls` something, well, draw things on the screen (when i start rio, windows
opens and so on). that all actions are slower than here in Other O.S. (i
think that there isnt problem to say what is it, maybe it can help.
Slackware 10.1).
Well, i want to make it clear that i'm not comparing 2 O.S. and saying that
one is better of something like that, i'm using only a reference to say
"well, i'm getting slow performance here, what's wrong?".
C H Forsyth
2007-04-01 15:14:03 UTC
Permalink
Post by pedro henrique antunes de oliveira
Making use of that topic. Actions like making new windows, `cat` something,
`ls` something, well, draw things on the screen (when i start rio, windows
opens and so on). that all actions are slower than here in Other O.S. (i
think that there isnt problem to say what is it, maybe it can help. Slackware 10.1).
on a machine as fast as yours, it's usually the video card interface that slows it down.
can you
cat '#v/vgactl'
and
echo $vgasize - $monitor
pedro henrique antunes de oliveira
2007-04-01 17:36:29 UTC
Permalink
% cat '#v/vgactl'
type vesa
size 1024x768x32 x8r8g8b8
blank time 30 idle 0 state on
hwaccel on
hwblank off
panning off
addr p 0xd0000000 v 0xe0000000 size 0x4000000
% echo $vgasize - $monitor
1024x768x32 - vesa

My video card is a GeForce Fx5200, with 128mb of memory.
About the driver... i have never instaled a driver for it, if there is one,
the system installed it to me. And how can i see if my HD is doind DMA or
not?
erik quanstrom
2007-04-01 18:12:01 UTC
Permalink
you are using vesa mode because monitor=vesa. you need to set
your monitor. information on doing that is in vgadb(6). once your
monitor information is set properly, the correct driver should be used.

to check for dma, "cat /dev/sdC0/ctl". you will see dma x dmactl y.
x describes the dma capabilities of the drive, y describes which dma
capabilities are being used. so if x is nonzero but y is zero then
"echo -n dma on > /dev/sdC0/ctl" will turn dma on. this can be
done automatically through plan9.ini. "man plan9.ini" for more info.

- erik
Post by pedro henrique antunes de oliveira
% cat '#v/vgactl'
type vesa
size 1024x768x32 x8r8g8b8
blank time 30 idle 0 state on
hwaccel on
hwblank off
panning off
addr p 0xd0000000 v 0xe0000000 size 0x4000000
% echo $vgasize - $monitor
1024x768x32 - vesa
My video card is a GeForce Fx5200, with 128mb of memory.
About the driver... i have never instaled a driver for it, if there is one,
the system installed it to me. And how can i see if my HD is doind DMA or
not?
ron minnich
2007-04-01 18:26:48 UTC
Permalink
Post by pedro henrique antunes de oliveira
Well, i dont know if this is a plan9 question it self.
I've writen a program using a data structure and the program spend about 30
secs to finish the task, some tima later I've compiled the same program
under other operating system, (and the same computer) and the program finish
the task with 3 secs.
anyone knows what can it be?
you need to restart. You need to learn now to profile that program,
and see where things are going. This whole thread is not going to help
you until you do that.

As for plan 9 and supercomputing, I can say you're in for an
interesting surprise in 2 months. That is all.

thanks

ron
pedro henrique antunes de oliveira
2007-04-01 19:12:08 UTC
Permalink
about the profile information, i've compiled and linked the program in this
way

% 8c -FV tds.c //this is the source i've posted with the print function
commented
% 8l -p tds.8
% 8.out // it output an error
1 Prof errors
% prof prof.392 8.out
http://phpfi.com/222234 //the output of the prof command


well, the fact is that i dont know how can it help..
g***@plan9.bell-labs.com
2007-04-01 19:27:11 UTC
Permalink
Before running 8.out, increase $profsize to eliminate the `Prof error':

profsize=40000

I profiled this program and, after using Bprint for output, found that
allocating one node at a time was using much of the CPU time;
allocating a few thousand nodes at a time helps. I also had to fix
some bugs; compile it with

8c -FTVw tds.c

to see them. The next biggest cost was tail recursion to find the
correct tree node for insertion; converting the tail recursion to
explicit iteration reduced the run time noticeably. Given the large
number of nodes and the trivial nature of the operations on them, even
small coding changes can produce measurable improvements. Final times
are:

; time 8.out >/dev/null
3.84u 0.01s 3.90r 8.out
; cat /dev/cputype
PentiumII/Xeon 333
pedro henrique antunes de oliveira
2007-04-01 19:58:32 UTC
Permalink
do you think that gcc is MALLOCing (hehe) all space with one call instead
fallow the program order (malloc 1 space at time) ?

the cat /dev/cputype prints P4 2401. i'll would ask for that file, if there
is some file with my cpuinfo.
Charles Forsyth
2007-04-01 20:09:04 UTC
Permalink
Post by pedro henrique antunes de oliveira
do you think that gcc is MALLOCing (hehe) all space with one call instead
fallow the program order (malloc 1 space at time) ?
you've misunderstood. we'd already discovered that all the extra time compared
to Linux was caused by having changed the program to do single writes via print
instead of buffered writes via printf (by the way, if you send
the output to /dev/null, the effect is much less, but even so).

geoff then showed how to make the program run even faster on either system,
assuming that was your main aim. he also showed that there were some bugs in
the original code (eg not returning a value from bstini,
which only happens to work because both compilers happen to leave the result
of malloc in the right place to be taken as a returned value by the caller of
bstini).
pedro henrique antunes de oliveira
2007-04-01 20:12:03 UTC
Permalink
Ok, but when i'd test the program with 'time program' i do comment the print
line, so the program was only sorting the things into the data structure,
not printing it..

even i need to consider the print function?
Charles Forsyth
2007-04-01 20:26:59 UTC
Permalink
then there is some confusion, because when i run your program
as-is (except for the bug fix) on my Plan 9 system, i get roughly the 2+ seconds you
mentioned as the time for the Linux system, and that also compares
with the time i get when i run the same code on a Linux system. i can only get 20+ seconds
on Plan 9 IF i have the program do the unbuffered print AND redirect to a file system across the
network (if i do the unbuffered print and send the output to /dev/null, it's still about 2 seconds).
ISHWAR RATTAN
2007-04-01 21:14:02 UTC
Permalink
Post by pedro henrique antunes de oliveira
Ok, but when i'd test the program with 'time program' i do comment the print
line, so the program was only sorting the things into the data structure,
not printing it..
even i need to consider the print function?
I think that you have proved that you should stick with 'this other OS'.

Hope that helps.
-ishwar
pedro henrique antunes de oliveira
2007-04-02 01:53:30 UTC
Permalink
Post by ISHWAR RATTAN
Hope that helps.
it doesnt

Continue reading on narkive:
Loading...