Discussion:
[9fans] usb slowness
(too old to reply)
Tristan
2012-03-04 01:56:39 UTC
Permalink
for a while now i've known that usb/ether runs slow for me. i thought it
was something to do with the specific situation or hardware, and that,
say, usb/disk would be alright. it isn't. and not just on the olpc. now,
my machines are fairly slow as machines go... but 2.5MB/s reads are a far
cry from 22MB/s (with the olpc on linux).

it turns out that _strayintr is taking up nearly all the time.

cpu% kprof /n/9fat/9cpcpuf /dev/kpdata
total: 24620 in kernel text: 23890 outside kernel text:730
RTZERO f0100000 PGSIZE 4Kb
ms % sym
21840 91.3 _strayintr
290 1.2 memmove
290 1.2 iunlock
180 0.7 sched
160 0.6 memset
150 0.6 sleep
...

my understanding is that _strayintr is glue between the interrupt vector
table and trap. and presumably includes the time spent in trap.

the short review of usbehci.c didn't provide any insight, so i guess i'm
asking, is this expected behavior? and what can i do?

enjoy,
tristan

--
All original matter is hereby placed immediately under the public domain.
erik quanstrom
2012-03-05 15:46:01 UTC
Permalink
> cpu% kprof /n/9fat/9cpcpuf /dev/kpdata
> total: 24620 in kernel text: 23890 outside kernel text:730
> RTZERO f0100000 PGSIZE 4Kb
> ms % sym
> 21840 91.3 _strayintr
> 290 1.2 memmove
> 290 1.2 iunlock
> 180 0.7 sched
> 160 0.6 memset
> 150 0.6 sleep

if you could temporarly run a 9atom kernel,
(http://ftp.quanstro.net/other/9pc*.gz), there
are a few things you can do to look at the
problem,

1. /dev/irqalloc (http://ftp.quanstro.net/magic/man2html/3/arch)
has two additional fields detailing the irq allocation, and time
spent in each interrupt. unless things are really hosed, this should
tell you what's at fault. if things are really quite screwed up, you
may be inducing spurious interrupts. in that case,

2. /dev/mpirq (http://ftp.quanstro.net/magic/man2html/3/apic)
should detail enough of bios reported interrupts to tell if something
is going sideways in interrupt allocation.

unfortunately, i would think that the most likely reason
for this is that usbether/devusb are not tamping down the
interrupt properly.

i've seen this with usb serial.

- erik
Richard Miller
2012-03-05 16:05:48 UTC
Permalink
> if you could temporarly run a 9atom kernel,

With the normal kernel, you can use acid or db to look at the intrtimes
array to see the spread of cpu times used in each interrupt vector entry.

See http://9fans.net/archive/2003/11/520
erik quanstrom
2012-03-05 16:14:05 UTC
Permalink
On Mon Mar 5 11:06:43 EST 2012, ***@hamnavoe.com wrote:
> > if you could temporarly run a 9atom kernel,
>
> With the normal kernel, you can use acid or db to look at the intrtimes
> array to see the spread of cpu times used in each interrupt vector entry.
>
> See http://9fans.net/archive/2003/11/520

i didn't find the original histogram very useful. it is not scaled
by the speed of the processor nor type of interrupt. the slots
in the histogram are appropriate for 8259 interrupts on a 2000-era
processor (100µs increments). almost all the interrupts i see are
< 20µs. the 82598 on my relativly slow cpu server averages 2.5µs
per interrupt.

it's also a pain to print out without a file in addition to /dev/irqalloc.
i found the absolute count and absolute # of cpu cycles (to avoid
predivision sloppiness) to be a lot more useful for the general case.

a hisogram sure could be useful, but i think i'd want to taylor it to
a specific problem. and i haven't had one yet.

- erik
Richard Miller
2012-03-05 16:20:28 UTC
Permalink
> a hisogram sure could be useful, but i think i'd want to taylor it to
> a specific problem. and i haven't had one yet.

The problem in hand is to check the hypothesis that usb disk slowness
is due to "nearly all the time" being spent in interrupt handling.
A quick glance at intrtimes should suffice for that.
Tristan
2012-03-06 13:19:22 UTC
Permalink
> if you could temporarly run a 9atom kernel,

cpu% dd -if /dev/sdU0.0/data -of /dev/null -count 1024 -bs 65536

/dev/irqalloc yields
trap irq delta count delta cycles type name
35 3 27648 510854000 i8259 usbehci

which doesn't seem extraordinary. and nothing else is huge either.

mpirq doesn't appear to exist in #P

the whole transfer takes about 8% longer on 9atom.

> > if i force polling, it's a little faster: 0.01u 0.45s 17.70r
> That's interesting - it shouldn't make a difference unless there's
> something wrong with the controller or the driver. What did you do to
> force polling?

added 'ctrl->poll.must = 1' to init in usbehci.c

> Yes, I think you're right - system cputime will only be incremented for
> a running process. But you can watch the last column of /dev/sysstat
> (or use 'stats -I') to see the percentage of time spent in interrupt
> handlers.

interrupt time is quite low/very low... (olpc / intel)

> How about number of interrupts? Erik's theory was that you were
> getting too many of these.

interrupt count is very low/moderate... (olpc / intel)

my working theory is that i'm getting all the interrupts, but not soon
enough... the fascinating piece is that the olpc and the pc with intel
ehc take just about the same amount of time.

and a different usb flash device doesn't change it (which was expected as
usb/ether seems to suffer also).

enjoy,
tristan

--
All original matter is hereby placed immediately under the public domain.
Tristan
2012-03-06 15:36:02 UTC
Permalink
> my working theory is that i'm getting all the interrupts, but not soon
> enough... the fascinating piece is that the olpc and the pc with intel
> ehc take just about the same amount of time.

indeed, if i set the interrupt threshold control to 1 micro-frame (they
default to 8), i get 10-11 sec dd, say, half as slow. and ping time to my
switch is half. and i can get about twice the throughput from the file
server, which is still kinda slow.

(hget yields about 6Mbit over a 100Mbit link)

tristan

--
All original matter is hereby placed immediately under the public domain.
Richard Miller
2012-03-06 16:10:58 UTC
Permalink
> indeed, if i set the interrupt threshold control to 1 micro-frame (they
> default to 8), i get 10-11 sec dd, say, half as slow.

That seems a good change to make. Particularly for usb mass storage,
which for some reason emulates a scsi transaction with three usb
transfers and therefore three interrupts per r/w operation (send
command, recv/send data, recv status). To get anything like maximum
throughput, we'd also have to queue all three transfers at once so
that the usb controller could schedule them in adjacent microframes
(or maybe even the same one?). Sending them separately amd
synchronously from the user-space driver will probably take long
enough to miss one or more microframes.

¡Hola! Nemo, should the interrupt threshold be set to 1 microframe
(125 μs) to get better latency for all devices? I can't see any good
reason for wanting to slow interrupts deliberately.
erik quanstrom
2012-03-06 18:55:49 UTC
Permalink
On Tue Mar 6 10:37:06 EST 2012, 9p-***@imu.li wrote:
> > my working theory is that i'm getting all the interrupts, but not soon
> > enough... the fascinating piece is that the olpc and the pc with intel
> > ehc take just about the same amount of time.
>
> indeed, if i set the interrupt threshold control to 1 micro-frame (they
> default to 8), i get 10-11 sec dd, say, half as slow. and ping time to my
> switch is half. and i can get about twice the throughput from the file
> server, which is still kinda slow.
>
> (hget yields about 6Mbit over a 100Mbit link)

it sounds like you have it all figured out. but just in case there is a spurious
interrupt causing trouble, you can check with this rather embarassing acid
script. sorry, i can't get acid to recoginze intrcountx as a vlong.

acid -l acid -k 1 /386/9pccpu
acid: i=0; while i<256*2 do {if(intrcountx[i]>0) then {print(i/2\D, "\t", intrcountx[i]);} i=i+2;}
7 143
14 44355
50 99090099
65 19938
73 37571
145 9814
153 1
161 42
169 7

- erik
Tristan
2012-03-06 22:22:21 UTC
Permalink
i tried another device a bit ago. openrd (kw). with and without setting
the interrupt delay to 1 microframe. i get just about the same times (a
little higher). does anybody get less than 10 sec for

dd -if /dev/sdU0.0/data -of /dev/null -count 1024 -bs 65536 ?

or even less than 20 sec?

that is, is this a problem with all my available hardware, or all of our
software?

tristan

--
All original matter is hereby placed immediately under the public domain.
Tristan
2012-03-07 05:50:37 UTC
Permalink
a little reading later.

the usb mass storage bulk only spec stipulates that only one request may
be active at a time.

wince.


so, bigger buffer? yup, 32k reads double the speed again. 64k reads
aren't quite right, something clips it's tail. now i'm up to ~11 MB/s.

except that it crashes the pc and olpc kernels. kw does fine though (for
now!). changing devmnt seem unlikely to be very safe without checking
on every fileserver in case it now happily assumes 8k (plus headers) is
enough for anything.

anybody messed with devmnt's Msgsize successfully?

enjoy,
tristan

--
All original matter is hereby placed immediately under the public domain.
Tristan
2012-03-07 05:53:45 UTC
Permalink
> anybody messed with devmnt's Msgsize successfully?

make that devmnt's MAXRPC, in usbfs.h it's Msgsize.

enjoy,
tristan

--
All original matter is hereby placed immediately under the public domain.
Charles Forsyth
2012-03-07 09:31:25 UTC
Permalink
fversion allows you to change msize without changing devmnt.c,
supposedly

On 7 March 2012 05:53, Tristan <9p-***@imu.li> wrote:
>> anybody messed with devmnt's Msgsize successfully?
>
> make that devmnt's MAXRPC, in usbfs.h it's Msgsize.
>
> enjoy,
> tristan
>
> --
> All original matter is hereby placed immediately under the public domain.
>
Tristan
2012-03-10 03:13:12 UTC
Permalink
every so often, using a mouse/tablet with the olpc, i get this message:

ohci: bug: donehead changed before ack

but not such that i can reproduce it on demand.

the archives suggest that erik experienced this.

disabling the ohc interrupts during the interrupt routine (the ohci spec
says to) appears to help, but i havn't been running that for very long.

anyone know anything about this?

thanks,
tristan

--
All original matter is hereby placed immediately under the public domain.
erik quanstrom
2012-03-10 12:33:54 UTC
Permalink
> disabling the ohc interrupts during the interrupt routine (the ohci spec
> says to) appears to help, but i havn't been running that for very long.

so i guess the question is the interrupt ohci is using edge or level.
if level, interrupts are going to be masked by the level and that
wouldn't be your problem. i suppose it could also be that shared
usb hardware has more than one interrupt enabled?

- erik
Tristan
2012-03-10 15:14:32 UTC
Permalink
> > disabling the ohc interrupts during the interrupt routine (the ohci spec
> > says to) appears to help, but i havn't been running that for very long.
indeed, a hour or so later, it started acting up.

the devices keep working if i jump back to the beginning of the handler
when donehead changes... nose wrinkle.

> so i guess the question is the interrupt ohci is using edge or level.
> if level, interrupts are going to be masked by the level and that
> wouldn't be your problem.
and it doesn't appear to be the problem.

my understanding is that donehead shouldn't change, because the hcd has
no way of knowing that the driver has read anything.

> i suppose it could also be that shared usb hardware has more than one
> interrupt enabled?
what? do you mean multiple devices on the interrupt or different types of
interrupts?

so far (and this is only an hour or so of use) donehead only changes from
nil to a real pointer, which leaves very little time in the routine.
cycles says [300,386].

which of course means that it isn't changing, it's just an interrupt with
without status&Wdh but with donehead set. checking for Wdh is running
perfectly well now.

thank you,
tristan

--
All original matter is hereby placed immediately under the public domain.
Richard Miller
2012-03-07 15:27:31 UTC
Permalink
> does anybody get less than 10 sec for
>
> dd -if /dev/sdU0.0/data -of /dev/null -count 1024 -bs 65536 ?
>
> or even less than 20 sec?

10.4 sec on a sheevaplug (with 1 microframe interrupt threshold).

By the way, everyone has been very tactful not to point out my
silly mistake in saying:

> ... You can use time(1) while
> reading usb disk to see if cpu is really the bottleneck, or if
> in fact your machine is spending most of its time waiting for
> a usb interrupt.

Of course most of the user+sys cpu time involved in reading from usb
disk will be charged to the usb/disk process (normally a subprocess of
usbd), not to user command doing the reading.
Richard Miller
2012-03-05 15:56:37 UTC
Permalink
> it turns out that _strayintr is taking up nearly all the time.

Or maybe it isn't. Taking up nearly all the cpu time isn't the
same as taking up nearly all the time. You can use time(1) while
reading usb disk to see if cpu is really the bottleneck, or if
in fact your machine is spending most of its time waiting for
a usb interrupt. If the machine is mostly idle, profiling clock
ticks should usually find it sitting at the HLT instruction which
happens to be couple of bytes before _strayintr.
Tristan
2012-03-06 00:01:42 UTC
Permalink
> You can use time(1) while reading usb disk to see if cpu is really the
> bottleneck, or if in fact your machine is spending most of its time
> waiting for a usb interrupt. If the machine is mostly idle, profiling
> clock ticks should usually find it sitting at the HLT instruction which
> happens to be couple of bytes before _strayintr.

time suggests that it is waiting for interrupts: 0.03u 0.75s 25.42r
if i force polling, it's a little faster: 0.01u 0.45s 17.70r
(dd -if /dev/sdU0.0/data -of /dev/null -bs 65536 -count 1024)

holding a second, time wouldn't show interrupt-handling time, correct?

load remains fairly low during both, but system calls and context
switches are high.

time to read up on ehci...

thank you,
tristan

--
All original matter is hereby placed immediately under the public domain.
Richard Miller
2012-03-06 10:39:51 UTC
Permalink
> if i force polling, it's a little faster: 0.01u 0.45s 17.70r

That's interesting - it shouldn't make a difference unless there's something
wrong with the controller or the driver. What did you do to force polling?

> holding a second, time wouldn't show interrupt-handling time, correct?

Yes, I think you're right - system cputime will only be incremented for
a running process. But you can watch the last column of /dev/sysstat
(or use 'stats -I') to see the percentage of time spent in interrupt handlers.

> load remains fairly low during both, but system calls and context
> switches are high.

How about number of interrupts? Erik's theory was that you were getting
too many of these.
Continue reading on narkive:
Loading...