Discussion:
[9fans] fossil pb: a clue?
(too old to reply)
t***@polynum.com
2012-01-13 11:30:26 UTC
Permalink
I'm still trying to find why I have fossil data twice the size of the
files served.

Since this is almost exactly (minus the MB of kerTeX) twice the size, I
wonder if fossil has, from the installation, still a recorded plan9.iso,
not visible when mounting main.

Since I have done an installation locally, without CDROM (doc explaining
howto under review), the bzip2 has been uncompressed in fossil area by
the installation scripts. But does the iso have been removed after
copying the data, and perhaps unmouting /n/dist/ or whatever? And if it
has not been removed, what is its path under fossil?

I tried brutally a grep(1) on /dev/sdC0/fossil. Found not something
looking like a "plan9.iso" filename entry, but the same data matched
was printed twice...

Is there a way to print the list of fossil registered pathnames (an
absolute lstree on what fossil has)? To find if there is "somewhere" the
iso registered, and not showing.
--
Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
t***@polynum.com
2012-01-13 13:41:51 UTC
Permalink
Summary of the previous episodes: after having reinstalled Plan9, with
almost only the vanilla distribution, du(1) announces 325 MB, while
fossil uses twice the place to store.

I suspected, since it was clearly almost precisely twice (minus some tmp
files and kerTeX), that the problem was the plan9.iso was still there,
at least in fossil, and not showing.

On the console, I tried in turn (looking at the scripts in pc/inst/)
where it could be, and found:

stat /active/dist/plan9.iso plan9.iso glenda sys 664 289990656

But the surprise is that it was _not_ hidden. It _was_ here under /dist
... but apparently not added to the summary made by du(1)? Does du(1)
"know" that some dir are mount point taking (normally) no real space,
and skipping them? Because this means one can add whatever files in
there and fill fossil with du(1) ignoring all...

On a side note, the print from du(1) is not accurate with the "-h" flag:

term% du -sh /
347.8285G /

I have megabytes, not gigabytes.
--
Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
erik quanstrom
2012-01-13 13:59:32 UTC
Permalink
Post by t***@polynum.com
But the surprise is that it was _not_ hidden. It _was_ here under /dist
... but apparently not added to the summary made by du(1)? Does du(1)
"know" that some dir are mount point taking (normally) no real space,
and skipping them? Because this means one can add whatever files in
there and fill fossil with du(1) ignoring all...
term% du -sh /
347.8285G /
I have megabytes, not gigabytes.
i think your / has mounts and binds that are confusing du. you
need to remount your root file system someplace free of mounts
or binds on top, e.g.:

; mount /srv/boot /n/boot; cd /n/boot
; du -s .>[2=]
4049232 .
; du -sh .>[2=]
3.861649G .
; hoc
4049232 / 1024 / 1024
3.86164855957

- erik
t***@polynum.com
2012-01-13 14:08:34 UTC
Permalink
Post by erik quanstrom
Post by t***@polynum.com
But the surprise is that it was _not_ hidden. It _was_ here under /dist
... but apparently not added to the summary made by du(1)? Does du(1)
"know" that some dir are mount point taking (normally) no real space,
and skipping them? Because this means one can add whatever files in
there and fill fossil with du(1) ignoring all...
term% du -sh /
347.8285G /
I have megabytes, not gigabytes.
i think your / has mounts and binds that are confusing du. you
need to remount your root file system someplace free of mounts
; mount /srv/boot /n/boot; cd /n/boot
; du -s .>[2=]
4049232 .
; du -sh .>[2=]
3.861649G .
; hoc
4049232 / 1024 / 1024
3.86164855957
Do you spot only the 347.8285G? or altogether the /dist/plan9.iso that
was not seen? Because, for the gigabytes, it is just a format error,
since without the option it reports correctly 350Mb. Since your test is
with gigabytes, the G suffix is correct. But it may be simply (I didn't
look at the source) that for Megabytes, it prints a G suffix too...
--
Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
erik quanstrom
2012-01-13 14:47:25 UTC
Permalink
Post by t***@polynum.com
Post by erik quanstrom
i think your / has mounts and binds that are confusing du. you
need to remount your root file system someplace free of mounts
; mount /srv/boot /n/boot; cd /n/boot
; du -s .>[2=]
4049232 .
; du -sh .>[2=]
3.861649G .
; hoc
4049232 / 1024 / 1024
3.86164855957
Do you spot only the 347.8285G? or altogether the /dist/plan9.iso that
was not seen? Because, for the gigabytes, it is just a format error,
since without the option it reports correctly 350Mb. Since your test is
with gigabytes, the G suffix is correct. But it may be simply (I didn't
look at the source) that for Megabytes, it prints a G suffix too...
please try what i suggested. i've shown that du -h works properly
on my system.

- erik
t***@polynum.com
2012-01-13 16:01:42 UTC
Permalink
Post by erik quanstrom
please try what i suggested. i've shown that du -h works properly
on my system.
Indeed, remounting yields the correct Mb result.

What I missed, is that the result of du(1) is not in bytes, but in
_kilobytes_ (since I know there is appx. 300 Mb, I thought it was bytes).

So this means that the plan9.iso, being "only" 250Mb, had a small impact
on a printed result wrongly multiplied by 1000 or so. So the file was
not hidden, but this is the whole du(1) count that is wrong.

The mounts in my profile are the vanilla ones (the only customizations
are for the network, the mouse, the keyboard). I do not play with the
namespace.

Have you an idea where to look to find what are the offending
instructions? /boot(8)?
--
Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
erik quanstrom
2012-01-13 16:16:11 UTC
Permalink
Post by t***@polynum.com
The mounts in my profile are the vanilla ones (the only customizations
are for the network, the mouse, the keyboard). I do not play with the
namespace.
Have you an idea where to look to find what are the offending
instructions? /boot(8)?
they're not offending! not all files in a typical plan 9 namespace
make sense to du. for example, if you're runing rio, it doesn't make
sense to add that to the file total. also, if you have a disk in
/dev/sdXX/data, that file will be added to the total, as will any
partitions of that disk, etc.

- erik
t***@polynum.com
2012-01-13 16:34:44 UTC
Permalink
Post by erik quanstrom
they're not offending! not all files in a typical plan 9 namespace
make sense to du. for example, if you're runing rio, it doesn't make
sense to add that to the file total. also, if you have a disk in
/dev/sdXX/data, that file will be added to the total, as will any
partitions of that disk, etc.
This means that du(1) is for listing, but as far as size goes, the
"du -s" does not make a lot of sense?

On a side note. When using mount(8) without arguments on a typical Unix,
one can see what is mounted where. Is there some way to find the
"organization" of the namespace on Plan9? (What is mount'ed and what is
bind'ed?
--
Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
David du Colombier
2012-01-13 16:42:21 UTC
Permalink
Post by t***@polynum.com
On a side note. When using mount(8) without arguments on a typical
Unix, one can see what is mounted where. Is there some way to find the
"organization" of the namespace on Plan9? (What is mount'ed and what
is bind'ed?
ns(1)
--
David du Colombier
Vivien MOREAU
2012-01-13 16:44:36 UTC
Permalink
Post by t***@polynum.com
On a side note. When using mount(8) without arguments on a typical
Unix, one can see what is mounted where. Is there some way to find the
"organization" of the namespace on Plan9? (What is mount'ed and what
is bind'ed?
Sure... ns(1) :-)
--
Vivien
t***@polynum.com
2012-01-13 16:50:56 UTC
Permalink
Post by Vivien MOREAU
Post by t***@polynum.com
On a side note. When using mount(8) without arguments on a typical
Unix, one can see what is mounted where. Is there some way to find the
"organization" of the namespace on Plan9? (What is mount'ed and what
is bind'ed?
Sure... ns(1) :-)
Missed this one... Thanks!
--
Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
David du Colombier
2012-01-13 16:17:34 UTC
Permalink
Post by t***@polynum.com
The mounts in my profile are the vanilla ones (the only customizations
are for the network, the mouse, the keyboard). I do not play with the
namespace.
Have you an idea where to look to find what are the offending
instructions? /boot(8)?
It's probably simply because /root is a recursive bind.

See /lib/namespace.
--
David du Colombier
t***@polynum.com
2012-01-13 16:41:01 UTC
Permalink
Post by David du Colombier
Post by t***@polynum.com
The mounts in my profile are the vanilla ones (the only customizations
are for the network, the mouse, the keyboard). I do not play with the
namespace.
Have you an idea where to look to find what are the offending
instructions? /boot(8)?
It's probably simply because /root is a recursive bind.
See /lib/namespace.
Yes, but reading "Getting Dot-Dot right" by Rob Pike, I thought that the
solution was to have, underneath, one uniq pathname for a file. Date(1)
can format UTC; whatever the user presentation, underneath there is only
the UTC. Namespace is a way to manage the nicknames, or the presentation
of data; to manage different views of the "real" thing, but underneath
there is an uniq pathname; a pathname finally resolved to something
(no infinite recursion).

So I'm wrong?
--
Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
Charles Forsyth
2012-01-13 16:50:34 UTC
Permalink
The name space can contain loops. Du and a few others try to detect that,
using Qids, to avoid being annoying, but the loops are there.
Open (and chdir etc) do indeed record the name used to open the file, and
that helps
resolve the ".." problem (now done slightly differently from the paper, I
think, but I'm not certain),
but that name won't have loops because it's a finite string interpreted
from left to right.

You can easily build a looped space to test it:
mkdir /tmp/y
mkdir /tmp/y/z
bind /tmp /tmp/y/z
# have fun
Post by t***@polynum.com
So I'm wrong?
t***@polynum.com
2012-01-13 17:05:56 UTC
Permalink
Post by Charles Forsyth
[...]
mkdir /tmp/y
mkdir /tmp/y/z
bind /tmp /tmp/y/z
# have fun
Since this is by error'ing that one learns, I learned a lot today!
--
Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
t***@polynum.com
2012-01-13 17:02:19 UTC
Permalink
Post by t***@polynum.com
Post by David du Colombier
See /lib/namespace.
Yes, but reading "Getting Dot-Dot right" by Rob Pike, I thought that the
solution was to have, underneath, one uniq pathname for a file. Date(1)
can format UTC; whatever the user presentation, underneath there is only
the UTC. Namespace is a way to manage the nicknames, or the presentation
of data; to manage different views of the "real" thing, but underneath
there is an uniq pathname; a pathname finally resolved to something
(no infinite recursion).
Answering to myself: du(1) -s make the sum of each entry it has printed.
If the entries are repeated (because of multiple binds), it appears in
the sum.

So du(1) does what it says; the sum.

I never thought that perhaps, under Unices, du(1) with hard links will
produce the same misleading result...
--
Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
Charles Forsyth
2012-01-13 17:11:51 UTC
Permalink
It was a long time ago, but I think some versions of du used dev/ino to
avoid counting the same file twice.
Post by t***@polynum.com
I never thought that perhaps, under Unices, du(1) with hard links will
produce the same misleading result...
Nicolas Bercher
2012-01-13 17:24:39 UTC
Permalink
Post by t***@polynum.com
I never thought that perhaps, under Unices, du(1) with hard links will
produce the same misleading result...
And fortunately, Unices 'du' handles this correctly!
(-l option toggles the counting of hardlinked files several times or
not)

Nicolas
t***@polynum.com
2012-01-13 17:44:40 UTC
Permalink
Post by Nicolas Bercher
Post by t***@polynum.com
I never thought that perhaps, under Unices, du(1) with hard links will
produce the same misleading result...
And fortunately, Unices 'du' handles this correctly!
(-l option toggles the counting of hardlinked files several times or
not)
But one could argue that in this case the only possibility would be
du(1) with some flag producing the grand "true" total without a
detailed listing.

Because, if it is not the case, whether another instance of an already
seen hardlink is not printed in the listing (but this is arbitrary);
whether the sum is incorrect (in the meaning not the sum of the sizes
displayed) ;)

The devil is in the details.
--
Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
erik quanstrom
2012-01-13 17:37:17 UTC
Permalink
Post by t***@polynum.com
the UTC. Namespace is a way to manage the nicknames, or the presentation
of data; to manage different views of the "real" thing, but underneath
there is an uniq pathname; a pathname finally resolved to something
(no infinite recursion).
what "real" thing? from the perspective of a user program,
neglecting #, every file access is through the namespace.

each file server has a unique path name, called the qid (as charles
mentioned). but between (instances of) file servers, qids are not unique.
in general, the problem i think is hard. but fortunately most reasonable
questions of unique files can be answered straightforwardly.

- erik
t***@polynum.com
2012-01-13 17:58:01 UTC
Permalink
Post by erik quanstrom
[...]
each file server has a unique path name, called the qid (as charles
mentioned). but between (instances of) file servers, qids are not unique.
But a "fully qualified qid", I mean, at this very moment, for the
kernel, a resource is some qid served by some server. So (srv,qid) is
an uniq identifier, even if only in a local context.

We are a lot to call ourselves: "I" or "me". But in the context, this is
an uniq identifier (because all other mees are not me!).

For this, IP has found an elegant solution. There are identifiers that
are only local. As long as there is no interconnexion, the identifiers
are not absolutely uniq, but relatively uniq. And this is sufficient.

But I realize that the problem is hard. And that all in all, the correct
information is available from the file servers, and that when the
namespace is concerned, we have all access potentially to huge
resources; so by the nature of interconnexions, the answer is fuzzy.

I will not exchange the distributed nature of Plan9; and the namespace;
and the everything is a file etc. against the ability to have du(1)
telling me "acurately" what is stored here and only here (since I have
other means to know with the console).

But this was obviously not clear for me till now!
--
Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
erik quanstrom
2012-01-13 18:14:27 UTC
Permalink
Post by t***@polynum.com
Post by erik quanstrom
each file server has a unique path name, called the qid (as charles
mentioned). but between (instances of) file servers, qids are not unique.
/dev/pid has a different size depending on what your pid happens to be. so
i think your statement is still too strong.
Post by t***@polynum.com
I will not exchange the distributed nature of Plan9; and the namespace;
and the everything is a file etc. against the ability to have du(1)
telling me "acurately" what is stored here and only here (since I have
other means to know with the console).
taking the original problem — "how much disk space am i using", i think
you're in a better position than a unix user would be. you can always
mount the fileserver serving the on-disk files from / someplace unique
and get an accurate count from there. you're right that that's not a
general solution. but then again, you have a specific question to which
there's an easy (if specific) answer.

- erik
Yaroslav
2012-01-13 21:00:31 UTC
Permalink
please note that the sum du returns may be bigger than the actual
storage used anyway - think deduping and compression done at venti
level.
Charles Forsyth
2012-01-13 22:14:53 UTC
Permalink
that's very true. i rely on that quite a bit, generating new copies
frequently, expecting that it won't consume much more space.
Post by Yaroslav
think deduping and compression done at venti
erik quanstrom
2012-01-13 21:02:22 UTC
Permalink
Post by Yaroslav
please note that the sum du returns may be bigger than the actual
storage used anyway - think deduping and compression done at venti
level.
not everyone has a venti.

- erik
erik quanstrom
2012-01-13 22:17:44 UTC
Permalink
that's very true. i rely on that quite a bit, generating new copies
frequently, expecting that it won't consume much more space.
an extra copy or 100 of the distribution
will be <1% of a new hard drive, even with
no de-dup.

- erik
Aram Hăvărneanu
2012-01-13 23:10:55 UTC
Permalink
Post by erik quanstrom
an extra copy or 100 of the distribution
will be <1% of a new hard drive, even with
no de-dup.
Sure, but there's other data than that. I do music, as a hobby. A
project for an electronic track can have 20GB because everything I use
is "statically linked" into it. Doing it this way has all the
advantages static linking for binaries has.

When your tracks have 20GB but 90% the data is shared, and you keep
full history for your track, dedup becomes invaluable.
--
Aram Hăvărneanu
Francisco J Ballesteros
2012-01-13 23:14:25 UTC
Permalink
but if you insert extra music in front of your track dedup in venti won't help.
or would it?
Post by erik quanstrom
an extra copy or 100 of the distribution
will be <1% of a new hard drive, even with
no de-dup.
Sure, but there's other data than that. I do music, as a hobby.  A
project for an electronic track can have 20GB because everything I use
is "statically linked" into it.  Doing it this way has all the
advantages static linking for binaries has.
When your tracks have 20GB but 90% the data is shared, and you keep
full history for your track, dedup becomes invaluable.
--
Aram Hăvărneanu
Aram Hăvărneanu
2012-01-13 23:23:40 UTC
Permalink
Post by Francisco J Ballesteros
but if you insert extra music in front of your track dedup in venti won't help.
or would it?
It wouldn't. In practice it seems that it usually appends, probably
for performance reasons, so for me it had worked so far absolutely
great.
--
Aram Hăvărneanu
Bakul Shah
2012-01-14 00:30:32 UTC
Permalink
Post by Francisco J Ballesteros
but if you insert extra music in front of your track dedup in venti won't help.
or would it?
No. Venti operates at block level.

You are better off using an SCM like mercurial (though commits
are likely to be slow). In case you were wondering, the
mercurial repo format does seem to be `dedup' friendly as new
data is appended at the end.

$ du -sh .hg
100M .hg
$ ls -l .hg/store/data/foo.d
-rw-r--r-- 1 xxxxx xxxxx 104857643 Jan 13 16:13 .hg/store/data/foo.d
$ cp .hg/store/data/foo.d xxx # save a copy of repo data for foo
$ echo 1 | cat - foo > bar && mv bar foo # prepend a couple of bytes to foo
$ hg commit -m'test4'
$ ls -l .hg/store/data/foo.d
-rw-r--r-- 1 xxxxx xxxxx 104857657 Jan 13 16:16 .hg/store/data/foo.d
$ cmp xxx .hg/store/data/foo.d # compare old repo data with new
cmp: EOF on xxx
$ du -sh .hg
100M .hg
dexen deVries
2012-01-14 01:01:50 UTC
Permalink
Post by Bakul Shah
Post by Francisco J Ballesteros
but if you insert extra music in front of your track dedup in venti won't
help. or would it?
No. Venti operates at block level.
there are two ways around it available:

0)
use of rolling-checksum enables decent block-level deduplication on files that
are modified in the middle; some info:
http://svana.org/kleptog/rgzip.html
http://blog.kodekabuki.com/post/11135148692/rsync-internals

in short, a rolling checksum is used to find reasonable restart points; for
us, block boundaries. probably could be overlayed over Venti;
rollingchecksumfs anybody?

1)
Git uses diff-based format for long-term compacted storage, plus some gzip
compression. i don't know specifics, but IIRC it's pretty much starndard diff.

it's fairly CPU- and memory-intensive on larger (10...120MB in my case) text
files, but produces beautiful result:

i have a cronjob take dump of a dozen MySQL databases; each some 10...120MB of
SQL (textual). each daily dump collection is committed into Git; the overall
daily collection size grew from some 10MB two years ago to about 410MB today;
over two years about 700 commits.

each dump differ slightly in content from yesterday's and the changes are
scattered all over the files; it would not de-duplicate block-level too well.

yet the Git storage, after compaction (which takes a few minutes on a slow
desktop), totals about 200MB, all the commits included. yep; less storage
taken by two years' worth of Git storage than by one daily dump.

perhaps Git's current diff format would not handle binary files very well, but
there are binary diffs available out there.
--
dexen deVries
Post by Bakul Shah
The Fast drives out the Slow even if the Fast is Wrong.
William Kahan in
http://www.cs.berkeley.edu/~wkahan/Stnfrd50.pdf
erik quanstrom
2012-01-14 13:27:58 UTC
Permalink
given the fact that most disks are very large, and most people's non-media
storage requirements are very small, why is the compelling.
shoot. ready. aim. i ment, "why is this compelling?". sorry.

- erik
erik quanstrom
2012-01-14 13:26:52 UTC
Permalink
Post by dexen deVries
0)
use of rolling-checksum enables decent block-level deduplication on files that
http://svana.org/kleptog/rgzip.html
http://blog.kodekabuki.com/post/11135148692/rsync-internals
in short, a rolling checksum is used to find reasonable restart points; for
us, block boundaries. probably could be overlayed over Venti;
rollingchecksumfs anybody?
1)
Git uses diff-based format for long-term compacted storage, plus some gzip
compression. i don't know specifics, but IIRC it's pretty much starndard diff.
it's fairly CPU- and memory-intensive on larger (10...120MB in my case) text
i have a cronjob take dump of a dozen MySQL databases; each some 10...120MB of
SQL (textual). each daily dump collection is committed into Git; the overall
daily collection size grew from some 10MB two years ago to about 410MB today;
over two years about 700 commits.
each dump differ slightly in content from yesterday's and the changes are
scattered all over the files; it would not de-duplicate block-level too well.
yet the Git storage, after compaction (which takes a few minutes on a slow
desktop), totals about 200MB, all the commits included. yep; less storage
taken by two years' worth of Git storage than by one daily dump.
given the fact that most disks are very large, and most people's non-media
storage requirements are very small, why is the compelling.

from what i've seen, people have the following requirements for storage:
1. speed
2. speed
3. speed
4. large caches.

- erik
hiro
2012-01-14 15:00:07 UTC
Permalink
venti is too big, buy bigger disks and forget venti.
Charles Forsyth
2012-01-14 15:06:29 UTC
Permalink
Although drives are larger now, even SSDs, there is great satisfaction in
being able to make copies of large trees arbitrarily, without having to
worry about them adding any more than just the changed files to the
write-once stored set.
I do this fairly often during testing.
Post by hiro
venti is too big, buy bigger disks and forget venti.
erik quanstrom
2012-01-14 15:29:21 UTC
Permalink
Post by Charles Forsyth
Although drives are larger now, even SSDs, there is great satisfaction in
being able to make copies of large trees arbitrarily, without having to
worry about them adding any more than just the changed files to the
write-once stored set.
I do this fairly often during testing.
(as an aside, one assumes changed files + directory tree as the
a/mtimes are changed.)

such satisfaction is not denyable, but is it a good tradeoff?

my /sys/src is 191mb. i can make a complete copy and never deleted
it every day for the next 2739 years (give or take :)) and not run out
of disk space. (most of the copies i make are deleted before they are
commited to the worm.)

i think it would be fair to argue that source and executables are
negligeable users of storage. media files, which are already compressed,
tend to dominate.

the tradeoff for this compression is a large amount of memory,
fragmentation, and cpu usage. that is to say, storage latency.

so i wonder if we're not spending all our resources trying to optimize
only a few percent of our storage needs.

- erik
Aram Hăvărneanu
2012-01-14 16:16:36 UTC
Permalink
Post by erik quanstrom
i think it would be fair to argue that source and executables are
negligeable users of storage.  media files, which are already compressed,
tend to dominate.
What about virtual machine images?
Post by erik quanstrom
the tradeoff for this compression is a large amount of memory,
fragmentation, and cpu usage.  that is to say, storage latency.
I have 24GB RAM. My primary laptops have 8GB RAM. I have all this RAM
not because of dedup but because I do memory intensive tasks, like
running virtual machines. I believe this is true for many users.

I'm of a completely different opinion regarding fragmentation. On
SSDs, it's a non issue. Historically, one of the hardest things to do
right in a filesystem was minimizing fragmentation. Today you don't
have to do it so there's less complexity to manage in the file system.
Even if you still have rotating rust to store the bulk of the data, a
small SSD cache in front of it renders fragmentation irrelevant.

My CPU can SHA-1 hash orders of magnitude faster than it can read from
disk, and that's using only generic instructions, plus, it's sitting
idle anyway.
Post by erik quanstrom
so i wonder if we're not spending all our resources trying to optimize
only a few percent of our storage needs.
Dedup is certainly not a panacea, but it's certainly useful for many workloads.
--
Aram Hăvărneanu
erik quanstrom
2012-01-14 16:32:30 UTC
Permalink
Post by Aram Hăvărneanu
What about virtual machine images?
Post by erik quanstrom
the tradeoff for this compression is a large amount of memory,
fragmentation, and cpu usage.  that is to say, storage latency.
I have 24GB RAM. My primary laptops have 8GB RAM. I have all this RAM
not because of dedup but because I do memory intensive tasks, like
running virtual machines. I believe this is true for many users.
russ posted some notes how how much memory and disk bandwidth are
required to write at a constant b/w of Xmb/s to venti. venti requires
enormous resources to perform this capability.

also, 24gb isn't really much storage. that's 1000 vm images/disk, assuming
that you store the regions with all zeros.

one thing to note is that we're silently comparing block (ish) storage (venti)
to file systems. this isn't really a useful comparison. i don't know of many
folks who store big disk images on file systems.

we have some customers who do do this, and they use the vsx to clone
a base vm image. there's no de-dup, but only the change extents get
stored.
Post by Aram Hăvărneanu
I'm of a completely different opinion regarding fragmentation. On
SSDs, it's a non issue.
that's not correct. a very good ssd will do only about 10,000 r/w random
iops. (certainly they show better numbers for the easy case of compressable
100% write work loads.) that's less than 40mb/s. on the other hand, a good ssd will do
about 10x, if eading sequentially.
Post by Aram Hăvărneanu
My CPU can SHA-1 hash orders of magnitude faster than it can read from
disk, and that's using only generic instructions, plus, it's sitting
idle anyway.
it's not clear to me that the sha-1 hash in venti has any real bearing on
venti's end performance. do you have any data or references for this?

- erik
Aram Hăvărneanu
2012-01-14 18:01:03 UTC
Permalink
Post by erik quanstrom
russ posted some notes how how much memory and disk bandwidth are
required to write at a constant b/w of Xmb/s to venti.  venti requires
enormous resources to perform this capability.
Maybe, I was talking generally about the concept of
content-addressable storage, not venti in particular. I believe it's
possible to do CAS without a major performance hit, look at ZFS, for
example.
Post by erik quanstrom
one thing to note is that we're silently comparing block (ish) storage (venti)
to file systems.  this isn't really a useful comparison.  i don't know of many
folks who store big disk images on file systems.
But many want to back up these images somewhere, and venti makes a
good candidate.

In my experience, a machine serving iSCSI or AoE to VMs running on
different machines is pretty common, and iSCSI or AoE is often done in
software, sometimes using big files on a local file system. I don't
know any other way to do it in Linux, if you export block storage
directly, you lose a lot of flexibility.

On Solaris, ZFS takes a different approach, you can ask ZFS to give
you a virtual LUN bypassing the VFS completely.
Post by erik quanstrom
Post by Aram Hăvărneanu
I'm of a completely different opinion regarding fragmentation. On
SSDs, it's a non issue.
that's not correct.  a very good ssd will do only about 10,000 r/w random
iops.  (certainly they show better numbers for the easy case of compressable
100% write work loads.)  that's less than 40mb/s.  on the other hand, a good ssd will do
about 10x, if eading sequentially.
Sure, but 1,000 iops gives you only a 10% performance hit. With
rotating rust 10 iops give you the same 10% hit, two orders of
magnitude difference. In my experience, even if you are ignoring the
fragmentation issue completely, your files will be less than 100 times
more fragmented compared with a traditional filesystem so your system
overall will be less affected by fragmentation.
--
Aram Hăvărneanu
erik quanstrom
2012-01-14 20:43:18 UTC
Permalink
Post by Aram Hăvărneanu
Maybe, I was talking generally about the concept of
content-addressable storage, not venti in particular. I believe it's
possible to do CAS without a major performance hit, look at ZFS, for
example.
do you have any reference to ZFS being content-addressed storage?
Post by Aram Hăvărneanu
Post by erik quanstrom
Post by Aram Hăvărneanu
I'm of a completely different opinion regarding fragmentation. On
SSDs, it's a non issue.
that's not correct.  a very good ssd will do only about 10,000 r/w random
iops.  (certainly they show better numbers for the easy case of compressable
100% write work loads.)  that's less than 40mb/s.  on the other hand, a good ssd will do
about 10x, if eading sequentially.
Sure, but 1,000 iops gives you only a 10% performance hit. With
rotating rust 10 iops give you the same 10% hit, two orders of
magnitude difference. In my experience, even if you are ignoring the
fragmentation issue completely, your files will be less than 100 times
more fragmented compared with a traditional filesystem so your system
overall will be less affected by fragmentation.
your claim was, random access is free on ssds. and i don't see how these
numbers bolster your claim at all.

- erik
Aram Hăvărneanu
2012-01-14 21:39:38 UTC
Permalink
Post by erik quanstrom
do you have any reference to ZFS being content-addressed storage?
It's not purely content-addressed storage, but it implements
deduplication in the same way venti does:
http://blogs.oracle.com/bonwick/entry/zfs_dedup (the blog post offers
only a high level overview, you have to dig into the code to see the
implementation).
Post by erik quanstrom
your claim was, random access is free on ssds.  and i don't see how these
numbers bolster your claim at all.
Maybe my wording wasn't the best, so I'll try again. SSDs can do
100times more iops than HDDs for a given performance hit ratio. In my
experience, ignoring fragmentation completely leads to average
fragmentation being significantly less than 100 times worse compared
to a traditional filesystem that tries hard to avoid it. In practice,
I've found that a fragmented filesystem on a SSD performs at worst 10%
behind the non-fragmented best case scenario. I'd trade 10%
performance for significantly simpler code anytime.

The phrase above is ignoring caching. I've ran ZFS for many years
without a SSD and I haven't noticed the fragmentation because of very
aggressive caching (see the ARC algorithm).
--
Aram Hăvărneanu
erik quanstrom
2012-01-14 21:54:12 UTC
Permalink
Post by Aram Hăvărneanu
Post by erik quanstrom
do you have any reference to ZFS being content-addressed storage?
It's not purely content-addressed storage, but it implements
http://blogs.oracle.com/bonwick/entry/zfs_dedup (the blog post offers
only a high level overview, you have to dig into the code to see the
implementation).
content addressed means given the content, you can generate the address.
this is NOT true of zfs at all.
Post by Aram Hăvărneanu
Post by erik quanstrom
your claim was, random access is free on ssds.  and i don't see how these
numbers bolster your claim at all.
Maybe my wording wasn't the best, so I'll try again. SSDs can do
100times more iops than HDDs for a given performance hit ratio. In my
experience, ignoring fragmentation completely leads to average
fragmentation being significantly less than 100 times worse compared
to a traditional filesystem that tries hard to avoid it. In practice,
I've found that a fragmented filesystem on a SSD performs at worst 10%
behind the non-fragmented best case scenario. I'd trade 10%
performance for significantly simpler code anytime.
The phrase above is ignoring caching. I've ran ZFS for many years
without a SSD and I haven't noticed the fragmentation because of very
aggressive caching (see the ARC algorithm).
you keep changing the subject. your original claim was that random
access is not slower than sequential access for ssds. you haven't backed
this argument up. the relative performance of ssds vs hard drives
and caching are completely irrelevant.

- erik
Aram Hăvărneanu
2012-01-14 22:11:48 UTC
Permalink
Post by erik quanstrom
content addressed means given the content, you can generate the address.
this is NOT true of zfs at all.
How come? With venti, the address is the SHA-1 hash, with ZFS, you get
to chose the hash, but it can still be a hash.
Post by erik quanstrom
you keep changing the subject.  your original claim was that random
access is not slower than sequential access for ssds.  you haven't backed
this argument up.  the relative performance of ssds vs hard drives
and caching are completely irrelevant.
My original claim was that fragmentation is a non issue if you have
SSDs. I still claim this and I expanded on the context in my previous
post. Of course that random I/O is slower than sequential I/O, SSD or
not, but in practice, filesystem fragmentation causes an amount or
random I/O much less than what a SSD can handle, so throughput in the
fragmented case is close to the throughput in the sequential case.

I don't think that caching is completely irrelevant. If I have to
chose between a complex scheme that avoids fragmentation and a simple
caching scheme that renders it irrelevant for a particular workload,
I'll chose the caching scheme because it's simpler.
--
Aram Hăvărneanu
erik quanstrom
2012-01-14 22:42:12 UTC
Permalink
Post by Aram Hăvărneanu
Post by erik quanstrom
content addressed means given the content, you can generate the address.
this is NOT true of zfs at all.
How come? With venti, the address is the SHA-1 hash, with ZFS, you get
to chose the hash, but it can still be a hash.
because in zfs the hash is not used as an address (lba).
Post by Aram Hăvărneanu
My original claim was that fragmentation is a non issue if you have
SSDs. I still claim this and I expanded on the context in my previous
post. Of course that random I/O is slower than sequential I/O, SSD or
not, but in practice, filesystem fragmentation causes an amount or
random I/O much less than what a SSD can handle, so throughput in the
fragmented case is close to the throughput in the sequential case.
I don't think that caching is completely irrelevant. If I have to
chose between a complex scheme that avoids fragmentation and a simple
caching scheme that renders it irrelevant for a particular workload,
I'll chose the caching scheme because it's simpler.
by all means, show us the numbers. personally, i believe the mfgrs are not
lying when they say that random i/o yields 1/10th the performance (at best)
of sequential i/o.

since you assert that it is ssds that make random i/o a non issue, and
not caching, logically caching is not relevant to your point.

- erik

ps. you're presenting a false choice between caching and fragmentation.
case in point, ken's fs doesn't fragment as much as venti (and does not
increase fragmentation over time) and yet it caches.
Aram Hăvărneanu
2012-01-14 23:03:11 UTC
Permalink
Post by erik quanstrom
Post by Aram Hăvărneanu
How come? With venti, the address is the SHA-1 hash, with ZFS, you get
to chose the hash, but it can still be a hash.
because in zfs the hash is not used as an address (lba).
But by this definition neither is venti. In venti, the hash is
translated to a lba by the index cache. In ZFS, the hash is translated
to a lba by the DDT (dedup-table). Both the Venti index cache and the
DDT can be regenerated from the data if they become corrupted.
Post by erik quanstrom
by all means, show us the numbers.  personally, i believe the mfgrs are not
lying when they say that random i/o yields 1/10th the performance (at best)
of sequential i/o.
I will, tomorrow.
Post by erik quanstrom
since you assert that it is ssds that make random i/o a non issue, and
not caching, logically caching is not relevant to your point.
I've been trying to claim two things at the same time, these two
things are unrelated. Both SSDs and caching alleviate fragmentation
issues, in different ways.
Post by erik quanstrom
ps.  you're presenting a false choice between caching and fragmentation.
case in point, ken's fs doesn't fragment as much as venti (and does not
increase fragmentation over time) and yet it caches.
If that's the impression I made, I'm sorry for the misunderstanding
but that's not what I wanted. Of course there's no choice between
caching and fragmentation. Every filesystem caches.
--
Aram Hăvărneanu
Bakul Shah
2012-01-14 23:32:09 UTC
Permalink
Post by erik quanstrom
Post by Aram Hăvărneanu
Post by erik quanstrom
content addressed means given the content, you can generate the address.
this is NOT true of zfs at all.
How come? With venti, the address is the SHA-1 hash, with ZFS, you get
to chose the hash, but it can still be a hash.
because in zfs the hash is not used as an address (lba).
True.
Post by erik quanstrom
Post by Aram Hăvărneanu
My original claim was that fragmentation is a non issue if you have
SSDs. I still claim this and I expanded on the context in my previous
post. Of course that random I/O is slower than sequential I/O, SSD or
not, but in practice, filesystem fragmentation causes an amount or
random I/O much less than what a SSD can handle, so throughput in the
fragmented case is close to the throughput in the sequential case.
I don't think that caching is completely irrelevant. If I have to
chose between a complex scheme that avoids fragmentation and a simple
caching scheme that renders it irrelevant for a particular workload,
I'll chose the caching scheme because it's simpler.
by all means, show us the numbers. personally, i believe the mfgrs are not
lying when they say that random i/o yields 1/10th the performance (at best)
of sequential i/o.
Intel 320 300GB SSD numbers (for example):
seq read: 270MBps
rnd read: 39.5Kiops == 158Mbps @ 4KB
seq write: 205MBps
rnd write: 23.0Kiops == 92MBps @ 4KB

SSDs don't have to contend with seek times but you have to pay
the erase cost (which can not be hidden in "backround GC" when
you are going full tilt).

For venti you'd pick 8k at least so the write throughput will
be higher that 92MBps (but not double). IIRC ZFS picks a much
larger block size so it suffers less here.

You have to check the numbers using the blocksizes relevant to
you.
Aram Hăvărneanu
2012-01-14 23:45:39 UTC
Permalink
Post by erik quanstrom
Post by Aram Hăvărneanu
How come? With venti, the address is the SHA-1 hash, with ZFS, you get
to chose the hash, but it can still be a hash.
because in zfs the hash is not used as an address (lba).
True.
As I said, neither in Venti. The hash is translated to a lba by the
index, a table that can be recreated from the data if it's missing or
if it's corrupt. ZFS also uses an index table called DDT. It also has
the same properties as Venti's index, it can be created by reading the
data.
seq read:  270MBps
seq write: 205MBps
Those are just random I/O stats, you are not interpreting them to see
what the penalty would be for some chosen fragmentation/block size.
I'm writing that interpretation tomorrow.
IIRC ZFS picks a much
larger block size so it suffers less here.
ZFS will try to use big blocks if it can, maximum 128kB right now, but
I don't see how this is relevant if you read random 4K logical blocks.

Please notice that both Venti and ZFS will write a big file
sequentially on the disk, that's (partially) what the index table/DDT
is for. In this case, the hash of sequential blocks within a file will
simply map (via the index) to sequential disk addresses.
--
Aram Hăvărneanu
erik quanstrom
2012-01-15 13:12:21 UTC
Permalink
Post by Aram Hăvărneanu
Post by erik quanstrom
Post by Aram Hăvărneanu
How come? With venti, the address is the SHA-1 hash, with ZFS, you get
to chose the hash, but it can still be a hash.
because in zfs the hash is not used as an address (lba).
True.
As I said, neither in Venti. The hash is translated to a lba by the
index, a table that can be recreated from the data if it's missing or
if it's corrupt. ZFS also uses an index table called DDT. It also has
the same properties as Venti's index, it can be created by reading the
data.
you've confused the internal implementation
with the public programming interface.

venti IS content-addressed. this is because in the programming interface,
one passes addresses data by its hash. venti could internally store its bits
in the holes in swiss cheese and it would still be content addressed,
not cheese addressed.

on the other hand, zfs is not content addressed. this is because, as
an iscsi target, zfs will be addressing data by block offset rather
than hash. zfs could store its bits in venti, and it still would NOT be
content addressed nor would it be venti-addressed.

- erik
Aram Hăvărneanu
2012-01-15 14:07:47 UTC
Permalink
Post by erik quanstrom
you've confused the internal implementation
with the public programming interface.
The properties we're discussing here (dedup, fragmentation,
performance) are an artifact of the implementation, not of the
interface. Fossil+Venti would perform the same if Venti exported its
interface only to Fossil, and not to the whole world.

In ZFS terms, Fossil is akin to ZPL (ZFS POSIX layer), it implements
filesystem semantics over another layer, DMU for ZFS and Venti for
Plan9.

Please not that ZFS exports more than the filesystem interface,
there's also iSCSI (as you noticed), zfs send/recv (remarcably similar
to venti/write venti/read, even in their use of standard input/output
interface), even bits of the DMU are exported. Hell, even the VFS
concrete implementation is exported (though only in kernel mode), else
you could not write layered filesystems (not as nice and simple as
Plan9 bind(2) but akin to STREAMS filters at filesystem layer). You
know I've worked at writing the SMB/CIFS in-kernel filesystem that
sits on top of ZFS, right? Well, when you're doing this you have to be
careful around this content-addressed thing. I've used the interface
you claim it doesn't exist. It's not public? Why is this relevant to
the properties of the system?
Post by erik quanstrom
on the other hand, zfs is not content addressed.  this is because, as
an iscsi target, zfs will be addressing data by block offset rather
than hash.  zfs could store its bits in venti, and it still would NOT be
content addressed nor would it be venti-addressed.
That's not fair at all, I could be claiming that venti is not content
addressed because you use hierarchical names to address data in a vac
archive. Vac is just a layer over venti, like ZPL and iSCSI is a layer
over DMU.

In this case the properties of the system depend on the properties of
the underlying layer, not of the interface, and this layer is content
addressed, even by your definition.
--
Aram Hăvărneanu
erik quanstrom
2012-01-15 14:25:09 UTC
Permalink
Post by Aram Hăvărneanu
sits on top of ZFS, right? Well, when you're doing this you have to be
careful around this content-addressed thing. I've used the interface
you claim it doesn't exist. It's not public? Why is this relevant to
well then, please provide a pointer if this is a public interface.

the reason why this is relevant is because we don't call ssds
fancy wafl-addressed storage (assuming that's how they do it),
because that's now the interface one gets. we don't call
raid appliances raid-addressed storage (assuming they're using
raid), because that's not the interface presented.

- erik
Charles Forsyth
2012-01-15 14:39:16 UTC
Permalink
Post by Aram Hăvărneanu
I've used the interface
you claim it doesn't exist. It's not public? ...
well then, please provide a pointer if this is a public interface.
I think he was saying you might not know about it because it isn't public,
although he's used it. "It's not public?" read with rising intonation?
Charles Forsyth
2012-01-14 18:39:05 UTC
Permalink
That only affects the directories, which are tiny, not the files.
Post by erik quanstrom
(as an aside, one assumes changed files + directory tree as the
a/mtimes are changed.)
c***@gmx.de
2012-01-13 23:24:56 UTC
Permalink
dedubstep!

--
cinap
erik quanstrom
2012-01-14 13:07:56 UTC
Permalink
Post by Aram Hăvărneanu
Post by Francisco J Ballesteros
but if you insert extra music in front of your track dedup in venti won't help.
or would it?
It wouldn't. In practice it seems that it usually appends, probably
for performance reasons, so for me it had worked so far absolutely
great.
ken's file server will work the same way for appends. you won't get a new
copy of the whole file in the worm, just the additional blocks + a copy of
the last partial + a new copy of some metadata.

- erik
erik quanstrom
2012-01-13 13:30:36 UTC
Permalink
have you already done something like
du -a | sort -nr | sed 20q
in the main tree?. (it may make sense to
remount /srv/boot someplace else to avoid device, etc.)

- erik
David du Colombier
2012-01-13 13:59:52 UTC
Permalink
How have you deleted the plan9.iso file?

If you have used rm(1) or fossilcons(4) remove, the blocks
should be properly unallocated from Fossil. But if you used
fossilcons(4) clri for example, you have to manually reclaim
the abandoned storage with clre and bfree, with the help of
fossil/flchk or fossilcons(4) check.

See the following illustration:

# our current empty fossil
main: df
main: 40,960 used + 1,071,710,208 free = 1,071,751,168 (0% used)

# we copy a file on fossil, then remove it properly
% cp /386/9pcf /n/fossil
main: df
main: 3,661,824 used + 1,068,089,344 free = 1,071,751,168 (0% used)
main: remove /active/9pcf
main: df
main: 57,344 used + 1,071,693,824 free = 1,071,751,168 (0% used)

# we copy a file on fossil, then remove it with clri
% cp /386/9pcf /n/fossil
main: df
main: 3,661,824 used + 1,068,089,344 free = 1,071,751,168 (0% used)
main: check
checking epoch 1...
check: visited 1/130829 blocks (0%)
fsys blocks: total=130829 used=447(0.3%) free=130382(99.7%) lost=0(0.0%)
fsck: 0 clri, 0 clre, 0 clrp, 0 bclose
main: clri /active/9pcf
main: df
main: 3,661,824 used + 1,068,089,344 free = 1,071,751,168 (0% used)
main: check
checking epoch 1...
check: visited 1/130829 blocks (0%)
fsys blocks: total=130829 used=447(0.3%) free=130382(99.7%) lost=0(0.0%)
error: non referenced entry in source /active[0]
fsck: 0 clri, 1 clre, 0 clrp, 0 bclose

# we identify the abandoned storage and reclaim it with bfree
term% fossil/flchk -f fossil.img | sed -n 's/^# //p'
clre 0x5 0
term% fossil/flchk -f fossil.img | sed -n 's/^# bclose (.*) .*/bfree \1/p'
bfree 0x7
bfree 0x8
[...]
main: clre 0x5 0
block 0x5 0 40
000000001FF420002900000000000000003691B900000000000000000000000071875E60000001A1
main: bfree 0x7
label 0x7 0 1 1 4294967295 0x71875e60
main: bfree 0x8
label 0x8 1 1 1 4294967295 0x71875e60
[...]
main: check
checking epoch 1...
check: visited 1/130829 blocks (0%)
fsys blocks: total=130829 used=7(0.0%) free=130822(100.0%) lost=0(0.0%)
fsck: 0 clri, 0 clre, 0 clrp, 0 bclose
main: df main: 3,661,824 used + 1,068,089,344 free = 1,071,751,168 (0%
used)

Note that it doesn't update c->fl->nused, reported by df.
--
David du Colombier
t***@polynum.com
2012-01-13 14:11:50 UTC
Permalink
Post by David du Colombier
How have you deleted the plan9.iso file?
I have used in user space rm(1), and then on the console: check, since
it is said to reclaim the space. And it did.

I have now half the size of the previous one, and this matches the real
data here.
Post by David du Colombier
If you have used rm(1) or fossilcons(4) remove, the blocks
should be properly unallocated from Fossil. But if you used
fossilcons(4) clri for example, you have to manually reclaim
the abandoned storage with clre and bfree, with the help of
fossil/flchk or fossilcons(4) check.
[...]
Thanks for the clarifications.

What puzzles me for now, is the du(1) hole...
--
Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
Continue reading on narkive:
Loading...