Added the project to http://projects.openmoko.org Currently coding everything in python. fgau is helping me with the python gtkhtml2 binding.
what about compressing the data? probably wouldn't allow you to search, but hey. --Minime 16:21, 16 July 2007 (CEST)
- (JoSch)I mentioned compression as an option in the article. It would be necessary to compress every single article on its own because it would be overkill to seek a few kilobytes in several GB compressed file every time. Then a title search could be made by filename but I think it's a better idea to have a title list file.
Compress batches of 10 files or so at a time.
- Compress the data in small portions - say 100K compressed - that can be decompressed in under a second.You probably want to use some sort of page-sorting-compressor - so that pages in one batch are similar - and will compress a bit better.
- It sounds logical that electronics based articles together will compress better than random (of course - in reality)...
- Then store a search-keyword database into this data.
- Works well.
- I use 'wwwoffle'  to search my browsed web-pages.
- --Speedevil 17:10, 16 July 2007 (CEST)
- Thanks for your ideas - I will consider them!
- --JoSch 17:57, 16 July 2007 (CEST)
Bzip is almost certainly a bad idea. It's really quite slow on this class of hardware. On another topic - worthwhile may be one 'core' encyclopedia, which contains entries like "Germany", "Paris", "1928". Combined with a daily or weekly download of 'topical' pages. "Steve Irwin", "Paris Hilton". This results in much better hit-rates for most users.
On compression. Of the most popular 5000 pages, they are 393M of uncompressed text. Compressing the whole lot as a solid block with gzip -9 results in 88M, gzip -1 is 101M. Individually gzipping -9 each article is 94M, and gzipping them in batches of 10 gives again 88M.
If the stats supplied are accurate, then this would cover some 80% of a months searches. Perhaps another 500M might take this to 90%+ --Speedevil 00:14, 18 July 2007 (CEST)
For me as a programmer it doesnt matter if I access the bzip2 or gzip library to uncompress. Maybe I should not take bzip as the archive format but I will do some reallife benchmarking when my neo1973 arrives. You did excellent research, Speedevil! It shows us, that it is not much difference for the archive size between compressing all files or only ten together. But it also showed that there would be not much space wasted if every file is compressed separately.
I think in every case one needs an index file of all available articles. It's useful for displaying article search results very quickly (will be faster than file system search) and can manage links to different article names. If I pack, say 10 files, together it will also be necessary for storing where an article is located.
It would be no problem to pack several versions of wikipedia and the user can pick the file he has space for. I added this to the feature section.
I think I'm just to much a poweruser because it'd make me embarrassed to say "hey! I can browse wikipedia on my mobile!!! ... but the article about isaac asimov you just asked for is not there..." so I for myself will definately buy a 4GB microSD card if necessary. So thanks for your ideas - it'd really never come to my mind that anyone could be confident with a stripped down wikipedia - no offense! :-) --JoSch 00:45, 18 July 2007 (CEST)
Consider what you can potentially do with it. Any email, web-page, SMS, RSS feed, ... can have live links for terms that are cached in the local wikipedia. (downloading 100K+ of text in an article, especially when it may be over a slow link- ...). => erm... pleacy specify how this feature will work - I'm not sure what you mean --JoSch 07:21, 18 July 2007 (CEST) Oh - and using bzip2 in batches of 10 results in 66M. :) --Speedevil 01:33, 18 July 2007 (CEST)
Again we need some more benchmarking on a real Neo...
With gzip: larger archives - faster extraction
With bz2: smaller archives - slower extraction
-> TODO: find out how much the speed differs
With packing 10files together: smaller overall size - slower article decompression
With packing every file on its own: larger overall size - faster article decompression
-> TODO: find out how much the speed differs
I would gladly do this but I still need a real Neo! Why can't there be someone shipping it from a European location? And why do I have to have a credit card for buying it? Waiting for the next group buy in #email@example.com --JoSch 16:17, 18 July 2007 (CEST)
Type of dump
I'd be interested in what made you choose the HTML dump over parsing the database dump; the original data should be more dense once parsed and easier to parse (no need to translate category names, for example). --chrysn 21:21, 29 October 2007 (CET)
Thanks for asking! You can read my reasons here: http://projects.openmoko.org/plugins/wiki/index.php?FAQ&id=49&type=g
In a nutshell: wikitext is no static language and much more diffcult to parse (tables...) xhtml is much easier and wikitext is no static standard - it will change overtime so I let the mediawiki software do the parsing to xhtml for me and then clean up this output. But you are right - some time in the feature I will look deeper into the mediawiki code and offer patches so that the media wiki dumper will output optimized html right away.