Wishlist/Wikipedia Mirror

From Openmoko

(Difference between revisions)
Jump to: navigation, search
(Added the Mokopedia Way)
(Why we need SDHC support)
Line 40: Line 40:
 
If I compress everything with bzip2 it shrinks to 19% of it's former size but this will not help for an offline version since seeking for a special file in such a big archive will cost an enormous amount of computing power. So the only options are to either pack every file separately or to pack a few files together in chunks of several kB. But this will reduce our compression ratio enormously (~30% compression with bz2) but is unavoidable for fast access.
 
If I compress everything with bzip2 it shrinks to 19% of it's former size but this will not help for an offline version since seeking for a special file in such a big archive will cost an enormous amount of computing power. So the only options are to either pack every file separately or to pack a few files together in chunks of several kB. But this will reduce our compression ratio enormously (~30% compression with bz2) but is unavoidable for fast access.
  
Conclusion: our only hope is to hope fpr SDHC support. I verified that the (second largest) german wiki will fit on a 2GB non-SDHC SD-Card with every file compressed separately. With my optimizations the english wikipedia will need 2.8GB so we need 4GB SDHC media for that.
+
Conclusion: our only hope is to hope for SDHC support. I verified that the (second largest) german wiki will fit on a 2GB non-SDHC SD-Card with every file compressed separately. With my optimizations the english wikipedia will need 2.8GB so we need 4GB SDHC media for that.

Revision as of 23:02, 26 September 2007

Wishes warning! This article or section documents one or more OpenMoko Wish List items, the features described here may or may not be implemented in the future.

Contents

A local copy of wikipedia that you can browse anywhere without incurring bandwidth costs. It's already been done for the iPod (see encyclopodia), but I think we can do better on the Neo. For starters, it has a much nicer screen, and a real UI toolkit to play with, and can run multiple apps simultaneously etc.

--HEx 07:22, 5 May 2007 (CEST)

Size

This is the real problem. A recent bzipped article-only dump of enwiki comes in at 2.1Gb. Encyclopodia's download seems to be about 1.7Gb. It seems likely that even with some optimization, a full copy of wikipedia is going to be over 1Gb. Thus it would seem that a complete copy is probably going to be impractical for most people (assuming they want to store other things on their Neo too), so it would be worth finding out what bits can be discarded with relatively little loss.

On the compression front, lzma would seem to be the way to go. Clearly the input needs to be divided into blocks for random access, but the choice of which articles go into which block I think could affect the compression ratio greatly.

[1] has some interesting benchmarks for wikipedia compression. I still think lzma would be best as it does quite well while not using much CPU and memory (plus it's already in OpenEmbedded).

To save space, we could provide smaller sections of the main Wikipedia (eg Music, Science, Literature) and download anything else on demand. It needs to be clear what's local and remote, however, as data charges can be expensive.

Format

This is a tough one. Despite the fact that MediaWiki's markup format is notoriously underspecified and hard to interpret, it's also widespread and human-readable and space-efficient, and, while a custom markup format could possibly be more space-efficient and easier to parse, the world really doesn't need any more markup formats. Thus I'm tending towards just rendering MediaWiki markup directly on the Neo, which would have the plus of being able to view the source if and when rendering goes horribly awry.

Of the possibilities at the list of parsers, flexbisonparse looks the most promising.

HTML or XML as an intermediate step would be a possibility, and there's sure to be a web browser available, but just dumping the output to a GtkTextView would probably be simpler and more efficient.

Other E-book software

Since there's bound to be other, similar software being developed or ported, it would make sense to see if this could usefully be part of a larger project. (The alternative, "do one thing, do it well", also has some appeal though.)

The Mokopedia Way

This section is about my Mokopedia project.

Why it isn't using Wikitext

Wikitext is horribly difficult to parse - just look at how tables are defined in wikitext and what thousands of possibilities you have! It would be overkill to write another wikitext Interpreter. Xhtml instead is simply parsed and can also be embedded easily in the gtkhtml widget. Another reason: if the wikimedia software changes anything on its wikitext 'specification' the parser has to be changed. Its useless to use something in a program, that is no standart and is likely to change without notice. Wikitext is more lightweight? right.

Why we need SDHC support

Size is only a problem for the english wikipedia. I managed to run the entire german wikipedia UNCOMPRESSED on my Nokia E70 Symbian phone. It had to be uncompressed because the articles were viewed with the Nokia browser from a local Raccoon install with a little Python script to search the articles. For more infos about my optimizations on the mediawiki html dump to get this running see [2] (German) If I compress everything with bzip2 it shrinks to 19% of it's former size but this will not help for an offline version since seeking for a special file in such a big archive will cost an enormous amount of computing power. So the only options are to either pack every file separately or to pack a few files together in chunks of several kB. But this will reduce our compression ratio enormously (~30% compression with bz2) but is unavoidable for fast access.

Conclusion: our only hope is to hope for SDHC support. I verified that the (second largest) german wiki will fit on a 2GB non-SDHC SD-Card with every file compressed separately. With my optimizations the english wikipedia will need 2.8GB so we need 4GB SDHC media for that.

Personal tools
Wishes warning! This article or section documents one or more OpenMoko Wish List items, the features described here may or may not be implemented in the future.

Contents

A local copy of wikipedia that you can browse anywhere without incurring bandwidth costs. It's already been done for the iPod (see encyclopodia), but I think we can do better on the Neo. For starters, it has a much nicer screen, and a real UI toolkit to play with, and can run multiple apps simultaneously etc.

--HEx 07:22, 5 May 2007 (CEST)

Size

This is the real problem. A recent bzipped article-only dump of enwiki comes in at 2.1Gb. Encyclopodia's download seems to be about 1.7Gb. It seems likely that even with some optimization, a full copy of wikipedia is going to be over 1Gb. Thus it would seem that a complete copy is probably going to be impractical for most people (assuming they want to store other things on their Neo too), so it would be worth finding out what bits can be discarded with relatively little loss.

On the compression front, lzma would seem to be the way to go. Clearly the input needs to be divided into blocks for random access, but the choice of which articles go into which block I think could affect the compression ratio greatly.

[1] has some interesting benchmarks for wikipedia compression. I still think lzma would be best as it does quite well while not using much CPU and memory (plus it's already in OpenEmbedded).

To save space, we could provide smaller sections of the main Wikipedia (eg Music, Science, Literature) and download anything else on demand. It needs to be clear what's local and remote, however, as data charges can be expensive.

Format

This is a tough one. Despite the fact that MediaWiki's markup format is notoriously underspecified and hard to interpret, it's also widespread and human-readable and space-efficient, and, while a custom markup format could possibly be more space-efficient and easier to parse, the world really doesn't need any more markup formats. Thus I'm tending towards just rendering MediaWiki markup directly on the Neo, which would have the plus of being able to view the source if and when rendering goes horribly awry.

Of the possibilities at the list of parsers, flexbisonparse looks the most promising.

HTML or XML as an intermediate step would be a possibility, and there's sure to be a web browser available, but just dumping the output to a GtkTextView would probably be simpler and more efficient.

Other E-book software

Since there's bound to be other, similar software being developed or ported, it would make sense to see if this could usefully be part of a larger project. (The alternative, "do one thing, do it well", also has some appeal though.)

The Mokopedia Way

This section is about my Mokopedia project.

Why it isn't using Wikitext

Wikitext is horribly difficult to parse - just look at how tables are defined in wikitext and what thousands of possibilities you have! It would be overkill to write another wikitext Interpreter. Xhtml instead is simply parsed and can also be embedded easily in the gtkhtml widget. Another reason: if the wikimedia software changes anything on its wikitext 'specification' the parser has to be changed. Its useless to use something in a program, that is no standart and is likely to change without notice. Wikitext is more lightweight? right.

Why we need SDHC support

Size is only a problem for the english wikipedia. I managed to run the entire german wikipedia UNCOMPRESSED on my Nokia E70 Symbian phone. It had to be uncompressed because the articles were viewed with the Nokia browser from a local Raccoon install with a little Python script to search the articles. For more infos about my optimizations on the mediawiki html dump to get this running see [2] (German) If I compress everything with bzip2 it shrinks to 19% of it's former size but this will not help for an offline version since seeking for a special file in such a big archive will cost an enormous amount of computing power. So the only options are to either pack every file separately or to pack a few files together in chunks of several kB. But this will reduce our compression ratio enormously (~30% compression with bz2) but is unavoidable for fast access.

Conclusion: our only hope is to hope for SDHC support. I verified that the (second largest) german wiki will fit on a 2GB non-SDHC SD-Card with every file compressed separately. With my optimizations the english wikipedia will need 2.8GB so we need 4GB SDHC media for that.