Wishlist/Wikipedia Mirror

From Openmoko

Revision as of 13:27, 31 August 2008 by DolfjeBot1 (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
Wishes warning! This article or section documents one or more OpenMoko Wish List items, the features described here may or may not be implemented in the future.

Contents

A local copy of wikipedia that you can browse anywhere without incurring bandwidth costs. It's already been done for the iPod (see encyclopodia), but I think we can do better on the Neo. For starters, it has a much nicer screen, and a real UI toolkit to play with, and can run multiple apps simultaneously etc.

--HEx 07:22, 5 May 2007 (CEST)

Size

This is the real problem. A recent bzipped article-only dump of enwiki comes in at 2.1Gb. Encyclopodia's download seems to be about 1.7Gb. It seems likely that even with some optimization, a full copy of wikipedia is going to be over 1Gb. Thus it would seem that a complete copy is probably going to be impractical for most people (assuming they want to store other things on their Neo too), so it would be worth finding out what bits can be discarded with relatively little loss.

On the compression front, lzma would seem to be the way to go. Clearly the input needs to be divided into blocks for random access, but the choice of which articles go into which block I think could affect the compression ratio greatly. Maybe if each category is a folder and each letter (article initial) of a category is an archive.

[1] has some interesting benchmarks for wikipedia compression. I still think lzma would be best as it does quite well while not using much CPU and memory (plus it's already in OpenEmbedded).

To save space, we could provide smaller sections of the main Wikipedia (eg Music, Science, Literature) and download anything else on demand. It needs to be clear what's local and remote, however, as data charges can be expensive.

From the other side, we could simply count on 8 Gb Micro SD card. Such card could store not just the article texts but maybe even images, if they are resized in advance to the small size, suitable for Openmoko screen.

Integration

It would be interesting to integrate Wikipedia with the GPS device, allowing a query about the articles, describing objects that are located nearby. There are many articles about cities and other objects with known locations in Wikipedia, and some uniform coordinate representation system seems also present.

Format

This is a tough one. Despite the fact that MediaWiki's markup format is notoriously underspecified and hard to interpret, it's also widespread and human-readable and space-efficient, and, while a custom markup format could possibly be more space-efficient and easier to parse, the world really doesn't need any more markup formats. Thus I'm tending towards just rendering MediaWiki markup directly on the Neo, which would have the plus of being able to view the source if and when rendering goes horribly awry.

Of the possibilities at the list of parsers, flexbisonparse looks the most promising.

HTML or XML as an intermediate step would be a possibility, and there's sure to be a web browser available, but just dumping the output to a GtkTextView would probably be simpler and more efficient.

Other E-book software

Since there's bound to be other, similar software being developed or ported, it would make sense to see if this could usefully be part of a larger project. (The alternative, "do one thing, do it well", also has some appeal though.)

The Mokopedia Way

This section is about my Mokopedia project.

Why it isn't using Wikitext

Wikitext is horribly difficult to parse - just look at how tables are defined in wikitext and what thousands of possibilities you have! It would be overkill to write another wikitext Interpreter. Xhtml instead is simply parsed and can also be embedded easily in the gtkhtml widget. Another reason: if the wikimedia software changes anything on its wikitext 'specification' the parser has to be changed. Its useless to use something in a program, that is no standart and is likely to change without notice. Wikitext is more lightweight? right.

Why we need SDHC support

Size is only a problem for the english wikipedia. I managed to run the entire german wikipedia UNCOMPRESSED on my Nokia E70 Symbian phone. It had to be uncompressed because the articles were viewed with the Nokia browser from a local Raccoon install with a little Python script to search the articles. For more infos about my optimizations on the mediawiki html dump to get this running see [2] (German) If I compress everything with bzip2 it shrinks to 19% of it's former size but this will not help for an offline version since seeking for a special file in such a big archive will cost an enormous amount of computing power. So the only options are to either pack every file separately or to pack a few files together in chunks of several kB. But this will reduce our compression ratio enormously (~30% compression with bz2) but is unavoidable for fast access.

Conclusion: our only hope is to hope for SDHC support. I verified that the (second largest) german wiki will fit on a 2GB non-SDHC SD-Card with every file compressed separately. With my optimizations the english wikipedia will need 2.8GB so we need 4GB SDHC media for that.

On the Booting_from_SD page there seems to be a way to make the Openmoko use SDHC cards. Seems like the kernel supports it already so if uboot loads the kernel from NAND using a SDHC card will work. The way of archieving this is still ugly (console work involved) - but if it does work it seems to be doable for the time.

Personal tools
Wishes warning! This article or section documents one or more OpenMoko Wish List items, the features described here may or may not be implemented in the future.

Contents

A local copy of wikipedia that you can browse anywhere without incurring bandwidth costs. It's already been done for the iPod (see encyclopodia), but I think we can do better on the Neo. For starters, it has a much nicer screen, and a real UI toolkit to play with, and can run multiple apps simultaneously etc.

--HEx 07:22, 5 May 2007 (CEST)

Size

This is the real problem. A recent bzipped article-only dump of enwiki comes in at 2.1Gb. Encyclopodia's download seems to be about 1.7Gb. It seems likely that even with some optimization, a full copy of wikipedia is going to be over 1Gb. Thus it would seem that a complete copy is probably going to be impractical for most people (assuming they want to store other things on their Neo too), so it would be worth finding out what bits can be discarded with relatively little loss.

On the compression front, lzma would seem to be the way to go. Clearly the input needs to be divided into blocks for random access, but the choice of which articles go into which block I think could affect the compression ratio greatly. Maybe if each category is a folder and each letter (article initial) of a category is an archive.

[1] has some interesting benchmarks for wikipedia compression. I still think lzma would be best as it does quite well while not using much CPU and memory (plus it's already in OpenEmbedded).

To save space, we could provide smaller sections of the main Wikipedia (eg Music, Science, Literature) and download anything else on demand. It needs to be clear what's local and remote, however, as data charges can be expensive.

From the other side, we could simply count on 8 Gb Micro SD card. Such card could store not just the article texts but maybe even images, if they are resized in advance to the small size, suitable for Openmoko screen.

Integration

It would be interesting to integrate Wikipedia with the GPS device, allowing a query about the articles, describing objects that are located nearby. There are many articles about cities and other objects with known locations in Wikipedia, and some uniform coordinate representation system seems also present.

Format

This is a tough one. Despite the fact that MediaWiki's markup format is notoriously underspecified and hard to interpret, it's also widespread and human-readable and space-efficient, and, while a custom markup format could possibly be more space-efficient and easier to parse, the world really doesn't need any more markup formats. Thus I'm tending towards just rendering MediaWiki markup directly on the Neo, which would have the plus of being able to view the source if and when rendering goes horribly awry.

Of the possibilities at the list of parsers, flexbisonparse looks the most promising.

HTML or XML as an intermediate step would be a possibility, and there's sure to be a web browser available, but just dumping the output to a GtkTextView would probably be simpler and more efficient.

Other E-book software

Since there's bound to be other, similar software being developed or ported, it would make sense to see if this could usefully be part of a larger project. (The alternative, "do one thing, do it well", also has some appeal though.)

The Mokopedia Way

This section is about my Mokopedia project.

Why it isn't using Wikitext

Wikitext is horribly difficult to parse - just look at how tables are defined in wikitext and what thousands of possibilities you have! It would be overkill to write another wikitext Interpreter. Xhtml instead is simply parsed and can also be embedded easily in the gtkhtml widget. Another reason: if the wikimedia software changes anything on its wikitext 'specification' the parser has to be changed. Its useless to use something in a program, that is no standart and is likely to change without notice. Wikitext is more lightweight? right.

Why we need SDHC support

Size is only a problem for the english wikipedia. I managed to run the entire german wikipedia UNCOMPRESSED on my Nokia E70 Symbian phone. It had to be uncompressed because the articles were viewed with the Nokia browser from a local Raccoon install with a little Python script to search the articles. For more infos about my optimizations on the mediawiki html dump to get this running see [2] (German) If I compress everything with bzip2 it shrinks to 19% of it's former size but this will not help for an offline version since seeking for a special file in such a big archive will cost an enormous amount of computing power. So the only options are to either pack every file separately or to pack a few files together in chunks of several kB. But this will reduce our compression ratio enormously (~30% compression with bz2) but is unavoidable for fast access.

Conclusion: our only hope is to hope for SDHC support. I verified that the (second largest) german wiki will fit on a 2GB non-SDHC SD-Card with every file compressed separately. With my optimizations the english wikipedia will need 2.8GB so we need 4GB SDHC media for that.

On the Booting_from_SD page there seems to be a way to make the Openmoko use SDHC cards. Seems like the kernel supports it already so if uboot loads the kernel from NAND using a SDHC card will work. The way of archieving this is still ugly (console work involved) - but if it does work it seems to be doable for the time.