Wishlist/Wikipedia Mirror

From Openmoko

Revision as of 08:47, 17 July 2007 by Glenn (Talk | contribs)

Jump to: navigation, search
Wishes warning! This article or section documents one or more OpenMoko Wish List items, the features described here may or may not be implemented in the future.

A local copy of wikipedia that you can browse anywhere without incurring bandwidth costs. It's already been done for the iPod (see encyclopodia), but I think we can do better on the Neo. For starters, it has a much nicer screen, and a real UI toolkit to play with, and can run multiple apps simultaneously etc.

--HEx 07:22, 5 May 2007 (CEST)

Size

This is the real problem. A recent bzipped article-only dump of enwiki comes in at 2.1Gb. Encyclopodia's download seems to be about 1.7Gb. It seems likely that even with some optimization, a full copy of wikipedia is going to be over 1Gb. Thus it would seem that a complete copy is probably going to be impractical for most people (assuming they want to store other things on their Neo too), so it would be worth finding out what bits can be discarded with relatively little loss.

On the compression front, lzma would seem to be the way to go. Clearly the input needs to be divided into blocks for random access, but the choice of which articles go into which block I think could affect the compression ratio greatly.

[1] has some interesting benchmarks for wikipedia compression. I still think lzma would be best as it does quite well while not using much CPU and memory (plus it's already in OpenEmbedded).

To save space, we could provide smaller sections of the main Wikipedia (eg Music, Science, Literature) and download anything else on demand. It needs to be clear what's local and remote, however, as data charges can be expensive.

Format

This is a tough one. Despite the fact that MediaWiki's markup format is notoriously underspecified and hard to interpret, it's also widespread and human-readable and space-efficient, and, while a custom markup format could possibly be more space-efficient and easier to parse, the world really doesn't need any more markup formats. Thus I'm tending towards just rendering MediaWiki markup directly on the Neo, which would have the plus of being able to view the source if and when rendering goes horribly awry.

Of the possibilities at the list of parsers, flexbisonparse looks the most promising.

HTML or XML as an intermediate step would be a possibility, and there's sure to be a web browser available, but just dumping the output to a GtkTextView would probably be simpler and more efficient.

Other E-book software

Since there's bound to be other, similar software being developed or ported, it would make sense to see if this could usefully be part of a larger project. (The alternative, "do one thing, do it well", also has some appeal though.)

Personal tools
Wishes warning! This article or section documents one or more OpenMoko Wish List items, the features described here may or may not be implemented in the future.

A local copy of wikipedia that you can browse anywhere without incurring bandwidth costs. It's already been done for the iPod (see encyclopodia), but I think we can do better on the Neo. For starters, it has a much nicer screen, and a real UI toolkit to play with, and can run multiple apps simultaneously etc.

--HEx 07:22, 5 May 2007 (CEST)

Size

This is the real problem. A recent bzipped article-only dump of enwiki comes in at 2.1Gb. Encyclopodia's download seems to be about 1.7Gb. It seems likely that even with some optimization, a full copy of wikipedia is going to be over 1Gb. Thus it would seem that a complete copy is probably going to be impractical for most people (assuming they want to store other things on their Neo too), so it would be worth finding out what bits can be discarded with relatively little loss.

On the compression front, lzma would seem to be the way to go. Clearly the input needs to be divided into blocks for random access, but the choice of which articles go into which block I think could affect the compression ratio greatly.

[1] has some interesting benchmarks for wikipedia compression. I still think lzma would be best as it does quite well while not using much CPU and memory (plus it's already in OpenEmbedded).

To save space, we could provide smaller sections of the main Wikipedia (eg Music, Science, Literature) and download anything else on demand. It needs to be clear what's local and remote, however, as data charges can be expensive.

Format

This is a tough one. Despite the fact that MediaWiki's markup format is notoriously underspecified and hard to interpret, it's also widespread and human-readable and space-efficient, and, while a custom markup format could possibly be more space-efficient and easier to parse, the world really doesn't need any more markup formats. Thus I'm tending towards just rendering MediaWiki markup directly on the Neo, which would have the plus of being able to view the source if and when rendering goes horribly awry.

Of the possibilities at the list of parsers, flexbisonparse looks the most promising.

HTML or XML as an intermediate step would be a possibility, and there's sure to be a web browser available, but just dumping the output to a GtkTextView would probably be simpler and more efficient.

Other E-book software

Since there's bound to be other, similar software being developed or ported, it would make sense to see if this could usefully be part of a larger project. (The alternative, "do one thing, do it well", also has some appeal though.)