This site is closed. You can no longer order any product here. Read why.
For several years since 2003 Wikipedia for TomeRaider was the only product on any handheld device that allowed you to browse the unabridged Wikipedia offline.
I keep these pages for historical reference only.
Complete Wikipedia Encyclopedia on your handheld or notebook
With TomeRaider 3 a stable and efficient version with images has finally become possible.
Proporta invested quite a lot in this project, making many fundamental changes to their compiler specifically for Wikipedia.
You can also choose to download ready-made text-only files from the Wikipedia server
You can order a ready-made Wikipedia on DVD, even with 250,000+ images.
Many steps are similar to the tried and tested procedure used for generating a Wikipedia file in TomeRaider 2 format.
You can choose to include images, in case you want to put the TomeRaider database on your notebook,
or when your PDA features a huge memory card. See also section images below.
You will need a fast PC, and a lot of patience even then!
My script does much more than it used to do (especially if you choose to include images), hence runs longer.
Downloading an x Gb zip file containing over 250,000 images (optional) takes time.
TomeRaider 3 beats its predecessor in browsing and search speed. But importing data takes an incredibly long time.
For the English Wikipedia this step easily takes a whole night and the day after. During that time your PC will be less responsive to other tasks.
Step by step instructions follow below (I tried to explain everything to the smallest detail, hence a long story).
Some other MediaWiki databases can be converted with this script as well. Wikiquote is one that comes to mind.
In case you wonder: I have no idea if a version of TomeRaider 3 for the EPOC/Symbian/Psion will be released soon or at all. Ask Mat Ripley, director of Proporta.
Notes for Palm owners
The Palm file size problem is now a thing of the past. No multi-volume Wikipedia for Palm users any more, just one huge file, like on other platforms.
TR for Palm still has a limited implementation of html, e.g. it does not yet support html tables.
TR for Palm does not yet support external fonts (it might be done with open source FreeType library, but I have no commitment on this)
My script now leaves all html-encoded characters (like Ω &1234;) intact, instead of replacing the non ASCII range with asterisks.
TomeRaider 3 introduces image support. So now you have to make a choice: do you want your TomeRaider Wikipedia with these images included or not.
Including images makes the offline browsing experience on your notebook or PDA/handheld come very close to surfing the Wikipedia online.
Obviously the online version will always be more up to date, but then again the TomeRaider version will serve its content faster
than the occasionally overworked Wikipedia servers may do, and offers better search facilities.
Caveat: you will need a huge memory card for your PDA/handheld.
Even while my script resizes most images still the English Wikipedia with 250,000 images (Oct 2005)
easily measures 2 Gb or more (depending on runtime options see below).
Fortunately memory cards become more affordable all the time. May 2005: you can buy a 4 Gb compact flash card
for much less than I paid for a 1 Gb card in January 2003.
My script never resizes bmp/gif/png images with a compression ratio of 1:10 or better. Many of these will be maps or diagrams with texts that would become unreadable after resizing.
Since these images already did compress well, not much would be gained anyway.
The final TomeRaider file size can vary considerably, depending on which runtime parameters you specify.
You can specify the maximum image size (default is max 240 pixels width/height).
All images larger than that will be resized.
You can make an exception on this for png and gif images with a compression ratio better than 1:10.
Many of these will be maps and diagrams where texts would become unreadable after resizing.
You can specify jpg compression ratio, thus influencing image size in bytes (not pixels) and image quality.
(more compression -> smaller size, less quality).
You can discard part of the images to further shrink file size, by specifying the maximum amount (in Mb) of images to keep.
Largest images will be removed first.
For details see below at 7. Run the perl script / Image parameters
In order to speed up your browsing experience TomeRaider will generate thumbnails for images larger than a certain size
(Windows/Pocket PC version 300x300 pixels, Palm version 130x130 pixels).
On article open this thumbnail wil be shown. You can click on the thumbnail to zoom and pan the original image.
Obviously this influences file size too. The penalty on Palm platform is larger than on the (P)PC platform, as most images will exceed 130x130 pixels.
Customization of the script
Maybe you want only certain images included in the TR file, let us say only for sports or science articles, and thus economize on file size?
My script provides a hook, a call to a function 'SelectImage' in a language specific file WikiToTomeImagesXX.pl (where XX = language code),
where your custom filter and resize criteria can be coded.
The function is called for each image, with the following parameters:
the image file name
the image display size specified in the article (if any, mostly for thumbnails)
the article title
the full article text
a list of all categories to which this article belongs
The function returns three values:
a boolean, whether to include this image at all
the image display size to be used
an optional extra category tag for this article, e.g. for debugging
You might use this for instance to include only png and gif images, which covers most maps and diagrams,
or to exclude certain categories (remember the category feature is relatively new in Wikipedia, hence not all pages are categorized yet).
If you want complete control perl skills would come in handy :) But for those of you without perl skills I created a project page on Wikipedia,
where category lists and code snippets for image selection and/or resizing could be exchanged. Everyone can contribute!
Before you start: decide if you want to include images. See section Images above.
Several steps below refer to language codes. The following languages are supported: de=German, en=English, eo=Esperanto, es=Spanish, fr=French, nl=Dutch, pl=Polish.
You'll need TomeRaider for Windows to generate the .tr3 file.
There is no TomeRaider version for Mac OS or Linux.
Obviously if you want to browse the result on your handheld you will also need a corresponding TomeRaider license for that platform.
Note 1: For TomeRaider 2 the free evaluation version of TomeRaider 2 for Windows sufficed to produce a .tr file for use on any supported platform.
I hope Proporta continues this wise policy.
Locate the file for the language of your choice (de/en/eo/fr/nl/pl), use link 'pages_articles.xml.bz2'
Download into the work directory you created in step 3.
Unzip file with bunzip2.exe
(download Windows version from this site):
in a DOS box run "bunzip2 xxx" where xxx is the zipped file
Rename the downloaded SQL dump file: The perl script expects file name 'pages_articles_xx.xml' , where xx is the language code.
Download the Wikipedia images (optional)
A Preferred method: download the container file(s)
Note: the container files are currently not available for download. Skip to plan B below.
This method saves you a week or more (English Wikipedia) compared to alternate method B.
In your work directory create a subdirectory 'Images_xx' where xx is the language code, e.g. Images_EN for images that belong to the English Wikipedia.
Download the image container file(s)
Again, you'll find the image dump file(s) at the Wikipedia download site.
Locate the file for the language of your choice (de/en/eo/fr/nl/pl), use link 'upload'
If applicable reassemble the original container file from its .aa/.ab/etc parts:
in a DOS box run "copy xxx /b + yyy /b somename.tar", where xxx and yyy are the partial files
(obviously extend the command if you have downloaded three files or more)
Untar the file with tar.exe
(download Windows version from this site to the 'Images_xx' subdirectory):
in a DOS box run "tar -x -f xxx", where xxx is the tar file you just downloaded or assembled
You will see 16 new subdirectories created, each with yet another 16 subdirectories, many or all containing images.
B Fallback method: let my script do the work
If you have trouble downloading these files you can delegate the image gathering process to my perl script. All files will then be downloaded separately.
(Keep the images for subsequent script runs. Only new images will then be downloaded next time to complete the collection.)
Be warned: downloading all 250,000 images (count Oct 2005) in this way will take one to several weeks (depending on the image server load).
Wikipedia limits download rate per user for obviously good reasons. Any faster download might be aborted any moment, and result in your IP being blocked.
Unzip into the work directory you created in step 3.
If case you want to run from the original sources: download and install a perl interpreter. I use ActivePerl 5.8 (free).
Do not use version 5.6.1 ; it is buggy (memory leaks).
If you want to make changes to the perl script: also download and install an IDE (Integrated Development Environment).
I use Open Perl IDE for Windows (open source), which I find quite satisfactory.
Download image manipulation tool
nconvert.exe (optional, needed when you want images included)
Download into the work directory you created in step 3. Nconvert.exe will be called by my script to resize and further compress each image.
About Nconvert: Written by Pierre.e Gougelet.
Nconvert is an impressive command line tool for image conversion and manipulation.
It supports more than 400 graphics formats!
For an impressive list of options type at a DOS box: 'nconvert -help | more' or 'nconvert -help > help.txt'
If you are looking for a well designed, free and capable image viewer and editor, try XnView, from the same author. The text files that come with XnView also relate to Nconvert.
XnView & NConvert are provided as Freeware for private non-commercial, educational use and non-profit organization.
More at Gougelets' site
Run the perl script
If you want to run my script on a regular basis, as many people seem to do, make sure you check for updates on my site on each run, especially
if you include images, since there is a minute change that a new image has been spotted that makes TomeRaider import fail, see below.
Decide which script parameters you will want to specify in the next step:
Language code (mandatory).
Specify DE (German), EN (English), EO (Esperanto), ES (Spanish), FR (French),
NL (Dutch) or PL (Polish).
Target platform (mandatory)
Specify PALM or PPC (for Pocket PC/Windows PC)
Image parameters (optional)
Add this option when you want images included.
The following subparameters are supported: (do not add spaces between them!)
[:nnn] Replace nnn by a pixel size. E.g. IMG:240 means:
include images, and resize them to max 240 pixels width (landscape images) or max 240 pixels height (portrait images).
[:JPGxxx] You can specify the compression ration for jpg images. Replace xxx by a number between 10 and 100.
Where 10 means extermely compressed, 100 means not compressed at all. In reality any value below 25 will be unsatisfactory,
any value above 50 will do little to improve image quality, but hugely blow up file size.
[:MAPyyy] You can specify that gif and png images with a compression ratio better than 1:10 should not be resized,
if they do not exceed a certain file size.
Most of these are probably maps and diagrams with texts that would become unreadable after resizing.
Replace yyy by the maximum original image size in bytes for which this exception applies.
E.g. :MAP50000 specifies that png and gif images with an original size less or equal to 50,000 bytes
and a compression ration better than 1:10 should not be resized.
[:MAXzzz] You can limit the total size of images (after resizing) that should be added to the file. Replace zzz by the total image size in Mb.
Largest images (after resizing) will be discarded first. E.g. :MAX1100 specifies
that the total amount of resized mages should not exceed 1100 Mb.
Note: TomeRaider will still add thumbnails, so final file size is difficult to predict, establish by trial and error, see also option :REUSE below
[:REUSE] You can specify that a new run should completely skip the image resize phase
and reuse all images from the previous run.
This will only work if you did not change the size parameter [:nnn] and the .csv file from the previous run is still available
(and of course the folder with resized images).
Its main purpose is to greatly speed up the file generation process (e.g. from 12 to 3 hours) when you want to rerun with a different :MAXzzz option.
Options :JPGxxx and :MAPyyy will now be ignored. If you want to change these values too, you need to rerun without :REUSE and also to manually remove the folder with resized images first!
When you do not specify parameter IMG or IMG:number,
in other words when you choose not to include images,
a placeholder text will be presented in the TomeRaider file for each removed image (de: 'Bild', en: 'image', nl: 'afb.' ...)
You can optionally specify a small image file (fully qualified path) that should be presented
instead of the placeholder text, e.g. NOIMG:"C:\WikiPedia\IconNoImage.png" (don't forget the colon and double quotes)
If you smartly choose to present a one pixel blank image, remaining image captions will look weirdly out of place, your choice.
Work directory (may be needed)
This only applies when you want images included. If in step 3 you created another work directory than C:\Wikipedia you will have to tell the script about it.
Specify the full path, as follows e.g. PATH:"D:\Wikipedia\TomeRaider" (don't forget the colon and double quotes)
Article selection (useful for quick try of image file generation with subset of articles)
With this parameter you can select only a subset of articles, based on their titles. Specify any character string for the filter.
You'll need the double quotes when the string contains spaces. One wildcard character is available: * (asterisk) allows any character any number of times in this place.
Title comparison will be case insensitive.
I may be a good idea to try this whole procedure with parameter ONLY:q* included.
This way you'll be able to see some results within half an hour or less.
Then rerun without this parameter with more confidence that a day waiting will be worthwhile.
ONLY:"*aa" means include only articles that start with 'aa', 'Aa', etc
ONLY:"aa*" means include only articles that end with 'aa', 'Aa', etc
ONLY:"*aa*" means include only articles that have 'aa', 'Aa' etc anywhere in the title
Special cases for the English Wikipedia
ONLY:lists means only articles with a title that starts with 'List(s) of ' will be selected.
ONLY:nolists means all articles with a title that starts with 'List(s) of ' will be omitted.
On the October 2005 text-only English edition option 'ONLY:nolists' saved about 40 Mb,
which helped to keep the file just within 1 Gb.
Most of these lists were obsolete anyway as they predated the much more efficient category system.
Character encoding (Polish Palm users only)
For the Polish Wikipedia and the Palm platform only: add WIN or ISO to signal
which character encoding should be used.
In Polish there are 9 letters that are not included in standard ASCII set under ISO-8859-1.
'WIN' will convert the 18 non ASCII Polish letters to Windows 1250 (also called Windows-EE) standard.
'ISO' will do the same to ISO-8859-2 standard.For further info please contact Piotr Ziemowit Danowski at xxx.@tramp.travel.pl (nospam: read xxx as danowski)
Fix for rendering of right-to-left text on Pocket PC
Some languages are written from right to left. The Pocket PC html engine does not support this.
The script tries to fix this and reverses appropriate unicode characters sequences (currently only Hebrew).
As a result these texts are now the wrong way around in the TomeRaider for Windows version. This is unfortunate, but since TomeRaider files are mostly used on handhelds,
I gave that priority. If you generate the file specifically for Windows you can suppress this reordering feature:
Specify NOFIXRTL to supress reverting right-to-left text.
Example parameter sets
EN PALM IMG
DE PPC IMG:240 PATH:"D:\Wikipedia"
NL PPC NOIMG:"C:\WikiPedia\IconNoImage.png"
PL PALM ISO
Start a DOS box and run the script as follows:
If you downloaded the executable run "WikiToTome parameters"
or If you downloaded the perl sources run "perl WikiToTome.pl parameters"
For your info: the text only conversion of the English Wikipedia runs for 115 minutes on my 2.8 GHz Win XP machine. My machine has 1 Gb physical memory. With less memory the job may run considerably longer, as more file swapping may occur.
Convert the script output into a TomeRaider database
The perl script produced file "WP_xx_yy_zz.txt" where xx is language code, yy is platform and zz is TXT (text only) or IMG (text + images).
This is the file you want to import into TomeRaider 3.
The newest TomeRaider for Windows (3.1.12 or higher) uses a different image conversion library.
Only from that version the dreaded abend, when TR3 encountered a corrupt image, is a thing of the past.
The newest build also compiles 3 or 4 times faster than before.
TomeRaider bug fixing and optimization will continue for a while, so please check on each run..
In TomeRaider 3 click menu File, entry Import. Select the input file.
Switch to tab 'options', then check the following:
Produce log file
Automatically sort unsorted files
Switch back to tab 'import' and click button 'import'.
In my TomeRaider 2 guide I said on this point: "Take a coffee break".
In TomeRaider 3 this has to be "Go shopping, visit your family or take a hike in the mountains."
You will see hundreds of warnings scroll by. Every line mentioned is a complete article.
No need to tell me about these. I've seen them. 100% elimination is impossible, as Wikipedia does not validate html input itself.
I correct tens of thousands of user errors in my scripts, so a couple of hundreds of errors in half as many pages means well over
99.9% of the articles are free from warnings, besides most html errors that remain have little or no impact on the article content, just some minor layout effects.
TomeRaider may now complain that some syntax errors were found and abort compilation.
This is incorrect behavior: TomeRaider should only issue warnings and always continue processing.
Note: Proporta has fixed this, it should not happen any more.
Once TomeRaider reports "Processing..." this phase is complete.
TomeRaider will now go silent for a very long time You will only see a revolving slash in the lower right corner.
Do not be confused, it does not mean TomeRaider has stalled.
Windows Task Manager will report only a low percentage of CPU usage, there will be disk activity.
Depending on which Wikipedia you are processing and on the speed of your PC, this phase may last from several hours to almost a full day.
What if TomeRaider blows up? If you chose to include images there is a very minor but real change
that TomeRaider stumbles over a corrupt image and breaks down, which is a pity as you will have to schedule yet another outdoor activity.
I saw it happen consistently with one image in the English and one in the German Wikipedia.
These two are now exempt from processing, but for every new image that is uploaded to Wikipedia there is a chance of 1:50,000 that it will
have the same disastrous effect on TomeRaider and cause it to exit with a runtime error R6025, 'pure virtual function call'.
If this happens to you:
Look at the bottom of the TomeRaider log file named 'log_WP_xx_yyyy (where xx = language code and yyyy = platform).
The last image mentioned is the culprit.
Please mail me right away. I will add the image name to an exclusion list, and report it on my site.
Rather than deleting the image (which would be a temporary solution, as my script would download it again on the next run) put any other image with the same name in its place.
You may have to do this in two directories. My script stores resized images in a separate directory tree, named Image_xx_yyyy (where xx = language code, yyyy is max size).
This keeps the original images intact in case you want to run again with a different resize factor.
May 2005: Proporta uses another graphics library now, which should eliminate this problem.
On completion TomeRaider will report conversion statistics.
The resulting TomeRaider file should now be ready for inspection.
Download the file to your handheld device.
TomeRaider 2 for Pocket PC was a bit picky about the location. I then suggested you put it in a top-level folder. Haven't checked yet if this is still needed.
PPC only: install extra fonts on handheld (optional)
This section applies to the (Pocket) PC version only.
Native Pocket PC support for math symbols and foreign characters is limited.
Unrecognized characters will be displayed as a small box. The following procedure overcomes this limitation.
Code2000.ttf (3.1 MB) is shareware ($5).
Download here; look for 'demo' font.
Copy Code2000.ttf to \Windows\Fonts or to the memory card (see below)
After registration you will get an improved version of the font.
Arial Unicode MS (24 MB!) has been developed by Microsoft.
This font used to be freely available on the web, but is now shipped only with Office 2000, FrontPage 2000, Office XP and Publisher 2002.
It contains TrueType hints, and is therefore better suited for rendering text at small point sizes.
If you want to put the font on your memory card:
Create folder \Fonts on the memory card and copy 'Code2000.ttf' or 'Arial Unicode MS.ttf' to this folder.
As a result the Pocket PC will be able to find fonts in this new folder.
You need to rerun this tiny app after a soft reset (or put it in your Startup folder).
For more about FontStorage.exe see this discussion at Brighthand. You will also find a zipped version there.
Set TomeRaider presentation options
Of course this is a matter of taste, but in my opinion some settings suit Wikipedia best.
TomeRaider 3 for Windows:
Click menu Edit / Preferences, go to tab Contents
Select 'Category information' / 'Display on top'
Uncheck 'Display cat info as tree', 'Cat names collapsible' and 'Cat info initially expanded'
Uncheck 'Show read progress' (hides a progress bar above the articles which is really only for ebooks).
The English compilation fails over articles that discuss nucleides (physics, elementary particles).
Apparently a template used on those articles (one of many) makes the script misbehave and after a long delay fail.
Someone who wants to debug this, and who knows perl, could try to add code that skips certain templates (by name, names to be taken from template list that is shown for given articles in edit mode in online Wikipedia) to see which one caused the problem.
Example PhysicsParticle is the culprit? See bug report.
From a bug report:
The script executes very slowly – Approximately 3 days, before it crashes my 2.0GHz XP machine (1GB physical memory).
In particular, it seems to have difficulty with articles on Atomic particles, pausing for several hours on each of:
Atom, Boson, Electron, Neutron, Neutrino, Photon, Proton, etc etc
Usually the log contains an entry along the lines of
"Title ‘Photon’ : Template ‘PhysicsParticle’ occurs more than 325 times (optimise possible?)"
It finally crashed with a windows system error after running for more than 3 days. The last entry in the DOS window was Pion. This line had been there for around 4 hours.
Image support in TR is still buggy: even after a successfull compilation many images do not show up.
May 2005: Fixed in Tomeraider 3.1.12
TomeRaider 3 import may fail on corrupt image
May 2005: Fixed in Tomeraider 3.1.12
TomeRaider may abort compilation after html syntax errors were found
This is incorrect behavior. When invalid html is encountered, TomeRaider should only issue warnings and continue processing.
Some user mistakes are too fancy for my auto correct code, so TomeRaider should become a bit more lenient.
May 2005: Fixed in Tomeraider 3.1.12
HTML characters in image names
HTML chars in image link are not yet converted to Unicode before retrieving the image.
As a result of this some images will not be found in the cache or online.
This affects mostly Esperanto and Polish Wikipedia I guess.
Fixed in my script 3.0.e
Incomplete Unicode support
Not a bug really, but a design limitation.
Palm: like in TomeRaider 2 there is no support for external fonts.
PPC: TomeRaider does not display non ASCII characters properly in the article index and in category lists.
Full Unicode support has been pledged by Proporta for the future.
Wrong image is sometimes shown in PPC (and Palm?) version
I received several reports about this. Have not seen it myself yet.
When you click on the faulty image the proper image is shown in the zoom/pan box.
This proves it has nothing to do with my script.
I will report this to Proporta.
Index lookup sometimes jumps to wrong entry in some languages.
This mainly affects languages where accented characters are very common (like German and Polish).
You can reach every article through hyperlinks.
Has been reported to Proporta several times.
Sep 2005: Fixed in Tomeraider 3.1.?
Too much white space below images in PPC (and Palm?) version
This is a minor nuisance. Has been reported to Proporta.
May 1, 2008: Script version 4.2.a released.
Lots of updates to adapt to new Wikimedia syntax. Extra tools for [pre/post]processing. No support. Use as you like.
Many thanks to Norbert Jaspers from Holgerlingen, Germany for greatly extending and improving the tool for generating png images from embedded math statements (<math>...</math>): WikiToTomeToolGenerateMathImages.pl.
Nov 20, 2005: Script version 3.2.b released.
Script can now process newest xml dump format.
June 18, 2005: Script version 3.1.c released.
Added Spanish language
May 20, 2005: Script version 3.1.b released.
New runtime options for generating a version with images.
Support for new TomeRaider compile option that optimizes file size per platform.
Improved presentation of images (layout changes).
March 14, 2005: Script version 3.0.j released.
The Wikipedia dump format changed in a small but significant way. This script version recognizes dumps both in old and new format.
February 9, 2005: Script version 3.0.i released.
Bug fix: again one article in English Wikipedia was not parsed properly (different reason than before of course). As a result the TomeRaider compilation failed.
January 14, 2005: Script version 3.0.g released.
Bug fix: one article in English Wikipedia was not parsed properly. As a result the TomeRaider compilation failed.
Added parameter nofixrtl
December 18, 2004: Script version 3.0.f released.
Some wikitables were presented in raw format - Fixed / Categories are back / Some layout changes based on Wikipedia monobook stylesheet
(especially for German Wikipedia: link tables and side boxes are now similar to online version).
December 7, 2004: Script version 3.0.e released.
Image processing optimized /
Most trivial redirects removed /
Non-ASCII characters in image names properly processed /
Missing English Main Page restored /
Less invalid link warnings
November 19, 2004: Public beta test has started.
November 10, 2004: Script abend fixed.
October 30, 2004: Pre beta test phase has started.
Errors and Feedback
If you spot any articles that not rendered properly, please let me know at firstname.lastname@example.org (nospam: read 'xyz' as 'erikzachte').
In many cases it may help if you illustrate the problem with a screenshot.
This is especially true if you test on a Palm device, as I haven't one myself.
Also please tell me if these instructions are incomplete or unclear in any way.
No need to tell me about the html syntax warnings, I've seen those. Again, 100% elimination is impossible, as Wikipedia does not validate html input itself.
I correct or eliminate 99.9% of the many thousands of html errors in my scripts. Only a few remain. Of course if you get an error that halts TomeRaider that is another matter.
Have fun with Wikipedia on your notebook or PDA!
Oh, by the way...
Since 2003 I spent 1000+ hours to build and maintain these scripts, essentially emulating the online Wikipedia parser, built by many, all by myself.