coming later
![]() |
How to rescue content from a MediaWiki wiki.
The goal was to rescue all of the content from an old MediaWiki wiki a format which could be re-hosted with another content management systems.
This was a failed project; it was impossible to automate so I ended up pursuing this by hand.
See also:
Using official tools ∞
- Some kind of built-in thing should exist, but it’s out of date as I understand. Sigh.
-
- Dumping the database and restoring has been a harrowing experience because of the annoyance of getting the LAMP-type hosting package to work. I gave up.
- However, it is still a valid notion to pry things out from the database in some way..
-
- A complete disaster.
-
XML dump like
[[Special:Export/Welcome]]- I have no clue how that could be useful for what I want.
spidering ∞
Note that one of the big problems of spidering is that this does not get the source, which might have metadata or other effort which could be useful. For example, HTML comments or the specifics of MediaWiki-syntax tables. Maybe just begin with [[Special:Allpages]].
-
[[Special:LonelyPages]]- First, make sure that all pages are linked-to, so they’d actually be found and downloaded.
- Or specifically spider that page as well.
- wget — A disaster, it can’t rewrite pages properly, always giving links to
/ -
curl and Curlmirror looked hopeful
manually? ∞
get a list of all pages ∞
-
Dump the database.
- From the database, get a list of all pages from one of the tables.
-
Process
[[Special:Allpages]]- Visit
http://localhost/wiki/Special:AllPagesand discern the number of namespaces (&namespace=0, etc…) .. I have 15
— For every namespace discern the number of items of “x to y”
— For every item, visit ‘x’
—- Collect every link on that page
- Visit
spider the list of pages ∞
A self-made spider program might be able to follow links and download the pages. But what do I attempt to get?
-
- e.g.
http://localhost/wiki/index.php?action=raw&title=Main_Page - But the file is always called “index.php”, so output it properly.
- Would have to translate the markup (could be hard? but could be a precursor to my own markup language (see Compiled Website)
- e.g.
-
HTML version
- Would have to strip a lot of it (sidebar, header, footer, more?)
-
Print version
- Would still have to strip a lot of it I think (what parts?)
-
Cached HTML version
- I noticed that there are cached versions of the pages!
— e.g./opt/bitnami/apps/wiki/htdocs/images/cache/9/9f/Apple.html - But is it up-to-date? How can I tell? How can I force an update? (visit the page,
&action=purge, some script, delete the cache and visit it)?
- I noticed that there are cached versions of the pages!
other ∞
-
Alternative parsers: https://www.mediawiki.org/wiki/Alternative_parsers
- Looks like a hopeless list
-
XML dump like
http://example.com/wiki/Special:Export/Welcome- How do I dump everything?
- What do I do with the xml files afterwards? =/
-
Database dump and manually do something with it?
- Yeah, right.
-
- Made for MediaWiki, but works for all offline web viewing.
- Wiki on a stick


ported