coming later

How to rescue content from a MediaWiki wiki.

The goal was to rescue all of the content from an old MediaWiki wiki a format which could be re-hosted with another content management systems.

This was a failed project; it was impossible to automate so I ended up pursuing this by hand.

Using official tools ∞

Some kind of built-in thing should exist, but it’s out of date as I understand. Sigh.
MySQL
- Dumping the database and restoring has been a harrowing experience because of the annoyance of getting the LAMP-type hosting package to work. I gave up.
- However, it is still a valid notion to pry things out from the database in some way..
dumpHTML extension
- A complete disaster.
XML dump like [[Special:Export/Welcome]]
- I have no clue how that could be useful for what I want.

spidering ∞

Note that one of the big problems of spidering is that this does not get the source, which might have metadata or other effort which could be useful. For example, HTML comments or the specifics of MediaWiki-syntax tables. Maybe just begin with [[Special:Allpages]].

[[Special:LonelyPages]]
- First, make sure that all pages are linked-to, so they’d actually be found and downloaded.
- Or specifically spider that page as well.
wget — A disaster, it can’t rewrite pages properly, always giving links to /
curl and Curlmirror looked hopeful

manually? ∞

get a list of all pages ∞

Dump the database.
- From the database, get a list of all pages from one of the tables.
Process [[Special:Allpages]]
- Visit http://localhost/wiki/Special:AllPages and discern the number of namespaces (&namespace=0, etc…) .. I have 15
  — For every namespace discern the number of items of “x to y”
  — For every item, visit ‘x’
  —- Collect every link on that page

spider the list of pages ∞

A self-made spider program might be able to follow links and download the pages. But what do I attempt to get?

Raw version
- e.g. http://localhost/wiki/index.php?action=raw&title=Main_Page
- But the file is always called “index.php”, so output it properly.
- Would have to translate the markup (could be hard? but could be a precursor to my own markup language (see Compiled Website)
HTML version
- Would have to strip a lot of it (sidebar, header, footer, more?)
Print version
- Would still have to strip a lot of it I think (what parts?)
Cached HTML version
- I noticed that there are cached versions of the pages!
  — e.g. /opt/bitnami/apps/wiki/htdocs/images/cache/9/9f/Apple.html
- But is it up-to-date? How can I tell? How can I force an update? (visit the page, &action=purge, some script, delete the cache and visit it)?

other ∞

Alternative parsers: https://www.mediawiki.org/wiki/Alternative_parsers
- Looks like a hopeless list
XML dump like http://example.com/wiki/Special:Export/Welcome
- How do I dump everything?
- What do I do with the xml files afterwards? =/
Database dump and manually do something with it?
- Yeah, right.
Kiwix
- Made for MediaWiki, but works for all offline web viewing.
Wiki on a stick
https://github.com/Git-Mediawiki/Git-Mediawiki

spiralofhope

Salvaging MediaWiki

Using official tools ∞

spidering ∞

manually? ∞

get a list of all pages ∞

spider the list of pages ∞

other ∞

One comment on “Salvaging MediaWiki”

Leave a Reply Cancel reply

Search