Solr for MediaWiki

how to create a Solr Schema for MediaWiki: Trying some different way to use Solr for MediaWiki search. SolrStore Extension http://www.mediawiki.org/wiki/Extension:SolrStore

MediaWiki Database Schema

Start from Page table. The following will get the most recent versions of all article from database.

SELECT
    p.page_id AS "Page ID",
    p.page_title AS "Page Title",
    r.rev_text_id AS "Revision ID",
    t.old_id AS "Text ID"
FROM
    wikidb.page p
        INNER JOIN wikidb.revision r
            ON p.page_latest = r.rev_id
        INNER JOIN wikidb.text t
            ON r.rev_text_id = t.old_id;

MediaWiki Namespaces

The column page_namespace in table mw_page tells the type of a page. A MediaWiki page could be a regular page, a file, a Talk, a user page, a Help page, a Category, etc. Details about MediaWiki Namespace could be found on Manual:Namespace.

Default Namespace is defined in file includes/Namespace.php.

Solr data importor will handle differently for different namespace. A regular page will be indexed based on HTML trip transformer. A file should use TikaEntityProcessor.

Solr Schema

Sole Data Importer Configuration

Solr wiki page for DataImportHandler.

DataSource
Transformer
Transformer will be loaded for each Entity. It will be used to customize a field throught a template in the field element.
EntityProcessor
EntiryProcessor is desinged to process a whole Entity. The default processor is SqlEntityProcessor. A entiry processor will generate a set of fields from the data source.
TikaEntityProcessor
This is used to indexing the binary file

Query Scrach Pad

Save some sample queries here.

A very simple query to import all pages in MediaWiki. The query will extract the columns: page_id, page_title, page_text. MediaWiki saves most of the fields as binary, so we need convert them to text.

select p.page_id as page_id, convert(p.page_title using utf8) as page_title,
convert(t.old_text using utf8) as page_text 
from mw_page p 
inner join mw_revision r on p.page_latest = r.rev_id 
inner join mw_text t on r.rev_text_id = t.old_id

all pages under namespace User:

select p.page_id, p.page_title
from mw_page p
inner join mw_revision r on p.page_latest = r.rev_id
inner join mw_text t on r.rev_text_id = t.old_id
where p.page_namespace = 2

How MediaWiki get the Full filesystem path to a File

The code is in file includes/filerepo/File.php. There are 2 parts for a full file path: Zone Path and Hash Path. Both are caculated in a file repo class. By default, MediaWiki is using local file system repo, which is defined in file includes/filerepo/FSRepo.php. This class is constructed at file includes/Setup.php. Zone Path is the directory for a filesystem repo. Hash Path is the first 2 leter of file name's md5 result.

Last Modified Date

Column rev_timestamp saves the modified date for each revision in table mw_revision. It is type is binary. We need perform the following convert during data importing:

convert(r.rev_timestamp, datetime) as page_modified

All Authors

Each page revision has an author. There are 2 columns for author information: rev_user_id and rev_user_text. The following query will give us the distinct authors account name for a wiki page. Again, the convert function will return the author name as string.

select distinct convert(rev_user_text using utf8)
from mw_revision
where rev_page = ${file.page_id})

Description Challenge

Try to extract a small set of page content as the description for a wiki page.

The original idea is using the Java built-in Rihno JavaScript engine to strip the wiki markups from the wiki content text. The Wiky js lib is used as the main reference

Using XPathEntityProcessor. Havn't try this yet ...

Enventually, I stick on the JavaScript transformer. Bascially, using Rhino JavaScript to rewrite the Wiky lib.

Sections

Personal tools