Solr for MediaWiki
how to create a Solr Schema for MediaWiki: Trying some different way to use Solr for MediaWiki search. SolrStore Extension http://www.mediawiki.org/wiki/Extension:SolrStore
MediaWiki Database Schema
Start from Page table. The following will get the most recent versions of all article from database.
SELECT p.page_id AS "Page ID", p.page_title AS "Page Title", r.rev_text_id AS "Revision ID", t.old_id AS "Text ID" FROM wikidb.page p INNER JOIN wikidb.revision r ON p.page_latest = r.rev_id INNER JOIN wikidb.text t ON r.rev_text_id = t.old_id;
MediaWiki Namespaces
The column page_namespace in table mw_page tells the type of a page. A MediaWiki page could be a regular page, a file, a Talk, a user page, a Help page, a Category, etc. Details about MediaWiki Namespace could be found on Manual:Namespace.
Default Namespace is defined in file includes/Namespace.php.
Solr data importor will handle differently for different namespace. A regular page will be indexed based on HTML trip transformer. A file should use TikaEntityProcessor.
Solr Schema
Sole Data Importer Configuration
Solr wiki page for DataImportHandler.
- DataSource
- Transformer
Transformer will be loaded for each Entity. It will be used to customize a field throught a template in the field element. - EntityProcessor
EntiryProcessor is desinged to process a whole Entity. The default processor is SqlEntityProcessor. A entiry processor will generate a set of fields from the data source. - TikaEntityProcessor
This is used to indexing the binary file
Query Scrach Pad
Save some sample queries here.
A very simple query to import all pages in MediaWiki. The query will extract the columns: page_id, page_title, page_text. MediaWiki saves most of the fields as binary, so we need convert them to text.
select p.page_id as page_id, convert(p.page_title using utf8) as page_title, convert(t.old_text using utf8) as page_text from mw_page p inner join mw_revision r on p.page_latest = r.rev_id inner join mw_text t on r.rev_text_id = t.old_id
all pages under namespace User:
select p.page_id, p.page_title from mw_page p inner join mw_revision r on p.page_latest = r.rev_id inner join mw_text t on r.rev_text_id = t.old_id where p.page_namespace = 2
How MediaWiki get the Full filesystem path to a File
The code is in file includes/filerepo/File.php. There are 2 parts for a full file path: Zone Path and Hash Path. Both are caculated in a file repo class. By default, MediaWiki is using local file system repo, which is defined in file includes/filerepo/FSRepo.php. This class is constructed at file includes/Setup.php. Zone Path is the directory for a filesystem repo. Hash Path is the first 2 leter of file name's md5 result.
Last Modified Date
Column rev_timestamp saves the modified date for each revision in table mw_revision. It is type is binary. We need perform the following convert during data importing:
convert(r.rev_timestamp, datetime) as page_modified
All Authors
Each page revision has an author. There are 2 columns for author information: rev_user_id and rev_user_text. The following query will give us the distinct authors account name for a wiki page. Again, the convert function will return the author name as string.
select distinct convert(rev_user_text using utf8) from mw_revision where rev_page = ${file.page_id})
Description Challenge
Try to extract a small set of page content as the description for a wiki page.
The original idea is using the Java built-in Rihno JavaScript engine to strip the wiki markups from the wiki content text. The Wiky js lib is used as the main reference
Using XPathEntityProcessor. Havn't try this yet ...
Enventually, I stick on the JavaScript transformer. Bascially, using Rhino JavaScript to rewrite the Wiky lib.