Solr for WordPress
Using Solr for Wordpress.
WordPress Database Schema
WordPress post short URL pattern:
http://WP_BLOGS.domain/WP_BLOGS.path/?p=WP_XXX_POSTS.ID
select blog.domain as domain, blog.path as path, post.id as postid, blog.blog_id as blogid, post.post_author, post.post_title, post.post_excerpt, post.post_content, post.post_type from wp_blogs blog, wp_886_posts post where blog.blog_id = 886 and post.post_status='publish'
Post Category and Tag
Both category and tag are saved in table wp_xx_terms. We will use the name column as the keyword. The table wp_xx_term_taxonomy will tell this term is a category for tag. The table wp_xx_term_relationships will associate a post/page to a term.
select term.name from wp_886_terms term, wp_886_term_taxonomy taxonomy, wp_886_term_relationships relat where relat.object_id = postid and relat.term_taxonomy_id = taxonomy.term_taxonomy_id and taxonomy.term_id = term.term_id
Post Comments
All about table wp_xx_comments
select comment_content as comment from wp_4_comments where comment_post_id = postid and comment_approved = 1
Media files in WordPress Post
All media files are saved in table wp_xxx_posts under post_type: attachment. The status of the attachement post will always be '''inherit''', which means it depends on the status of its parent post.
Database Importer Configuration
<entity name="post4" processor="SqlEntityProcessor" query="select blog.domain as domain, blog.path as path, post.id as postid, blog.blog_id as blogid, post.post_author as postAuthor, post.post_title as postTitle, post.post_excerpt as postExcerpt, post.post_content as postContent, post.post_type from wp_blogs blog, wp_4_posts post where blog.blog_id = 4 and post.post_status='publish'" transformer="TemplateTransformer" > <!-- column is the database table column name, name is the solr schema field name. --> <field column="id" name="id" template="${post4.blogid}-${post4.postid}"/> <field column="postTitle" name="title" /> <field column="postExcerpt" name="description" /> <field column="postContent" name="content" /> <entity name="post4_authors" processor="SqlEntityProcessor" query="select display_name from wp_users where id = ${post4.postAuthor}" > <field column="display_name" name="authors" /> </entity> <field column="url" name="url" template="http://${post4.domain}${post4.path}?p=${post4.postid}" /> <!-- trying to get the keywords --> <entity name="post4_cats" processor="SqlEntityProcessor" query="select term.name as keyword from wp_4_terms term, wp_4_term_taxonomy taxonomy, wp_4_term_relationships relat where relat.object_id = ${post4.postid} and relat.term_taxonomy_id = taxonomy.term_taxonomy_id and taxonomy.term_id = term.term_id" > <field column="keyword" name="keywords"/> </entity> </entity>
A Little Script to Automate Things
mysql_connect($server, $username, $password) or die("Could not connect: " . mysql_error()); mysql_select_db($database); $blogIds = mysql_evaluate_array('select blog_id from wp_blogs'); echo "<pre>"; for ($i = 0; $i < count($blogIds); $i++) { echo <<<EOT <entity name="post$blogIds[$i]" processor="SqlEntityProcessor" query="select blog.domain as domain, blog.path as path, post.id as postid, blog.blog_id as blogid, post.post_author, post.post_title, post.post_excerpt, post.post_content, post.post_type from wp_blogs blog, wp_$blogIds[$i]_posts post where blog.blog_id = $blogIds[$i] and post.post_status='publish'" transformer="TemplateTransformer" > <field column="id" name="id" template="\${post$blogIds[$i].blogid}-\${post$blogIds[$i].postid}"/> <field column="post_title" name="title" /> <field column="post_excerpt" name="description" /> <field column="post_content" name="content" /> <field column="url" name="url" template="http://\${post$blogIds[$i].domain}\${post$blogIds[$i].path}?p=\${post$blogIds[$i].postid}" /> </entity> EOT; echo "\n"; } echo "</pre>";
Script Transformer for Description
The idea candidate for description of a WordPress psot will be the excerpt field of a post. However, the exerpt field has not always been filled. So here comes the following script:
// parse the post content and extract the description for a post. function processPost(row) { orgContent = new String(row.get('postContent')); // strip out all HTML tags. content = orgContent.replace(/<[^>]*>/g, ""); row.put('postContent', content); // decide the excerpt as the description. excerpt = new String(row.get('postExcerpt')); if (excerpt.trim().length < 1) { if (content.length > 255) { excerpt = content.substring(0, 255); } else { excerpt = content; } row.put('postExcerpt', excerpt); } return row; }