====== Install and Configure Sphinx ====== These instructions apply to a SUSE Linux installation of CATS where the CATS package has been installed in the directory /srv/www/htdocs/cats. If your installation differs, you will need to make appropriate adjustments to the paths referenced in this documentation. ===== Basic Requirements ===== - Installed and operational CATS system - functional cron - LSB compatible init.d for run control ===== What it Does ===== The Sphinx package consists of two primary parts, the indexer that creates the search indexes and is run periodically to rebuild those indexes, and the searchd daemon that handles the queries from the sphinxapi.php library. The CATS integration design calls for a primary index, cats, that is rebuilt once per day via cron.daily. This once per day rebuild picks up all candidates/resumes in the database and completely reindexes the text resume, key skills, and candidate's first and last names. Additionally, it resets the sph_counter to a high water mark. A second delta index, catsdelta, handles additions to the database during the business day via a cron script that rebuilds only the secondary delta index based on the high water mark set at the prior run of the primary index. It is evisioned that this script would run every 20 or 30 minutes during the business day to keep that delta index up to date with recent additions to the database. ===== Installation Instructions ===== ==== Download Sphinx ==== * [[http://sphinxsearch.com/downloads.html|Download the Sphinx tarball]] * Configure, make, and make install the tarball according to the [[http://sphinxsearch.com/doc.html#installing|installation documentation]] at the [[http://sphinxsearch.com/|sphinxsearch.com]] site. On the SUSE box, this installed Sphinx in the following directories; /usr/local/bin/indexer /usr/local/bin/searchd /usr/local/man/man8/searchd.8.gz /usr/local/etc/sphinx.conf.dist /usr/share/doc/packages/sphinx-0.9.7-rc2 /usr/share/doc/packages/sphinx-0.9.7-rc2/COPYING /usr/share/doc/packages/sphinx-0.9.7-rc2/doc /usr/share/doc/packages/sphinx-0.9.7-rc2/doc/mk.cmd /usr/share/doc/packages/sphinx-0.9.7-rc2/doc/sphinx.css /usr/share/doc/packages/sphinx-0.9.7-rc2/doc/sphinx.html /usr/share/doc/packages/sphinx-0.9.7-rc2/doc/sphinx.txt /usr/share/doc/packages/sphinx-0.9.7-rc2/doc/sphinx.xml /usr/share/doc/packages/sphinx-0.9.7-rc2/INSTALL ==== Copy sphinxapi.php ==== * Copy the api/sphinxapi.php file to /srv/www/htdocs/cats/lib/sphinxapi.php ==== Create Indexer Cron Script ==== * Create the following cron script in /etc/cron.daily/indexer to run the indexer on a daily basis. #!/bin/sh /usr/local/bin/indexer --all --rotate --config /srv/www/htdocs/cats/modules/search/sphinx.conf * chown root:root /etc/cron.daily/indexer * chmod 700 /etc/cron.daily/indexer ==== Create Searchd init.d Script ==== * Create an /etc/init.d/searchd script for the searchd daemon, the example below works well for a SUSE installation, you may need to alter it for non-SUSE distributions. #! /bin/sh # Copyright (c) 1995-2004 SUSE Linux AG, Nuernberg, Germany. # All rights reserved. # # Author: Kurt Garloff # Please send feedback to http://www.suse.de/feedback/ # # /etc/init.d/searchd # and its symbolic link # /(usr/)sbin/rcsearchd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. # # ### BEGIN INIT INFO # Provides: searchd for sphinx # Required-Start: $syslog $remote_fs mysql # Should-Start: $time ypbind sendmail # Required-Stop: $syslog $remote_fs # Should-Stop: $time ypbind sendmail # Default-Start: 3 5 # Default-Stop: 0 1 2 6 # Short-Description: searchd daemon for sphinx search # Description: Starts the Sphinx searchd daemon ### END INIT INFO # # Check for missing binaries (stale symlinks should not happen) # Note: Special treatment of stop for LSB conformance LOGFILE=/var/log/searchd.log SEARCHD=/usr/local/bin/searchd test -x $SEARCHD || { echo "$SEARCHD not installed"; if [ "$1" = "stop" ]; then exit 0; else exit 5; fi; } # Source LSB init functions . /etc/rc.status # Reset status of this service rc_reset case "$1" in start) echo -n "Starting $SEARCHD " ## Start daemon with startproc(8). If this fails ## the return value is set appropriately by startproc. startproc -l $LOGFILE $SEARCHD --config /srv/www/htdocs/cats/modules/search/sphinx.conf # Remember status and be verbose rc_status -v ;; stop) echo -n "Shutting down $SEARCHD " ## Stop daemon with killproc(8) and if this fails ## killproc sets the return value according to LSB. killproc -TERM $SEARCHD # Remember status and be verbose rc_status -v ;; try-restart|condrestart) ## Do a restart only if the service was active before. ## Note: try-restart is now part of LSB (as of 1.9). ## RH has a similar command named condrestart. if test "$1" = "condrestart"; then echo "${attn} Use try-restart ${done}(LSB)${attn} rather than condrestart ${warn}(RH)${norm}" fi $0 status if test $? = 0; then $0 restart else rc_reset # Not running is not a failure. fi # Remember status and be quiet rc_status ;; restart) ## Stop the service and regardless of whether it was ## running or not, start it again. $0 stop $0 start # Remember status and be quiet rc_status ;; force-reload) ## Signal the daemon to reload its config. Most daemons ## do this on signal 1 (SIGHUP). ## If it does not support it, restart. echo -n "Reload service $SEARCHD " ## if it supports it: killproc -HUP $SEARCHD rc_status -v ## Otherwise: #$0 try-restart #rc_status ;; reload) ## Like force-reload, but if daemon does not support ## signaling, do nothing (!) # If it supports signaling: echo -n "Reload service $SEARCHD " killproc -HUP $SEARCHD rc_status -v ## Otherwise if it does not support reload: #rc_failed 3 #rc_status -v ;; status) echo -n "Checking for service $SEARCHD " ## Check status with checkproc(8), if process is running ## checkproc will return with exit status 0. # Return value is slightly different for the status command: # 0 - service up and running # 1 - service dead, but /var/run/ pid file exists # 2 - service dead, but /var/lock/ lock file exists # 3 - service not running (unused) # 4 - service status unknown :-( # 5--199 reserved (5--99 LSB, 100--149 distro, 150--199 appl.) # NOTE: checkproc returns LSB compliant status values. checkproc $SEARCHD # NOTE: rc_status knows that we called this init script with # "status" option and adapts its messages accordingly. rc_status -v ;; *) echo "Usage: $0 {start|stop|status|try-restart|restart|force-reload|reload|probe}" exit 1 ;; esac rc_exit * Save the searchd init.d script to /etc/init.d/searchd, make it executable, then install it as a service using insserv searchd. * Create a run control softlink for the init.d script ln -s /etc/init.d/searchd /usr/sbin/rcsearchd ==== Create sphinx.conf ==== * Create the sphinx.conf configuration file and save it to /srv/www/htdocs/cats/modules/search. You will need to create the 'search' directory since it doesn't exist yet. Be sure to specify your correct and on the configuration lines where indicated. Also be sure to create the index file directory you specify under the index path if it doesn't exist (/srv/www/htdocs/cats/modules/search/index). # # sphinx configuration file for CATS # ############################################################################# ## data source definition ############################################################################# source catsdb { type = mysql strip_html = 0 index_html_attrs = # some straightforward parameters for 'mysql' source type sql_host = localhost sql_user = sql_pass = sql_db = cats sql_port = 3306 # optional, default is 3306 sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(attachment_id) from attachment sql_query = \ SELECT attachment_id, data_item_id, UNIX_TIMESTAMP(attachment.date_created) AS date_added, title, text \ last_name, first_name, notes, key_skills \ FROM attachment left join candidate on data_item_id = candidate_id \ where resume = 1 and attachment.site_id = 1 and data_item_type = 100 \ and attachment_id <= (SELECT max_doc_id from sph_counter where counter_id = 1) sql_group_column = data_item_id sql_date_column = date_added sql_query_post = sql_query_info = SELECT * FROM attachment WHERE attachment_id=$id } source delta : catsdb { sql_query_pre= sql_query = \ SELECT attachment_id, data_item_id, UNIX_TIMESTAMP(attachment.date_created) AS date_added, title, text \ last_name, first_name, notes, key_skills \ FROM attachment left join candidate on data_item_id = candidate_id \ where resume = 1 and attachment.site_id = 1 and data_item_type = 100 \ and attachment_id > (SELECT max_doc_id from sph_counter where counter_id = 1) } ############################################################################# ## index definition ############################################################################# index cats { source = catsdb # this is path and index file name without extension # # indexer will append different extensions to this path to # generate names for both permanent and temporary index files # # .tmp* files are temporary and can be safely removed # if indexer fails to remove them automatically # # .sp* files are fulltext index data files. specifically, # .spa contains attribute values attached to each document id # .spd contains doclists and hitlists # .sph contains index header (schema and other settings) # .spi contains wordlists # # MUST be defined path = /srv/www/htdocs/cats/modules/search/index/cats docinfo = extern morphology = none stopwords = min_word_len = 1 charset_type = sbcs } index catsdelta : cats { source = delta path = /srv/www/htdocs/cats/modules/search/index/cats_delta } ############################################################################# ## indexer settings ############################################################################# indexer { mem_limit = 32M } ############################################################################# ## searchd settings ############################################################################# searchd { address = 127.0.0.1 port = 3312 log = /var/log/searchd.log query_log = /var/log/query.log read_timeout = 5 max_children = 30 pid_file = /var/run/searchd.pid # default is 1000 (just like with Google) max_matches = 1000 } # --eof-- ==== Add sph_counter to CATS database ==== * create the sph_counter table in the CATS database. # in MySQL use cats CREATE TABLE sph_counter ( counter_id INTEGER PRIMARY KEY NOT NULL, max_doc_id INTEGER NOT NULL ); ==== Try creating your index ==== * Index your CATS database by running the indexer from the commandline helphand:~ # /usr/local/bin/indexer --all --config /srv/www/htdocs/cats/modules/search/sphinx.conf Sphinx 0.9.7-RC2 Copyright (c) 2001-2006, Andrew Aksyonoff using config file '/srv/www/htdocs/cats/modules/search/sphinx.conf'... indexing index 'cats'... collected 4668 docs, 20.7 MB sorted 2.1 Mhits, 100.0% done total 4668 docs, 20663522 bytes total 3.324 sec, 6216481.50 bytes/sec, 1404.34 docs/sec helphand:~ # ==== Start the Searchd Daemon ==== * Assuming your indexer processed without errors, your install is in good shape, so start the searchd daemon. helphand:~ #rcsearchd start ==== Test Search from Commandline ==== * Now test the search from the commandline by searching for a resume keyword. helphand:~ # search --config /srv/www/htdocs/cats/modules/search/sphinx.conf controller [lot's of returned stuff snipped for brevity] date_created=2006-10-11 13:38:19 date_modified=2006-10-11 13:38:19 words: 1. 'controller': 402 documents, 776 hits helphand:~ # ==== Setup Cron for Regular Updates ==== * Assuming the search returned expected results, you are almost finished with the Sphinx install. You simply need to add a crontab entry to run the following periodically throughout the business day to index any new candidate resumes added to the database during the day. Create the following file in /etc/cron.d/cats # use /bin/sh to run commands, no matter what /etc/passwd says SHELL=/bin/sh # mail any output to `root', no matter whose crontab this is MAILTO=root PATH=/usr/local/bin # # Business Days, Business Hours 20,50 7-17 * * Mon,Tue,Wed,Thu,Fri root $PATH/indexer --rotate --config /srv/www/htdocs/cats/modules/search/sphinx.conf catsdelta >>/dev/null * chmod 600 /etc/cron.d/cats * You're done!