This article is a sort of series I'm writing with my experiences of learning the EC2 cloud while learning Sphinx search. The more I learn about Sphinx and use it, the more impressed I am. And also the more possibilities that I see. I believe this is going to solve A LOT of problems for us.
Initially the reason for going to Sphinx was because MyISAM tables were not doing it for us anymore and we need to convert to InnoDB, which of course does not support fulltext searches. Being pretty tapped out on our single server as its 12GB of RAM and already planning for a future migration to the Amazon Cloud anyway... I decided to launch this on EC2 as my first step in that direction.
I have been pouring over the Sphinx documentation, articles, and forum posts over the last couple of days. And now using it. I am amazed. And while laying in bed after getting it running last night, I gained some additional inspiration of how to leverage it in some non-traditional, non-fulltext ways. But I'll leave that for another day...
So the first thing to do is copy the distribution sample configuration file.
cp /usr/local/etc/sphinx-min.conf /usr/local/etc/sphinx.conf
And edit it...
nano /usr/local/etc/sphinx.conf
Below is a basic sample subset of what I am using...
source articles
{
type = mysql
sql_host = YOUR_IP
sql_user = YOUR_MYSQL_USER
sql_pass = YOUR_MYSQL_PASSWORD
sql_db = YOUR_MYSQL_DATABASE
sql_port = 3306 # optional, default is 3306
# Here is the actual query from the database, the ID has to be first
# The $start and $end variables come from sql_query_range and sql_range_step below
sql_query = SELECT id, title, body, type_id, date_published FROM articles
WHERE id BETWEEN $start AND $end
# Attributes.... these will not be in the fulltext index, but can be filtered on
sql_attr_uint = type_id
sql_attr_timestamp = date_published
# This will allow us to only index 1000 rows at a time and
# and it will wait two seconds between each query
sql_query_range = SELECT MIN(id), MAX(id) FROM articles
sql_range_step = 1000
sql_ranged_throttle = 2000
}
index articles_index
{
source = articles
# storing the source on the EBS volume... IMPORTANT!
path = /ebs/data/articles
docinfo = extern
charset_type = sbcs
min_stemming_len = 4
min_word_length = 2
html_strip = 1
enable_star = 1
}
indexer
{
# Default is 32M, you may want to increase more depending on your dataset
mem_limit = 128M
}
searchd
{
# Note this is the new default port for Sphinx
listen = 9312
# Log and query log on EBS volume also
log = /ebs/log/searchd.log
query_log = /ebs/log/query.log
read_timeout = 5
max_children = 30
pid_file = /ebs/log/searchd.pid
max_matches = 1000
seamless_rotate = 1
preopen_indexes = 0
unlink_old = 1
}
Now it's time to actually index for the first time. This is the amazing part because it is so very very fast! And because of the steps of 1000 and throttling that we put in above, we can run it on the live database and it shouldn't have a huge impact on performance. To manually run it at the command line do this...
/usr/local/bin/indexer articles_index
Of course after you really get this all configured, you'll want to add this to a cron job. And there are all kinds of more complexities to think of like how often you want it to re-index and if you need to set up things like delta tables so that it only queries and re-indexes a smaller set of data.
One very important thing to keep in mind when setting these up and making your plans... whenever indexer runs it is RE-INDEXING the entire dataset. It is NOT just appending new data. You do have the ability to merge datasets and search multiple indexes at once. So the most common solution for large datasets that are frequently updated would be to have a main index with archived data and then another index that will run frequently maybe on just that day's data. And then once a day a cron job can merge the two indexes... and in the mean time your queries will just search both.
So now that we've done the first indexing, let's search for something at the command line...
/usr/local/bin/search obama
So if you run that query, you should see all of the articles that referenced obama. Notice that it will return the primary key... not the entire row data. So on your application side you will need to take those primary keys and look up the full data you need (a very fast query). Otherwise, you could store more meta data like the article title in another column as an attribute.
Finally, we need to implement this on our application. My language of choice is PHP. I had all sorts of problems with "autoconf" trying to compile the Sphinx code into PHP until I finally gave up on the chore after an hour and a half of getting no where but frustrated. I found it's not even necessary though for most people. You can just include the API class.
You can download sphinxapi.php here from Google Code.
Here is some very basic PHP code to get and process the results. It has absolutely no debugging in it, so I'll leave that to you to figure out the rest. It will search for whatever is in the ?q= argument to the URL. Be sure to do a print_r of $results to see what is inside of your returned array.
include(SPHINX_API_PATH);
$sc = new SphinxClient();
$sc->SetServer(SPHINX_IP, 9312);
$sc->SetArrayResult(true);
$results = $sc->Query($_GET['q'], 'articles_index');
if ($results) {
// Process results
if (is_array($results) && count($results) > 0) {
foreach ($results['matches'] as $r) {
$ids[] = $r['id'];
}
}
}
Well that's all for now. I will post more as I progress and optimize it more. Hopefully this helps someone, as I didn't find this information consolidated in one place as I was setting this up. Enjoy!
Previous Posts in Series:
Day 2: Installing Sphinx on EC2 and EBS
Day 1: Launching an AMI and Mounting an EBS Volume
Learn more about Sphinx from the source:
Sphinx Fulltext Search Homepage