If you are using Drupal 6 or an earlier version of this popular open source content management system, commonly installed on Linux servers running Apache and MySQL, to manage your content, you may have a problem. It is the same even if you managed to install Drupal in a much less common environment such as Windows, or are running it on a PostgreSQL database. The problem I am talking about lurks inside Drupal's core search module. Supposedly, it was fixed starting with Drupal 7, but I didn't check this personally.

Drupal's optional core search module is the one that makes internal site searches possible by parsing the site's nodes by the keywords that they contain. The module also scores the keywords it collects by the criteria set up in the module's admin interface, indexes them and stores results in the four MySQL database search tables: search_dataset, search_index, search_node_links and search_total. Even though this search module is optional, it is enabled by default at installation and most Drupal sites keep it that way.

That’s all good, except that this search module indexes both published and unpublished nodes. Now, what’s the problem with that? Well, we live in the world full of spammers and the way many good Drupal site administrators deal with all the unwanted “content” that spammers generate is not by deleting spam entirely, but by unpublishing it. In many other setups, all content must be approved by admins explicitly and that which is not, remains unpublished.

Now imagine a very likely scenario of a site with a thousand published real content articles and a hundred times as many unpublished spam articles. Also presume that each unit of spam is two to three times the length of real published content – somehow spammers always have that much more to say! All that unpublished spam grows the four Drupal MySQL search tables and it grows them a lot.

In our case, the largest of these MySQL database search tables, search_index alone grew to over 20 million keyword records and took up almost a gigabyte of database storage space. For a site with relatively little real content, this was an unjustifiably high load, which caused the MySQL database server to timeout on a nearly daily basis often taking the co-resident application server down with it. Something had to be done.

To solve the problem, I slightly hacked the node_update_index() function to restrict search indexing to only the published nodes. I then truncated all the four MySQL search tables and rebuilt the indexes. When all was done, the MySQL search_index table was only 35 thousand keywords long and took just over two megabytes of database space.

If you are in a similar situation, you may appreciate following my three easy steps to save your server from crashing all the time:


  1. Find the node_update_index() function inside the node.module file located in the modules/node directory of your Drupal installation. Change the WHERE clause of the SELECT statement used in this function from WHERE d.sid IS NULL OR d.reindex <> 0 to WHERE (d.sid IS NULL OR d.reindex <> 0) AND n.status=1. Remember to again apply this patch every time you upgrade Drupal 6 core.

  2. Empty the four Drupal MySQL search tables by issuing TRUNCATE TABLE search_dataset; TRUNCATE TABLE search_index; TRUNCATE TABLE search_node_links; TRUNCATE TABLE search_total; commands from MySQL command line or by using phpMyAdmin to do it.

  3. To rebuild the search indexes, go to your Drupal website’s /admin/settings/search url and click on the “Re-index site” button. Drupal will take it from here, or you can expedite the rebuilding of your site’s search indexes by running cron manually a few times until you see the “100% of the site has been indexed. There are 0 items left to index.” message above the “Re-index site” button. To run cron manually, go to your Drupal website’s /admin/reports/status/run-cron url.

Good luck!

MySQL logo

Why would you ever want to leave spam unpublished? With appropriately set user permissions around node creation, plus the proper configuration of one or more modules (e.g. spambot, honeypot, spamicide, mollom) spam should be prevented from entering the site at all. It isn't just fat search tables you should be worried about.

Drupal's core search module is no performance poster-child even when running smoothly. In my experience though, sites that allow unchecked content creation (including comments) are more likely to break from the strain of being hammered by automated spam submissions than from legitimate users attempting to use search on a bloated index.

Maintaining hacked versions of core or contrib modules is rarely a winning strategy; you might not mind the burden of remembering about it and manually patching Drupal each time you update, but one day someone else might be maintaining the site and they'll have no idea about your modifications.

You are absolutely right, there are other ways to deal with Spam, especially if you are doing a brand new Drupal install. However, this is great a fix for an older Drupal installation that was allowed to get out of hand in the first place. It is not necessarily a best practice for a new Drupal installation, but why would you do a brand new install of such an old version of Drupal now anyway?!

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <img> <p> <br> <tr> <td> <table>
  • Lines and paragraphs break automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.

User login

Poll

Are you currently bullish or bearish on the economy and the stock market?:

Email Center

Username:

@StockValues.Org

Password:


New Users
Lost Password
I Need Help





Who's online

There are currently 0 users and 6 guests online.