Koha ILS

Losing Weight in Koha

The holidays are approaching, and it’s time to think about losing some weight. I’m talking about Koha, and specifically your Koha database. I’m going to talk briefly about why it’s important to keep your Koha database as lean and mean as possible.

Currently Koha uses the MySQL database server. In addition, most of the Koha tables are stored using the InnoDB storage engine. (Storage engines are simply how MySQL actually stores the data in the tables). InnoDB is particularly good for transaction processing, and it offers good automatic crash recovery. However, it also has a profound affect on memory usage, and therefore, performance.

InnoDB tables rely heavily on RAM. MySQL will try to cache as much of the table data and indexes in memory as it can, and operates a lot more efficiently when it can load the entire table into RAM. If InnoDB tables cannot be stored in RAM, MySQL has to do a lot more reading from and writing to disk, which is really slow. There’s a MySQL startup setting called ‘innodb_buffer_pool_size’, which tells MySQL how much RAM it can use to cache data. On a dedicated MySQL server, you can set this as high as 80% of available memory, but of course, you have to set it much lower on a shared server (Koha and MySQL on the same machine). Generally we try to set to 60% on a shared system.

So how does all this relate to Koha database size? Well, databases grow over time, especially library databases. Not only are sites constantly adding new bib and item records, but the indexes grow, the statistics and logs grow, etc., etc. A database that may fit “comfortably” in memory on new server may start to find itself feeling a bit . . . squeezed over time, like trying to wear those old size 30 pants.

An overgrown database can lead to dual problems that can affect performance. One, the database can grow larger than the innodb_buffer_pool_size setting. When that happens it has to do more swapping of data to disk, which can slow things down. Secondly, if you keep increasing the buffer pool, you run the risk of taking too much memory away from other applications, e.g. Apache and perl, and now the system is less able to handle spikes in traffic.

So clearly there is good motivation for pruning down your overall database size. Now, I’m not recommending you go out and engage in a massive weeding project. Certain tables *ought* to grow, like biblio and biblioitems; that means your collection is growing, and that’s a good thing. The statistics table is also going to grow, and that’s good too; it means your system is being used. But there are several tables in Koha that need a good policy of regular pruning. The primary candidates for this are:

  • action_logs
  • message_queue
  • import_records

Let’s take a quick look at each of these, and how to clean them.

Action_logs: This is the table that stores certain transactions on the system. If any of the “*Log” system preferences are enabled (e.g. AuthoritiesLog, BorrowersLog, IssueLog, etc.), then that data gets written to this table. This can make the table get very large over time. Unless you use this data for reports, you may not need to keep it around for very long. Fortunately there’s an easy way to clean it out. Koha comes with a cronjob called “cleanup_database.pl”. This script does exactly what the name implies, and can take several optional parameters. One of them is the “–logs DAYS” parameter, which will delete all records from action_logs older than DAYS. This setting defaults to 180 days, if no DAYS is given.

Message_queue: This is the table where notices to the borrower are stored. For example, all overdue notices, advance notices, holds, etc., are stored here. When you look at a patron record, and look at the history of notices sent, you’re looking at the message_queue. This table too can get very large over time. The cleanup_database cronjob I just mentioned will also clean out the message_queue by using the “–mail DAYS” parameter. If you don’t specify the number of days, it will default to 30 days, but you may want to keep notices around a little longer than that.

Import_records: This is the big one, and is one of the worst contributors to “database bloat”. This holds the full record information of items that have been staged in MARC record batches. If you do a lot of this, you can see how this table can get really large. In fact, on a number of our sites, we have noticed import_records tables that are *larger* than the biblioitems tables. (That means these sites have more data staged than is actually in the collection!)

Cleanup_database can also purge the import_records, by using the “–import DAYS” parameter, but this is kind of a brute force approach. We recommend that you clean out your staged batches manually, as soon as you are done with them, keeping the import file on your local PCs as a backup. You clean out staged batches using the “Staged MARC Record Managment” tool (More > Tools > Catalog > Staged MARC Record Management). And of course using cleanup_database is always an option, if you’d rather go that route.

There’s also a hidden side to import_records. When you search another catalog using Z39.50 from the staff client, every record that gets returned is also stored in import_records. These are the “hidden” records in import_records. They don’t show up in the Staged MARC Record Management tool, and there’s no good way to delete them from inside the staff client. Cleanup_database comes to the rescue again. A recent update to Koha added the “–z3950” option to cleanup_database, which will purge all records coming from Z39.50 sources. If you’re not doing so, it’s a very good idea to use this new option on your system, because those Z39.50 records can really fill up the database.

So there you have it. Those a just a few of the tables that can contribute to database bloat, and cleanup_database is your friend for helping keep them in check. Keeping these tables pruned can not only help with overall system performance, but may also help decrease the length of time for backups and upgrades. For example, the recent upgrade to 3.16.x included the addition of a few indexes to action_logs. On one test site we upgraded, the action_logs table was over 1G in size. The upgrade took over seven hours (nearly all of which was updating action_logs), and the size of the table increased from 1G to nearly 3G!

So let’s keep it lean.

Read more by Larry Baerveldt

Tags systems tutorial