Thursday, December 5, 2013

Evergreen Update this Weekend

We have scheduled our next Evergreen update for the night of Sunday, December 8th, providing no deal-breaking issues are discovered with the next release between now and then.

Among the features you’ll see following the update are:

  1. A mobile-friendly catalog
  2. The addition of an author, title, subject and series Browse search mode
  3. Larger fonts in the OPAC (by default) and labeled and repositioned search fields
  4. The addition of a “My List Preferences” tab in Account Preferences where you can identify the number of lists and the number of list items that should display on each page. Individual lists will also display a page number and will be able to be navigated by page number
  5. When pasting a barcode into a search filed, any extra white space will be automatically trimmed
  6. Additional memory leaks tied to receipt printing are fixed. This should improve staff client performance
  7. In circulation, lost items will now be included as part of the lump-sum tally of items out and will display with other checked out items, rather than below
  8. The ability to print receipts for a single selected item is added from the interfaces for Items Out or Lost, Claims Returned, Long Overdue and Has Unpaid Billings

There’s still time to look at the new version before the update. We currently have it set up on the training server where you can put it through its paces.

If you currently have the staff client for the training server installed on your computer, you will be prompted to update your client when you attempt to log in.

Parallel Metabib Reingest in Evergreen 2.4 and Later

The Evergreen 2.4 to 2.5 upgrade process is going to require a reingest of your bibliographic records so that new features, such as the browse search, will work properly. Traditional methods of reingesting records using a SQL script are slow, since the search indexing for each bibliographic record is updated in turn. They also require that you tinker with global flags in the database.

To remedy some of these issues, 2.4 modified the database function metabib.reingest_metabib_field_entries to accept three boolean flags in addition to the record id of the bibliographic record that needs reingesting. These flags indicate which of the metabib indexes you'd like to skip for the given bib record id: facet, browse, or search in that order. (Just to make it perfectly clear: setting a flag to TRUE causes that index reingest to be skipped and not run. This logic is the opposite of what you might typically expect, so it bares repeating.) The options are all FALSE by default, so if you want to reingest everything then you can still use the function in the old way. However, using these flags to only reingest what needs to be reingested can save you some time and also permits us to write a program to do a complete bibliographic reingest in parallel. This latter is a feat that was rather difficult to achieve prior to 2.4.

The main advantage of using the 2.4 version of metabib.reingest_metabib_field_entries over a SQL script that updates your bibliographic records is that when you use the flags to turn off or skip different ingest methods, you gain fine grained control over the ingest process. Simply updating a bibliographic record causes all of the reingest methods to run on this record. In the course of normal operation, this is exactly what you want. If a MARC record is edited for instance, you want the changes to show up in all of the indexes. However, when you are doing a planned reingest of all of your records, such as during an upgrade or after adding a custom metabib field, you may want more control over which index gets updated. Only updating the facet, browse, or search index when necessary will save you a bit of time when indexing all your records at once. In the case where you've added a new configuration for a facet, search, or browse metabib field, you will want to ask your database administrator to run a simple SQL script to reingest all of your bibs using the metabib.reingest_metabib_field_entries function with the appropriate flags. While updating only one metabib index will save you some time, indexing all of your records in this way will still take several hours. In the event that you want to update all of your indexes for all of your bibliographic records in one go, you will definitely want to do this in parallel. Using the appropriate flags with the metabib.reingest_metabib_field_entries function makes this possible.

Before you can run the reingest in parallel, you need to know a little about how the different ingest routines work. You need to know that the facet, browse, and search ingests can all happen at the same time. That is, they can run in parallel with each other. That said, the browse ingest cannot run in parallel with other browse ingests. You run the risk of having database conflicts with the different processes doing browse updates at the same time. What this means is that you have to partition the work so that the facet and search ingests run in parallel, and the browse ingest runs sequentially over each record. You can still run the browse ingest while the parallel facet and search ingests run. If that sounds a bit complicated, never fear. I have written pingest.pl, a smallish Perl program that will do all of that for you.

While you may have great success with just downloading the program, copying it to one of your Evergreen servers, and running it without knowing how it does what it does, you should probably understand a few things about how it works and the assumptions that it makes before you attempt to run it.

First, it assumes that you have set the PGHOST, PGPORT, PGDATABASE, PGUSER, and PGPASSWORD environment variables as described here. If you don't have those set, you will either need to set them or to modify the three lines that have DBI->connect('DBI:Pg') on them so that the program can find your database.

Second, the program will run 8 parallel processes by default, and will use batches of 10,000 records each when ingesting for the facet and search indexes. These values work in my environment. You may want to use different numbers depending on the capabilities of your database server and the number of records in your database. These values are set by the constants MAXCHILD and BATCHSIZE defined near the top of the file. MAXCHILD controls how many processes are used for the parallel ingest. BATCHSIZE controls how many records are processed by each of the parallel processes. The browse ingest that sequentially over all records as a single batch also counts against the limit set by MAXCHILD. Because the browse ingest operates more or less sequentially, it serves as the main limit on how long the total reingest takes. Using any reasonable number of processes and batch size, the combined facet and search ingests will likely finish several hours before the browse ingest does. As a general rule of thumb, you should probably set MAXCHILD to one half the number of cores or threads (if HTT is enabled) on your database server, and BATCHSIZE should be approximately one one-hundredth of the number of your bibliographic records. There is room to fudge here, and if you're doing this during an upgrade, you could just go ahead and use all of the cores on your database server. You should experiment and find numbers that work for you. You might discover that larger batches work just fine in your situation.

Finally, the program itself spends most of its time waiting on the database, so it uses very few resources on the computer where it runs. If you run it from a server or workstation other than your database server, you generally should not have to worry about how many CPU cores that machine has. The database server's resources and utilization are your main concerns.

We use this script quite frequently when updating our development and training servers, as well as when necessary during upgrades, here at MVLC. We hope you also find it useful. We know that there are ways it could be improved, such as moving the maximum child and batch size parameters from constants to command line parameters. If you make any modifications that would be useful to others, then we would be happy to incorporate them.

Tuesday, May 14, 2013

Strike That

Well, I've just been informed that I'm a liar, or misspoke if you prefer.

I didn't know when I made the previous post that we are actually still using 1,000 as our MaxRequestsPerChild setting. I guess I misunderstood what I was being told last week, and I certainly did not bother to check our configuration before making the previous blog post.

I apologize for any confusion this may have caused. However, the general observation that you may have to experiment with your Apache settings before you find the right combination for your server and its usage patterns still stands. You can't always expect the defaults to work right out of the box.

More Apache Fun

Another update to share with you what MVLC is doing with our Apache configuration for Evergreen.

We found last week with the MaxMemFree setting at 16 and MaxRequestsPerChild at 1,000 that we were getting more texts from our monitoring software about the load being high on our Evergreen server. We thought this might have to do with more frequent turnover among the Apache child processes, so we adjusted MaxRequestsPerChild back up to 10,000. However, during overnight monitoring, we discovered that this made the situation worse, or at least put us back where we were before trying all of these changes.

In the end, we have set MaxRequestsPerChild to 5,000 while leaving MaxMemFree at 16. We've been running this configuration for several days, including over the weekend, and things seem to have really settled down on our server. You may have to experiment with the settings to find something that works for you if you think you are having this issue.

Tuesday, May 7, 2013

Update on the Apache Situation

Thomas just told me that he changed another Apache configuration variable that seems to have helped things. He set the MaxMemFree directive to 16 in our mpm_prefork configuration section. This setting also limits the memory that an individual Apache process can consume before releasing the memory back to the operating system.

I thought we'd share this in case anyone else is bumping their heads against memory issues with Apache.

Saturday, May 4, 2013

What has been going on.

TL;DR: We've had trouble with the memory consumption of Apache processes on our Evergreen server since we did our latest update on April 14, 2013. Along the way to figuring this out we've had a few minor detours and fixed another bug. Our breakthrough came when one of us realized that the longer Apache processes run the more memory they were using. We have made changes to our Apache configuration as a mitigation strategy. Basically, we have lowered our MaxRequestsPerChild from 10,000 down to 1,000. This appears to have helped, but only time will tell.

Read on for the gory details....

Thursday, February 14, 2013

Another Backstage Authority Update

Looks like we haven't had much to say here at MVLC since October, but that's because we've been too busy doing lots of things to get around to making blog entries.

I just wanted to take a minute to let everyone know that the software for managing authority updates with Backstage Library Works got a little code update today. A command line option was added that allows you to just download the new files from Backstage's server. This is good if you want to download the files before the weekend and wait until next week to process them, or if you just don't feel like loading the authority and bibliographic updates right away.

As always, the code is here:

http://git.mvlcstaff.org/?p=jason/backstage.git;a=summary