open-ils.storage.actor.org_unit.closed_date.overlap can hang and gives inconsistent results

Bug #1901191 reported by Jason Stephenson
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Evergreen
Confirmed
High
Unassigned

Bug Description

Evergreen Versions: 3.2.10+ (Tested on 3.2.10 and Master)
OpenSRF Version: N/A
PostgreSQL Version: Tested with PostgreSQL 9.6 and Postgresql 12 (likely N/A)
O/S Version(s): Tested on Ubuntu 16.04 and 18.04 (likely N/A)

open-ils.storage.actor.org_unit.closed_date.overlap can get stuck in a loop with certain data in the hours of operation and org unit closed dates tables. The client will timeout and return no data, but the storage drone keeps running in one of the loops inside the storage function, new_org_closed_overlap. The drone will continue running queries in the database for days on end until the drone is killed. While I have not observed this being a performance issue on our database server, it could be an issue for a site with a smaller database server. I have observed dozens of these queries running at the same time.

This function will also return inconsistent start and end times for certain overlaps using the same data. That is, two dates where I would expect to see the same start and end date, return different start and end dates. I do not think this is intentional.

I am attaching a SQL script that can be loaded on the Concerto dataset that will reliably reproduce these issues. After loading the data, you can expose the issue with srfsh.

To recreate the loop issue, run the following in srfsh:

request open-ils.storage open-ils.storage.actor.org_unit.closed_date.overlap 8, "2020-10-16T16:30:43-0400"

To recreate the inconsistent start and end dates, the following two commands will suffice:

request open-ils.storage open-ils.storage.actor.org_unit.closed_date.overlap 8, "2020-11-26T16:30:43-0500"

request open-ils.storage open-ils.storage.actor.org_unit.closed_date.overlap 8, "2020-11-27T16:30:43-0500"

You can obtain similar results with this data using other dates as well.

Revision history for this message
Jason Stephenson (jstephenson) wrote :
Revision history for this message
Jason Stephenson (jstephenson) wrote :

Here is a SQL query that you can use to find the queries run by the storage drone that is looping. You will want to scan the output for look ups on the actor.hours_of_operation and actor.org_unit_closed tables. Note that you will have to run the SQL several times to find any/all processes. These queries are short-lived and there is a decent chance that you will miss some if you only run the query once or twice. If you run this on a database with a lot going on, i.e. a production database, then I would suggest running it several times before executing any of the following steps to make sure that you are getting processes stuck in a loop.

You can find the process ID of the storage drone(s) that is/are looping over the queries by copying the client port column value, logging into the machine with the client IP address and running the following command:

lsof -iTCP | grep {client port}

Where {client port} is the port number reported by the query.

The second column of the output from lsof is the drone's process ID (PID). You can terminate the drone with the following command:

kill {PID}

Where {PID} is, of course, the numeric process ID reported by lsof.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

I have started a branch for this bug: https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dyrcona/lp1901191-closed-dates-overlap-loop

So far, it adds test data to reproduce the bug in the Concerto dataset.

Lynn Floyd (lfloyd)
tags: added: closeddates sampledata
Changed in evergreen:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Galen Charlton (gmc) wrote :

Noting for anybody looking into this that despite the name, the routine in /OpenILS/Application/Storage/Publisher/actor.pm that's getting called is org_closed_overlap(), _not_ new_org_closed_overlap(). The latter is very old experimental code from 2009 that if anything, does worse with Jason's test case.

Revision history for this message
Terran McCanna (tmccanna) wrote :

Is this the same underlying bug? https://bugs.launchpad.net/evergreen/+bug/1869728

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.