Evergreen

Internal Server Errors with "large" bookbags in TPAC

Bug #1060953 reported by Jason Stephenson on 2012-10-03

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Evergreen	Fix Released	High	Unassigned	Evergreen 2.4.0-alpha1
2.2	Fix Released	High	Unassigned	Evergreen 2.2.3
2.3	Fix Released	High	Unassigned	Evergreen 2.3.1

Bug Description

We have been getting reports of frequent internal server errors with TPAC both with and without the staff client.

Here's an example of what I got this morning:

[Wed Oct 03 09:23:57 2012] [error] [client 134.241.121.11] egweb: Context Loader
error: Exception: OpenSRF::DomainObject::oilsMethodException 2012-10-03T09:23:57 OpenILS::WWW::EGWeb /usr/local/share/perl/5.14.2/OpenILS/WWW/EGWeb.pm:102 <500> No active transaction to roll back\n

And we found this in the logs

[Tue Oct 02 10:19:01 2012] [error] [client 71.174.57.242] egweb: Context Loader error: uri_escape: Unmatched [ in regex; marked by <-- HERE in m/([ <-- HERE ])/ at (eval 1710470) line 1.\n at /usr/local/share/perl/5.10.1/OpenILS/WWW/EGCatLoader/Account.pm line 1813

for last night.

So, looks like different things causing it in different contexts.

Tags:

Revision history for this message

Lebbeous Fogle-Weekley (lebbeous) wrote on 2012-10-03:

This may or may not be pertinent, but it looks suspicious at least that the code referred to in the two error messages is installed at two different paths.

If that should prove not to be pertinent, could uri_escape() dying leave a cstore session in such a state as to cause the EGWeb error later (despite the fact that in this example, the EGWeb error comes first). Would a bigger sample of log messages actually show the EGWeb errors following the uri_escape() errors later in general?

And if uri_escape() chokes for some input, we should determine what and protect against that.

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2012-10-03:

The different paths are from different servers. The first is my development machine. The second is from production. I apologize for not making that clear in my report.

Revision history for this message

Lebbeous Fogle-Weekley (lebbeous) wrote on 2012-10-03:

New idea about the error from uri_escape(), at least:

uri_escape() takes an optional second parameter, which is treated as a range and wrapped in brackets to build a regex. That parameter is not meant to hold tainted user input, but CGI::param() returns a list of scalars if its parameter names a field that appears more than once in the request's query string. So basically when we access the TPAC with certain URIs with repeated fields in certain cases, I suspect we're triggering this (unsure whether this relates to the other EGWeb/cstore error or not, and as yet unable to reproduce).

Jason, with regards to the uri_escape() error, would you be able to find a URI in your access logs that corresponds to this error, and confirm whether accessing the TPAC by that specific URI causes a reproducible error?

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2012-10-03:

[Wed Oct 03 10:46:03 2012] [error] [client 134.241.121.11] egweb: Context Loader error: uri_escape: Unmatched [ in regex; marked by <-- HERE in m/([ <-- HERE ])/ at (eval 14634) line 1.\n at /usr/local/share/perl/5.14.2/OpenILS/WWW/EGCatLoader/Account.pm line 1816\n

A new one that corresponds to:

134.241.121.11 - - [03/Oct/2012:10:46:03 -0400] "GET /eg/opac/myopac/list/update?bool=and;bool=and;bool=and;qtype=author;qtype=title;qtype=author;contains=contains;contains=contains;contains=contains;query=ted%20hartley;query=;query=;locg=1;pubdate=is;_adv=1;anchor=%231256818%3Fbool%3Dand%3Bbool%3Dand%3Bbool%3Dand%3Bqtype%3Dauthor%3Bqtype%3Dtitle%3Bqtype%3Dauthor%3Bcontains%3Dcontains%3Bcontains%3Dcontains%3Bcontains%3Dcontains%3Bquery%3Dted%2520hartley%3Bquery%3D%3Bquery%3D%3Blocg%3D1%3Bpubdate%3Dis%3B_adv%3D1;record=1256818;action=add_rec;list=71942 HTTP/1.1" 500 1009 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:14.0) Gecko/20120713

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2012-10-06:

When these

occur while loading a patron's bookbag/my list, they are reproducible. The error always occurs for the same list of the same patron. I do not know exactly what causes it, but commit 4cbc94386a2bb62cc3b8c4edb43bc29876909d82 of http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dyrcona/lp1060953 seems to fix it for these lists.

Revision history for this message

Bill Erickson (berick) wrote on 2012-10-08:

Jason, the transactions are necessary for ensuring the TPAC reads the latest data from the master DB when replication is in use. Otherwise, a bookbag page may load with old data from a replicated DB. (Similar to .authoritative calls). Removing the transactions will lead to other problems. I'm guessing the cstore transaction is dying prior to the final error and hopefully there's something useful in the logs to indicate why...

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2012-10-10:

While opening one of these lists that always causes an Internal Server Error and checking the logs, I see this:

open-ils.cstore 2012-10-10 10:54:01 [INFO:2917:osrf_application.c:1040:] CALL: open-ils.cstore open-ils.cstore.json_query {"from":["unapi.bre",410637,"marcxml","record","{mra}","MVLC",0,"acn=>5,acp=>5",null,null,null]}
open-ils.cstore 2012-10-10 10:54:01 [INFO:2917:osrf_app_session.c:1017:] [open-ils.cstore] sent 4809 bytes of data to <email address hidden>/client_at_jasondev.mvlcstaff.org_2910
open-ils.cstore 2012-10-10 10:54:01 [INFO:2917:osrf_stack.c:163:] Message processing duration 0.004189
[2012-10-10 10:54:01] /usr/sbin/apache2 [INFO:2910:CStoreEditor.pm:113:] editor[1|65845] rolling back db session
[2012-10-10 10:54:01] /usr/sbin/apache2 [INFO:2910:CStoreEditor.pm:113:] editor[1|65845] request en-US open-ils.cstore.transaction.rollback []
open-ils.cstore 2012-10-10 10:54:01 [INFO:2775:osrf_app_session.c:1017:] [open-ils.cstore] sent 196 bytes of data to <email address hidden>/client_at_jasondev.mvlcstaff.org_2910
open-ils.cstore 2012-10-10 10:54:01 [WARN:2775:osrf_application.c:839:] Returning method exception with message: An unknown server error occurred

It looks like we're in get_records_and_facets, retrieving data when a CStoreEditor decides to roll back a transaction for no apparent reason. Perhaps it times out?

I have no idea at this point what causes this. The patron does have a deleted bib in their book bag, but removing that makes no difference.

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2012-10-10:

Commenting out the explicit $e->rollback at the bottom of Account->load_myopac_bookbags seems to also improve things. However, I am again not sure if that is a proper fix or not.

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2012-10-10:

Removing the rollback doesn't seem to cause problems for patrons whose lists work without the code changes.

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2012-10-10:

#10

Still trying to figure this out.

It looks like open-ils.cstore rolls back a transaction blindly when something happens in get_records_and_facets. What that something is, I have no idea because that information isn't logged.

Since open-ils.cstore is rolling back the CStoreEditor's transaction, the CStoreEditor editor isn't notified. One way to check this is to change the $e->rollback line in load_myopac_bookbags to

$e->rollback if ($e->{xact});

CStoreEditor still thinks it has a transaction and tries to rollback thus causing the double rollback situation that leads to an internal server error.

I don't know why open-ils.cstore.json_query.atomic is trying a rollback. A quick look over the C code didn't give me any pointer. Maybe a fresh look in the morning will help.

Revision history for this message

Bill Erickson (berick) wrote on 2012-10-11:

#11

Such a rollback suggests the query is failing or timing out. Do you have access to the postgres logs? Perhaps there's an error (or an open-ils.cstore error in osrfsys log).

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2012-10-11:

#12

Bill,

What I see in osrfsys.log is posted above. It is retrieving facets, then cstore does a roll back, then cstoreeditor tries a rollback and the transaction is actually gone because cstore already rolled back.

I'll check the postgres logs for errors.

I suspect this may be a case of bug 856811. I'll check the titles, etc., of the bibs in this patron's bag for #.

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2012-10-11:

#13

No postgres errors and no # apparent in any of the facets.

The patron has 101 items in their cbrebi. I'll explore the code to see if that might have something to do with it.

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2012-10-11:

#14

At this point, I'm stuck. I got nothing. Increasing the logging didn't make the situation any more clear. I still don't know why the transaction is rolled back for some patron's list and not others. The number of items on a list does seem to be a factor.

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2012-10-12:

#15

Relevant logs Edit (3.4 MiB, application/octet-stream)

After banging my head against this for 3 days, I'm giving up.

What I see in the logs (attached) looks like everything is successful, up until the final call to $e->rollback in OpenILS::WWW::EGCatLoader::Account->load_myopac_bookbags().

There are no errors in PostgreSQL logs nor the stderr logs to indicate a problem.

This only happens with "large" bookbags. On my development server, bookbags with 99 or more entries cause it to happen every time. A bookbag with 22 entries works, as do those with less. I have not attempted enough experiments to determine if there is any particular threshold, but I imagine that would vary by server and load.

All I can surmise is that something in the normal shutdown of the search and cstore processes used to gather record and facet information is causing a rollback or a commit on the open transaction. However, if that were the case, then I'd expect this error with every bookbag, and not just large ones. At any rate, this rollback or commit is apparently not logged any where.

So I am throwing my latest logs showing successes and failures up here for anyone to look at who may have a better idea of what is going on.

Jason

Ben Shum (bshum) on 2012-10-12

Changed in evergreen:
status:	New → Confirmed
importance:	Undecided → Medium

Revision history for this message

Ben Shum (bshum) wrote on 2012-10-12:

#16

Confirmed for us. Nobody had reported issues with large bookbags, but I tested it on my own initiative and it is in fact broken as described.

Upping to HIGH importance given that it could be a big problem.

Changed in evergreen:
importance:	Medium → High

Jason Stephenson (jstephenson) on 2012-10-12

summary:

- Random Internal Server Errors with TPAC
+ Internal Server Errors with "large" bookbags in TPAC

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2012-10-12:

#17

I have changed the bug summary to reflect the more serious problem of inaccessible bookbags.

The uri_escape issues should be moved to a different bug if they pop up again. I caused one of these myself on my development server with some bad code changes. However, we have seen it rarely in the logs on production where those changes never made it.

Revision history for this message

Bill Erickson (berick) wrote on 2012-10-18:

#18

I have reproduced. Investigating.

Revision history for this message

Bill Erickson (berick) wrote on 2012-10-18:

#19

I've pushed a patch to working/user/berick/lp1060953-tpac-large-bbag-timeout

It's strikingly similar to Jason's original patch, the main difference being that we still start out with a transaction, but we close it out just prior to fetching the bookbag records, which takes place outside of the transaction and takes a long time (resulting in under-the-covers xact timeout).

FWIW, researching this, it also occurred to me why we're not seeing long-running calls in the logs: we're fetching bookbag entries with a series of small, quick requests (via multisession) instead of one large request. They just happen to add up to taking a little too long in total.

tags:

added: opac pullrequest

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2012-10-18:

#20

Thanks, Bill! This works for me on all of the problem barcodes that I had from before, and continues on the non problem accounts.

Signed off and pushed to master, rel_2_3 and rel_2_2!

Changed in evergreen:
status:	Confirmed → Fix Committed

Revision history for this message

Dan Scott (denials) wrote on 2012-11-02:

#22

Robert: many people won't see your comment as this bug is marked "Fix committed" - or they'll see it in their mail, but not be able to find it in the bug reports because we often filter out "Fix committed" bugs from searches.

Rather than posting a question on a bug in that state (asking the dev team to put their limited resources into possibly going on a wild goose chase for you ), could you possibly try out a new build of Evergreen master with your deleted bib scenario and post a new bug if it is in fact a reproducible problem? That would be really helpful, and would help us track open problems with much more ease.

Revision history for this message

Ben Shum (bshum) wrote on 2012-11-02:

#23

Re: Robert

I've got plenty of bookbags in my sample patrons that include deleted bibs, but the bookbags as displayed in TPAC doesn't show me the deleted bibs in the list. I actually reported that problem in bug 976054 several months back when we first went live with TPAC.

My assumption is that the internal server errors are completely unrelated to this particular bug situation and the deleted bibs are incidental.

I would echo Dan's suggestion that commenting on bugs that have already been committed/released tend to cause confusion and people should start new bugs once confirming continuing bad behavior on more current Evergreen releases or master itself.

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2012-11-03:

#24

Robert & Dan,

In my experience, it has nothing to do with deleted bib records. I deliberately checked patrons with deleted bibs in their bookbags, and removed them from the bookbag. The problem was simply a function of the number of bibs in a patron's bookbag. The behavior did not change with or without deleted bibs.

That said, if anyone sees an issue with bookbags and deleted bibs, I'd welcome a new bug report.

Jason Stephenson (jstephenson) on 2012-11-14