Bug #561894 “Make sure it's possible for Ops to restart fastcgi ...” : Bugs : Open Library

Revision history for this message

George (george-archive) wrote on 2010-04-12:

#1

current-nagios-list.png Edit (223.8 KiB, image/png)

Changed in openlibrary:
assignee:	nobody → Anand Chitipothu (anandology)
milestone:	none → upstream-to-www
importance:	Undecided → Critical

Revision history for this message

George (george-archive) wrote on 2010-04-22:

#2

Edward:
- document work search SOLR for Ops team
- Add monitoring for both instances of Upstream SOLRs
- look to move SOLR update process off Edward's dev box onto the SOLR production box (*07)

Revision history for this message

George (george-archive) wrote on 2010-04-23:

#3

Download full text (3.2 KiB)

NAGIOS
0 = all good; 1 = warning; 2 = critical error
- commonly just look for strings within webpages, and expect them within a certain timeframe

Current NAGIOS setup?
- http://nagios2.us.archive.org/control/nagios-status.php?hostgroup=24.openlibrary&style=detail&hoststatustypes=15
- Ralf: "It works, but I don't know what to do in the middle of the night. I just restarted everything and it seemed to fix it."
- There are some memory leaks in the program. Restarting the fastcgis should fix those.
- current timeframes (3 seconds: warning; 5 seconds: critical; 10 seconds: give up)
- coverstore already monitored; no change required
- all services in upstart script; all restarts call that

SOLR
- live updates to one SOLR
- 4GB upstream SOLR memory (untested under production load)
- check in SOLR restart stuff into OL repo at olsystem/event.d/solr-upstream

NEED
- To check that Upstream SOLR is online (individually) by hitting it directly (not through the website). Send a URL request:
- A test for each of the indexes (x4)
  - http://ia331507.us.archive.org:8983/solr/works/select?wt=json&q=city
  - http://ia331507.us.archive.org:8984/solr/authors/select?wt=json&q=mark
  - http://ia331507.us.archive.org:8984/solr/subjects/select?wt=json&q=city
  - http://ia331507.us.archive.org:8984/solr/editions/select?wt=json&q=mark
    - searches will contain this if working: "response":{"numFound":
    - if it's not working, you'll get responses in HTML, not JSON
- Stick to one port for SOLR - 8983 (the default)
- Update http://home.us.archive.org/cgi-bin/twiki/viewauth/OpenLibrary/WebHome#Ops
- Ralf needs to know when to start what service - we mostly only restart fastcgis; rarely go beyond that
- Should we monitor each process on each node? (Ralf to investigate)
    - or, CRON job every 5 mins; NAGIOS can ping
    - or, could run /usr/lib/nagios/plugins/check_procs -h against benchmarks (benchmarks TBD)
    - main memory concern is the OL software. Perhaps we should watch that specifically
    - or, look for OUT OF MEMORY in standard error in the log (SOLR)
        - If SOLR runs out of memory, you'll see:
            - SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
- to check memory use for fastcgis - run on ia311532 and ia311533 nodes
    - /usr/lib/nagios/plugins/check_procs --metric=VSZ -w 2000000 -c 3000000 -a openlibrary-server
        - right now this checks all fastcgis; might be able to isolate specific processes

UPSTREAM NOW
- shared DB with production (database; infobase server; http interface)
- different URL structures

TO LAUNCH
- set up testing server on May 3; run tests through that
- migration:
    - need to remove the URL adapter by adding new versions for all pages
        - restart all the memcache servers
    - move all templates to "a regular place"
    - restart server to make sure Upstream plugin is loading
    - Profit!

TO DO
- Ralf will put these checks into NAGIOS; then we'll review
- ia331532 and ia331533
- /usr/lib/nagios/plugins/check_procs --metric=VSZ -w 2000000 -c 3000000 -a 'fastcgi 7031'
- /usr/lib/nagios/plugins/check_procs --metric=VSZ -w 2000000 -c 3000000 -a 'fastcgi 7030'
- Anand: benchmarks...

NAGIOS
0 = all good; 1 = warning; 2 = critical error
- commonly just look for strings within webpages, and expect them within a certain timeframe

Current NAGIOS setup?
- http://nagios2.us.archive.org/control/nagios-status.php?hostgroup=24.openlibrary&style=detail&hoststatustypes=15
- Ralf: "It works, but I don't know what to do in the middle of the night. I just restarted everything and it seemed to fix it."
- There are some memory leaks in the program. Restarting the fastcgis should fix those.
- current timeframes (3 seconds: warning; 5 seconds: critical; 10 seconds: give up)
- coverstore already monitored; no change required
- all services in upstart script; all restarts call that

SOLR
- live updates to one SOLR
- 4GB upstream SOLR memory (untested under production load)
- check in SOLR restart stuff into OL repo at olsystem/event.d/solr-upstream

NEED
- To check that Upstream SOLR is online (individually) by hitting it directly (not through the website). Send a URL request:
- A test for each of the indexes (x4)
  - http://ia331507.us.archive.org:8983/solr/works/select?wt=json&q=city
  - http://ia331507.us.archive.org:8984/solr/authors/select?wt=json&q=mark
  - http://ia331507.us.archive.org:8984/solr/subjects/select?wt=json&q=city
  - http://ia331507.us.archive.org:8984/solr/editions/select?wt=json&q=mark
    - searches will contain this if working: "response":{"numFound":
    - if it's not working, you'll get responses in HTML, not JSON
- Stick to one port for SOLR - 8983 (the default)
- Update http://home.us.archive.org/cgi-bin/twiki/viewauth/OpenLibrary/WebHome#Ops
- Ralf needs to know when to start what service - we mostly only restart fastcgis; rarely go beyond that
- Should we monitor each process on each node? (Ralf to investigate)
    - or, CRON job every 5 mins; NAGIOS can ping
    - or, could run /usr/lib/nagios/plugins/check_procs -h against benchmarks (benchmarks TBD)
    - main memory concern is the OL software. Perhaps we should watch that specifically 
    - or, look for OUT OF MEMORY in standard error in the log (SOLR)
        - If SOLR runs out of memory, you'll see:
            - SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
- to check memory use for fastcgis - run on ia311532 and ia311533 nodes
    - /usr/lib/nagios/plugins/check_procs --metric=VSZ -w 2000000 -c 3000000 -a openlibrary-server
        - right now this checks all fastcgis; might be able to isolate specific processes

UPSTREAM NOW
- shared DB with production (database; infobase server; http interface)
- different URL structures

TO LAUNCH
- set up testing server on May 3; run tests through that
- migration: 
    - need to remove the URL adapter by adding new versions for all pages
        - restart all the memcache servers 
    - move all templates to "a regular place"
    - restart server to make sure Upstream plugin is loading
    - Profit!

TO DO
- Ralf will put these checks into NAGIOS; then we'll review
- ia331532 and ia331533
    - /usr/lib/nagios/plugins/check_procs --metric=VSZ -w 2000000 -c 3000000 -a 'fastcgi 7031'
    - /usr/lib/nagios/plugins/check_procs --metric=VSZ -w 2000000 -c 3000000 -a 'fastcgi 7030'
- Anand: benchmarks for memory usage - if <2GB it's fine; <3GB warning; >3GB critical

Revision history for this message

George (george-archive) wrote on 2010-04-23:

#4

6PM PST is preferred for calls with Anand if needed.

Revision history for this message

George (george-archive) wrote on 2010-04-24:

#5

Can we check that other Ops peeps can log on to OL machines? Like, Sam?

Revision history for this message

Anand Chitipothu (anandology) wrote on 2010-04-25: Re: [Bug 561894] Re: Make sure it's possible for Ops to restart fastcgi processes through NAGIOS

#6

On 24-Apr-10, at 5:51 AM, George wrote:

> Can we check that other Ops peeps can log on to OL machines? Like,
> Sam?

Yes, ops people have login to all nodes in the cluster.

George (george-archive) on 2010-04-26

Changed in openlibrary:
status:	New → In Progress

Revision history for this message

George (george-archive) wrote on 2010-05-27:

#7

Ralf - let us know what you still need from us to restart fastcgis.

Changed in openlibrary:
assignee:	Anand Chitipothu (anandology) → Ralf Muehlen (launchpad-muehlen)
importance:	Critical → High

George (george-archive) on 2010-06-04

Changed in openlibrary:
milestone:	upstream-to-www → stability

Revision history for this message

Ralf Muehlen (launchpad-muehlen) wrote on 2010-06-21:

#8

I updated nagios to include restart links for newer services, and dropped the old upstream services. The only service that cannot be restarted currently are the Search Engines on ia331508 and 09. If someone provides a restart script, I can make the nagios links.

Changed in openlibrary:
status:	In Progress → Fix Released

Revision history for this message

Ralf Muehlen (launchpad-muehlen) wrote on 2010-06-21:

#9

Search Engines now also have restart links.

Open Library

Make sure it's possible for Ops to restart fastcgi processes through NAGIOS

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches