NAGIOS
0 = all good; 1 = warning; 2 = critical error
- commonly just look for strings within webpages, and expect them within a certain timeframe
Current NAGIOS setup?
- http://nagios2.us.archive.org/control/nagios-status.php?hostgroup=24.openlibrary&style=detail&hoststatustypes=15
- Ralf: "It works, but I don't know what to do in the middle of the night. I just restarted everything and it seemed to fix it."
- There are some memory leaks in the program. Restarting the fastcgis should fix those.
- current timeframes (3 seconds: warning; 5 seconds: critical; 10 seconds: give up)
- coverstore already monitored; no change required
- all services in upstart script; all restarts call that
SOLR
- live updates to one SOLR
- 4GB upstream SOLR memory (untested under production load)
- check in SOLR restart stuff into OL repo at olsystem/event.d/solr-upstream
NEED
- To check that Upstream SOLR is online (individually) by hitting it directly (not through the website). Send a URL request:
- A test for each of the indexes (x4)
- http://ia331507.us.archive.org:8983/solr/works/select?wt=json&q=city
- http://ia331507.us.archive.org:8984/solr/authors/select?wt=json&q=mark
- http://ia331507.us.archive.org:8984/solr/subjects/select?wt=json&q=city
- http://ia331507.us.archive.org:8984/solr/editions/select?wt=json&q=mark
- searches will contain this if working: "response":{"numFound":
- if it's not working, you'll get responses in HTML, not JSON
- Stick to one port for SOLR - 8983 (the default)
- Update http://home.us.archive.org/cgi-bin/twiki/viewauth/OpenLibrary/WebHome#Ops
- Ralf needs to know when to start what service - we mostly only restart fastcgis; rarely go beyond that
- Should we monitor each process on each node? (Ralf to investigate)
- or, CRON job every 5 mins; NAGIOS can ping
- or, could run /usr/lib/nagios/plugins/check_procs -h against benchmarks (benchmarks TBD)
- main memory concern is the OL software. Perhaps we should watch that specifically
- or, look for OUT OF MEMORY in standard error in the log (SOLR)
- If SOLR runs out of memory, you'll see:
- SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
- to check memory use for fastcgis - run on ia311532 and ia311533 nodes
- /usr/lib/nagios/plugins/check_procs --metric=VSZ -w 2000000 -c 3000000 -a openlibrary-server
- right now this checks all fastcgis; might be able to isolate specific processes
UPSTREAM NOW
- shared DB with production (database; infobase server; http interface)
- different URL structures
TO LAUNCH
- set up testing server on May 3; run tests through that
- migration:
- need to remove the URL adapter by adding new versions for all pages
- restart all the memcache servers
- move all templates to "a regular place"
- restart server to make sure Upstream plugin is loading
- Profit!
TO DO
- Ralf will put these checks into NAGIOS; then we'll review
- ia331532 and ia331533
- /usr/lib/nagios/plugins/check_procs --metric=VSZ -w 2000000 -c 3000000 -a 'fastcgi 7031'
- /usr/lib/nagios/plugins/check_procs --metric=VSZ -w 2000000 -c 3000000 -a 'fastcgi 7030'
- Anand: benchmarks for memory usage - if <2GB it's fine; <3GB warning; >3GB critical
NAGIOS
0 = all good; 1 = warning; 2 = critical error
- commonly just look for strings within webpages, and expect them within a certain timeframe
Current NAGIOS setup? nagios2. us.archive. org/control/ nagios- status. php?hostgroup= 24.openlibrary& style=detail& hoststatustypes =15
- http://
- Ralf: "It works, but I don't know what to do in the middle of the night. I just restarted everything and it seemed to fix it."
- There are some memory leaks in the program. Restarting the fastcgis should fix those.
- current timeframes (3 seconds: warning; 5 seconds: critical; 10 seconds: give up)
- coverstore already monitored; no change required
- all services in upstart script; all restarts call that
SOLR event.d/ solr-upstream
- live updates to one SOLR
- 4GB upstream SOLR memory (untested under production load)
- check in SOLR restart stuff into OL repo at olsystem/
NEED ia331507. us.archive. org:8983/ solr/works/ select? wt=json& q=city ia331507. us.archive. org:8984/ solr/authors/ select? wt=json& q=mark ia331507. us.archive. org:8984/ solr/subjects/ select? wt=json& q=city ia331507. us.archive. org:8984/ solr/editions/ select? wt=json& q=mark :{"numFound" : home.us. archive. org/cgi- bin/twiki/ viewauth/ OpenLibrary/ WebHome# Ops nagios/ plugins/ check_procs -h against benchmarks (benchmarks TBD) OutOfMemoryErro r: GC overhead limit exceeded nagios/ plugins/ check_procs --metric=VSZ -w 2000000 -c 3000000 -a openlibrary-server
- To check that Upstream SOLR is online (individually) by hitting it directly (not through the website). Send a URL request:
- A test for each of the indexes (x4)
- http://
- http://
- http://
- http://
- searches will contain this if working: "response"
- if it's not working, you'll get responses in HTML, not JSON
- Stick to one port for SOLR - 8983 (the default)
- Update http://
- Ralf needs to know when to start what service - we mostly only restart fastcgis; rarely go beyond that
- Should we monitor each process on each node? (Ralf to investigate)
- or, CRON job every 5 mins; NAGIOS can ping
- or, could run /usr/lib/
- main memory concern is the OL software. Perhaps we should watch that specifically
- or, look for OUT OF MEMORY in standard error in the log (SOLR)
- If SOLR runs out of memory, you'll see:
- SEVERE: java.lang.
- to check memory use for fastcgis - run on ia311532 and ia311533 nodes
- /usr/lib/
- right now this checks all fastcgis; might be able to isolate specific processes
UPSTREAM NOW
- shared DB with production (database; infobase server; http interface)
- different URL structures
TO LAUNCH
- set up testing server on May 3; run tests through that
- migration:
- need to remove the URL adapter by adding new versions for all pages
- restart all the memcache servers
- move all templates to "a regular place"
- restart server to make sure Upstream plugin is loading
- Profit!
TO DO nagios/ plugins/ check_procs --metric=VSZ -w 2000000 -c 3000000 -a 'fastcgi 7031' nagios/ plugins/ check_procs --metric=VSZ -w 2000000 -c 3000000 -a 'fastcgi 7030'
- Ralf will put these checks into NAGIOS; then we'll review
- ia331532 and ia331533
- /usr/lib/
- /usr/lib/
- Anand: benchmarks for memory usage - if <2GB it's fine; <3GB warning; >3GB critical