[00:35:53] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:39:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:41:05] <Tamzin>	 might not be an operations thing, but as early warning system: Lua errors popping up randomly on enwiki rn
[01:43:44] <NoSQL>	 Example of ^^^, https://i.imgur.com/BlioHB0.png / https://i.imgur.com/OlR4aFv.png
[01:45:03] <Tamzin>	 No obvious cause in https://en.wikipedia.org/w/index.php?target=Template%3AAdmin&limit=500&days=7&enhanced=1&title=Special:RecentChangesLinked&urlversion=2
[01:49:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:32:43] <Krinkle>	 !log repool mw1340
[02:32:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:37:56] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[03:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[03:39:31] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:47:37] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:50:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[04:01:03] <icinga-wm>	 RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:10:11] <icinga-wm>	 PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: debian-weekly-rebuild.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:46:35] <icinga-wm>	 PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2005 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[04:47:17] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service,mediawiki_job_db_lag_stats_reporter.service,mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:47:35] <icinga-wm>	 PROBLEM - Check systemd state on labweb1002 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:47:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:47:59] <icinga-wm>	 PROBLEM - Check systemd state on conf2005 is CRITICAL: CRITICAL - degraded: The following units failed: etcdmirror-conftool-eqiad-wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:48:19] <icinga-wm>	 PROBLEM - Etcd replication lag #page on conf2005 is CRITICAL: connect to address 10.192.32.52 and port 8000: Connection refused https://wikitech.wikimedia.org/wiki/Etcd
[04:48:33] <icinga-wm>	 PROBLEM - Check systemd state on labweb1001 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:48:51] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:57:57] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Unknown error executing dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[04:59:19] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Unknown error executing dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[05:00:05] <icinga-wm>	 RECOVERY - Check systemd state on labweb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:00:58] <cwhite>	 o/
[05:01:13] <cwhite>	 the cert is expired on conf1006
[05:01:56] <mutante>	 I am here now
[05:02:07] <mutante>	 half the time it took me to get on IRC..hrm
[05:04:16] <mutante>	 cwhite: ok, I acked it (not resolved, but acked)
[05:04:53] <mutante>	 alert was for conf2005 though, not 1006
[05:05:21] <cwhite>	 mutante: conf2005 mirror service failed due to certificate expired
[05:05:55] <cwhite>	 CRITICAL: Generic error: Connection to etcd failed due to MaxRetryError('HTTPSConnectionPool(host=\'conf1006.eqiad.wmnet\', port=4001): Max retries exceeded ... SSLError(SSLError("bad handshake: Error([(\'SSL routines\', \'tls_process_server_certificate\', \'certificate verify failed\')],)",),))',)
[05:06:01] <icinga-wm>	 RECOVERY - Check systemd state on labweb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:06:06] <mutante>	 cwhite: looking for docs on wikitech or a previous ticket
[05:06:49] <mutante>	 it sounds like we just cant sync changes but it's not like something is down
[05:11:48] <mutante>	 https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Consistency
[05:12:57] <icinga-wm>	 PROBLEM - Check systemd state on labweb1002 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:13:55] <icinga-wm>	 PROBLEM - Check systemd state on labweb1001 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:15:02] <mutante>	 looking if it's made with cergen
[05:17:11] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on labweb1001 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service andrew bogott whatever this is, it doesnt need to be fixed on a weekend. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:17:11] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on labweb1002 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service andrew bogott whatever this is, it doesnt need to be fixed on a weekend. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:18:20] <dcaro_away>	 hey, me here too
[05:18:37] <dcaro>	 labweb started crashing also
[05:19:13] <mutante>	 currently we were looking at the cert and found one for codfw made with cergen, but the expired one is eqiad
[05:19:19] <mutante>	 now reading https://phabricator.wikimedia.org/T302153
[05:19:41] <mutante>	 cwhite: that ticket confirms it "I think this is the old etcd certificate we used to use for etcd in codfw; since we've moved to etcd v3 we're using a new cert created with cergen:"
[05:19:46] <dcaro>	 👍 let me know if I can help
[05:19:50] <mutante>	 so codfw was already renewed 
[05:27:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job etcd in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:28:13] <andrewbogott>	 mutante/cwhite, anything I can do to help?
[05:30:22] <mutante>	 andrewbogott: the cert in eqiad is expired but in codfw it's not. we could renew it but additionally eqiad is not using cergen yet
[05:30:52] <mutante>	 we are currently figuring out how to switch it
[05:31:25] <mutante>	 trying to make a new cert for eqiad with cergen
[05:32:01] <andrewbogott>	 ok :)
[05:32:18] <andrewbogott>	 It is very late here so probably I will tune out if you already have a plan of action
[05:37:01] <icinga-wm>	 RECOVERY - Check systemd state on labweb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:39:52] <dcaro>	 hmm, ^that is not true right now, still failing wikitech_run_jobs.service
[05:40:32] <dcaro>	 I tihnk it's just seeing it between the restart and the fail and flapping
[05:40:59] <mutante>	 I am still editing yaml to create a new cert with cergen
[05:41:11] <mutante>	 1001 through 1003 dont exist anymore
[05:41:21] <mutante>	 but we need 1004, 1005 and 1006, both short name and FQDN
[05:41:28] <mutante>	 and etcd.eqiad.wmnet
[05:41:36] <cwhite>	 +1
[05:42:10] <mutante>	 following https://wikitech.wikimedia.org/wiki/Cergen#Update_a_certificate  and adding eqiad next to existing codfw
[05:42:20] <mutante>	 need to revoke old cert from puppetmaster
[05:42:20] <cwhite>	 LGTM
[05:43:04] <mutante>	  sudo puppet cert list --all | grep etcd
[05:43:19] <mutante>	 on puppetmaster1001 shows etcd.eqiad.wmnet is expired, so "puppet cert clean"ing that
[05:43:57] <icinga-wm>	 PROBLEM - Check systemd state on labweb1001 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:44:23] <mutante>	 !log puppetmaster1001 - sudo puppet cert clean etcd.eqiad.wmnet (expired)
[05:44:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:45:45] <mutante>	 don't need to do the step to delete files because that is codfw, and that is not expired
[05:46:09] <mutante>	 so leaving secret/secrets/certificates/etcd-v3.codfw.wmnet alone 
[05:48:24] <cwhite>	 The same certificate is expired in codfw as well, but maybe it isn't paging because codfw isn't the primary and being mirrored.
[05:50:26] <mutante>	 sudo cergen -c 'etcd*' --generate --base-path=/srv/private/modules/secret/secrets/certificates /srv/private/modules/secret/secrets/certificates/certificate.manifests.d   did not create it 
[05:50:41] <mutante>	 so if this was renewing the existing cert then I would have to delete the files
[05:50:50] <mutante>	 since you say that is also expired.. ok..doing that
[05:51:19] <icinga-wm>	 RECOVERY - Puppet CA expired certs on puppetmaster1001 is OK: OK: all puppet agent certs fine https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate
[05:51:30] <cwhite>	 mutante: nevermind, 
[05:51:34] <mutante>	 cwhite:  "etcd-v3.codfw.wmnet" seems ok
[05:51:48] <cwhite>	 mutante: it appears there's an old abandoned file
[05:51:56] <cwhite>	 the one nginx using is valid
[05:52:02] <mutante>	 now why does it not pick up the etcd one I added to the yaml
[05:54:12] <mutante>	 NOTE: If you are regenerating a Puppet signed certificate, you must first remove the certificate from the Puppet CA. puppet cert clean <common_name> should do it. 
[05:58:20] <mutante>	 ok, I am doing it a different way, in a new .yaml file just for eqiad
[06:02:07] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 105 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[06:07:03] <icinga-wm>	 RECOVERY - Check systemd state on labweb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:10:28] <mutante>	 we now have a new "etcd-v3.eqiad.wmnet" cert
[06:11:36] <mutante>	 now gotta add it to public repo
[06:13:57] <icinga-wm>	 PROBLEM - Check systemd state on labweb1001 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:18:47] <wikibugs>	 (03PS1) 10Dzahn: add new certificate for etcd-v3.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/787884
[06:19:32] <wikibugs>	 (03CR) 10Dzahn: "openssl x509 -noout -text -in etcd-v3.eqiad.wmnet.crt  | grep DNS" [puppet] - 10https://gerrit.wikimedia.org/r/787884 (owner: 10Dzahn)
[06:19:58] <wikibugs>	 (03PS2) 10Dzahn: add new certificate for etcd-v3.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/787884 (https://phabricator.wikimedia.org/T302153)
[06:20:21] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/787884 (https://phabricator.wikimedia.org/T302153) (owner: 10Dzahn)
[06:20:29] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add new certificate for etcd-v3.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/787884 (https://phabricator.wikimedia.org/T302153) (owner: 10Dzahn)
[06:22:25] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:23:01] <wikibugs>	 (03PS1) 10Cwhite: hiera: tlsproxy: use new etcd-v3 certificate [puppet] - 10https://gerrit.wikimedia.org/r/787885 (https://phabricator.wikimedia.org/T302153)
[06:24:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "yes, makes sense. codfw has it, common does not yet." [puppet] - 10https://gerrit.wikimedia.org/r/787885 (https://phabricator.wikimedia.org/T302153) (owner: 10Cwhite)
[06:26:20] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] hiera: tlsproxy: use new etcd-v3 certificate [puppet] - 10https://gerrit.wikimedia.org/r/787885 (https://phabricator.wikimedia.org/T302153) (owner: 10Cwhite)
[06:27:29] <mutante>	 when I puppet merged there were issues ..syncing conftool data.. not surprising..due to the cert
[06:32:01] <icinga-wm>	 RECOVERY - Check systemd state on labweb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:37:31] <icinga-wm>	 RECOVERY - Check systemd state on labweb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:37:47] <cwhite>	 !log restart etcdmirror-conftool.eqiad.wmnet on conf2005
[06:37:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:37:56] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:38:01] <icinga-wm>	 RECOVERY - Check systemd state on conf2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:38:12] <icinga-wm>	 RECOVERY - Etcd replication lag #page on conf2005 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Etcd
[06:38:17] <mutante>	 dcaro: May 01 06:38:02 labweb1001 systemd[1]: wikitech_run_jobs.service: Succeeded.
[06:38:27] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[06:38:49] <cwhite>	 there's the recovery
[06:38:59] <icinga-wm>	 RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2005 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[06:38:59] <cwhite>	 thanks, mutante!
[06:39:26] <mutante>	 checking Icinga
[06:39:28] <mutante>	 all looks good
[06:39:30] <mutante>	 thanks, cwhite 
[06:39:36] <mutante>	 pheew..man :)
[06:39:49] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[06:42:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job etcd in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:46:27] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:47:26] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: Renew puppet cert for etcd.codfw.wmnet - https://phabricator.wikimedia.org/T302153 (10Dzahn) 06:42 < mutante> etcd in codfw had already been converted to use cergen and etcd-v3 certs but eqiad had not 06:43 < mutante> eventually cwhite and myself figured that out and...
[07:01:17] <icinga-wm>	 PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2022-05-04 06:59:52 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/
[07:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:14:36] <dcaro>	 hatnks both!
[07:41:29] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:43:15] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:43:55] <RhinosF1>	 dcaro: is toolserver.org supposed to renew auto? If Monday is a day off for WMCS then that will expire before next working hours
[07:50:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:52:33] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:58:42] <taavi>	 RhinosF1: why would monday be a day off?
[07:58:56] <RhinosF1>	 taavi: May Day
[07:59:11] <RhinosF1>	 It is in the UK, no idea about every country
[07:59:13] <taavi>	 that's on sunday
[08:00:35] <RhinosF1>	 taavi: UK carries forward to next working day
[08:00:47] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:00:56] <RhinosF1>	 So we get a week day off if a bank holiday falls on a weekend
[08:02:12] <taavi>	 that's unfair, we don't have that :(
[08:02:57] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:03:09] <RhinosF1>	 taavi: that's stupid. What's the point in a
[08:03:20] <RhinosF1>	 Holiday on a weekend
[08:03:24] <RhinosF1>	 You're already off
[08:07:49] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:11:45] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:12:09] <p858snake>	 because most public holidays are about remembrance of a significant event, not a day off work
[08:16:05] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:17:01] <RhinosF1>	 p858snake: the jubilee is and we get an extra day so we can have even stupider street parties
[08:22:37] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:29:07] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:31:01] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:51:19] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:01:45] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:08:33] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:15:57] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:17:37] <icinga-wm>	 PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:17:51] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:24:43] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:48:51] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:51:05] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:57:41] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:06:59] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 106 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[10:09:15] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:11:57] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:18:49] <icinga-wm>	 RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:37:15] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:37:56] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[10:43:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job etcd in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:50:45] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:56:07] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:57:39] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[11:11:45] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:37:05] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:21] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:45:57] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:50:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[11:56:49] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:59:20] <wikibugs>	 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Fuzzy) Are there any updates on the issue? I'm playing around the limit by removing functionality from some templates, but I cannot dodge the lim...
[12:00:13] <wikibugs>	 (03PS6) 10Krinkle: static.php: Restore short cache for temporary 'mismatch' response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T302465)
[12:00:19] <wikibugs>	 (03PS7) 10Krinkle: static.php: Restore short cache for temporary 'mismatch' response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T302465)
[12:03:19] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:05:55] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:10:11] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:12:31] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:22:01] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3548 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[12:22:37] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[12:23:11] <icinga-wm>	 PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:26:37] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.01613 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[12:27:13] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[12:31:15] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:33:15] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:40:07] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:53:55] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:56:03] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:02:55] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:07:13] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:13:49] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:20:19] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:22:31] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:24:03] <wikibugs>	 (03PS1) 10Majavah: keyholder: Collect Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/787911
[13:24:13] <icinga-wm>	 RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:27:59] <wikibugs>	 (03PS2) 10Majavah: keyholder: Collect Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/787911
[13:29:19] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:31:15] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:37:45] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:51:24] <cwhite>	 !log reload nginx on conf1004/1005 to pick up cert changes
[13:51:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:49] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:52:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job etcd in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:55:29] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:03:18] <jinxer-wm>	 (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:08:18] <jinxer-wm>	 (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:21:33] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:37:56] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[14:48:27] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:00:45] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:01:56] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[15:15:03] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:50:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[15:54:11] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:01:55] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:12:45] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:40:23] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:33:07] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:59:41] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:16:19] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[18:18:29] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 5 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[18:37:56] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[18:54:27] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:01:56] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[19:19:57] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:50:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[20:06:53] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:37:41] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:44:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[20:44:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[20:44:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[20:44:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[20:44:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T306560)', diff saved to https://phabricator.wikimedia.org/P27164 and previous config saved to /var/cache/conftool/dbconfig/20220501-204427-ladsgroup.json
[20:44:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:31] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[20:46:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T306560)', diff saved to https://phabricator.wikimedia.org/P27165 and previous config saved to /var/cache/conftool/dbconfig/20220501-204640-ladsgroup.json
[20:46:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:51:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[20:51:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[20:51:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:51:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[20:52:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[20:52:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:01:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27166 and previous config saved to /var/cache/conftool/dbconfig/20220501-210145-ladsgroup.json
[21:01:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:09:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance
[21:09:25] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance
[21:09:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:09:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:15:51] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:16:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27167 and previous config saved to /var/cache/conftool/dbconfig/20220501-211650-ladsgroup.json
[21:16:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:31:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T306560)', diff saved to https://phabricator.wikimedia.org/P27168 and previous config saved to /var/cache/conftool/dbconfig/20220501-213155-ladsgroup.json
[21:31:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[21:31:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[21:31:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:32:00] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[21:32:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:32:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P27169 and previous config saved to /var/cache/conftool/dbconfig/20220501-213203-ladsgroup.json
[21:32:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:32:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P27170 and previous config saved to /var/cache/conftool/dbconfig/20220501-213415-ladsgroup.json
[21:34:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[21:34:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[21:34:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:37:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[21:37:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[21:37:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:37:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:37:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P27171 and previous config saved to /var/cache/conftool/dbconfig/20220501-213750-ladsgroup.json
[21:37:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:37:55] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[21:38:51] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:41:11] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:49:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P27172 and previous config saved to /var/cache/conftool/dbconfig/20220501-214920-ladsgroup.json
[21:49:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:54:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P27173 and previous config saved to /var/cache/conftool/dbconfig/20220501-215405-ladsgroup.json
[21:54:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:54:10] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[22:04:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P27174 and previous config saved to /var/cache/conftool/dbconfig/20220501-220425-ladsgroup.json
[22:04:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:09:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P27175 and previous config saved to /var/cache/conftool/dbconfig/20220501-220910-ladsgroup.json
[22:09:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:09:17] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:19:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P27176 and previous config saved to /var/cache/conftool/dbconfig/20220501-221930-ladsgroup.json
[22:19:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[22:19:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[22:19:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:19:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P27177 and previous config saved to /var/cache/conftool/dbconfig/20220501-221938-ladsgroup.json
[22:19:42] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[22:19:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:19:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:19:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:21:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P27178 and previous config saved to /var/cache/conftool/dbconfig/20220501-222152-ladsgroup.json
[22:21:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:24:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P27179 and previous config saved to /var/cache/conftool/dbconfig/20220501-222415-ladsgroup.json
[22:24:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:36:29] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:36:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P27180 and previous config saved to /var/cache/conftool/dbconfig/20220501-223657-ladsgroup.json
[22:37:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:37:56] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[22:39:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P27181 and previous config saved to /var/cache/conftool/dbconfig/20220501-223920-ladsgroup.json
[22:39:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:39:25] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[22:52:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P27182 and previous config saved to /var/cache/conftool/dbconfig/20220501-225202-ladsgroup.json
[22:52:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:56:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[22:56:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[22:56:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:56:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:56:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P27183 and previous config saved to /var/cache/conftool/dbconfig/20220501-225626-ladsgroup.json
[22:56:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:56:30] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[23:01:55] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:01:56] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[23:07:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P27184 and previous config saved to /var/cache/conftool/dbconfig/20220501-230707-ladsgroup.json
[23:07:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[23:07:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[23:07:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:07:13] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[23:07:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:07:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T306560)', diff saved to https://phabricator.wikimedia.org/P27185 and previous config saved to /var/cache/conftool/dbconfig/20220501-230715-ladsgroup.json
[23:07:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:07:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:09:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T306560)', diff saved to https://phabricator.wikimedia.org/P27186 and previous config saved to /var/cache/conftool/dbconfig/20220501-230928-ladsgroup.json
[23:09:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:12:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P27187 and previous config saved to /var/cache/conftool/dbconfig/20220501-231227-ladsgroup.json
[23:12:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:12:31] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[23:14:53] <wikibugs>	 (03PS1) 10Ssingh: certspotter: remove rate-limiting CT log [puppet] - 10https://gerrit.wikimedia.org/r/787937
[23:15:32] <sukhe>	 ^ this is spamming the sre-traffic list and hence the Sunday fix
[23:19:08] <wikibugs>	 (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/pcc-worker1003/35015/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/787937 (owner: 10Ssingh)
[23:19:11] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] certspotter: remove rate-limiting CT log [puppet] - 10https://gerrit.wikimedia.org/r/787937 (owner: 10Ssingh)
[23:22:37] <icinga-wm>	 PROBLEM - SSH on wtp1037.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:24:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P27188 and previous config saved to /var/cache/conftool/dbconfig/20220501-232433-ladsgroup.json
[23:24:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P27189 and previous config saved to /var/cache/conftool/dbconfig/20220501-232732-ladsgroup.json
[23:27:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:39:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P27190 and previous config saved to /var/cache/conftool/dbconfig/20220501-233938-ladsgroup.json
[23:39:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:40:53] <wikibugs>	 (03PS1) 10Brian Wolff: Update my blog url [puppet] - 10https://gerrit.wikimedia.org/r/787938
[23:42:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P27191 and previous config saved to /var/cache/conftool/dbconfig/20220501-234237-ladsgroup.json
[23:42:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:44:15] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:45:11] <wikibugs>	 (03PS2) 10Brian Wolff: Planet: Update my (bawolff) blog url [puppet] - 10https://gerrit.wikimedia.org/r/787938
[23:46:27] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:50:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[23:54:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T306560)', diff saved to https://phabricator.wikimedia.org/P27192 and previous config saved to /var/cache/conftool/dbconfig/20220501-235443-ladsgroup.json
[23:54:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[23:54:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[23:54:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:54:48] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[23:54:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:54:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:55:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance
[23:55:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance
[23:55:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:55:06] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance
[23:55:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:55:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:55:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance
[23:55:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:55:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[23:55:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[23:55:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:55:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:55:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[23:55:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[23:55:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:55:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:55:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T306560)', diff saved to https://phabricator.wikimedia.org/P27193 and previous config saved to /var/cache/conftool/dbconfig/20220501-235549-ladsgroup.json
[23:55:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:57:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T306560)', diff saved to https://phabricator.wikimedia.org/P27194 and previous config saved to /var/cache/conftool/dbconfig/20220501-235700-ladsgroup.json
[23:57:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:57:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P27195 and previous config saved to /var/cache/conftool/dbconfig/20220501-235742-ladsgroup.json
[23:57:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:57:46] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565