[00:35:53] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:39:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:05] might not be an operations thing, but as early warning system: Lua errors popping up randomly on enwiki rn [01:43:44] Example of ^^^, https://i.imgur.com/BlioHB0.png / https://i.imgur.com/OlR4aFv.png [01:45:03] No obvious cause in https://en.wikipedia.org/w/index.php?target=Template%3AAdmin&limit=500&days=7&enhanced=1&title=Special:RecentChangesLinked&urlversion=2 [01:49:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:32:43] !log repool mw1340 [02:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:39:31] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:47:37] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:50:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:01:03] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:10:11] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: debian-weekly-rebuild.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:46:35] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2005 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:47:17] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service,mediawiki_job_db_lag_stats_reporter.service,mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:47:35] PROBLEM - Check systemd state on labweb1002 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:47:45] (JobUnavailable) firing: Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:47:59] PROBLEM - Check systemd state on conf2005 is CRITICAL: CRITICAL - degraded: The following units failed: etcdmirror-conftool-eqiad-wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:48:19] PROBLEM - Etcd replication lag #page on conf2005 is CRITICAL: connect to address 10.192.32.52 and port 8000: Connection refused https://wikitech.wikimedia.org/wiki/Etcd [04:48:33] PROBLEM - Check systemd state on labweb1001 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:48:51] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:57:57] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Unknown error executing dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [04:59:19] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Unknown error executing dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [05:00:05] RECOVERY - Check systemd state on labweb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:00:58] o/ [05:01:13] the cert is expired on conf1006 [05:01:56] I am here now [05:02:07] half the time it took me to get on IRC..hrm [05:04:16] cwhite: ok, I acked it (not resolved, but acked) [05:04:53] alert was for conf2005 though, not 1006 [05:05:21] mutante: conf2005 mirror service failed due to certificate expired [05:05:55] CRITICAL: Generic error: Connection to etcd failed due to MaxRetryError('HTTPSConnectionPool(host=\'conf1006.eqiad.wmnet\', port=4001): Max retries exceeded ... SSLError(SSLError("bad handshake: Error([(\'SSL routines\', \'tls_process_server_certificate\', \'certificate verify failed\')],)",),))',) [05:06:01] RECOVERY - Check systemd state on labweb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:06:06] cwhite: looking for docs on wikitech or a previous ticket [05:06:49] it sounds like we just cant sync changes but it's not like something is down [05:11:48] https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Consistency [05:12:57] PROBLEM - Check systemd state on labweb1002 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:13:55] PROBLEM - Check systemd state on labweb1001 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:15:02] looking if it's made with cergen [05:17:11] ACKNOWLEDGEMENT - Check systemd state on labweb1001 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service andrew bogott whatever this is, it doesnt need to be fixed on a weekend. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:17:11] ACKNOWLEDGEMENT - Check systemd state on labweb1002 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service andrew bogott whatever this is, it doesnt need to be fixed on a weekend. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:18:20] hey, me here too [05:18:37] labweb started crashing also [05:19:13] currently we were looking at the cert and found one for codfw made with cergen, but the expired one is eqiad [05:19:19] now reading https://phabricator.wikimedia.org/T302153 [05:19:41] cwhite: that ticket confirms it "I think this is the old etcd certificate we used to use for etcd in codfw; since we've moved to etcd v3 we're using a new cert created with cergen:" [05:19:46] 👍 let me know if I can help [05:19:50] so codfw was already renewed [05:27:45] (JobUnavailable) firing: (2) Reduced availability for job etcd in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:28:13] mutante/cwhite, anything I can do to help? [05:30:22] andrewbogott: the cert in eqiad is expired but in codfw it's not. we could renew it but additionally eqiad is not using cergen yet [05:30:52] we are currently figuring out how to switch it [05:31:25] trying to make a new cert for eqiad with cergen [05:32:01] ok :) [05:32:18] It is very late here so probably I will tune out if you already have a plan of action [05:37:01] RECOVERY - Check systemd state on labweb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:39:52] hmm, ^that is not true right now, still failing wikitech_run_jobs.service [05:40:32] I tihnk it's just seeing it between the restart and the fail and flapping [05:40:59] I am still editing yaml to create a new cert with cergen [05:41:11] 1001 through 1003 dont exist anymore [05:41:21] but we need 1004, 1005 and 1006, both short name and FQDN [05:41:28] and etcd.eqiad.wmnet [05:41:36] +1 [05:42:10] following https://wikitech.wikimedia.org/wiki/Cergen#Update_a_certificate and adding eqiad next to existing codfw [05:42:20] need to revoke old cert from puppetmaster [05:42:20] LGTM [05:43:04] sudo puppet cert list --all | grep etcd [05:43:19] on puppetmaster1001 shows etcd.eqiad.wmnet is expired, so "puppet cert clean"ing that [05:43:57] PROBLEM - Check systemd state on labweb1001 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:44:23] !log puppetmaster1001 - sudo puppet cert clean etcd.eqiad.wmnet (expired) [05:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:45] don't need to do the step to delete files because that is codfw, and that is not expired [05:46:09] so leaving secret/secrets/certificates/etcd-v3.codfw.wmnet alone [05:48:24] The same certificate is expired in codfw as well, but maybe it isn't paging because codfw isn't the primary and being mirrored. [05:50:26] sudo cergen -c 'etcd*' --generate --base-path=/srv/private/modules/secret/secrets/certificates /srv/private/modules/secret/secrets/certificates/certificate.manifests.d did not create it [05:50:41] so if this was renewing the existing cert then I would have to delete the files [05:50:50] since you say that is also expired.. ok..doing that [05:51:19] RECOVERY - Puppet CA expired certs on puppetmaster1001 is OK: OK: all puppet agent certs fine https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [05:51:30] mutante: nevermind, [05:51:34] cwhite: "etcd-v3.codfw.wmnet" seems ok [05:51:48] mutante: it appears there's an old abandoned file [05:51:56] the one nginx using is valid [05:52:02] now why does it not pick up the etcd one I added to the yaml [05:54:12] NOTE: If you are regenerating a Puppet signed certificate, you must first remove the certificate from the Puppet CA. puppet cert clean should do it. [05:58:20] ok, I am doing it a different way, in a new .yaml file just for eqiad [06:02:07] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 105 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:07:03] RECOVERY - Check systemd state on labweb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:10:28] we now have a new "etcd-v3.eqiad.wmnet" cert [06:11:36] now gotta add it to public repo [06:13:57] PROBLEM - Check systemd state on labweb1001 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:18:47] (03PS1) 10Dzahn: add new certificate for etcd-v3.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/787884 [06:19:32] (03CR) 10Dzahn: "openssl x509 -noout -text -in etcd-v3.eqiad.wmnet.crt | grep DNS" [puppet] - 10https://gerrit.wikimedia.org/r/787884 (owner: 10Dzahn) [06:19:58] (03PS2) 10Dzahn: add new certificate for etcd-v3.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/787884 (https://phabricator.wikimedia.org/T302153) [06:20:21] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/787884 (https://phabricator.wikimedia.org/T302153) (owner: 10Dzahn) [06:20:29] (03CR) 10Dzahn: [C: 03+2] add new certificate for etcd-v3.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/787884 (https://phabricator.wikimedia.org/T302153) (owner: 10Dzahn) [06:22:25] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:23:01] (03PS1) 10Cwhite: hiera: tlsproxy: use new etcd-v3 certificate [puppet] - 10https://gerrit.wikimedia.org/r/787885 (https://phabricator.wikimedia.org/T302153) [06:24:39] (03CR) 10Dzahn: [C: 03+1] "yes, makes sense. codfw has it, common does not yet." [puppet] - 10https://gerrit.wikimedia.org/r/787885 (https://phabricator.wikimedia.org/T302153) (owner: 10Cwhite) [06:26:20] (03CR) 10Dzahn: [C: 03+2] hiera: tlsproxy: use new etcd-v3 certificate [puppet] - 10https://gerrit.wikimedia.org/r/787885 (https://phabricator.wikimedia.org/T302153) (owner: 10Cwhite) [06:27:29] when I puppet merged there were issues ..syncing conftool data.. not surprising..due to the cert [06:32:01] RECOVERY - Check systemd state on labweb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:37:31] RECOVERY - Check systemd state on labweb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:37:47] !log restart etcdmirror-conftool.eqiad.wmnet on conf2005 [06:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:38:01] RECOVERY - Check systemd state on conf2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:38:12] RECOVERY - Etcd replication lag #page on conf2005 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Etcd [06:38:17] dcaro: May 01 06:38:02 labweb1001 systemd[1]: wikitech_run_jobs.service: Succeeded. [06:38:27] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [06:38:49] there's the recovery [06:38:59] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2005 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:38:59] thanks, mutante! [06:39:26] checking Icinga [06:39:28] all looks good [06:39:30] thanks, cwhite [06:39:36] pheew..man :) [06:39:49] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [06:42:45] (JobUnavailable) firing: (2) Reduced availability for job etcd in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:46:27] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:47:26] 10SRE, 10serviceops, 10Patch-For-Review: Renew puppet cert for etcd.codfw.wmnet - https://phabricator.wikimedia.org/T302153 (10Dzahn) 06:42 < mutante> etcd in codfw had already been converted to use cergen and etcd-v3 certs but eqiad had not 06:43 < mutante> eventually cwhite and myself figured that out and... [07:01:17] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2022-05-04 06:59:52 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/ [07:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:14:36] hatnks both! [07:41:29] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:43:15] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:43:55] dcaro: is toolserver.org supposed to renew auto? If Monday is a day off for WMCS then that will expire before next working hours [07:50:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:52:33] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:58:42] RhinosF1: why would monday be a day off? [07:58:56] taavi: May Day [07:59:11] It is in the UK, no idea about every country [07:59:13] that's on sunday [08:00:35] taavi: UK carries forward to next working day [08:00:47] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:56] So we get a week day off if a bank holiday falls on a weekend [08:02:12] that's unfair, we don't have that :( [08:02:57] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:03:09] taavi: that's stupid. What's the point in a [08:03:20] Holiday on a weekend [08:03:24] You're already off [08:07:49] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:11:45] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:09] because most public holidays are about remembrance of a significant event, not a day off work [08:16:05] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:01] p858snake: the jubilee is and we get an extra day so we can have even stupider street parties [08:22:37] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:29:07] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:31:01] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:51:19] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:45] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:33] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:15:57] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:37] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:17:51] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:24:43] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:51] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:51:05] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:57:41] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:59] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 106 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:09:15] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:11:57] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:18:49] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:37:15] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:43:00] (JobUnavailable) firing: Reduced availability for job etcd in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:50:45] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:07] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:57:39] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:11:45] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:05] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:21] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:57] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:56:49] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:20] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Fuzzy) Are there any updates on the issue? I'm playing around the limit by removing functionality from some templates, but I cannot dodge the lim... [12:00:13] (03PS6) 10Krinkle: static.php: Restore short cache for temporary 'mismatch' response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T302465) [12:00:19] (03PS7) 10Krinkle: static.php: Restore short cache for temporary 'mismatch' response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T302465) [12:03:19] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:05:55] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:10:11] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:12:31] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:22:01] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3548 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [12:22:37] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:23:11] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:26:37] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.01613 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [12:27:13] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:31:15] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:15] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:40:07] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:53:55] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:03] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:02:55] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:13] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:13:49] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:20:19] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:22:31] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:24:03] (03PS1) 10Majavah: keyholder: Collect Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/787911 [13:24:13] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:27:59] (03PS2) 10Majavah: keyholder: Collect Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/787911 [13:29:19] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:15] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:37:45] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:24] !log reload nginx on conf1004/1005 to pick up cert changes [13:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:49] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:52:45] (JobUnavailable) resolved: Reduced availability for job etcd in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:55:29] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:03:18] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:08:18] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:21:33] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:48:27] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:45] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:15:03] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:54:11] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:01:55] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:12:45] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:40:23] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:33:07] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:59:41] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:16:19] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [18:18:29] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 5 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [18:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:54:27] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:19:57] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:50:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:06:53] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:37:41] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:44:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [20:44:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [20:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [20:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [20:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T306560)', diff saved to https://phabricator.wikimedia.org/P27164 and previous config saved to /var/cache/conftool/dbconfig/20220501-204427-ladsgroup.json [20:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:31] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [20:46:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T306560)', diff saved to https://phabricator.wikimedia.org/P27165 and previous config saved to /var/cache/conftool/dbconfig/20220501-204640-ladsgroup.json [20:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [20:51:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [20:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [20:52:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [20:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27166 and previous config saved to /var/cache/conftool/dbconfig/20220501-210145-ladsgroup.json [21:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [21:09:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [21:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:51] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:16:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27167 and previous config saved to /var/cache/conftool/dbconfig/20220501-211650-ladsgroup.json [21:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T306560)', diff saved to https://phabricator.wikimedia.org/P27168 and previous config saved to /var/cache/conftool/dbconfig/20220501-213155-ladsgroup.json [21:31:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [21:31:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [21:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:00] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [21:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P27169 and previous config saved to /var/cache/conftool/dbconfig/20220501-213203-ladsgroup.json [21:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P27170 and previous config saved to /var/cache/conftool/dbconfig/20220501-213415-ladsgroup.json [21:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [21:34:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [21:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [21:37:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [21:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P27171 and previous config saved to /var/cache/conftool/dbconfig/20220501-213750-ladsgroup.json [21:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:55] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:38:51] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:41:11] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:49:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P27172 and previous config saved to /var/cache/conftool/dbconfig/20220501-214920-ladsgroup.json [21:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P27173 and previous config saved to /var/cache/conftool/dbconfig/20220501-215405-ladsgroup.json [21:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:10] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:04:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P27174 and previous config saved to /var/cache/conftool/dbconfig/20220501-220425-ladsgroup.json [22:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P27175 and previous config saved to /var/cache/conftool/dbconfig/20220501-220910-ladsgroup.json [22:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:17] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:19:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P27176 and previous config saved to /var/cache/conftool/dbconfig/20220501-221930-ladsgroup.json [22:19:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [22:19:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [22:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P27177 and previous config saved to /var/cache/conftool/dbconfig/20220501-221938-ladsgroup.json [22:19:42] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [22:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P27178 and previous config saved to /var/cache/conftool/dbconfig/20220501-222152-ladsgroup.json [22:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P27179 and previous config saved to /var/cache/conftool/dbconfig/20220501-222415-ladsgroup.json [22:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:29] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:36:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P27180 and previous config saved to /var/cache/conftool/dbconfig/20220501-223657-ladsgroup.json [22:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:39:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P27181 and previous config saved to /var/cache/conftool/dbconfig/20220501-223920-ladsgroup.json [22:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:25] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:52:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P27182 and previous config saved to /var/cache/conftool/dbconfig/20220501-225202-ladsgroup.json [22:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [22:56:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [22:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P27183 and previous config saved to /var/cache/conftool/dbconfig/20220501-225626-ladsgroup.json [22:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:30] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:01:55] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:07:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P27184 and previous config saved to /var/cache/conftool/dbconfig/20220501-230707-ladsgroup.json [23:07:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [23:07:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [23:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:13] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [23:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T306560)', diff saved to https://phabricator.wikimedia.org/P27185 and previous config saved to /var/cache/conftool/dbconfig/20220501-230715-ladsgroup.json [23:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T306560)', diff saved to https://phabricator.wikimedia.org/P27186 and previous config saved to /var/cache/conftool/dbconfig/20220501-230928-ladsgroup.json [23:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P27187 and previous config saved to /var/cache/conftool/dbconfig/20220501-231227-ladsgroup.json [23:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:31] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:14:53] (03PS1) 10Ssingh: certspotter: remove rate-limiting CT log [puppet] - 10https://gerrit.wikimedia.org/r/787937 [23:15:32] ^ this is spamming the sre-traffic list and hence the Sunday fix [23:19:08] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/pcc-worker1003/35015/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/787937 (owner: 10Ssingh) [23:19:11] (03CR) 10Ssingh: [C: 03+2] certspotter: remove rate-limiting CT log [puppet] - 10https://gerrit.wikimedia.org/r/787937 (owner: 10Ssingh) [23:22:37] PROBLEM - SSH on wtp1037.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:24:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P27188 and previous config saved to /var/cache/conftool/dbconfig/20220501-232433-ladsgroup.json [23:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P27189 and previous config saved to /var/cache/conftool/dbconfig/20220501-232732-ladsgroup.json [23:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P27190 and previous config saved to /var/cache/conftool/dbconfig/20220501-233938-ladsgroup.json [23:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:53] (03PS1) 10Brian Wolff: Update my blog url [puppet] - 10https://gerrit.wikimedia.org/r/787938 [23:42:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P27191 and previous config saved to /var/cache/conftool/dbconfig/20220501-234237-ladsgroup.json [23:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:15] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:45:11] (03PS2) 10Brian Wolff: Planet: Update my (bawolff) blog url [puppet] - 10https://gerrit.wikimedia.org/r/787938 [23:46:27] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:50:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:54:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T306560)', diff saved to https://phabricator.wikimedia.org/P27192 and previous config saved to /var/cache/conftool/dbconfig/20220501-235443-ladsgroup.json [23:54:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [23:54:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [23:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:48] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [23:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [23:55:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [23:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [23:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [23:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [23:55:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [23:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [23:55:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [23:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T306560)', diff saved to https://phabricator.wikimedia.org/P27193 and previous config saved to /var/cache/conftool/dbconfig/20220501-235549-ladsgroup.json [23:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T306560)', diff saved to https://phabricator.wikimedia.org/P27194 and previous config saved to /var/cache/conftool/dbconfig/20220501-235700-ladsgroup.json [23:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P27195 and previous config saved to /var/cache/conftool/dbconfig/20220501-235742-ladsgroup.json [23:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:46] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565