[00:02:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:06:07] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:06:37] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:07:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:11:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:16:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:16:37] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:17:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:21:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:22:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:26:22] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:31:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:39:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/924127 [00:39:29] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/924127 (owner: 10TrainBranchBot) [00:41:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:41:37] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:46:22] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:49:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:54:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:55:39] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/924127 (owner: 10TrainBranchBot) [00:56:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:01:07] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:03:34] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T337705 (10phaultfinder) [01:41:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:42:26] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [01:43:56] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [01:46:07] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:49:12] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:50:06] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:51:08] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:52:32] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 20 Jun 2023 04:41:39 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:53:04] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49994 bytes in 0.208 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:56:50] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.917 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:06:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.11 [core] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924128 (https://phabricator.wikimedia.org/T337525) [02:08:03] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.11 [core] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924128 (https://phabricator.wikimedia.org/T337525) (owner: 10TrainBranchBot) [02:23:19] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.11 [core] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924128 (https://phabricator.wikimedia.org/T337525) (owner: 10TrainBranchBot) [02:26:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:29:12] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-ipmi-exporter.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:31:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:22] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924176 (https://phabricator.wikimedia.org/T337525) [03:01:24] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924176 (https://phabricator.wikimedia.org/T337525) (owner: 10TrainBranchBot) [03:02:07] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924176 (https://phabricator.wikimedia.org/T337525) (owner: 10TrainBranchBot) [03:02:39] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.11 refs T337525 [03:02:44] T337525: 1.41.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T337525 [03:35:50] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:41:46] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:52:33] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.11 refs T337525 (duration: 49m 54s) [03:52:38] T337525: 1.41.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T337525 [03:53:49] (03PS3) 10KartikMistry: Undeploy Special:Contribute from unsupported skins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923527 (https://phabricator.wikimedia.org/T337366) [03:54:46] !log mwpresync@deploy1002 Pruned MediaWiki: 1.41.0-wmf.9 (duration: 02m 10s) [04:10:41] * kart_ updating cxserver.. [04:11:19] (03PS3) 10KartikMistry: Update cxserver to 2023-05-29-112644-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/923920 (https://phabricator.wikimedia.org/T337657) [04:14:10] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-05-29-112644-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/923920 (https://phabricator.wikimedia.org/T337657) (owner: 10KartikMistry) [04:15:09] (03Merged) 10jenkins-bot: Update cxserver to 2023-05-29-112644-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/923920 (https://phabricator.wikimedia.org/T337657) (owner: 10KartikMistry) [04:20:57] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [04:21:24] PROBLEM - dump of db_inventory in codfw on backupmon1001 is CRITICAL: Last dump for db_inventory at codfw (db2185) taken on 2023-05-30 03:55:35 is 109 KiB, but the previous one was 93 KiB, a change of +16.6 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [04:21:26] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [04:24:03] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [04:24:37] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [04:26:38] PROBLEM - dump of db_inventory in eqiad on backupmon1001 is CRITICAL: Last dump for db_inventory at eqiad (db1215) taken on 2023-05-30 04:02:03 is 109 KiB, but the previous one was 93 KiB, a change of +17.1 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [04:27:32] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [04:28:08] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [04:28:30] !log Updated cxserver to 2023-05-29-112644-production (T337657) [04:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:28:34] T337657: Shutdown OpusMT service - https://phabricator.wikimedia.org/T337657 [04:31:42] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:34:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:01:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:08:08] PROBLEM - MariaDB Replica IO: s3 on clouddb1017 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1236, Errmsg: Got fatal error 1236 from master when reading data from binary log: Error: connecting slave requested to start from GTID 171966471-171966471-66240, which is not in the masters binlog https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:17:58] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 62597 [05:22:23] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 62597 [05:24:27] (03PS1) 10Muehlenhoff: Remove access for xihua [puppet] - 10https://gerrit.wikimedia.org/r/924339 [05:25:11] (03CR) 10CI reject: [V: 04-1] Remove access for xihua [puppet] - 10https://gerrit.wikimedia.org/r/924339 (owner: 10Muehlenhoff) [05:25:35] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Hxi-ctr out of all services on: 1255 hosts [05:26:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Hxi-ctr out of all services on: 1255 hosts [05:27:27] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Hxi-ctr out of all services on: 784 hosts [05:28:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Hxi-ctr out of all services on: 784 hosts [05:29:18] (03PS2) 10Muehlenhoff: Remove access for xihua [puppet] - 10https://gerrit.wikimedia.org/r/924339 [05:31:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:33:03] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for xihua [puppet] - 10https://gerrit.wikimedia.org/r/924339 (owner: 10Muehlenhoff) [05:36:12] (03PS1) 10Muehlenhoff: Remove access for nray [puppet] - 10https://gerrit.wikimedia.org/r/924340 [05:39:03] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for nray [puppet] - 10https://gerrit.wikimedia.org/r/924340 (owner: 10Muehlenhoff) [05:40:18] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Nray out of all services on: 784 hosts [05:40:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Nray out of all services on: 784 hosts [05:40:50] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Nray out of all services on: 1255 hosts [05:41:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Nray out of all services on: 1255 hosts [05:42:59] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 62597 [05:43:01] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.network.peering (exit_code=97) with action 'configure' for AS: 62597 [05:59:00] (03PS1) 10Marostegui: Revert "db2110: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/924163 [05:59:12] (03CR) 10CI reject: [V: 04-1] Revert "db2110: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/924163 (owner: 10Marostegui) [05:59:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48623 and previous config saved to /var/cache/conftool/dbconfig/20230530-055913-root.json [05:59:32] (03Abandoned) 10Marostegui: Revert "db2110: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/924163 (owner: 10Marostegui) [05:59:54] (03PS3) 10Muehlenhoff: debmonitor::server: Add bookworm support [puppet] - 10https://gerrit.wikimedia.org/r/922145 [05:59:57] (03PS1) 10Marostegui: db2110: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/924341 [06:02:57] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922145 (owner: 10Muehlenhoff) [06:04:55] (03CR) 10Marostegui: [C: 03+2] db2110: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/924341 (owner: 10Marostegui) [06:14:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48624 and previous config saved to /var/cache/conftool/dbconfig/20230530-061417-root.json [06:16:29] (03PS1) 10Vgutierrez: service: Disable monitors for wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/924342 (https://phabricator.wikimedia.org/T337446) [06:20:11] (03CR) 10Marostegui: [C: 03+1] service: Disable monitors for wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/924342 (https://phabricator.wikimedia.org/T337446) (owner: 10Vgutierrez) [06:20:49] marostegui: Wmflib::Service::Lvs sets the monitors key as mandatory [06:21:04] so it won't work [06:21:21] vgutierrez: Did arturo or dcaro came back to you yesterday? [06:21:39] nope [06:21:59] I am almost done with the current transfer, but there will be more coming [06:29:12] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-ipmi-exporter.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:29:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48625 and previous config saved to /var/cache/conftool/dbconfig/20230530-062922-root.json [06:33:58] (03PS1) 10Jelto: miscweb: set ipv4 and port for 15 and annual blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/924345 (https://phabricator.wikimedia.org/T300171) [06:34:18] vgutierrez: Until this is fixed I guess I will do the transfer in a different way so only one of the backends will be done instead of two [06:38:51] (03CR) 10Jelto: [C: 03+2] miscweb: set ipv4 and port for 15 and annual blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/924345 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [06:40:54] marostegui: it's a weird scenario from pybal's point of view. Two servers defined as backend servers for 16 services [06:41:25] so 32 monitors attempting to reconnect as fast as possible at the same time [06:44:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48628 and previous config saved to /var/cache/conftool/dbconfig/20230530-064427-root.json [06:45:38] (03PS2) 10Vgutierrez: service: Disable monitors for wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/924342 (https://phabricator.wikimedia.org/T337446) [06:47:55] vgutierrez ^ would that allow me to do what I did yesterday? (so stopping all backends) that'd help to recover things faster [06:48:05] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.makevm for new host testvm2006.codfw.wmnet [06:48:06] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [06:48:15] (03CR) 10Muehlenhoff: [C: 03+2] debmonitor::server: Add bookworm support [puppet] - 10https://gerrit.wikimedia.org/r/922145 (owner: 10Muehlenhoff) [06:49:44] marostegui: via puppet isn't feasible.. all our puppetization expects that services get some kind of monitoring/healthchecking [06:50:06] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [06:50:44] considering it isn't harming traffic I'd just ignore/ack the lvs alerts [06:50:57] just let us (traffic) know when you start/finish please [06:51:02] and sorry for the inconvenience [06:51:07] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [06:51:07] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:51:07] !log slyngshede@cumin1001 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [06:51:10] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [06:52:45] (03CR) 10Ladsgroup: [C: 03+2] tables_to_check: drop revision_comment_temp [software] - 10https://gerrit.wikimedia.org/r/924122 (https://phabricator.wikimedia.org/T215466) (owner: 10Zabe) [06:53:18] (03Merged) 10jenkins-bot: tables_to_check: drop revision_comment_temp [software] - 10https://gerrit.wikimedia.org/r/924122 (https://phabricator.wikimedia.org/T215466) (owner: 10Zabe) [06:57:45] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [06:58:47] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [06:58:48] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [06:59:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48629 and previous config saved to /var/cache/conftool/dbconfig/20230530-065932-root.json [07:00:05] Amir1, Urbanecm, and taavi: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230530T0700) [07:00:05] Func and kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:14] * kart_ is here [07:00:20] o/ [07:00:52] kart_: you can self-serve, right? [07:01:04] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [07:01:48] Amir1: sure [07:01:54] Yes [07:01:55] once done, ping me to do Func's patch [07:02:00] OK! [07:02:08] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [07:02:08] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:02:08] !log slyngshede@cumin1001 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [07:02:11] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [07:02:11] !log slyngshede@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=2) for new host testvm2006.codfw.wmnet [07:02:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923527 (https://phabricator.wikimedia.org/T337366) (owner: 10KartikMistry) [07:03:59] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.makevm for new host testvm2006.codfw.wmnet [07:04:00] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [07:04:01] (03Merged) 10jenkins-bot: Undeploy Special:Contribute from unsupported skins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923527 (https://phabricator.wikimedia.org/T337366) (owner: 10KartikMistry) [07:04:47] !log kartik@deploy1002 Started scap: Backport for [[gerrit:923527|Undeploy Special:Contribute from unsupported skins (T337366)]] [07:04:52] T337366: Tabs on the "Contribute" page not showing for some skins - https://phabricator.wikimedia.org/T337366 [07:05:49] (03CR) 10Vgutierrez: [C: 03+2] SRE: Add a new cookbook that allows to run puppet configuration while restarting Varnish [cookbooks] - 10https://gerrit.wikimedia.org/r/922844 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [07:06:07] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [07:06:09] Not that it matters for any of them deploys as none are beta only but beta scap is broken [07:06:26] !log kartik@deploy1002 kartik: Backport for [[gerrit:923527|Undeploy Special:Contribute from unsupported skins (T337366)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [07:07:10] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [07:07:10] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:07:10] !log slyngshede@cumin1001 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [07:07:14] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [07:07:16] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [07:09:14] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [07:10:19] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [07:10:19] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:10:19] !log slyngshede@cumin1001 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [07:10:22] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [07:10:29] !log slyngshede@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host testvm2006.codfw.wmnet [07:10:47] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.makevm for new host testvm2006.codfw.wmnet [07:10:48] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [07:11:42] RhinosF1: noted. Thanks! [07:12:56] (03PS2) 10KartikMistry: testwiki: Enable Section Translation for 9 Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924050 (https://phabricator.wikimedia.org/T337290) [07:14:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48630 and previous config saved to /var/cache/conftool/dbconfig/20230530-071436-root.json [07:15:20] 10SRE, 10ops-codfw, 10DBA: db2110 crashed - https://phabricator.wikimedia.org/T337445 (10Marostegui) 05Open→03Resolved The host is repooled. Thanks for your help! [07:16:00] (03CR) 10Filippo Giunchedi: "FWIW cadvisor can run no problem on VMs, sorry for the breakage though!" [puppet] - 10https://gerrit.wikimedia.org/r/924106 (https://phabricator.wikimedia.org/T108027) (owner: 10Andrew Bogott) [07:16:16] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [07:16:33] !log update bookworm installer to rc4 T330495 [07:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:37] T330495: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 [07:16:37] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:923527|Undeploy Special:Contribute from unsupported skins (T337366)]] (duration: 11m 49s) [07:16:41] T337366: Tabs on the "Contribute" page not showing for some skins - https://phabricator.wikimedia.org/T337366 [07:17:22] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [07:17:22] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:17:22] !log slyngshede@cumin1001 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [07:17:26] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [07:18:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924050 (https://phabricator.wikimedia.org/T337290) (owner: 10KartikMistry) [07:19:07] (03Merged) 10jenkins-bot: testwiki: Enable Section Translation for 9 Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924050 (https://phabricator.wikimedia.org/T337290) (owner: 10KartikMistry) [07:19:36] !log kartik@deploy1002 Started scap: Backport for [[gerrit:924050|testwiki: Enable Section Translation for 9 Wikipedia (T337290)]] [07:19:41] T337290: Enable MinT, Content and Section Translation for 10 languages previously lacking machine translation - https://phabricator.wikimedia.org/T337290 [07:21:08] !log kartik@deploy1002 kartik: Backport for [[gerrit:924050|testwiki: Enable Section Translation for 9 Wikipedia (T337290)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [07:23:01] I added a bunch of commits that were originally scheduled for the UTC afternoon window. Please ping me when the original commits are done. [07:27:23] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [07:28:29] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [07:28:30] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [07:29:15] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:924050|testwiki: Enable Section Translation for 9 Wikipedia (T337290)]] (duration: 09m 38s) [07:29:20] T337290: Enable MinT, Content and Section Translation for 10 languages previously lacking machine translation - https://phabricator.wikimedia.org/T337290 [07:29:20] Amir1: I'm done with my 2 config deployments. [07:29:34] cool [07:29:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48632 and previous config saved to /var/cache/conftool/dbconfig/20230530-072941-root.json [07:29:54] (03CR) 10Ladsgroup: [C: 03+1] "LGTM but Somone from data engineering should do the deployment." [puppet] - 10https://gerrit.wikimedia.org/r/923545 (https://phabricator.wikimedia.org/T275246) (owner: 10Zabe) [07:30:29] (03CR) 10Ladsgroup: [C: 03+2] Revert "Rename wgPageContentLanguage to wgPageViewLanguage" partially [core] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924086 (https://phabricator.wikimedia.org/T337634) (owner: 10Func) [07:30:30] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [07:30:50] !log move LDAP permissions for hghani from cn=nda to cn=wmf T322145 [07:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:06] (03PS1) 10Gergő Tisza: Section images: Accept more recommendation types [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924356 [07:31:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [core] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924086 (https://phabricator.wikimedia.org/T337634) (owner: 10Func) [07:31:34] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [07:31:34] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:31:34] !log slyngshede@cumin1001 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [07:31:37] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [07:31:37] !log slyngshede@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=2) for new host testvm2006.codfw.wmnet [07:32:30] (03CR) 10Gergő Tisza: [C: 03+2] Section images: Accept more recommendation types [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924356 (owner: 10Gergő Tisza) [07:38:40] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.makevm for new host testvm2006.codfw.wmnet [07:38:42] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [07:40:35] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [07:41:43] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [07:41:43] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:41:43] !log slyngshede@cumin1001 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [07:41:46] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [07:42:08] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [07:44:03] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [07:44:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48633 and previous config saved to /var/cache/conftool/dbconfig/20230530-074445-root.json [07:45:08] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [07:45:08] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:45:08] !log slyngshede@cumin1001 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [07:45:11] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [07:45:11] !log slyngshede@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host testvm2006.codfw.wmnet [07:46:25] (03Merged) 10jenkins-bot: Revert "Rename wgPageContentLanguage to wgPageViewLanguage" partially [core] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924086 (https://phabricator.wikimedia.org/T337634) (owner: 10Func) [07:46:49] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:924086|Revert "Rename wgPageContentLanguage to wgPageViewLanguage" partially (T337634)]] [07:46:53] T337634: Sorting broken for anonymous users - TypeError: Cannot read properties of undefined (reading 'type') / TypeError: Language ID should be string or object. / TypeError: undefined is not an object (evaluating 'cachedParsers[sortList[i][0]].type') / TypeError: locale value must be a string or object - https://phabricator.wikimedia.org/T337634 [07:48:13] !log ladsgroup@deploy1002 func and ladsgroup: Backport for [[gerrit:924086|Revert "Rename wgPageContentLanguage to wgPageViewLanguage" partially (T337634)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [07:48:24] Func: it's live in mwdebug [07:48:29] testing [07:49:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host puppetdb2003.codfw.wmnet with OS bookworm [07:49:32] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host puppetdb2003.codfw.wmnet with OS bookworm [07:50:28] Amir1: Good to go [07:50:37] awesome [07:51:35] (03Merged) 10jenkins-bot: Section images: Accept more recommendation types [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924356 (owner: 10Gergő Tisza) [07:52:21] (03CR) 10Ladsgroup: Switch VisualEditor to not use RESTbase on small and medium wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923650 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [07:56:06] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:924086|Revert "Rename wgPageContentLanguage to wgPageViewLanguage" partially (T337634)]] (duration: 09m 17s) [07:56:11] T337634: Sorting broken for anonymous users - TypeError: Cannot read properties of undefined (reading 'type') / TypeError: Language ID should be string or object. / TypeError: undefined is not an object (evaluating 'cachedParsers[sortList[i][0]].type') / TypeError: locale value must be a string or object - https://phabricator.wikimedia.org/T337634 [07:56:32] (JobUnavailable) firing: Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:56:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:57:53] Amir1: all done? [07:57:59] PROBLEM - SSH on wdqs2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:58:05] yup [07:58:07] sorry [07:58:34] thanks! I'll backport a few more things [07:59:59] (03PS5) 10D3r1ck01: Switch VisualEditor to not use RESTbase on small and medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923650 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [08:00:09] !log tgr@deploy1002 Started scap: Backport for [[gerrit:924356|Section images: Accept more recommendation types]] [08:00:53] (03CR) 10D3r1ck01: [C: 03+2] Switch VisualEditor to not use RESTbase on small and medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923650 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [08:01:33] !log tgr@deploy1002 tgr: Backport for [[gerrit:924356|Section images: Accept more recommendation types]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [08:01:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:01:50] (03Merged) 10jenkins-bot: Switch VisualEditor to not use RESTbase on small and medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923650 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [08:03:02] (03PS1) 10Ladsgroup: Revert "Switch VisualEditor to not use RESTbase on small and medium wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924357 [08:03:10] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: remove 'global' instance references [puppet] - 10https://gerrit.wikimedia.org/r/921350 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi) [08:03:16] (03CR) 10Gergő Tisza: [C: 03+2] Improve logging of invalid image recommendation kinds [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923643 (owner: 10Gergő Tisza) [08:03:20] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/output/921350/41393/" [puppet] - 10https://gerrit.wikimedia.org/r/921350 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi) [08:03:20] (03PS3) 10Filippo Giunchedi: prometheus: remove 'global' instance references [puppet] - 10https://gerrit.wikimedia.org/r/921350 (https://phabricator.wikimedia.org/T288196) [08:04:05] (03CR) 10Ladsgroup: [C: 03+2] "Please don't +2 patches in config like that. Follow https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924357 (owner: 10Ladsgroup) [08:04:31] tgr_: Someone +2'ed a config patch without the intention of deploying. I'm reverting it. [08:04:53] (03Merged) 10jenkins-bot: Revert "Switch VisualEditor to not use RESTbase on small and medium wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924357 (owner: 10Ladsgroup) [08:05:30] thanks, didn't notice that. [08:06:10] Amir1, tgr_: It was me, sorry wrong button. I hit rebase then hit +2 mistakenly. Sorry. Thanks Amir1 for the revert. [08:08:01] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:924356|Section images: Accept more recommendation types]] (duration: 07m 51s) [08:08:13] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetdb2003.codfw.wmnet with reason: host reimage [08:09:58] (03PS1) 10D3r1ck01: Revert "Revert "Switch VisualEditor to not use RESTbase on small and medium wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924358 [08:10:57] (03Abandoned) 10Jelto: service::catalog add miscweb 15 and annual to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/923263 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [08:11:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/923620 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [08:11:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetdb2003.codfw.wmnet with reason: host reimage [08:12:48] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.makevm for new host testvm2006.codfw.wmnet [08:12:49] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [08:14:12] (SystemdUnitFailed) resolved: wmf_auto_restart_prometheus-ipmi-exporter.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:14:45] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [08:15:26] (03Abandoned) 10Vgutierrez: service: Disable monitors for wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/924342 (https://phabricator.wikimedia.org/T337446) (owner: 10Vgutierrez) [08:15:49] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [08:15:49] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:15:50] !log slyngshede@cumin1001 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [08:15:53] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [08:19:06] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10kostajh) [08:19:55] PROBLEM - Check systemd state on wdqs2021 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service,wmf_auto_restart_prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:20:04] !log disable puppet on P:kubernetes::node (apart from staging-codfw) for https://gerrit.wikimedia.org/r/c/operations/puppet/+/909687 [08:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:52] (03Merged) 10jenkins-bot: Improve logging of invalid image recommendation kinds [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923643 (owner: 10Gergő Tisza) [08:21:23] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Make kubernetes::clusters the central place for k8s config [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [08:21:32] (JobUnavailable) resolved: Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:21:42] (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:23:37] (03PS1) 10Gergő Tisza: Improve handling of missing image recommendation [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924361 [08:25:11] (03PS6) 10JMeybohm: Remove profile::kubernetes::deployment_server from role::releases [puppet] - 10https://gerrit.wikimedia.org/r/912785 (https://phabricator.wikimedia.org/T288629) [08:25:50] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [08:26:27] (03CR) 10JMeybohm: [C: 03+2] Remove profile::kubernetes::deployment_server from role::releases [puppet] - 10https://gerrit.wikimedia.org/r/912785 (https://phabricator.wikimedia.org/T288629) (owner: 10JMeybohm) [08:26:32] (JobUnavailable) firing: Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:27:18] tgr_: could you ping me when you are done deploying? [08:27:30] will do [08:27:39] !log re-enable puppet on P:kubernetes::node for https://gerrit.wikimedia.org/r/c/operations/puppet/+/909687 [08:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:52] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [08:28:56] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [08:28:57] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:28:57] !log slyngshede@cumin1001 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [08:28:57] !log tgr@deploy1002 Started scap: Backport for [[gerrit:923643|Improve logging of invalid image recommendation kinds]] [08:28:59] (03CR) 10Gergő Tisza: [C: 03+2] Section images: Do not treat unexpected kinds as production errors [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923644 (owner: 10Gergő Tisza) [08:29:00] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [08:29:00] !log slyngshede@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host testvm2006.codfw.wmnet [08:29:35] zabe: do you want to backport T337599, or some other thing? [08:29:35] T337599: Running a "get edits" check on user with no edits gives fatal exception - https://phabricator.wikimedia.org/T337599 [08:30:10] urbanecm: that and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/922492 [08:30:23] okay. in that case, no need for me to queue :). thanks! [08:30:24] !log tgr@deploy1002 tgr: Backport for [[gerrit:923643|Improve logging of invalid image recommendation kinds]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [08:30:29] (03PS16) 10JMeybohm: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) [08:30:31] (03CR) 10Jbond: [C: 03+2] firewall: drop block_abuse_nets parameter [puppet] - 10https://gerrit.wikimedia.org/r/923620 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [08:31:50] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.makevm for new host testvm2006.codfw.wmnet [08:31:52] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [08:32:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/924059 (owner: 10Volans) [08:33:48] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [08:34:33] 10SRE, 10Observability-Metrics, 10User-fgiunchedi: Extend router ACLs to block 4194/tcp on LVSes - https://phabricator.wikimedia.org/T337689 (10ayounsi) a:03fgiunchedi What I pushed is an extra safeguard, but a more viable fix is to have the daemon listen on the host's primary IP (like all the other simila... [08:34:53] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [08:34:53] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:34:53] !log slyngshede@cumin1001 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [08:34:56] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [08:35:18] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [08:36:19] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [08:36:20] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [08:38:21] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [08:39:22] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [08:39:22] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:39:22] !log slyngshede@cumin1001 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [08:39:26] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [08:39:26] !log slyngshede@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=2) for new host testvm2006.codfw.wmnet [08:39:28] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:923643|Improve logging of invalid image recommendation kinds]] (duration: 10m 30s) [08:39:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:40:11] XioNoX: thanks for the update re: T337689 that's indeed a better fix, I'll look into that [08:40:11] T337689: Extend router ACLs to block 4194/tcp on LVSes - https://phabricator.wikimedia.org/T337689 [08:40:33] no pb! let me know if I can help [08:41:11] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.makevm for new host testvm2006.codfw.wmnet [08:41:12] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [08:41:42] (SystemdUnitFailed) resolved: (2) systemd-timedated.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:43:05] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [08:43:14] (03PS1) 10Volans: .gitmessage: add Hosts: line [puppet] - 10https://gerrit.wikimedia.org/r/924438 [08:43:42] (03PS2) 10Jbond: ganeti. add GanetiRAPI.nodes and GanetiRAPI.groups [software/spicerack] - 10https://gerrit.wikimedia.org/r/924081 [08:43:44] (03CR) 10Jbond: ganeti. add GanetiRAPI.nodes and GanetiRAPI.groups (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/924081 (owner: 10Jbond) [08:43:48] (03CR) 10Volans: [C: 03+2] spicerack: add test-cookbook script [puppet] - 10https://gerrit.wikimedia.org/r/924059 (owner: 10Volans) [08:44:05] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [08:44:05] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:44:05] !log slyngshede@cumin1001 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [08:44:08] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [08:44:11] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [08:44:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:45:12] (03CR) 10Hoo man: [C: 03+1] install_console: restrict options used [puppet] - 10https://gerrit.wikimedia.org/r/922559 (https://phabricator.wikimedia.org/T117348) (owner: 10Jbond) [08:45:45] (03CR) 10Jbond: "thanks all" [puppet] - 10https://gerrit.wikimedia.org/r/922559 (https://phabricator.wikimedia.org/T117348) (owner: 10Jbond) [08:45:47] (03CR) 10Jbond: [C: 03+2] install_console: restrict options used [puppet] - 10https://gerrit.wikimedia.org/r/922559 (https://phabricator.wikimedia.org/T117348) (owner: 10Jbond) [08:48:01] (03PS1) 10Slyngshede: WMF signup message, stray " [software/bitu] - 10https://gerrit.wikimedia.org/r/924439 [08:48:42] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [08:49:42] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [08:49:42] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:49:42] !log slyngshede@cumin1001 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [08:49:45] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [08:49:52] !log slyngshede@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host testvm2006.codfw.wmnet [08:50:33] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.makevm for new host testvm2006.codfw.wmnet [08:50:34] (03CR) 10Volans: [C: 03+1] "LGTM, thx" [software/spicerack] - 10https://gerrit.wikimedia.org/r/924081 (owner: 10Jbond) [08:50:35] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [08:50:36] (03CR) 10Muehlenhoff: ganeti. add GanetiRAPI.nodes and GanetiRAPI.groups (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/924081 (owner: 10Jbond) [08:51:09] (03Merged) 10jenkins-bot: Section images: Do not treat unexpected kinds as production errors [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923644 (owner: 10Gergő Tisza) [08:51:45] (03CR) 10Gergő Tisza: [C: 03+2] Improve handling of missing image recommendation [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924361 (owner: 10Gergő Tisza) [08:51:49] !log tgr@deploy1002 Started scap: Backport for [[gerrit:923644|Section images: Do not treat unexpected kinds as production errors]] [08:52:36] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [08:53:14] !log tgr@deploy1002 tgr: Backport for [[gerrit:923644|Section images: Do not treat unexpected kinds as production errors]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [08:53:42] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [08:53:42] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:53:42] !log slyngshede@cumin1001 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [08:53:45] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [08:54:10] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [08:54:49] (03PS3) 10Jbond: ganeti. add GanetiRAPI.nodes and GanetiRAPI.groups [software/spicerack] - 10https://gerrit.wikimedia.org/r/924081 [08:55:02] (03CR) 10Jbond: ganeti. add GanetiRAPI.nodes and GanetiRAPI.groups (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/924081 (owner: 10Jbond) [08:55:47] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/spicerack] - 10https://gerrit.wikimedia.org/r/924081 (owner: 10Jbond) [08:59:01] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [09:00:03] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [09:00:04] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:00:04] !log slyngshede@cumin1001 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [09:00:07] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [09:00:07] !log slyngshede@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host testvm2006.codfw.wmnet [09:01:09] PROBLEM - Check systemd state on wdqs2021 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:32] (JobUnavailable) resolved: Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:02:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:04:55] (03PS4) 10Jbond: ganeti. add GanetiRAPI.nodes and GanetiRAPI.groups [software/spicerack] - 10https://gerrit.wikimedia.org/r/924081 [09:05:00] (03CR) 10Jbond: [C: 03+2] ganeti. add GanetiRAPI.nodes and GanetiRAPI.groups [software/spicerack] - 10https://gerrit.wikimedia.org/r/924081 (owner: 10Jbond) [09:05:32] (JobUnavailable) firing: Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:06:12] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:923644|Section images: Do not treat unexpected kinds as production errors]] (duration: 14m 22s) [09:09:20] (03Merged) 10jenkins-bot: ganeti. add GanetiRAPI.nodes and GanetiRAPI.groups [software/spicerack] - 10https://gerrit.wikimedia.org/r/924081 (owner: 10Jbond) [09:11:12] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.makevm for new host testvm2006.codfw.wmnet [09:11:18] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [09:11:55] (03PS1) 10Fabfur: cache::upload: Add hieradata to switch HTTPS redirection from Varnish to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/924444 (https://phabricator.wikimedia.org/T323557) [09:12:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:13:12] (03Merged) 10jenkins-bot: Improve handling of missing image recommendation [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924361 (owner: 10Gergő Tisza) [09:13:28] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [09:14:17] (03Abandoned) 10Clément Goubert: testwikidatawiki: Fix missing mobile redir to k8s [puppet] - 10https://gerrit.wikimedia.org/r/923384 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [09:14:35] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [09:14:35] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:14:35] !log slyngshede@cumin1001 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [09:14:38] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [09:15:25] !log tgr@deploy1002 Started scap: Backport for [[gerrit:924361|Improve handling of missing image recommendation]] [09:16:05] (03PS8) 10Jelto: Gitlab: Support OIDC alongside CAS for OmniAuth in Gitlab [puppet] - 10https://gerrit.wikimedia.org/r/916509 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [09:17:16] !log tgr@deploy1002 tgr: Backport for [[gerrit:924361|Improve handling of missing image recommendation]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [09:19:15] (03CR) 10Clément Goubert: [C: 03+1] trafficserver: also match mobile domains in mw-on-k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/924080 (owner: 10Giuseppe Lavagetto) [09:19:25] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/924444 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [09:19:50] !log run aborrero@cumin1001:~ 2s 98 $ sudo cumin "P{R:Profile::Mariadb::Section = 's7'} and P{P:wmcs::db::wikireplicas::mariadb_multiinstance}" "/usr/local/sbin/maintain-meta_p --all-databases --bootstrap" [09:19:50] (T337446) [09:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:54] T337446: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 [09:20:32] (JobUnavailable) resolved: Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:20:45] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host puppetboard1003.eqiad.wmnet with OS bookworm [09:21:56] (03PS3) 10Jelto: gitlab: sync all configured providers [puppet] - 10https://gerrit.wikimedia.org/r/916522 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [09:22:14] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host puppetboard2003.codfw.wmnet with OS bookworm [09:24:03] (03PS5) 10Clément Goubert: mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923385 (https://phabricator.wikimedia.org/T337490) [09:24:22] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:924361|Improve handling of missing image recommendation]] (duration: 08m 57s) [09:24:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:24:40] (03PS2) 10Fabfur: cache::upload: Add hieradata to switch HTTPS redirection from Varnish to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/924444 (https://phabricator.wikimedia.org/T323557) [09:25:24] zabe: done [09:25:33] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/924444 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [09:25:37] sorry for the delay, GrowthExperiments CI isn't very snappy [09:26:19] (03CR) 10Zabe: [C: 03+2] Check for null when using ::getCheckUserHelperFieldset [extensions/CheckUser] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923635 (https://phabricator.wikimedia.org/T337599) (owner: 10Zabe) [09:26:33] (03PS4) 10Zabe: Start reading from rev_comment_id in test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922492 (https://phabricator.wikimedia.org/T299954) [09:26:44] yup [09:27:46] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [09:27:49] (03CR) 10Zabe: [C: 03+2] Start reading from rev_comment_id in test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922492 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [09:28:39] (03Merged) 10jenkins-bot: Start reading from rev_comment_id in test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922492 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [09:29:23] !log zabe@deploy1002 Started scap: Backport for [[gerrit:922492|Start reading from rev_comment_id in test wikis (T299954)]] [09:29:28] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [09:29:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:29:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetdb2003.codfw.wmnet with OS bookworm [09:29:50] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host puppetdb2003.codfw.wmnet with OS bookworm completed: - puppetd... [09:29:55] (03PS1) 10Muehlenhoff: ganeti: Pass memory size in megabytes [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/924445 (https://phabricator.wikimedia.org/T230712) [09:30:27] (03PS1) 10Jbond: sre.ganeti.makvm: update the default memory to 1.5 [cookbooks] - 10https://gerrit.wikimedia.org/r/924466 [09:30:39] (03PS1) 10Muehlenhoff: sre.ganeti.makevm: Bump default to 1.5G [cookbooks] - 10https://gerrit.wikimedia.org/r/924467 (https://phabricator.wikimedia.org/T230712) [09:30:47] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [09:30:48] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [09:30:51] !log zabe@deploy1002 zabe: Backport for [[gerrit:922492|Start reading from rev_comment_id in test wikis (T299954)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [09:31:16] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/923591 (owner: 10Clément Goubert) [09:31:43] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetboard1003.eqiad.wmnet with reason: host reimage [09:32:42] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [09:33:07] (03CR) 10CI reject: [V: 04-1] sre.ganeti.makvm: update the default memory to 1.5 [cookbooks] - 10https://gerrit.wikimedia.org/r/924466 (owner: 10Jbond) [09:33:31] PROBLEM - Check systemd state on wdqs2021 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:33:47] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [09:33:47] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:33:47] !log slyngshede@cumin1001 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [09:33:50] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [09:33:50] !log slyngshede@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=2) for new host testvm2006.codfw.wmnet [09:33:56] (03CR) 10CI reject: [V: 04-1] ganeti: Pass memory size in megabytes [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/924445 (https://phabricator.wikimedia.org/T230712) (owner: 10Muehlenhoff) [09:34:03] (03PS2) 10Jbond: sre.ganeti.makvm: update the default memory to 1.5 [cookbooks] - 10https://gerrit.wikimedia.org/r/924466 [09:34:51] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetboard1003.eqiad.wmnet with reason: host reimage [09:36:25] (03CR) 10CI reject: [V: 04-1] sre.ganeti.makvm: update the default memory to 1.5 [cookbooks] - 10https://gerrit.wikimedia.org/r/924466 (owner: 10Jbond) [09:37:11] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:922492|Start reading from rev_comment_id in test wikis (T299954)]] (duration: 07m 48s) [09:37:16] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [09:37:54] (03PS8) 10Clément Goubert: mw-on-k8s: Redirect closed wikis to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490) [09:38:51] (03PS9) 10Clément Goubert: mw-on-k8s: Redirect closed wikis to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490) [09:39:24] (03PS2) 10Muehlenhoff: ganeti: Pass memory size in megabytes [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/924445 (https://phabricator.wikimedia.org/T230712) [09:40:13] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.makevm for new host testvm2006.codfw.wmnet [09:40:14] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [09:41:22] (03Merged) 10jenkins-bot: Check for null when using ::getCheckUserHelperFieldset [extensions/CheckUser] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923635 (https://phabricator.wikimedia.org/T337599) (owner: 10Zabe) [09:42:10] !log zabe@deploy1002 Started scap: Backport for [[gerrit:923635|Check for null when using ::getCheckUserHelperFieldset (T337599)]] [09:42:14] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [09:42:15] T337599: Running a "get edits" check on user with no edits gives fatal exception - https://phabricator.wikimedia.org/T337599 [09:42:20] (03CR) 10Volans: [C: 04-2] "This is against the debian branch... it should be against master" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/924445 (https://phabricator.wikimedia.org/T230712) (owner: 10Muehlenhoff) [09:42:56] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:42:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host puppetdb1003.eqiad.wmnet with OS bookworm [09:43:05] 10SRE, 10Infrastructure-Foundations: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host puppetdb1003.eqiad.wmnet with OS bookworm [09:43:15] (03PS1) 10Zabe: Start reading from rev_comment_id in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924469 (https://phabricator.wikimedia.org/T299954) [09:43:36] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [09:43:36] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:43:36] !log slyngshede@cumin1001 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [09:43:39] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [09:43:43] !log zabe@deploy1002 zabe: Backport for [[gerrit:923635|Check for null when using ::getCheckUserHelperFieldset (T337599)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [09:45:28] (03PS1) 10Marostegui: wiki-replicas.sql: Add heartbeat_p [puppet] - 10https://gerrit.wikimedia.org/r/924471 (https://phabricator.wikimedia.org/T337446) [09:45:34] Amir1: ^ [09:46:10] !log jbond@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetboard2003.codfw.wmnet with reason: host reimage [09:46:19] (03CR) 10Ladsgroup: [C: 03+1] "Do we need to add meta_p too?" [puppet] - 10https://gerrit.wikimedia.org/r/924471 (https://phabricator.wikimedia.org/T337446) (owner: 10Marostegui) [09:46:23] (03CR) 10Marostegui: [C: 03+2] wiki-replicas.sql: Add heartbeat_p [puppet] - 10https://gerrit.wikimedia.org/r/924471 (https://phabricator.wikimedia.org/T337446) (owner: 10Marostegui) [09:46:26] thanks [09:46:35] Amir1: yep, I will note it just in case we need more [09:46:51] awesome. Thanks [09:47:28] (03PS1) 10Muehlenhoff: ganeti: Pass memory size in megabytes [software/spicerack] - 10https://gerrit.wikimedia.org/r/924472 (https://phabricator.wikimedia.org/T230712) [09:47:40] (03PS1) 10Marostegui: wiki-replicas.sql: Add meta_p GRANT [puppet] - 10https://gerrit.wikimedia.org/r/924473 (https://phabricator.wikimedia.org/T337446) [09:48:02] (03CR) 10Marostegui: [C: 04-2] "Do not merge yet in case we find other grants that are needed" [puppet] - 10https://gerrit.wikimedia.org/r/924473 (https://phabricator.wikimedia.org/T337446) (owner: 10Marostegui) [09:49:05] (03CR) 10Zabe: [C: 03+2] Start reading from rev_comment_id in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924469 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [09:49:37] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetboard2003.codfw.wmnet with reason: host reimage [09:50:03] (03Merged) 10jenkins-bot: Start reading from rev_comment_id in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924469 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [09:51:46] (03PS3) 10Jbond: sre.ganeti.makvm: update the default memory to 1.5 [cookbooks] - 10https://gerrit.wikimedia.org/r/924466 [09:52:03] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:923635|Check for null when using ::getCheckUserHelperFieldset (T337599)]] (duration: 09m 52s) [09:52:08] T337599: Running a "get edits" check on user with no edits gives fatal exception - https://phabricator.wikimedia.org/T337599 [09:52:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:52:36] !log zabe@deploy1002 Started scap: Backport for [[gerrit:924469|Start reading from rev_comment_id in group0 wikis (T299954)]] [09:52:40] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [09:52:58] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:54:48] !log zabe@deploy1002 zabe: Backport for [[gerrit:924469|Start reading from rev_comment_id in group0 wikis (T299954)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [09:55:17] (03CR) 10Volans: [C: 03+1] "LGTM, thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/924472 (https://phabricator.wikimedia.org/T230712) (owner: 10Muehlenhoff) [09:55:35] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetdb1003.eqiad.wmnet with reason: host reimage [09:55:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:57:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:52] PROBLEM - MariaDB read only s2 on clouddb1018 is CRITICAL: Could not connect to localhost:3312 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:57:55] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [09:58:06] PROBLEM - Check systemd state on clouddb1018 is CRITICAL: CRITICAL - degraded: The following units failed: wmf-pt-kill@s2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetdb1003.eqiad.wmnet with reason: host reimage [09:58:58] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2006.codfw.wmnet - slyngshede@cumin1001" [09:59:11] !log slyngshede@cumin1001 START - Cookbook sre.hosts.reimage for host testvm2006.codfw.wmnet with OS bookworm [09:59:18] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Merge reimaging cookbooks - https://phabricator.wikimedia.org/T336491 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1001 for host testvm2006.codfw.wmnet with OS bookworm [10:00:06] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230530T1000) [10:00:25] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. We need to merge/deploy https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/924472 first" [cookbooks] - 10https://gerrit.wikimedia.org/r/924466 (owner: 10Jbond) [10:00:48] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:924469|Start reading from rev_comment_id in group0 wikis (T299954)]] (duration: 08m 12s) [10:00:53] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [10:00:54] * zabe done [10:01:22] PROBLEM - mysqld processes on clouddb1018 is CRITICAL: PROCS CRITICAL: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [10:01:48] PROBLEM - MariaDB Replica SQL: s2 on clouddb1018 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:04:08] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Pass memory size in megabytes [software/spicerack] - 10https://gerrit.wikimedia.org/r/924472 (https://phabricator.wikimedia.org/T230712) (owner: 10Muehlenhoff) [10:07:25] (03CR) 10Jbond: ganeti: Pass memory size in megabytes (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/924472 (https://phabricator.wikimedia.org/T230712) (owner: 10Muehlenhoff) [10:08:30] (03CR) 10Volans: [C: 03+1] ganeti: Pass memory size in megabytes (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/924472 (https://phabricator.wikimedia.org/T230712) (owner: 10Muehlenhoff) [10:10:07] (03PS4) 10Majavah: ferm::service: allow passing array of hosts [puppet] - 10https://gerrit.wikimedia.org/r/919300 [10:10:28] (03PS1) 10Jbond: ganeti: update definition of add to accept float or int [software/spicerack] - 10https://gerrit.wikimedia.org/r/924477 (https://phabricator.wikimedia.org/T230712) [10:10:30] (03CR) 10CI reject: [V: 04-1] ferm::service: allow passing array of hosts [puppet] - 10https://gerrit.wikimedia.org/r/919300 (owner: 10Majavah) [10:10:38] (03CR) 10Jbond: ganeti: Pass memory size in megabytes (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/924472 (https://phabricator.wikimedia.org/T230712) (owner: 10Muehlenhoff) [10:10:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:11:10] !log jbond@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host puppetboard1003.eqiad.wmnet with OS bookworm [10:11:19] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host puppetboard2003.codfw.wmnet with OS bookworm [10:11:29] (03PS5) 10Majavah: ferm::service: allow passing array of hosts [puppet] - 10https://gerrit.wikimedia.org/r/919300 [10:12:19] (03CR) 10Majavah: ferm::service: allow passing array of hosts (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/919300 (owner: 10Majavah) [10:12:31] (03CR) 10Volans: [C: 03+1] "LGTM, nit for tests inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/924477 (https://phabricator.wikimedia.org/T230712) (owner: 10Jbond) [10:13:49] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41394/console" [puppet] - 10https://gerrit.wikimedia.org/r/919300 (owner: 10Majavah) [10:15:51] (03PS2) 10Jbond: ganeti: update definition of add to accept float or int [software/spicerack] - 10https://gerrit.wikimedia.org/r/924477 (https://phabricator.wikimedia.org/T230712) [10:15:54] (03CR) 10Jbond: ganeti: update definition of add to accept float or int (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/924477 (https://phabricator.wikimedia.org/T230712) (owner: 10Jbond) [10:16:49] (03CR) 10Volans: "LGTM, nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/924466 (owner: 10Jbond) [10:17:00] (03CR) 10Jgiannelos: "Just a heads up, this requires the container to already exist in swift." [deployment-charts] - 10https://gerrit.wikimedia.org/r/924112 (https://phabricator.wikimedia.org/T333318) (owner: 10Effie Mouzeli) [10:17:28] (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/924477 (https://phabricator.wikimedia.org/T230712) (owner: 10Jbond) [10:18:10] (03CR) 10Jgiannelos: [C: 03+1] "Also we need to depool codfw before applying this change." [deployment-charts] - 10https://gerrit.wikimedia.org/r/924112 (https://phabricator.wikimedia.org/T333318) (owner: 10Effie Mouzeli) [10:18:25] (03PS4) 10Jbond: sre.ganeti.makvm: update the default memory to 1.5 [cookbooks] - 10https://gerrit.wikimedia.org/r/924466 [10:18:48] (03CR) 10Jbond: sre.ganeti.makvm: update the default memory to 1.5 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/924466 (owner: 10Jbond) [10:19:12] (03PS1) 10Matthias Mullie: Fix maxJobs default [extensions/ImageSuggestions] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924454 [10:19:22] (03PS1) 10Matthias Mullie: Fix maxJobs default [extensions/ImageSuggestions] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924455 [10:21:18] (03CR) 10CI reject: [V: 04-1] sre.ganeti.makvm: update the default memory to 1.5 [cookbooks] - 10https://gerrit.wikimedia.org/r/924466 (owner: 10Jbond) [10:21:44] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active - Telia, AS1299/IPv6: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:25:22] (03CR) 10Kosta Harlan: [C: 03+1] "Thanks for fixing this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924079 (owner: 10Gergő Tisza) [10:28:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:29:05] (03PS5) 10Jbond: sre.ganeti.makvm: update the default memory to 1.5 [cookbooks] - 10https://gerrit.wikimedia.org/r/924466 [10:29:24] (03CR) 10Jbond: [C: 03+2] ganeti: update definition of add to accept float or int [software/spicerack] - 10https://gerrit.wikimedia.org/r/924477 (https://phabricator.wikimedia.org/T230712) (owner: 10Jbond) [10:29:32] (03Abandoned) 10Volans: ganeti: Pass memory size in megabytes [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/924445 (https://phabricator.wikimedia.org/T230712) (owner: 10Muehlenhoff) [10:31:44] PROBLEM - Check systemd state on wdqs2021 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:32:33] (03CR) 10Muehlenhoff: "Looks good, one nit and one thought/proposal inline" [puppet] - 10https://gerrit.wikimedia.org/r/922815 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [10:33:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:33:19] (03PS3) 10Hnowlan: rest-gateway: add citoid support [deployment-charts] - 10https://gerrit.wikimedia.org/r/920710 (https://phabricator.wikimedia.org/T329049) [10:33:28] (03CR) 10Hnowlan: rest-gateway: add citoid support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920710 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [10:33:52] (03Merged) 10jenkins-bot: ganeti: update definition of add to accept float or int [software/spicerack] - 10https://gerrit.wikimedia.org/r/924477 (https://phabricator.wikimedia.org/T230712) (owner: 10Jbond) [10:37:58] RECOVERY - SSH on wdqs2021 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:40:40] (03PS3) 10Volans: dhcp: reword some exception messages [software/spicerack] - 10https://gerrit.wikimedia.org/r/920225 [10:40:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:40:44] (03CR) 10Volans: "replies and questions inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/920225 (owner: 10Volans) [10:41:00] RECOVERY - MariaDB Replica Lag: s1 on clouddb1021 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:41:11] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [10:41:16] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [10:42:42] PROBLEM - SSH on wdqs2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:42:45] (03PS2) 10Volans: Add Python 3.11 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/922489 (owner: 10Ayounsi) [10:44:58] (03CR) 10CI reject: [V: 04-1] dhcp: reword some exception messages [software/spicerack] - 10https://gerrit.wikimedia.org/r/920225 (owner: 10Volans) [10:45:01] (03CR) 10Effie Mouzeli: [C: 03+2] tegola: Switch swift container to tegola-swift-codfw-v003 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/924112 (https://phabricator.wikimedia.org/T333318) (owner: 10Effie Mouzeli) [10:45:05] (03CR) 10Marostegui: [C: 03+2] wiki-replicas.sql: Add meta_p GRANT [puppet] - 10https://gerrit.wikimedia.org/r/924473 (https://phabricator.wikimedia.org/T337446) (owner: 10Marostegui) [10:45:22] (03CR) 10Effie Mouzeli: tegola: Switch swift container to tegola-swift-codfw-v003 [deployment-charts] - 10https://gerrit.wikimedia.org/r/924112 (https://phabricator.wikimedia.org/T333318) (owner: 10Effie Mouzeli) [10:45:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:46:57] (03PS4) 10Volans: dhcp: reword some exception messages [software/spicerack] - 10https://gerrit.wikimedia.org/r/920225 [10:50:24] !log slyngshede@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2006.codfw.wmnet with reason: host reimage [10:50:49] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [10:53:09] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [10:53:52] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2006.codfw.wmnet with reason: host reimage [10:56:46] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [10:56:49] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [10:57:12] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [10:57:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:57:44] (03PS1) 10Arturo Borrero Gonzalez: lvs: remove wikireplicas S3 definition [puppet] - 10https://gerrit.wikimedia.org/r/924481 (https://phabricator.wikimedia.org/T337721) [10:58:02] (03PS1) 10Jbond: build_envoy_deb: update to work with bookworm [puppet] - 10https://gerrit.wikimedia.org/r/924482 [11:00:03] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:02:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:02:48] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/924482 (owner: 10Jbond) [11:04:11] (03PS1) 10Filippo Giunchedi: cadvisor: listen on main ip address only [puppet] - 10https://gerrit.wikimedia.org/r/924483 (https://phabricator.wikimedia.org/T337689) [11:04:30] (03CR) 10Ladsgroup: "Adding Brandon as he reviewed If49d66b64c1 so might know if this can cause issues and Valentine who was involved with the pybal's page yes" [puppet] - 10https://gerrit.wikimedia.org/r/924481 (https://phabricator.wikimedia.org/T337721) (owner: 10Arturo Borrero Gonzalez) [11:04:31] RECOVERY - Check systemd state on wdqs2021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/924466 (owner: 10Jbond) [11:05:52] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41396/console" [puppet] - 10https://gerrit.wikimedia.org/r/924483 (https://phabricator.wikimedia.org/T337689) (owner: 10Filippo Giunchedi) [11:07:20] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2006.codfw.wmnet with OS bookworm [11:07:20] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2006.codfw.wmnet [11:07:25] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Merge reimaging cookbooks - https://phabricator.wikimedia.org/T336491 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1001 for host testvm2006.codfw.wmnet with OS bookworm completed: - testvm2... [11:08:09] RECOVERY - SSH on wdqs2021 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:08:34] (03CR) 10Jbond: [C: 03+1] "lgtm minor nit" [puppet] - 10https://gerrit.wikimedia.org/r/924483 (https://phabricator.wikimedia.org/T337689) (owner: 10Filippo Giunchedi) [11:08:37] (03PS7) 10Slyngshede: sre.ganeti.makevm call reimage after VM creation [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) [11:08:44] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [11:10:27] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [11:11:07] (03CR) 10Volans: [C: 03+2] Add Python 3.11 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/922489 (owner: 10Ayounsi) [11:11:11] (03Abandoned) 10Muehlenhoff: sre.ganeti.makevm: Bump default to 1.5G [cookbooks] - 10https://gerrit.wikimedia.org/r/924467 (https://phabricator.wikimedia.org/T230712) (owner: 10Muehlenhoff) [11:11:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud_private: route the whole cloud public IPv4 space to cloudsw [puppet] - 10https://gerrit.wikimedia.org/r/923324 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [11:11:45] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:12:39] (03CR) 10CI reject: [V: 04-1] sre.ganeti.makevm call reimage after VM creation [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [11:12:49] (03CR) 10Hashar: [C: 03+2] "Long week-end is gone unblocking deployment :]" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/923688 (https://phabricator.wikimedia.org/T331651) (owner: 10Hashar) [11:13:33] (03Merged) 10jenkins-bot: wm-checks-api: add support for DUCT [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/923688 (https://phabricator.wikimedia.org/T331651) (owner: 10Hashar) [11:13:33] (03CR) 10Jelto: [C: 03+2] Gitlab: Support OIDC alongside CAS for OmniAuth in Gitlab [puppet] - 10https://gerrit.wikimedia.org/r/916509 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [11:13:41] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:14:02] !log hashar@deploy1002 Started deploy [gerrit/gerrit@6deabc9]: wm-checks-api: add support for DUCT - T331651 [11:14:07] T331651: [wm-checks-api] support kindrobot - https://phabricator.wikimedia.org/T331651 [11:14:10] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@6deabc9]: wm-checks-api: add support for DUCT - T331651 (duration: 00m 08s) [11:15:13] (03PS8) 10Slyngshede: sre.ganeti.makevm call reimage after VM creation [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) [11:17:31] (03Merged) 10jenkins-bot: Add Python 3.11 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/922489 (owner: 10Ayounsi) [11:21:24] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [11:21:27] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [11:34:45] PROBLEM - puppetboard.wikimedia.org tls expiry on puppetboard1003 is CRITICAL: connect to address 10.64.32.38 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:34:51] PROBLEM - Check that envoy is running on puppetboard1003 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is inactive https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [11:35:05] PROBLEM - Check systemd state on puppetboard1003 is CRITICAL: CRITICAL - degraded: The following units failed: uwsgi-puppetboard.service,wmf_auto_restart_uwsgi-puppetboard.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:11] PROBLEM - puppetboard.wikimedia.org requires authentication on puppetboard1003 is CRITICAL: connect to address 10.64.32.38 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:35:57] PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard1003 is CRITICAL: connect to address localhost and port 8001: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [11:41:14] (03CR) 10Slyngshede: "Parameters where configured wrong, revealed in testing." [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [11:41:23] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [11:41:26] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [11:42:54] (03PS1) 10Daimona Eaytoy: prod: Remove $wgCampaignEventsEnableMultipleOrganizers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924488 (https://phabricator.wikimedia.org/T334088) [11:45:32] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on puppetboard2003.codfw.wmnet,puppetboard1003.eqiad.wmnet with reason: building_systems [11:45:45] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on puppetboard2003.codfw.wmnet,puppetboard1003.eqiad.wmnet with reason: building_systems [11:45:47] (03PS3) 10Daimona Eaytoy: beta: Remove $wgCampaignEventsEnableMultipleOrganizers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909401 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo) [11:46:01] !log slyngshede@cumin1001 START - Cookbook sre.hosts.decommission for hosts testvm2006.codfw.wmnet [11:46:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [11:46:58] (03PS2) 10Daimona Eaytoy: prod: Remove $wgCampaignEventsEnableMultipleOrganizers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924488 (https://phabricator.wikimedia.org/T334088) [11:47:36] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [11:50:00] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [11:50:13] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for moved cloudcontrol2005-dev - cmooney@cumin1001" [11:51:04] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [11:51:08] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for moved cloudcontrol2005-dev - cmooney@cumin1001" [11:51:08] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:51:10] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10serviceops-collab, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) @jbond I tested https://gerrit.wikimedia.org/r/916509 on the GitLab hosts but the change is noop and no new oauth provider is av... [11:51:13] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:51:14] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm2006.codfw.wmnet [11:51:19] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Merge reimaging cookbooks - https://phabricator.wikimedia.org/T336491 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by slyngshede@cumin1001 for hosts: `testvm2006.codfw.wmnet` - testvm2006.codfw.wmnet (**PASS**)... [11:51:44] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/924466 (owner: 10Jbond) [11:57:09] (03CR) 10Ayounsi: dhcp: reword some exception messages (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/920225 (owner: 10Volans) [11:59:42] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2005-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336564 (10cmooney) >>! In T336564#8884008, @Jhancock.wm wrote: > @cmooney I moved the patch to switch cloudsw1-b1-codfw, port ge-1/0/13, but I can't get the netbox... [12:08:19] (03CR) 10Ayounsi: Add cookbook to configure router's BGP sessions to k8s hosts (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [12:13:27] (03PS1) 10Ayounsi: Spicerack: add some colors [software/spicerack] - 10https://gerrit.wikimedia.org/r/924493 [12:13:29] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) @papaul when you are back can you advise on the status of these? They all appear as connected on asw-b1-codfw... [12:13:35] (03PS2) 10Filippo Giunchedi: cadvisor: listen on main ip address only [puppet] - 10https://gerrit.wikimedia.org/r/924483 (https://phabricator.wikimedia.org/T337689) [12:13:40] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/924483 (https://phabricator.wikimedia.org/T337689) (owner: 10Filippo Giunchedi) [12:14:18] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2005-dev.codfw.wmnet with OS bullseye [12:14:26] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2005-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336564 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudcontrol2005-dev.codfw.wmnet with OS bullseye [12:15:23] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41398/console" [puppet] - 10https://gerrit.wikimedia.org/r/924483 (https://phabricator.wikimedia.org/T337689) (owner: 10Filippo Giunchedi) [12:16:37] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2005-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336564 (10aborrero) >>! In T336564#8888280, @cmooney wrote: > > @aborrero you should be good to do the reimage on this now. I've reserved [[ https://netbox.wikime... [12:18:04] (03CR) 10CI reject: [V: 04-1] Spicerack: add some colors [software/spicerack] - 10https://gerrit.wikimedia.org/r/924493 (owner: 10Ayounsi) [12:21:09] ACKNOWLEDGEMENT - dump of db_inventory in codfw on backupmon1001 is CRITICAL: Last dump for db_inventory at codfw (db2185) taken on 2023-05-30 03:55:35 is 109 KiB, but the previous one was 93 KiB, a change of +16.6 % Jcrespo expected https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [12:21:09] ACKNOWLEDGEMENT - dump of db_inventory in eqiad on backupmon1001 is CRITICAL: Last dump for db_inventory at eqiad (db1215) taken on 2023-05-30 04:02:03 is 109 KiB, but the previous one was 93 KiB, a change of +17.1 % Jcrespo expected https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [12:22:33] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] cadvisor: listen on main ip address only [puppet] - 10https://gerrit.wikimedia.org/r/924483 (https://phabricator.wikimedia.org/T337689) (owner: 10Filippo Giunchedi) [12:24:43] (03PS1) 10Jbond: install_console: provide a default for $2 [puppet] - 10https://gerrit.wikimedia.org/r/924497 [12:25:41] (03PS2) 10Jbond: install_console: provide a default for $2 [puppet] - 10https://gerrit.wikimedia.org/r/924497 (https://phabricator.wikimedia.org/T117348) [12:25:42] (SystemdUnitFailed) firing: cadvisor.service Failed on elastic1058:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:25:44] (03CR) 10Jelto: "one question in-line regarding blackbox checks." [puppet] - 10https://gerrit.wikimedia.org/r/923652 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [12:25:52] (03CR) 10Jbond: [V: 03+2 C: 03+2] install_console: provide a default for $2 [puppet] - 10https://gerrit.wikimedia.org/r/924497 (https://phabricator.wikimedia.org/T117348) (owner: 10Jbond) [12:26:32] PROBLEM - Check systemd state on db2146 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:36] PROBLEM - Check systemd state on durum3002 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:38] PROBLEM - Check systemd state on cp2027 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:08] PROBLEM - Check systemd state on cp2042 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:10] PROBLEM - Check systemd state on parse1008 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:16] PROBLEM - Check systemd state on cp3056 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:18] PROBLEM - Check systemd state on cp6012 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:18] PROBLEM - Check systemd state on mw2375 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:18] PROBLEM - Check systemd state on mc-gp1003 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:26] PROBLEM - Check systemd state on mw1476 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:28] PROBLEM - Check systemd state on dumpsdata1004 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:32] PROBLEM - Check systemd state on cp6007 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:32] PROBLEM - Check systemd state on cp3051 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:32] PROBLEM - Check systemd state on ncredir6001 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:32] PROBLEM - Check systemd state on cp2030 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:36] PROBLEM - Check systemd state on elastic1058 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:42] PROBLEM - Check systemd state on mw2298 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:47] (03PS2) 10Bartosz Dziewoński: Hide 'editnotice-notext' message in VE (and mobile apps) [extensions/VisualEditor] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924159 (https://phabricator.wikimedia.org/T337633) [12:27:52] PROBLEM - Check systemd state on mw1460 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:52] PROBLEM - Check systemd state on db1207 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:55] oh crap, sorry that's me [12:27:56] PROBLEM - Check systemd state on doh6002 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:56] PROBLEM - Check systemd state on parse2016 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:57] (03PS2) 10Bartosz Dziewoński: ve.ui.MWGalleryDialog: Fix showing the search panel [extensions/VisualEditor] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924160 (https://phabricator.wikimedia.org/T337638) [12:28:02] PROBLEM - Check systemd state on druid1007 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:04] PROBLEM - Check systemd state on cp4047 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:06] godog: flag provided but not defined: -listen [12:28:12] PROBLEM - Check systemd state on elastic2073 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:12] PROBLEM - Check systemd state on mw1357 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:12] PROBLEM - Check systemd state on ganeti6001 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:18] thank you volans [12:28:22] PROBLEM - Check systemd state on parse2005 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:23] I'll revert [12:28:26] PROBLEM - Check systemd state on mw1356 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:30] PROBLEM - Check systemd state on lvs1017 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:32] PROBLEM - Check systemd state on sessionstore2003 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:35] (03PS1) 10Bartosz Dziewoński: Hide 'editnotice-notext' message in VE (and mobile apps) [extensions/VisualEditor] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924456 (https://phabricator.wikimedia.org/T337633) [12:28:38] (03PS1) 10Filippo Giunchedi: Revert "cadvisor: listen on main ip address only" [puppet] - 10https://gerrit.wikimedia.org/r/924457 [12:28:40] PROBLEM - Check systemd state on mw1418 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:42] PROBLEM - Check systemd state on mw1370 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:42] PROBLEM - Check systemd state on mc-wf1001 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:48] (03PS1) 10Bartosz Dziewoński: ve.ui.MWGalleryDialog: Fix showing the search panel [extensions/VisualEditor] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924458 (https://phabricator.wikimedia.org/T337638) [12:28:50] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Revert "cadvisor: listen on main ip address only" [puppet] - 10https://gerrit.wikimedia.org/r/924457 (owner: 10Filippo Giunchedi) [12:28:52] PROBLEM - Check systemd state on mw2421 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:56] PROBLEM - Check systemd state on ganeti2019 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:58] PROBLEM - Check systemd state on install6002 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:58] PROBLEM - Check systemd state on mw2300 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:04] PROBLEM - Check systemd state on mw1466 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:13] s/listen/listen_ip/ I think :) [12:29:32] PROBLEM - Check systemd state on mw1393 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:34] PROBLEM - Check systemd state on snapshot1015 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:35] yes [12:29:36] PROBLEM - Check systemd state on mw2354 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:36] PROBLEM - Check systemd state on parse2013 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:38] PROBLEM - Check systemd state on mw2410 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:40] PROBLEM - Check systemd state on lvs6002 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:44] !log disablig puppet where cadvisor is present [12:29:46] PROBLEM - Check systemd state on db2113 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:52] PROBLEM - Check systemd state on logstash1025 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:56] PROBLEM - Check systemd state on ms-be1074 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:56] PROBLEM - Check systemd state on cp3058 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:00] PROBLEM - Check systemd state on mw2355 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:00] PROBLEM - Check systemd state on dse-k8s-worker1006 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:00] PROBLEM - Check systemd state on mw1477 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:00] PROBLEM - Check systemd state on lvs3006 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:00] PROBLEM - Check systemd state on prometheus4002 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:00] PROBLEM - Check systemd state on mw1491 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:02] PROBLEM - Check systemd state on mw1407 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:06] PROBLEM - Check systemd state on an-airflow1003 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:07] thank you volans [12:30:10] PROBLEM - Check systemd state on cp3054 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:12] PROBLEM - Check systemd state on ganeti5006 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:12] PROBLEM - Check systemd state on cp4046 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:12] PROBLEM - Check systemd state on ganeti6002 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:14] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 93, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:30:20] PROBLEM - Check systemd state on ores1007 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:30] PROBLEM - Check systemd state on db2162 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:34] PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:42] PROBLEM - Check systemd state on cp3063 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:42] (SystemdUnitFailed) firing: (2) cadvisor.service Failed on elastic1058:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:30:48] PROBLEM - Check systemd state on cp4048 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:54] PROBLEM - Check systemd state on xhgui2001 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:56] PROBLEM - Check systemd state on mw1440 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:00] PROBLEM - Check systemd state on mw2292 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:18] PROBLEM - Check systemd state on gerrit1003 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:18] PROBLEM - Check systemd state on parse1016 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:24] PROBLEM - Check systemd state on cp4039 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:30] PROBLEM - Check systemd state on mw2383 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:30] PROBLEM - Check systemd state on mw2371 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:32] PROBLEM - Check systemd state on cp4052 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:36] PROBLEM - Check systemd state on mw1371 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:36] PROBLEM - Check systemd state on mw1444 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:38] PROBLEM - Check systemd state on mw1479 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:38] PROBLEM - Check systemd state on netflow4002 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:40] PROBLEM - Check systemd state on mw2443 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:44] PROBLEM - Check systemd state on poolcounter2003 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:46] PROBLEM - Check systemd state on mw2450 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:48] PROBLEM - Check systemd state on cp6011 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:50] PROBLEM - Check systemd state on mw2310 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:50] PROBLEM - Check systemd state on mw2367 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:56] PROBLEM - Check systemd state on mw1461 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:58] PROBLEM - Check systemd state on mw1424 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:00] PROBLEM - Check systemd state on mw2264 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:04] PROBLEM - Check systemd state on mw1488 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:24] PROBLEM - Check systemd state on mw2444 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:26] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:34] PROBLEM - Check systemd state on mw2260 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:34] PROBLEM - Check systemd state on mw1404 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:42] (03PS1) 10Muehlenhoff: Add a cookbook to drain a Ganeti node (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 [12:32:50] PROBLEM - Check systemd state on cp6004 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:06] PROBLEM - Check systemd state on mw2403 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:12] (03PS2) 10Muehlenhoff: Add a cookbook to drain a Ganeti node (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 [12:33:18] RECOVERY - puppetboard.wikimedia.org requires authentication on puppetboard1003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 554 bytes in 1.047 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:33:28] (03PS1) 10Filippo Giunchedi: cadvisor: listen on main ip address only [puppet] - 10https://gerrit.wikimedia.org/r/924500 (https://phabricator.wikimedia.org/T337689) [12:33:40] RECOVERY - Check that envoy is running on puppetboard1003 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [12:34:08] PROBLEM - Check systemd state on mw2400 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:50] PROBLEM - Check systemd state on mw1426 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:54] PROBLEM - Check systemd state on mw2413 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:35:50] (03CR) 10CI reject: [V: 04-1] Add a cookbook to drain a Ganeti node (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (owner: 10Muehlenhoff) [12:37:08] (03CR) 10Ayounsi: [C: 03+1] cadvisor: listen on main ip address only [puppet] - 10https://gerrit.wikimedia.org/r/924500 (https://phabricator.wikimedia.org/T337689) (owner: 10Filippo Giunchedi) [12:37:15] (03CR) 10Filippo Giunchedi: [C: 03+2] cadvisor: listen on main ip address only [puppet] - 10https://gerrit.wikimedia.org/r/924500 (https://phabricator.wikimedia.org/T337689) (owner: 10Filippo Giunchedi) [12:37:44] PROBLEM - Check systemd state on mw1442 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:24] PROBLEM - Check systemd state on ms-be1068 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:26] PROBLEM - Check systemd state on mw1373 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:28] PROBLEM - Check systemd state on mw2388 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:50] PROBLEM - Check systemd state on install5002 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:52] PROBLEM - Check systemd state on mw1464 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:52] PROBLEM - Check systemd state on mw1457 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:39:12] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:39:14] RECOVERY - Check systemd state on lvs1017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:39:14] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:39:20] PROBLEM - Check systemd state on cp4041 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:40:11] (03PS3) 10Muehlenhoff: Add a cookbook to drain a Ganeti node (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 [12:40:30] PROBLEM - Check systemd state on cp3064 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:42:28] PROBLEM - Check systemd state on db2158 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:42:34] PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:42:38] PROBLEM - Check systemd state on mw2321 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:42:39] (03CR) 10CI reject: [V: 04-1] Add a cookbook to drain a Ganeti node (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (owner: 10Muehlenhoff) [12:43:20] PROBLEM - Check systemd state on mw2438 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:43:20] PROBLEM - Check systemd state on parse2001 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:44:18] PROBLEM - Check systemd state on cp6014 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:44:28] PROBLEM - Check systemd state on mw1359 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:44:30] PROBLEM - Check systemd state on mw2272 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:44:34] PROBLEM - Check systemd state on parse1009 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:45:48] RECOVERY - Check systemd state on mw2438 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:45:51] 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon I think we can resolve this now; the remove-ghost-objects cookbook has helped, and recent `rclone` runs have successfully completed. [12:46:16] PROBLEM - Check systemd state on ganeti1019 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:46:22] PROBLEM - Check systemd state on mw2359 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:46:55] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [12:46:58] 10SRE-swift-storage: Swiftrepl doesn't work on bullseye (and swiftrepl.conf is deployed by hand) - https://phabricator.wikimedia.org/T299125 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon I think we're now at the point where we can commit to our `rclone`-based replacement. [12:48:00] PROBLEM - Check systemd state on cp4044 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:48:01] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2009.codfw.wmnet with OS bullseye [12:48:06] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe2009.codfw.wmnet with OS bullseye [12:48:13] (03Restored) 10BBlack: service: Disable monitors for wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/924342 (https://phabricator.wikimedia.org/T337446) (owner: 10Vgutierrez) [12:48:16] PROBLEM - Check systemd state on mw2366 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:48:24] RECOVERY - Check systemd state on an-airflow1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:48:36] PROBLEM - Check systemd state on cp6015 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:48:44] (03PS4) 10Muehlenhoff: Add a cookbook to drain a Ganeti node (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 [12:48:46] RECOVERY - Check systemd state on cp2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:49:01] (03PS3) 10BBlack: service: Disable monitors for wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/924342 (https://phabricator.wikimedia.org/T337446) (owner: 10Vgutierrez) [12:49:04] PROBLEM - Check systemd state on db1178 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:49:18] PROBLEM - Check systemd state on mw2426 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:49:50] godog: still failing? ^^^ [12:50:22] PROBLEM - Check systemd state on cp5020 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:50:36] PROBLEM - Check systemd state on mw2363 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:50:48] volans: I think that's the race between the check and the puppet runs (ongoing) [12:51:02] PROBLEM - Check systemd state on mw2259 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:51:03] ok [12:51:27] (03CR) 10CI reject: [V: 04-1] Add a cookbook to drain a Ganeti node (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (owner: 10Muehlenhoff) [12:51:55] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:51:58] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:51:58] RECOVERY - Check systemd state on mw2426 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:52:22] PROBLEM - Check systemd state on mw1395 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:52:25] (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol2005-dev: enable puppet role [puppet] - 10https://gerrit.wikimedia.org/r/924504 (https://phabricator.wikimedia.org/T336564) [12:52:34] RECOVERY - Check systemd state on cp2027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:52:34] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41405/console" [puppet] - 10https://gerrit.wikimedia.org/r/924342 (https://phabricator.wikimedia.org/T337446) (owner: 10Vgutierrez) [12:52:54] PROBLEM - Check systemd state on mw2405 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:53:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudcontrol2005-dev: enable puppet role [puppet] - 10https://gerrit.wikimedia.org/r/924504 (https://phabricator.wikimedia.org/T336564) (owner: 10Arturo Borrero Gonzalez) [12:53:48] PROBLEM - Check systemd state on mw1414 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:53:48] RECOVERY - MariaDB read only s2 on clouddb1018 is OK: Version 10.4.22-MariaDB, Uptime 5s, read_only: True, event_scheduler: False, 13.63 QPS, connection latency: 0.004024s, query latency: 0.002028s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [12:53:58] PROBLEM - Check systemd state on cp1090 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:54:12] PROBLEM - Check systemd state on lvs6003 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:54:14] PROBLEM - Check systemd state on mw1364 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:54:16] RECOVERY - Check systemd state on mw1357 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:54:16] PROBLEM - Check systemd state on mw2335 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:54:22] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/922543 (owner: 10PipelineBot) [12:54:28] RECOVERY - mysqld processes on clouddb1018 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [12:54:32] jouncebot: next [12:54:32] In 0 hour(s) and 5 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230530T1300) [12:54:32] In 0 hour(s) and 5 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230530T1300) [12:54:39] (03PS2) 10Ayounsi: Spicerack: add some colors [software/spicerack] - 10https://gerrit.wikimedia.org/r/924493 [12:54:46] RECOVERY - MariaDB Replica SQL: s2 on clouddb1018 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:54:48] is the deployment proceeding as scheduled? or are we having some outage? [12:54:51] (03PS4) 10DDesouza: Deploy Research Incentive survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917863 (https://phabricator.wikimedia.org/T336092) [12:55:02] RECOVERY - Check systemd state on elastic2073 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:06] RECOVERY - Check systemd state on ganeti6001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:06] RECOVERY - Check systemd state on mw1460 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:10] RECOVERY - Check systemd state on mw2355 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:14] RECOVERY - Check systemd state on mw2403 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:16] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/922543 (owner: 10PipelineBot) [12:55:18] there's a lot of patches schedules for this one btw. sorry about that [12:55:22] RECOVERY - Check systemd state on mw2375 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:24] RECOVERY - Check systemd state on cp3056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:25] MatmaRex: please go ahead, just monitoring spam [12:55:26] RECOVERY - Check systemd state on mw1476 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:30] RECOVERY - Check systemd state on elastic1058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:36] RECOVERY - Check systemd state on mw1466 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:48] PROBLEM - Check systemd state on cp1089 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:50] RECOVERY - Check systemd state on mw1393 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:56] RECOVERY - Check systemd state on mw1371 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:04] RECOVERY - Check systemd state on mw1356 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:12] RECOVERY - Check systemd state on lvs6002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:16] RECOVERY - Check systemd state on mw2363 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:18] (03PS5) 10Muehlenhoff: Add a cookbook to drain a Ganeti node (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 [12:56:23] (03CR) 10BBlack: [C: 03+2] service: Disable monitors for wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/924342 (https://phabricator.wikimedia.org/T337446) (owner: 10Vgutierrez) [12:56:26] RECOVERY - Check systemd state on mw1418 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:30] RECOVERY - Check systemd state on mw1370 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:30] RECOVERY - puppetboard.wikimedia.org tls expiry on puppetboard1003 is OK: OK - Certificate puppetboard.discovery.wmnet will expire on Tue 27 Jun 2023 09:36:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:56:34] RECOVERY - Check systemd state on mw1373 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:36] RECOVERY - Check systemd state on mw1477 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:40] RECOVERY - Check systemd state on mw1407 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:40] RECOVERY - Check systemd state on cp3058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:40] RECOVERY - Check systemd state on mw2388 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:44] RECOVERY - Check systemd state on mw1395 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:46] (03CR) 10Jbond: [C: 03+2] build_envoy_deb: update to work with bookworm [puppet] - 10https://gerrit.wikimedia.org/r/924482 (owner: 10Jbond) [12:56:50] RECOVERY - Check systemd state on mw2421 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:56] RECOVERY - Check systemd state on ganeti2019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:57:06] RECOVERY - Check systemd state on install6002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:57:08] RECOVERY - Check systemd state on mw1442 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:57:08] RECOVERY - Check systemd state on lvs6003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:57:12] RECOVERY - Check systemd state on mw2335 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:57:14] RECOVERY - Check systemd state on mw2405 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:57:28] RECOVERY - Check systemd state on mw1479 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:57:32] RECOVERY - Check systemd state on mw2354 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:57:34] RECOVERY - Check systemd state on mw2410 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:57:38] PROBLEM - Check systemd state on mw2376 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:57:40] RECOVERY - Check systemd state on mw1426 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:57:48] RECOVERY - Check systemd state on mw2413 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:57:58] RECOVERY - Check systemd state on logstash1025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:57:58] RECOVERY - Check systemd state on mw1440 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:04] PROBLEM - Check systemd state on zookeeper-test1002 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:14] RECOVERY - Check systemd state on mw2400 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:20] RECOVERY - Check systemd state on lvs3006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:36] (03CR) 10CI reject: [V: 04-1] Add a cookbook to drain a Ganeti node (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (owner: 10Muehlenhoff) [12:58:38] RECOVERY - Check systemd state on ganeti6002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:40] RECOVERY - Check systemd state on mw1364 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:42] RECOVERY - Check systemd state on ganeti5006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:46] RECOVERY - Check systemd state on cp4039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:52] RECOVERY - Check systemd state on mw2383 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:16] PROBLEM - Check systemd state on cp6016 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:24] RECOVERY - Check systemd state on mw2366 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:28] RECOVERY - Check systemd state on mw1414 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:29] !lvs1020: restart pybal to test disabling wikireplicas monitoring [12:59:30] PROBLEM - Check systemd state on mw2285 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:34] PROBLEM - Check systemd state on parse1011 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:46] RECOVERY - Check systemd state on gerrit1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:46] bblack: let me know if the test goes well and I can shutdown more instances [12:59:53] bblack: more wikireplicas instances, that is [12:59:56] RECOVERY - Check systemd state on mw1464 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:56] RECOVERY - Check systemd state on mw1457 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:00] RECOVERY - Check systemd state on mw2444 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:06] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230530T1300). [13:00:06] RECOVERY - Check systemd state on install5002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:06] tgr, matthiasmullie, Daimona, HouseOfM, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:06] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230530T1300) [13:00:08] RECOVERY - Check systemd state on mw2371 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:08] RECOVERY - Check systemd state on ganeti1019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:09] o/ my 2 patches don't need mwdebug/testing; they only affect a not currently running maint script that I will execute later. [13:00:12] RECOVERY - Check systemd state on mw1404 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:25] o/ [13:00:28] RECOVERY - Check systemd state on mw1359 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:30] !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2005-dev.codfw.wmnet with reason: host reimage [13:00:42] (SystemdUnitFailed) resolved: (2) cadvisor.service Failed on elastic1058:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:00:44] (Unable to deploy today, meeting D:) [13:00:55] o/ my patch is a noop in production [13:01:12] RECOVERY - Check systemd state on mw2321 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:02:42] RECOVERY - Check systemd state on cp3063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:02:49] (03CR) 10Andrew Bogott: "I an yet again waiting 10 minutes for cumin runs that would take seconds with this patch." [software/cumin] - 10https://gerrit.wikimedia.org/r/869332 (https://phabricator.wikimedia.org/T325773) (owner: 10Andrew Bogott) [13:03:44] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2005-dev.codfw.wmnet with reason: host reimage [13:04:56] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [13:06:09] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [13:06:21] Any deployer around? [13:06:22] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2009.codfw.wmnet with reason: host reimage [13:06:28] RECOVERY - Check systemd state on mw2359 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:34] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [13:07:20] (I'll be out of this meeting in ~30mins, but there's a lot to do, so ideally another deployer could pick this up) [13:07:40] In the interest of time, I can go ahead and self-deploy mine [13:07:41] (03PS1) 10Elukey: varnishkafka: add catch all systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/924506 [13:07:43] (03PS1) 10Elukey: profile::cache::kafka: add support for PKI [puppet] - 10https://gerrit.wikimedia.org/r/924507 [13:08:00] RECOVERY - Check systemd state on cp4041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:10] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4), 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10lmata) [13:08:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [extensions/ImageSuggestions] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924454 (owner: 10Matthias Mullie) [13:08:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [extensions/ImageSuggestions] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924455 (owner: 10Matthias Mullie) [13:08:27] starting mine [13:08:30] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [13:08:30] (03CR) 10CI reject: [V: 04-1] varnishkafka: add catch all systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/924506 (owner: 10Elukey) [13:08:38] (03PS6) 10Muehlenhoff: Add a cookbook to drain a Ganeti node (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 [13:08:53] tgr_: can you self-deploy? mine will take some time to pass CI - your config patch should be quick I suppose? [13:08:53] !log lvs1018: restart pybal for wikireplicas monitoring removal [13:08:56] (03CR) 10Volans: Openstack backend: make use of all_tenants nova api flag (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/869332 (https://phabricator.wikimedia.org/T325773) (owner: 10Andrew Bogott) [13:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:03] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [13:09:05] will do [13:09:19] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [13:09:35] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2009.codfw.wmnet with reason: host reimage [13:09:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924079 (owner: 10Gergő Tisza) [13:09:48] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [13:09:57] marostegui: it seems to be functioning as intended (no checks on wikireplicas) [13:10:37] (03Merged) 10jenkins-bot: GrowthExperiments: Re-add $wgGERestbaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924079 (owner: 10Gergő Tisza) [13:10:45] (NodeTextfileStale) resolved: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:10:58] (03CR) 10CI reject: [V: 04-1] Add a cookbook to drain a Ganeti node (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (owner: 10Muehlenhoff) [13:11:06] !log tgr@deploy1002 Started scap: Backport for [[gerrit:924079|GrowthExperiments: Re-add $wgGERestbaseUrl]] [13:11:13] !log herron@cumin1001 START - Cookbook sre.hosts.reimage for host mwlog2002.codfw.wmnet with OS bullseye [13:11:26] (03PS2) 10Elukey: varnishkafka: add catch all systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/924506 [13:11:28] (03PS2) 10Elukey: profile::cache::kafka: add support for PKI [puppet] - 10https://gerrit.wikimedia.org/r/924507 [13:11:46] RECOVERY - Check systemd state on mw1444 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:13:05] !log tgr@deploy1002 tgr: Backport for [[gerrit:924079|GrowthExperiments: Re-add $wgGERestbaseUrl]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:13:08] (03PS3) 10Elukey: profile::cache::kafka: add support for PKI [puppet] - 10https://gerrit.wikimedia.org/r/924507 [13:13:32] RECOVERY - Check systemd state on mw2376 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:13:35] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41407/console" [puppet] - 10https://gerrit.wikimedia.org/r/924506 (owner: 10Elukey) [13:14:49] (03PS7) 10Muehlenhoff: Add a cookbook to drain a Ganeti node (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 [13:15:39] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41408/console" [puppet] - 10https://gerrit.wikimedia.org/r/924507 (owner: 10Elukey) [13:17:10] RECOVERY - Check systemd state on mw2367 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:17:27] (03CR) 10CI reject: [V: 04-1] Add a cookbook to drain a Ganeti node (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (owner: 10Muehlenhoff) [13:17:52] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] WMF signup message, stray " [software/bitu] - 10https://gerrit.wikimedia.org/r/924439 (owner: 10Slyngshede) [13:18:12] (03PS4) 10Elukey: profile::cache::kafka: add support for PKI [puppet] - 10https://gerrit.wikimedia.org/r/924507 [13:18:19] tgr_: matthiasmullie: would either of you be able to deploy my patches as well afterwards? [13:19:31] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41409/console" [puppet] - 10https://gerrit.wikimedia.org/r/924507 (owner: 10Elukey) [13:19:32] MatmaRex: sorry, I can't today (watching my 1yr old, too much of a distraction to deal with unforeseen circumstances) [13:19:56] bblack: excellent thanks [13:20:16] :) [13:20:32] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:924079|GrowthExperiments: Re-add $wgGERestbaseUrl]] (duration: 09m 26s) [13:20:42] (03PS5) 10Elukey: profile::cache::kafka: add support for PKI [puppet] - 10https://gerrit.wikimedia.org/r/924507 [13:20:48] RECOVERY - Check systemd state on mw1424 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:20:49] but sounded like TheresNoTime may be available by the time tgr_ & me are done [13:20:50] matthiasmullie: done [13:20:56] tgr_: rgr thanks [13:21:07] MatmaRex: I can deploy the rest once matthiasmullie is finished [13:21:41] thanks [13:21:46] well, whatever gets in before the end of the hour, I have a meeting afterwards [13:21:51] (03PS1) 10BBlack: wikireplicas: restore pybal monitoring [puppet] - 10https://gerrit.wikimedia.org/r/924508 (https://phabricator.wikimedia.org/T337446) [13:21:57] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41410/console" [puppet] - 10https://gerrit.wikimedia.org/r/924507 (owner: 10Elukey) [13:22:36] RECOVERY - Check systemd state on mw2310 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:23:36] RECOVERY - MariaDB Replica Lag: s3 on clouddb1013 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:23:42] RECOVERY - MariaDB Replica IO: s3 on clouddb1013 is OK: OK slave_io_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:24:06] (03Merged) 10jenkins-bot: Fix maxJobs default [extensions/ImageSuggestions] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924454 (owner: 10Matthias Mullie) [13:24:20] RECOVERY - Check systemd state on clouddb1018 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:24:28] RECOVERY - Check systemd state on mw2443 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:25:16] (03Merged) 10jenkins-bot: Fix maxJobs default [extensions/ImageSuggestions] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924455 (owner: 10Matthias Mullie) [13:25:48] !log mlitn@deploy1002 Started scap: Backport for [[gerrit:924454|Fix maxJobs default]], [[gerrit:924455|Fix maxJobs default]] [13:26:36] (03PS1) 10Elukey: Move cp4037's varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/924509 [13:27:16] !log mlitn@deploy1002 mlitn: Backport for [[gerrit:924454|Fix maxJobs default]], [[gerrit:924455|Fix maxJobs default]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:27:48] RECOVERY - Check systemd state on cp3064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:28:01] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41411/console" [puppet] - 10https://gerrit.wikimedia.org/r/924509 (owner: 10Elukey) [13:28:06] RECOVERY - Check systemd state on mw2450 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:28:41] (03PS8) 10Muehlenhoff: Add a cookbook to drain a Ganeti node (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 [13:29:08] tgr_: syncing mine; you might want to start +2 those other patches already? [13:29:35] (03PS1) 10ArielGlenn: fix up sample nfs share mount command in docs for dupms nfs share testing [puppet] - 10https://gerrit.wikimedia.org/r/924510 (https://phabricator.wikimedia.org/T325232) [13:29:47] (03CR) 10Gergő Tisza: [C: 03+2] editpage: Change the order of hooks slightly for FlaggedRevs [core] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924158 (https://phabricator.wikimedia.org/T337637) (owner: 10Bartosz Dziewoński) [13:31:01] (03CR) 10CI reject: [V: 04-1] Add a cookbook to drain a Ganeti node (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (owner: 10Muehlenhoff) [13:32:22] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:27] !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:924454|Fix maxJobs default]], [[gerrit:924455|Fix maxJobs default]] (duration: 07m 39s) [13:33:38] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: fetch-rings-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:43] tgr_: I;m done; the floor is all yours [13:33:50] thx [13:34:07] (03CR) 10Gergő Tisza: [C: 03+2] beta: Remove $wgCampaignEventsEnableMultipleOrganizers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909401 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo) [13:34:54] (03Merged) 10jenkins-bot: beta: Remove $wgCampaignEventsEnableMultipleOrganizers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909401 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo) [13:35:14] (03PS1) 10KartikMistry: Enable Content and Section Translation for 9 Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924511 (https://phabricator.wikimedia.org/T337290) [13:37:26] (03CR) 10Vgutierrez: [C: 03+1] cache::upload: Add hieradata to switch HTTPS redirection from Varnish to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/924444 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [13:38:27] Daimona: do you want to test the bet change or should I go on with the prod one? [13:38:55] s/bet/beta/ [13:39:02] tgr_: thanks, I think you can go ahead with prod! [13:39:09] (03PS9) 10Muehlenhoff: Add a cookbook to drain a Ganeti node (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 [13:39:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924488 (https://phabricator.wikimedia.org/T334088) (owner: 10Daimona Eaytoy) [13:40:27] (03Merged) 10jenkins-bot: prod: Remove $wgCampaignEventsEnableMultipleOrganizers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924488 (https://phabricator.wikimedia.org/T334088) (owner: 10Daimona Eaytoy) [13:40:32] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:40:53] !log tgr@deploy1002 Started scap: Backport for [[gerrit:924488|prod: Remove $wgCampaignEventsEnableMultipleOrganizers (T334088)]] [13:40:58] T334088: Enable the multiple organizers feature in production - https://phabricator.wikimedia.org/T334088 [13:41:04] there will another shower of recoveries btw [13:41:07] all harmless [13:42:19] (03PS1) 10MVernon: hira: disable swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/924516 (https://phabricator.wikimedia.org/T279637) [13:42:27] !log tgr@deploy1002 tgr and daimona: Backport for [[gerrit:924488|prod: Remove $wgCampaignEventsEnableMultipleOrganizers (T334088)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:42:44] Daimona: ^^ [13:42:48] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/924516 (https://phabricator.wikimedia.org/T279637) (owner: 10MVernon) [13:43:12] (03PS1) 10Muehlenhoff: Setup debmonitor2003 as bookworm debmonitor VM [puppet] - 10https://gerrit.wikimedia.org/r/924517 (https://phabricator.wikimedia.org/T241049) [13:44:01] Thanks! HouseOfM: we can now test that the multiple organizers feature is still appearing in production wikis (testwiki, test2wiki, meta, officewiki) [13:44:12] RECOVERY - Check systemd state on cp2042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:24] Sweet, thanks @tgr [13:44:32] RECOVERY - Check systemd state on cp6004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:44] RECOVERY - Check systemd state on cp6007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:49] You can choose any mwdebug server from the dropdown [13:45:10] RECOVERY - Check systemd state on cp6014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:32] RECOVERY - Check systemd state on cp6011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:50] (03PS10) 10Muehlenhoff: Add a cookbook to drain a Ganeti node [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (https://phabricator.wikimedia.org/T203964) [13:45:56] (03Merged) 10jenkins-bot: editpage: Change the order of hooks slightly for FlaggedRevs [core] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924158 (https://phabricator.wikimedia.org/T337637) (owner: 10Bartosz Dziewoński) [13:46:00] RECOVERY - Check systemd state on cp6015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:22] RECOVERY - Check systemd state on cp6016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:42] RECOVERY - Check systemd state on cp1090 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:49] (03CR) 10Jcrespo: [C: 03+1] hira: disable swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/924516 (https://phabricator.wikimedia.org/T279637) (owner: 10MVernon) [13:47:04] RECOVERY - Check systemd state on parse1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:24] Looking good to me [13:47:26] (03CR) 10MVernon: [C: 03+2] hira: disable swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/924516 (https://phabricator.wikimedia.org/T279637) (owner: 10MVernon) [13:47:28] RECOVERY - Check systemd state on cp1089 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:06] RECOVERY - Check systemd state on cp3051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:10] RECOVERY - Check systemd state on cp5020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:28] RECOVERY - Check systemd state on ores1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:24] @Daimona, all good [13:49:28] RECOVERY - Check systemd state on cp4044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:32] RECOVERY - Check systemd state on cp3054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:48] Cool, thanks. @tgr_ you can proceed [13:49:50] RECOVERY - Check systemd state on db2113 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:04] RECOVERY - Check systemd state on cp4047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:04] RECOVERY - Check systemd state on db2162 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:30] RECOVERY - Check systemd state on db2146 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:36] RECOVERY - Check systemd state on db1178 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2009.codfw.wmnet with OS bullseye [13:50:46] RECOVERY - Check systemd state on cp4052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:46] 10SRE-swift-storage, 10Patch-For-Review: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe2009.codfw.wmnet with OS bullseye completed: - ms-fe2009... [13:50:54] RECOVERY - Check systemd state on cp4046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:54] RECOVERY - Check systemd state on db2158 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:16] RECOVERY - Check systemd state on cp4048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:46] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:54] RECOVERY - Check systemd state on db1207 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:06] RECOVERY - Check systemd state on doh6002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:22] RECOVERY - Check systemd state on druid1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:24] RECOVERY - Check systemd state on dumpsdata1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:40] RECOVERY - Check systemd state on durum3002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:00] RECOVERY - Check systemd state on dse-k8s-worker1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:53] (03CR) 10Hokwelum: [C: 03+1] "looks good :-)" [puppet] - 10https://gerrit.wikimedia.org/r/924510 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [13:54:22] RECOVERY - Check systemd state on ncredir6001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:14] (03PS1) 10Jelto: gitlab: use production idp for gitlab hosts [puppet] - 10https://gerrit.wikimedia.org/r/924525 (https://phabricator.wikimedia.org/T320390) [13:55:16] RECOVERY - Check systemd state on mw2259 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:20] RECOVERY - Check systemd state on mc-gp1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:26] RECOVERY - Check systemd state on ms-be1068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:28] RECOVERY - Check systemd state on mc-wf1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:32] RECOVERY - Check systemd state on parse2016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:44] RECOVERY - Check systemd state on parse2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:48] RECOVERY - Check systemd state on sessionstore2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:56] RECOVERY - Check systemd state on parse1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:58] RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:58] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe2009.codfw.wmnet [13:56:06] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe2009.codfw.wmnet [13:56:08] RECOVERY - Check systemd state on mw1491 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:56:16] RECOVERY - Check systemd state on mw2298 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:56:32] RECOVERY - Check systemd state on mw2300 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:56:34] RECOVERY - Check systemd state on parse2013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:56:40] RECOVERY - Check systemd state on cp6012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:57:06] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:924488|prod: Remove $wgCampaignEventsEnableMultipleOrganizers (T334088)]] (duration: 16m 13s) [13:57:10] RECOVERY - Check systemd state on snapshot1015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:57:11] T334088: Enable the multiple organizers feature in production - https://phabricator.wikimedia.org/T334088 [13:57:22] RECOVERY - Check systemd state on mw2272 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:57:25] (03CR) 10Jelto: "not sure if that makes sense, but I noticed wmcloud idp is configured as the default for all gitlab hosts in I737a9da73911f1f6f7084d909db2" [puppet] - 10https://gerrit.wikimedia.org/r/924525 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [13:57:29] Daimona: deployed [13:57:44] RECOVERY - Check systemd state on mw2264 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:57:44] RECOVERY - Check systemd state on parse2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:57:52] RECOVERY - Check systemd state on prometheus4002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:57:54] RECOVERY - Check systemd state on mw2260 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:58:02] Thanks! [13:58:18] RECOVERY - Check systemd state on mw2285 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:58:21] !log tgr@deploy1002 Started scap: Backport for [[gerrit:924158|editpage: Change the order of hooks slightly for FlaggedRevs (T337637)]] [13:58:26] T337637: Duplicated edit notice about pending changes - https://phabricator.wikimedia.org/T337637 [13:58:30] RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:58:36] RECOVERY - Check systemd state on parse1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:58:53] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41412/console" [puppet] - 10https://gerrit.wikimedia.org/r/916522 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [13:58:56] RECOVERY - Check systemd state on zookeeper-test1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:59:02] (03PS6) 10Elukey: profile::cache::kafka: add support for PKI [puppet] - 10https://gerrit.wikimedia.org/r/924507 [13:59:04] (03PS2) 10Elukey: Move cp4037's varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/924509 [13:59:10] RECOVERY - Check systemd state on netflow4002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:59:10] RECOVERY - Check systemd state on mw2292 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:59:20] RECOVERY - Check systemd state on xhgui2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:59:22] RECOVERY - Check systemd state on poolcounter2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:59:57] !log tgr@deploy1002 tgr and matmarex: Backport for [[gerrit:924158|editpage: Change the order of hooks slightly for FlaggedRevs (T337637)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [14:00:18] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41413/console" [puppet] - 10https://gerrit.wikimedia.org/r/924509 (owner: 10Elukey) [14:00:22] MatmaRex: ^ [14:00:22] RECOVERY - Check systemd state on parse1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:24] RECOVERY - Check systemd state on ms-be1074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:26] RECOVERY - Check systemd state on mw1461 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:28] looking [14:00:40] RECOVERY - Check systemd state on mw1488 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:41] tgr_: looks good [14:02:21] i assume you're leaving after this one syncs? [14:04:37] I can continue, I don't need to do anything during the meeting [14:05:11] (I confused it with a different meeting) [14:05:43] (03CR) 10ArielGlenn: [C: 03+2] fix up sample nfs share mount command in docs for dupms nfs share testing [puppet] - 10https://gerrit.wikimedia.org/r/924510 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [14:05:50] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:14] oh, i guess we're in the same meeting then, heh [14:06:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:06:36] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:924158|editpage: Change the order of hooks slightly for FlaggedRevs (T337637)]] (duration: 08m 14s) [14:06:36] !log installing libwebp security updates [14:06:40] !log bking@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [14:06:41] T337637: Duplicated edit notice about pending changes - https://phabricator.wikimedia.org/T337637 [14:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:46] (03CR) 10ArielGlenn: [C: 04-2] "prevent merge while the parent change gets reviewed and deployed" [puppet] - 10https://gerrit.wikimedia.org/r/924510 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [14:06:47] thanks [14:07:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924159 (https://phabricator.wikimedia.org/T337633) (owner: 10Bartosz Dziewoński) [14:08:42] !log bking@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [14:08:44] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:58] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:13:57] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:13:59] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:16:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:43] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host puppetdb1003.eqiad.wmnet with OS bookworm [14:16:47] 10SRE, 10Infrastructure-Foundations: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host puppetdb1003.eqiad.wmnet with OS bookworm executed with errors: - puppetdb1003 (**FA... [14:16:54] !log herron@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mwlog2002.codfw.wmnet with OS bullseye [14:17:17] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: adjust openstack.codfw1dev FQDN [dns] - 10https://gerrit.wikimedia.org/r/924526 (https://phabricator.wikimedia.org/T336564) [14:17:19] (03CR) 10Hokwelum: [C: 03+1] "Checks out!" [puppet] - 10https://gerrit.wikimedia.org/r/923289 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [14:17:32] (03PS17) 10JMeybohm: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) [14:17:34] (03PS5) 10JMeybohm: profile::imagecatalog migrate from user token to client cert [puppet] - 10https://gerrit.wikimedia.org/r/912842 (https://phabricator.wikimedia.org/T325268) [14:17:36] (03PS10) 10JMeybohm: prometheus::k8s: Use kubernetes::clusters_defaults [puppet] - 10https://gerrit.wikimedia.org/r/913114 (https://phabricator.wikimedia.org/T325268) [14:17:38] (03PS10) 10JMeybohm: prometheus::k8s switch staging-codfw to client cert auth [puppet] - 10https://gerrit.wikimedia.org/r/913149 (https://phabricator.wikimedia.org/T325268) [14:18:12] (03CR) 10ArielGlenn: [C: 03+2] Dumps: move the nfs share test conf to the right location [puppet] - 10https://gerrit.wikimedia.org/r/923289 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [14:19:52] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41416/console" [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [14:21:00] (03PS1) 10Fabfur: run-puppet-restart-varnish: fix _custom_action signature [cookbooks] - 10https://gerrit.wikimedia.org/r/924527 (https://phabricator.wikimedia.org/T323557) [14:21:20] (03CR) 10ArielGlenn: [C: 03+2] fix up sample nfs share mount command in docs for dupms nfs share testing [puppet] - 10https://gerrit.wikimedia.org/r/924510 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [14:22:05] so CI is taking forever today, i guess? if we want to deploy the rest, i suggest +2-ing them all and deploying them together [14:24:02] (03CR) 10CI reject: [V: 04-1] run-puppet-restart-varnish: fix _custom_action signature [cookbooks] - 10https://gerrit.wikimedia.org/r/924527 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [14:24:16] (03CR) 10CDanis: [C: 03+1] "LGTM, will merge today" [puppet] - 10https://gerrit.wikimedia.org/r/921437 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [14:25:02] (03CR) 10CDanis: [C: 03+1] "thanks! LGTM, will merge today" [puppet] - 10https://gerrit.wikimedia.org/r/923448 (https://phabricator.wikimedia.org/T337317) (owner: 10Jameel Kaisar) [14:25:39] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Ignore comment just me thinking out loud." [dns] - 10https://gerrit.wikimedia.org/r/924526 (https://phabricator.wikimedia.org/T336564) (owner: 10Arturo Borrero Gonzalez) [14:26:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: adjust openstack.codfw1dev FQDN (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/924526 (https://phabricator.wikimedia.org/T336564) (owner: 10Arturo Borrero Gonzalez) [14:27:27] (03Merged) 10jenkins-bot: Hide 'editnotice-notext' message in VE (and mobile apps) [extensions/VisualEditor] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924159 (https://phabricator.wikimedia.org/T337633) (owner: 10Bartosz Dziewoński) [14:27:57] !log tgr@deploy1002 Started scap: Backport for [[gerrit:924159|Hide 'editnotice-notext' message in VE (and mobile apps) (T337633)]] [14:28:02] T337633: Empty message 'editnotice-notext' is visible as an edit notice in VisualEditor and mobile apps - https://phabricator.wikimedia.org/T337633 [14:28:17] (03CR) 10Volans: run-puppet-restart-varnish: fix _custom_action signature (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/924527 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [14:28:23] (03CR) 10JMeybohm: "Please comment out/disable the egress zookeeper stuff for now (with a link in comment to the phab task) to make it clear that we don't use" [deployment-charts] - 10https://gerrit.wikimedia.org/r/922874 (https://phabricator.wikimedia.org/T333464) (owner: 10Ottomata) [14:28:43] !log herron@cumin1001 START - Cookbook sre.hosts.reimage for host mwlog2002.codfw.wmnet with OS bullseye [14:29:33] !log tgr@deploy1002 matmarex and tgr: Backport for [[gerrit:924159|Hide 'editnotice-notext' message in VE (and mobile apps) (T337633)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [14:29:55] MatmaRex: ^ [14:29:56] (03PS2) 10Fabfur: run-puppet-restart-varnish: fix _custom_action signature [cookbooks] - 10https://gerrit.wikimedia.org/r/924527 (https://phabricator.wikimedia.org/T323557) [14:29:57] tgr_: looks good [14:30:11] tgr_: so CI is taking forever today, i guess? if we want to deploy the rest, i suggest +2-ing them all and deploying them together [14:30:26] yeah, slow day [14:30:37] (03CR) 10Fabfur: run-puppet-restart-varnish: fix _custom_action signature (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/924527 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [14:30:38] the wmf.11 backports don't really need additional testing [14:31:30] (03PS1) 10Bking: rdf-streaming-updater: Enable new flink version in CODFW [deployment-charts] - 10https://gerrit.wikimedia.org/r/924528 (https://phabricator.wikimedia.org/T334244) [14:32:18] (03CR) 10DCausse: [C: 03+1] rdf-streaming-updater: Enable new flink version in CODFW [deployment-charts] - 10https://gerrit.wikimedia.org/r/924528 (https://phabricator.wikimedia.org/T334244) (owner: 10Bking) [14:33:00] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/924527 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [14:33:02] (03CR) 10Vgutierrez: [C: 03+1] run-puppet-restart-varnish: fix _custom_action signature [cookbooks] - 10https://gerrit.wikimedia.org/r/924527 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [14:34:08] (03CR) 10Fabfur: [C: 03+2] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/924527 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [14:34:31] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: Enable new flink version in CODFW [deployment-charts] - 10https://gerrit.wikimedia.org/r/924528 (https://phabricator.wikimedia.org/T334244) (owner: 10Bking) [14:35:02] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:14] (03Merged) 10jenkins-bot: rdf-streaming-updater: Enable new flink version in CODFW [deployment-charts] - 10https://gerrit.wikimedia.org/r/924528 (https://phabricator.wikimedia.org/T334244) (owner: 10Bking) [14:35:59] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:924159|Hide 'editnotice-notext' message in VE (and mobile apps) (T337633)]] (duration: 08m 01s) [14:36:04] T337633: Empty message 'editnotice-notext' is visible as an edit notice in VisualEditor and mobile apps - https://phabricator.wikimedia.org/T337633 [14:36:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:36:42] (03Merged) 10jenkins-bot: run-puppet-restart-varnish: fix _custom_action signature [cookbooks] - 10https://gerrit.wikimedia.org/r/924527 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [14:37:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924160 (https://phabricator.wikimedia.org/T337638) (owner: 10Bartosz Dziewoński) [14:37:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924456 (https://phabricator.wikimedia.org/T337633) (owner: 10Bartosz Dziewoński) [14:37:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924458 (https://phabricator.wikimedia.org/T337638) (owner: 10Bartosz Dziewoński) [14:37:35] (03CR) 10Jbond: [C: 04-1] "nice ideas but wont work as expected currently" [software/spicerack] - 10https://gerrit.wikimedia.org/r/924493 (owner: 10Ayounsi) [14:38:14] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:41:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:41:39] 10SRE, 10Release-Engineering-Team, 10Security-Team, 10Wikimedia-GitHub, and 2 others: Add github.com/wikimedia as an SCM for Semgrep Cloud - https://phabricator.wikimedia.org/T337561 (10sbassett) >>! In T337561#8883334, @Dzahn wrote: > Let's keep access requests on tickets and not in ad-hoc chats. Yep, th... [14:41:49] (03CR) 10Jbond: "lgtm, minor nit" [puppet] - 10https://gerrit.wikimedia.org/r/924506 (owner: 10Elukey) [14:43:02] (03PS8) 10Ottomata: flink-operator - deploy in wikikube eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/922874 (https://phabricator.wikimedia.org/T333464) [14:43:13] (03CR) 10Ottomata: flink-operator - deploy in wikikube eqiad and codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922874 (https://phabricator.wikimedia.org/T333464) (owner: 10Ottomata) [14:44:32] (03PS1) 10Arturo Borrero Gonzalez: cloudlb: make it aware of cloudcontrol2005-dev [puppet] - 10https://gerrit.wikimedia.org/r/924533 (https://phabricator.wikimedia.org/T336564) [14:45:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb: make it aware of cloudcontrol2005-dev [puppet] - 10https://gerrit.wikimedia.org/r/924533 (https://phabricator.wikimedia.org/T336564) (owner: 10Arturo Borrero Gonzalez) [14:46:33] !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mwlog2002.codfw.wmnet with reason: host reimage [14:49:45] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mwlog2002.codfw.wmnet with reason: host reimage [14:50:31] !log installing texlive-bin security updates [14:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:49] (03PS1) 10Herron: reuse-lvm-root-4dev: add grub-installer/bootdev config [puppet] - 10https://gerrit.wikimedia.org/r/924535 (https://phabricator.wikimedia.org/T333614) [14:52:26] (03CR) 10Herron: [C: 03+2] "self-merging since this was live tested and fixed mwlog2002 reimage (see bug)" [puppet] - 10https://gerrit.wikimedia.org/r/924535 (https://phabricator.wikimedia.org/T333614) (owner: 10Herron) [14:56:16] (03PS3) 10Fabfur: cache::upload: Add hieradata to switch HTTPS redirection from Varnish to HAProxy only on host c2042.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/924444 (https://phabricator.wikimedia.org/T323557) [14:56:32] (03PS1) 10Kimberly Sarabia: Turn on A/B Test Hebrew [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924536 (https://phabricator.wikimedia.org/T336969) [14:56:38] (03CR) 10CI reject: [V: 04-1] cache::upload: Add hieradata to switch HTTPS redirection from Varnish to HAProxy only on host c2042.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/924444 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [14:57:44] (03PS4) 10Fabfur: cache::upload: Add hieradata to switch HTTPS redirection from Varnish to HAProxy only on host cp2042 [puppet] - 10https://gerrit.wikimedia.org/r/924444 (https://phabricator.wikimedia.org/T323557) [14:58:09] (03CR) 10CI reject: [V: 04-1] cache::upload: Add hieradata to switch HTTPS redirection from Varnish to HAProxy only on host cp2042 [puppet] - 10https://gerrit.wikimedia.org/r/924444 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [14:58:38] (03PS5) 10Fabfur: cache::upload: Switch HTTPS redirection from Varnish to HAProxy only on cp2042 [puppet] - 10https://gerrit.wikimedia.org/r/924444 (https://phabricator.wikimedia.org/T323557) [14:58:51] 10SRE, 10ops-eqiad: Move two GPUs from Hadoop to Lift Wing - https://phabricator.wikimedia.org/T335031 (10elukey) @Jclark-ctr sorryyyy didn't see the ping :( Lemme know if you have time in these days or next week, thanks a lot! The caveat is that we'd need to move 2 GPUs from a dse-k8s-worker node, not from H... [14:59:31] (03Merged) 10jenkins-bot: ve.ui.MWGalleryDialog: Fix showing the search panel [extensions/VisualEditor] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924160 (https://phabricator.wikimedia.org/T337638) (owner: 10Bartosz Dziewoński) [14:59:33] (03Merged) 10jenkins-bot: Hide 'editnotice-notext' message in VE (and mobile apps) [extensions/VisualEditor] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924456 (https://phabricator.wikimedia.org/T337633) (owner: 10Bartosz Dziewoński) [14:59:37] (03Merged) 10jenkins-bot: ve.ui.MWGalleryDialog: Fix showing the search panel [extensions/VisualEditor] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924458 (https://phabricator.wikimedia.org/T337638) (owner: 10Bartosz Dziewoński) [15:00:08] !log tgr@deploy1002 Started scap: Backport for [[gerrit:924160|ve.ui.MWGalleryDialog: Fix showing the search panel (T337638)]], [[gerrit:924456|Hide 'editnotice-notext' message in VE (and mobile apps) (T337633)]], [[gerrit:924458|ve.ui.MWGalleryDialog: Fix showing the search panel (T337638)]] [15:00:20] T337638: Gallery creation not functional in VisualEditor - https://phabricator.wikimedia.org/T337638 [15:00:21] T337633: Empty message 'editnotice-notext' is visible as an edit notice in VisualEditor and mobile apps - https://phabricator.wikimedia.org/T337633 [15:00:39] (03CR) 10Alexandros Kosiaris: [C: 03+1] "This chan" [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [15:02:04] !log tgr@deploy1002 tgr and matmarex: Backport for [[gerrit:924160|ve.ui.MWGalleryDialog: Fix showing the search panel (T337638)]], [[gerrit:924456|Hide 'editnotice-notext' message in VE (and mobile apps) (T337633)]], [[gerrit:924458|ve.ui.MWGalleryDialog: Fix showing the search panel (T337638)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [15:02:18] MatmaRex: ^ last one [15:02:22] 10SRE, 10Znuny, 10serviceops-collab: Puppet template for /etc/clamav/clamd.conf needs to be updated - https://phabricator.wikimedia.org/T330129 (10Arnoldokoth) 05Open→03Resolved [15:02:36] wmf.10 gallery backport looks good [15:03:13] wmf.11 looks good too [15:03:34] (03CR) 10Jbond: "see inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [15:03:34] !log bking@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [15:03:40] tgr_: all good [15:05:10] !log bking@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [15:06:11] (03PS3) 10Elukey: varnishkafka: add catch all systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/924506 [15:06:13] (03PS7) 10Elukey: profile::cache::kafka: add support for PKI [puppet] - 10https://gerrit.wikimedia.org/r/924507 [15:06:15] (03PS3) 10Elukey: Move cp4037's varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/924509 [15:07:49] (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [15:08:10] (03PS4) 10Elukey: varnishkafka: add catch all systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/924506 [15:08:12] (03PS8) 10Elukey: profile::cache::kafka: add support for PKI [puppet] - 10https://gerrit.wikimedia.org/r/924507 [15:08:14] (03PS4) 10Elukey: Move cp4037's varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/924509 [15:08:17] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:924160|ve.ui.MWGalleryDialog: Fix showing the search panel (T337638)]], [[gerrit:924456|Hide 'editnotice-notext' message in VE (and mobile apps) (T337633)]], [[gerrit:924458|ve.ui.MWGalleryDialog: Fix showing the search panel (T337638)]] (duration: 08m 08s) [15:08:23] T337638: Gallery creation not functional in VisualEditor - https://phabricator.wikimedia.org/T337638 [15:08:23] T337633: Empty message 'editnotice-notext' is visible as an edit notice in VisualEditor and mobile apps - https://phabricator.wikimedia.org/T337633 [15:09:07] (03PS5) 10Elukey: varnishkafka: add catch all systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/924506 [15:09:09] (03PS9) 10Elukey: profile::cache::kafka: add support for PKI [puppet] - 10https://gerrit.wikimedia.org/r/924507 [15:09:11] (03PS5) 10Elukey: Move cp4037's varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/924509 [15:09:34] deployed, logs look good. [15:09:49] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [15:09:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [15:09:59] (03CR) 10Elukey: varnishkafka: add catch all systemd unit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/924506 (owner: 10Elukey) [15:10:00] !log UTC evening deploys done [15:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:37] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41419/console" [puppet] - 10https://gerrit.wikimedia.org/r/924506 (owner: 10Elukey) [15:13:08] (03PS1) 10Hokwelum: Rename nfs_settings dir to nfs_testing and move nfs test files into nfs test dir [puppet] - 10https://gerrit.wikimedia.org/r/924542 [15:14:07] (03CR) 10Alexandros Kosiaris: [C: 03+2] .gitmessage: add Hosts: line [puppet] - 10https://gerrit.wikimedia.org/r/924438 (owner: 10Volans) [15:14:30] !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aborrero@cumin2002" [15:14:49] (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [15:14:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [15:15:35] !log aborrero@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aborrero@cumin2002" [15:15:36] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2005-dev.codfw.wmnet with OS bullseye [15:15:43] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2005-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336564 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudcontrol2005-dev.codfw.wmnet with OS bullseye complete... [15:16:35] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/924506 (owner: 10Elukey) [15:19:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [15:20:04] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:20:51] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [15:21:09] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [15:22:02] (thanks tgr_) [15:23:06] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS13030/IPv4: Connect - Init7, AS13030/IPv6: Connect - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:24:19] (WcqsStreamingUpdaterFlinkJobNotRunning) resolved: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [15:24:42] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:28:29] (03PS1) 10Elukey: ml-services: add autoscaling capabilities to revert risk la [deployment-charts] - 10https://gerrit.wikimedia.org/r/924544 [15:28:31] (03PS1) 10Elukey: services: raise auth-users rate limit for Lift Wing in the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/924545 [15:28:42] 10SRE, 10Observability-Metrics, 10User-fgiunchedi: Extend router ACLs to block 4194/tcp on LVSes - https://phabricator.wikimedia.org/T337689 (10fgiunchedi) 05Open→03Resolved >>! In T337689#8887689, @ayounsi wrote: > What I pushed is an extra safeguard, but a more viable fix is to have the daemon listen o... [15:28:44] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4), 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10fgiunchedi) [15:29:09] (03CR) 10CI reject: [V: 04-1] ml-services: add autoscaling capabilities to revert risk la [deployment-charts] - 10https://gerrit.wikimedia.org/r/924544 (owner: 10Elukey) [15:29:19] (03CR) 10CI reject: [V: 04-1] services: raise auth-users rate limit for Lift Wing in the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/924545 (owner: 10Elukey) [15:32:12] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:45] (03CR) 10JMeybohm: [C: 03+1] flink-operator - deploy in wikikube eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/922874 (https://phabricator.wikimedia.org/T333464) (owner: 10Ottomata) [15:35:52] (03PS2) 10Elukey: ml-services: add autoscaling capabilities to revert risk la [deployment-charts] - 10https://gerrit.wikimedia.org/r/924544 [15:35:54] (03PS2) 10Elukey: services: raise auth-users rate limit for Lift Wing in the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/924545 [15:36:48] (03CR) 10Ottomata: [C: 03+2] flink-operator - deploy in wikikube eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/922874 (https://phabricator.wikimedia.org/T333464) (owner: 10Ottomata) [15:36:54] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/924494 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [15:37:32] (03PS6) 10Clément Goubert: mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923385 (https://phabricator.wikimedia.org/T337490) [15:37:48] (03PS10) 10Clément Goubert: mw-on-k8s: Redirect closed wikis to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490) [15:38:19] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/923385 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [15:38:29] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [15:39:08] (03Merged) 10jenkins-bot: flink-operator - deploy in wikikube eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/922874 (https://phabricator.wikimedia.org/T333464) (owner: 10Ottomata) [15:40:06] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10akosiaris) @Trizek-WMF, should we resolve this? [15:40:14] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "👌" [deployment-charts] - 10https://gerrit.wikimedia.org/r/924545 (owner: 10Elukey) [15:40:50] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:42:42] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: add autoscaling capabilities to revert risk la [deployment-charts] - 10https://gerrit.wikimedia.org/r/924544 (owner: 10Elukey) [15:43:53] (03CR) 10AikoChou: [C: 03+1] ml-services: add autoscaling capabilities to revert risk la [deployment-charts] - 10https://gerrit.wikimedia.org/r/924544 (owner: 10Elukey) [15:44:05] (03CR) 10AikoChou: [C: 03+1] services: raise auth-users rate limit for Lift Wing in the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/924545 (owner: 10Elukey) [15:45:57] (03CR) 10ArielGlenn: [C: 03+2] Rename nfs_settings dir to nfs_testing and move nfs test files into nfs test dir [puppet] - 10https://gerrit.wikimedia.org/r/924542 (owner: 10Hokwelum) [15:46:09] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF) 05In progress→03Resolved [15:46:12] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Trizek-WMF) [15:49:33] !log otto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:49:37] !log otto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:51:34] !log otto@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:51:41] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:51:55] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mwlog2002.codfw.wmnet with OS bullseye [15:52:30] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:53:14] !log otto@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:54:04] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:54:24] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Idle - HE, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:54:31] !log otto@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:54:39] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:55:00] (03CR) 10Klausman: [C: 03+1] services: raise auth-users rate limit for Lift Wing in the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/924545 (owner: 10Elukey) [15:55:12] !log otto@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:55:26] (03CR) 10Klausman: [C: 03+1] ml-services: add autoscaling capabilities to revert risk la [deployment-charts] - 10https://gerrit.wikimedia.org/r/924544 (owner: 10Elukey) [15:55:34] PROBLEM - Docker registry HTTPS interface on registry1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [15:56:00] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:56:58] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:56:58] RECOVERY - Docker registry HTTPS interface on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 3754 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Docker [15:57:08] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:57:15] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:58:04] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:58:10] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:58:17] (03PS1) 10JMeybohm: Revert: Ratelimit a hotlink saturation case [puppet] - 10https://gerrit.wikimedia.org/r/924550 [15:58:39] (03CR) 10CI reject: [V: 04-1] Revert: Ratelimit a hotlink saturation case [puppet] - 10https://gerrit.wikimedia.org/r/924550 (owner: 10JMeybohm) [15:58:41] (03PS2) 10Urbanecm: [Growth] Enable user impact refresh on 10 more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924053 (https://phabricator.wikimedia.org/T336203) [15:58:46] jouncebot: nowandnext [15:58:46] No deployments scheduled for the next 0 hour(s) and 1 minute(s) [15:58:46] In 0 hour(s) and 1 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230530T1600) [15:59:05] (03PS2) 10JMeybohm: Revert: Ratelimit a hotlink saturation case [puppet] - 10https://gerrit.wikimedia.org/r/924550 [15:59:27] (03CR) 10CI reject: [V: 04-1] Revert: Ratelimit a hotlink saturation case [puppet] - 10https://gerrit.wikimedia.org/r/924550 (owner: 10JMeybohm) [16:00:02] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:00:06] jbond and rzl: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230530T1600) [16:00:06] No Gerrit patches in the queue for this window AFAICS. [16:00:52] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:04:29] urbanecm: nothing planned for the puppet window today, all yours if you need it :) [16:05:10] thanks! [16:05:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924053 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm) [16:06:58] (03Merged) 10jenkins-bot: [Growth] Enable user impact refresh on 10 more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924053 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm) [16:07:26] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:924053|[Growth] Enable user impact refresh on 10 more wikis (T336203)]] [16:07:31] T336203: Positive reinforcement: Deploy the new Impact module to all Wikipedias - https://phabricator.wikimedia.org/T336203 [16:14:34] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:924053|[Growth] Enable user impact refresh on 10 more wikis (T336203)]] (duration: 07m 08s) [16:14:39] T336203: Positive reinforcement: Deploy the new Impact module to all Wikipedias - https://phabricator.wikimedia.org/T336203 [16:15:28] rzl: would it be possible to start the growthexperiments-userImpactUpdateRecentlyRegistered and growthexperiments-userImpactUpdateRecentlyEdited jobs at mwmaint1002 before the timer kicks in? if it's too problematic, i can run it in a tmux too. [16:15:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:17:15] urbanecm: sure thing - ready for it now? [16:17:26] yes. [16:19:19] !log rzl@mwmaint1002:~$ sudo systemctl start mediawiki_job_growthexperiments-userImpactUpdateRecentlyRegistered [16:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:00] (03PS1) 10Herron: mwlog: add remove_python2_on_bullseye exemption [puppet] - 10https://gerrit.wikimedia.org/r/924555 (https://phabricator.wikimedia.org/T333614) [16:20:14] !log rzl@mwmaint1002:~$ sudo systemctl start mediawiki_job_growthexperiments-userImpactUpdateRecentlyEdited [16:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:21:35] thanks rzl! [16:21:52] no worries! RecentlyEdited is still running, up to the Ts now [16:22:05] there we go, both done [16:22:07] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41422/console" [puppet] - 10https://gerrit.wikimedia.org/r/924555 (https://phabricator.wikimedia.org/T333614) (owner: 10Herron) [16:23:34] (03PS18) 10Cwhite: prometheus: generate swagger targets from service catalog [puppet] - 10https://gerrit.wikimedia.org/r/916914 (https://phabricator.wikimedia.org/T320620) [16:24:29] (03CR) 10Herron: [V: 03+1 C: 03+2] "self-merging to complete the upgrade to bullseye. we should revisit this for a longer term fix" [puppet] - 10https://gerrit.wikimedia.org/r/924555 (https://phabricator.wikimedia.org/T333614) (owner: 10Herron) [16:32:32] (03PS1) 10Herron: udp2log: dont use python symlink [puppet] - 10https://gerrit.wikimedia.org/r/924557 (https://phabricator.wikimedia.org/T333614) [16:33:15] (03PS2) 10Herron: udp2log: dont use python symlink [puppet] - 10https://gerrit.wikimedia.org/r/924557 (https://phabricator.wikimedia.org/T333614) [16:33:38] (03PS1) 10Hokwelum: create mount point dir [puppet] - 10https://gerrit.wikimedia.org/r/924558 [16:34:48] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41423/console" [puppet] - 10https://gerrit.wikimedia.org/r/924557 (https://phabricator.wikimedia.org/T333614) (owner: 10Herron) [16:35:33] (03CR) 10Herron: [V: 03+1 C: 03+2] "self-merging to complete host reimage" [puppet] - 10https://gerrit.wikimedia.org/r/924557 (https://phabricator.wikimedia.org/T333614) (owner: 10Herron) [16:36:02] (03PS2) 10Hokwelum: create mount point dir [puppet] - 10https://gerrit.wikimedia.org/r/924558 (https://phabricator.wikimedia.org/T325232) [16:38:10] (03PS2) 10Urbanecm: [Growth] Enable new Impact for 10 additional wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924060 (https://phabricator.wikimedia.org/T336203) [16:38:59] (03PS19) 10Cwhite: prometheus: generate swagger targets from service catalog [puppet] - 10https://gerrit.wikimedia.org/r/916914 (https://phabricator.wikimedia.org/T320620) [16:41:32] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/924135 [16:46:26] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-product-users for KCVelaga (WMF) - https://phabricator.wikimedia.org/T337766 (10KCVelaga_WMF) [16:51:30] (03PS1) 10BCornwall: pybal: Switch codfw LVS to use Maglev scheduler [puppet] - 10https://gerrit.wikimedia.org/r/924559 (https://phabricator.wikimedia.org/T263797) [16:51:48] jouncebot: now [16:51:48] For the next 0 hour(s) and 8 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230530T1600) [16:51:52] jouncebot: nowandnext [16:51:52] For the next 0 hour(s) and 8 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230530T1600) [16:51:52] In 0 hour(s) and 8 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230530T1700) [16:53:48] (03PS1) 10Ssingh: depool codfw (emergency patch, do not merge) [dns] - 10https://gerrit.wikimedia.org/r/924561 (https://phabricator.wikimedia.org/T263797) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230530T1700) [17:03:58] (03CR) 10Cwhite: "PCC OK: https://puppet-compiler.wmflabs.org/output/916914/41424/" [puppet] - 10https://gerrit.wikimedia.org/r/916914 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [17:10:10] (03PS6) 10Ilias Sarantopoulos: ORES: add model versions configuration and thresholds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922512 (https://phabricator.wikimedia.org/T319170) [17:12:38] (03PS1) 10Zabe: Start reading from rev_comment_id in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924564 (https://phabricator.wikimedia.org/T299954) [17:17:22] (03PS3) 10Cwhite: team-sre: add openapi/swagger alerts [alerts] - 10https://gerrit.wikimedia.org/r/918547 (https://phabricator.wikimedia.org/T320620) [17:19:55] (03CR) 10Cwhite: team-sre: add openapi/swagger alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/918547 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [17:24:57] 10SRE, 10Release-Engineering-Team, 10Security-Team, 10Wikimedia-GitHub, and 3 others: Add github.com/wikimedia as an SCM for Semgrep Cloud - https://phabricator.wikimedia.org/T337561 (10sbassett) [17:26:09] (03PS2) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/924135 (https://phabricator.wikimedia.org/T337464) (owner: 10PipelineBot) [17:33:33] (03PS3) 10Hokwelum: create mount point dir for dumps test nfs share [puppet] - 10https://gerrit.wikimedia.org/r/924558 (https://phabricator.wikimedia.org/T325232) [17:38:36] (03PS4) 10Hokwelum: create mount point dir for dumps test nfs share [puppet] - 10https://gerrit.wikimedia.org/r/924558 (https://phabricator.wikimedia.org/T325232) [17:42:20] (03CR) 10ArielGlenn: [C: 03+2] create mount point dir for dumps test nfs share [puppet] - 10https://gerrit.wikimedia.org/r/924558 (https://phabricator.wikimedia.org/T325232) (owner: 10Hokwelum) [17:45:29] !log re-enabling puppet on contint2001 [17:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:06] 10ops-drmrs, 10DC-Ops, 10decommission-hardware, 10SRE Observability (FY2022/2023-Q4): Decommission prometheus6001 - https://phabricator.wikimedia.org/T335588 (10RobH) 05Open→03Resolved So VMs don't need/warrant a hardware decom ticket, resolving. [17:55:08] 10SRE, 10ops-ulsfo, 10DC-Ops, 10decommission-hardware, 10SRE Observability (FY2022/2023-Q4): Decommission prometheus4001 - https://phabricator.wikimedia.org/T335585 (10RobH) 05Open→03Resolved So VMs don't need/warrant a hardware decom ticket, resolving. [17:56:50] (03PS3) 10Urbanecm: [Growth] Enable new Impact for 10 additional wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924060 (https://phabricator.wikimedia.org/T336203) [17:58:48] (03PS2) 10Dzahn: releases: Ensure rsync jobs get removed on the non-active machine [puppet] - 10https://gerrit.wikimedia.org/r/924085 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [17:59:08] (03CR) 10Dzahn: "nitpick: please start commit message with the module name, so like "releases: " in this case." [puppet] - 10https://gerrit.wikimedia.org/r/924085 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [18:00:05] dduvall and ^demon: gettimeofday() says it's time for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230530T1800) [18:01:48] (03PS1) 10Fabfur: run-puppet-restart-varnish: Add dry_run support to check function [cookbooks] - 10https://gerrit.wikimedia.org/r/924590 (https://phabricator.wikimedia.org/T323557) [18:04:08] (03CR) 10CI reject: [V: 04-1] run-puppet-restart-varnish: Add dry_run support to check function [cookbooks] - 10https://gerrit.wikimedia.org/r/924590 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [18:06:41] (03CR) 10Dzahn: [C: 04-1] "This is flipped around from what it should be (and how it was on doc hosts, which is why this is confusing). What is supposed to happen he" [puppet] - 10https://gerrit.wikimedia.org/r/924085 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [18:07:44] (03PS20) 10Eevans: cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) [18:08:13] (03CR) 10CI reject: [V: 04-1] cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [18:10:09] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/921244/41426/" [puppet] - 10https://gerrit.wikimedia.org/r/921244 (owner: 10EoghanGaffney) [18:10:59] (03CR) 10Dzahn: [C: 03+2] "should have linked to T336168.. oh well." [puppet] - 10https://gerrit.wikimedia.org/r/921244 (owner: 10EoghanGaffney) [18:11:37] (03PS2) 10Fabfur: run-puppet-restart-varnish: Add dry_run support to check function [cookbooks] - 10https://gerrit.wikimedia.org/r/924590 (https://phabricator.wikimedia.org/T323557) [18:13:48] (03CR) 10Dzahn: [C: 03+2] "this was no-op on all hosts except doc1003, there it changed IPs to host names in ferm rules. ferm reloaded just fine. no issues." [puppet] - 10https://gerrit.wikimedia.org/r/921244 (owner: 10EoghanGaffney) [18:14:04] (03CR) 10CI reject: [V: 04-1] run-puppet-restart-varnish: Add dry_run support to check function [cookbooks] - 10https://gerrit.wikimedia.org/r/924590 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [18:18:33] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924591 (https://phabricator.wikimedia.org/T337525) [18:18:35] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924591 (https://phabricator.wikimedia.org/T337525) (owner: 10TrainBranchBot) [18:19:21] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924591 (https://phabricator.wikimedia.org/T337525) (owner: 10TrainBranchBot) [18:23:51] (03PS3) 10Fabfur: run-puppet-restart-varnish: Add dry_run support to check function [cookbooks] - 10https://gerrit.wikimedia.org/r/924590 (https://phabricator.wikimedia.org/T323557) [18:27:15] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.11 refs T337525 [18:27:20] T337525: 1.41.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T337525 [18:30:12] (03CR) 10JHathaway: puppetmaster: add new function to check for local files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922877 (https://phabricator.wikimedia.org/T268344) (owner: 10Jbond) [18:30:28] (03PS1) 10Hokwelum: README updated with info on how to create the mount point subdir [puppet] - 10https://gerrit.wikimedia.org/r/924592 [18:31:32] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/924517 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [18:36:26] (03CR) 10Majavah: puppetmaster: add new function to check for local files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922877 (https://phabricator.wikimedia.org/T268344) (owner: 10Jbond) [18:40:20] (03CR) 10Dzahn: "for some reason I can access the Internet without going through PROXY and it's not obvious in ferm rules why.. it is in iptables -L though" [puppet] - 10https://gerrit.wikimedia.org/r/902513 (owner: 10Dzahn) [18:41:14] RECOVERY - MariaDB Replica Lag: s3 on clouddb1017 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:41:48] RECOVERY - MariaDB Replica IO: s3 on clouddb1017 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:42:09] (03PS2) 10Hokwelum: README updated with info on how to create the mount point subdir [puppet] - 10https://gerrit.wikimedia.org/r/924592 [18:43:25] (03PS3) 10Hokwelum: README updated with info on how to create the mount point subdir [puppet] - 10https://gerrit.wikimedia.org/r/924592 (https://phabricator.wikimedia.org/T325232) [18:43:30] RECOVERY - MariaDB Replica IO: s1 on clouddb1013 is OK: OK slave_io_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:52:18] (03PS4) 10Hokwelum: README updated with info on how to create the mount point subdir [puppet] - 10https://gerrit.wikimedia.org/r/924592 (https://phabricator.wikimedia.org/T325232) [18:57:03] (03CR) 10JHathaway: puppetmaster: add new function to check for local files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922877 (https://phabricator.wikimedia.org/T268344) (owner: 10Jbond) [18:57:05] (03PS1) 10BBlack: [WIP] pybal: configure failover i13n IPs [puppet] - 10https://gerrit.wikimedia.org/r/924593 (https://phabricator.wikimedia.org/T334703) [18:57:27] (03CR) 10CI reject: [V: 04-1] [WIP] pybal: configure failover i13n IPs [puppet] - 10https://gerrit.wikimedia.org/r/924593 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [18:57:35] (03CR) 10Jdrewniak: [C: 03+1] Turn on A/B Test Hebrew [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924536 (https://phabricator.wikimedia.org/T336969) (owner: 10Kimberly Sarabia) [18:59:28] (03PS5) 10Hokwelum: README updated with info on how to create the dumps test nfs mount point subdir [puppet] - 10https://gerrit.wikimedia.org/r/924592 (https://phabricator.wikimedia.org/T325232) [19:02:06] (03CR) 10ArielGlenn: [C: 03+2] README updated with info on how to create the dumps test nfs mount point subdir [puppet] - 10https://gerrit.wikimedia.org/r/924592 (https://phabricator.wikimedia.org/T325232) (owner: 10Hokwelum) [19:02:16] (03PS2) 10BBlack: [WIP] pybal: configure failover i13n IPs [puppet] - 10https://gerrit.wikimedia.org/r/924593 (https://phabricator.wikimedia.org/T334703) [19:05:13] (03CR) 10CI reject: [V: 04-1] [WIP] pybal: configure failover i13n IPs [puppet] - 10https://gerrit.wikimedia.org/r/924593 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [19:06:14] (03PS3) 10BBlack: [WIP] pybal: configure failover i13n IPs [puppet] - 10https://gerrit.wikimedia.org/r/924593 (https://phabricator.wikimedia.org/T334703) [19:09:10] (03CR) 10CI reject: [V: 04-1] [WIP] pybal: configure failover i13n IPs [puppet] - 10https://gerrit.wikimedia.org/r/924593 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [19:10:46] (03CR) 10Brennen Bearnes: gitlab: sync all configured providers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/916522 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [19:11:16] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [19:11:53] (03CR) 10Brennen Bearnes: [C: 03+1] "Typos aside, +1 for idea. Seems fine." [puppet] - 10https://gerrit.wikimedia.org/r/916522 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [19:12:50] !log [WDQS Deploy] Deploying version 0.3.124 [19:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:08] RECOVERY - MariaDB Replica Lag: s1 on clouddb1013 is OK: OK slave_sql_lag Replication lag: 51.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:14:48] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:16:22] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:18:22] inflatador, ryankemper: issue with the WDQS deployment? Need help? ^^ [19:19:20] gehel looking into it now [19:21:52] (03PS1) 10Reedy: Revert "Temporarily disable UCoC link from non tech wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924567 (https://phabricator.wikimedia.org/T280886) [19:21:59] (03CR) 10CI reject: [V: 04-1] Revert "Temporarily disable UCoC link from non tech wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924567 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [19:22:17] (03CR) 10Reedy: [C: 04-2] "Needs rebase..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924567 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [19:24:24] (03PS2) 10Reedy: Revert "Temporarily disable UCoC link from non tech wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924567 (https://phabricator.wikimedia.org/T280886) [19:24:28] !log ryankemper@puppetmaster1001 conftool action : set/weight=0:pooled=inactive; selector: name=wdqs2021.* [19:24:36] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:25:48] PROBLEM - Check systemd state on wdqs2009 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:26:20] RECOVERY - SSH on wdqs2009 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:26:25] (03CR) 10Reedy: "https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMessages/+/698076 needs to land first too" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924567 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [19:26:28] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.207 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:27:20] RECOVERY - Check systemd state on wdqs2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:27:52] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 16m 36s) [19:27:54] PROBLEM - Docker registry HTTPS interface on registry1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [19:28:03] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [19:28:12] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:28:58] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 54s) [19:30:50] RECOVERY - Docker registry HTTPS interface on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 3754 bytes in 0.220 second response time https://wikitech.wikimedia.org/wiki/Docker [19:32:31] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [19:33:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:34:09] (03PS2) 10Samtar: Turn on A/B Test Hebrew [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924536 (https://phabricator.wikimedia.org/T336969) (owner: 10Kimberly Sarabia) [19:35:28] Yeah WDQS looks fine now. Fixed a host that was pooled=false instead of inactive, and rebooted wdqs2009 which was ssh unresponsive. [19:35:37] (03PS4) 10BBlack: [WIP] pybal: configure failover i13n IPs [puppet] - 10https://gerrit.wikimedia.org/r/924593 (https://phabricator.wikimedia.org/T334703) [19:35:59] (03CR) 10CI reject: [V: 04-1] [WIP] pybal: configure failover i13n IPs [puppet] - 10https://gerrit.wikimedia.org/r/924593 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [19:36:28] (03CR) 10Muehlenhoff: Setup debmonitor2003 as bookworm debmonitor VM (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/924517 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [19:36:33] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 04m 02s) [19:38:07] (03PS5) 10BBlack: [WIP] pybal: configure failover i13n IPs [puppet] - 10https://gerrit.wikimedia.org/r/924593 (https://phabricator.wikimedia.org/T334703) [19:41:08] (03PS6) 10BBlack: [WIP] pybal: configure failover i13n IPs [puppet] - 10https://gerrit.wikimedia.org/r/924593 (https://phabricator.wikimedia.org/T334703) [19:43:08] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [19:44:25] (03PS7) 10BBlack: [WIP] pybal: configure failover i13n IPs [puppet] - 10https://gerrit.wikimedia.org/r/924593 (https://phabricator.wikimedia.org/T334703) [19:48:39] !log xcollazo@deploy1002 Started deploy [airflow-dags/analytics@cd667c2]: Deplot Iceberg version of referrer_daily on analytics Airflow instance. T335305. [19:48:46] T335305: Migrate referrer_daily to Iceberg - https://phabricator.wikimedia.org/T335305 [19:48:49] !log xcollazo@deploy1002 Finished deploy [airflow-dags/analytics@cd667c2]: Deplot Iceberg version of referrer_daily on analytics Airflow instance. T335305. (duration: 00m 09s) [19:49:12] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [19:49:54] (03PS1) 10Ladsgroup: Add WANCache to ParserOutputPageProperties::finalize [extensions/CirrusSearch] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924568 (https://phabricator.wikimedia.org/T336698) [19:51:04] (03PS1) 10Ladsgroup: Add WANCache to ParserOutputPageProperties::finalize [extensions/CirrusSearch] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924569 (https://phabricator.wikimedia.org/T336698) [19:51:28] (03PS8) 10BBlack: pybal: configure failover i13n IPs [puppet] - 10https://gerrit.wikimedia.org/r/924593 (https://phabricator.wikimedia.org/T334703) [19:51:38] jouncebot: nowandnext [19:51:38] For the next 0 hour(s) and 8 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230530T1800) [19:51:38] In 0 hour(s) and 8 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230530T2000) [19:53:46] (03CR) 10BBlack: [C: 03+1] "This looks right to me: https://puppet-compiler.wmflabs.org/output/924593/41437/lvs4010.ulsfo.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/924593 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [20:00:07] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230530T2000). nyaa~ [20:00:07] kimberly_sarabia: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:10] * TheresNoTime can deploy! [20:00:21] hello. ty [20:00:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924536 (https://phabricator.wikimedia.org/T336969) (owner: 10Kimberly Sarabia) [20:00:57] (03PS1) 10BBlack: [WIP] safe-service-restart: use failover i13n [puppet] - 10https://gerrit.wikimedia.org/r/924596 (https://phabricator.wikimedia.org/T334703) [20:01:26] (03Merged) 10jenkins-bot: Turn on A/B Test Hebrew [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924536 (https://phabricator.wikimedia.org/T336969) (owner: 10Kimberly Sarabia) [20:01:55] !log samtar@deploy1002 Started scap: Backport for [[gerrit:924536|Turn on A/B Test Hebrew (T336969)]] [20:02:01] T336969: [Zebra AB test] Fix the mixing of global and user IDs for AB Test Enrollment Bucketing - https://phabricator.wikimedia.org/T336969 [20:02:28] (03CR) 10CI reject: [V: 04-1] [WIP] safe-service-restart: use failover i13n [puppet] - 10https://gerrit.wikimedia.org/r/924596 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [20:03:35] !log samtar@deploy1002 ksarabia and samtar: Backport for [[gerrit:924536|Turn on A/B Test Hebrew (T336969)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:03:36] (03PS2) 10BBlack: [WIP] safe-service-restart: use failover i13n [puppet] - 10https://gerrit.wikimedia.org/r/924596 (https://phabricator.wikimedia.org/T334703) [20:03:38] kimberly_sarabia: live on mwdebug, can you test? [20:03:45] sure one moment [20:03:55] TheresNoTime: please let me know once you're done [20:04:04] Amir1: will do [20:05:05] TheresNoTime: LGTM [20:05:10] syncing [20:09:08] (03PS9) 10BBlack: pybal: configure failover i13n IPs [puppet] - 10https://gerrit.wikimedia.org/r/924593 (https://phabricator.wikimedia.org/T334703) [20:09:10] (03PS3) 10BBlack: safe-service-restart: use failover i13n [puppet] - 10https://gerrit.wikimedia.org/r/924596 (https://phabricator.wikimedia.org/T334703) [20:10:42] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:924536|Turn on A/B Test Hebrew (T336969)]] (duration: 08m 46s) [20:10:44] kimberly_sarabia: live in prod :) [20:10:47] T336969: [Zebra AB test] Fix the mixing of global and user IDs for AB Test Enrollment Bucketing - https://phabricator.wikimedia.org/T336969 [20:10:55] Amir1: all yours [20:11:19] awesome [20:11:19] TheresNoTime: great. tysm [20:11:27] (03CR) 10Ladsgroup: [C: 03+2] Add WANCache to ParserOutputPageProperties::finalize [extensions/CirrusSearch] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924568 (https://phabricator.wikimedia.org/T336698) (owner: 10Ladsgroup) [20:12:11] !log bking@wdqs2009 depool wdqs2009 until it catches up with lag [20:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:47] (03PS1) 10Jforrester: linker: Check for null parser in Linker::makeThumbLink2 [core] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924570 (https://phabricator.wikimedia.org/T337794) [20:23:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/CirrusSearch] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924568 (https://phabricator.wikimedia.org/T336698) (owner: 10Ladsgroup) [20:23:44] 10SRE-OnFire, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 2 others: Review alerting around Wikidata Query Service update pipeline - https://phabricator.wikimedia.org/T336574 (10bking) [20:24:00] 10SRE-OnFire, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), and 2 others: Update WDQS Runbook following update lag incident - https://phabricator.wikimedia.org/T336577 (10bking) [20:29:55] (03CR) 10Urbanecm: [C: 03+1] "can be deployed now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924060 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm) [20:30:05] (03Merged) 10jenkins-bot: Add WANCache to ParserOutputPageProperties::finalize [extensions/CirrusSearch] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924568 (https://phabricator.wikimedia.org/T336698) (owner: 10Ladsgroup) [20:30:31] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:924568|Add WANCache to ParserOutputPageProperties::finalize (T336698)]] [20:30:37] T336698: Reduce the load of CirrusSearch update jobs on MW jobrunners - https://phabricator.wikimedia.org/T336698 [20:32:00] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:924568|Add WANCache to ParserOutputPageProperties::finalize (T336698)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:33:42] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [20:34:07] (03PS9) 10Jsn.sherman: beta: log additional click events on Special:Diff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896432 (https://phabricator.wikimedia.org/T326214) [20:34:56] (03PS10) 10Jsn.sherman: beta: log additional click events on Special:Diff|MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896432 (https://phabricator.wikimedia.org/T326214) [20:35:36] (03Abandoned) 10Jsn.sherman: Log additional click events on Special:MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899725 (https://phabricator.wikimedia.org/T326216) (owner: 10Jsn.sherman) [20:37:19] (03CR) 10Ladsgroup: [C: 03+2] Add WANCache to ParserOutputPageProperties::finalize [extensions/CirrusSearch] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924569 (https://phabricator.wikimedia.org/T336698) (owner: 10Ladsgroup) [20:39:59] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:924568|Add WANCache to ParserOutputPageProperties::finalize (T336698)]] (duration: 09m 27s) [20:40:04] T336698: Reduce the load of CirrusSearch update jobs on MW jobrunners - https://phabricator.wikimedia.org/T336698 [20:40:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/CirrusSearch] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924569 (https://phabricator.wikimedia.org/T336698) (owner: 10Ladsgroup) [20:49:49] (03CR) 10Eevans: [C: 03+2] cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [20:51:03] (03CR) 10Eevans: [V: 03+2 C: 03+2] cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [20:56:36] (03Merged) 10jenkins-bot: Add WANCache to ParserOutputPageProperties::finalize [extensions/CirrusSearch] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924569 (https://phabricator.wikimedia.org/T336698) (owner: 10Ladsgroup) [20:57:01] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:924569|Add WANCache to ParserOutputPageProperties::finalize (T336698)]] [20:57:07] T336698: Reduce the load of CirrusSearch update jobs on MW jobrunners - https://phabricator.wikimedia.org/T336698 [20:58:36] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:924569|Add WANCache to ParserOutputPageProperties::finalize (T336698)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [21:09:46] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:924569|Add WANCache to ParserOutputPageProperties::finalize (T336698)]] (duration: 12m 44s) [21:15:49] jouncebot: nowandnext [21:15:50] No deployments scheduled for the next 8 hour(s) and 44 minute(s) [21:15:50] In 8 hour(s) and 44 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230531T0600) [21:27:42] Amir1: have you finished deploying? I may backport 924570 for T337794 [21:27:43] T337794: Error: Call to a member function getOutput() on null - https://phabricator.wikimedia.org/T337794 [21:28:08] TheresNoTime: I am [21:28:10] have fun [21:28:26] :) [21:29:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [core] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924570 (https://phabricator.wikimedia.org/T337794) (owner: 10Jforrester) [21:48:15] (03Merged) 10jenkins-bot: linker: Check for null parser in Linker::makeThumbLink2 [core] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924570 (https://phabricator.wikimedia.org/T337794) (owner: 10Jforrester) [21:48:45] !log samtar@deploy1002 Started scap: Backport for [[gerrit:924570|linker: Check for null parser in Linker::makeThumbLink2 (T337794)]] [21:48:51] T337794: Error: Call to a member function getOutput() on null - https://phabricator.wikimedia.org/T337794 [21:50:22] !log samtar@deploy1002 jforrester and samtar: Backport for [[gerrit:924570|linker: Check for null parser in Linker::makeThumbLink2 (T337794)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [21:50:26] * TheresNoTime testing [21:50:38] 10SRE-OnFire, 10Discovery-Search, 10Sustainability: WDQS: Document procedure for switching between Kubernetes and Yarn Streaming Updater - https://phabricator.wikimedia.org/T337801 (10bking) [21:51:06] * TheresNoTime syncing [21:51:49] 10SRE-OnFire, 10Discovery-Search, 10Sustainability: WDQS: Document procedure for switching between Kubernetes and Yarn Streaming Updater - https://phabricator.wikimedia.org/T337801 (10bking) [21:56:22] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [21:56:34] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:924570|linker: Check for null parser in Linker::makeThumbLink2 (T337794)]] (duration: 07m 48s) [21:56:39] T337794: Error: Call to a member function getOutput() on null - https://phabricator.wikimedia.org/T337794 [21:57:48] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [22:48:35] 10SRE, 10DNS, 10Domains, 10Traffic: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10Dzahn) Is there a place where we can read about this project and the general plan around it? [22:58:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:07:55] (03PS2) 10Cwhite: hiera: disable security plugin on beta-logs [puppet] - 10https://gerrit.wikimedia.org/r/912391 (https://phabricator.wikimedia.org/T333732) [23:12:39] (03PS1) 10Dzahn: planet: restrict firewall source range for port 443 to envoy [puppet] - 10https://gerrit.wikimedia.org/r/924604 [23:18:33] (03CR) 10Dzahn: [C: 04-1] "maybe just flip the "bool2str('present', 'absent')" around and call it "$ensure_not_on_active"? or !$ensure_on_active ?" [puppet] - 10https://gerrit.wikimedia.org/r/924085 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [23:27:23] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/924604/41440/" [puppet] - 10https://gerrit.wikimedia.org/r/924604 (owner: 10Dzahn) [23:27:37] (03PS2) 10Dzahn: planet: restrict firewall source range for port 443 to envoy [puppet] - 10https://gerrit.wikimedia.org/r/924604 [23:28:30] jouncebot: nowandnext [23:28:30] No deployments scheduled for the next 6 hour(s) and 31 minute(s) [23:28:30] In 6 hour(s) and 31 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230531T0600) [23:28:38] (03PS2) 10Zabe: Start reading from rev_comment_id in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924564 (https://phabricator.wikimedia.org/T299954) [23:28:40] (03CR) 10Dzahn: "Haven't checked yet but seems like this might affect a bunch of other hosts too." [puppet] - 10https://gerrit.wikimedia.org/r/924604 (owner: 10Dzahn) [23:28:44] (03CR) 10Zabe: [C: 03+2] Start reading from rev_comment_id in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924564 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [23:29:31] (03Merged) 10jenkins-bot: Start reading from rev_comment_id in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924564 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [23:30:01] !log zabe@deploy1002 Started scap: Backport for [[gerrit:924564|Start reading from rev_comment_id in group1 wikis (T299954)]] [23:30:10] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [23:31:32] !log zabe@deploy1002 zabe: Backport for [[gerrit:924564|Start reading from rev_comment_id in group1 wikis (T299954)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [23:31:38] (03PS1) 10Zabe: Start reading from rev_comment_id everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924605 (https://phabricator.wikimedia.org/T299954) [23:38:02] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:924564|Start reading from rev_comment_id in group1 wikis (T299954)]] (duration: 08m 00s) [23:38:07] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [23:49:12] (03PS1) 10Dzahn: gerrit/bacula: adjust Gerrit file paths to be backed up [puppet] - 10https://gerrit.wikimedia.org/r/924608 (https://phabricator.wikimedia.org/T336427) [23:50:03] (03PS2) 10Dzahn: gerrit/bacula: adjust Gerrit file paths to be backed up [puppet] - 10https://gerrit.wikimedia.org/r/924608 (https://phabricator.wikimedia.org/T336427) [23:54:05] (03CR) 10Dzahn: "/var/lib/gerrit2 which is not currently backed up contains all the .h2 databases like:" [puppet] - 10https://gerrit.wikimedia.org/r/924608 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn)