[00:04:13] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1019784 (owner: 10TrainBranchBot) [00:13:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 871.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:15:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.084s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:21:15] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9720987 (10Papaul) [00:23:42] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9721003 (10Papaul) @Jhancock.wm anything else left to be done on this task? [00:23:44] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9720988 (10Papaul) 05Open→03Resolved Complete [00:35:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 810.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:37:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 864.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:38:12] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9721040 (10ssingh) Thanks for the task @RobH! As in the previous runs, please feel free to leave these for Traffic: ` Update the operations/puppet repo - this should include updates to preseed.ya... [00:41:55] (03CR) 10Dzahn: [C:03+1] phabricator: Switch certificate generation to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1020190 (https://phabricator.wikimedia.org/T360413) (owner: 10EoghanGaffney) [00:42:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 828.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:45:38] (03CR) 10Dzahn: [C:03+1] "puppet won't delete the old certs on the host, so a revert would only be editing the envoy config to point back to the old cert location a" [puppet] - 10https://gerrit.wikimedia.org/r/1020190 (https://phabricator.wikimedia.org/T360413) (owner: 10EoghanGaffney) [02:02:02] (03PS1) 10DDesouza: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020427 (https://phabricator.wikimedia.org/T219903) [02:02:38] (03CR) 10Ryan Kemper: [C:03+1] site.pp: move elastic2088 back into production [puppet] - 10https://gerrit.wikimedia.org/r/1020375 (https://phabricator.wikimedia.org/T361525) (owner: 10Bking) [02:02:39] (03CR) 10Ryan Kemper: [C:03+2] site.pp: move elastic2088 back into production [puppet] - 10https://gerrit.wikimedia.org/r/1020375 (https://phabricator.wikimedia.org/T361525) (owner: 10Bking) [02:04:40] (03CR) 10DDesouza: [C:03+2] miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020427 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [02:05:15] (03CR) 10DDesouza: [V:03+2 C:03+2] miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020427 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [02:05:53] (03Merged) 10jenkins-bot: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020427 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [02:06:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:12:25] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:31:31] (Traffic bill over quota) firing: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [02:36:31] (Traffic bill over quota) firing: (3) Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [02:38:29] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 839.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:42:50] !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [02:43:06] !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [02:43:07] !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [02:43:29] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:43:33] !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [02:43:34] !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [02:43:57] !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [02:46:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 826.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:48:31] !log T361525 Trying to powercycle `elastic2088` thru mgmt port (host not responding to ssh) [02:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:48:36] T361525: Degraded RAID on elastic2088 - https://phabricator.wikimedia.org/T361525 [02:51:31] (Traffic bill over quota) firing: (3) Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [02:52:11] (03PS1) 10Ryan Kemper: Revert "site.pp: move elastic2088 back into production" [puppet] - 10https://gerrit.wikimedia.org/r/1020238 [02:52:33] (03PS2) 10Ryan Kemper: Revert "site.pp: move elastic2088 back into production" [puppet] - 10https://gerrit.wikimedia.org/r/1020238 (https://phabricator.wikimedia.org/T361525) [02:53:41] 10ops-eqiad, 06SRE: Inbound interface errors - https://phabricator.wikimedia.org/T362366#9721140 (10phaultfinder) [02:54:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T352010)', diff saved to https://phabricator.wikimedia.org/P60695 and previous config saved to /var/cache/conftool/dbconfig/20240417-025403-ladsgroup.json [02:54:11] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [02:54:46] (03CR) 10Ryan Kemper: [C:03+2] Revert "site.pp: move elastic2088 back into production" [puppet] - 10https://gerrit.wikimedia.org/r/1020238 (https://phabricator.wikimedia.org/T361525) (owner: 10Ryan Kemper) [02:56:31] (Traffic bill over quota) resolved: (2) Alert for device cr2-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [02:58:29] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 800.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:06:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 800.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:06:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:09:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P60696 and previous config saved to /var/cache/conftool/dbconfig/20240417-030911-ladsgroup.json [03:23:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 917.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:24:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P60697 and previous config saved to /var/cache/conftool/dbconfig/20240417-032418-ladsgroup.json [03:33:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 802.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:39:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T352010)', diff saved to https://phabricator.wikimedia.org/P60698 and previous config saved to /var/cache/conftool/dbconfig/20240417-033926-ladsgroup.json [03:39:29] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [03:39:32] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [03:39:41] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [03:39:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T352010)', diff saved to https://phabricator.wikimedia.org/P60699 and previous config saved to /var/cache/conftool/dbconfig/20240417-033948-ladsgroup.json [03:43:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 924.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:53:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 806.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:00:17] (03PS6) 10JHathaway: Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) [04:03:11] (03CR) 10CI reject: [V:04-1] Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [04:38:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1222.eqiad.wmnet with reason: Maintenance [04:39:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1222.eqiad.wmnet with reason: Maintenance [04:40:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T352010)', diff saved to https://phabricator.wikimedia.org/P60700 and previous config saved to /var/cache/conftool/dbconfig/20240417-044015-ladsgroup.json [04:40:20] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [04:44:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [04:45:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [04:45:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T361627)', diff saved to https://phabricator.wikimedia.org/P60701 and previous config saved to /var/cache/conftool/dbconfig/20240417-044517-marostegui.json [04:45:22] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [04:50:59] 10ops-codfw, 06SRE, 06DBA, 10decommission-hardware, 13Patch-For-Review: decommission db2100.codfw.wmnet - https://phabricator.wikimedia.org/T361584#9721220 (10Marostegui) This host wasn't removed from zarcillo - I have done so. [04:51:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T361627)', diff saved to https://phabricator.wikimedia.org/P60702 and previous config saved to /var/cache/conftool/dbconfig/20240417-045130-marostegui.json [04:51:36] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [04:53:50] (03PS1) 10Marostegui: db2182: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1020466 [04:53:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2182', diff saved to https://phabricator.wikimedia.org/P60703 and previous config saved to /var/cache/conftool/dbconfig/20240417-045353-root.json [04:54:47] (03CR) 10Marostegui: [C:03+2] db2182: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1020466 (owner: 10Marostegui) [04:55:01] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2182.codfw.wmnet with OS bookworm [04:55:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P60704 and previous config saved to /var/cache/conftool/dbconfig/20240417-045522-ladsgroup.json [04:59:58] !log dbmaint Upgrade s7 codfw to Bookworm and MariaDB 10.6 T362745 [05:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:20] T362745: Upgrade s7 to MariaDB 10.6 - https://phabricator.wikimedia.org/T362745 [05:05:51] !log Rename machine_vision tables on db1249 eqiad dbmaint s4 T362229 [05:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:56] T362229: Drop MachineVision tables from beta and production - https://phabricator.wikimedia.org/T362229 [05:06:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P60705 and previous config saved to /var/cache/conftool/dbconfig/20240417-050638-marostegui.json [05:10:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P60706 and previous config saved to /var/cache/conftool/dbconfig/20240417-051029-ladsgroup.json [05:12:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2182.codfw.wmnet with reason: host reimage [05:15:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2182.codfw.wmnet with reason: host reimage [05:21:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P60707 and previous config saved to /var/cache/conftool/dbconfig/20240417-052145-marostegui.json [05:22:22] (03PS1) 10Marostegui: Revert "db2182: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1020239 [05:25:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T352010)', diff saved to https://phabricator.wikimedia.org/P60708 and previous config saved to /var/cache/conftool/dbconfig/20240417-052537-ladsgroup.json [05:25:39] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance [05:25:42] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:25:52] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance [05:26:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T352010)', diff saved to https://phabricator.wikimedia.org/P60709 and previous config saved to /var/cache/conftool/dbconfig/20240417-052600-ladsgroup.json [05:31:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P60710 and previous config saved to /var/cache/conftool/dbconfig/20240417-053131-root.json [05:31:39] (03CR) 10Marostegui: [C:03+2] Revert "db2182: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1020239 (owner: 10Marostegui) [05:33:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 957.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:34:15] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9721277 (10Papaul) @ssingh After 2 days working on this issue, I finally got at the bottom of the of problem. After many reboots on cp11... [05:35:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2182.codfw.wmnet with OS bookworm [05:36:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T361627)', diff saved to https://phabricator.wikimedia.org/P60711 and previous config saved to /var/cache/conftool/dbconfig/20240417-053653-marostegui.json [05:36:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [05:36:58] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [05:37:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [05:37:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T361627)', diff saved to https://phabricator.wikimedia.org/P60712 and previous config saved to /var/cache/conftool/dbconfig/20240417-053716-marostegui.json [05:43:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 809ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:43:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T361627)', diff saved to https://phabricator.wikimedia.org/P60713 and previous config saved to /var/cache/conftool/dbconfig/20240417-054333-marostegui.json [05:43:39] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [05:46:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P60714 and previous config saved to /var/cache/conftool/dbconfig/20240417-054637-root.json [05:56:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:58:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P60715 and previous config saved to /var/cache/conftool/dbconfig/20240417-055841-marostegui.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240417T0600) [06:01:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P60716 and previous config saved to /var/cache/conftool/dbconfig/20240417-060143-root.json [06:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:12:25] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:13:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P60717 and previous config saved to /var/cache/conftool/dbconfig/20240417-061349-marostegui.json [06:16:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P60718 and previous config saved to /var/cache/conftool/dbconfig/20240417-061649-root.json [06:25:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 820.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:27:15] (03Abandoned) 10Hashar: Increase default thumbnail display size from 220px to 300px [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154408 (owner: 10Jforrester) [06:27:29] (03Abandoned) 10Hashar: Match 'editcontentmodel' permission with 'move' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309066 (https://phabricator.wikimedia.org/T85847) (owner: 10Legoktm) [06:28:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T361627)', diff saved to https://phabricator.wikimedia.org/P60719 and previous config saved to /var/cache/conftool/dbconfig/20240417-062856-marostegui.json [06:28:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance [06:29:03] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [06:29:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance [06:29:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T361627)', diff saved to https://phabricator.wikimedia.org/P60720 and previous config saved to /var/cache/conftool/dbconfig/20240417-062918-marostegui.json [06:30:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 807.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:31:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P60721 and previous config saved to /var/cache/conftool/dbconfig/20240417-063155-root.json [06:34:46] (03PS2) 10Hashar: logging: pluralize $wmgDefaultMonologHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019267 (https://phabricator.wikimedia.org/T238838) [06:34:46] (03PS3) 10Hashar: logging: always register udp2log handlers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019253 (https://phabricator.wikimedia.org/T228838) [06:35:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T361627)', diff saved to https://phabricator.wikimedia.org/P60722 and previous config saved to /var/cache/conftool/dbconfig/20240417-063537-marostegui.json [06:35:45] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [06:40:46] (03PS3) 10Anzx: mlwiki: create draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020242 (https://phabricator.wikimedia.org/T362653) [06:47:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P60723 and previous config saved to /var/cache/conftool/dbconfig/20240417-064700-root.json [06:50:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P60724 and previous config saved to /var/cache/conftool/dbconfig/20240417-065044-marostegui.json [06:53:29] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:05] Amir1 and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240417T0700). [07:00:05] anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:02:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P60725 and previous config saved to /var/cache/conftool/dbconfig/20240417-070206-root.json [07:05:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P60726 and previous config saved to /var/cache/conftool/dbconfig/20240417-070552-marostegui.json [07:05:59] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1020190 (https://phabricator.wikimedia.org/T360413) (owner: 10EoghanGaffney) [07:09:11] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1019887 (https://phabricator.wikimedia.org/T360414) (owner: 10Dzahn) [07:10:30] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete restbase discovery cert [puppet] - 10https://gerrit.wikimedia.org/r/1020258 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [07:15:42] (03PS1) 10Muehlenhoff: Remove obsolete stub cert [labs/private] - 10https://gerrit.wikimedia.org/r/1020624 (https://phabricator.wikimedia.org/T360636) [07:18:27] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete stub cert [labs/private] - 10https://gerrit.wikimedia.org/r/1020624 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [07:21:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T361627)', diff saved to https://phabricator.wikimedia.org/P60727 and previous config saved to /var/cache/conftool/dbconfig/20240417-072059-marostegui.json [07:21:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance [07:21:05] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [07:21:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance [07:21:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2150', diff saved to https://phabricator.wikimedia.org/P60728 and previous config saved to /var/cache/conftool/dbconfig/20240417-072115-root.json [07:21:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T361627)', diff saved to https://phabricator.wikimedia.org/P60729 and previous config saved to /var/cache/conftool/dbconfig/20240417-072122-marostegui.json [07:21:48] (03PS1) 10Marostegui: db2150: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1020625 [07:22:22] (03CR) 10Marostegui: [C:03+2] db2150: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1020625 (owner: 10Marostegui) [07:22:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2150.codfw.wmnet with OS bookworm [07:26:24] !log restart db1240 database for mariadb upgrade [07:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:27] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2214.codfw.wmnet [07:27:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T361627)', diff saved to https://phabricator.wikimedia.org/P60730 and previous config saved to /var/cache/conftool/dbconfig/20240417-072733-marostegui.json [07:27:38] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [07:30:00] (03PS1) 10Jcrespo: mariadbd: Upgrade mariadb package on db1240 from 10.4 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1020685 (https://phabricator.wikimedia.org/T360751) [07:31:13] (03CR) 10Jcrespo: [C:03+2] mariadbd: Upgrade mariadb package on db1240 from 10.4 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1020685 (https://phabricator.wikimedia.org/T360751) (owner: 10Jcrespo) [07:32:30] (03PS1) 10Muehlenhoff: Switch db2214 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020691 (https://phabricator.wikimedia.org/T349619) [07:33:35] (03CR) 10Muehlenhoff: [C:03+2] Switch db2214 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020691 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:35:56] (03PS1) 10Fabfur: benthos: added some labels [puppet] - 10https://gerrit.wikimedia.org/r/1020692 (https://phabricator.wikimedia.org/T358109) [07:37:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2214.codfw.wmnet [07:38:21] (03PS1) 10Jcrespo: mariadb: Upgrade mariadb package on db1216 from 10.4 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1020693 (https://phabricator.wikimedia.org/T360751) [07:38:53] !log restart db1216 database for mariadb upgrade [07:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:13] !log analytics/refinery deploy begin (added source jars 0.2.35) [07:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2150.codfw.wmnet with reason: host reimage [07:39:54] !log aqu@deploy1002 Started deploy [analytics/refinery@c4e197f]: Regular analytics weekly train [analytics/refinery@c4e197fa] [07:40:30] (03CR) 10Jcrespo: [C:03+2] mariadb: Upgrade mariadb package on db1216 from 10.4 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1020693 (https://phabricator.wikimedia.org/T360751) (owner: 10Jcrespo) [07:40:48] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1173.eqiad.wmnet [07:41:50] (03PS1) 10Muehlenhoff: Switch db1173 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020694 (https://phabricator.wikimedia.org/T349619) [07:42:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P60731 and previous config saved to /var/cache/conftool/dbconfig/20240417-074241-marostegui.json [07:42:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2150.codfw.wmnet with reason: host reimage [07:45:45] (03CR) 10Muehlenhoff: [C:03+2] Switch db1173 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020694 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:45:48] some ulsfo routers went unreachable [07:46:40] (03PS1) 10Fabfur: benthos: fix check for possible empty values [puppet] - 10https://gerrit.wikimedia.org/r/1020695 (https://phabricator.wikimedia.org/T358109) [07:49:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1173.eqiad.wmnet [07:54:06] (03PS1) 10Marostegui: Revert "db2150: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1020243 [07:55:36] (03CR) 10Filippo Giunchedi: benthos: added some labels (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1020692 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [07:57:40] (03CR) 10Filippo Giunchedi: [C:03+1] benthos: fix check for possible empty values [puppet] - 10https://gerrit.wikimedia.org/r/1020695 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [07:57:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P60732 and previous config saved to /var/cache/conftool/dbconfig/20240417-075748-marostegui.json [07:58:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P60733 and previous config saved to /var/cache/conftool/dbconfig/20240417-075850-root.json [07:58:57] (03CR) 10Marostegui: [C:03+2] Revert "db2150: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1020243 (owner: 10Marostegui) [08:00:47] !log jayme@cumin1002 START - Cookbook sre.hosts.reboot-single for host kubestage2002.codfw.wmnet [08:03:10] (03PS2) 10Fabfur: benthos: added some labels [puppet] - 10https://gerrit.wikimedia.org/r/1020692 (https://phabricator.wikimedia.org/T358109) [08:03:23] (03CR) 10Fabfur: benthos: added some labels (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1020692 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [08:03:36] (03PS1) 10Jcrespo: site.pp: Reorder backup sources by server name and update comments [puppet] - 10https://gerrit.wikimedia.org/r/1020697 (https://phabricator.wikimedia.org/T360751) [08:03:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2150.codfw.wmnet with OS bookworm [08:05:34] (03CR) 10Filippo Giunchedi: [C:03+1] benthos: added some labels [puppet] - 10https://gerrit.wikimedia.org/r/1020692 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [08:07:51] !log aqu@deploy1002 Finished deploy [analytics/refinery@c4e197f]: Regular analytics weekly train [analytics/refinery@c4e197fa] (duration: 27m 57s) [08:10:17] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage2002.codfw.wmnet [08:12:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T361627)', diff saved to https://phabricator.wikimedia.org/P60734 and previous config saved to /var/cache/conftool/dbconfig/20240417-081256-marostegui.json [08:13:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1212.eqiad.wmnet with reason: Maintenance [08:13:01] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [08:13:03] !log aqu@deploy1002 Started deploy [analytics/refinery@c4e197f] (thin): Regular analytics weekly train THIN [analytics/refinery@c4e197fa] [08:13:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1212.eqiad.wmnet with reason: Maintenance [08:13:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:13:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:13:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T361627)', diff saved to https://phabricator.wikimedia.org/P60735 and previous config saved to /var/cache/conftool/dbconfig/20240417-081326-marostegui.json [08:13:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P60736 and previous config saved to /var/cache/conftool/dbconfig/20240417-081356-root.json [08:15:34] (03CR) 10Jcrespo: [C:03+2] "https://puppet-compiler.wmflabs.org/output/1020697/1964/" [puppet] - 10https://gerrit.wikimedia.org/r/1020697 (https://phabricator.wikimedia.org/T360751) (owner: 10Jcrespo) [08:16:42] !log aqu@deploy1002 Finished deploy [analytics/refinery@c4e197f] (thin): Regular analytics weekly train THIN [analytics/refinery@c4e197fa] (duration: 03m 39s) [08:19:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T361627)', diff saved to https://phabricator.wikimedia.org/P60737 and previous config saved to /var/cache/conftool/dbconfig/20240417-081953-marostegui.json [08:19:59] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [08:23:57] (03PS1) 10JMeybohm: kubernetes::node: Ensure apparmor profiles are loaded automatically [puppet] - 10https://gerrit.wikimedia.org/r/1020700 (https://phabricator.wikimedia.org/T326785) [08:24:18] !log aqu@deploy1002 Started deploy [analytics/refinery@c4e197f] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@c4e197fa] [08:24:28] (03PS1) 10Kevin Bazira: ml-services: add logo-detection isvc to experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020706 (https://phabricator.wikimedia.org/T362749) [08:25:36] (03PS1) 10JMeybohm: wikifunction: Move apparmor annotation to pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020701 (https://phabricator.wikimedia.org/T326785) [08:26:41] !log aqu@deploy1002 Finished deploy [analytics/refinery@c4e197f] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@c4e197fa] (duration: 02m 23s) [08:27:00] (03CR) 10CI reject: [V:04-1] kubernetes::node: Ensure apparmor profiles are loaded automatically [puppet] - 10https://gerrit.wikimedia.org/r/1020700 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [08:27:46] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 18): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1965/c" [puppet] - 10https://gerrit.wikimedia.org/r/1020700 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [08:29:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P60738 and previous config saved to /var/cache/conftool/dbconfig/20240417-082901-root.json [08:31:41] (03PS1) 10SimmeD: Updated wmf-config/InitialiseSettings.php by adding single space expected between "//" and comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020707 [08:33:20] (03CR) 10Fabfur: [C:03+2] benthos: added some labels [puppet] - 10https://gerrit.wikimedia.org/r/1020692 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [08:35:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P60739 and previous config saved to /var/cache/conftool/dbconfig/20240417-083501-marostegui.json [08:36:07] (03CR) 10Hashar: logging: always register udp2log handlers (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019253 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [08:37:29] jouncebot: owandnext [08:37:40] jouncebot: nowandnext [08:37:40] No deployments scheduled for the next 1 hour(s) and 22 minute(s) [08:37:40] In 1 hour(s) and 22 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240417T1000) [08:38:12] (03PS2) 10JMeybohm: kubernetes::node: Ensure apparmor profiles are loaded automatically [puppet] - 10https://gerrit.wikimedia.org/r/1020700 (https://phabricator.wikimedia.org/T326785) [08:38:19] (03CR) 10Hashar: [C:03+2] logging: pluralize $wmgDefaultMonologHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019267 (https://phabricator.wikimedia.org/T238838) (owner: 10Hashar) [08:39:09] (03Merged) 10jenkins-bot: logging: pluralize $wmgDefaultMonologHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019267 (https://phabricator.wikimedia.org/T238838) (owner: 10Hashar) [08:39:18] (03CR) 10Fabfur: [C:03+2] benthos: fix check for possible empty values [puppet] - 10https://gerrit.wikimedia.org/r/1020695 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [08:39:24] (03Abandoned) 10SimmeD: Updated wmf-config/InitialiseSettings.php by adding single space expected between "//" and comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020707 (owner: 10SimmeD) [08:39:51] (03PS3) 10JMeybohm: kubernetes::node: Ensure apparmor profiles are loaded automatically [puppet] - 10https://gerrit.wikimedia.org/r/1020700 (https://phabricator.wikimedia.org/T326785) [08:40:49] !log Deployed refinery using scap, then deployed onto hdfs [08:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:17] !log hashar@deploy1002 Started scap: Backport for [[gerrit:1019267|logging: pluralize $wmgDefaultMonologHandler (T238838)]] [08:41:21] T238838: Disabling old AWB versions - https://phabricator.wikimedia.org/T238838 [08:41:44] hmm [08:41:46] wrong bug [08:42:57] (03CR) 10Hashar: [C:03+2] logging: pluralize $wmgDefaultMonologHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019267 (https://phabricator.wikimedia.org/T238838) (owner: 10Hashar) [08:44:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P60741 and previous config saved to /var/cache/conftool/dbconfig/20240417-084407-root.json [08:44:13] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 14 NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/1020700 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [08:44:26] !log hashar@deploy1002 hashar: Backport for [[gerrit:1019267|logging: pluralize $wmgDefaultMonologHandler (T238838)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:44:32] !log hashar@deploy1002 hashar: Continuing with sync [08:46:40] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9721680 (10MoritzMuehlenhoff) [08:50:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P60742 and previous config saved to /var/cache/conftool/dbconfig/20240417-085009-marostegui.json [08:55:01] (03PS3) 10Effie Mouzeli: mediawiki deployments: use mcrouter daemonset for both DCs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020251 (https://phabricator.wikimedia.org/T346690) [08:57:54] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:1019267|logging: pluralize $wmgDefaultMonologHandler (T238838)]] (duration: 16m 37s) [08:57:59] T238838: Disabling old AWB versions - https://phabricator.wikimedia.org/T238838 [08:58:21] (03CR) 10Clément Goubert: [C:03+1] mediawiki deployments: use mcrouter daemonset for both DCs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020251 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [08:58:27] jouncebot: now [08:58:27] No deployments scheduled for the next 1 hour(s) and 1 minute(s) [08:59:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P60743 and previous config saved to /var/cache/conftool/dbconfig/20240417-085912-root.json [09:01:23] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki deployments: use mcrouter daemonset for both DCs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020251 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [09:02:48] (03Merged) 10jenkins-bot: mediawiki deployments: use mcrouter daemonset for both DCs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020251 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [09:03:17] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [09:05:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T361627)', diff saved to https://phabricator.wikimedia.org/P60744 and previous config saved to /var/cache/conftool/dbconfig/20240417-090516-marostegui.json [09:05:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1223.eqiad.wmnet with reason: Maintenance [09:05:22] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [09:05:29] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [09:05:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1223.eqiad.wmnet with reason: Maintenance [09:05:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1223 (T361627)', diff saved to https://phabricator.wikimedia.org/P60745 and previous config saved to /var/cache/conftool/dbconfig/20240417-090539-marostegui.json [09:08:42] !log jiji@deploy1002 Started scap: Switch mediawiki in eqiad to use node-local mcrouter ds - T346690 [09:08:47] T346690: mcrouter daemonset on mw-on-k8s - https://phabricator.wikimedia.org/T346690 [09:08:52] 06SRE, 06Infrastructure-Foundations, 10Mail, 10MediaWiki-Email: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860#9721761 (10Xover) And now I just got a resend of a different email to a different user, originally sent on April 11. That’s something like two out of... [09:12:02] (03PS2) 10Msz2001: Only people who belong to 'editor' or 'sysop' groups will be able to publish translations directly. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020729 (https://phabricator.wikimedia.org/T362756) [09:12:06] effie: since you have pushed your change we get roughly 800 messages per minute stating "Duplicate get(): "{key}" fetched {count} times" from `objectcache` https://logstash.wikimedia.org/goto/d17686d0fd57c0a2c94dfc5348991efa [09:12:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T361627)', diff saved to https://phabricator.wikimedia.org/P60746 and previous config saved to /var/cache/conftool/dbconfig/20240417-091203-marostegui.json [09:12:12] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [09:12:42] that is for fetches of keys such as `wikidatawiki:MWSession:......` and I have no idea what it means [09:12:43] hashar: do we get the same on codfw ? [09:13:15] (03PS3) 10Msz2001: [plwiki] Limit Content Translation publishing to mainspace for non-editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020729 (https://phabricator.wikimedia.org/T362756) [09:13:27] (03CR) 10Hashar: [C:03+2] "I have confirmed we still have logs :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019267 (https://phabricator.wikimedia.org/T238838) (owner: 10Hashar) [09:13:33] hashar: lets give it a little time to see if it will stop [09:14:07] and we got a similar bump under the `session` channel https://logstash.wikimedia.org/goto/5335cc7943b94d988c7b4631c2e05e7e [09:14:17] I haven't looked which message exactly [09:14:17] it has slowed down already [09:14:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P60747 and previous config saved to /var/cache/conftool/dbconfig/20240417-091418-root.json [09:15:21] Session "{session}": Metadata merge failed: {exception} [09:15:37] https://logstash.wikimedia.org/goto/745c3f29242f28c23867f6fc7d267b9e [09:16:07] those are exception being thrown though they get logged at warning level [09:17:33] they are not errors though [09:17:49] I have no idea about the impacts [09:17:59] I just found out the elevated logging as I was looking for something else [09:18:01] obviously they are related to the change, no doubt [09:18:15] (PHPFPMTooBusy) firing: (3) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 30.2% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:19:02] (03CR) 10Jaime Nuche: scap: introduce bootstrapping mechanism specific to deployment hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820749 (owner: 10Jaime Nuche) [09:19:03] now that is an actual problem [09:20:10] (03PS1) 10Jcrespo: dbbackups: Setup dbprov1005 as new host to send s3 and s5 backups [puppet] - 10https://gerrit.wikimedia.org/r/1020750 (https://phabricator.wikimedia.org/T362509) [09:21:15] (MediaWikiLatencyExceeded) firing: (4) p75 latency high: eqiad mw-api-ext (k8s) 5.255s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:21:55] (MaxConntrack) firing: Max conntrack at 93.03% on kubernetes1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [09:21:57] (ProbeDown) firing: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:22:34] (ProbeDown) firing: (9) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:23:15] (PHPFPMTooBusy) firing: (3) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 17.66% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:23:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:23:54] effie: is this you? [09:24:27] jayme: yes, I am trying to undestand why, let's go to sre [09:24:30] -sre [09:25:27] the authentication metrics are showing users are Login and the Central Login went done https://grafana.wikimedia.org/d/000000004/authentication-metrics?orgId=1 [09:25:30] so I guess something is broken [09:25:33] oh, I didn't realize it was a deploy [09:25:47] I will report on status page [09:25:55] and that more or less aligns with the elevated rates of logs in `session` and `objectcache` [09:25:59] just got "Request from 87.96.230.208 via cp3067.esams.wmnet, ATS/9.1.4 [09:25:59] Error: 502, Broken pipe at 2024-04-17 09:24:38 GMT" [09:26:15] jynus: this is me, on -sre [09:26:15] (MediaWikiLatencyExceeded) firing: (4) p75 latency high: eqiad mw-api-ext (k8s) 6.79s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:26:25] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:26:40] AzaTht: https://www.wikimediastatus.net/ [09:26:55] (MaxConntrack) resolved: Max conntrack at 100% on kubernetes1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [09:26:57] (ProbeDown) firing: (8) Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:27:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P60748 and previous config saved to /var/cache/conftool/dbconfig/20240417-092714-marostegui.json [09:27:34] (ProbeDown) firing: (13) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:27:40] jynus: it was "All systems operational" when I looked right before posted ツ [09:27:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [09:27:53] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (GET pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:28:15] (MediaWikiMemcachedHighErrorRate) firing: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [09:28:15] (PHPFPMTooBusy) firing: (5) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.67% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:28:51] (SwaggerProbeHasFailures) firing: (6) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:29:14] (03CR) 10Klausman: [C:03+1] ml-services: fix indentation in mistral model resources and increase memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018646 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [09:29:19] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1019726 (https://phabricator.wikimedia.org/T362518) (owner: 10Muehlenhoff) [09:29:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P60749 and previous config saved to /var/cache/conftool/dbconfig/20240417-092923-root.json [09:29:36] (GatewayBackendErrorsHigh) firing: rest-gateway: elevated 5xx errors from wikifeeds_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [09:30:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:31:03] !log jiji@deploy1002 scap failed: KeyError 'production' (duration: 22m 21s) [09:31:15] (MediaWikiLatencyExceeded) firing: (5) p75 latency high: eqiad mw-api-ext (k8s) 1.702s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:31:57] (ProbeDown) firing: (12) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:32:34] (ProbeDown) firing: (16) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:32:53] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (GET pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:33:15] (MediaWikiMemcachedHighErrorRate) firing: (2) MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [09:33:22] (03PS1) 10Arnaudb: mariadb: removes underscore on striker database name [puppet] - 10https://gerrit.wikimedia.org/r/1020709 (https://phabricator.wikimedia.org/T360149) [09:33:43] (VarnishUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [09:33:51] (SwaggerProbeHasFailures) firing: (7) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:34:36] (GatewayBackendErrorsHigh) firing: (2) rest-gateway: elevated 5xx errors from wikifeeds_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [09:34:51] (ATSBackendErrorsHigh) firing: (3) ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [09:35:00] (03CR) 10Alexandros Kosiaris: [C:03+1] kubernetes::node: Ensure apparmor profiles are loaded automatically [puppet] - 10https://gerrit.wikimedia.org/r/1020700 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [09:35:15] (MediaWikiHighErrorRate) resolved: (4) Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:36:15] (MediaWikiLatencyExceeded) firing: (5) p75 latency high: eqiad mw-api-ext (k8s) 807ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:36:57] (ProbeDown) firing: (10) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:37:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:37:34] (ProbeDown) firing: (14) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:38:15] (PHPFPMTooBusy) firing: (6) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 25.46% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:38:51] (SwaggerProbeHasFailures) firing: (7) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:39:51] (ATSBackendErrorsHigh) firing: (9) ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [09:41:15] (MediaWikiLatencyExceeded) firing: (4) p75 latency high: eqiad mw-api-ext (k8s) 1.765s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:41:25] (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:41:49] (03CR) 10Klausman: ml-services: add logo-detection isvc to experimental namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020706 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [09:41:57] (ProbeDown) firing: (10) Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:42:15] (MediaWikiHighErrorRate) firing: (4) Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:42:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P60750 and previous config saved to /var/cache/conftool/dbconfig/20240417-094223-marostegui.json [09:42:34] (ProbeDown) firing: (15) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:43:15] (PHPFPMTooBusy) firing: (6) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 17.76% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:43:51] (SwaggerProbeHasFailures) firing: (8) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:44:12] !log cgoubert@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=mw-web-ro,name=eqiad [09:44:20] !log cgoubert@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=mw-api-int-ro,name=eqiad [09:44:29] !log cgoubert@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=mw-api-ext-ro,name=eqiad [09:44:51] (ATSBackendErrorsHigh) firing: (12) ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [09:46:15] (MediaWikiLatencyExceeded) firing: (4) p75 latency high: eqiad mw-api-ext (k8s) 807ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:46:57] (ProbeDown) firing: (9) Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:47:15] (MediaWikiHighErrorRate) resolved: (6) Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:47:34] (ProbeDown) firing: (11) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:47:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [09:48:15] (PHPFPMTooBusy) firing: (6) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 12.93% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:48:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [09:48:51] (SwaggerProbeHasFailures) firing: (10) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:49:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:49:51] (ATSBackendErrorsHigh) firing: (13) ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [09:51:15] (MediaWikiLatencyExceeded) resolved: (3) p75 latency high: eqiad mw-api-ext (k8s) 807ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:51:56] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [09:51:58] (ProbeDown) resolved: (7) Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:52:34] (ProbeDown) resolved: (7) Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:53:15] (MediaWikiMemcachedHighErrorRate) resolved: (2) MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [09:53:15] (PHPFPMTooBusy) resolved: (5) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 14.83% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:53:43] (VarnishUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [09:53:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [09:53:51] (SwaggerProbeHasFailures) resolved: (5) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:54:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:54:36] (GatewayBackendErrorsHigh) firing: (2) rest-gateway: elevated 5xx errors from wikifeeds_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [09:54:51] (ATSBackendErrorsHigh) resolved: (9) ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [09:56:56] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [09:57:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T361627)', diff saved to https://phabricator.wikimedia.org/P60753 and previous config saved to /var/cache/conftool/dbconfig/20240417-095731-marostegui.json [09:57:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance [09:57:37] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [09:57:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240417T1000) [10:02:57] (03CR) 10Muehlenhoff: [C:03+2] Only install Go from backports on bullseye-based stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/1019726 (https://phabricator.wikimedia.org/T362518) (owner: 10Muehlenhoff) [10:04:37] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9721866 (10MoritzMuehlenhoff) [10:06:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [10:06:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [10:06:40] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: fix indentation in mistral model resources and increase memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018646 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [10:07:30] fyi, replag for tools is still increasing it seems ? [10:07:51] (03Merged) 10jenkins-bot: ml-services: fix indentation in mistral model resources and increase memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018646 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [10:08:14] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: sync [10:08:21] (03PS1) 10Clément Goubert: admin_ng: Bump coredns replicas to 6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020765 (https://phabricator.wikimedia.org/T346690) [10:08:36] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: sync [10:08:47] (03CR) 10Hnowlan: [C:03+1] admin_ng: Bump coredns replicas to 6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020765 (https://phabricator.wikimedia.org/T346690) (owner: 10Clément Goubert) [10:09:32] (03PS1) 10David Caro: dynamicproxy: disable response buffering [puppet] - 10https://gerrit.wikimedia.org/r/1020767 [10:11:46] (03CR) 10Clément Goubert: [C:03+2] admin_ng: Bump coredns replicas to 6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020765 (https://phabricator.wikimedia.org/T346690) (owner: 10Clément Goubert) [10:12:25] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:12:59] (03PS1) 10Effie Mouzeli: mediawiki-common: add a dot to the mcrouter url [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020768 (https://phabricator.wikimedia.org/T346690) [10:13:56] (03CR) 10Alexandros Kosiaris: [C:03+1] mediawiki-common: add a dot to the mcrouter url [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020768 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:14:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [10:14:36] (GatewayBackendErrorsHigh) resolved: rest-gateway: elevated 5xx errors from wikifeeds_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [10:14:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [10:14:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2105 (T361627)', diff saved to https://phabricator.wikimedia.org/P60755 and previous config saved to /var/cache/conftool/dbconfig/20240417-101446-marostegui.json [10:14:53] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [10:14:59] (03Merged) 10jenkins-bot: admin_ng: Bump coredns replicas to 6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020765 (https://phabricator.wikimedia.org/T346690) (owner: 10Clément Goubert) [10:15:06] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki-common: add a dot to the mcrouter url [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020768 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:16:20] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1020767 (owner: 10David Caro) [10:16:27] (03Merged) 10jenkins-bot: mediawiki-common: add a dot to the mcrouter url [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020768 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:17:29] (03PS1) 10Btullis: Disable CustomVariables and CustomPiwikJs on new matomo server [puppet] - 10https://gerrit.wikimedia.org/r/1020769 (https://phabricator.wikimedia.org/T349397) [10:18:28] (03PS2) 10David Caro: dynamicproxy: disable response buffering to files [puppet] - 10https://gerrit.wikimedia.org/r/1020767 [10:18:41] (03PS3) 10David Caro: dynamicproxy: disable response buffering to files [puppet] - 10https://gerrit.wikimedia.org/r/1020767 [10:18:42] (03PS2) 10Btullis: Disable CustomVariables and CustomPiwikJs on new matomo server [puppet] - 10https://gerrit.wikimedia.org/r/1020769 (https://phabricator.wikimedia.org/T349397) [10:19:32] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "also LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1020767 (owner: 10David Caro) [10:20:00] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1967/co" [puppet] - 10https://gerrit.wikimedia.org/r/1020769 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [10:20:30] (03CR) 10Slavina Stefanova: dynamicproxy: disable response buffering to files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1020767 (owner: 10David Caro) [10:22:50] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es2028.codfw.wmnet [10:23:46] (03CR) 10David Caro: dynamicproxy: disable response buffering to files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1020767 (owner: 10David Caro) [10:23:49] (03PS4) 10David Caro: dynamicproxy: disable response buffering to files [puppet] - 10https://gerrit.wikimedia.org/r/1020767 [10:23:56] (03PS1) 10Muehlenhoff: Switch es2028 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020772 (https://phabricator.wikimedia.org/T349619) [10:25:04] (03PS1) 10Effie Mouzeli: mediawiki-common: use mcrouter ds only on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020774 (https://phabricator.wikimedia.org/T346690) [10:25:28] (03CR) 10David Caro: [C:03+2] dynamicproxy: disable response buffering to files [puppet] - 10https://gerrit.wikimedia.org/r/1020767 (owner: 10David Caro) [10:26:25] (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:27:35] (03PS3) 10Btullis: Disable CustomVariables and CustomPiwikJs on new matomo server [puppet] - 10https://gerrit.wikimedia.org/r/1020769 (https://phabricator.wikimedia.org/T349397) [10:27:38] (03CR) 10Alexandros Kosiaris: [C:03+1] mediawiki-common: use mcrouter ds only on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020774 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:28:38] (03CR) 10Muehlenhoff: [C:03+2] Switch es2028 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020772 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:29:18] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1020266 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [10:29:36] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1968/co" [puppet] - 10https://gerrit.wikimedia.org/r/1020769 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [10:29:45] (03PS2) 10Effie Mouzeli: mediawiki-common: use mcrouter ds only on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020774 (https://phabricator.wikimedia.org/T346690) [10:30:09] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki-common: use mcrouter ds only on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020774 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:31:35] (03Merged) 10jenkins-bot: mediawiki-common: use mcrouter ds only on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020774 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:33:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es2028.codfw.wmnet [10:33:14] (03CR) 10David Caro: [C:03+2] "Forgot to add the task: https://phabricator.wikimedia.org/T354116" [puppet] - 10https://gerrit.wikimedia.org/r/1020767 (owner: 10David Caro) [10:33:52] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [10:34:12] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es1027.eqiad.wmnet [10:34:13] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [10:34:16] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [10:34:29] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [10:34:51] !log apply the coredns patches for bumping instances from 4 to 6. They are noop, I am applying them to update helm's state. [10:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T361627)', diff saved to https://phabricator.wikimedia.org/P60756 and previous config saved to /var/cache/conftool/dbconfig/20240417-103455-marostegui.json [10:34:58] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [10:35:00] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [10:35:04] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [10:35:05] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [10:35:09] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [10:35:13] (03PS1) 10Muehlenhoff: Switch es1027 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020775 (https://phabricator.wikimedia.org/T349619) [10:36:03] !log jiji@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [10:36:03] !log jiji@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [10:36:25] (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:36:51] !log jiji@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [10:37:29] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1969/co" [puppet] - 10https://gerrit.wikimedia.org/r/1020769 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [10:37:51] (03CR) 10Muehlenhoff: [C:03+2] Switch es1027 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020775 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:37:54] !log jiji@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [10:38:21] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [10:38:30] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:39:01] (03PS2) 10Kevin Bazira: ml-services: add logo-detection isvc to experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020706 (https://phabricator.wikimedia.org/T362749) [10:40:10] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [10:41:03] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [10:41:17] (03PS4) 10Btullis: Disable CustomVariables and CustomPiwikJs on new matomo server [puppet] - 10https://gerrit.wikimedia.org/r/1020769 (https://phabricator.wikimedia.org/T349397) [10:41:24] (03PS3) 10Kevin Bazira: ml-services: add logo-detection isvc to experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020706 (https://phabricator.wikimedia.org/T362749) [10:41:54] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [10:42:07] (03CR) 10Kevin Bazira: ml-services: add logo-detection isvc to experimental namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020706 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [10:42:08] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [10:42:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es1027.eqiad.wmnet [10:42:32] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [10:42:34] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1020769 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [10:44:56] !log jiji@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=mw-api-int-ro,name=eqiad [10:45:03] (03PS1) 10Clément Goubert: admin_ng: Bump coredns memory for wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020778 (https://phabricator.wikimedia.org/T346690) [10:45:33] (03CR) 10Alexandros Kosiaris: [C:03+1] admin_ng: Bump coredns memory for wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020778 (https://phabricator.wikimedia.org/T346690) (owner: 10Clément Goubert) [10:45:59] !log pool eqiad back for mw-web-ro, mw-api-int-ro and mw-api-ext-ro [10:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:24] !log jiji@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=mw-api-ext-ro,name=eqiad [10:49:05] (03CR) 10Clément Goubert: [C:03+2] admin_ng: Bump coredns memory for wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020778 (https://phabricator.wikimedia.org/T346690) (owner: 10Clément Goubert) [10:50:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P60757 and previous config saved to /var/cache/conftool/dbconfig/20240417-105002-marostegui.json [10:51:32] (03PS1) 10Btullis: Install a matomo plugin on the new host [puppet] - 10https://gerrit.wikimedia.org/r/1020780 (https://phabricator.wikimedia.org/T349397) [10:52:40] (03Merged) 10jenkins-bot: admin_ng: Bump coredns memory for wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020778 (https://phabricator.wikimedia.org/T346690) (owner: 10Clément Goubert) [10:52:56] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1020780 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [10:53:04] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [10:53:27] (03CR) 10Btullis: [V:03+1 C:03+2] Disable CustomVariables and CustomPiwikJs on new matomo server [puppet] - 10https://gerrit.wikimedia.org/r/1020769 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [10:53:49] !log jiji@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=mw-web-ro,name=eqiad [10:53:56] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [10:54:05] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:00:05] mvolz: #bothumor My software never has bugs. It just develops random features. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240417T1100). [11:00:27] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#9722005 (10Ladsgroup) >>! In T360029#9658042, @CDanis wrote: > Just to make sure I understand, the request here is an easy-to-automate way of... [11:04:30] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:05:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P60758 and previous config saved to /var/cache/conftool/dbconfig/20240417-110510-marostegui.json [11:06:10] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es2030.codfw.wmnet [11:07:05] (03PS1) 10Muehlenhoff: Switch es2030 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020788 (https://phabricator.wikimedia.org/T349619) [11:07:40] (03PS1) 10Alexandros Kosiaris: coredns: Switch podAntiAffinity rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020789 [11:10:31] (03CR) 10CI reject: [V:04-1] coredns: Switch podAntiAffinity rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020789 (owner: 10Alexandros Kosiaris) [11:11:08] (03CR) 10Btullis: [V:03+1 C:03+2] Install a matomo plugin on the new host [puppet] - 10https://gerrit.wikimedia.org/r/1020780 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [11:11:11] jouncebot: now [11:11:11] For the next 0 hour(s) and 48 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240417T1100) [11:12:56] (03CR) 10Muehlenhoff: [C:03+2] Switch es2030 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020788 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:13:18] (03PS2) 10Alexandros Kosiaris: coredns: Switch podAntiAffinity rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020789 [11:13:25] !log jiji@deploy1002 Started scap: NoOp [11:17:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es2030.codfw.wmnet [11:19:59] (03CR) 10JMeybohm: [C:03+1] coredns: Switch podAntiAffinity rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020789 (owner: 10Alexandros Kosiaris) [11:20:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T361627)', diff saved to https://phabricator.wikimedia.org/P60759 and previous config saved to /var/cache/conftool/dbconfig/20240417-112017-marostegui.json [11:20:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance [11:20:23] (03CR) 10Alexandros Kosiaris: [C:03+2] coredns: Switch podAntiAffinity rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020789 (owner: 10Alexandros Kosiaris) [11:20:33] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [11:20:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance [11:20:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2109 (T361627)', diff saved to https://phabricator.wikimedia.org/P60760 and previous config saved to /var/cache/conftool/dbconfig/20240417-112040-marostegui.json [11:20:50] (03CR) 10Ayounsi: [C:03+1] Netbox custom script to add additional IPv4 addresses to host [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1017064 (https://phabricator.wikimedia.org/T358096) (owner: 10Cathal Mooney) [11:22:12] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es1032.eqiad.wmnet [11:23:03] !log jiji@deploy1002 Finished scap: NoOp (duration: 09m 38s) [11:23:09] (03Merged) 10jenkins-bot: coredns: Switch podAntiAffinity rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020789 (owner: 10Alexandros Kosiaris) [11:23:12] (03PS1) 10Muehlenhoff: Switch es1032 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020795 (https://phabricator.wikimedia.org/T349619) [11:23:41] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [11:23:54] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [11:23:55] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:24:11] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:24:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T352010)', diff saved to https://phabricator.wikimedia.org/P60761 and previous config saved to /var/cache/conftool/dbconfig/20240417-112418-ladsgroup.json [11:24:25] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:24:25] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:25:39] (03CR) 10Muehlenhoff: [C:03+2] Switch es1032 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020795 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:27:28] (03PS1) 10Btullis: Swith matomo/piwik to the new host [puppet] - 10https://gerrit.wikimedia.org/r/1020798 (https://phabricator.wikimedia.org/T351552) [11:28:16] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1972/console" [puppet] - 10https://gerrit.wikimedia.org/r/1020798 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis) [11:29:21] (03PS2) 10Btullis: Swith matomo/piwik to the new host [puppet] - 10https://gerrit.wikimedia.org/r/1020798 (https://phabricator.wikimedia.org/T351552) [11:29:34] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:29:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es1032.eqiad.wmnet [11:30:27] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:30:41] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:33:23] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es2032.codfw.wmnet [11:33:40] (03PS3) 10Muehlenhoff: Move cloudcephosd2001-dev to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1017248 (https://phabricator.wikimedia.org/T361913) [11:35:30] (03PS1) 10Muehlenhoff: Switch es2032 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020799 (https://phabricator.wikimedia.org/T349619) [11:36:09] !log stevemunene@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [11:38:57] (03CR) 10Elukey: {echo,session}store (staging): use wmf-ca-certificates.crt (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020356 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [11:42:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T361627)', diff saved to https://phabricator.wikimedia.org/P60762 and previous config saved to /var/cache/conftool/dbconfig/20240417-114201-marostegui.json [11:42:08] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [11:42:23] (03CR) 10Muehlenhoff: [C:03+2] Switch es2032 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020799 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:44:39] !log depool ncredir2001 [11:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es2032.codfw.wmnet [11:47:27] (03CR) 10Jcrespo: "So it is my understanding that usernames with underscores require escaping (\_), otherwise they grant the rights to any user containing an" [puppet] - 10https://gerrit.wikimedia.org/r/1020709 (https://phabricator.wikimedia.org/T360149) (owner: 10Arnaudb) [11:48:47] (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:48:51] (03CR) 10Muehlenhoff: [C:03+2] alertmanager: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1020198 (owner: 10Muehlenhoff) [11:52:27] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9722192 (10MoritzMuehlenhoff) [11:53:11] stevemunene: ^ I've seen the datahub release alert flying by a couple of times now - is that expected or is there something wrong with the deployment? [11:55:17] hi jayme That is from some ongoing work [11:56:00] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: ASW single-point of failure for LVS VIPs at POPs - https://phabricator.wikimedia.org/T362772 (10cmooney) 03NEW p:05Triage→03Medium [11:56:24] stevemunene: so you're fixing the deployment? [11:57:00] !log stevemunene@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [11:57:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P60763 and previous config saved to /var/cache/conftool/dbconfig/20240417-115709-marostegui.json [11:57:46] jayme: Was in the middle of an upgrade and yes, sorry I was a bit unclear [11:57:57] ack, okay [11:58:47] (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:00:03] (03CR) 10JMeybohm: [V:03+1 C:03+2] kubernetes::node: Ensure apparmor profiles are loaded automatically [puppet] - 10https://gerrit.wikimedia.org/r/1020700 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [12:01:25] (03PS1) 10Slyngshede: Initial documentation for the Bitu API. [software/bitu] - 10https://gerrit.wikimedia.org/r/1020802 [12:02:12] (03PS1) 10JMeybohm: kubernetes::node: Remove apparmor cleanup code [puppet] - 10https://gerrit.wikimedia.org/r/1020803 (https://phabricator.wikimedia.org/T326785) [12:03:47] (03CR) 10JMeybohm: [C:03+1] "Cory/James: Feel free to merge and deploy when you see fit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020701 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [12:04:29] (03PS2) 10Slyngshede: Initial documentation for the Bitu API. [software/bitu] - 10https://gerrit.wikimedia.org/r/1020802 [12:05:43] (03CR) 10JMeybohm: "ocf. no merge before ~13.00 UTC" [puppet] - 10https://gerrit.wikimedia.org/r/1020803 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [12:06:33] !log upgrading PHP on mediawiki baremetal canaries servers T362511 [12:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:31] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1020798 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis) [12:07:45] (WidespreadPuppetFailure) firing: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:07:56] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (NOOP 8 CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1020803 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [12:08:20] WidespreadPuppetFailure is me [12:11:35] (03CR) 10Btullis: Swith matomo/piwik to the new host [puppet] - 10https://gerrit.wikimedia.org/r/1020798 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis) [12:12:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P60765 and previous config saved to /var/cache/conftool/dbconfig/20240417-121218-marostegui.json [12:12:53] !log repool ncredir2001 [12:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:45] (03PS4) 10Kevin Bazira: ml-services: add logo-detection isvc to experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020706 (https://phabricator.wikimedia.org/T362749) [12:16:28] (03PS2) 10JMeybohm: kubernetes::node: Remove apparmor cleanup code [puppet] - 10https://gerrit.wikimedia.org/r/1020803 (https://phabricator.wikimedia.org/T326785) [12:16:28] (03PS1) 10JMeybohm: apparmor::profile: Don't try to define /etc/apparmor.d resource [puppet] - 10https://gerrit.wikimedia.org/r/1020805 (https://phabricator.wikimedia.org/T326785) [12:19:41] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1975/c" [puppet] - 10https://gerrit.wikimedia.org/r/1020805 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [12:20:01] (03CR) 10JMeybohm: [V:03+1 C:03+2] apparmor::profile: Don't try to define /etc/apparmor.d resource [puppet] - 10https://gerrit.wikimedia.org/r/1020805 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [12:21:34] thanks for the heads up, jayme! [12:21:49] (03PS1) 10Marostegui: db2120: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1020806 [12:21:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2120', diff saved to https://phabricator.wikimedia.org/P60766 and previous config saved to /var/cache/conftool/dbconfig/20240417-122150-root.json [12:22:17] (03CR) 10Marostegui: [C:03+2] db2120: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1020806 (owner: 10Marostegui) [12:23:12] jynus: sure - code is fixed, alert should go away in a bit [12:25:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2120.codfw.wmnet with OS bookworm [12:27:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T361627)', diff saved to https://phabricator.wikimedia.org/P60767 and previous config saved to /var/cache/conftool/dbconfig/20240417-122725-marostegui.json [12:27:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [12:27:31] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [12:27:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [12:27:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2127 (T361627)', diff saved to https://phabricator.wikimedia.org/P60768 and previous config saved to /var/cache/conftool/dbconfig/20240417-122748-marostegui.json [12:28:06] (03PS1) 10Muehlenhoff: Remove parsoid-canary Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1020807 (https://phabricator.wikimedia.org/T359387) [12:29:59] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:31:01] (03PS1) 10Elukey: knative-serving: move net_istio configs to a dict [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020808 (https://phabricator.wikimedia.org/T353622) [12:32:32] (03CR) 10Muehlenhoff: [C:03+2] Remove parsoid-canary Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1020807 (https://phabricator.wikimedia.org/T359387) (owner: 10Muehlenhoff) [12:32:57] (03CR) 10Cathal Mooney: [C:03+1] Puppet: add magru (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [12:40:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2120.codfw.wmnet with reason: host reimage [12:41:20] (03PS1) 10Marostegui: Revert "db2120: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1020730 [12:44:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2120.codfw.wmnet with reason: host reimage [12:45:08] (03CR) 10Elukey: "The diff is a little mixed since probably the current order from the list/array is not the same compared to the one that the dict creates." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020808 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [12:46:20] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es2026.codfw.wmnet [12:47:08] (03PS1) 10Muehlenhoff: Switch es2026 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020819 (https://phabricator.wikimedia.org/T349619) [12:47:23] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1017248 (https://phabricator.wikimedia.org/T361913) (owner: 10Muehlenhoff) [12:47:35] (03CR) 10Jforrester: "Thanks, will do in an hour's time in our window." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020701 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [12:47:45] (WidespreadPuppetFailure) firing: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:47:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T361627)', diff saved to https://phabricator.wikimedia.org/P60769 and previous config saved to /var/cache/conftool/dbconfig/20240417-124756-marostegui.json [12:48:02] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [12:49:56] (03CR) 10Muehlenhoff: [C:03+2] Switch es2026 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020819 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:50:36] (03PS1) 10Ssingh: geo-maps: add drmrs LVS map [dns] - 10https://gerrit.wikimedia.org/r/1020823 [12:52:45] (WidespreadPuppetFailure) resolved: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:54:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es2026.codfw.wmnet [12:57:13] (03CR) 10Ssingh: [C:03+2] Puppet: add magru [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240417T1300). [13:00:05] anzx and kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] hello [13:00:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2120 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P60770 and previous config saved to /var/cache/conftool/dbconfig/20240417-130027-root.json [13:00:35] (03CR) 10Marostegui: [C:03+2] Revert "db2120: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1020730 (owner: 10Marostegui) [13:00:45] o/ [13:01:10] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es2031.codfw.wmnet [13:01:56] (03PS1) 10Muehlenhoff: Switch es2031 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020824 (https://phabricator.wikimedia.org/T349619) [13:02:03] Lucas_WMDE: does the patch for T362653 look OK to you? [13:02:03] T362653: Create Draft Namespace in Malayalam Wikipedia - https://phabricator.wikimedia.org/T362653 [13:03:00] I think so, yeah [13:03:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P60771 and previous config saved to /var/cache/conftool/dbconfig/20240417-130303-marostegui.json [13:03:10] we just need to not forget to run namespaceDupes ^^ [13:03:27] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9722316 (10ssingh) @Papaul: Thanks for the update! Looks promising indeed and to actually close this, we should downgrade another host i... [13:03:32] (03CR) 10Muehlenhoff: [C:03+2] Switch es2031 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020824 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:04:17] anzx: are you around? [13:04:17] what’s the current status of production after the incident earlier (T362766)? is it okay to deploy normal changes? [13:04:17] T362766: 2024-04-17 mw-* went down in eqiad - https://phabricator.wikimedia.org/T362766 [13:05:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2120.codfw.wmnet with OS bookworm [13:06:00] kostajh: i sm around [13:06:03] (pinging jynus as the IC but would also be happy for anyone else to respond ^^) [13:06:25] (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:26] (03PS1) 10Ssingh: geo-maps: add magru to geo maps [dns] - 10https://gerrit.wikimedia.org/r/1020825 (https://phabricator.wikimedia.org/T346722) [13:06:41] (03CR) 10Kosta Harlan: "Comparing with Ibe5548a6759e794c125a81a59d87bde0134da825, do we want to set noindex/nofollow, and enable VisualEditor for the Draft namesp" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020242 (https://phabricator.wikimedia.org/T362653) (owner: 10Anzx) [13:06:41] yes, status is resolved, no blockers (there may be some followups, but ok with normal operations) [13:06:47] okay, thanks! [13:07:01] anzx: hi :) I left a comment on your patch [13:07:03] kostajh: do you want to do the deployments or should I start with mlwiki? [13:07:07] ah, ok [13:08:04] Lucas_WMDE: if the mlwiki patch looks good to you, please start [13:08:11] otherwise, my 3 patches can be synced together [13:08:22] I think you raised a valid point so I’ll let anzx reply to that ^^ [13:08:25] * Lucas_WMDE looks at your changes [13:08:43] (03CR) 10Ssingh: [C:03+2] "Task for this patch is https://phabricator.wikimedia.org/T346722." [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [13:09:08] (03PS2) 10Kosta Harlan: beta: Disable wgWikimediaEventsIPoidUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015295 (https://phabricator.wikimedia.org/T354597) [13:09:14] (03PS2) 10Kosta Harlan: WikimediaEvents: Set IPoid URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015296 (https://phabricator.wikimedia.org/T354597) [13:09:17] (03PS2) 10Kosta Harlan: EventStreamConfig: Register ip_reputation/score [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015299 (https://phabricator.wikimedia.org/T354597) [13:09:26] (rebased to get some new CI builds after the old ones were gone) [13:10:23] thx [13:10:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es2031.codfw.wmnet [13:11:06] (03CR) 10Lucas Werkmeister (WMDE): "VE probably makes sense, good point. I think noindex/nofollow is already set (line 4786); if I understand correctly, the `wmgExemptFromUse" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020242 (https://phabricator.wikimedia.org/T362653) (owner: 10Anzx) [13:11:25] (SystemdUnitFailed) firing: (5) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:11:28] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es2033.codfw.wmnet [13:12:02] kostajh: I wonder why diffConfig detects no change in the first change (Disable …IPoidUrl) [13:12:06] topranks: this build2001 one doesn't seem to be related to us [13:12:25] (03PS1) 10Muehlenhoff: Switch es2033 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020826 (https://phabricator.wikimedia.org/T349619) [13:12:36] 10ops-codfw, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Degraded RAID on elastic2088 - https://phabricator.wikimedia.org/T361525#9722367 (10Jhancock.wm) [13:12:42] ohh, it’s only set (outside of beta) in the following change? [13:13:08] hm, but I’m not sure if overriding CS.php in IS-labs.php works like that [13:13:15] sukhe: yeah that seems wide of the changes we're making alright [13:13:33] 10ops-codfw, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Degraded RAID on elastic2088 - https://phabricator.wikimedia.org/T361525#9722368 (10bking) a:05bking→03None [13:13:34] (03PS1) 10Ssingh: magru: add geo-resources and update wikimedia.org zone [dns] - 10https://gerrit.wikimedia.org/r/1020827 (https://phabricator.wikimedia.org/T346722) [13:13:40] Lucas_WMDE: the goal is to unset the variable in beta and enable in production [13:13:45] Maybe I did it wrong [13:14:08] (03CR) 10Muehlenhoff: [C:03+2] Switch es2033 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020826 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:14:14] 10ops-codfw, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Degraded RAID on elastic2088 - https://phabricator.wikimedia.org/T361525#9722374 (10Jhancock.wm) @RKemper I am going to check it out and get back in touch with dell. These are the same errors we were getting before the card was r... [13:14:18] kostajh: Lucas_WMDE I think since they didn't ask for visual editor , should I update patch to enable it [13:14:21] (03CR) 10CI reject: [V:04-1] magru: add geo-resources and update wikimedia.org zone [dns] - 10https://gerrit.wikimedia.org/r/1020827 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [13:15:17] anzx: https://ml.wikipedia.org/wiki/Special:Tags looks like VE is used a lot on that wiki, so IMHO it would be fine to just guess that they’ll be fine with enabling it [13:15:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2120 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P60772 and previous config saved to /var/cache/conftool/dbconfig/20240417-131533-root.json [13:15:47] Lucas_WMDE: I will update patch [13:15:49] (03PS2) 10Ssingh: magru: add geo-resources and update wikimedia.org zone [dns] - 10https://gerrit.wikimedia.org/r/1020827 (https://phabricator.wikimedia.org/T346722) [13:16:25] (SystemdUnitFailed) firing: (6) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:16:43] (03CR) 10CI reject: [V:04-1] magru: add geo-resources and update wikimedia.org zone [dns] - 10https://gerrit.wikimedia.org/r/1020827 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [13:17:04] (03CR) 10Lucas Werkmeister (WMDE): [C:04-1] "I don’t think this will work – I857cefbd4a sets the IPoid URL in `CommonSettings.php`, and according to [the comment near the top of `Init" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015295 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [13:17:07] (03PS1) 10Ladsgroup: Setting dummy password for cumin dedicated mysql user [labs/private] - 10https://gerrit.wikimedia.org/r/1020828 [13:17:57] (03CR) 10Ssingh: "Failure is expected since we don't have the data center name magru yet. Will merge after I93fe45bed44583c86680b5595c481181e048282b" [dns] - 10https://gerrit.wikimedia.org/r/1020827 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [13:17:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es2033.codfw.wmnet [13:18:08] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es1026.eqiad.wmnet [13:18:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P60773 and previous config saved to /var/cache/conftool/dbconfig/20240417-131811-marostegui.json [13:20:41] (03PS2) 10Ssingh: realm: fix consistency for site IPs [puppet] - 10https://gerrit.wikimedia.org/r/1019843 [13:20:55] Lucas_WMDE: I guess I should use CommonSettings-Labs.php to set the URL to null [13:21:09] yeah, I guess that would also work [13:21:29] I was wondering if CS.php could just check if the variable was already set via isset(), but then I remembered that isset() is false for null values [13:21:40] so that wouldn’t quite work out, annoyingly [13:22:00] (I think you could do that in one change btw, set the URL for production and unset it for beta) [13:23:16] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp1115.eqiad.wmnet [13:23:17] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1115.eqiad.wmnet [13:23:18] (03PS1) 10Ladsgroup: mariadb: Set up dedicated cumin user [puppet] - 10https://gerrit.wikimedia.org/r/1020830 [13:23:31] (03PS1) 10Muehlenhoff: Switch es1026 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020831 (https://phabricator.wikimedia.org/T349619) [13:23:38] kostajh: alternatively… do you even need to do anything? it looks like $wmgLocalServices['ipoid'] might be null on beta anyways [13:23:54] (checked in `mwscript shell testwiki` on deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud) [13:23:56] yeah just reached that conclusion :) [13:24:02] labs.php sets it to null [13:24:05] but I don’t know where $wmgLocalServices is set otherwise [13:24:05] ah ok [13:24:06] I'll update the patches [13:24:12] yeah then that’s probably enough ^^ [13:24:28] (03CR) 10Muehlenhoff: [C:03+2] Switch es1026 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020831 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:24:40] could add a like // can be null, e.g. on Beta [13:24:42] maybe ^^ [13:24:44] *a comment like [13:25:14] (03CR) 10Cathal Mooney: [C:03+1] "LGTM. Apart from eqiad and codw these private LVS ranges don't seem to have any usage, so I'm not sure if we should keep them longer term" [dns] - 10https://gerrit.wikimedia.org/r/1020823 (owner: 10Ssingh) [13:25:23] (SystemdUnitFailed) firing: (2) ferm.service on mw1367:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:25:30] ^ this might be us, checking [13:25:42] (03CR) 10Ladsgroup: [C:03+1] Setting dummy password for cumin dedicated mysql user [labs/private] - 10https://gerrit.wikimedia.org/r/1020828 (owner: 10Ladsgroup) [13:26:01] (03PS4) 10Anzx: mlwiki: create draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020242 (https://phabricator.wikimedia.org/T362653) [13:26:10] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Setting dummy password for cumin dedicated mysql user [labs/private] - 10https://gerrit.wikimedia.org/r/1020828 (owner: 10Ladsgroup) [13:26:25] (SystemdUnitFailed) firing: (8) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:27:11] (03PS3) 10Kosta Harlan: WikimediaEvents: Set IPoid URL and enable ip_reputation/score [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015295 (https://phabricator.wikimedia.org/T354597) [13:27:16] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] mlwiki: create draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020242 (https://phabricator.wikimedia.org/T362653) (owner: 10Anzx) [13:27:50] mw1367 should be recovering, we have seen this in the past where ferm doesn't reload and so needs a manual push [13:27:54] (03PS4) 10Kosta Harlan: WikimediaEvents: Set IPoid URL and enable ip_reputation/score [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015295 (https://phabricator.wikimedia.org/T354597) [13:28:02] (03Abandoned) 10Kosta Harlan: WikimediaEvents: Set IPoid URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015296 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [13:28:07] (03Abandoned) 10Kosta Harlan: EventStreamConfig: Register ip_reputation/score [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015299 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [13:28:18] (03CR) 10Lucas Werkmeister (WMDE): [C:04-1] "sorry, just one small copy+paste mistake and then this should be good to go ^^" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020242 (https://phabricator.wikimedia.org/T362653) (owner: 10Anzx) [13:28:27] (03CR) 10Cathal Mooney: [C:03+1] geo-maps: add magru to geo maps [dns] - 10https://gerrit.wikimedia.org/r/1020825 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [13:28:41] Lucas_WMDE: ready for review [13:29:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es1026.eqiad.wmnet [13:29:12] looking [13:29:24] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9722407 (10Andrew) 05Open→03Resolved These are now in service and working fine. [13:29:26] oh, and another patch appeared on the wikitech page [13:29:48] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es1033.eqiad.wmnet [13:30:23] (SystemdUnitFailed) firing: (9) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:30:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2120 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P60774 and previous config saved to /var/cache/conftool/dbconfig/20240417-133040-root.json [13:30:47] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] WikimediaEvents: Set IPoid URL and enable ip_reputation/score [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015295 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [13:31:12] kostajh: I’ll go ahead with your change then [13:31:16] and then hopefully anzx right afterwards [13:31:27] DreamRimmer: not sure we’ll have time for your change :/ [13:31:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015295 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [13:31:35] Lucas_WMDE: ty [13:32:05] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1115.eqiad.wmnet,service=(cdn|ats-be) [13:32:36] no worries, we can see next time [13:32:41] (03PS1) 10Muehlenhoff: Switch es1033 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020833 (https://phabricator.wikimedia.org/T349619) [13:32:59] (03Merged) 10jenkins-bot: WikimediaEvents: Set IPoid URL and enable ip_reputation/score [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015295 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [13:33:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T361627)', diff saved to https://phabricator.wikimedia.org/P60775 and previous config saved to /var/cache/conftool/dbconfig/20240417-133318-marostegui.json [13:33:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [13:33:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [13:33:25] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [13:33:26] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1015295|WikimediaEvents: Set IPoid URL and enable ip_reputation/score (T354597)]] [13:33:38] T354597: Record IP reputation data for account creations and edits - https://phabricator.wikimedia.org/T354597 [13:33:50] (SystemdUnitFailed) firing: (9) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:34:22] (03CR) 10Muehlenhoff: [C:03+2] Switch es1033 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020833 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:34:53] (03PS5) 10Anzx: mlwiki: create draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020242 (https://phabricator.wikimedia.org/T362653) [13:35:10] (03CR) 10Anzx: "i think enabl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020242 (https://phabricator.wikimedia.org/T362653) (owner: 10Anzx) [13:35:57] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "good to go once the current deployment is done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020242 (https://phabricator.wikimedia.org/T362653) (owner: 10Anzx) [13:36:30] (03CR) 10Ssingh: [C:03+2] geo-maps: add drmrs LVS map [dns] - 10https://gerrit.wikimedia.org/r/1020823 (owner: 10Ssingh) [13:36:32] !log lucaswerkmeister-wmde@deploy1002 kharlan and lucaswerkmeister-wmde: Backport for [[gerrit:1015295|WikimediaEvents: Set IPoid URL and enable ip_reputation/score (T354597)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:36:39] (03PS1) 10Bking: query_service: enable CPU performance governor for w[cd]qs [puppet] - 10https://gerrit.wikimedia.org/r/1020834 (https://phabricator.wikimedia.org/T336443) [13:36:43] kostajh: is the production part of the change testable? [13:36:48] !log running authdns-update for CR 1020823 [13:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:52] Lucas_WMDE: yes [13:36:59] okay, then please test :) [13:37:06] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1020834 (https://phabricator.wikimedia.org/T336443) (owner: 10Bking) [13:37:16] ok! I'll need a few minutes [13:37:37] (03CR) 10Eevans: {echo,session}store (staging): use wmf-ca-certificates.crt (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020356 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [13:37:41] Lucas_WMDE: which mwdebug backend to use? [13:37:42] (03PS3) 10Eevans: {echo,session}store (staging): use wmf-ca-certificates.crt [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020356 (https://phabricator.wikimedia.org/T352647) [13:37:48] any of them [13:37:59] `scap backport` always syncs changes to all of them [13:38:09] (including “k8s-experimental” which will soon be renamed to be less experimental) [13:38:16] ok [13:39:28] (03PS1) 10Slyngshede: Keymanagement, fix parsing and display of FIDO/U2F keys [software/bitu] - 10https://gerrit.wikimedia.org/r/1020836 [13:39:43] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator 2024-04-04-132719 to 2024-04-17-125039 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020837 (https://phabricator.wikimedia.org/T302519) [13:40:14] (03CR) 10Fabfur: [C:03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/1020825 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [13:40:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es1033.eqiad.wmnet [13:41:56] (03PS2) 10Jforrester: wikifunctions: Move apparmor annotation to pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020701 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [13:43:26] Lucas_WMDE: I don't see the events generated when using https://wikitech.wikimedia.org/wiki/Kafka#kafkacat but it's possible I'm doing something wrong there. [13:43:50] (SystemdUnitFailed) firing: (9) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:44:16] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9722479 (10MoritzMuehlenhoff) [13:44:28] hmm [13:44:31] Lucas_WMDE: hmm, I do see an error in logstash https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-mediawiki-1-7.0.0-1-2024.04.17?id=29JL7I4BXQUFBRtCE_7H [13:44:51] “Event submitted for unregistered stream name "mediawiki.ip_reputation.score"” [13:45:06] yeah [13:45:07] (03CR) 10DCausse: [C:03+1] query_service: enable CPU performance governor for w[cd]qs [puppet] - 10https://gerrit.wikimedia.org/r/1020834 (https://phabricator.wikimedia.org/T336443) (owner: 10Bking) [13:45:16] I think I have the wrong name reference in WikimediaEvents, looking [13:45:21] (03CR) 10Ssingh: [C:03+2] realm: fix consistency for site IPs [puppet] - 10https://gerrit.wikimedia.org/r/1019843 (owner: 10Ssingh) [13:45:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2120 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P60776 and previous config saved to /var/cache/conftool/dbconfig/20240417-134545-root.json [13:46:06] * Lucas_WMDE knows very little about event stream stuff [13:46:33] (03CR) 10Bking: [C:03+2] query_service: enable CPU performance governor for w[cd]qs [puppet] - 10https://gerrit.wikimedia.org/r/1020834 (https://phabricator.wikimedia.org/T336443) (owner: 10Bking) [13:47:41] not a huge fan of how https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#Using_scap_backport and https://wikitech.wikimedia.org/wiki/Scap#Backport_Deployments point to each other saying “look over there for more details” [13:47:57] Lucas_WMDE: I'm also confused about what I am doing wrong here. [13:48:02] but I think it’s clear enough how to revert the change if necessary [13:48:46] kostajh: is it urgent to get this configuration deployed? otherwise I’d say revert now, understand later :/ [13:49:29] Lucas_WMDE: yeah let's revert it. Sorry for the trouble. [13:49:34] !log lucaswerkmeister-wmde@deploy1002 Sync cancelled. [13:49:59] (03PS1) 10TrainBranchBot: Revert "WikimediaEvents: Set IPoid URL and enable ip_reputation/score" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020839 [13:49:59] (03CR) 10TrainBranchBot: "lucaswerkmeister-wmde@deploy1002 created a revert of this change as I6a299ce2c67f81faa520a3366bd83657988f96f6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015295 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [13:50:05] kostajh: no problem at all [13:50:18] hopefully you’ll be able to figure out what’s wrong [13:50:23] (SystemdUnitFailed) firing: (8) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:50:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020839 (owner: 10TrainBranchBot) [13:51:24] (03Merged) 10jenkins-bot: Revert "WikimediaEvents: Set IPoid URL and enable ip_reputation/score" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020839 (owner: 10TrainBranchBot) [13:51:32] (03CR) 10Lucas Werkmeister (WMDE): "Details: It produced [one error in logstash](https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-mediawiki-1-7.0.0-1-2024" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020839 (owner: 10TrainBranchBot) [13:51:53] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1020839|Revert "WikimediaEvents: Set IPoid URL and enable ip_reputation/score"]] [13:51:58] (03PS2) 10Ssingh: geo-maps: add magru to geo maps [dns] - 10https://gerrit.wikimedia.org/r/1020825 (https://phabricator.wikimedia.org/T346722) [13:52:12] (03CR) 10Ssingh: "rebased for drmrs LVS change" [dns] - 10https://gerrit.wikimedia.org/r/1020825 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [13:52:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [13:52:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [13:52:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T361627)', diff saved to https://phabricator.wikimedia.org/P60777 and previous config saved to /var/cache/conftool/dbconfig/20240417-135253-marostegui.json [13:52:58] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [13:53:02] (03PS1) 10Hashar: wm-zuul-status: filter based solely on change number [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1020840 (https://phabricator.wikimedia.org/T358253) [13:53:41] (03PS2) 10Slyngshede: Keymanagement, fix parsing and display of FIDO/U2F keys [software/bitu] - 10https://gerrit.wikimedia.org/r/1020836 [13:53:49] (03CR) 10Hashar: "It was a bit long to reach the task you filed back in February, but that is the implementation/fix :)" [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1020840 (https://phabricator.wikimedia.org/T358253) (owner: 10Hashar) [13:53:50] (SystemdUnitFailed) firing: (7) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:54:23] kostajh: maybe something like https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/992631 was missing? [13:54:24] *looks closer* [13:55:06] !log lucaswerkmeister-wmde@deploy1002 trainbranchbot and lucaswerkmeister-wmde: Backport for [[gerrit:1020839|Revert "WikimediaEvents: Set IPoid URL and enable ip_reputation/score"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:55:13] !log lucaswerkmeister-wmde@deploy1002 trainbranchbot and lucaswerkmeister-wmde: Continuing with sync [13:55:18] (no need to test the revert I think) [13:56:03] (03PS3) 10Slyngshede: Keymanagement, fix parsing and display of FIDO/U2F keys [software/bitu] - 10https://gerrit.wikimedia.org/r/1020836 [13:56:46] (03PS3) 10Ssingh: geo-maps: add magru to geo maps [dns] - 10https://gerrit.wikimedia.org/r/1020825 (https://phabricator.wikimedia.org/T346722) [13:57:04] kostajh: although the streams next to the one you added (mediawiki.cirrussearch.page_rerender.v1, mediawiki.page-create) don’t show up in wgEventLoggingStreamNames either, so maybe that’s not the problem after all [13:57:33] (03PS4) 10Ssingh: geo-maps: add magru to geo maps [dns] - 10https://gerrit.wikimedia.org/r/1020825 (https://phabricator.wikimedia.org/T346722) [13:57:37] Yeah those were the ones I was referencing when writing my patch [13:58:50] (SystemdUnitFailed) firing: (7) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:58:58] jouncebot: next [13:58:59] In 0 hour(s) and 1 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240417T1400) [13:59:04] we’ll definitely run into that, sorry [14:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240417T1400) [14:00:10] :-( [14:00:14] I’m still deploying, sorry :( [14:00:26] I mean, my deploy tool is different from yours, so I /can/ deploy. [14:00:34] But it's probably better that I don't. :-) [14:00:38] heh [14:00:46] I mean, in theory what I’m deploying right now should be a 100% no-op [14:00:48] hello [14:00:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2120 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P60778 and previous config saved to /var/cache/conftool/dbconfig/20240417-140051-root.json [14:00:56] `helmfile` vs. `scap`. [14:00:56] Hey domas, how's life? [14:00:57] since it’s a revert of a change that never made it beyond mwdebug [14:01:05] so now the production servers are getting the same code deployed again (in theory) [14:01:44] (I also wanted to deploy anzx’ namespace change but I guess that’s not happening, damn) [14:01:47] @James_F, same old same old! [14:01:55] (03CR) 10Ssingh: [C:03+2] geo-maps: add magru to geo maps [dns] - 10https://gerrit.wikimedia.org/r/1020825 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [14:01:55] Lucas_WMDE: Ack. [14:02:07] !log running authdns-update for adding magru to geo-maps: T346722 [14:02:10] * Lucas_WMDE peeks at kubectl for progress info [14:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:12] T346722: Sao Paulo, Brazil, South America POP tracking task - https://phabricator.wikimedia.org/T346722 [14:03:10] 179/223 up-to-date, probably a few more minutes [14:03:22] James_F, became an IG food influencer nowadays, doing everything I can do to avoid working on AI :-D [14:03:50] domas: … isn't that just working /for/ AI, namely the feed algo? [14:03:50] (SystemdUnitFailed) firing: (6) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:04:16] (03PS3) 10Ssingh: magru: add geo-resources and update wikimedia.org zone [dns] - 10https://gerrit.wikimedia.org/r/1020827 (https://phabricator.wikimedia.org/T346722) [14:04:50] (03CR) 10Cory Massaro: [C:03+1] wikifunctions: Move apparmor annotation to pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020701 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [14:05:11] (03CR) 10CI reject: [V:04-1] magru: add geo-resources and update wikimedia.org zone [dns] - 10https://gerrit.wikimedia.org/r/1020827 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [14:05:23] true true [14:05:39] has wikipedia been replaced by LLM yet? [14:05:46] I saw there was some director-of-ML role ! [14:05:46] Always has been. [14:05:47] (03PS1) 10Cathal Mooney: Add new BGP group for cross-rack PyBal peerings at L3 POPs [homer/public] - 10https://gerrit.wikimedia.org/r/1020843 (https://phabricator.wikimedia.org/T362772) [14:05:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T352010)', diff saved to https://phabricator.wikimedia.org/P60779 and previous config saved to /var/cache/conftool/dbconfig/20240417-140549-ladsgroup.json [14:05:53] !log depool ncredir2001 [14:06:01] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:20] (03CR) 10CI reject: [V:04-1] Add new BGP group for cross-rack PyBal peerings at L3 POPs [homer/public] - 10https://gerrit.wikimedia.org/r/1020843 (https://phabricator.wikimedia.org/T362772) (owner: 10Cathal Mooney) [14:06:55] (03PS1) 10Cathal Mooney: Adjust LVS config in esams, drmrs to peer bit both ASWs [puppet] - 10https://gerrit.wikimedia.org/r/1020844 (https://phabricator.wikimedia.org/T362772) [14:07:15] (03CR) 10CI reject: [V:04-1] Adjust LVS config in esams, drmrs to peer bit both ASWs [puppet] - 10https://gerrit.wikimedia.org/r/1020844 (https://phabricator.wikimedia.org/T362772) (owner: 10Cathal Mooney) [14:07:26] (03PS3) 10Jforrester: wikifunctions: Move apparmor annotation to pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020701 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [14:07:28] (03CR) 10Jforrester: [C:03+2] wikifunctions: Move apparmor annotation to pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020701 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [14:07:34] * domas looks at shard groups in DBs... they did not change in 10+ years?!!??!? [14:07:58] (03CR) 10Ssingh: "10:04:47 error: CNAME 'measure-magru.wikimedia.org.' points to known same-zone NXDOMAIN 'upload-lb.magru.wikimedia.org.'" [dns] - 10https://gerrit.wikimedia.org/r/1020827 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [14:07:59] domas: Not much; s5 is more the default for new wikis than s3, but otherwise we've scaled hardware just about fast enough. [14:08:24] Plus some feature re-writing / endless DB query tuning to keep pace. [14:08:24] (03Merged) 10jenkins-bot: wikifunctions: Move apparmor annotation to pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020701 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [14:08:29] yea, looks like operation is at the level where everything is still done manually! nice jobs program tho! [14:08:42] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1020839|Revert "WikimediaEvents: Set IPoid URL and enable ip_reputation/score"]] (duration: 16m 49s) [14:08:43] /o\ [14:08:47] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: insetup::data_persistence [14:08:51] just morebot is now stashbot :( [14:08:53] Lucas_WMDE: All done? [14:08:55] morebots [14:09:02] yes, sorry, you’re good to go [14:09:05] Awesome. [14:09:14] was distracted for a moment [14:09:15] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:09:19] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:09:42] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:09:57] has mediawiki been implemented in wikifunctions yet? [14:10:08] No. :-) [14:10:14] you're not serious people [14:10:18] Next step is async content loading for WF into MW pages. [14:10:22] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:10:25] Which'll be fun. [14:10:48] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:11:56] domas: Hackathon is in Tallinn in a couple of weeks' time. You should come by! ;-) [14:13:08] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:13:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T361627)', diff saved to https://phabricator.wikimedia.org/P60780 and previous config saved to /var/cache/conftool/dbconfig/20240417-141314-marostegui.json [14:13:17] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:13:21] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [14:13:29] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Unfortunately there wasn’t enough time to deploy this today, but it should be okay to deploy at any time later." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020242 (https://phabricator.wikimedia.org/T362653) (owner: 10Anzx) [14:13:49] (03PS2) 10Cathal Mooney: Add new BGP group for cross-rack PyBal peerings at L3 POPs [homer/public] - 10https://gerrit.wikimedia.org/r/1020843 (https://phabricator.wikimedia.org/T362772) [14:13:50] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:14:23] (03CR) 10CI reject: [V:04-1] Add new BGP group for cross-rack PyBal peerings at L3 POPs [homer/public] - 10https://gerrit.wikimedia.org/r/1020843 (https://phabricator.wikimedia.org/T362772) (owner: 10Cathal Mooney) [14:14:36] domas: We now have fancy bot-maintained pages to tell us what bits of the DBs are likely to flake today: https://wikitech.wikimedia.org/wiki/Map_of_database_maintenance ;-) [14:14:39] James_F, heh, weird location [14:14:55] (03CR) 10Ssingh: Reverse DNS changes for new Magru prefixes (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [14:15:04] hah [14:15:26] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:15:50] (03PS1) 10Muehlenhoff: Switch insetup::data_persistence to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020845 (https://phabricator.wikimedia.org/T349619) [14:15:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2120 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P60781 and previous config saved to /var/cache/conftool/dbconfig/20240417-141557-root.json [14:16:10] (03PS2) 10Jforrester: wikifunctions: Upgrade orchestrator 2024-04-04-132719 to 2024-04-17-125039 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020837 (https://phabricator.wikimedia.org/T302519) [14:16:13] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator 2024-04-04-132719 to 2024-04-17-125039 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020837 (https://phabricator.wikimedia.org/T302519) (owner: 10Jforrester) [14:16:29] jee this channel got spammy [14:16:45] (03PS5) 10Cathal Mooney: DNS zone changes for new Magru prefixes [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) [14:17:10] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator 2024-04-04-132719 to 2024-04-17-125039 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020837 (https://phabricator.wikimedia.org/T302519) (owner: 10Jforrester) [14:17:36] (03CR) 10CI reject: [V:04-1] DNS zone changes for new Magru prefixes [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [14:18:24] (03PS6) 10Cathal Mooney: DNS zone changes for new Magru prefixes [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) [14:18:34] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:18:35] (03CR) 10Cathal Mooney: DNS zone changes for new Magru prefixes (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [14:18:46] (03PS7) 10Cathal Mooney: DNS zone changes for new Magru prefixes [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) [14:19:10] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:19:38] (03CR) 10CI reject: [V:04-1] DNS zone changes for new Magru prefixes [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [14:19:40] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:20:36] !log depool cp1114.eqiad.wmnet for PXE boot testing issues and downgrade NIC firmware: T350179 [14:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:43] T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 [14:20:49] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1114.eqiad.wmnet,service=(cdn|ats-be) [14:20:49] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:20:56] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:20:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P60782 and previous config saved to /var/cache/conftool/dbconfig/20240417-142057-ladsgroup.json [14:21:10] !log sukhe@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp1114.eqiad.wmnet [14:21:13] (03PS1) 10Majavah: P:toolforge::bastion: add rsync [puppet] - 10https://gerrit.wikimedia.org/r/1020847 (https://phabricator.wikimedia.org/T362679) [14:22:00] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp1114.eqiad.wmnet [14:22:06] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:23:03] (Done with our deploy window if others need it.) [14:23:13] (03CR) 10Muehlenhoff: [C:03+2] Switch insetup::data_persistence to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020845 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:23:34] anzx: if you’re still around, it sounds like we could deploy the mlwiki draft namespace now? [14:23:48] (unless someone objects ^^) [14:23:51] (and thanks James_F!) [14:25:43] excited to see a database person as CTO ! [14:28:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P60783 and previous config saved to /var/cache/conftool/dbconfig/20240417-142823-marostegui.json [14:28:32] Lucas_WMDE: yeah I am available [14:28:50] alright! [14:29:21] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Deploying in a gap between other windows now (Wikifunctions didn’t need their full window)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020242 (https://phabricator.wikimedia.org/T362653) (owner: 10Anzx) [14:29:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020242 (https://phabricator.wikimedia.org/T362653) (owner: 10Anzx) [14:29:46] (03PS6) 10Anzx: mlwiki: create draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020242 (https://phabricator.wikimedia.org/T362653) [14:29:50] (03CR) 10TrainBranchBot: "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020242 (https://phabricator.wikimedia.org/T362653) (owner: 10Anzx) [14:30:36] (03Merged) 10jenkins-bot: mlwiki: create draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020242 (https://phabricator.wikimedia.org/T362653) (owner: 10Anzx) [14:31:03] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1020242|mlwiki: create draft namespace (T362653)]] [14:31:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2120 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P60784 and previous config saved to /var/cache/conftool/dbconfig/20240417-143103-root.json [14:31:16] T362653: Create Draft Namespace in Malayalam Wikipedia - https://phabricator.wikimedia.org/T362653 [14:33:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: insetup::data_persistence [14:34:05] !log lucaswerkmeister-wmde@deploy1002 anzx and lucaswerkmeister-wmde: Backport for [[gerrit:1020242|mlwiki: create draft namespace (T362653)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:34:09] anzx: please test :) [14:34:27] https://ml.wikipedia.org/wiki/Draft:XYZ redirects me to a URL with localized namespace name, that already sounds like a good sign [14:34:58] (03PS1) 10Muehlenhoff: Add explicit Hiera host entries for es1035-es1040,es2035-es2040 [puppet] - 10https://gerrit.wikimedia.org/r/1020849 (https://phabricator.wikimedia.org/T349619) [14:36:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P60785 and previous config saved to /var/cache/conftool/dbconfig/20240417-143606-ladsgroup.json [14:36:12] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1020847 (https://phabricator.wikimedia.org/T362679) (owner: 10Majavah) [14:37:23] (03CR) 10Majavah: [C:03+2] P:toolforge::bastion: add rsync [puppet] - 10https://gerrit.wikimedia.org/r/1020847 (https://phabricator.wikimedia.org/T362679) (owner: 10Majavah) [14:38:50] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:01] anzx: are you still there? [14:42:26] (03PS1) 10Muehlenhoff: Remove now obsolete Hiera host entries for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020850 (https://phabricator.wikimedia.org/T349619) [14:43:02] (03PS1) 10Ssingh: hiera: add magru installserver in dhcp.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1020851 (https://phabricator.wikimedia.org/T346722) [14:43:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P60786 and previous config saved to /var/cache/conftool/dbconfig/20240417-144330-marostegui.json [14:44:15] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:44:20] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9722701 (10MoritzMuehlenhoff) [14:44:38] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:44:45] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1976/console" [puppet] - 10https://gerrit.wikimedia.org/r/1020851 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [14:44:48] (03CR) 10Fabfur: [C:03+1] "ok" [puppet] - 10https://gerrit.wikimedia.org/r/1020851 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [14:44:56] anzx: ping [14:45:11] (03CR) 10Marostegui: [C:03+1] Add explicit Hiera host entries for es1035-es1040,es2035-es2040 [puppet] - 10https://gerrit.wikimedia.org/r/1020849 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:45:38] (03CR) 10Cathal Mooney: [C:03+1] hiera: add magru installserver in dhcp.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1020851 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [14:46:00] (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: add magru installserver in dhcp.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1020851 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [14:46:24] (03CR) 10Muehlenhoff: [C:03+2] Add explicit Hiera host entries for es1035-es1040,es2035-es2040 [puppet] - 10https://gerrit.wikimedia.org/r/1020849 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:49:13] no sign of anzx :( [14:49:57] as far as I can tell the namespace is working, so I’ll just go ahead and deploy it anyway [14:50:01] !log lucaswerkmeister-wmde@deploy1002 anzx and lucaswerkmeister-wmde: Continuing with sync [14:51:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T352010)', diff saved to https://phabricator.wikimedia.org/P60787 and previous config saved to /var/cache/conftool/dbconfig/20240417-145113-ladsgroup.json [14:51:16] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance [14:51:20] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:51:29] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance [14:51:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T352010)', diff saved to https://phabricator.wikimedia.org/P60788 and previous config saved to /var/cache/conftool/dbconfig/20240417-145136-ladsgroup.json [14:51:55] hm, those were some *very* suspiciously fast helmfile runs [14:52:08] (03PS1) 10Clément Goubert: kubernetes: move 6 appservers from codfw [puppet] - 10https://gerrit.wikimedia.org/r/1020852 (https://phabricator.wikimedia.org/T351074) [14:52:17] I’m not used to mw-web taking just 1 minute (eqiad) or 52 seconds (codfw) [14:52:35] hmm [14:53:05] `kube_env mw-web eqiad; kubectl get deployments` reports 223 somethings though, that’s the same number as earlier [14:53:15] (somethings = pods, I think? 😅) [14:53:49] pods are being replaced as we speak [14:54:12] weird that helmfile returned before it was done replacing the pods [14:54:17] ohhhhh [14:54:19] no, I’m just an idiot [14:54:24] that was --selector name=canary :) [14:54:28] hahaha [14:54:30] the --selector name=main ones are ongoing now [14:54:32] :D [14:54:37] yeah, that's *a lot* faster [14:54:46] weird that :D [14:55:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 986.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:55:27] yeah now `kubectl get deployments` shows a lower number of up-to-date, as expected [14:55:37] jouncebot: next [14:55:38] In 2 hour(s) and 4 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240417T1700) [14:55:41] yes yes parsoid, you're slow, it's ok [14:56:10] or is it though :/ [14:58:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T361627)', diff saved to https://phabricator.wikimedia.org/P60789 and previous config saved to /var/cache/conftool/dbconfig/20240417-145838-marostegui.json [14:58:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [14:58:43] It's serving more 400s than usual since 1443 [14:58:44] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [14:58:50] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:58:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [14:58:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2186.codfw.wmnet with reason: Maintenance [14:59:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2186.codfw.wmnet with reason: Maintenance [14:59:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T361627)', diff saved to https://phabricator.wikimedia.org/P60790 and previous config saved to /var/cache/conftool/dbconfig/20240417-145916-marostegui.json [14:59:39] latency is coming down, must have been a bunch of reparses [15:00:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 935.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:01:24] (03CR) 10Ssingh: DNS zone changes for new Magru prefixes (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [15:01:38] Lucas_WMDE: sorry I didn't notice ping, was having lunch in that time, thanks for deploy [15:01:59] you can still test it now, just in case any follow-up fixes are necessary ^^ [15:02:06] Testing [15:02:22] ok, thanks! [15:02:39] and I’ll run namespaceDupes in a moment, apparently there are 0 pages but 82 links to fix [15:03:27] Lucas_WMDE: looks good [15:03:33] great, thanks! [15:03:46] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1020242|mlwiki: create draft namespace (T362653)]] (duration: 32m 43s) [15:03:59] T362653: Create Draft Namespace in Malayalam Wikipedia - https://phabricator.wikimedia.org/T362653 [15:04:21] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes mlwiki --fix # T362653: 0 pages to fix, 0 were resolvable; 82 links to fix, 82 were resolvable, 0 were deleted. [15:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:20] !log UTC afternoon backport+config window (belatedly) done [15:05:22] * Lucas_WMDE done [15:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:33] !log repool ncredir2001 [15:06:34] Lucas_WMDE: thank you [15:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:55] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1018256 (https://phabricator.wikimedia.org/T361066) (owner: 10Slyngshede) [15:07:58] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp1114.eqiad.wmnet with OS bullseye [15:08:06] 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, and 2 others: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9722796 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp1114.eqiad.wmnet with OS bul... [15:08:24] (03PS8) 10Cathal Mooney: DNS zone changes for new Magru prefixes [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) [15:09:19] (03CR) 10CI reject: [V:04-1] DNS zone changes for new Magru prefixes [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [15:09:45] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cp1115.eqiad.wmnet [15:12:51] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:13:13] (03CR) 10Muehlenhoff: Initial documentation for the Bitu API. (038 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1020802 (owner: 10Slyngshede) [15:13:14] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:13:36] (03PS2) 10Ladsgroup: mariadb: Set up dedicated cumin user [puppet] - 10https://gerrit.wikimedia.org/r/1020830 [15:15:04] (03CR) 10Hnowlan: [C:03+1] kubernetes: move 6 appservers from codfw [puppet] - 10https://gerrit.wikimedia.org/r/1020852 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [15:16:39] (03CR) 10CI reject: [V:04-1] mariadb: Set up dedicated cumin user [puppet] - 10https://gerrit.wikimedia.org/r/1020830 (owner: 10Ladsgroup) [15:16:41] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1020830 (owner: 10Ladsgroup) [15:16:44] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance [15:16:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance [15:16:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2127 (T360332)', diff saved to https://phabricator.wikimedia.org/P60792 and previous config saved to /var/cache/conftool/dbconfig/20240417-151653-arnaudb.json [15:17:22] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 12), 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9722869 (10WDoranWMF) [15:17:27] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [15:17:28] (03PS3) 10Ladsgroup: mariadb: Set up dedicated cumin user [puppet] - 10https://gerrit.wikimedia.org/r/1020830 [15:17:54] (03CR) 10Fabfur: [C:03+1] "I double checked the PTRs for ip6 and looks good now" [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [15:18:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2120 depool T358741', diff saved to https://phabricator.wikimedia.org/P60793 and previous config saved to /var/cache/conftool/dbconfig/20240417-151811-arnaudb.json [15:18:27] T358741: Decommission db2096-db2120 - https://phabricator.wikimedia.org/T358741 [15:18:40] (KubernetesRsyslogDown) firing: rsyslog on mw2412:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2412 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:20:09] (03PS1) 10Arnaudb: mariadb: remove db2120 [puppet] - 10https://gerrit.wikimedia.org/r/1020716 (https://phabricator.wikimedia.org/T358741) [15:20:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T361627)', diff saved to https://phabricator.wikimedia.org/P60794 and previous config saved to /var/cache/conftool/dbconfig/20240417-152023-marostegui.json [15:20:38] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [15:21:02] 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, and 2 others: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9722934 (10ssingh) 05Open→03Resolved @Papaul deserves a lot of love for fixing this persistent issue. The 21.x firmware (specifica... [15:22:39] (03CR) 10Muehlenhoff: Keymanagement, fix parsing and display of FIDO/U2F keys (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1020836 (owner: 10Slyngshede) [15:23:13] (03CR) 10Ssingh: [C:03+1] "Looks good, thanks!" [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [15:25:49] 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, and 2 others: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9722986 (10MoritzMuehlenhoff) >>! In T350179#9722934, @ssingh wrote: > @Papaul deserves a lot of love for fixing this persistent issue... [15:26:49] 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, and 2 others: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9722990 (10MatthewVernon) +1 to thanks to Papaul for getting to the bottom of this! [15:27:09] 06SRE, 06Infrastructure-Foundations, 10Mail, 10MediaWiki-Email: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860#9722988 (10jhathaway) Given that this has reoccurred and from the emails you provided looks to be duplication on the application layer I think we need... [15:27:59] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1114.eqiad.wmnet with reason: host reimage [15:28:03] ROB! [15:28:20] Can you please unplug all cables! [15:30:30] !log making magru IPs live in netbox and generating DNS records with cookbook T362421 [15:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:40] T362421: magru network setup - https://phabricator.wikimedia.org/T362421 [15:31:34] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:31:50] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1114.eqiad.wmnet with reason: host reimage [15:32:05] (03PS1) 10Jdlrobson: Upstream tablet infobox styles [extensions/WikimediaMessages] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1020736 (https://phabricator.wikimedia.org/T3603861) [15:32:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T360332)', diff saved to https://phabricator.wikimedia.org/P60795 and previous config saved to /var/cache/conftool/dbconfig/20240417-153238-arnaudb.json [15:32:40] (03PS1) 10Jdlrobson: Upstream tablet infobox styles [extensions/WikimediaMessages] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1020737 (https://phabricator.wikimedia.org/T3603861) [15:32:43] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [15:32:53] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to Superset for aitolkyn - https://phabricator.wikimedia.org/T362533#9723032 (10ssingh) 05Open→03Resolved a:03ssingh @Aitolkyn I am marking this as resolved but if that's not the case, please re-open it again thanks! [15:33:04] (03CR) 10Marostegui: [C:03+1] mariadb: remove db2120 [puppet] - 10https://gerrit.wikimedia.org/r/1020716 (https://phabricator.wikimedia.org/T358741) (owner: 10Arnaudb) [15:33:20] (03CR) 10Arnaudb: [C:03+2] mariadb: remove db2120 [puppet] - 10https://gerrit.wikimedia.org/r/1020716 (https://phabricator.wikimedia.org/T358741) (owner: 10Arnaudb) [15:33:44] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding first entries for magru IPs - cmooney@cumin1002" [15:34:34] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding first entries for magru IPs - cmooney@cumin1002" [15:34:34] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:34:50] (03PS1) 10Jdlrobson: Enable WikimediaSkinStyles on English Wikipedia Vector 2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020854 (https://phabricator.wikimedia.org/T362726) [15:35:31] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2120.codfw.wmnet [15:36:54] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1020830 (owner: 10Ladsgroup) [15:39:29] (03CR) 10Fabfur: [C:03+1] "ok for me" [dns] - 10https://gerrit.wikimedia.org/r/1020827 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [15:40:39] !log merging patch and updating dns servers with new magru ranges T362421 [15:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:51] T362421: magru network setup - https://phabricator.wikimedia.org/T362421 [15:41:27] (03PS9) 10Cathal Mooney: DNS zone changes for new Magru ranges [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) [15:42:10] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [15:42:14] (03CR) 10CI reject: [V:04-1] DNS zone changes for new Magru ranges [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [15:44:41] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2120.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [15:45:48] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2120.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [15:45:48] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:45:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2120.codfw.wmnet [15:47:19] 06SRE, 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9723080 (10elukey) Added some thoughts to T353622#9723070, I found out a big can of worms while testing staging :) The upgrade is more complex than... [15:50:48] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:51:38] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1114.eqiad.wmnet with OS bullseye [15:52:37] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding first entries for magru IPs - cmooney@cumin1002" [15:53:07] 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, and 2 others: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9723209 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp1114.eqiad.wmnet with OS bul... [15:53:28] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding first entries for magru IPs - cmooney@cumin1002" [15:53:28] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:53:59] (03PS10) 10Cathal Mooney: DNS zone changes for new Magru prefixes [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) [15:54:30] (03CR) 10CI reject: [V:04-1] DNS zone changes for new Magru prefixes [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [15:54:48] (03PS1) 10Arnaudb: mariadb: removes db2119 [puppet] - 10https://gerrit.wikimedia.org/r/1020717 (https://phabricator.wikimedia.org/T362790) [15:56:43] (03CR) 10Ladsgroup: "https://puppet-compiler.wmflabs.org/output/1020830/830/cumin1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1020830 (owner: 10Ladsgroup) [15:57:41] (03CR) 10Marostegui: [C:03+1] mariadb: removes db2119 [puppet] - 10https://gerrit.wikimedia.org/r/1020717 (https://phabricator.wikimedia.org/T362790) (owner: 10Arnaudb) [15:57:57] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:59:45] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#9723345 (10CDanis) >>! In T360029#9722005, @Ladsgroup wrote: >>>! In T360029#9658042, @CDanis wrote: > Actually the idea is that dbctl should... [15:59:49] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding first entries for magru IPs - cmooney@cumin1002" [16:00:24] (03CR) 10Btullis: [C:03+2] Migrate geo-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014659 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [16:00:40] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding first entries for magru IPs - cmooney@cumin1002" [16:00:40] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:01:00] (03CR) 10Arnaudb: [C:03+2] mariadb: removes db2119 [puppet] - 10https://gerrit.wikimedia.org/r/1020717 (https://phabricator.wikimedia.org/T362790) (owner: 10Arnaudb) [16:01:22] (03Merged) 10jenkins-bot: Migrate geo-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014659 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [16:02:21] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2119.codfw.wmnet [16:03:08] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/geo-analytics: apply [16:03:27] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [16:03:47] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/geo-analytics: apply [16:03:56] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [16:04:06] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/geo-analytics: apply [16:04:12] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply [16:04:16] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1114.eqiad.wmnet,service=(cdn|ats-be) [16:04:39] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply [16:04:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2119 depool T358741', diff saved to https://phabricator.wikimedia.org/P60796 and previous config saved to /var/cache/conftool/dbconfig/20240417-160443-arnaudb.json [16:04:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P60797 and previous config saved to /var/cache/conftool/dbconfig/20240417-160451-arnaudb.json [16:04:59] T358741: Decommission db2096-db2120 - https://phabricator.wikimedia.org/T358741 [16:05:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P60798 and previous config saved to /var/cache/conftool/dbconfig/20240417-160501-marostegui.json [16:05:57] !log cdanis@cumin1002 conftool action : set/host_ip=69.69.69.69; selector: name=db1211 [16:06:02] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding first entries for magru IPs - cmooney@cumin1002" [16:06:06] !log cdanis@cumin1002 conftool action : set/host_ip=10.64.16.8; selector: name=db1211 [16:07:15] !log cdanis@cumin1002 conftool action : set/host_ip=1.1.1.1; selector: name=db1211 [16:07:21] !log cdanis@cumin1002 conftool action : set/host_ip=10.64.16.8; selector: name=db1211 [16:08:03] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#9723385 (10CDanis) @Marostegui As it turns out, plain old `confctl` can be used to do this already. You can for instance do `sh sudo confctl... [16:08:35] (03CR) 10Cathal Mooney: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [16:08:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2412:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2412 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:08:40] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding first entries for magru IPs - cmooney@cumin1002" [16:08:40] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:08:44] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [16:09:39] !log above conftool actions had no impact on production, no dbctl config commit was performed. [16:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:08] (03CR) 10Elukey: [C:03+1] "The CI's diff is lovely, now the new ca file rendered is the one from values.yaml, namely the rootCa configured for the session store prod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020356 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [16:10:08] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:10:09] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2119.codfw.wmnet [16:10:58] (03CR) 10Ladsgroup: "According to my reporting, the new user should be everywhere with all the rights." [puppet] - 10https://gerrit.wikimedia.org/r/1020830 (owner: 10Ladsgroup) [16:12:21] 10ops-codfw, 10decommission-hardware, 13Patch-For-Review: decommission db2119.codfw.wmnet - https://phabricator.wikimedia.org/T362790#9723389 (10ABran-WMF) a:05ABran-WMF→03None [16:13:21] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [16:13:40] (KubernetesRsyslogDown) firing: rsyslog on mw2412:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2412 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:14:42] !log restarted rsyslog on mw2412 - T357616 [16:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:47] T357616: Logs from containers sometimes not visible in logstash - https://phabricator.wikimedia.org/T357616 [16:17:41] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding first entries for magru IPs - cmooney@cumin1002" [16:18:33] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding first entries for magru IPs - cmooney@cumin1002" [16:18:33] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:18:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2412:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2412 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:19:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P60799 and previous config saved to /var/cache/conftool/dbconfig/20240417-161958-arnaudb.json [16:20:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P60800 and previous config saved to /var/cache/conftool/dbconfig/20240417-162008-marostegui.json [16:21:58] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#9723417 (10Marostegui) That's awesome!! Then I guess the cookbook to orchestrate all this can be done? Do we need something else? [16:24:21] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [16:24:36] (03PS7) 10Btullis: Migrate image-suggestions to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014660 (https://phabricator.wikimedia.org/T360531) [16:24:52] (03PS7) 10JHathaway: Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) [16:24:52] (03CR) 10Btullis: [C:03+2] Migrate media-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014661 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [16:25:10] (03PS7) 10Btullis: Migrate media-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014661 (https://phabricator.wikimedia.org/T360531) [16:25:44] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:27:08] (03CR) 10Btullis: [V:03+2 C:03+2] Migrate media-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014661 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [16:27:22] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [16:28:01] (03CR) 10CI reject: [V:04-1] Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [16:28:03] (03Merged) 10jenkins-bot: Migrate media-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014661 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [16:29:00] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding first entries for magru IPs - cmooney@cumin1002" [16:29:51] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding first entries for magru IPs - cmooney@cumin1002" [16:29:51] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:30:38] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [16:35:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T360332)', diff saved to https://phabricator.wikimedia.org/P60801 and previous config saved to /var/cache/conftool/dbconfig/20240417-163506-arnaudb.json [16:35:20] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [16:35:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T361627)', diff saved to https://phabricator.wikimedia.org/P60802 and previous config saved to /var/cache/conftool/dbconfig/20240417-163518-marostegui.json [16:35:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [16:35:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [16:35:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T361627)', diff saved to https://phabricator.wikimedia.org/P60803 and previous config saved to /var/cache/conftool/dbconfig/20240417-163532-marostegui.json [16:35:32] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [16:35:42] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding first entries for magru IPs - cmooney@cumin1002" [16:36:33] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding first entries for magru IPs - cmooney@cumin1002" [16:36:33] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:36:40] (03PS11) 10Cathal Mooney: DNS zone changes for new Magru prefixes [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) [16:37:35] (03CR) 10CI reject: [V:04-1] DNS zone changes for new Magru prefixes [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [16:38:31] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/media-analytics: apply [16:38:50] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [16:39:00] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/media-analytics: apply [16:39:19] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply [16:39:32] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/media-analytics: apply [16:39:48] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply [16:40:50] (03PS7) 10Btullis: Migrate page-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014662 (https://phabricator.wikimedia.org/T360531) [16:41:56] (03CR) 10Btullis: [C:03+2] Migrate page-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014662 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [16:42:55] (03Merged) 10jenkins-bot: Migrate page-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014662 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [16:44:57] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/page-analytics: apply [16:45:17] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [16:45:22] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/page-analytics: apply [16:45:41] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply [16:46:41] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [16:47:07] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [16:50:35] (03PS12) 10Cathal Mooney: DNS zone changes for new Magru prefixes [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) [16:53:51] (03PS13) 10Cathal Mooney: DNS zone changes for new Magru prefixes [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) [16:55:34] (03CR) 10Cathal Mooney: [C:03+2] DNS zone changes for new Magru prefixes [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [16:56:34] !log running authdns-update to make magru dns records live T362421 [16:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:39] T362421: magru network setup - https://phabricator.wikimedia.org/T362421 [16:56:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T361627)', diff saved to https://phabricator.wikimedia.org/P60804 and previous config saved to /var/cache/conftool/dbconfig/20240417-165647-marostegui.json [16:56:52] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [16:59:54] (03PS4) 10Ssingh: magru: add geo-resources and update wikimedia.org zone [dns] - 10https://gerrit.wikimedia.org/r/1020827 (https://phabricator.wikimedia.org/T346722) [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240417T1700) [17:01:29] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#9723602 (10CDanis) I think you should be able to use the existing spicerack interface to confctl to do the `set/host_ip=...` action -- that sh... [17:01:38] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9723606 (10RobH) >>! In T362729#9721040, @ssingh wrote: > Thanks for the task @RobH! As in the previous runs, please feel free to leave these for Traffic: > > ` > Update the operations/puppet rep... [17:08:14] (03PS8) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1019866 [17:09:40] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (owner: 10CDobbins) [17:11:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P60805 and previous config saved to /var/cache/conftool/dbconfig/20240417-171154-marostegui.json [17:14:06] (03CR) 10Ssingh: [C:03+2] magru: add geo-resources and update wikimedia.org zone [dns] - 10https://gerrit.wikimedia.org/r/1020827 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [17:14:27] !log running authdns-update for adding magru geo-resources/IPs: T346722 [17:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:32] T346722: Sao Paulo, Brazil, South America POP tracking task - https://phabricator.wikimedia.org/T346722 [17:18:13] (03PS1) 10Jforrester: wikifunctions: Configure prometheus endpoints on both services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020872 [17:20:16] (03CR) 10Jforrester: [C:04-1] "https://integration.wikimedia.org/ci/job/helm-lint/16896/console shows now diff (except for the chart version), so this is probably wrong?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020872 (owner: 10Jforrester) [17:21:28] (03PS9) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1019866 [17:22:30] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-04-17-125039 to 2024-04-17-163312 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020874 [17:22:50] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (owner: 10CDobbins) [17:27:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P60807 and previous config saved to /var/cache/conftool/dbconfig/20240417-172702-marostegui.json [17:37:55] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#9723736 (10Ladsgroup) >>! In T360029#9723602, @CDanis wrote: > I don't see why you couldn't do a simple `subprocess.run` to do a commit, proba... [17:42:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T361627)', diff saved to https://phabricator.wikimedia.org/P60808 and previous config saved to /var/cache/conftool/dbconfig/20240417-174210-marostegui.json [17:42:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2190.codfw.wmnet with reason: Maintenance [17:42:17] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [17:42:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2190.codfw.wmnet with reason: Maintenance [17:42:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T361627)', diff saved to https://phabricator.wikimedia.org/P60809 and previous config saved to /var/cache/conftool/dbconfig/20240417-174233-marostegui.json [17:44:44] 06SRE, 10LDAP-Access-Requests: Grant Access to 'wmf' ldap group for DErenrich to allow logstash access - https://phabricator.wikimedia.org/T362731#9723754 (10ssingh) @NBaca-WMF: This needs your approval, thanks! [17:52:33] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for kgraessle - https://phabricator.wikimedia.org/T362812 (10Kgraessle) 03NEW [17:54:31] (03PS1) 10Ssingh: admin: add derenrich to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1020879 (https://phabricator.wikimedia.org/T362731) [17:57:28] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:57:35] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:59:32] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:59:37] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:00:04] dancy and hashar: Deploy window MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240417T1800) [18:00:58] o/ [18:01:27] (03PS1) 10Ssingh: admin: add kgraessle to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1020881 (https://phabricator.wikimedia.org/T362812) [18:01:46] * dancy browses logspam [18:03:28] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmf for kgraessle - https://phabricator.wikimedia.org/T362812#9723831 (10ssingh) @DMburugu: this requires your approval, thanks! [18:03:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T361627)', diff saved to https://phabricator.wikimedia.org/P60810 and previous config saved to /var/cache/conftool/dbconfig/20240417-180346-marostegui.json [18:03:50] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:03:54] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [18:11:29] (03PS1) 10Jdlrobson: Enable night mode in AMC for all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020883 (https://phabricator.wikimedia.org/T361555) [18:13:50] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:14:02] (03PS1) 10Jdlrobson: Enable limited width on all main pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020886 (https://phabricator.wikimedia.org/T357706) [18:18:20] (03PS1) 10TrainBranchBot: group1 wikis to 1.43.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020889 (https://phabricator.wikimedia.org/T361395) [18:18:22] (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.43.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020889 (https://phabricator.wikimedia.org/T361395) (owner: 10TrainBranchBot) [18:18:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P60812 and previous config saved to /var/cache/conftool/dbconfig/20240417-181854-marostegui.json [18:19:05] (03Merged) 10jenkins-bot: group1 wikis to 1.43.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020889 (https://phabricator.wikimedia.org/T361395) (owner: 10TrainBranchBot) [18:34:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P60813 and previous config saved to /var/cache/conftool/dbconfig/20240417-183401-marostegui.json [18:35:33] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.43.0-wmf.1 refs T361395 [18:35:40] T361395: 1.43.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T361395 [18:43:46] (03PS1) 10Cathal Mooney: Remove comment added in error [dns] - 10https://gerrit.wikimedia.org/r/1020901 (https://phabricator.wikimedia.org/T362421) [18:49:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T361627)', diff saved to https://phabricator.wikimedia.org/P60814 and previous config saved to /var/cache/conftool/dbconfig/20240417-184908-marostegui.json [18:49:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2194.codfw.wmnet with reason: Maintenance [18:49:16] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [18:49:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2194.codfw.wmnet with reason: Maintenance [18:49:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2194 (T361627)', diff saved to https://phabricator.wikimedia.org/P60815 and previous config saved to /var/cache/conftool/dbconfig/20240417-184931-marostegui.json [18:50:19] (03PS3) 10Cathal Mooney: Add new BGP group for cross-rack PyBal peerings at L3 POPs [homer/public] - 10https://gerrit.wikimedia.org/r/1020843 (https://phabricator.wikimedia.org/T362772) [18:53:11] Train is blocked on https://phabricator.wikimedia.org/T362817 [18:54:54] (03PS2) 10Cathal Mooney: Adjust LVS config in esams, drmrs to peer bit both ASWs [puppet] - 10https://gerrit.wikimedia.org/r/1020844 (https://phabricator.wikimedia.org/T362772) [18:56:40] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2040:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2040 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:58:39] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: ASW single-point of failure for LVS VIPs at POPs - https://phabricator.wikimedia.org/T362772#9724034 (10cmooney) I believe the two patches above, once merged, will add the required redundancy. Following option 1 above, creatin... [19:02:45] (03CR) 10Ssingh: [C:03+1] Remove comment added in error [dns] - 10https://gerrit.wikimedia.org/r/1020901 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [19:02:45] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: ASW single-point of failure for LVS VIPs at POPs - https://phabricator.wikimedia.org/T362772#9724049 (10cmooney) Perhaps one option would be to ignore the puppet patch to change drmrs and esams for now - but merge the Homer one... [19:10:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T361627)', diff saved to https://phabricator.wikimedia.org/P60816 and previous config saved to /var/cache/conftool/dbconfig/20240417-191043-marostegui.json [19:10:49] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [19:11:40] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2040:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2040 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:12:03] 06SRE, 06Infrastructure-Foundations, 06Traffic: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps - https://phabricator.wikimedia.org/T359054#9724065 (10CDanis) I largely agree with Arzhel's assessment. At a cursory glance, Uruguay or Paraguay look ideal as first candidates.... [19:18:44] (03PS1) 10Majavah: P:toolforge::bastion: install locales-all [puppet] - 10https://gerrit.wikimedia.org/r/1020906 (https://phabricator.wikimedia.org/T362680) [19:19:47] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1020906 (https://phabricator.wikimedia.org/T362680) (owner: 10Majavah) [19:25:48] (03PS1) 10Jforrester: Revert "REST: Deprecate using "post" as the parameter source" [core] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1020910 (https://phabricator.wikimedia.org/T362817) [19:25:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P60817 and previous config saved to /var/cache/conftool/dbconfig/20240417-192551-marostegui.json [19:40:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P60818 and previous config saved to /var/cache/conftool/dbconfig/20240417-194058-marostegui.json [19:43:38] (03PS1) 10Ebernhardson: cirrus: Update container image and increase metaspace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020913 [19:50:09] (03CR) 10CI reject: [V:04-1] Revert "REST: Deprecate using "post" as the parameter source" [core] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1020910 (https://phabricator.wikimedia.org/T362817) (owner: 10Jforrester) [19:50:56] (03CR) 10Jforrester: "recheck" [core] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1020910 (https://phabricator.wikimedia.org/T362817) (owner: 10Jforrester) [19:55:40] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2040:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2040 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:56:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T361627)', diff saved to https://phabricator.wikimedia.org/P60819 and previous config saved to /var/cache/conftool/dbconfig/20240417-195605-marostegui.json [19:56:08] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2209.codfw.wmnet with reason: Maintenance [19:56:11] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [19:56:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2209.codfw.wmnet with reason: Maintenance [19:56:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2209 (T361627)', diff saved to https://phabricator.wikimedia.org/P60820 and previous config saved to /var/cache/conftool/dbconfig/20240417-195628-marostegui.json [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240417T2000) [20:00:05] Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:37] o/ [20:01:23] o/ [20:01:42] Jdlrobson: can the 2 pairs of patches go out together? config + backports? [20:03:19] (03CR) 10Clare Ming: [C:03+2] Upstream tablet infobox styles [extensions/WikimediaMessages] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1020736 (https://phabricator.wikimedia.org/T3603861) (owner: 10Jdlrobson) [20:03:22] (03CR) 10Clare Ming: [C:03+2] Upstream tablet infobox styles [extensions/WikimediaMessages] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1020737 (https://phabricator.wikimedia.org/T3603861) (owner: 10Jdlrobson) [20:04:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020854 (https://phabricator.wikimedia.org/T362726) (owner: 10Jdlrobson) [20:04:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020883 (https://phabricator.wikimedia.org/T361555) (owner: 10Jdlrobson) [20:05:02] cjming: they can yet [20:05:04] *yes [20:05:40] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2040:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2040 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:06:10] (03Merged) 10jenkins-bot: Enable WikimediaSkinStyles on English Wikipedia Vector 2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020854 (https://phabricator.wikimedia.org/T362726) (owner: 10Jdlrobson) [20:06:12] (03Merged) 10jenkins-bot: Enable night mode in AMC for all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020883 (https://phabricator.wikimedia.org/T361555) (owner: 10Jdlrobson) [20:06:46] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1020854|Enable WikimediaSkinStyles on English Wikipedia Vector 2022 skin (T362726)]], [[gerrit:1020883|Enable night mode in AMC for all projects (T361555)]] [20:06:52] T362726: [config] Enable night mode styles on Vector 2022 skin - https://phabricator.wikimedia.org/T362726 [20:06:53] T361555: [Config] Enable night mode for logged in AMC users on mobile for more projects and include template namespace - https://phabricator.wikimedia.org/T361555 [20:09:48] !log cjming@deploy1002 cjming and jdlrobson: Backport for [[gerrit:1020854|Enable WikimediaSkinStyles on English Wikipedia Vector 2022 skin (T362726)]], [[gerrit:1020883|Enable night mode in AMC for all projects (T361555)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:09:52] Jdlrobson: 1st 2 config patches on test servers [20:10:39] cjming: on it [20:11:21] cjming: and LGTM! [20:11:28] !log cjming@deploy1002 cjming and jdlrobson: Continuing with sync [20:17:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T361627)', diff saved to https://phabricator.wikimedia.org/P60821 and previous config saved to /var/cache/conftool/dbconfig/20240417-201733-marostegui.json [20:17:50] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [20:21:15] (03Merged) 10jenkins-bot: Upstream tablet infobox styles [extensions/WikimediaMessages] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1020736 (https://phabricator.wikimedia.org/T3603861) (owner: 10Jdlrobson) [20:23:12] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9724322 (10RobH) [20:24:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.303s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:24:49] (03Merged) 10jenkins-bot: Upstream tablet infobox styles [extensions/WikimediaMessages] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1020737 (https://phabricator.wikimedia.org/T3603861) (owner: 10Jdlrobson) [20:24:59] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1020854|Enable WikimediaSkinStyles on English Wikipedia Vector 2022 skin (T362726)]], [[gerrit:1020883|Enable night mode in AMC for all projects (T361555)]] (duration: 18m 13s) [20:25:08] T362726: [config] Enable night mode styles on Vector 2022 skin - https://phabricator.wikimedia.org/T362726 [20:25:08] T361555: [Config] Enable night mode for logged in AMC users on mobile for more projects and include template namespace - https://phabricator.wikimedia.org/T361555 [20:26:01] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1020736|Upstream tablet infobox styles (T3603861)]], [[gerrit:1020737|Upstream tablet infobox styles (T3603861)]] [20:29:02] !log cjming@deploy1002 cjming and jdlrobson: Backport for [[gerrit:1020736|Upstream tablet infobox styles (T3603861)]], [[gerrit:1020737|Upstream tablet infobox styles (T3603861)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:29:07] Jdlrobson: config patches should be live! backports on test servers [20:29:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 1.125s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:29:39] cjming: looking! :D [20:29:45] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 909.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:30:48] cjming: yep that's working! Please sync! [20:30:53] !log cjming@deploy1002 cjming and jdlrobson: Continuing with sync [20:32:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P60822 and previous config saved to /var/cache/conftool/dbconfig/20240417-203241-marostegui.json [20:35:29] thanks cjming - how come this went so much quicker today?! :) [20:36:28] yw! shipping in pairs helps lol -- and i +2'd the backports 1st thing [20:39:01] makes sense [20:39:45] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 1.023s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:40:29] (03CR) 10Eevans: [C:03+2] {echo,session}store (staging): use wmf-ca-certificates.crt [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020356 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [20:41:30] (03Merged) 10jenkins-bot: {echo,session}store (staging): use wmf-ca-certificates.crt [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020356 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [20:43:32] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1020736|Upstream tablet infobox styles (T3603861)]], [[gerrit:1020737|Upstream tablet infobox styles (T3603861)]] (duration: 17m 30s) [20:43:34] Jdlrobson: alrighty - backports should be live! [20:43:35] !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [20:44:13] !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [20:44:19] !log end of UTC late backport window [20:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:31] !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/echostore: apply [20:44:45] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 959.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:44:55] !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/echostore: apply [20:47:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P60823 and previous config saved to /var/cache/conftool/dbconfig/20240417-204748-marostegui.json [20:48:20] thanks cjming ! [20:49:00] ur welcome! [20:49:45] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 913.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:50:49] (03CR) 10Dzahn: [C:03+2] create ae.wikimedia.org for United Arab Emirates User Group [dns] - 10https://gerrit.wikimedia.org/r/1020311 (https://phabricator.wikimedia.org/T362529) (owner: 10Dzahn) [20:51:12] (03PS3) 10Dzahn: create ae.wikimedia.org for United Arab Emirates User Group [dns] - 10https://gerrit.wikimedia.org/r/1020311 (https://phabricator.wikimedia.org/T362529) [20:55:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 864.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240417T2100) [21:00:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 857.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:02:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T361627)', diff saved to https://phabricator.wikimedia.org/P60824 and previous config saved to /var/cache/conftool/dbconfig/20240417-210256-marostegui.json [21:03:02] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [21:06:29] (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1020311 (https://phabricator.wikimedia.org/T362529) (owner: 10Dzahn) [21:09:47] !log DNS - created ae.wikimedia.org for United Arab Emirates User Group wiki - T362529 [21:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:52] T362529: Create a Wikimedians of United Arab Emirates User Group Wiki - https://phabricator.wikimedia.org/T362529 [21:15:40] (KubernetesRsyslogDown) firing: rsyslog on mw2413:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2413 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:15:52] (03PS1) 10Zabe: Add Apache configuration for ae.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1020920 (https://phabricator.wikimedia.org/T362529) [21:20:16] (03CR) 10Dzahn: [C:03+1] Add Apache configuration for ae.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1020920 (https://phabricator.wikimedia.org/T362529) (owner: 10Zabe) [21:20:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2413:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2413 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:26:33] (03CR) 10Dzahn: [C:03+1] "created in DNS today - user group confirmed - and that we are using country TLD here" [puppet] - 10https://gerrit.wikimedia.org/r/1020920 (https://phabricator.wikimedia.org/T362529) (owner: 10Zabe) [21:29:18] (03CR) 10Dzahn: [C:03+1] "follows I23cb7cd2911ff7" [puppet] - 10https://gerrit.wikimedia.org/r/1020920 (https://phabricator.wikimedia.org/T362529) (owner: 10Zabe) [21:29:48] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1020087 (https://phabricator.wikimedia.org/T362421) (owner: 10Volans) [21:30:34] (03CR) 10Cathal Mooney: [C:04-1] "Acknowledged" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020202 (owner: 10Ssingh) [21:31:34] (03PS2) 10Cathal Mooney: Remove comment added in error [dns] - 10https://gerrit.wikimedia.org/r/1020901 (https://phabricator.wikimedia.org/T362421) [21:32:47] (03CR) 10Cathal Mooney: [C:03+2] Remove comment added in error [dns] - 10https://gerrit.wikimedia.org/r/1020901 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [21:33:41] (03PS34) 10Ryan Kemper: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [21:35:13] (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [21:35:40] (KubernetesRsyslogDown) firing: rsyslog on mw2414:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2414 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:39:57] (03PS35) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [21:40:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2414:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2414 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:41:04] (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [21:41:40] (03CR) 10Zabe: [C:03+2] Revert "REST: Deprecate using "post" as the parameter source" [core] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1020910 (https://phabricator.wikimedia.org/T362817) (owner: 10Jforrester) [21:42:38] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#9724526 (10Jdlrobson) [21:46:42] (03CR) 10Thcipriani: [C:03+1] "🎉 no more strange symlink!" [puppet] - 10https://gerrit.wikimedia.org/r/1020321 (https://phabricator.wikimedia.org/T359643) (owner: 10Ahmon Dancy) [21:47:23] (03CR) 10Dzahn: [C:03+2] scap.cfg.erb: Disable /srv/mediawiki-staging/php symlink management [puppet] - 10https://gerrit.wikimedia.org/r/1020321 (https://phabricator.wikimedia.org/T359643) (owner: 10Ahmon Dancy) [21:47:48] jouncebot: nowandnext [21:47:48] For the next 0 hour(s) and 12 minute(s): Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240417T2100) [21:47:48] In 8 hour(s) and 12 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240418T0600) [21:47:49] In 8 hour(s) and 12 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240418T0600) [21:50:26] !log deploying scap config change (gerrit:1020321) - [cumin2002:~] $ sudo cumin -b 4 -s 40 'C:scap AND mw*' 'run-puppet-agent' T359643 [21:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:30] T359643: Get rid of the /srv/mediawiki/php symbolic link - https://phabricator.wikimedia.org/T359643 [21:52:53] (03CR) 10Dzahn: [C:03+2] "running puppet on all mw* via cumin, slowly" [puppet] - 10https://gerrit.wikimedia.org/r/1020321 (https://phabricator.wikimedia.org/T359643) (owner: 10Ahmon Dancy) [21:55:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.017s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:56:49] (03PS36) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [21:57:55] (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [22:00:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 870.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:02:09] (03PS2) 10Muehlenhoff: Remove now obsolete Hiera host entries for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020850 (https://phabricator.wikimedia.org/T349619) [22:02:28] (03PS37) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [22:02:30] (03CR) 10Jcrespo: [C:03+1] Remove now obsolete Hiera host entries for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020850 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [22:03:50] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:03:52] (03Merged) 10jenkins-bot: Revert "REST: Deprecate using "post" as the parameter source" [core] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1020910 (https://phabricator.wikimedia.org/T362817) (owner: 10Jforrester) [22:06:31] mutante: could you ping me when it is okay to deploy? [22:07:19] (03CR) 10RLazarus: [C:03+1] "I haven't double-checked the policy or approvals or anything, but LGTM for the config change." [puppet] - 10https://gerrit.wikimedia.org/r/1020920 (https://phabricator.wikimedia.org/T362529) (owner: 10Zabe) [22:08:17] zabe: unless I abort the cumin run, it will take hours but also no deployments are scheduled until in 8 hours? [22:08:59] heh, I +2'ed https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1020910 like 30min ago [22:10:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 832.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:10:23] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (10) wdqs2013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:10:23] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wdqs2014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:10:56] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 19 hosts with reason: T362508 [22:11:03] T362508: WDQS updater misbehaving in codfw - https://phabricator.wikimedia.org/T362508 [22:11:27] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 19 hosts with reason: T362508 [22:11:42] just being cautious, maybe it's not a problem.. but deploying while scap config is being changed seems like it could potentially be messy [22:11:49] I can speed it up though [22:13:50] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:14:24] mutante: puppet will have run everywhere after 30 min anyhow, right? :) [22:15:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 805.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:15:24] rzl: that's right, I just canceled cumin [22:15:36] zabe: it's ok in ~ 8 minutes [22:16:01] checks where it was actually applied [22:17:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.149s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:19:13] alright [22:21:14] (03CR) 10Dzahn: [C:03+1] "approval via https://phabricator.wikimedia.org/T362529#9713714 and https://meta.wikimedia.org/wiki/Affiliations_Committee/Resolutions/Reco" [puppet] - 10https://gerrit.wikimedia.org/r/1020920 (https://phabricator.wikimedia.org/T362529) (owner: 10Zabe) [22:21:16] zabe: it's ok right now. scap.cfg was already edited on 141/143 mw* with scap.. and now on all. go ahead. [22:22:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 934.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:24:32] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.36s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:24:51] !log zabe@deploy1002 Started scap: Backport for [[gerrit:1020910|Revert "REST: Deprecate using "post" as the parameter source" (T362817)]] [22:25:01] T362817: PHP Deprecated: The "post" source is deprecated, use "body" instead [Called from MediaWiki\Rest\Validator\ParamValidatorCallbacks::getValue] - https://phabricator.wikimedia.org/T362817 [22:27:58] !log zabe@deploy1002 jforrester and zabe: Backport for [[gerrit:1020910|Revert "REST: Deprecate using "post" as the parameter source" (T362817)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:28:40] (KubernetesRsyslogDown) firing: rsyslog on mw2415:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2415 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:29:08] !log zabe@deploy1002 jforrester and zabe: Continuing with sync [22:29:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 1.085s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:33:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2415:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2415 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:38:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.183s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:42:06] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1020910|Revert "REST: Deprecate using "post" as the parameter source" (T362817)]] (duration: 17m 14s) [22:42:15] T362817: PHP Deprecated: The "post" source is deprecated, use "body" instead [Called from MediaWiki\Rest\Validator\ParamValidatorCallbacks::getValue] - https://phabricator.wikimedia.org/T362817 [22:42:34] * zabe done [22:43:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 828.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:45:17] zabe: nice confirmation that nothing was wrong with scap [22:48:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 925.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:52:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T352010)', diff saved to https://phabricator.wikimedia.org/P60825 and previous config saved to /var/cache/conftool/dbconfig/20240417-225206-ladsgroup.json [22:52:12] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [22:53:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 925.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:54:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 904.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:57:37] (03PS1) 10Dzahn: ci: test data_rsync dest host change [puppet] - 10https://gerrit.wikimedia.org/r/1020949 [22:59:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 831.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:01:40] (KubernetesRsyslogDown) firing: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:03:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 932.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:07:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P60826 and previous config saved to /var/cache/conftool/dbconfig/20240417-230714-ladsgroup.json [23:14:01] !log rsyncing jenkins data from contint2002 to contint1002, pre-sync in preparation for migration next week - /srv/jenkins (291G) and much smaller zuul and jenkins data dirs T334517 [23:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:06] T334517: upgrade contint servers to bullseye - https://phabricator.wikimedia.org/T334517 [23:17:59] (03Abandoned) 10Dzahn: ci: test data_rsync dest host change [puppet] - 10https://gerrit.wikimedia.org/r/1020949 (owner: 10Dzahn) [23:18:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 818.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:19:57] (03PS2) 10Dzahn: create wikipedia-pl-sysop.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1018747 (https://phabricator.wikimedia.org/T361041) [23:20:39] (03PS3) 10Dzahn: create wikipedia-pl-sysop.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1018747 (https://phabricator.wikimedia.org/T361041) [23:20:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T352010)', diff saved to https://phabricator.wikimedia.org/P60827 and previous config saved to /var/cache/conftool/dbconfig/20240417-232050-ladsgroup.json [23:20:56] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:22:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P60828 and previous config saved to /var/cache/conftool/dbconfig/20240417-232221-ladsgroup.json [23:22:55] !log sukhe@cp1114:~$ sudo -i haproxy-restart [23:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:00] (03PS1) 10Dzahn: ci: disable zuul merger on contint2002 for migration [puppet] - 10https://gerrit.wikimedia.org/r/1020950 (https://phabricator.wikimedia.org/T334517) [23:27:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 848.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:31:38] (03PS1) 10Dzahn: switch contint.wikimedia.org from contint2002 to contint1002 [dns] - 10https://gerrit.wikimedia.org/r/1020951 (https://phabricator.wikimedia.org/T334517) [23:31:55] (03CR) 10Dzahn: [C:04-2] "next week" [puppet] - 10https://gerrit.wikimedia.org/r/1020950 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [23:32:03] (03CR) 10Dzahn: [C:04-2] "next week" [dns] - 10https://gerrit.wikimedia.org/r/1020951 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [23:35:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P60829 and previous config saved to /var/cache/conftool/dbconfig/20240417-233557-ladsgroup.json [23:37:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T352010)', diff saved to https://phabricator.wikimedia.org/P60830 and previous config saved to /var/cache/conftool/dbconfig/20240417-233731-ladsgroup.json [23:37:34] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [23:37:37] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:37:47] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [23:38:13] (03PS1) 10Dzahn: ci: switch contint manager_host from 2002 to 1002 [puppet] - 10https://gerrit.wikimedia.org/r/1020954 (https://phabricator.wikimedia.org/T334517) [23:38:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1020721 [23:38:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1020721 (owner: 10TrainBranchBot) [23:47:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 832.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:47:19] (03PS1) 10Dzahn: ci: switch gearman_server IP from contint2002 to contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/1020955 (https://phabricator.wikimedia.org/T334517) [23:47:23] !log amastilovic@deploy1002 Started deploy [airflow-dags/analytics@c9d6969]: (no justification provided) [23:48:00] !log amastilovic@deploy1002 Finished deploy [airflow-dags/analytics@c9d6969]: (no justification provided) (duration: 00m 37s) [23:51:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P60831 and previous config saved to /var/cache/conftool/dbconfig/20240417-235105-ladsgroup.json [23:52:36] (03PS1) 10Dzahn: ci: switch source and destination server for data rsync [puppet] - 10https://gerrit.wikimedia.org/r/1020957 (https://phabricator.wikimedia.org/T334517) [23:54:38] (03PS2) 10Dzahn: ci: switch gearman_server IP from contint2002 to contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/1020955 (https://phabricator.wikimedia.org/T334517) [23:57:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 845.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:59:30] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1020721 (owner: 10TrainBranchBot)