[00:17:10] (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:37:11] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1025.eqiad.wmnet with OS bullseye [00:38:49] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1009386 [00:38:51] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1009386 (owner: 10TrainBranchBot) [00:46:54] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye [00:47:10] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:56:25] (03PS2) 10RLazarus: sre.switchdc.mediawiki: Stop maintenance scripts on Kubernetes [cookbooks] - 10https://gerrit.wikimedia.org/r/1008583 (https://phabricator.wikimedia.org/T359130) [01:01:40] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1009386 (owner: 10TrainBranchBot) [01:02:10] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:04:58] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wdqs1025.mgmt.eqiad.wmnet with reboot policy FORCED [01:07:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1025.mgmt.eqiad.wmnet with reboot policy FORCED [01:07:27] (03PS3) 10RLazarus: sre.switchdc.mediawiki: Stop maintenance scripts on Kubernetes [cookbooks] - 10https://gerrit.wikimedia.org/r/1008583 (https://phabricator.wikimedia.org/T359130) [01:23:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2005.mgmt.codfw.wmnet with reboot policy FORCED [01:23:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2006.mgmt.codfw.wmnet with reboot policy FORCED [01:24:59] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2005.mgmt.codfw.wmnet with reboot policy FORCED [01:25:06] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2006.mgmt.codfw.wmnet with reboot policy FORCED [01:25:44] 06SRE, 10ops-codfw, 06DBA, 06DC-Ops: hw troubleshooting: not identified for db2117.codfw.wmnet - https://phabricator.wikimedia.org/T358846#9609786 (10Jhancock.wm) [01:26:23] 06SRE, 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2117.codfw.wmnet - https://phabricator.wikimedia.org/T359141#9609783 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [01:28:29] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [01:29:22] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1025.eqiad.wmnet with reason: host reimage [01:29:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:31:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T352010)', diff saved to https://phabricator.wikimedia.org/P58606 and previous config saved to /var/cache/conftool/dbconfig/20240307-013111-ladsgroup.json [01:31:17] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [01:31:53] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1025.eqiad.wmnet with reason: host reimage [01:35:30] (03CR) 10Ssingh: [C: 03+1] Make auth NSID distinct from recdns on same host [puppet] - 10https://gerrit.wikimedia.org/r/1009316 (owner: 10BBlack) [01:35:31] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [01:36:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:38:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2005.mgmt.codfw.wmnet with reboot policy FORCED [01:38:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2006.mgmt.codfw.wmnet with reboot policy FORCED [01:46:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P58607 and previous config saved to /var/cache/conftool/dbconfig/20240307-014618-ladsgroup.json [01:46:45] (03PS1) 10Bking: site.pp: Move wdqs1025 into test role [puppet] - 10https://gerrit.wikimedia.org/r/1009371 (https://phabricator.wikimedia.org/T358727) [01:47:13] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:27] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2006.mgmt.codfw.wmnet with reboot policy FORCED [01:47:35] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2005.mgmt.codfw.wmnet with reboot policy FORCED [01:50:16] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002" [01:53:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2005.mgmt.codfw.wmnet with reboot policy FORCED [02:01:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P58608 and previous config saved to /var/cache/conftool/dbconfig/20240307-020124-ladsgroup.json [02:08:04] (03PS1) 10RLazarus: mediawiki: Add mwscript labels to the job as well as the pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009373 (https://phabricator.wikimedia.org/T341553) [02:16:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T352010)', diff saved to https://phabricator.wikimedia.org/P58609 and previous config saved to /var/cache/conftool/dbconfig/20240307-021631-ladsgroup.json [02:16:34] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [02:16:36] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [02:16:46] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [02:16:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T352010)', diff saved to https://phabricator.wikimedia.org/P58610 and previous config saved to /var/cache/conftool/dbconfig/20240307-021652-ladsgroup.json [02:37:13] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:02:13] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:16:01] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:16:08] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:21:55] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:46:55] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:16:04] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:16:55] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:17:10] (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:31:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:39:15] * kart_ starting deploying cxserver.. [04:39:22] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2024-03-05-082211-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008836 (https://phabricator.wikimedia.org/T353136) (owner: 10KartikMistry) [04:40:15] (03Merged) 10jenkins-bot: Update cxserver to 2024-03-05-082211-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008836 (https://phabricator.wikimedia.org/T353136) (owner: 10KartikMistry) [04:41:24] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [04:43:05] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [04:48:41] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [04:49:15] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [04:50:12] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [04:50:47] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [04:55:17] !log Updated cxserver to 2024-03-05-082211-production (T353136, T353259, T350773) [04:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:24] T353136: Set MinT as the default service in Content Translation for languages supported by IndicTrans2 - https://phabricator.wikimedia.org/T353136 [04:55:24] T353259: Set MinT as the default service in Content Translation for a set of languages based on their machine translation usage - https://phabricator.wikimedia.org/T353259 [04:55:25] T350773: Remove preq and use node fetch - https://phabricator.wikimedia.org/T350773 [05:11:55] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:16:04] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:31:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:52:54] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:55:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T352010)', diff saved to https://phabricator.wikimedia.org/P58611 and previous config saved to /var/cache/conftool/dbconfig/20240307-055528-ladsgroup.json [05:55:35] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [06:10:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P58612 and previous config saved to /var/cache/conftool/dbconfig/20240307-061034-ladsgroup.json [06:22:33] <_joe_> !log updated php-luasandbox everywhere T353414 [06:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:37] T353414: Build and deploy LuaSandbox 4.1.2 - https://phabricator.wikimedia.org/T353414 [06:25:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P58613 and previous config saved to /var/cache/conftool/dbconfig/20240307-062541-ladsgroup.json [06:29:26] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359417#9609948 (10Marostegui) [06:29:35] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359417#9609949 (10Marostegui) Waiting for ssh out of band verification [06:33:56] 06SRE, 10SRE-Access-Requests: Requesting access to mwmaint for rkhan / Himejijo - https://phabricator.wikimedia.org/T359490#9609955 (10Marostegui) p:05Triage→03Medium [06:34:52] 06SRE, 10SRE-Access-Requests: Requesting access to mwmaint for rkhan / Himejijo - https://phabricator.wikimedia.org/T359490#9609958 (10Marostegui) @Himejijo are you sure that is your wikitech user name and your email? I cannot find anything for any of those two. [06:36:07] 06SRE, 10SRE-Access-Requests: Requesting access to mwmaint for rkhan / Himejijo - https://phabricator.wikimedia.org/T359490#9609960 (10Marostegui) The user rkhan does exist but it is not associated to that email. Can you post the email it is associated to that user and if you really meant rkhan instead of Hime... [06:39:23] 06SRE, 10SRE-Access-Requests: Requesting access to mwmaint for rkhan / Himejijo - https://phabricator.wikimedia.org/T359490#9609962 (10Marostegui) [06:40:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T352010)', diff saved to https://phabricator.wikimedia.org/P58614 and previous config saved to /var/cache/conftool/dbconfig/20240307-064050-ladsgroup.json [06:40:53] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [06:40:55] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [06:41:06] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [06:41:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T352010)', diff saved to https://phabricator.wikimedia.org/P58615 and previous config saved to /var/cache/conftool/dbconfig/20240307-064112-ladsgroup.json [06:47:51] (03PS1) 10Marostegui: installserver: Do not reimage es1039 [puppet] - 10https://gerrit.wikimedia.org/r/1009380 [06:52:58] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 140 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:53:04] (03Abandoned) 10Muehlenhoff: Revert "admin: temporarily revoke legoktm's ssh key" [puppet] - 10https://gerrit.wikimedia.org/r/1005637 (owner: 10Legoktm) [06:54:22] (03CR) 10Marostegui: [C: 03+2] installserver: Do not reimage es1039 [puppet] - 10https://gerrit.wikimedia.org/r/1009380 (owner: 10Marostegui) [06:55:54] (03CR) 10Muehlenhoff: [C: 03+2] Move the old apt servers to insetup::buster role [puppet] - 10https://gerrit.wikimedia.org/r/1009281 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240307T0700) [07:00:05] kormat, marostegui, Amir1, and arnaudb: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240307T0700). nyaa~ [07:01:22] 06SRE, 06Infrastructure-Foundations: Move RPKI hosts to Bookworm - https://phabricator.wikimedia.org/T359502 (10MoritzMuehlenhoff) 03NEW [07:01:31] 06SRE, 06Infrastructure-Foundations: Move RPKI hosts to Bookworm - https://phabricator.wikimedia.org/T359502#9609992 (10MoritzMuehlenhoff) p:05Triage→03Medium [07:02:27] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:06:17] !log revoke Kerberos host principals for apt1001/apt2001 T331613 [07:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:35] T331613: Migrate apt repository to bookworm - https://phabricator.wikimedia.org/T331613 [07:11:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1009282 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [07:27:36] (03CR) 10Muehlenhoff: [C: 03+2] Move nginx/Puppet settings for new apt hosts to the role Hiera data [puppet] - 10https://gerrit.wikimedia.org/r/1009282 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [07:35:17] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9610033 (10MoritzMuehlenhoff) [07:37:47] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9610034 (10MoritzMuehlenhoff) [07:40:51] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Adapt profile::nginx to new packaging scheme introduced in Bookworm - https://phabricator.wikimedia.org/T329529#9610044 (10MoritzMuehlenhoff) [07:47:10] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:00:05] Amir1 and Urbanecm: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240307T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:08:10] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:08:17] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:17:10] (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:27:12] !log installing nftables bugfix updates from bullseye point release [08:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1009387 [08:56:59] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1009387 (owner: 10TrainBranchBot) [08:57:27] (03PS1) 10Muehlenhoff: Add repo sync for routinator packages on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1009453 (https://phabricator.wikimedia.org/T359502) [08:58:14] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [08:59:43] (03PS1) 10Effie Mouzeli: mw-mcrouter: Lift namespace quotas for this namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009454 [08:59:51] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1009387 (owner: 10TrainBranchBot) [09:00:07] jnuche and dduvall: Time to snap out of that daydream and deploy MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240307T0900). [09:00:56] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1009387 (owner: 10TrainBranchBot) [09:01:15] morning, running the train in a few minutes [09:04:26] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [09:05:17] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [09:05:37] (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009456 (https://phabricator.wikimedia.org/T354439) [09:05:39] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009456 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot) [09:06:35] (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009456 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot) [09:11:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [09:12:10] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:13:39] PROBLEM - Check whether ferm is active by checking the default input chain on mw2311 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:13:41] PROBLEM - Check whether ferm is active by checking the default input chain on mw2351 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:13:55] (03PS1) 10Effie Mouzeli: mcrouter: Add priorityClassName option to the daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009458 [09:14:05] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2031 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:14:30] (03PS2) 10Effie Mouzeli: mcrouter: Add priorityClassName option to the daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009458 [09:15:35] (03CR) 10Fabfur: [V: 03+1 C: 03+2] haproxy: enable log to benthos socket [puppet] - 10https://gerrit.wikimedia.org/r/1009293 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [09:16:29] (03PS1) 10Majavah: hieradata: update striker to 2024-03-07-091437-production [puppet] - 10https://gerrit.wikimedia.org/r/1009459 [09:17:48] (03CR) 10Majavah: [C: 03+2] hieradata: update striker to 2024-03-07-091437-production [puppet] - 10https://gerrit.wikimedia.org/r/1009459 (owner: 10Majavah) [09:18:01] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.21 refs T354439 [09:18:06] T354439: 1.42.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T354439 [09:20:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T352010)', diff saved to https://phabricator.wikimedia.org/P58616 and previous config saved to /var/cache/conftool/dbconfig/20240307-092029-ladsgroup.json [09:20:34] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:21:24] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9610198 (10MoritzMuehlenhoff) [09:22:14] (03CR) 10Muehlenhoff: [C: 03+2] Add repo sync for routinator packages on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1009453 (https://phabricator.wikimedia.org/T359502) (owner: 10Muehlenhoff) [09:22:58] (03PS1) 10Majavah: Revert "hieradata: update striker to 2024-03-07-091437-production" [puppet] - 10https://gerrit.wikimedia.org/r/1009328 [09:24:11] (03CR) 10CI reject: [V: 04-1] Revert "hieradata: update striker to 2024-03-07-091437-production" [puppet] - 10https://gerrit.wikimedia.org/r/1009328 (owner: 10Majavah) [09:24:37] (03PS2) 10Majavah: Revert "hieradata: update striker to 2024-03-07-091437-production" [puppet] - 10https://gerrit.wikimedia.org/r/1009328 [09:26:35] (03CR) 10Majavah: [C: 03+2] Revert "hieradata: update striker to 2024-03-07-091437-production" [puppet] - 10https://gerrit.wikimedia.org/r/1009328 (owner: 10Majavah) [09:35:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P58618 and previous config saved to /var/cache/conftool/dbconfig/20240307-093536-ladsgroup.json [09:36:30] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:36:36] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:39:52] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:39:58] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:39:59] (03CR) 10Effie Mouzeli: "j" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009458 (owner: 10Effie Mouzeli) [09:40:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1009388 [09:40:41] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1009388 (owner: 10TrainBranchBot) [09:41:59] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:42:06] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:43:39] RECOVERY - Check whether ferm is active by checking the default input chain on mw2311 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:43:41] RECOVERY - Check whether ferm is active by checking the default input chain on mw2351 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:44:05] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2031 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:45:32] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:45:39] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:48:38] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:48:44] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:49:51] (03PS1) 10Mvolz: Update Zotero to 2024-02-29-135444-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009464 (https://phabricator.wikimedia.org/T308371) [09:50:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P58619 and previous config saved to /var/cache/conftool/dbconfig/20240307-095043-ladsgroup.json [09:51:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2131 (re)pooling @ 25%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P58620 and previous config saved to /var/cache/conftool/dbconfig/20240307-095108-arnaudb.json [09:51:43] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:51:49] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:52:28] !log root@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [09:52:28] !log root@deploy2002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [09:52:35] !log root@deploy2002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [09:52:35] !log root@deploy2002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [09:53:27] (03CR) 10Btullis: [C: 03+2] Increase the frequency of the matomo indexing job to hourly [puppet] - 10https://gerrit.wikimedia.org/r/1009314 (https://phabricator.wikimedia.org/T319013) (owner: 10Btullis) [09:54:00] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:54:06] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:04:16] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1009388 (owner: 10TrainBranchBot) [10:05:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T352010)', diff saved to https://phabricator.wikimedia.org/P58621 and previous config saved to /var/cache/conftool/dbconfig/20240307-100549-ladsgroup.json [10:05:52] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2193.codfw.wmnet with reason: Maintenance [10:05:56] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [10:06:05] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2193.codfw.wmnet with reason: Maintenance [10:06:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T352010)', diff saved to https://phabricator.wikimedia.org/P58622 and previous config saved to /var/cache/conftool/dbconfig/20240307-100611-ladsgroup.json [10:06:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2131 (re)pooling @ 50%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P58623 and previous config saved to /var/cache/conftool/dbconfig/20240307-100620-arnaudb.json [10:08:55] jouncebot: nowandnext [10:08:55] For the next 0 hour(s) and 51 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240307T0900) [10:08:55] In 0 hour(s) and 51 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240307T1100) [10:08:55] In 0 hour(s) and 51 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240307T1100) [10:09:10] (03PS2) 10Samtar: [BETA CLUSTER] enable $wgCodeMirrorV6 on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008937 (https://phabricator.wikimedia.org/T357795) (owner: 10MusikAnimal) [10:09:23] (03CR) 10Muehlenhoff: [C: 03+2] thanos backend: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1004134 (owner: 10Muehlenhoff) [10:09:47] (03PS1) 10Effie Mouzeli: sre.switchdc.mediawiki: add helm env variables to commands [cookbooks] - 10https://gerrit.wikimedia.org/r/1009468 (https://phabricator.wikimedia.org/T359154) [10:10:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool to upgrade T358642', diff saved to https://phabricator.wikimedia.org/P58624 and previous config saved to /var/cache/conftool/dbconfig/20240307-101004-arnaudb.json [10:10:10] T358642: Upgrade x1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358642 [10:11:48] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db1220.eqiad.wmnet with reason: T358642 [10:12:01] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1220.eqiad.wmnet with reason: T358642 [10:14:12] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1220.eqiad.wmnet with OS bookworm [10:15:39] (03CR) 10Volans: "question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1009468 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [10:16:10] (03CR) 10Klausman: [C: 03+1] slo_template: update SLO window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1009299 (owner: 10Elukey) [10:17:38] (03PS1) 10Majavah: hieradata: update striker to 2024-03-07-101638-production [puppet] - 10https://gerrit.wikimedia.org/r/1009469 [10:18:56] (03CR) 10Majavah: [C: 03+2] hieradata: update striker to 2024-03-07-101638-production [puppet] - 10https://gerrit.wikimedia.org/r/1009469 (owner: 10Majavah) [10:21:01] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:21:08] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:21:17] (03PS1) 10Muehlenhoff: wikidough: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1009471 [10:21:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2131 (re)pooling @ 75%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P58625 and previous config saved to /var/cache/conftool/dbconfig/20240307-102125-arnaudb.json [10:21:34] !log updated spicerack on cumin[12]002 to v8.4.1 [10:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:56] !log restarting Jenkins CI to update plugins [10:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:32] (03CR) 10Clément Goubert: [C: 03+1] mw-mcrouter: Lift namespace quotas for this namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009454 (owner: 10Effie Mouzeli) [10:23:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008937 (https://phabricator.wikimedia.org/T357795) (owner: 10MusikAnimal) [10:23:53] (03Merged) 10jenkins-bot: [BETA CLUSTER] enable $wgCodeMirrorV6 on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008937 (https://phabricator.wikimedia.org/T357795) (owner: 10MusikAnimal) [10:24:24] (03PS1) 10Sg912: SLO queries for AQS 2.0 geo analytics Bug:T358751 [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1009472 [10:26:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1220.eqiad.wmnet with reason: host reimage [10:27:35] (03CR) 10Sg912: "Added some queries to create AQS 2.0 service geo analytics SLO dahboard" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1009472 (owner: 10Sg912) [10:28:55] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1220.eqiad.wmnet with reason: host reimage [10:29:18] (03CR) 10JMeybohm: mw-mcrouter: Lift namespace quotas for this namespace (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009454 (owner: 10Effie Mouzeli) [10:29:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1009471 (owner: 10Muehlenhoff) [10:30:21] (03PS2) 10Sg912: SLO queries for AQS 2.0 geo analytics Bug: T358751 [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1009472 (https://phabricator.wikimedia.org/T358751) [10:30:21] jouncebot: nowandnext [10:30:21] For the next 0 hour(s) and 29 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240307T0900) [10:30:22] In 0 hour(s) and 29 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240307T1100) [10:30:22] In 0 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240307T1100) [10:30:25] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:31:58] (03PS3) 10Sg912: SLO queries for AQS 2.0 geo analytics Bug: T358751 Change-Id: I6f97fbfdb013787c1cfca590c6b089bdc2cb0198 [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1009472 (https://phabricator.wikimedia.org/T358751) [10:33:06] (03PS2) 10Urbanecm: wikimaniawiki: Update logos to 2024 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008830 (https://phabricator.wikimedia.org/T358379) [10:33:08] (03PS1) 10Muehlenhoff: dumps::generation::server::rsync_firewall: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1009479 [10:33:32] (03PS3) 10Urbanecm: wikimaniawiki: Update logos to 2024 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008830 (https://phabricator.wikimedia.org/T358379) [10:33:41] (03CR) 10Urbanecm: wikimaniawiki: Update logos to 2024 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008830 (https://phabricator.wikimedia.org/T358379) (owner: 10Urbanecm) [10:34:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1009479 (owner: 10Muehlenhoff) [10:36:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2131 (re)pooling @ 100%: Post upgrade', diff saved to https://phabricator.wikimedia.org/P58626 and previous config saved to /var/cache/conftool/dbconfig/20240307-103630-arnaudb.json [10:37:54] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [10:44:16] rolling back train due to issue with parsoid [10:44:38] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009482 (https://phabricator.wikimedia.org/T354439) [10:44:40] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009482 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot) [10:45:04] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:45:11] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:45:20] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009482 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot) [10:48:10] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:48:17] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:48:33] (03CR) 10JMeybohm: [C: 04-1] "[07.03.24 11:22] I'm not sure it makes sense to invent a priorityclass for this tbh" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009458 (owner: 10Effie Mouzeli) [10:49:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1220.eqiad.wmnet with OS bookworm [10:52:53] (03PS1) 10Fabfur: benthos/haproxy: fix missing header parsing [puppet] - 10https://gerrit.wikimedia.org/r/1009485 (https://phabricator.wikimedia.org/T358109) [10:56:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 1%: Reimaging + upgrade done', diff saved to https://phabricator.wikimedia.org/P58627 and previous config saved to /var/cache/conftool/dbconfig/20240307-105629-arnaudb.json [10:57:15] (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008071 (owner: 10PipelineBot) [10:58:09] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008071 (owner: 10PipelineBot) [11:00:05] mvolz: That opportune time for a Services – Citoid / Zotero deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240307T1100). [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240307T1100) [11:01:08] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1598/co" [puppet] - 10https://gerrit.wikimedia.org/r/1009485 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [11:02:27] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:03:02] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [11:03:29] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:03:59] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [11:04:29] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:06:08] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:07:04] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:10:49] (03CR) 10Mvolz: [C: 03+2] Update Zotero to 2024-02-29-135444-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009464 (https://phabricator.wikimedia.org/T308371) (owner: 10Mvolz) [11:11:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 2%: Reimaging + upgrade done', diff saved to https://phabricator.wikimedia.org/P58628 and previous config saved to /var/cache/conftool/dbconfig/20240307-111134-arnaudb.json [11:11:41] (03Merged) 10jenkins-bot: Update Zotero to 2024-02-29-135444-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009464 (https://phabricator.wikimedia.org/T308371) (owner: 10Mvolz) [11:12:00] (03PS4) 10Majavah: openstack: neutron: first attempt of installing ovs-agent [puppet] - 10https://gerrit.wikimedia.org/r/1008463 (https://phabricator.wikimedia.org/T326373) [11:12:29] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply [11:12:56] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:13:10] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.21 refs T354439 [11:13:18] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/zotero: apply [11:13:28] T354439: 1.42.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T354439 [11:13:43] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [11:15:17] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/zotero: apply [11:15:45] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [11:16:55] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/zotero: apply [11:16:58] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [11:18:01] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1007 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:18:01] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1014 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:18:15] 06SRE, 10SRE Observability (FY2023/2024-Q3): Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7 - https://phabricator.wikimedia.org/T358506#9610607 (10Volans) But is this still task still valid? The alert hosts were migrated to bookworm this week and puppet is running fine there. [11:23:24] !log jnuche@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.21 refs T354439 (duration: 10m 13s) [11:23:28] T354439: 1.42.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T354439 [11:25:13] 06SRE, 06Infrastructure-Foundations, 06cloud-services-team: Track source of packages in reprepro - https://phabricator.wikimedia.org/T105385#9610637 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This task is quite old and these days we've established the scheme of using separate compone... [11:26:32] 06SRE, 06Infrastructure-Foundations, 10Packaging: Debian repository supporting multiple package versions - https://phabricator.wikimedia.org/T115758#9610647 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff We have established that with the system of separate component we introduced a few y... [11:26:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 5%: Reimaging + upgrade done', diff saved to https://phabricator.wikimedia.org/P58629 and previous config saved to /var/cache/conftool/dbconfig/20240307-112640-arnaudb.json [11:41:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetserver2001.codfw.wmnet [11:41:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 10%: Reimaging + upgrade done', diff saved to https://phabricator.wikimedia.org/P58630 and previous config saved to /var/cache/conftool/dbconfig/20240307-114145-arnaudb.json [11:43:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver2001.codfw.wmnet [11:47:10] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:48:01] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:48:01] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1014 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:52:08] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9610748 (10MoritzMuehlenhoff) [11:56:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 15%: Reimaging + upgrade done', diff saved to https://phabricator.wikimedia.org/P58631 and previous config saved to /var/cache/conftool/dbconfig/20240307-115650-arnaudb.json [11:59:42] (03PS2) 10Effie Mouzeli: mw-mcrouter: Lift namespace quotas for this namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009454 [11:59:46] (03CR) 10Effie Mouzeli: mw-mcrouter: Lift namespace quotas for this namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009454 (owner: 10Effie Mouzeli) [12:00:04] (03CR) 10Effie Mouzeli: mw-mcrouter: Lift namespace quotas for this namespace (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009454 (owner: 10Effie Mouzeli) [12:00:08] (03PS1) 10Muehlenhoff: Add a component/tomcat9 for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1009499 (https://phabricator.wikimedia.org/T359333) [12:00:42] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add OVS codfw1dev test prefixes - taavi@cumin1002" [12:01:29] (03CR) 10Majavah: [C: 03+2] Add some new networks for WMCS OVS testing [puppet] - 10https://gerrit.wikimedia.org/r/1007901 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [12:02:59] (03PS1) 10Jaime Nuche: REST: allow lower-case method names [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009500 (https://phabricator.wikimedia.org/T359306) [12:03:44] (03CR) 10Majavah: [C: 03+2] openvswitch: use package resource for ordering [puppet] - 10https://gerrit.wikimedia.org/r/1009495 (owner: 10Majavah) [12:03:50] (03CR) 10Majavah: [C: 03+2] P:openstack: neutron: fix VLAN names on OVS test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1009496 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [12:04:12] (03PS1) 10Slyngshede: D:prometheus::class_config allow disabling of select hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1009501 (https://phabricator.wikimedia.org/T359524) [12:04:30] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add OVS codfw1dev test prefixes - taavi@cumin1002" [12:08:39] (03CR) 10Effie Mouzeli: "I just used "high-priority" for the fixture, I was planning to use any of the ones we already have." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009458 (owner: 10Effie Mouzeli) [12:08:52] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1602/console" [puppet] - 10https://gerrit.wikimedia.org/r/1009501 (https://phabricator.wikimedia.org/T359524) (owner: 10Slyngshede) [12:09:50] PROBLEM - Check whether ferm is active by checking the default input chain on mw2369 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:10:12] (03PS2) 10Slyngshede: D:prometheus::class_config allow disabling of select hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1009501 (https://phabricator.wikimedia.org/T359524) [12:11:55] (SystemdUnitFailed) firing: (4) ferm.service on mw2369:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:11:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 20%: Reimaging + upgrade done', diff saved to https://phabricator.wikimedia.org/P58632 and previous config saved to /var/cache/conftool/dbconfig/20240307-121155-arnaudb.json [12:13:27] (03PS2) 10Clément Goubert: mw-on-k8s: Add MediaWikiHTTPErrorRatio alert [alerts] - 10https://gerrit.wikimedia.org/r/1009493 [12:13:39] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1603/co" [puppet] - 10https://gerrit.wikimedia.org/r/1009501 (https://phabricator.wikimedia.org/T359524) (owner: 10Slyngshede) [12:14:27] (03CR) 10Effie Mouzeli: [C: 03+2] mw-mcrouter: Lift namespace quotas for this namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009454 (owner: 10Effie Mouzeli) [12:15:24] (03PS1) 10Muehlenhoff: tomcat: When on bookworm, install from component [puppet] - 10https://gerrit.wikimedia.org/r/1009506 (https://phabricator.wikimedia.org/T359333) [12:16:45] (03CR) 10CI reject: [V: 04-1] tomcat: When on bookworm, install from component [puppet] - 10https://gerrit.wikimedia.org/r/1009506 (https://phabricator.wikimedia.org/T359333) (owner: 10Muehlenhoff) [12:16:55] (SystemdUnitFailed) firing: (6) ferm.service on mw1376:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:17:02] (03Merged) 10jenkins-bot: mw-mcrouter: Lift namespace quotas for this namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009454 (owner: 10Effie Mouzeli) [12:17:09] (03PS3) 10Clément Goubert: mw-on-k8s: Add MediaWikiHTTPErrorRatio alert [alerts] - 10https://gerrit.wikimedia.org/r/1009493 (https://phabricator.wikimedia.org/T359509) [12:17:10] (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:19:46] (03PS1) 10Majavah: P:opesntack: nova: convert cloudvirt2001-dev to OVS agent [puppet] - 10https://gerrit.wikimedia.org/r/1009511 [12:20:15] (03PS2) 10Majavah: P:opesntack: nova: convert cloudvirt2001-dev to OVS agent [puppet] - 10https://gerrit.wikimedia.org/r/1009511 (https://phabricator.wikimedia.org/T358761) [12:21:55] (SystemdUnitFailed) firing: (7) ferm.service on mw1376:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:22:40] (03PS3) 10Majavah: P:opesntack: nova: convert cloudvirt2001-dev to OVS agent [puppet] - 10https://gerrit.wikimedia.org/r/1009511 (https://phabricator.wikimedia.org/T358761) [12:23:52] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1009511 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [12:24:33] (03PS2) 10Muehlenhoff: tomcat: When on bookworm, install from component [puppet] - 10https://gerrit.wikimedia.org/r/1009506 (https://phabricator.wikimedia.org/T359333) [12:25:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:25:56] (03CR) 10CI reject: [V: 04-1] tomcat: When on bookworm, install from component [puppet] - 10https://gerrit.wikimedia.org/r/1009506 (https://phabricator.wikimedia.org/T359333) (owner: 10Muehlenhoff) [12:27:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 25%: Reimaging + upgrade done', diff saved to https://phabricator.wikimedia.org/P58633 and previous config saved to /var/cache/conftool/dbconfig/20240307-122701-arnaudb.json [12:27:40] PROBLEM - Check nf_conntrack usage in neutron netns on cloudnet2007-dev is CRITICAL: CRITICAL: no netns defined? https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:28:33] jelto@cumin1002 jelto: The switchover backup on on gitlab2002 is complete [12:28:46] PROBLEM - Check whether ferm is active by checking the default input chain on mw2428 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:29:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T352010)', diff saved to https://phabricator.wikimedia.org/P58634 and previous config saved to /var/cache/conftool/dbconfig/20240307-122949-ladsgroup.json [12:29:54] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [12:30:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:31:55] (SystemdUnitFailed) firing: (8) ferm.service on mw1376:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:33:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:34:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [12:36:30] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:36:37] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1009499 (https://phabricator.wikimedia.org/T359333) (owner: 10Muehlenhoff) [12:36:55] (SystemdUnitFailed) firing: (4) ferm.service on mw2368:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:37:24] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:37:29] (03CR) 10JMeybohm: [C: 03+1] "Good point" [cookbooks] - 10https://gerrit.wikimedia.org/r/1008583 (https://phabricator.wikimedia.org/T359130) (owner: 10RLazarus) [12:38:03] (03PS3) 10Slyngshede: D:prometheus::class_config allow disabling of select hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1009501 (https://phabricator.wikimedia.org/T359524) [12:38:15] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:38:20] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:39:10] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:39:14] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51594 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:39:22] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.283 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:39:23] (03CR) 10CI reject: [V: 04-1] D:prometheus::class_config allow disabling of select hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1009501 (https://phabricator.wikimedia.org/T359524) (owner: 10Slyngshede) [12:39:50] RECOVERY - Check whether ferm is active by checking the default input chain on mw2369 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:41:50] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1606/co" [puppet] - 10https://gerrit.wikimedia.org/r/1009501 (https://phabricator.wikimedia.org/T359524) (owner: 10Slyngshede) [12:41:59] (03PS1) 10Jelto: gitlab: fix irc log for backup complete message [cookbooks] - 10https://gerrit.wikimedia.org/r/1009520 [12:42:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 30%: Reimaging + upgrade done', diff saved to https://phabricator.wikimedia.org/P58635 and previous config saved to /var/cache/conftool/dbconfig/20240307-124206-arnaudb.json [12:44:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P58636 and previous config saved to /var/cache/conftool/dbconfig/20240307-124456-ladsgroup.json [12:50:49] (03CR) 10Slyngshede: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1009501 (https://phabricator.wikimedia.org/T359524) (owner: 10Slyngshede) [12:54:51] (03PS4) 10Majavah: P:openstack: nova: convert cloudvirt2001-dev to OVS agent [puppet] - 10https://gerrit.wikimedia.org/r/1009511 (https://phabricator.wikimedia.org/T358761) [12:56:03] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1009511 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [12:56:12] jouncebot: nowandnext [12:56:12] No deployments scheduled for the next 0 hour(s) and 3 minute(s) [12:56:12] In 0 hour(s) and 3 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240307T1300) [12:57:11] (03CR) 10Clément Goubert: [C: 03+2] mw-web, mw-api-ext: Raise replicas for 60% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009268 (https://phabricator.wikimedia.org/T357508) (owner: 10Clément Goubert) [12:57:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 40%: Reimaging + upgrade done', diff saved to https://phabricator.wikimedia.org/P58637 and previous config saved to /var/cache/conftool/dbconfig/20240307-125711-arnaudb.json [12:57:37] (03CR) 10Slyngshede: [V: 03+1] "I'm not overly excited about this solution. Would it perhaps be better to define a new openldap::rw2 role, but named better." [puppet] - 10https://gerrit.wikimedia.org/r/1009501 (https://phabricator.wikimedia.org/T359524) (owner: 10Slyngshede) [12:58:11] (03Merged) 10jenkins-bot: mw-web, mw-api-ext: Raise replicas for 60% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009268 (https://phabricator.wikimedia.org/T357508) (owner: 10Clément Goubert) [12:58:46] RECOVERY - Check whether ferm is active by checking the default input chain on mw2428 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:58:49] (03CR) 10Majavah: D:prometheus::class_config allow disabling of select hosts. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009501 (https://phabricator.wikimedia.org/T359524) (owner: 10Slyngshede) [12:59:43] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [12:59:59] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [13:00:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P58638 and previous config saved to /var/cache/conftool/dbconfig/20240307-130002-ladsgroup.json [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240307T1300) [13:00:07] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [13:00:24] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [13:00:46] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [13:01:05] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [13:01:11] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [13:01:35] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [13:01:51] !log trafficserver: move 60% of traffic to mw on k8s - T357508 [13:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:55] (SystemdUnitFailed) resolved: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:01:55] T357508: Move 60% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357508 [13:01:56] (03CR) 10Clément Goubert: [C: 03+2] trafficserver: move 60% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1009269 (https://phabricator.wikimedia.org/T357508) (owner: 10Clément Goubert) [13:10:06] !log jiji@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [13:11:29] !log jiji@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [13:12:08] !log jiji@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:12:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 50%: Reimaging + upgrade done', diff saved to https://phabricator.wikimedia.org/P58639 and previous config saved to /var/cache/conftool/dbconfig/20240307-131216-arnaudb.json [13:12:42] !log jiji@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:13:11] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:13:34] (03CR) 10Jgiannelos: [C: 03+1] REST: allow lower-case method names [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009500 (https://phabricator.wikimedia.org/T359306) (owner: 10Jaime Nuche) [13:14:26] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:14:41] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:15:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T352010)', diff saved to https://phabricator.wikimedia.org/P58640 and previous config saved to /var/cache/conftool/dbconfig/20240307-131509-ladsgroup.json [13:15:12] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2217.codfw.wmnet with reason: Maintenance [13:15:14] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2217.codfw.wmnet with reason: Maintenance [13:15:17] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:15:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T352010)', diff saved to https://phabricator.wikimedia.org/P58641 and previous config saved to /var/cache/conftool/dbconfig/20240307-131520-ladsgroup.json [13:15:32] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:22:35] (03CR) 10Kamila Součková: [C: 03+1] "LGTM, but I'd feel more comfortable if someone else looked too" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1009494 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [13:24:06] 06SRE, 10MW-on-K8s, 10Scap, 06serviceops, 10Release-Engineering-Team (Now this 🫠): Scap should check errors coming from mw-on-k8s canaries during deployments - https://phabricator.wikimedia.org/T357402#9611194 (10Clement_Goubert) [13:24:22] 06SRE, 10MW-on-K8s, 06Traffic, 06serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9611195 (10Clement_Goubert) [13:24:50] 06SRE, 10MW-on-K8s, 06Traffic, 06serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9611196 (10Clement_Goubert) [13:25:10] 06SRE, 10Data Pipelines, 06Data-Engineering, 06Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227#9611197 (10Isaac) Very excited to see this gaining some traction (thanks @mpopov and @dr0ptp4kt)! Commenting on the analytics side of things (I don't know e... [13:25:58] (03PS1) 10Btullis: Switch AQS edit-analytics and editor-analytics to use the Feb snapshot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009389 [13:26:25] 06SRE, 10MW-on-K8s, 06Release-Engineering-Team, 06Traffic, and 2 others: Move 60% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357508#9611192 (10Clement_Goubert) 05In progress→03Resolved [13:27:05] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2097.codfw.wmnet with reason: Maintenance [13:27:18] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2097.codfw.wmnet with reason: Maintenance [13:27:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 75%: Reimaging + upgrade done', diff saved to https://phabricator.wikimedia.org/P58642 and previous config saved to /var/cache/conftool/dbconfig/20240307-132721-arnaudb.json [13:27:26] !log jnuche@deploy2002 Started deploy [zuul/deploy@efce3ee]: test deployment for new host [13:27:41] !log jnuche@deploy2002 Finished deploy [zuul/deploy@efce3ee]: test deployment for new host (duration: 00m 15s) [13:28:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1173.eqiad.wmnet with reason: Maintenance [13:29:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1173.eqiad.wmnet with reason: Maintenance [13:29:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2129.codfw.wmnet with reason: Maintenance [13:29:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2129.codfw.wmnet with reason: Maintenance [13:31:02] PROBLEM - Kafka broker TLS certificate validity on kafka-logging1003 is CRITICAL: SSL CRITICAL - Certificate kafka-logging1003.eqiad.wmnet valid until 2024-03-14 13:31:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [13:32:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1183.eqiad.wmnet with reason: Maintenance [13:32:43] (03CR) 10Jforrester: [C: 03+1] "Good to deploy from my POV." [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009500 (https://phabricator.wikimedia.org/T359306) (owner: 10Jaime Nuche) [13:32:53] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1183.eqiad.wmnet with reason: Maintenance [13:33:12] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2113.codfw.wmnet with reason: Maintenance [13:33:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2113.codfw.wmnet with reason: Maintenance [13:33:32] (03CR) 10Joal: [C: 03+1] "LGTM! Thanks ben" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009389 (owner: 10Btullis) [13:34:08] (03PS5) 10Anzx: kowikisource: add NamespaceAliases for User and Usertalk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008989 (https://phabricator.wikimedia.org/T358508) [13:36:04] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1238.eqiad.wmnet with reason: Maintenance [13:36:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1238.eqiad.wmnet with reason: Maintenance [13:36:19] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2179.codfw.wmnet with reason: Maintenance [13:36:33] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2179.codfw.wmnet with reason: Maintenance [13:36:35] (03PS3) 10Anzx: itwikivoyage: update wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008988 (https://phabricator.wikimedia.org/T358456) [13:36:47] (03CR) 10Jaime Nuche: "Thank you Yiannis and James!" [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009500 (https://phabricator.wikimedia.org/T359306) (owner: 10Jaime Nuche) [13:37:15] deploying backport to unblock train in a couple minutes [13:39:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jnuche@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009500 (https://phabricator.wikimedia.org/T359306) (owner: 10Jaime Nuche) [13:40:37] (03PS4) 10Slyngshede: D:prometheus::class_config allow disabling of select hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1009501 (https://phabricator.wikimedia.org/T359524) [13:42:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 100%: Reimaging + upgrade done', diff saved to https://phabricator.wikimedia.org/P58643 and previous config saved to /var/cache/conftool/dbconfig/20240307-134226-arnaudb.json [13:44:44] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1608/co" [puppet] - 10https://gerrit.wikimedia.org/r/1009501 (https://phabricator.wikimedia.org/T359524) (owner: 10Slyngshede) [13:45:20] (03CR) 10Elukey: [V: 03+2 C: 03+2] slo_template: update SLO window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1009299 (owner: 10Elukey) [13:49:16] (03PS5) 10Slyngshede: P:prometheus::ops Remove new LDAP hosts from Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1009501 (https://phabricator.wikimedia.org/T359524) [13:49:55] (03CR) 10Slyngshede: P:prometheus::ops Remove new LDAP hosts from Prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009501 (https://phabricator.wikimedia.org/T359524) (owner: 10Slyngshede) [13:51:37] (03CR) 10Brouberol: [C: 03+1] "Looks legit :D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009389 (owner: 10Btullis) [13:52:38] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1609/co" [puppet] - 10https://gerrit.wikimedia.org/r/1009501 (https://phabricator.wikimedia.org/T359524) (owner: 10Slyngshede) [13:56:46] (03CR) 10Btullis: [C: 03+2] Switch AQS edit-analytics and editor-analytics to use the Feb snapshot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009389 (owner: 10Btullis) [13:57:35] (03Merged) 10jenkins-bot: Switch AQS edit-analytics and editor-analytics to use the Feb snapshot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009389 (owner: 10Btullis) [13:58:25] jnuche: if https://grafana.wikimedia.org/goto/KZMbFoASz?orgId=1 goes to 0 after deployment, that means the patch is good :p [13:58:53] (03Merged) 10jenkins-bot: REST: allow lower-case method names [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009500 (https://phabricator.wikimedia.org/T359306) (owner: 10Jaime Nuche) [13:58:55] ack, thx [13:59:41] !log jnuche@deploy2002 Started scap: Backport for [[gerrit:1009500|REST: allow lower-case method names]] [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240307T1400). [14:00:05] anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:36] i can deploy today [14:00:39] !log btullis@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [14:00:42] except jnuche is apparently deploying sth now [14:00:46] so i'll wait for their green light :) [14:01:10] yeah, I'm backporting a blocker fix [14:01:17] !log jnuche@deploy2002 jnuche: Backport for [[gerrit:1009500|REST: allow lower-case method names]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:01:23] if all goes well, I'll roll forwards the train after [14:01:24] !log btullis@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [14:01:32] sorry for the disruption :( [14:01:48] !log jnuche@deploy2002 jnuche: Continuing with sync [14:01:49] jnuche: do you have an idea how long that'd take (aka how long i'll have from the window)? [14:02:53] urbanecm: I hope I'll be done in ~20m max, so you should still have plenty of time [14:02:59] sounds good [14:06:06] (03PS2) 10Anzx: knwiki: Add importupload userright to administrator usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009333 (https://phabricator.wikimedia.org/T359545) [14:06:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1157.eqiad.wmnet with reason: Maintenance [14:06:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1157.eqiad.wmnet with reason: Maintenance [14:07:26] PROBLEM - Check whether ferm is active by checking the default input chain on mw1356 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:07:44] (03PS3) 10Effie Mouzeli: [DNM] mcrouter: Add priorityClassName option to the daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009458 [14:08:20] (03PS1) 10Jforrester: Undeploy the 'similar-users' service, unused for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009538 (https://phabricator.wikimedia.org/T345274) [14:08:24] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:08:34] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:09:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2127.codfw.wmnet with reason: Maintenance [14:10:04] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2127.codfw.wmnet with reason: Maintenance [14:11:09] !log bking@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002" [14:11:11] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1025.eqiad.wmnet with OS bullseye [14:11:22] !log jnuche@deploy2002 Finished scap: Backport for [[gerrit:1009500|REST: allow lower-case method names]] (duration: 11m 40s) [14:11:22] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:11:36] (03PS1) 10Volans: DNS-related cookbooks: adapt for conftool state [cookbooks] - 10https://gerrit.wikimedia.org/r/1009539 (https://phabricator.wikimedia.org/T347054) [14:11:41] (03CR) 10MVernon: "This does seem like a sensible refinement, though I'm not sure how useful it is as an alerting metric (should it maybe be a warning?)" [alerts] - 10https://gerrit.wikimedia.org/r/1008590 (owner: 10Tim Starling) [14:12:10] (03CR) 10Ssingh: [C: 03+1] "And thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1009471 (owner: 10Muehlenhoff) [14:12:14] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:13:34] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync hiera as instructed by failed reimage cookbook - bking@cumin2002 - T358727" [14:13:38] T358727: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727 [14:14:26] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync hiera as instructed by failed reimage cookbook - bking@cumin2002 - T358727" [14:14:58] urbanecm: the backport didn't work, I can't continue rolling the train [14:15:03] please go ahead with the scheduled backports [14:15:07] okay, will do [14:15:14] anzx: hi, are you around? [14:15:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2006.mgmt.codfw.wmnet with reboot policy FORCED [14:15:50] urbanecm: yes [14:15:53] let's go then"! [14:15:55] jnuche: so the patch didn't do the trick ? [14:16:11] nemo-yiannis: yeah, unfortunately [14:16:22] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:16:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2005.mgmt.codfw.wmnet with reboot policy FORCED [14:17:02] anzx: can you clarify why does https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1008988 remove NS 106? [14:17:05] (03CR) 10Muehlenhoff: [C: 03+2] Add a component/tomcat9 for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1009499 (https://phabricator.wikimedia.org/T359333) (owner: 10Muehlenhoff) [14:17:12] i do not see that agreed on the task so far, but i might be missing sth [14:17:32] urbanecm: that namespace was already removed [14:17:36] aha [14:17:48] RECOVERY - Check whether ferm is active by checking the default input chain on mw1356 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:17:49] (03PS4) 10Urbanecm: itwikivoyage: update wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008988 (https://phabricator.wikimedia.org/T358456) (owner: 10Anzx) [14:18:05] (03CR) 10Urbanecm: [C: 03+2] itwikivoyage: update wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008988 (https://phabricator.wikimedia.org/T358456) (owner: 10Anzx) [14:18:21] (03PS6) 10Urbanecm: kowikisource: add NamespaceAliases for User and Usertalk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008989 (https://phabricator.wikimedia.org/T358508) (owner: 10Anzx) [14:18:30] (03CR) 10Urbanecm: [C: 03+2] kowikisource: add NamespaceAliases for User and Usertalk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008989 (https://phabricator.wikimedia.org/T358508) (owner: 10Anzx) [14:18:40] (03PS7) 10Urbanecm: kowikisource: add NamespaceAliases for User and Usertalk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008989 (https://phabricator.wikimedia.org/T358508) (owner: 10Anzx) [14:18:48] (03CR) 10Urbanecm: kowikisource: add NamespaceAliases for User and Usertalk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008989 (https://phabricator.wikimedia.org/T358508) (owner: 10Anzx) [14:18:56] (03CR) 10Urbanecm: [C: 03+2] kowikisource: add NamespaceAliases for User and Usertalk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008989 (https://phabricator.wikimedia.org/T358508) (owner: 10Anzx) [14:19:04] (03PS2) 10Bking: site.pp: Move wdqs1025 into test role [puppet] - 10https://gerrit.wikimedia.org/r/1009371 (https://phabricator.wikimedia.org/T358727) [14:19:20] (03CR) 10JMeybohm: [C: 04-1] "hmm..CI fails to render:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009373 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [14:19:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008988 (https://phabricator.wikimedia.org/T358456) (owner: 10Anzx) [14:20:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008989 (https://phabricator.wikimedia.org/T358508) (owner: 10Anzx) [14:20:10] wikibugs has some lag [14:20:20] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1009371 (https://phabricator.wikimedia.org/T358727) (owner: 10Bking) [14:20:48] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2006.mgmt.codfw.wmnet with reboot policy FORCED [14:20:51] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:20:52] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:1008988|itwikivoyage: update wgNamespacesToBeSearchedDefault (T358456)]], [[gerrit:1008989|kowikisource: add NamespaceAliases for User and Usertalk namespaces (T358508)]] [14:20:58] T358456: Italian Wikivoyage $wgNamespacesToBeSearchedDefault change request - https://phabricator.wikimedia.org/T358456 [14:20:58] T358508: Add namespace shortcuts in kowikisource - https://phabricator.wikimedia.org/T358508 [14:21:00] (03Merged) 10jenkins-bot: itwikivoyage: update wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008988 (https://phabricator.wikimedia.org/T358456) (owner: 10Anzx) [14:21:08] (03Merged) 10jenkins-bot: kowikisource: add NamespaceAliases for User and Usertalk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008989 (https://phabricator.wikimedia.org/T358508) (owner: 10Anzx) [14:21:16] (03PS3) 10Urbanecm: knwiki: Add importupload userright to administrator usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009333 (https://phabricator.wikimedia.org/T359545) (owner: 10Anzx) [14:21:24] (03CR) 10Eevans: [C: 03+2] restbase: provision restbase1040-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005596 (https://phabricator.wikimedia.org/T354560) (owner: 10Eevans) [14:21:32] (03CR) 10Urbanecm: [C: 03+2] knwiki: Add importupload userright to administrator usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009333 (https://phabricator.wikimedia.org/T359545) (owner: 10Anzx) [14:21:40] (03PS4) 10Urbanecm: wikimaniawiki: Update logos to 2024 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008830 (https://phabricator.wikimedia.org/T358379) [14:21:48] (03CR) 10Urbanecm: [C: 03+2] wikimaniawiki: Update logos to 2024 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008830 (https://phabricator.wikimedia.org/T358379) (owner: 10Urbanecm) [14:21:56] (03Merged) 10jenkins-bot: knwiki: Add importupload userright to administrator usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009333 (https://phabricator.wikimedia.org/T359545) (owner: 10Anzx) [14:22:04] (03Merged) 10jenkins-bot: wikimaniawiki: Update logos to 2024 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008830 (https://phabricator.wikimedia.org/T358379) (owner: 10Urbanecm) [14:22:14] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:22:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:22:18] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51594 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:22:28] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.314 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:22:30] !log urbanecm@deploy2002 urbanecm and anzx: Backport for [[gerrit:1008988|itwikivoyage: update wgNamespacesToBeSearchedDefault (T358456)]], [[gerrit:1008989|kowikisource: add NamespaceAliases for User and Usertalk namespaces (T358508)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:22:42] anzx: your first two patches are at mwdebug. can you take a look and test? [14:22:49] urbanecm: checking [14:23:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2005.mgmt.codfw.wmnet with reboot policy FORCED [14:23:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2006.mgmt.codfw.wmnet with reboot policy FORCED [14:23:56] !log btullis@deploy2002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [14:24:05] (03PS1) 10Muehlenhoff: Don't run spec test on buster, but instead of Bullseye and Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1009541 [14:24:17] !log btullis@deploy2002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [14:24:26] !log btullis@deploy2002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [14:24:49] !log btullis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [14:24:51] urbanecm: both looks correct [14:24:57] !log btullis@deploy2002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [14:25:02] awesome, proceeding [14:25:03] !log urbanecm@deploy2002 urbanecm and anzx: Continuing with sync [14:25:11] !log btullis@deploy2002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [14:25:18] (03PS54) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [14:25:21] !log btullis@deploy2002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [14:25:40] !log btullis@deploy2002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [14:25:47] (03CR) 10Giuseppe Lavagetto: "LGTM but please either copy in a lvm.conf file, or make the sed regexes more secure by using anchors to ensure we only sub full lines." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1009494 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [14:25:59] !log btullis@deploy2002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [14:26:16] !log btullis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [14:28:02] (03CR) 10Ssingh: "Looks good! Just one question about the confd error message inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/1009539 (https://phabricator.wikimedia.org/T347054) (owner: 10Volans) [14:28:39] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1162.eqiad.wmnet with reason: Maintenance [14:28:53] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1162.eqiad.wmnet with reason: Maintenance [14:28:58] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2107.codfw.wmnet with reason: Maintenance [14:29:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2107.codfw.wmnet with reason: Maintenance [14:29:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 47.99% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:30:40] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:30:51] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1181.eqiad.wmnet with reason: Maintenance [14:31:04] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1181.eqiad.wmnet with reason: Maintenance [14:31:09] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2121.codfw.wmnet with reason: Maintenance [14:31:22] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2121.codfw.wmnet with reason: Maintenance [14:32:11] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase1040.eqiad.wmnet with reason: Bootstrapping — T354560 [14:32:16] T354560: Provision new RESTBase cluster nodes: restbase10[34-42] - https://phabricator.wikimedia.org/T354560 [14:32:25] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1040.eqiad.wmnet with reason: Bootstrapping — T354560 [14:33:10] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1167.eqiad.wmnet with reason: Maintenance [14:33:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1167.eqiad.wmnet with reason: Maintenance [14:33:14] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:33:30] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:33:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T357189)', diff saved to https://phabricator.wikimedia.org/P58644 and previous config saved to /var/cache/conftool/dbconfig/20240307-143336-arnaudb.json [14:33:41] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [14:34:05] (03PS1) 10Clément Goubert: mw-parsoid: Add 10 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009543 [14:34:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 47.99% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:34:16] (03CR) 10Volans: DNS-related cookbooks: adapt for conftool state (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1009539 (https://phabricator.wikimedia.org/T347054) (owner: 10Volans) [14:34:27] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:1008988|itwikivoyage: update wgNamespacesToBeSearchedDefault (T358456)]], [[gerrit:1008989|kowikisource: add NamespaceAliases for User and Usertalk namespaces (T358508)]] (duration: 13m 34s) [14:35:27] urbanecm: Any chance I can still slip into this deployment window with a patch? [14:35:32] T358456: Italian Wikivoyage $wgNamespacesToBeSearchedDefault change request - https://phabricator.wikimedia.org/T358456 [14:35:32] T358508: Add namespace shortcuts in kowikisource - https://phabricator.wikimedia.org/T358508 [14:36:17] (03PS1) 10DLynch: editcheckreferenceurl: Validate URL returned from Citoid, not input [extensions/Citoid] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009334 (https://phabricator.wikimedia.org/T359527) [14:38:51] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:1008830|wikimaniawiki: Update logos to 2024 (T358379)]], [[gerrit:1009333|knwiki: Add importupload userright to administrator usergroup (T359545)]] [14:38:55] !log import tomcat9 9.0.43-2~deb11u9+wmf12u1 to apt.wikimedia.org T359333 [14:38:56] T358379: Change Wikimania wiki logo - https://phabricator.wikimedia.org/T358379 [14:38:56] T359545: Add fileimport userright to administrator usergroup on Kannada Wikipedia - https://phabricator.wikimedia.org/T359545 [14:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:00] T359333: Build Tomcat 9 for Bookworm - https://phabricator.wikimedia.org/T359333 [14:39:49] Kemayo: feel free too [14:40:03] I can do the deploy [14:40:08] (03CR) 10Ssingh: [C: 03+1] DNS-related cookbooks: adapt for conftool state (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1009539 (https://phabricator.wikimedia.org/T347054) (owner: 10Volans) [14:40:22] !log urbanecm@deploy2002 urbanecm and anzx: Backport for [[gerrit:1008830|wikimaniawiki: Update logos to 2024 (T358379)]], [[gerrit:1009333|knwiki: Add importupload userright to administrator usergroup (T359545)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:40:31] Checking [14:40:32] anzx: can you check the second one please? [14:40:34] thanks [14:40:35] (03PS1) 10Jaime Nuche: REST: ignore request body on GET requests [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009544 (https://phabricator.wikimedia.org/T359509) [14:40:50] Amir1: Thanks! I've added it to the deployments page. [14:40:52] Amir1: fyi just finishing backports rn. i can ping you once done if needed? [14:40:58] urbanecm: sure [14:41:06] (03PS2) 10MVernon: Add new ceph container image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1009494 (https://phabricator.wikimedia.org/T279621) [14:41:17] urbanecm: looks good [14:41:28] Amir1: will do. if it's a mw backport ,feel free to +2 now, i will need less than "one CI" of time :D [14:41:46] :D [14:41:49] !log urbanecm@deploy2002 urbanecm and anzx: Continuing with sync [14:42:02] (03CR) 10MVernon: "Gone with a full-line substitution." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1009494 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [14:42:33] (03CR) 10Ladsgroup: [C: 03+2] editcheckreferenceurl: Validate URL returned from Citoid, not input [extensions/Citoid] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009334 (https://phabricator.wikimedia.org/T359527) (owner: 10DLynch) [14:46:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbprov2005.mgmt.codfw.wmnet with reboot policy FORCED [14:46:30] PROBLEM - Check whether ferm is active by checking the default input chain on mw1425 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:46:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbprov2006.mgmt.codfw.wmnet with reboot policy FORCED [14:47:00] (03CR) 10Volans: DNS-related cookbooks: adapt for conftool state (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1009539 (https://phabricator.wikimedia.org/T347054) (owner: 10Volans) [14:47:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:47:48] PROBLEM - Check whether ferm is active by checking the default input chain on mw2435 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:49:14] (03Merged) 10jenkins-bot: editcheckreferenceurl: Validate URL returned from Citoid, not input [extensions/Citoid] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009334 (https://phabricator.wikimedia.org/T359527) (owner: 10DLynch) [14:51:05] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:1008830|wikimaniawiki: Update logos to 2024 (T358379)]], [[gerrit:1009333|knwiki: Add importupload userright to administrator usergroup (T359545)]] (duration: 12m 14s) [14:51:10] T358379: Change Wikimania wiki logo - https://phabricator.wikimedia.org/T358379 [14:51:10] T359545: Add fileimport userright to administrator usergroup on Kannada Wikipedia - https://phabricator.wikimedia.org/T359545 [14:51:37] (03PS1) 10Elukey: role::ml_k8s::staging::worker: add Dragonfly [puppet] - 10https://gerrit.wikimedia.org/r/1009548 (https://phabricator.wikimedia.org/T359416) [14:51:38] and done [14:51:38] urbanecm: Thankyou [14:51:40] Amir1: ^^ [14:51:41] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:1009334|editcheckreferenceurl: Validate URL returned from Citoid, not input (T359527)]] [14:51:45] T359527: Automatic reference to shortened URL "youtu.be" gives "unreliable site" warning, doesn't let you insert reference - https://phabricator.wikimedia.org/T359527 [14:51:50] thanks! [14:52:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:53:12] !log ladsgroup@deploy2002 kemayo and ladsgroup: Backport for [[gerrit:1009334|editcheckreferenceurl: Validate URL returned from Citoid, not input (T359527)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:53:25] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1610/co" [puppet] - 10https://gerrit.wikimedia.org/r/1009548 (https://phabricator.wikimedia.org/T359416) (owner: 10Elukey) [14:53:38] Kemayo: it's live in mwdebug [14:53:55] One second to test [14:55:09] (03CR) 10Muehlenhoff: [C: 03+2] Don't run spec test on buster, but instead of Bullseye and Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1009541 (owner: 10Muehlenhoff) [14:55:16] Amir1: Okay, looks good! [14:55:21] !log ladsgroup@deploy2002 kemayo and ladsgroup: Continuing with sync [14:55:30] awesome, pushing forward [14:56:25] (03CR) 10Muehlenhoff: [C: 03+2] wikidough: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1009471 (owner: 10Muehlenhoff) [14:56:33] !log repool cp4037 for very short time to process and collect logs from HAProxy/Benthos (T358109) [14:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:37] T358109: Install new Benthos instance on cp hosts - https://phabricator.wikimedia.org/T358109 [14:57:37] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbprov2005'] [14:57:38] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [14:57:38] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbprov2006'] [14:57:41] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9611785 (10Jhancock.wm) [14:57:58] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['dbprov2005'] [14:58:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['dbprov2006'] [14:58:15] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [14:58:16] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbprov2005'] [14:58:20] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbprov2006'] [14:58:55] (03CR) 10Jforrester: [C: 04-1] "Needs SRE/etc. changes first." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009538 (https://phabricator.wikimedia.org/T345274) (owner: 10Jforrester) [15:00:38] (03PS3) 10Muehlenhoff: tomcat: When on bookworm, install from component [puppet] - 10https://gerrit.wikimedia.org/r/1009506 (https://phabricator.wikimedia.org/T359333) [15:02:16] (03CR) 10CI reject: [V: 04-1] tomcat: When on bookworm, install from component [puppet] - 10https://gerrit.wikimedia.org/r/1009506 (https://phabricator.wikimedia.org/T359333) (owner: 10Muehlenhoff) [15:02:27] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:03:21] (03CR) 10Elukey: [V: 03+1] "As far as I can see, the supernode is reachable from the ml-staging2001 node, and there is no special config on the supernode to accept an" [puppet] - 10https://gerrit.wikimedia.org/r/1009548 (https://phabricator.wikimedia.org/T359416) (owner: 10Elukey) [15:04:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dbprov2005'] [15:04:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dbprov2006'] [15:05:05] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:1009334|editcheckreferenceurl: Validate URL returned from Citoid, not input (T359527)]] (duration: 13m 23s) [15:05:10] Kemayo: done [15:05:13] T359527: Automatic reference to shortened URL "youtu.be" gives "unreliable site" warning, doesn't let you insert reference - https://phabricator.wikimedia.org/T359527 [15:05:32] Amir1: thanks! [15:05:47] ^_^ [15:08:36] (03CR) 10Ssingh: [C: 03+1] DNS-related cookbooks: adapt for conftool state (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1009539 (https://phabricator.wikimedia.org/T347054) (owner: 10Volans) [15:10:32] (03PS1) 10Elukey: slo_definitions: remove prometheus label from ml-serve definitions [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1009551 [15:11:21] (03PS1) 10Muehlenhoff: Also fix the tomcat10 spec test to run on Bullseye/Bookworm, not Buster [puppet] - 10https://gerrit.wikimedia.org/r/1009552 (https://phabricator.wikimedia.org/T359333) [15:12:20] duesen: I'm a bit confused, if the issue happens in restbase why the fix is in mw? [15:16:30] RECOVERY - Check whether ferm is active by checking the default input chain on mw1425 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:17:48] RECOVERY - Check whether ferm is active by checking the default input chain on mw2435 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:18:02] (03PS2) 10Daniel Kinzler: REST: ignore request body on GET requests [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009544 (https://phabricator.wikimedia.org/T359509) (owner: 10Jaime Nuche) [15:18:32] (03CR) 10Muehlenhoff: [C: 03+2] Also fix the tomcat10 spec test to run on Bullseye/Bookworm, not Buster [puppet] - 10https://gerrit.wikimedia.org/r/1009552 (https://phabricator.wikimedia.org/T359333) (owner: 10Muehlenhoff) [15:21:20] (03PS20) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) [15:21:46] (03CR) 10BPirkle: [C: 03+2] REST: ignore request body on GET requests [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009544 (https://phabricator.wikimedia.org/T359509) (owner: 10Jaime Nuche) [15:23:00] (03PS4) 10Muehlenhoff: tomcat: When on bookworm, install from component [puppet] - 10https://gerrit.wikimedia.org/r/1009506 (https://phabricator.wikimedia.org/T359333) [15:24:15] (03CR) 10CI reject: [V: 04-1] tomcat: When on bookworm, install from component [puppet] - 10https://gerrit.wikimedia.org/r/1009506 (https://phabricator.wikimedia.org/T359333) (owner: 10Muehlenhoff) [15:27:48] Amir1: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1009544 got +2'd [15:28:11] would you care too much if I backport it now? the change seems reasonable enough [15:30:15] Amir1: The issue got introduced when MW core enforced some rules around method and body. In our case GET shouldn't have a body. [15:30:59] (03PS5) 10Muehlenhoff: tomcat: When on bookworm, install from component [puppet] - 10https://gerrit.wikimedia.org/r/1009506 (https://phabricator.wikimedia.org/T359333) [15:32:00] 06SRE, 06Content-Transform-Team-WIP, 10MW-on-K8s, 06Traffic, and 3 others: A lot of `[info] Wikitext for this page has duplicate ids:` in logstash for mw-parsoid. Possibly related to PageBundle - https://phabricator.wikimedia.org/T358588#9611908 (10MSantos) [15:32:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T357189)', diff saved to https://phabricator.wikimedia.org/P58647 and previous config saved to /var/cache/conftool/dbconfig/20240307-153200-arnaudb.json [15:32:08] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [15:32:14] (03CR) 10CI reject: [V: 04-1] tomcat: When on bookworm, install from component [puppet] - 10https://gerrit.wikimedia.org/r/1009506 (https://phabricator.wikimedia.org/T359333) (owner: 10Muehlenhoff) [15:33:01] (03CR) 10Sergio Gimeno: [C: 04-1] Add account_conversion event stream (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989216 (owner: 10Cyndywikime) [15:39:45] (03PS2) 10Effie Mouzeli: sre.switchdc.mediawiki: add helm env variables to commands [cookbooks] - 10https://gerrit.wikimedia.org/r/1009468 (https://phabricator.wikimedia.org/T359154) [15:40:42] (03Merged) 10jenkins-bot: REST: ignore request body on GET requests [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009544 (https://phabricator.wikimedia.org/T359509) (owner: 10Jaime Nuche) [15:41:44] (03CR) 10Effie Mouzeli: sre.switchdc.mediawiki: add helm env variables to commands (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1009468 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [15:42:58] I'll backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1009544 in the next 5 minutes if there are no objections [15:44:01] (03CR) 10CI reject: [V: 04-1] sre.switchdc.mediawiki: add helm env variables to commands [cookbooks] - 10https://gerrit.wikimedia.org/r/1009468 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [15:45:59] (03PS3) 10Effie Mouzeli: sre.switchdc.mediawiki: add helm env variables to commands [cookbooks] - 10https://gerrit.wikimedia.org/r/1009468 (https://phabricator.wikimedia.org/T359154) [15:47:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P58648 and previous config saved to /var/cache/conftool/dbconfig/20240307-154706-arnaudb.json [15:47:10] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:48:49] !log jnuche@deploy2002 Started scap: Backport for [[gerrit:1009544|REST: ignore request body on GET requests (T359509)]] [15:49:03] T359509: REST API calls suddenly all returning 400 - https://phabricator.wikimedia.org/T359509 [15:50:22] !log jnuche@deploy2002 jnuche: Backport for [[gerrit:1009544|REST: ignore request body on GET requests (T359509)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:50:35] !log jnuche@deploy2002 jnuche: Continuing with sync [15:54:28] PROBLEM - Check whether ferm is active by checking the default input chain on mw1379 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:54:52] PROBLEM - Check whether ferm is active by checking the default input chain on parse1010 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:56:58] (03PS1) 10Muehlenhoff: Remove Tomcat spec tests [puppet] - 10https://gerrit.wikimedia.org/r/1009558 (https://phabricator.wikimedia.org/T359333) [15:58:49] !log dzahn@cumin2002 START - Cookbook sre.hosts.decommission for hosts etherpad2001.codfw.wmnet [15:59:44] (03CR) 10Klausman: [C: 03+1] slo_definitions: remove prometheus label from ml-serve definitions [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1009551 (owner: 10Elukey) [15:59:56] !log jnuche@deploy2002 Finished scap: Backport for [[gerrit:1009544|REST: ignore request body on GET requests (T359509)]] (duration: 11m 06s) [16:00:00] T359509: REST API calls suddenly all returning 400 - https://phabricator.wikimedia.org/T359509 [16:00:02] 06SRE, 10MW-on-K8s, 10Scap, 06serviceops, and 2 others: Adapt scap's testing strategy to mw-on-k8s - https://phabricator.wikimedia.org/T358117#9611979 (10CodeReviewBot) dancy merged https://gitlab.wikimedia.org/repos/releng/train-dev/-/merge_requests/48 exp/files/php/scap.cfg: Set testservers_check_cmd_*... [16:01:15] (03CR) 10Ahmon Dancy: "Pinging to make sure this doesn't get lost." [puppet] - 10https://gerrit.wikimedia.org/r/1007953 (https://phabricator.wikimedia.org/T358887) (owner: 10Ssingh) [16:02:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P58649 and previous config saved to /var/cache/conftool/dbconfig/20240307-160213-arnaudb.json [16:02:20] (03CR) 10Elukey: "Keith: o/ after a discussion with Tobias we are wondering if anything changed on the prometheus label point of view. I see:" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1009551 (owner: 10Elukey) [16:02:20] !log deleting etherpad2001 VM -replaced by etherpad2002 - T357159 [16:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:35] T357159: Site: 2 VM %request for etherpad - https://phabricator.wikimedia.org/T357159 [16:02:42] (03CR) 10Klausman: [C: 03+1] role::ml_k8s::staging::worker: add Dragonfly [puppet] - 10https://gerrit.wikimedia.org/r/1009548 (https://phabricator.wikimedia.org/T359416) (owner: 10Elukey) [16:02:53] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [16:04:09] (03CR) 10Volans: "LGTM, just a final typo ;)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1009468 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [16:04:15] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:04:16] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts etherpad2001.codfw.wmnet [16:04:19] 06SRE, 06Infrastructure-Foundations, 06collaboration-services, 10vm-requests: Site: 2 VM %request for etherpad - https://phabricator.wikimedia.org/T357159#9611988 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin2002 for hosts: `etherpad2001.codfw.wmnet` - etherpad2001.codfw.w... [16:04:58] (03PS4) 10Effie Mouzeli: sre.switchdc.mediawiki: add helm env variables to commands [cookbooks] - 10https://gerrit.wikimedia.org/r/1009468 (https://phabricator.wikimedia.org/T359154) [16:05:19] (03CR) 10Effie Mouzeli: sre.switchdc.mediawiki: add helm env variables to commands (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1009468 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [16:05:20] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359417#9611989 (10Marostegui) [16:05:36] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1009468 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [16:06:04] (03CR) 10Effie Mouzeli: [C: 03+2] sre.switchdc.mediawiki: add helm env variables to commands [cookbooks] - 10https://gerrit.wikimedia.org/r/1009468 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [16:06:09] backport seems to have removed the blocker [16:06:10] train deploy in a couple of mins [16:06:31] !log bouncing prometheus@k8s.service - T343529 [16:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:43] T343529: Prometheus doesn't reload or alert on expired client certificates - https://phabricator.wikimedia.org/T343529 [16:08:03] 06SRE, 10Continuous-Integration-Infrastructure, 06collaboration-services, 10vm-requests, 13Patch-For-Review: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9611992 (10Dzahn) zuul has now succesfully been deployed to this machine by @jnuche :) [16:09:00] (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009559 (https://phabricator.wikimedia.org/T354439) [16:09:02] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009559 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot) [16:09:22] (03CR) 10Kamila Součková: [C: 03+1] mw-parsoid: Add 10 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009543 (owner: 10Clément Goubert) [16:09:30] (03PS1) 10Marostegui: data.yaml: Add bdgreenlee to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1009561 (https://phabricator.wikimedia.org/T359417) [16:09:43] (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009559 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot) [16:10:00] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: add helm env variables to commands [cookbooks] - 10https://gerrit.wikimedia.org/r/1009468 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [16:15:15] (03PS2) 10Marostegui: data.yaml: Add bdgreenlee to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1009561 (https://phabricator.wikimedia.org/T359417) [16:17:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T357189)', diff saved to https://phabricator.wikimedia.org/P58650 and previous config saved to /var/cache/conftool/dbconfig/20240307-161720-arnaudb.json [16:17:35] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [16:18:18] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1209.eqiad.wmnet with reason: Maintenance [16:18:31] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1209.eqiad.wmnet with reason: Maintenance [16:18:34] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2165.codfw.wmnet with reason: Maintenance [16:18:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2165.codfw.wmnet with reason: Maintenance [16:19:08] 10SRE-Access-Requests, 06Data-Platform-SRE: Add user fabfur to analytics-privatedata-users - https://phabricator.wikimedia.org/T359561 (10gmodena) 03NEW [16:19:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1184.eqiad.wmnet with reason: Maintenance [16:19:43] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1184.eqiad.wmnet with reason: Maintenance [16:19:45] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2112.codfw.wmnet with reason: Maintenance [16:19:58] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2112.codfw.wmnet with reason: Maintenance [16:20:21] (03Abandoned) 10Hnowlan: thumbor: don't set x-forwarded-for at haproxy level [deployment-charts] - 10https://gerrit.wikimedia.org/r/931592 (https://phabricator.wikimedia.org/T339863) (owner: 10Hnowlan) [16:20:28] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.21 refs T354439 [16:20:39] T354439: 1.42.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T354439 [16:21:14] (03CR) 10Marostegui: data.yaml: Add bdgreenlee to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009561 (https://phabricator.wikimedia.org/T359417) (owner: 10Marostegui) [16:24:07] (03CR) 10Dzahn: "I am not sure which of the 3 options it is. There is analytics-privatedata-user with and without shell access and with and without kerbero" [puppet] - 10https://gerrit.wikimedia.org/r/1009561 (https://phabricator.wikimedia.org/T359417) (owner: 10Marostegui) [16:24:28] RECOVERY - Check whether ferm is active by checking the default input chain on mw1379 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:24:52] RECOVERY - Check whether ferm is active by checking the default input chain on parse1010 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:25:06] (03CR) 10Brouberol: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1009371 (https://phabricator.wikimedia.org/T358727) (owner: 10Bking) [16:25:36] (03CR) 10Dzahn: "uid and UID number match, has approval and key matches. that seems all good, though the expiration date isn't mentioned." [puppet] - 10https://gerrit.wikimedia.org/r/1009561 (https://phabricator.wikimedia.org/T359417) (owner: 10Marostegui) [16:25:56] (03CR) 10Marostegui: "That's a good point - let me ask for some reason I assumed it was with ssh access simply cause they provided it." [puppet] - 10https://gerrit.wikimedia.org/r/1009561 (https://phabricator.wikimedia.org/T359417) (owner: 10Marostegui) [16:26:30] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359417#9612069 (10Marostegui) Which access you specifically need for analytics-private-users? https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analyt... [16:28:03] seeing more of these less compilation errors since deploy: https://phabricator.wikimedia.org/T359414 [16:29:52] !log T343529 ✔ cdanis@prometheus2005.codfw.wmnet ~ 🕦☕sudo systemctl restart thanos-sidecar@k8s.service [16:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:57] T343529: Prometheus doesn't reload or alert on expired client certificates - https://phabricator.wikimedia.org/T343529 [16:30:25] (SystemdUnitFailed) resolved: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:34:25] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:37:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T352010)', diff saved to https://phabricator.wikimedia.org/P58651 and previous config saved to /var/cache/conftool/dbconfig/20240307-163706-ladsgroup.json [16:37:10] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:37:22] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [16:38:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2005.codfw.wmnet with OS bullseye [16:38:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2006.codfw.wmnet with OS bullseye [16:38:28] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9612112 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye [16:38:31] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9612113 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host dbprov2006.codfw.wmnet with OS bullseye [16:40:21] jouncebot: nowandnext [16:40:21] No deployments scheduled for the next 0 hour(s) and 19 minute(s) [16:40:22] In 0 hour(s) and 19 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240307T1700) [16:43:20] !log dancy@deploy2002 Installing scap version "4.70.0" for 373 hosts [16:44:05] !log dancy@deploy2002 Installation of scap version "4.70.0" completed for 373 hosts [16:45:51] (03CR) 10Clément Goubert: [C: 03+2] mw-parsoid: Add 10 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009543 (owner: 10Clément Goubert) [16:46:42] (03Merged) 10jenkins-bot: mw-parsoid: Add 10 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009543 (owner: 10Clément Goubert) [16:47:26] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [16:47:45] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [16:47:49] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [16:48:11] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [16:52:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P58652 and previous config saved to /var/cache/conftool/dbconfig/20240307-165213-ladsgroup.json [16:53:02] (03CR) 10Ahmon Dancy: "Eric, when's a good time for us to sync up to test out deployment with this change?" [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/1007701 (https://phabricator.wikimedia.org/T357739) (owner: 10Ahmon Dancy) [16:54:23] (03CR) 10Bking: [C: 03+2] site.pp: Move wdqs1025 into test role [puppet] - 10https://gerrit.wikimedia.org/r/1009371 (https://phabricator.wikimedia.org/T358727) (owner: 10Bking) [16:57:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1009511 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [17:00:05] jhathaway and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240307T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:01:20] (03CR) 10Dzahn: [C: 03+1] scap.cfg.erb: Set testservers_check_cmd_* in production [puppet] - 10https://gerrit.wikimedia.org/r/1007956 (https://phabricator.wikimedia.org/T358117) (owner: 10Ahmon Dancy) [17:06:14] (03CR) 10RLazarus: [C: 03+2] sre.switchdc.mediawiki: Stop maintenance scripts on Kubernetes [cookbooks] - 10https://gerrit.wikimedia.org/r/1008583 (https://phabricator.wikimedia.org/T359130) (owner: 10RLazarus) [17:06:54] claime: nothing in the Puppet window, if you wanted it for anything [17:07:12] nah I'm good [17:07:21] I just wanted to bump mw-parsoid [17:07:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P58653 and previous config saved to /var/cache/conftool/dbconfig/20240307-170720-ladsgroup.json [17:07:22] oh never mind, I see you went already :) briefly thrown off by different usernames [17:07:23] thanks tho [17:07:26] 👍 [17:07:26] (03PS1) 10Marostegui: installserver: Do not reimage es1038 [puppet] - 10https://gerrit.wikimedia.org/r/1009573 [17:08:07] dancy: clear to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1007956 ? [17:08:22] Yes please! [17:08:27] Nice [17:08:28] claime: thanks [17:08:32] (03CR) 10Clément Goubert: [C: 03+2] scap.cfg.erb: Set testservers_check_cmd_* in production [puppet] - 10https://gerrit.wikimedia.org/r/1007956 (https://phabricator.wikimedia.org/T358117) (owner: 10Ahmon Dancy) [17:10:15] (03PS1) 10Bking: wdqs: make "monitoring_tier" var optional [puppet] - 10https://gerrit.wikimedia.org/r/1009574 (https://phabricator.wikimedia.org/T359062) [17:10:33] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1009574 (https://phabricator.wikimedia.org/T359062) (owner: 10Bking) [17:11:33] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: Stop maintenance scripts on Kubernetes [cookbooks] - 10https://gerrit.wikimedia.org/r/1008583 (https://phabricator.wikimedia.org/T359130) (owner: 10RLazarus) [17:11:50] (03CR) 10Marostegui: [C: 03+2] installserver: Do not reimage es1038 [puppet] - 10https://gerrit.wikimedia.org/r/1009573 (owner: 10Marostegui) [17:11:58] (03PS1) 10Dzahn: planet: add prometheus apache exporter to role [puppet] - 10https://gerrit.wikimedia.org/r/1009575 (https://phabricator.wikimedia.org/T359556) [17:14:08] dancy: Merged and puppet run on deploy2002, you can test at your convenience :) [17:14:41] !log dancy@deploy2002 Started scap: testing T358117 [17:14:46] T358117: Adapt scap's testing strategy to mw-on-k8s - https://phabricator.wikimedia.org/T358117 [17:15:28] (03PS21) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) [17:15:38] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE: Add user fabfur to analytics-privatedata-users - https://phabricator.wikimedia.org/T359561#9612404 (10Marostegui) a:03Fabfur [17:15:53] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE: Add user fabfur to analytics-privatedata-users - https://phabricator.wikimedia.org/T359561#9612399 (10Marostegui) @Fabfur as a SRE I assume you'd self serve? [17:19:09] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE: Add user fabfur to analytics-privatedata-users - https://phabricator.wikimedia.org/T359561#9612414 (10BTullis) We should still seek approval from one of [[https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/data/data.yaml#L429-L433|t... [17:19:28] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359417#9612420 (10bdgreenlee) Sounds like, at least for now, **no kerberos, no ssh**. Presumably if it turns out I do need either, I can request it then. Than... [17:19:55] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE: Add user fabfur to analytics-privatedata-users - https://phabricator.wikimedia.org/T359561#9612424 (10Marostegui) Yeah, I think the whole procedure needs to be followed (using the correct template for this ticket too) [17:22:13] (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:22:26] (03PS3) 10Marostegui: data.yaml: Add bdgreenlee to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1009561 (https://phabricator.wikimedia.org/T359417) [17:22:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T352010)', diff saved to https://phabricator.wikimedia.org/P58654 and previous config saved to /var/cache/conftool/dbconfig/20240307-172227-ladsgroup.json [17:22:41] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:22:47] (03CR) 10Marostegui: "No ssh, no kerberos: https://phabricator.wikimedia.org/T359417#9612420" [puppet] - 10https://gerrit.wikimedia.org/r/1009561 (https://phabricator.wikimedia.org/T359417) (owner: 10Marostegui) [17:25:56] !log dancy@deploy2002 Finished scap: testing T358117 (duration: 11m 15s) [17:26:00] T358117: Adapt scap's testing strategy to mw-on-k8s - https://phabricator.wikimedia.org/T358117 [17:28:12] !log set aside WAL for prometheus@k8s in eqiad and restart - T354399 [17:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:16] T354399: Prometheus @ k8s OOM loop - https://phabricator.wikimedia.org/T354399 [17:30:38] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1009574 (https://phabricator.wikimedia.org/T359062) (owner: 10Bking) [17:30:46] 06SRE, 10MW-on-K8s, 10Scap, 06serviceops, 10Release-Engineering-Team (Now this 🫠): Scap should check errors coming from mw-on-k8s canaries during deployments - https://phabricator.wikimedia.org/T357402#9612486 (10dancy) [17:31:26] 06SRE, 10MW-on-K8s, 10Scap, 06serviceops, 10Release-Engineering-Team (Now this 🫠): Adapt scap's testing strategy to mw-on-k8s - https://phabricator.wikimedia.org/T358117#9612484 (10dancy) 05In progress→03Resolved Changes deployed and tested. Resolving this task. [17:32:32] 06SRE, 10MW-on-K8s, 10Scap, 06serviceops, 10Release-Engineering-Team (Now this 🫠): Adapt scap's testing strategy to mw-on-k8s - https://phabricator.wikimedia.org/T358117#9612505 (10Clement_Goubert) @dancy Thanks a bunch! \o/ [17:34:33] 10SRE-swift-storage: 2024-2025 ms swift capacity - https://phabricator.wikimedia.org/T359077#9612509 (10MatthewVernon) Additionally, we are retiring the last 9 12x4 T nodes from eqiad and the last 6 12x4T nodes from codfw and replacing them with 24x8T units. __After__ that refresh but __ignoring__ the proposed... [17:37:35] (03CR) 10Dzahn: [C: 03+1] data.yaml: Add bdgreenlee to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1009561 (https://phabricator.wikimedia.org/T359417) (owner: 10Marostegui) [17:38:33] (03PS2) 10Bking: wdqs: make "monitoring_tier" var optional [puppet] - 10https://gerrit.wikimedia.org/r/1009574 (https://phabricator.wikimedia.org/T359062) [17:39:03] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1009574 (https://phabricator.wikimedia.org/T359062) (owner: 10Bking) [17:40:32] (03CR) 10Marostegui: [C: 03+2] data.yaml: Add bdgreenlee to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1009561 (https://phabricator.wikimedia.org/T359417) (owner: 10Marostegui) [17:41:52] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359417#9612555 (10Marostegui) 05Open→03Resolved a:03Marostegui I have merged and deployed the change. Please give it 30 minutes or so for the change to... [17:43:49] !log set aside WAL for prometheus@k8s in codfw and restart - T354399 [17:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:53] T354399: Prometheus @ k8s OOM loop - https://phabricator.wikimedia.org/T354399 [17:47:36] (03PS3) 10Bking: wdqs: make "monitoring_tier" var optional [puppet] - 10https://gerrit.wikimedia.org/r/1009574 (https://phabricator.wikimedia.org/T359062) [17:54:52] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1009574 (https://phabricator.wikimedia.org/T359062) (owner: 10Bking) [17:57:13] (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:00:04] bd808: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240307T1800). [18:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240307T1800) [18:10:29] 06SRE, 06Commons, 06Data-Persistence (work done), 10MediaWiki-extensions-WikibaseClient, and 7 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730#9612740 (10Michael) [18:15:40] nothing needing my deploy window today. [18:17:19] (03CR) 10Btullis: Add new ceph container image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1009494 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [18:19:04] !log bearloga@deploy2002 Started deploy [airflow-dags/analytics_product@15edf4a]: (no justification provided) [18:19:13] !log bearloga@deploy2002 Finished deploy [airflow-dags/analytics_product@15edf4a]: (no justification provided) (duration: 00m 08s) [18:21:48] (PuppetZeroResources) firing: Puppet has failed generate resources on wdqs1025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [18:22:18] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 60 days, 0:00:00 on wdqs[1022-1025].eqiad.wmnet with reason: T337013 [18:22:23] T337013: [Epic] Splitting the graph in WDQS - https://phabricator.wikimedia.org/T337013 [18:22:38] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 60 days, 0:00:00 on wdqs[1022-1025].eqiad.wmnet with reason: T337013 [18:35:31] (03PS22) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) [18:39:51] (03PS3) 10MVernon: Add new ceph container image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1009494 (https://phabricator.wikimedia.org/T279621) [18:40:00] (03CR) 10MVernon: Add new ceph container image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1009494 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [18:40:40] (03PS4) 10MVernon: Add new ceph container image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1009494 (https://phabricator.wikimedia.org/T279621) [18:49:39] !log running a wikidata dump manually on snapshot1009 for partitions 25,27 [18:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] jnuche and dduvall: Your horoscope predicts another MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240307T1900). [19:00:29] (03PS8) 10Andrew Bogott: wmcs-puppetcertleaks: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007444 (https://phabricator.wikimedia.org/T351455) [19:00:31] (03PS8) 10Andrew Bogott: wmf_sink: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007445 (https://phabricator.wikimedia.org/T351455) [19:00:34] (03PS1) 10Andrew Bogott: profile::puppetserver: include ssldir_on_srv arg [puppet] - 10https://gerrit.wikimedia.org/r/1009588 (https://phabricator.wikimedia.org/T276327) [19:04:22] (03PS2) 10Andrew Bogott: profile::puppetserver: include ssldir_on_srv arg [puppet] - 10https://gerrit.wikimedia.org/r/1009588 (https://phabricator.wikimedia.org/T276327) [19:04:24] (03PS9) 10Andrew Bogott: wmcs-puppetcertleaks: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007444 (https://phabricator.wikimedia.org/T351455) [19:04:27] (03PS9) 10Andrew Bogott: wmf_sink: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007445 (https://phabricator.wikimedia.org/T351455) [19:04:40] (03CR) 10CI reject: [V: 04-1] profile::puppetserver: include ssldir_on_srv arg [puppet] - 10https://gerrit.wikimedia.org/r/1009588 (https://phabricator.wikimedia.org/T276327) (owner: 10Andrew Bogott) [19:08:04] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1009588 (https://phabricator.wikimedia.org/T276327) (owner: 10Andrew Bogott) [19:10:55] (03CR) 10Andrew Bogott: [C: 03+2] profile::puppetserver: include ssldir_on_srv arg [puppet] - 10https://gerrit.wikimedia.org/r/1009588 (https://phabricator.wikimedia.org/T276327) (owner: 10Andrew Bogott) [19:12:21] 06SRE, 06Commons, 06Data-Persistence (work done), 10MediaWiki-extensions-WikibaseClient, and 8 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730#9613061 (10Michael) [19:12:32] (03PS1) 10Jdlrobson: Fixes: Less_Exception_Compiler [extensions/SearchVue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009337 (https://phabricator.wikimedia.org/T359414) [19:33:05] (03CR) 10Dzahn: [C: 03+2] planet: add prometheus apache exporter to role [puppet] - 10https://gerrit.wikimedia.org/r/1009575 (https://phabricator.wikimedia.org/T359556) (owner: 10Dzahn) [19:35:29] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009393 [19:45:21] (03PS4) 10Herron: SLO queries for AQS 2.0 geo analytics Bug: T358751 Change-Id: I6f97fbfdb013787c1cfca590c6b089bdc2cb0198 [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1009472 (https://phabricator.wikimedia.org/T358751) (owner: 10Sg912) [19:47:10] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:34:40] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:37:10] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:39:24] PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:47:38] (03CR) 10Eevans: [C: 03+2] Switch from git-fat to git-lfs [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/1007701 (https://phabricator.wikimedia.org/T357739) (owner: 10Ahmon Dancy) [20:47:40] (03CR) 10Eevans: [V: 03+2 C: 03+2] Switch from git-fat to git-lfs [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/1007701 (https://phabricator.wikimedia.org/T357739) (owner: 10Ahmon Dancy) [20:48:19] jnuche: fix for https://phabricator.wikimedia.org/T359414 got merged to master. Did you want to backport to deploy branch? [20:48:28] (https://gerrit.wikimedia.org/r/1009337) [20:48:45] cc @dduvall @thcipriani ^ [20:49:01] !log dancy@deploy2002 Started deploy [cassandra/logstash-logback-encoder@162f72f]: (no justification provided) [20:49:57] !log dancy@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@162f72f]: (no justification provided) (duration: 00m 56s) [20:50:20] !log dancy@deploy2002 Started deploy [cassandra/logstash-logback-encoder@c200e79]: (no justification provided) [20:50:55] !log dancy@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@c200e79]: (no justification provided) (duration: 00m 35s) [20:55:24] RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:59:21] Jdlrobson: noting it's well after j.nuche's workday. skimming the ticket and glancing at logs, the volume does look like it warrants a backport. [20:59:52] jouncebot: nowandnext [20:59:52] For the next 0 hour(s) and 0 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240307T1900) [20:59:52] In 0 hour(s) and 0 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240307T2100) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240307T2100) [21:00:05] No Gerrit patches in the queue for this window AFAICS. [21:00:46] Jdlrobson: i can go ahead and get that out [21:00:55] (cc: dduvall) [21:02:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy2002 using scap backport" [extensions/SearchVue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009337 (https://phabricator.wikimedia.org/T359414) (owner: 10Jdlrobson) [21:03:22] Thanks brennen ! [21:04:21] Jdlrobson: looks like a few minutes yet on CI, but anything to test there once it does hit test boxen? [21:04:42] (03Merged) 10jenkins-bot: Fixes: Less_Exception_Compiler [extensions/SearchVue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009337 (https://phabricator.wikimedia.org/T359414) (owner: 10Jdlrobson) [21:04:56] !log brennen@deploy2002 Started scap: Backport for [[gerrit:1009337|Fixes: Less_Exception_Compiler (T359414 T357740)]] [21:05:03] T359414: Less_Exception_Compiler: error evaluating function `fade` The first argument to fade must be a color index: 110 in QuickViewTutorialPopup.vue on line 3, column 331|