[00:19:39] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:19:45] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:25:42] !log moscovium - systemctl start logrotate T360391 [00:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:47] T360391: SystemdUnitFailed - moscovium - logrotate - https://phabricator.wikimedia.org/T360391 [00:29:35] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:29:42] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:32:55] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:33:02] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:37:36] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1011443 [00:37:40] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1011443 (owner: 10TrainBranchBot) [00:52:23] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:52:30] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:00:46] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:00:53] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:00:58] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1011443 (owner: 10TrainBranchBot) [01:03:04] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:03:11] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:03:49] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T360395 (10phaultfinder) 03NEW [01:07:14] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:07:21] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:52:25] (SystemdUnitFailed) firing: (3) rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:57:25] (SystemdUnitFailed) firing: (3) rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240319T0200) [02:07:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.23 [core] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1011444 (https://phabricator.wikimedia.org/T354441) [02:07:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.42.0-wmf.23 [core] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1011444 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot) [02:26:49] (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.23 [core] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1011444 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot) [02:37:16] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:47:13] (SystemdUnitFailed) firing: (2) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:57:16] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240319T0300) [03:03:21] !log mwpresync@deploy2002 Pruned MediaWiki: 1.42.0-wmf.20 (duration: 03m 18s) [03:04:44] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012462 (https://phabricator.wikimedia.org/T354441) [03:04:45] (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012462 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot) [03:05:26] 10ops-codfw, 06SRE: 14Inbound interface errors - 14https://phabricator.wikimedia.org/T360395#9640823 (10Papaul) 05Open→03Resolved a:03Papaul [03:05:28] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012462 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot) [03:05:49] !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.23 refs T354441 [03:06:01] T354441: 1.42.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T354441 [03:15:48] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:15:55] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:03:08] !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.23 refs T354441 (duration: 57m 18s) [04:03:21] T354441: 1.42.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T354441 [04:17:43] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:17:50] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:23:57] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:24:04] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:50:20] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:50:26] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:51:45] (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:05:55] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:06:02] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:21:45] (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:33:53] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:34:00] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:49:09] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:49:16] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:56:58] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:57:04] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:57:25] (SystemdUnitFailed) firing: (3) rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240319T0600) [06:00:05] kormat, marostegui, Amir1, and arnaudb: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240319T0600) [06:01:18] (03PS1) 10Marostegui: Revert "db1246: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1011459 [06:05:50] (03CR) 10Marostegui: [C:03+2] Revert "db1246: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1011459 (owner: 10Marostegui) [06:06:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 1%: After HW issues', diff saved to https://phabricator.wikimedia.org/P58810 and previous config saved to /var/cache/conftool/dbconfig/20240319-060620-root.json [06:07:51] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: 14hw troubleshooting: Unidentified for db1246.eqiad.wmnet - 14https://phabricator.wikimedia.org/T359940#9640895 (10Marostegui) 05Open→03Resolved 14Started to repool this host. [06:11:01] (03PS1) 10Marostegui: installserver: Do not reimage es1037 [puppet] - 10https://gerrit.wikimedia.org/r/1012466 [06:11:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:15:03] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage es1037 [puppet] - 10https://gerrit.wikimedia.org/r/1012466 (owner: 10Marostegui) [06:18:30] (03PS1) 10Marostegui: db1154: Migrate to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1012467 (https://phabricator.wikimedia.org/T358638) [06:20:27] (03CR) 10Marostegui: [C:03+2] db1154: Migrate to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1012467 (https://phabricator.wikimedia.org/T358638) (owner: 10Marostegui) [06:21:23] (03PS3) 10AOkoth: miscweb: add security-landing-page values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1011028 (https://phabricator.wikimedia.org/T350796) [06:21:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 5%: After HW issues', diff saved to https://phabricator.wikimedia.org/P58811 and previous config saved to /var/cache/conftool/dbconfig/20240319-062126-root.json [06:22:03] (03CR) 10AOkoth: miscweb: add security-landing-page values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1011028 (https://phabricator.wikimedia.org/T350796) (owner: 10AOkoth) [06:36:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 10%: After HW issues', diff saved to https://phabricator.wikimedia.org/P58812 and previous config saved to /var/cache/conftool/dbconfig/20240319-063632-root.json [06:41:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:47:13] (SystemdUnitFailed) firing: (2) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:47:24] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:47:30] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:51:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 25%: After HW issues', diff saved to https://phabricator.wikimedia.org/P58813 and previous config saved to /var/cache/conftool/dbconfig/20240319-065137-root.json [06:57:31] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:06:01] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:06:08] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:06:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 50%: After HW issues', diff saved to https://phabricator.wikimedia.org/P58814 and previous config saved to /var/cache/conftool/dbconfig/20240319-070643-root.json [07:08:11] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:08:17] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:10:14] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:10:21] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:17:38] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:17:45] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:21:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 75%: After HW issues', diff saved to https://phabricator.wikimedia.org/P58815 and previous config saved to /var/cache/conftool/dbconfig/20240319-072149-root.json [07:22:46] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:22:53] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:24:51] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:24:57] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:26:57] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:27:04] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:32:44] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:32:51] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:36:51] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:36:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 100%: After HW issues', diff saved to https://phabricator.wikimedia.org/P58816 and previous config saved to /var/cache/conftool/dbconfig/20240319-073655-root.json [07:36:57] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:38:53] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:39:00] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:41:14] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:41:21] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:43:35] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:43:42] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:45:39] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:45:46] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:48:10] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:48:17] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:50:26] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:50:27] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:52:25] (SystemdUnitFailed) firing: (3) rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:56:19] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:56:26] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:00:04] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240319T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:26] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:00:33] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:02:25] (SystemdUnitFailed) firing: (3) rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:09:43] !log slyngshede@cumin1002 START - Cookbook sre.hosts.reimage for host idp-test1003.wikimedia.org with OS bookworm [08:14:26] jouncebot: NoSQL [08:14:29] jouncebot: now [08:14:30] For the next 0 hour(s) and 45 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240319T0800) [08:14:53] I am going to push those Wikitech block/unblock hooks for Gerrit [08:15:10] which I talked about with Bryan last thursday but postponed the deploy till this week [08:17:56] !log hashar@deploy2002 Started scap: Backport for [[gerrit:1011151|wikitech: fix handling of Gerrit status code (T307558)]] [08:20:33] !log hashar@deploy2002 hashar: Backport for [[gerrit:1011151|wikitech: fix handling of Gerrit status code (T307558)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:20:41] testing [08:21:37] ah no it is on wikitech :/ [08:21:41] !log hashar@deploy2002 hashar: Continuing with sync [08:22:39] !log slyngshede@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on idp-test1003.wikimedia.org with reason: host reimage [08:24:42] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp-test1003.wikimedia.org with reason: host reimage [08:33:27] !log hashar@deploy2002 Finished scap: Backport for [[gerrit:1011151|wikitech: fix handling of Gerrit status code (T307558)]] (duration: 15m 31s) [08:42:25] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idp-test1003.wikimedia.org with OS bookworm [08:43:32] !log hashar@deploy2002 Started scap: Backport for [[gerrit:1011171|wikitech: fix curl_exec a falsey value (T307558)]] [08:45:53] !log hashar@deploy2002 hashar: Backport for [[gerrit:1011171|wikitech: fix curl_exec a falsey value (T307558)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:46:17] !log hashar@deploy2002 hashar: Continuing with sync [08:58:08] !log hashar@deploy2002 Finished scap: Backport for [[gerrit:1011171|wikitech: fix curl_exec a falsey value (T307558)]] (duration: 14m 36s) [09:14:32] (03PS18) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 [09:16:12] (03CR) 10Giuseppe Lavagetto: [C:03+2] Revert "Rakefile: remove useless files from generated docs" [puppet] - 10https://gerrit.wikimedia.org/r/1010570 (https://phabricator.wikimedia.org/T358507) (owner: 10Hashar) [09:16:23] poor wikibugs [09:16:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1011151 (https://phabricator.wikimedia.org/T307558) (owner: 10Hashar) [09:16:52] (03Merged) 10jenkins-bot: wikitech: fix handling of Gerrit status code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1011151 (https://phabricator.wikimedia.org/T307558) (owner: 10Hashar) [09:18:59] !log hashar@deploy2002 Started scap: Backport for [[gerrit:1012607|wikitech: close parenthesis in log message (T307558)]] [09:20:34] (03CR) 10Fabfur: [C:03+2] benthos: added minor unit tests [puppet] - 10https://gerrit.wikimedia.org/r/1012453 (https://phabricator.wikimedia.org/T359626) (owner: 10Fabfur) [09:20:58] (03PS2) 10Hashar: wikitech: fix curl_exec a falsey value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1011171 (https://phabricator.wikimedia.org/T307558) [09:21:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1011171 (https://phabricator.wikimedia.org/T307558) (owner: 10Hashar) [09:21:21] !log hashar@deploy2002 hashar: Backport for [[gerrit:1012607|wikitech: close parenthesis in log message (T307558)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:21:30] (03Merged) 10jenkins-bot: wikitech: fix curl_exec a falsey value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1011171 (https://phabricator.wikimedia.org/T307558) (owner: 10Hashar) [09:22:42] (03CR) 10Jelto: [C:03+1] "lgtm now, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1011028 (https://phabricator.wikimedia.org/T350796) (owner: 10AOkoth) [09:22:51] !log hashar@deploy2002 hashar: Continuing with sync [09:27:36] (03CR) 10JMeybohm: profile::prometheus::k8s: move istio metrics to a separate job (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [09:28:12] 10SRE-swift-storage, 06Commons, 06serviceops: Commons thumbnails are broken for certain large sizes of thumbnail images - https://phabricator.wikimedia.org/T358738#9641068 (10TheDJ) >>! In T358738#9640913, @seav wrote: > Is there a way to use your shell trick to determine which images would need purging? No... [09:28:38] (03CR) 10JMeybohm: [C:03+1] mediawiki: Add mwscript labels to the job as well as the pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009373 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [09:28:48] (03PS1) 10Hashar: wikitech: close parenthesis in log message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012607 (https://phabricator.wikimedia.org/T307558) [09:29:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012607 (https://phabricator.wikimedia.org/T307558) (owner: 10Hashar) [09:29:16] (03Merged) 10jenkins-bot: wikitech: close parenthesis in log message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012607 (https://phabricator.wikimedia.org/T307558) (owner: 10Hashar) [09:29:48] 06SRE, 10SRE-Access-Requests: Requesting access to "researchers" and "analytics-privatedata-users" for Xiao Xiao - https://phabricator.wikimedia.org/T352098#9641081 (10Fabfur) Contacted privately on separate channel to inform her about this issue [09:30:46] (03CR) 10Hashar: "I have mentioned it on the task T222209 (*Cleanup logging and curl use in wikitech post-block hooks*)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1011171 (https://phabricator.wikimedia.org/T307558) (owner: 10Hashar) [09:31:38] (03PS1) 10Majavah: dynamicproxy: support wildcards in non-shared domains [puppet] - 10https://gerrit.wikimedia.org/r/1012608 (https://phabricator.wikimedia.org/T360363) [09:32:56] (03CR) 10Alexandros Kosiaris: [C:03+2] Route /w/CREDITS and /w/COPYING to /w/static.php [puppet] - 10https://gerrit.wikimedia.org/r/1012439 (https://phabricator.wikimedia.org/T359643) (owner: 10Ahmon Dancy) [09:33:24] (03CR) 10JMeybohm: Add template rendering external services egress NetworkPolicy resources (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009279 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [09:34:19] !log hashar@deploy2002 Finished scap: Backport for [[gerrit:1012607|wikitech: close parenthesis in log message (T307558)]] (duration: 15m 19s) [09:34:43] 10SRE-swift-storage, 06Commons, 06serviceops: Commons thumbnails are broken for certain large sizes of thumbnail images - https://phabricator.wikimedia.org/T358738#9641119 (10akosiaris) >>! In T358738#9639319, @TheDJ wrote: > ping @akosiaris Ideas on why codfw is out of date and won't correct ? Is it out of... [09:35:16] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.526s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:40:16] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 1.47s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:45:45] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:45:52] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:46:45] (03PS1) 10Majavah: hieradata: update codfw1dev horizon to 2024-03-19-094112 [puppet] - 10https://gerrit.wikimedia.org/r/1012611 [09:48:10] (03CR) 10Majavah: [C:03+2] hieradata: update codfw1dev horizon to 2024-03-19-094112 [puppet] - 10https://gerrit.wikimedia.org/r/1012611 (owner: 10Majavah) [09:50:53] 10ops-codfw, 06SRE: 14Inbound interface errors - 14https://phabricator.wikimedia.org/T358417#9641197 (10jcrespo) 14This is all very ok to me- unless I complain about having bad bandwidth (and only did in very specific cases where it affected important operations and there was some network issue), I think... [09:51:58] (03PS1) 10Slyngshede: R:idp Add new node to IDP node set. [puppet] - 10https://gerrit.wikimedia.org/r/1012612 (https://phabricator.wikimedia.org/T357748) [09:53:28] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1656/co" [puppet] - 10https://gerrit.wikimedia.org/r/1012612 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [09:56:19] (03PS1) 10Majavah: hieradata: update horizon to 2024-03-19-094112 [puppet] - 10https://gerrit.wikimedia.org/r/1012613 (https://phabricator.wikimedia.org/T360363) [09:57:57] (03CR) 10Majavah: [C:03+2] dynamicproxy: support wildcards in non-shared domains [puppet] - 10https://gerrit.wikimedia.org/r/1012608 (https://phabricator.wikimedia.org/T360363) (owner: 10Majavah) [09:58:15] (03Abandoned) 10Hashar: P:gerrit: Add logoutd script for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/705426 (https://phabricator.wikimedia.org/T286905) (owner: 10Jbond) [09:59:30] 06SRE, 10CAS-SSO, 10Gerrit, 06Infrastructure-Foundations, and 2 others: 14Add logout.d script for Gerrit - 14https://phabricator.wikimedia.org/T286905#9641230 (10hashar) 05Open→03Declined 14Users are blocked in Gerrit via wikitech Special:Block which had some recent fixes as part of T307558. [09:59:36] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: 14Cookbook for centralised logouts and session status queries - 14https://phabricator.wikimedia.org/T283242#9641233 (10hashar) [10:08:25] (SystemdUnitFailed) firing: httpbb_hourly_appserver.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:10:06] (03CR) 10Majavah: [C:03+2] hieradata: update horizon to 2024-03-19-094112 [puppet] - 10https://gerrit.wikimedia.org/r/1012613 (https://phabricator.wikimedia.org/T360363) (owner: 10Majavah) [10:12:05] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9641276 (10MoritzMuehlenhoff) [10:12:41] (03CR) 10Muehlenhoff: [C:03+1] "That should solve all our issues :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1012612 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [10:13:05] (03CR) 10Slyngshede: [V:03+1 C:03+2] R:idp Add new node to IDP node set. [puppet] - 10https://gerrit.wikimedia.org/r/1012612 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [10:16:44] (03PS3) 10Majavah: P:toolforge: move webservice CLI to the CLI profile [puppet] - 10https://gerrit.wikimedia.org/r/1012390 (https://phabricator.wikimedia.org/T314664) [10:16:44] (03PS3) 10Majavah: P:toolforge::bastion: remove tekton component [puppet] - 10https://gerrit.wikimedia.org/r/1012391 [10:16:46] (03PS1) 10Majavah: P:toolforge: ensure new bastions have en_US.UTF-8 locale [puppet] - 10https://gerrit.wikimedia.org/r/1012615 (https://phabricator.wikimedia.org/T314665) [10:17:55] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1012615 (https://phabricator.wikimedia.org/T314665) (owner: 10Majavah) [10:19:36] (03CR) 10David Caro: [C:03+1] dynamicproxy: support wildcards in non-shared domains [puppet] - 10https://gerrit.wikimedia.org/r/1012608 (https://phabricator.wikimedia.org/T360363) (owner: 10Majavah) [10:21:59] (SystemdUnitFailed) firing: (3) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:22:29] (03CR) 10CI reject: [V:04-1] P:toolforge: ensure new bastions have en_US.UTF-8 locale [puppet] - 10https://gerrit.wikimedia.org/r/1012615 (https://phabricator.wikimedia.org/T314665) (owner: 10Majavah) [10:22:31] (03PS1) 10Giuseppe Lavagetto: scap::master: add rsync server for the k8s release repo [puppet] - 10https://gerrit.wikimedia.org/r/1012617 [10:22:35] (03PS1) 10Giuseppe Lavagetto: scap::master: add k8s support to scap-master-sync [puppet] - 10https://gerrit.wikimedia.org/r/1012618 [10:23:13] (03PS2) 10Majavah: P:toolforge: ensure new bastions have en_US.UTF-8 locale [puppet] - 10https://gerrit.wikimedia.org/r/1012615 (https://phabricator.wikimedia.org/T314665) [10:23:13] (03PS4) 10Majavah: P:toolforge: move webservice CLI to the CLI profile [puppet] - 10https://gerrit.wikimedia.org/r/1012390 (https://phabricator.wikimedia.org/T314664) [10:23:15] (03PS4) 10Majavah: P:toolforge::bastion: remove tekton component [puppet] - 10https://gerrit.wikimedia.org/r/1012391 [10:25:59] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations: Phase out cergen for Data Platform services - https://phabricator.wikimedia.org/T360412 (10MoritzMuehlenhoff) 03NEW [10:28:57] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9641317 (10MoritzMuehlenhoff) [10:29:33] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:29:40] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:32:59] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:33:06] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:33:26] (03PS1) 10Slyngshede: Switch idp-test to new Bookworm server. [dns] - 10https://gerrit.wikimedia.org/r/1012620 (https://phabricator.wikimedia.org/T357748) [10:33:49] 06SRE, 06collaboration-services: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413 (10MoritzMuehlenhoff) 03NEW [10:35:38] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:35:44] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:35:58] 06SRE, 10Wikimedia-Mailing-lists: Mailing list request for Igbo Wikimedians - https://phabricator.wikimedia.org/T360350#9641363 (10Ladsgroup) Due to https://meta.wikimedia.org/wiki/Mailing_lists/Standardization the name of the mailing list will be "wikimedia-igbo" would that be okay with you? [10:36:42] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1012620 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [10:37:00] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9641355 (10MoritzMuehlenhoff) [10:37:15] (03CR) 10David Caro: [C:03+1] P:toolforge: ensure new bastions have en_US.UTF-8 locale [puppet] - 10https://gerrit.wikimedia.org/r/1012615 (https://phabricator.wikimedia.org/T314665) (owner: 10Majavah) [10:39:52] (03CR) 10David Caro: [C:03+1] P:toolforge: move webservice CLI to the CLI profile [puppet] - 10https://gerrit.wikimedia.org/r/1012390 (https://phabricator.wikimedia.org/T314664) (owner: 10Majavah) [10:40:07] (03CR) 10David Caro: [C:03+1] "Yes!" [puppet] - 10https://gerrit.wikimedia.org/r/1012391 (owner: 10Majavah) [10:41:00] (03CR) 10David Caro: "Same as https://gerrit.wikimedia.org/r/c/operations/puppet/+/1012391 ?" [puppet] - 10https://gerrit.wikimedia.org/r/975768 (owner: 10Majavah) [10:41:12] (03CR) 10David Caro: [C:03+1] aptrepo: drop tekton component [puppet] - 10https://gerrit.wikimedia.org/r/975769 (owner: 10Majavah) [10:45:29] (03CR) 10Majavah: [C:03+2] P:toolforge: ensure new bastions have en_US.UTF-8 locale [puppet] - 10https://gerrit.wikimedia.org/r/1012615 (https://phabricator.wikimedia.org/T314665) (owner: 10Majavah) [10:46:59] (SystemdUnitFailed) firing: (3) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:48:25] (SystemdUnitFailed) firing: (2) httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:49:46] (03CR) 10Majavah: [C:03+2] P:toolforge: move webservice CLI to the CLI profile [puppet] - 10https://gerrit.wikimedia.org/r/1012390 (https://phabricator.wikimedia.org/T314664) (owner: 10Majavah) [10:50:10] (03PS1) 10Fabfur: haproxy: avoid UA header truncation [puppet] - 10https://gerrit.wikimedia.org/r/1012624 (https://phabricator.wikimedia.org/T358109) [10:50:15] (03CR) 10Majavah: [C:03+2] P:toolforge::bastion: remove tekton component [puppet] - 10https://gerrit.wikimedia.org/r/1012391 (owner: 10Majavah) [10:50:37] (03Abandoned) 10Majavah: P:toolforge: uninstall tekton component [puppet] - 10https://gerrit.wikimedia.org/r/975768 (owner: 10Majavah) [10:51:06] 06SRE, 10observability: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414 (10MoritzMuehlenhoff) 03NEW [10:51:59] (03PS2) 10Majavah: aptrepo: drop tekton component [puppet] - 10https://gerrit.wikimedia.org/r/975769 [10:52:45] 06SRE, 10Wikimedia-Mailing-lists: Mailing list request for Igbo Wikimedians - https://phabricator.wikimedia.org/T360350#9641392 (10OtuNwachinemere) >>! In T360350#9641363, @Ladsgroup wrote: > Due to https://meta.wikimedia.org/wiki/Mailing_lists/Standardization the name of the mailing list will be "wikimedia-ig... [10:53:44] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:53:49] (03PS2) 10Fabfur: haproxy: avoid UA header truncation [puppet] - 10https://gerrit.wikimedia.org/r/1012624 (https://phabricator.wikimedia.org/T358109) [10:53:50] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:57:31] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240319T1100) [11:01:39] (03PS1) 10Hashar: Revert "Route /w/CREDITS and /w/COPYING to /w/static.php" [puppet] - 10https://gerrit.wikimedia.org/r/1011461 (https://phabricator.wikimedia.org/T359643) [11:01:45] _joe_: taavi ^ [11:02:24] and akosiaris [11:02:29] I don't know why that got merged [11:02:31] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:02:38] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:03:43] 06SRE, 10Wikimedia-Mailing-lists: 14Mailing list request for Igbo Wikimedians - 14https://phabricator.wikimedia.org/T360350#9641440 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup 14{{done}} https://lists.wikimedia.org/postorius/lists/wikimedia-igbo.lists.wikimedia.org/ I made it a public mailing lis... [11:05:47] well I am reverting it [11:06:10] err [11:06:14] that is on puppet bah [11:06:15] :) [11:06:29] (03PS1) 10Fabfur: benthos: change Benthos prometheus port to avoid conflicts [puppet] - 10https://gerrit.wikimedia.org/r/1012625 (https://phabricator.wikimedia.org/T358109) [11:07:48] <_joe_> hashar: I think we should instead merge the mediawiki-config patch [11:07:54] <_joe_> and move forward [11:07:56] nop [11:08:19] I am not going to merge / deploy a code I am not familiar with at all and risk spending the time dealing with it [11:08:25] but you can do it :) [11:08:42] I feel like it is easier to rollback the change that should not have been deployed and is causing whatever issue happens [11:10:31] (also I have hungry kids incoming home :D ) [11:10:53] We'll deal with it [11:10:59] Not the kids [11:11:12] <_joe_> lol [11:11:28] <_joe_> I'd rather deal with hungry kids than with mediawiki-config tbf [11:11:46] It's a toss up for me [11:11:55] But I have more XP with mw-cfg [11:12:09] Which tells you how much experience I have with hungry kids x) [11:12:18] * hashar throws kids at Clément [11:12:24] WHY [11:12:28] anyway [11:12:33] it is easier to rollback [11:12:59] It isn't, and I'd like to understand why a patch with a cross-repo dep managed to get merged without its dep being merged [11:13:07] instead of having X unrelated people trying to deploy a fixup to the scary `/w/static.php` [11:13:17] It's not a fixup [11:13:19] (03CR) 10Filippo Giunchedi: [C:03+1] benthos: change Benthos prometheus port to avoid conflicts [puppet] - 10https://gerrit.wikimedia.org/r/1012625 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [11:13:22] it got merged cause puppet.git is an outlawer [11:13:24] It was the up-dep [11:13:39] folks V+2 and Submit a change there which get Gerrit to merge it [11:13:55] (03CR) 10Fabfur: [C:03+2] benthos: change Benthos prometheus port to avoid conflicts [puppet] - 10https://gerrit.wikimedia.org/r/1012625 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [11:14:03] <_joe_> well that patch has a +1 by Timo btw [11:14:08] while the `Depends-On` field in the commit message is only recongized by Zuul/CI (Gerrit does not know anything about that header) [11:14:17] Ah, TIL [11:14:23] <_joe_> jouncebot: nowandnext [11:14:24] For the next 0 hour(s) and 45 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240319T1100) [11:14:24] In 0 hour(s) and 45 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240319T1200) [11:14:30] I thought gerrit recognized it and would block merging [11:14:32] <_joe_> ook, it's our spot [11:14:55] so essentially Puppet.git is managed like the rest of the industry/Github does it: the wrong way :] [11:14:58] (I am kidding) [11:15:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by oblivian@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012427 (https://phabricator.wikimedia.org/T359643) (owner: 10Ahmon Dancy) [11:15:46] * hashar hears noises, arms himself with carrots, potatoes and sausages [11:15:51] (03Merged) 10jenkins-bot: static.php: Handle COPYING and CREDITS files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012427 (https://phabricator.wikimedia.org/T359643) (owner: 10Ahmon Dancy) [11:16:16] !log oblivian@deploy2002 Started scap: Backport for [[gerrit:1012427|static.php: Handle COPYING and CREDITS files (T359643)]] [11:16:22] T359643: Get rid of the /srv/mediawiki/php symbolic link - https://phabricator.wikimedia.org/T359643 [11:18:49] !log oblivian@deploy2002 dancy and oblivian: Backport for [[gerrit:1012427|static.php: Handle COPYING and CREDITS files (T359643)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:19:56] !log oblivian@deploy2002 dancy and oblivian: Continuing with sync [11:20:09] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9641488 (10MoritzMuehlenhoff) [11:21:47] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:21:53] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:25:20] (03PS1) 10Clément Goubert: mediawiki: Route /w/CREDITS and /w/COPYING to /w/static.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012627 (https://phabricator.wikimedia.org/T359643) [11:26:25] (03Abandoned) 10Hashar: Revert "Route /w/CREDITS and /w/COPYING to /w/static.php" [puppet] - 10https://gerrit.wikimedia.org/r/1011461 (https://phabricator.wikimedia.org/T359643) (owner: 10Hashar) [11:26:33] (03PS1) 10Fabfur: benthos: provide fqdn as hostname to backward compatibility [puppet] - 10https://gerrit.wikimedia.org/r/1012628 (https://phabricator.wikimedia.org/T358109) [11:27:16] (JobUnavailable) firing: (3) Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:27:29] (03CR) 10Fabfur: [C:03+2] haproxy: avoid UA header truncation [puppet] - 10https://gerrit.wikimedia.org/r/1012624 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [11:27:31] <_joe_> fabfur: ^^ is that you? [11:27:53] <_joe_> the benthos job being unavailable [11:28:25] it's me [11:28:30] <_joe_> ack [11:28:56] but don't know why, maybe because I changed the port and this should reflect ? [11:30:20] yes that's right, things will converge at the next puppet run on prometheus [11:30:22] <_joe_> no idea, but that looks likely something that prometheus should know [11:30:31] <_joe_> ah heh, that :) [11:30:51] ack, luckily I needed to do this just once [11:31:05] sorry for the alert [11:31:20] <_joe_> no biggie [11:31:34] (03PS1) 10Majavah: hieradata: use cfssl for cloudweb in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1012629 (https://phabricator.wikimedia.org/T317463) [11:31:40] (03CR) 10Alexandros Kosiaris: [C:03+1] mediawiki: Route /w/CREDITS and /w/COPYING to /w/static.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012627 (https://phabricator.wikimedia.org/T359643) (owner: 10Clément Goubert) [11:31:46] !log oblivian@deploy2002 Finished scap: Backport for [[gerrit:1012427|static.php: Handle COPYING and CREDITS files (T359643)]] (duration: 15m 29s) [11:31:58] yeah no worries, kinda unavoidable unless you run puppet on cp first and the immediately on prometheus ulsfo afterwards, but yeah no biggie as _joe_ said [11:32:13] (03PS2) 10Majavah: hieradata: use cfssl for cloudweb in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1012629 (https://phabricator.wikimedia.org/T357750) [11:32:16] (JobUnavailable) firing: (3) Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:32:23] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Route /w/CREDITS and /w/COPYING to /w/static.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012627 (https://phabricator.wikimedia.org/T359643) (owner: 10Clément Goubert) [11:33:04] <_joe_> claime: deployment done [11:33:21] ack, merging mw-on-k8s patch, will do a scap deployment of just that [11:33:25] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:33:31] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:33:52] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1659/co" [puppet] - 10https://gerrit.wikimedia.org/r/1012629 (https://phabricator.wikimedia.org/T357750) (owner: 10Majavah) [11:34:02] (03Merged) 10jenkins-bot: mediawiki: Route /w/CREDITS and /w/COPYING to /w/static.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012627 (https://phabricator.wikimedia.org/T359643) (owner: 10Clément Goubert) [11:35:27] !log cgoubert@deploy2002 Started scap: mediawiki: Route /w/CREDITS and /w/COPYING to /w/static.php - gerrit:1012627 - T359643 [11:35:48] <_joe_> claime: lmk when testservers are synced, so I can verify [11:35:51] !log cgoubert@deploy2002 Finished scap: mediawiki: Route /w/CREDITS and /w/COPYING to /w/static.php - gerrit:1012627 - T359643 (duration: 00m 23s) [11:36:01] <_joe_> uh [11:36:03] Hmm that was a bit quick [11:36:14] <_joe_> that looks like the chart wasn't picked up [11:36:14] I bet chartmuseum hadn't updated yet [11:36:18] <_joe_> you were too quick [11:36:22] tsk [11:36:24] story of my life [11:36:24] <_joe_> but gj scap [11:37:20] _joe_: that deployment will not stop on testservers I think [11:37:28] k8s-only implies force iirc [11:37:36] <_joe_> uhm well [11:37:44] I can do a manual on mw-debug [11:37:55] <_joe_> nah it's safe tbh [11:38:03] !log cgoubert@deploy2002 Started scap: mediawiki: Route /w/CREDITS and /w/COPYING to /w/static.php - gerrit:1012627 - T359643 [11:38:23] Yeah that's more like it [11:38:25] :D [11:38:30] <_joe_> but maybe we need to revisit it [11:38:39] <_joe_> the implying of --force [11:38:58] Sure, we have talked about it a bit already, I don't think it's too complicated [11:39:04] I have a couple tasks to write for scap [11:39:08] <_joe_> anyways, there's a lot of work to do on our side in determining how to perform infra-level changes [11:39:39] _joe_: should be up on mwdebug and canary releases [11:39:48] (you can test while it's deploying the rest [11:39:49] ) [11:39:56] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:40:05] Working on my end [11:40:09] <_joe_> works yes [11:40:10] (03PS2) 10Fabfur: benthos: provide fqdn as hostname to backward compatibility [puppet] - 10https://gerrit.wikimedia.org/r/1012628 (https://phabricator.wikimedia.org/T358109) [11:40:13] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:40:28] cgoubert@cumin2002:~$ curl -k -H 'Host: en.wikipedia.org' https://mwdebug.discovery.wmnet:4444/w/COPYING -I 2> /dev/null | grep x-powered [11:40:30] x-powered-by: PHP/7.4.33 [11:43:25] (SystemdUnitFailed) firing: (2) httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:45:02] !log cgoubert@deploy2002 Finished scap: mediawiki: Route /w/CREDITS and /w/COPYING to /w/static.php - gerrit:1012627 - T359643 (duration: 06m 59s) [11:45:15] T359643: Get rid of the /srv/mediawiki/php symbolic link - https://phabricator.wikimedia.org/T359643 [11:48:25] (SystemdUnitFailed) resolved: (2) httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:51:52] claime: so kids arrive by regional train at 19:07 :) [11:52:03] have a good lunch everyone! [11:58:46] (03CR) 10Slyngshede: [C:03+2] Switch idp-test to new Bookworm server. [dns] - 10https://gerrit.wikimedia.org/r/1012620 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [11:59:30] (03CR) 10Clément Goubert: [C:03+1] scap::master: add rsync server for the k8s release repo [puppet] - 10https://gerrit.wikimedia.org/r/1012617 (owner: 10Giuseppe Lavagetto) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240319T1200) [12:00:48] 06SRE, 10Wikimedia-Mailing-lists: 14Mailing list request for Igbo Wikimedians - 14https://phabricator.wikimedia.org/T360350#9641639 (10OtuNwachinemere) 14>>! In T360350#9641440, @Ladsgroup wrote: > {{done}} > https://lists.wikimedia.org/postorius/lists/wikimedia-igbo.lists.wikimedia.org/ > > I made it a... [12:02:40] (SystemdUnitFailed) firing: (3) rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:11:05] !log Switch idp-test to upgraded Bookworm host [12:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:56] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1012628 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [12:12:40] (KubernetesRsyslogDown) firing: rsyslog on mw1363:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1363 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:17:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1363:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1363 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:22:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 40.84% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:48:12] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [12:48:29] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [12:48:35] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [12:49:00] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [12:49:21] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [12:51:06] 06SRE, 10SRE-swift-storage: outdated DjVu file page thumbnail in cache - https://phabricator.wikimedia.org/T186153#9641658 (10Wargo) What we know at the moment? What component is responsible for this? What was debugged? Any releated Tasks? [12:53:26] (03PS1) 10Clément Goubert: mw-parsoid: increase replicas to 155 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012639 [12:54:06] (03CR) 10Clément Goubert: [C:03+2] mw-parsoid: increase replicas to 155 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012639 (owner: 10Clément Goubert) [12:54:15] (03CR) 10Cathal Mooney: [C:03+2] Fix error when removing an interface's bridge membership (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009789 (https://phabricator.wikimedia.org/T359629) (owner: 10Cathal Mooney) [12:54:23] (03Merged) 10jenkins-bot: mw-parsoid: increase replicas to 155 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012639 (owner: 10Clément Goubert) [12:54:31] (03Merged) 10jenkins-bot: Fix error when removing an interface's bridge membership [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009789 (https://phabricator.wikimedia.org/T359629) (owner: 10Cathal Mooney) [12:54:40] (KubernetesRsyslogDown) firing: rsyslog on mw1363:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1363 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:56:25] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:idp Use Tomcat9 build for Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009709 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [12:56:33] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:prometheus::ops Remove new LDAP hosts from Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1009501 (https://phabricator.wikimedia.org/T359524) (owner: 10Slyngshede) [12:59:33] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.netbox.update-extras (exit_code=1) rolling restart_daemons on A:netbox-canary [12:59:37] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [12:59:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1363:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1363 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:59:44] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240319T1300) [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:06:50] !log manually adding 20 replicas to mw-parsoid to help with big reparse [13:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:11] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [13:07:29] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [13:07:33] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [13:07:48] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [13:08:24] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase1024.eqiad.wmnet with reason: Decommissioning — T354561 [13:08:28] T354561: Decommission restbase10[19-27] - https://phabricator.wikimedia.org/T354561 [13:08:38] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1024.eqiad.wmnet with reason: Decommissioning — T354561 [13:15:22] (03PS1) 10Effie Mouzeli: traffic: Completely depool codfw from user traffic (switchover #1) [dns] - 10https://gerrit.wikimedia.org/r/1012645 (https://phabricator.wikimedia.org/T357547) [13:17:16] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:19:44] !log Restarting CI Jenkins [13:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:16] (JobUnavailable) resolved: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:23:12] (03CR) 10Filippo Giunchedi: "LGTM, modulo tests to be changed to make CI happy" [alerts] - 10https://gerrit.wikimedia.org/r/1010347 (owner: 10Tim Starling) [13:26:07] (03CR) 10Filippo Giunchedi: "LGTM modulo what Janis mentioned" [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [13:30:14] (03PS1) 10Clément Goubert: Add File:Claus_-_Conkle to blacklist [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012670 (https://phabricator.wikimedia.org/T353876) [13:30:15] (03PS3) 10Klausman: admin_ng: Add network policy to allow LW isvcs to access ML Cassandra [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012668 (https://phabricator.wikimedia.org/T360428) [13:31:54] (03CR) 10Filippo Giunchedi: [C:03+1] profile::thanos: Add latency histogram buckets back for Istio [puppet] - 10https://gerrit.wikimedia.org/r/1011146 (https://phabricator.wikimedia.org/T359879) (owner: 10Klausman) [13:32:28] (03CR) 10Klausman: [C:03+2] profile::thanos: Add latency histogram buckets back for Istio [puppet] - 10https://gerrit.wikimedia.org/r/1011146 (https://phabricator.wikimedia.org/T359879) (owner: 10Klausman) [13:34:57] (03CR) 10Ssingh: [C:03+1] traffic: Completely depool codfw from user traffic (switchover #1) [dns] - 10https://gerrit.wikimedia.org/r/1012645 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [13:35:00] 06SRE, 10SRE-swift-storage: outdated DjVu file page thumbnail in cache - https://phabricator.wikimedia.org/T186153#9641884 (10Wargo) [13:35:37] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Purge attempts for pages of files with large number of thumbnails fails on Commons - https://phabricator.wikimedia.org/T214759#9641886 (10Wargo) [13:35:46] 10SRE-swift-storage, 10Thumbor: Outdated thumbnails for djvu file on Commons cannot be purged and do not update - https://phabricator.wikimedia.org/T206190#9641887 (10Wargo) [13:36:42] (03CR) 10Giuseppe Lavagetto: [C:03+1] Add File:Claus_-_Conkle to blacklist [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012670 (https://phabricator.wikimedia.org/T353876) (owner: 10Clément Goubert) [13:36:50] (03PS2) 10Elukey: profile::prometheus::k8s: move istio metrics to a separate job [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390) [13:36:58] (03CR) 10Elukey: profile::prometheus::k8s: move istio metrics to a separate job (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [13:37:06] 10ops-esams, 06DC-Ops, 06Traffic: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430 (10RobH) 03NEW [13:37:18] 10ops-esams, 06DC-Ops, 06Traffic: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9641914 (10RobH) [13:37:56] (03CR) 10Clément Goubert: [C:03+2] Add File:Claus_-_Conkle to blacklist [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012670 (https://phabricator.wikimedia.org/T353876) (owner: 10Clément Goubert) [13:38:23] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [13:38:42] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [13:38:53] (03CR) 10Kamila Součková: [C:03+2] sre.switchdc.mediawiki: update descriptions [cookbooks] - 10https://gerrit.wikimedia.org/r/1009854 (https://phabricator.wikimedia.org/T357547) (owner: 10Kamila Součková) [13:39:01] 10ops-esams, 06DC-Ops, 06Traffic: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9641925 (10RobH) [13:39:13] (03Merged) 10jenkins-bot: Add File:Claus_-_Conkle to blacklist [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012670 (https://phabricator.wikimedia.org/T353876) (owner: 10Clément Goubert) [13:41:37] !log Deploying changeprop and changeprop-jobqueue - T353876 [13:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:43] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [13:41:54] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1011091 (owner: 10Arturo Borrero Gonzalez) [13:42:04] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [13:42:11] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [13:42:21] (03CR) 10Elukey: "The change is missing the correspondent Virtual Service and Destination Rule IIUC, it should be a similar use case as api-ro.discovery.wmn" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012668 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [13:42:29] (03CR) 10BBlack: [C:03+1] "Looks legit to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1009261 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [13:42:32] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [13:42:37] (03CR) 10Klausman: [C:03+1] profile::prometheus::k8s: move istio metrics to a separate job [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [13:43:18] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [13:43:23] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: update descriptions [cookbooks] - 10https://gerrit.wikimedia.org/r/1009854 (https://phabricator.wikimedia.org/T357547) (owner: 10Kamila Součková) [13:43:49] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [13:44:45] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: apply [13:45:11] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [13:47:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 48.14% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:50:51] !log kafka-logging1001:~# kafka reassign-partitions -reassignment-json-file mediawiki.httpd.accesslog-sampled.json --execute --throttle 50000000 T326419 [13:50:54] (03CR) 10Scott French: [C:03+1] traffic: Completely depool codfw from user traffic (switchover #1) [dns] - 10https://gerrit.wikimedia.org/r/1012645 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [13:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:56] T326419: Expand kafka-logging using hosts kafka-logging[12]00[45] - https://phabricator.wikimedia.org/T326419 [13:51:12] (03CR) 10Giuseppe Lavagetto: [C:03+2] scap::master: add rsync server for the k8s release repo [puppet] - 10https://gerrit.wikimedia.org/r/1012617 (owner: 10Giuseppe Lavagetto) [13:53:21] (03PS1) 10Ayounsi: Add Netbox script to change a server's NIC [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1012680 (https://phabricator.wikimedia.org/T360297) [13:54:40] (KubernetesRsyslogDown) firing: rsyslog on mw1363:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1363 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:55:43] !log kafka-logging1001:~# kafka reassign-partitions -reassignment-json-file rsyslog-info.json --execute --throttle 50000000 T326419 [13:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:02] (03PS1) 10Clément Goubert: Revert "mw-parsoid: increase replicas to 155" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1011464 [13:57:25] (SystemdUnitFailed) firing: (3) rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:58:16] (03CR) 10Clément Goubert: [C:03+2] Revert "mw-parsoid: increase replicas to 155" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1011464 (owner: 10Clément Goubert) [13:58:36] (03CR) 10Alexandros Kosiaris: [C:03+1] traffic: Completely depool codfw from user traffic (switchover #1) [dns] - 10https://gerrit.wikimedia.org/r/1012645 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [13:59:24] (03Merged) 10jenkins-bot: Revert "mw-parsoid: increase replicas to 155" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1011464 (owner: 10Clément Goubert) [13:59:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1363:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1363 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:00:04] Deploy window Northward Switchover: Services + Traffic (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240319T1400) [14:03:43] (03CR) 10Effie Mouzeli: [C:03+2] traffic: Completely depool codfw from user traffic (switchover #1) [dns] - 10https://gerrit.wikimedia.org/r/1012645 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [14:07:25] !log Completely depool codfw from user traffic - T357547 [14:07:25] (SystemdUnitFailed) firing: (3) rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:30] T357547: ☂️ Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547 [14:08:09] (03PS2) 10Ayounsi: Add Netbox script to change a server's NIC [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1012680 (https://phabricator.wikimedia.org/T360297) [14:10:19] (03PS3) 10Ayounsi: Add Netbox script to change a server's NIC [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1012680 (https://phabricator.wikimedia.org/T360297) [14:16:55] !log jiji@cumin1002 START - Cookbook sre.discovery.datacenter depool all services in codfw: Northward DC Switchover, March 2024 - T357547 [14:17:00] T357547: ☂️ Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547 [14:20:13] (03CR) 10Giuseppe Lavagetto: "Tested the rsync command and it works as expected." [puppet] - 10https://gerrit.wikimedia.org/r/1012618 (owner: 10Giuseppe Lavagetto) [14:22:12] !log depooling services from codfw - T357547 [14:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:16] T357547: ☂️ Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547 [14:25:14] 10ops-esams, 06DC-Ops, 06Traffic: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9642233 (10RobH) a:03RobH [14:25:28] 10ops-esams, 06DC-Ops, 06Traffic: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9642243 (10RobH) Chatted with @ssingh as I had neglected some items we had discussed previously: * Adjusted this from a single installation window to 2 windows, 1 week apart, falling on Wednesday. ** All... [14:25:40] (03PS4) 10Ayounsi: Add Netbox script to change a server's NIC [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1012680 (https://phabricator.wikimedia.org/T360297) [14:28:09] (03PS5) 10Ayounsi: Add Netbox script to change a server's NIC [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1012680 (https://phabricator.wikimedia.org/T360297) [14:28:42] (03CR) 10Ayounsi: "PS4 is live on netbox-next" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1012680 (https://phabricator.wikimedia.org/T360297) (owner: 10Ayounsi) [14:29:47] (03PS1) 10RobH: dbprov updates [puppet] - 10https://gerrit.wikimedia.org/r/1012684 (https://phabricator.wikimedia.org/T355353) [14:30:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 37.95% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:30:47] (03CR) 10RobH: [C:03+2] dbprov updates [puppet] - 10https://gerrit.wikimedia.org/r/1012684 (https://phabricator.wikimedia.org/T355353) (owner: 10RobH) [14:33:46] (03CR) 10CI reject: [V:04-1] dbprov updates [puppet] - 10https://gerrit.wikimedia.org/r/1012684 (https://phabricator.wikimedia.org/T355353) (owner: 10RobH) [14:34:05] (03CR) 10Muehlenhoff: [C:03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1012629 (https://phabricator.wikimedia.org/T357750) (owner: 10Majavah) [14:34:25] (03PS2) 10RobH: dbprov updates [puppet] - 10https://gerrit.wikimedia.org/r/1012684 (https://phabricator.wikimedia.org/T355353) [14:35:15] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 47.48% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:37:16] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:04] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1012684 (https://phabricator.wikimedia.org/T355353) (owner: 10RobH) [14:38:05] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9642302 (10Papaul) Zeroize done on asw-a3 and asw-a4 [14:40:03] !log jiji@cumin1002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all services in codfw: Northward DC Switchover, March 2024 - T357547 [14:40:08] T357547: ☂️ Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547 [14:40:15] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 47.28% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:41:35] (03PS1) 10Clément Goubert: mw-on-k8s: raise replicas for add ro traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012686 (https://phabricator.wikimedia.org/T357547) [14:42:33] (03CR) 10CI reject: [V:04-1] dbprov updates [puppet] - 10https://gerrit.wikimedia.org/r/1012684 (https://phabricator.wikimedia.org/T355353) (owner: 10RobH) [14:42:55] !log Traffic+Services switchover complete, codfw is depooled - Τ357547 [14:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:18] (03PS3) 10RobH: dbprov updates [puppet] - 10https://gerrit.wikimedia.org/r/1012684 (https://phabricator.wikimedia.org/T355353) [14:44:30] 06SRE, 06Data-Platform-SRE: Phase out cergen for Search Platform services - https://phabricator.wikimedia.org/T360439 (10MoritzMuehlenhoff) 03NEW [14:44:38] (03CR) 10RobH: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1012684 (https://phabricator.wikimedia.org/T355353) (owner: 10RobH) [14:45:02] 06SRE, 10SRE-swift-storage: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412#9642419 (10MoritzMuehlenhoff) [14:45:10] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9642420 (10MoritzMuehlenhoff) [14:45:15] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 49.19% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:46:48] (03PS1) 10Ssingh: P:cumin: add alias for dnsbox hosts (dns-rec/auth) [puppet] - 10https://gerrit.wikimedia.org/r/1012688 [14:46:56] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9642415 (10MoritzMuehlenhoff) [14:47:13] (SystemdUnitFailed) firing: (2) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:48:27] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9642421 (10MoritzMuehlenhoff) [14:49:48] (03CR) 10Muehlenhoff: dbprov updates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1012684 (https://phabricator.wikimedia.org/T355353) (owner: 10RobH) [14:52:30] (03CR) 10Jcrespo: [C:04-1] dbprov updates [puppet] - 10https://gerrit.wikimedia.org/r/1012684 (https://phabricator.wikimedia.org/T355353) (owner: 10RobH) [14:53:22] (03CR) 10Jcrespo: [C:04-1] "I was working on this, the right recipe is the db one, manual is only to prevent accidental reimages." [puppet] - 10https://gerrit.wikimedia.org/r/1012684 (https://phabricator.wikimedia.org/T355353) (owner: 10RobH) [14:53:52] (03Abandoned) 10RobH: dbprov updates [puppet] - 10https://gerrit.wikimedia.org/r/1012684 (https://phabricator.wikimedia.org/T355353) (owner: 10RobH) [14:55:03] (03CR) 10Jcrespo: [C:04-1] dbprov updates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1012684 (https://phabricator.wikimedia.org/T355353) (owner: 10RobH) [14:55:07] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mariadb::misc::phabricator [14:55:31] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9642483 (10RobH) My patchset had mistakes, and @jcrespo has advised he is workong on these patchsets. As such, I've abandoned my patchset. [14:55:47] (03Restored) 10Jcrespo: dbprov updates [puppet] - 10https://gerrit.wikimedia.org/r/1012684 (https://phabricator.wikimedia.org/T355353) (owner: 10RobH) [14:55:51] (ATSBackendErrorsHigh) firing: (3) ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:55:59] known [14:55:59] We're on it [14:56:03] (03CR) 10Jcrespo: [C:04-1] "Nah, I can amend it. Let's keep it but I will own it." [puppet] - 10https://gerrit.wikimedia.org/r/1012684 (https://phabricator.wikimedia.org/T355353) (owner: 10RobH) [14:56:03] ack [14:56:07] ok, thanks [14:56:09] !incidents [14:56:10] 4526 (UNACKED) [3x] ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet) [14:56:14] !ack 4526 [14:56:15] 4526 (ACKED) [3x] ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet) [14:56:15] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9642488 (10RobH) a:05VRiley-WMF→03jcrespo This installation is blocked until patchsets to allow installation are complete. I've removed the assignment... [14:56:27] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9642493 (10RobH) a:05Jhancock.wm→03jcrespo This installation is blocked until patchsets to allow installation are complete. I've removed the assignment from @Jhancock.wm t... [14:56:47] maps isn't happy per https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=eqiad&var-cluster=maps&var-instance=All&viewPanel=87&from=now-3h&to=now [14:56:47] (03PS1) 10Muehlenhoff: Switch mariadb::misc::phabricator to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1012689 (https://phabricator.wikimedia.org/T349619) [14:56:50] 100% CPU [14:57:28] (03CR) 10Marostegui: [C:03+1] Switch mariadb::misc::phabricator to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1012689 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:57:35] !log jiji@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad [14:57:48] !log jiji@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw [14:57:52] (03PS3) 10Ayounsi: Routed Ganeti: Add v6 static route to VM [puppet] - 10https://gerrit.wikimedia.org/r/995032 (https://phabricator.wikimedia.org/T300152) [14:58:08] (03CR) 10Muehlenhoff: [C:03+2] Switch mariadb::misc::phabricator to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1012689 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:58:12] !log pooling kartotherian on codfw back [14:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:04] eoghan, jelto, and arnoldokoth: It is that lovely time of the day again! You are hereby commanded to deploy SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240319T1500). [15:00:07] we 'll probably need to restart kartotherian in eqiad [15:00:29] !log restart kartotherian on eqiad [15:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:42] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:00:48] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:00:51] (ATSBackendErrorsHigh) firing: (4) ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [15:01:08] !incidents [15:01:08] 4526 (ACKED) [3x] ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet) [15:02:16] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mariadb::misc::phabricator [15:02:36] (03CR) 10Ahmon Dancy: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1012618 (owner: 10Giuseppe Lavagetto) [15:03:38] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9642543 (10jcrespo) I will take care, as I discussed previously with John, but to avoid future mistakes, @RobH is there a way to transmit the desired recip... [15:04:11] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9642547 (10MoritzMuehlenhoff) [15:05:24] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mariadb::misc::db_inventory [15:05:51] (ATSBackendErrorsHigh) resolved: (4) ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [15:06:02] slowly recovering [15:06:37] (03PS1) 10Muehlenhoff: Switch mariadb::misc::db_inventory to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1012693 (https://phabricator.wikimedia.org/T349619) [15:07:16] (JobUnavailable) firing: (5) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:58] (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: use cfssl for cloudweb in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1012629 (https://phabricator.wikimedia.org/T357750) (owner: 10Majavah) [15:09:35] (03CR) 10Effie Mouzeli: [C:03+1] mw-on-k8s: raise replicas for add ro traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012686 (https://phabricator.wikimedia.org/T357547) (owner: 10Clément Goubert) [15:11:38] 06SRE, 10observability: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9642568 (10lmata) hi @MoritzMuehlenhoff, what timeline do you need us to meet for this change? [15:11:51] (03CR) 10SBassett: Remove X-Webkit-CSP-Report-Only response header from foundationwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003108 (https://phabricator.wikimedia.org/T357479) (owner: 10TheDJ) [15:12:16] (JobUnavailable) resolved: (5) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:12:26] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [15:12:39] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [15:12:41] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9642583 (10RobH) >>! In T355353#9642542, @jcrespo wrote: > I will take care, as I discussed previously with John, but to avoid future mistakes, @RobH is th... [15:12:46] !log kafka-logging1001:~# kafka reassign-partitions -reassignment-json-file rsyslog-notice.json --execute --throttle 50000000 T326419 [15:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:50] T326419: Expand kafka-logging using hosts kafka-logging[12]00[45] - https://phabricator.wikimedia.org/T326419 [15:12:50] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [15:12:59] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [15:14:15] (03CR) 10Gergő Tisza: [C:03+1] "Do you need help deploying this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE)) [15:14:56] 06SRE, 10observability: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9642592 (10MoritzMuehlenhoff) >>! In T360414#9642568, @lmata wrote: > hi @MoritzMuehlenhoff, what timeline do you need us to meet for this change? This all ties into the greater scheme of Puppet 5... [15:14:58] !log repooling cp4037 for brief time (T358109) [15:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:03] T358109: Install new Benthos instance on cp hosts - https://phabricator.wikimedia.org/T358109 [15:15:20] (03CR) 10Muehlenhoff: [C:03+2] Switch mariadb::misc::db_inventory to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1012693 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:15:26] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [15:16:58] (03CR) 10Filippo Giunchedi: [C:03+1] "The regexp is invalid due to +" [puppet] - 10https://gerrit.wikimedia.org/r/1011146 (https://phabricator.wikimedia.org/T359879) (owner: 10Klausman) [15:16:58] (03CR) 10Clément Goubert: [C:03+2] mw-on-k8s: raise replicas for add ro traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012686 (https://phabricator.wikimedia.org/T357547) (owner: 10Clément Goubert) [15:17:27] !log Raising mw-web and mw-api-ext replicas for additional read-only traffic - T357547 [15:17:28] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [15:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:32] T357547: ☂️ Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547 [15:17:39] (03CR) 10Thiemo Kreuz (WMDE): "Thanks for asking. No, I'm not able to deploy this. I'm afraid I never fully understood how code review works in this codebase." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE)) [15:17:44] (03CR) 10Klausman: [C:03+2] "Ah, right! Good spot, I will send a fix." [puppet] - 10https://gerrit.wikimedia.org/r/1011146 (https://phabricator.wikimedia.org/T359879) (owner: 10Klausman) [15:18:16] (03Merged) 10jenkins-bot: mw-on-k8s: raise replicas for add ro traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012686 (https://phabricator.wikimedia.org/T357547) (owner: 10Clément Goubert) [15:18:45] (03PS5) 10Jcrespo: mediabackups: Add newly setup storage host backup1011 [puppet] - 10https://gerrit.wikimedia.org/r/995188 (https://phabricator.wikimedia.org/T334069) [15:18:45] (03PS5) 10Jcrespo: mediabackups: Add newly setup storage host backup2011 [puppet] - 10https://gerrit.wikimedia.org/r/995189 (https://phabricator.wikimedia.org/T334069) [15:18:52] (03PS4) 10Jcrespo: installserver: Update dbprov for reimage of dbprov[12]00[56] [puppet] - 10https://gerrit.wikimedia.org/r/1012684 (https://phabricator.wikimedia.org/T355353) (owner: 10RobH) [15:18:58] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [15:19:18] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [15:19:57] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [15:20:18] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [15:20:32] (03CR) 10CI reject: [V:04-1] installserver: Update dbprov for reimage of dbprov[12]00[56] [puppet] - 10https://gerrit.wikimedia.org/r/1012684 (https://phabricator.wikimedia.org/T355353) (owner: 10RobH) [15:20:39] (03PS1) 10Klausman: profile::thanos: Fix broken regex for istio latency bucket RR [puppet] - 10https://gerrit.wikimedia.org/r/1012696 (https://phabricator.wikimedia.org/T360428) [15:20:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mariadb::misc::db_inventory [15:21:26] (03PS2) 10Klausman: profile::thanos: Fix broken regex for istio latency bucket RR [puppet] - 10https://gerrit.wikimedia.org/r/1012696 (https://phabricator.wikimedia.org/T360428) [15:22:45] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [15:23:02] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [15:24:37] (03PS3) 10Klausman: profile::thanos: Fix broken regex for istio latency bucket RR [puppet] - 10https://gerrit.wikimedia.org/r/1012696 (https://phabricator.wikimedia.org/T360428) [15:25:13] (03CR) 10Filippo Giunchedi: [C:03+2] profile::thanos: Fix broken regex for istio latency bucket RR [puppet] - 10https://gerrit.wikimedia.org/r/1012696 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [15:25:27] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] profile::thanos: Fix broken regex for istio latency bucket RR [puppet] - 10https://gerrit.wikimedia.org/r/1012696 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [15:25:42] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [15:26:03] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [15:26:26] 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9642666 (10lmata) >>! In T360414#9642592, @MoritzMuehlenhoff wrote: > This all ties into the greater scheme of Puppet 5 and Buster deprecations, so next quarte... [15:27:48] jouncebot nowandnext [15:27:48] For the next 0 hour(s) and 32 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240319T1500) [15:27:49] In 0 hour(s) and 32 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240319T1600) [15:29:07] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9642678 (10ssingh) Hi @RobH: Thanks for creating the task. In some further discussion with @BBlack today, we decided that we will do the following: - We have decided that we will depool esams p... [15:29:32] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator/Phorge update [15:29:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator/Phorge update [15:29:55] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator/Phorge update [15:30:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator/Phorge update [15:30:33] !log starting phabricator/phorge update (T358610) [15:30:36] !log brennen@deploy2002 Started deploy [phabricator/deployment@9617e09]: deploy to phab2002 for T358610 [15:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:37] T358610: Update wmf/stable to Phorge upstream's 2023.49 stable release - https://phabricator.wikimedia.org/T358610 [15:31:00] !log brennen@deploy2002 Finished deploy [phabricator/deployment@9617e09]: deploy to phab2002 for T358610 (duration: 00m 23s) [15:31:27] !log brennen@deploy2002 Started deploy [phabricator/deployment@9617e09]: deploy to phab1004 for T358610 [15:32:09] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9642721 (10RobH) [15:32:36] (03PS1) 10Clément Goubert: mw-web: Bump replicas another 15% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012698 (https://phabricator.wikimedia.org/T357547) [15:32:47] !log brennen@deploy2002 Finished deploy [phabricator/deployment@9617e09]: deploy to phab1004 for T358610 (duration: 01m 19s) [15:33:45] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9642727 (10RobH) Remote hands won't have any ability to power down a host other than by pressing the front power button. It would reduce potential complexity if we power down all the hosts for t... [15:34:34] (03CR) 10Effie Mouzeli: [C:03+1] mw-web: Bump replicas another 15% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012698 (https://phabricator.wikimedia.org/T357547) (owner: 10Clément Goubert) [15:34:46] (03CR) 10Clément Goubert: [C:03+2] mw-web: Bump replicas another 15% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012698 (https://phabricator.wikimedia.org/T357547) (owner: 10Clément Goubert) [15:35:01] 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: failed disk for ml-serve2008.codfw.wmnet (not urgent) - https://phabricator.wikimedia.org/T360446 (10klausman) 03NEW [15:35:44] (03Merged) 10jenkins-bot: mw-web: Bump replicas another 15% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012698 (https://phabricator.wikimedia.org/T357547) (owner: 10Clément Goubert) [15:35:48] !log kafka-logging1001:~# kafka reassign-partitions -reassignment-json-file rsyslog-warning.json --execute --throttle 50000000 T326419 [15:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:53] T326419: Expand kafka-logging using hosts kafka-logging[12]00[45] - https://phabricator.wikimedia.org/T326419 [15:36:06] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2098.codfw.wmnet [15:36:16] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [15:36:22] (03Abandoned) 10Vgutierrez: admin: Remove cdobbins SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1011090 (owner: 10Vgutierrez) [15:36:33] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [15:36:35] !log restart kartotherian on codfw [15:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:01] (03PS1) 10Muehlenhoff: Switch db2098 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1012699 (https://phabricator.wikimedia.org/T349619) [15:37:38] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [15:37:51] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [15:38:58] (03CR) 10Muehlenhoff: [C:03+2] Switch db2098 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1012699 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:39:48] !log kafka-logging1001:~# kafka reassign-partitions -reassignment-json-file udp_localhost-err.json --execute --throttle 50000000 T326419 [15:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:11] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9642787 (10MoritzMuehlenhoff) [15:40:15] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 49.91% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:41:21] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9642791 (10ssingh) >>! In T360430#9642721, @RobH wrote: > Remote hands won't have any ability to power down a host other than by pressing the front power button. It would reduce potential comple... [15:42:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2098.codfw.wmnet [15:43:38] (03PS1) 10Elukey: ml-services: update Docker image for Readability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012700 (https://phabricator.wikimedia.org/T360111) [15:45:15] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:45:21] yes yes [15:46:45] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9642811 (10jcrespo) No, Robh, this is not your fault- I delayed doing it because emergencies and then offsite and then rest hours/vacations. Now, knowing t... [15:47:01] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9642812 (10Papaul) Zeroize done on all the old switches in role a [15:49:36] (03CR) 10Jcrespo: "Moritz please a quick check?" [puppet] - 10https://gerrit.wikimedia.org/r/1012684 (https://phabricator.wikimedia.org/T355353) (owner: 10RobH) [15:50:15] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 44.5% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:52:17] 10ops-eqiad, 06SRE, 10procurement: install (2) 1.92TB SSDs from decom into prometheus100[56] - https://phabricator.wikimedia.org/T359632#9642838 (10fgiunchedi) @Jclark-ctr @VRiley-WMF please ping me on irc when you get on site tomorrow and we can coordinate, I'll be around, thank you! [15:53:01] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9642847 (10dcaro) [15:54:17] !log kafka-logging1001:~# kafka reassign-partitions -reassignment-json-file udp_localhost-info.json --execute --throttle 50000000 T326419 [15:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:22] T326419: Expand kafka-logging using hosts kafka-logging[12]00[45] - https://phabricator.wikimedia.org/T326419 [15:55:13] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9642859 (10MoritzMuehlenhoff) [15:57:08] (03CR) 10AikoChou: [C:03+1] ml-services: update Docker image for Readability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012700 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [15:59:41] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1012684 (https://phabricator.wikimedia.org/T355353) (owner: 10RobH) [16:00:04] jhathaway and rzl: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240319T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:15] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 48.51% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:00:34] (03PS1) 10Bking: cloudelastic: check/alert on cluster inconsistencies [puppet] - 10https://gerrit.wikimedia.org/r/1012703 (https://phabricator.wikimedia.org/T358541) [16:01:44] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1012703 (https://phabricator.wikimedia.org/T358541) (owner: 10Bking) [16:01:50] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] base: standard_packages: install fzf [puppet] - 10https://gerrit.wikimedia.org/r/1011091 (owner: 10Arturo Borrero Gonzalez) [16:01:58] (03CR) 10CI reject: [V:04-1] cloudelastic: check/alert on cluster inconsistencies [puppet] - 10https://gerrit.wikimedia.org/r/1012703 (https://phabricator.wikimedia.org/T358541) (owner: 10Bking) [16:02:12] !log installing wireshark security updates [16:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:24] (03CR) 10Jcrespo: [C:03+2] installserver: Update dbprov for reimage of dbprov[12]00[56] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1012684 (https://phabricator.wikimedia.org/T355353) (owner: 10RobH) [16:02:32] (03PS1) 10Papaul: Remove asw-a from homer [homer/public] - 10https://gerrit.wikimedia.org/r/1012705 (https://phabricator.wikimedia.org/T358244) [16:04:01] (03CR) 10Papaul: [C:03+2] Remove asw-a from homer [homer/public] - 10https://gerrit.wikimedia.org/r/1012705 (https://phabricator.wikimedia.org/T358244) (owner: 10Papaul) [16:04:20] (03PS1) 10Clément Goubert: mw-on-k8s: Lower idle %age for saturation alert [alerts] - 10https://gerrit.wikimedia.org/r/1012706 (https://phabricator.wikimedia.org/T357547) [16:06:01] (03PS1) 10Ahmon Dancy: Remove /w/COPYING and /w/CREDITS symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012708 (https://phabricator.wikimedia.org/T359643) [16:06:59] (SystemdUnitFailed) firing: (2) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:08:31] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9642913 (10Papaul) [16:10:19] (03PS1) 10Ahmon Dancy: static.php: Handle COPYING and CREDITS files [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1012710 (https://phabricator.wikimedia.org/T359643) [16:10:20] !log installing dpdk security updates [16:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:40] (03Abandoned) 10Ahmon Dancy: static.php: Handle COPYING and CREDITS files [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1012710 (https://phabricator.wikimedia.org/T359643) (owner: 10Ahmon Dancy) [16:13:26] <_joe_> jouncebot: nowandnext [16:13:26] For the next 0 hour(s) and 46 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240319T1600) [16:13:26] In 0 hour(s) and 46 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240319T1700) [16:13:40] <_joe_> ok, then I guess I can test sync-masters [16:14:01] (03CR) 10Giuseppe Lavagetto: [C:03+2] scap::master: add k8s support to scap-master-sync [puppet] - 10https://gerrit.wikimedia.org/r/1012618 (owner: 10Giuseppe Lavagetto) [16:16:25] (03CR) 10Giuseppe Lavagetto: [C:03+1] mw-on-k8s: Lower idle %age for saturation alert [alerts] - 10https://gerrit.wikimedia.org/r/1012706 (https://phabricator.wikimedia.org/T357547) (owner: 10Clément Goubert) [16:16:35] (03CR) 10Clément Goubert: [C:03+2] mw-on-k8s: Lower idle %age for saturation alert [alerts] - 10https://gerrit.wikimedia.org/r/1012706 (https://phabricator.wikimedia.org/T357547) (owner: 10Clément Goubert) [16:16:41] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9642936 (10jcrespo) This should unblock both the eqiad and the codfw tasks- except if there is an unexpected bug, but the overall idea should be there. [16:16:59] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9642938 (10jcrespo) [16:17:01] (03CR) 10Kevin Bazira: [C:03+1] ml-services: update Docker image for Readability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012700 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [16:17:41] (03Merged) 10jenkins-bot: mw-on-k8s: Lower idle %age for saturation alert [alerts] - 10https://gerrit.wikimedia.org/r/1012706 (https://phabricator.wikimedia.org/T357547) (owner: 10Clément Goubert) [16:18:37] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9642948 (10jcrespo) [16:18:48] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9642944 (10jcrespo) a:05jcrespo→03Jhancock.wm Done. [16:19:38] !log oblivian@deploy2002 Started scap: null k8s-only deployment to test scap-master-sync [16:20:15] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.15% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:20:49] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9642954 (10jcrespo) a:05jcrespo→03VRiley-WMF I hope this this is the right assignment, but not sure. [16:25:15] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.15% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:25:31] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9642974 (10Papaul) [16:26:56] (03CR) 10Fabfur: [V:03+1 C:03+2] benthos: provide fqdn as hostname to backward compatibility [puppet] - 10https://gerrit.wikimedia.org/r/1012628 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [16:28:26] !log oblivian@deploy2002 Finished scap: null k8s-only deployment to test scap-master-sync (duration: 08m 47s) [16:29:09] (03PS1) 10Fabfur: benthos: add $schema key to validate schema [puppet] - 10https://gerrit.wikimedia.org/r/1012712 (https://phabricator.wikimedia.org/T360450) [16:30:15] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:31:09] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9642996 (10Papaul) [16:31:41] (03CR) 10Elukey: [C:03+2] ml-services: update Docker image for Readability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012700 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [16:31:49] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.5 point update - https://phabricator.wikimedia.org/T357133#9642999 (10MoritzMuehlenhoff) [16:32:17] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove asw-a-codfw mgmt DNS - pt1979@cumin2002" [16:33:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove asw-a-codfw mgmt DNS - pt1979@cumin2002" [16:33:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:36:20] (03CR) 10Gmodena: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1012712 (https://phabricator.wikimedia.org/T360450) (owner: 10Fabfur) [16:36:48] !log dancy@deploy2002 Installing scap version "4.73.0" for 373 hosts [16:37:45] !log dancy@deploy2002 Installation of scap version "4.73.0" completed for 373 hosts [16:39:12] (03CR) 10Fabfur: [C:03+2] benthos: add $schema key to validate schema [puppet] - 10https://gerrit.wikimedia.org/r/1012712 (https://phabricator.wikimedia.org/T360450) (owner: 10Fabfur) [16:39:45] (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1012715 [16:40:06] (03CR) 10Ahmon Dancy: [C:03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1012715 (owner: 10Ahmon Dancy) [16:40:56] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1012715 (owner: 10Ahmon Dancy) [16:41:54] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:42:01] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:44:49] !log oblivian@deploy2002 Started scap: null k8s-only deployment to test scap-master-sync (take 2) [16:45:09] !log kafka-logging1001:~# kafka reassign-partitions -reassignment-json-file udp_localhost-warning.json --execute --throttle 50000000 T326419 [16:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:15] T326419: Expand kafka-logging using hosts kafka-logging[12]00[45] - https://phabricator.wikimedia.org/T326419 [16:45:51] (03PS4) 10Jdlrobson: Enable night mode on pilot wikis in AMC mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012452 (https://phabricator.wikimedia.org/T359152) [16:47:15] (03PS1) 10Ssingh: cookbooks.sre.dns: add roll-reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1012719 (https://phabricator.wikimedia.org/T347054) [16:49:25] (03CR) 10Krinkle: [C:03+1] Remove /w/COPYING and /w/CREDITS symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012708 (https://phabricator.wikimedia.org/T359643) (owner: 10Ahmon Dancy) [16:51:39] (03PS2) 10Ssingh: cookbooks.sre.dns: add roll-reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1012719 (https://phabricator.wikimedia.org/T347054) [16:52:13] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:52:20] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:53:50] !log oblivian@deploy2002 Finished scap: null k8s-only deployment to test scap-master-sync (take 2) (duration: 09m 01s) [16:54:06] !log repooling cp4037 for brief time (T358109) [16:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:10] T358109: Install new Benthos instance on cp hosts - https://phabricator.wikimedia.org/T358109 [16:54:15] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [16:55:22] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [16:58:33] (03CR) 10Ssingh: "Ready for review (I think? :). Note that the A:dnsbox alias depends on Id881f31adb136b29c6db97263b3e1a9cc45640ca." [cookbooks] - 10https://gerrit.wikimedia.org/r/1012719 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240319T1700) [17:08:04] (03PS2) 10Bking: cloudelastic: check/alert on cluster inconsistencies [puppet] - 10https://gerrit.wikimedia.org/r/1012703 (https://phabricator.wikimedia.org/T358541) [17:09:14] (03CR) 10CI reject: [V:04-1] cloudelastic: check/alert on cluster inconsistencies [puppet] - 10https://gerrit.wikimedia.org/r/1012703 (https://phabricator.wikimedia.org/T358541) (owner: 10Bking) [17:10:09] !log sudo cumin "A:dns-rec" "disable-puppet 'merging CR 1009261'" [17:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:24] (03PS1) 10Fabfur: benthos: add simple kafka_franz batching [puppet] - 10https://gerrit.wikimedia.org/r/1012724 (https://phabricator.wikimedia.org/T360454) [17:10:31] (03PS3) 10Bking: cloudelastic: check/alert on cluster inconsistencies [puppet] - 10https://gerrit.wikimedia.org/r/1012703 (https://phabricator.wikimedia.org/T358541) [17:11:12] (03CR) 10Ssingh: [C:03+2] P:dns::auth: skipping running authdns-update on host if not pooled [puppet] - 10https://gerrit.wikimedia.org/r/1009261 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [17:11:47] (03CR) 10CI reject: [V:04-1] cloudelastic: check/alert on cluster inconsistencies [puppet] - 10https://gerrit.wikimedia.org/r/1012703 (https://phabricator.wikimedia.org/T358541) (owner: 10Bking) [17:13:55] (03PS4) 10Bking: cloudelastic: check/alert on cluster inconsistencies [puppet] - 10https://gerrit.wikimedia.org/r/1012703 (https://phabricator.wikimedia.org/T358541) [17:14:33] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: cluster=dnsbox,name=dns6001.wikimedia.org,service=authdns-update [17:15:08] (03CR) 10CI reject: [V:04-1] cloudelastic: check/alert on cluster inconsistencies [puppet] - 10https://gerrit.wikimedia.org/r/1012703 (https://phabricator.wikimedia.org/T358541) (owner: 10Bking) [17:15:30] !log running dummy authdns-update [17:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:40] (KubernetesRsyslogDown) firing: rsyslog on mw1440:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1440 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:16:00] (03CR) 10Fabfur: [C:03+2] benthos: add simple kafka_franz batching [puppet] - 10https://gerrit.wikimedia.org/r/1012724 (https://phabricator.wikimedia.org/T360454) (owner: 10Fabfur) [17:16:33] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org,service=authdns-update [17:19:13] !log sudo cumin -b1 -s120 "A:dns-rec and not P{dns6001*}" "run-puppet-agent --enable 'merging CR 1009261'" [17:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:55] !log repooling cp4037 for brief time (T358109) [17:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:59] T358109: Install new Benthos instance on cp hosts - https://phabricator.wikimedia.org/T358109 [17:19:59] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [17:22:01] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [17:30:01] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9643269 (10ssingh) Rob, once the time/data is confirmed, please let me know here or on IRC and I will send an email to sre@. Thanks! [17:32:24] (03PS3) 10Dzahn: admin: absent user kcv-wikimf, renamed to kcvelaga [puppet] - 10https://gerrit.wikimedia.org/r/1011187 (https://phabricator.wikimedia.org/T358658) [17:35:31] (03CR) 10Dzahn: [C:03+2] admin: absent user kcv-wikimf, renamed to kcvelaga [puppet] - 10https://gerrit.wikimedia.org/r/1011187 (https://phabricator.wikimedia.org/T358658) (owner: 10Dzahn) [17:39:21] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: 14Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - 14https://phabricator.wikimedia.org/T358658#9643295 (10Dzahn) 05Open→03Resolved [17:39:44] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: 14Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - 14https://phabricator.wikimedia.org/T358658#9643290 (10Dzahn) 14>>! In T358658#9618875, @KCVelaga_WMF wrote: > @cmooney all permissions and access for `kcvelaga` are... [17:40:36] 06SRE, 10SRE-Access-Requests: Requesting access to "researchers" and "analytics-privatedata-users" for Xiao Xiao - https://phabricator.wikimedia.org/T352098#9643296 (10Dzahn) a:05eoghan→03None [17:42:19] (03PS1) 10Dzahn: admins: revoke ssh key for xiaoxiao [puppet] - 10https://gerrit.wikimedia.org/r/1012727 (https://phabricator.wikimedia.org/T352098) [17:46:20] (03PS1) 10Majavah: dynamicproxy: use http 1.1 for backend connections [puppet] - 10https://gerrit.wikimedia.org/r/1012728 (https://phabricator.wikimedia.org/T354116) [17:46:32] (03CR) 10Dzahn: [C:03+2] admins: revoke ssh key for xiaoxiao [puppet] - 10https://gerrit.wikimedia.org/r/1012727 (https://phabricator.wikimedia.org/T352098) (owner: 10Dzahn) [17:46:51] 06SRE, 10Cumin, 06Infrastructure-Foundations: Feature request: When cumin is running with -b (and -s), it should display the current host being affected - https://phabricator.wikimedia.org/T355811#9643304 (10ssingh) For another data point: this can also be handy when you are running `-b1 -s` and w... [17:52:15] (03CR) 10Dzahn: [C:03+2] "better safe than sorry, she has already been contacted about generating a new key" [puppet] - 10https://gerrit.wikimedia.org/r/1012727 (https://phabricator.wikimedia.org/T352098) (owner: 10Dzahn) [18:00:05] dancy and hashar: Deploy window MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240319T1800) [18:00:55] o/ [18:02:25] I'm going to deploy a mediawiki-config change first. [18:04:11] (03PS1) 10Dzahn: bump version of static bugzilla image to 2024-03-19-172702 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012731 (https://phabricator.wikimedia.org/T101522) [18:05:02] Nevermind. Rolling the train first. [18:05:26] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012733 (https://phabricator.wikimedia.org/T354441) [18:05:27] (03CR) 10Dzahn: "version string taken from https://gitlab.wikimedia.org/repos/sre/miscweb/bugzilla/-/jobs/228211 though it doesn't show up in https://dock" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012731 (https://phabricator.wikimedia.org/T101522) (owner: 10Dzahn) [18:05:29] (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012733 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot) [18:06:00] !log running dummy authdns-update on dns1004 and dns6001 [18:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:32] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012733 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot) [18:07:40] (SystemdUnitFailed) firing: (3) rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:10:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1440:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1440 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:16:32] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns1004.wikimedia.org,service=authdns-update [18:16:44] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns1004.wikimedia.org,service=authdns-update [18:21:11] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.23 refs T354441 [18:21:16] T354441: 1.42.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T354441 [18:22:36] (03CR) 10Dzahn: [C:03+2] bump version of static bugzilla image to 2024-03-19-172702 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012731 (https://phabricator.wikimedia.org/T101522) (owner: 10Dzahn) [18:23:50] (03Merged) 10jenkins-bot: bump version of static bugzilla image to 2024-03-19-172702 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012731 (https://phabricator.wikimedia.org/T101522) (owner: 10Dzahn) [18:31:56] !log dzahn@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [18:33:17] !log dzahn@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [18:34:18] !log dzahn@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [18:35:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012708 (https://phabricator.wikimedia.org/T359643) (owner: 10Ahmon Dancy) [18:36:35] !log dzahn@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [18:37:13] (03Merged) 10jenkins-bot: Remove /w/COPYING and /w/CREDITS symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012708 (https://phabricator.wikimedia.org/T359643) (owner: 10Ahmon Dancy) [18:37:30] !log dzahn@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [18:37:37] !log dancy@deploy2002 Started scap: Backport for [[gerrit:1012708|Remove /w/COPYING and /w/CREDITS symlinks (T359643)]] [18:37:45] T359643: Get rid of the /srv/mediawiki/php symbolic link - https://phabricator.wikimedia.org/T359643 [18:39:52] !log dzahn@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [18:39:59] !log dancy@deploy2002 dancy: Backport for [[gerrit:1012708|Remove /w/COPYING and /w/CREDITS symlinks (T359643)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:40:31] !log dancy@deploy2002 dancy: Continuing with sync [18:52:35] !log dancy@deploy2002 Finished scap: Backport for [[gerrit:1012708|Remove /w/COPYING and /w/CREDITS symlinks (T359643)]] (duration: 14m 57s) [18:52:39] T359643: Get rid of the /srv/mediawiki/php symbolic link - https://phabricator.wikimedia.org/T359643 [18:53:38] 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog: Access to rua-dmarc@wikimedia.org - https://phabricator.wikimedia.org/T360462 (10DBu-WMF) 03NEW [18:56:07] 10ops-eqiad, 06SRE, 10Wikidata, 10wmde-wikidata-tech, and 2 others: 14Reclaim recently-decommed CP host for WDQS (see T352253) - 14https://phabricator.wikimedia.org/T358727#9643611 (10bking) 05Open→03Resolved 14Apologies for not posting this sooner. `wdqs1025` has been ready for use since the abov... [18:59:07] (03PS4) 10Santiago Faci: [DNM] Update the WikiLambda instrumentation to use core interaction events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992223 (https://phabricator.wikimedia.org/T350497) [19:00:31] (03PS5) 10Santiago Faci: Update the WikiLambda instrumentation to use core interaction events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992223 (https://phabricator.wikimedia.org/T350497) [19:00:55] (03CR) 10Santiago Faci: Update the WikiLambda instrumentation to use core interaction events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992223 (https://phabricator.wikimedia.org/T350497) (owner: 10Santiago Faci) [19:25:17] (03PS1) 10Ahmon Dancy: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1012740 [19:25:39] (03CR) 10Ahmon Dancy: [C:03+2] Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1012740 (owner: 10Ahmon Dancy) [19:26:04] (03CR) 10Ahmon Dancy: [V:03+2 C:03+2] Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1012740 (owner: 10Ahmon Dancy) [19:26:14] 10ops-eqiad, 06SRE, 06Data-Engineering: Degraded RAID on dumpsdata1007 - https://phabricator.wikimedia.org/T359702#9643764 (10Jclark-ctr) Replaced disk 5. noticed 2nd disk failure disk 7. opened another ticket for replacement of disk 7 You have successfully submitted request SR187258816. [19:26:42] 10ops-eqiad, 06SRE: Degraded RAID on dumpsdata1006 - https://phabricator.wikimedia.org/T360468 (10ops-monitoring-bot) 03NEW [19:34:27] 10ops-eqiad, 06SRE, 06Data-Engineering: 14Degraded RAID on dumpsdata1007 - 14https://phabricator.wikimedia.org/T359702#9643787 (10Jclark-ctr) 05Open→03Resolved 14replaced disk 7 with onhand disk will put replacement into extra storage when it arrives [19:34:40] 10ops-eqiad, 06SRE: Degraded RAID on dumpsdata1006 - https://phabricator.wikimedia.org/T360468#9643791 (10Jclark-ctr) a:03Jclark-ctr no disk issues it is rebuilding [19:34:57] 10ops-eqiad, 06SRE: 14Degraded RAID on dumpsdata1006 - 14https://phabricator.wikimedia.org/T360468#9643793 (10Jclark-ctr) 05Open→03Resolved [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240319T2000). [20:00:05] jan_drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:38] Looks like it's just my patch today, I can self deploy [20:02:25] (SystemdUnitFailed) firing: (3) rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:04:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/MinervaNeue] (wmf/1.42.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1011457 (https://phabricator.wikimedia.org/T359983) (owner: 10Jdlrobson) [20:07:13] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:09:58] (03CR) 10Krinkle: [C:04-1] Use more compact PHP7 syntax where possible (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE)) [20:12:25] (SystemdUnitFailed) firing: (3) rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:13:03] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1007445 (https://phabricator.wikimedia.org/T351455) (owner: 10Andrew Bogott) [20:23:22] (03Merged) 10jenkins-bot: The new class should be present alongside the old class for all page views [skins/MinervaNeue] (wmf/1.42.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1011457 (https://phabricator.wikimedia.org/T359983) (owner: 10Jdlrobson) [20:23:49] !log jdrewniak@deploy2002 Started scap: Backport for [[gerrit:1011457|The new class should be present alongside the old class for all page views (T359983)]] [20:23:54] T359983: Rename the skin night mode classes to something more sensible before they become widely used - https://phabricator.wikimedia.org/T359983 [20:25:12] (03PS1) 10Dzahn: peopleweb: set envoy::ssl_provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1012749 (https://phabricator.wikimedia.org/T360413) [20:26:11] !log jdrewniak@deploy2002 jdrewniak and jdlrobson: Backport for [[gerrit:1011457|The new class should be present alongside the old class for all page views (T359983)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:26:52] !log jdrewniak@deploy2002 jdrewniak and jdlrobson: Continuing with sync [20:27:07] (03CR) 10CI reject: [V:04-1] peopleweb: set envoy::ssl_provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1012749 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [20:28:13] (03PS5) 10Bking: cloudelastic: check/alert on cluster inconsistencies [puppet] - 10https://gerrit.wikimedia.org/r/1012703 (https://phabricator.wikimedia.org/T358541) [20:29:47] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1012749 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [20:30:49] (03PS6) 10Bking: cloudelastic: check/alert on cluster inconsistencies [puppet] - 10https://gerrit.wikimedia.org/r/1012703 (https://phabricator.wikimedia.org/T358541) [20:32:15] (03PS7) 10Bking: cloudelastic: check/alert on cluster inconsistencies [puppet] - 10https://gerrit.wikimedia.org/r/1012703 (https://phabricator.wikimedia.org/T358541) [20:32:30] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1012703 (https://phabricator.wikimedia.org/T358541) (owner: 10Bking) [20:39:00] !log jdrewniak@deploy2002 Finished scap: Backport for [[gerrit:1011457|The new class should be present alongside the old class for all page views (T359983)]] (duration: 15m 11s) [20:39:06] T359983: Rename the skin night mode classes to something more sensible before they become widely used - https://phabricator.wikimedia.org/T359983 [20:39:32] (03PS1) 10Ahmon Dancy: mime: Register `.owl` as application/rdf+xml [core] (wmf/1.42.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1012766 (https://phabricator.wikimedia.org/T171807) [20:40:15] (03PS5) 10Jdrewniak: Enable night mode on pilot wikis in AMC mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012452 (https://phabricator.wikimedia.org/T359152) (owner: 10Jdlrobson) [20:40:48] (03PS6) 10Jdrewniak: Enable night mode on pilot wikis in AMC mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012452 (https://phabricator.wikimedia.org/T359152) (owner: 10Jdlrobson) [20:41:23] (03PS7) 10Jdrewniak: Enable night mode on pilot wikis in AMC mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012452 (https://phabricator.wikimedia.org/T359152) (owner: 10Jdlrobson) [20:41:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012452 (https://phabricator.wikimedia.org/T359152) (owner: 10Jdlrobson) [20:42:31] (03Merged) 10jenkins-bot: Enable night mode on pilot wikis in AMC mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012452 (https://phabricator.wikimedia.org/T359152) (owner: 10Jdlrobson) [20:43:00] !log jdrewniak@deploy2002 Started scap: Backport for [[gerrit:1012452|Enable night mode on pilot wikis in AMC mode (T359152)]] [20:43:05] T359152: Deploy initial version of night mode to pilot wikis on the mobile website for testing - https://phabricator.wikimedia.org/T359152 [20:45:41] (03CR) 10Dzahn: "hmm.. is it all or do we need to set more of the options... https://puppet-compiler.wmflabs.org/output/1012749/1661/people1004.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/1012749 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [20:46:15] !log jdrewniak@deploy2002 jdrewniak and jdlrobson: Backport for [[gerrit:1012452|Enable night mode on pilot wikis in AMC mode (T359152)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:47:19] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9644060 (10Papaul) [20:47:33] !log jdrewniak@deploy2002 jdrewniak and jdlrobson: Continuing with sync [20:50:24] 10ops-codfw, 06SRE: Degraded RAID on elastic2037 - https://phabricator.wikimedia.org/T359742#9644064 (10Papaul) @RKemper hello please see @Jhancock.wm comment above. Thank you [20:53:39] 10ops-codfw, 06SRE: install (2) 1.92TB SSDs from decom into prometheus200[56] - https://phabricator.wikimedia.org/T359631#9644068 (10Papaul) @fgiunchedi hello I will be working with you tomorrow on this since @Jhancock.wm has some things to take care of @16UTC [20:56:42] (03CR) 10Ebernhardson: [C:03+1] "puppet compiler output seems reasonable to me" [puppet] - 10https://gerrit.wikimedia.org/r/1012703 (https://phabricator.wikimedia.org/T358541) (owner: 10Bking) [20:56:57] 10ops-codfw, 06SRE: Degraded RAID on elastic2037 - https://phabricator.wikimedia.org/T359742#9644071 (10bking) Thanks @Papaul and @Jhancock.wm ! As you pointed out, the server is out of warranty. We're working on decommissioning in T358882 , but in the meantime, I'll get a puppet patch up to silence icinga a... [20:59:45] !log jdrewniak@deploy2002 Finished scap: Backport for [[gerrit:1012452|Enable night mode on pilot wikis in AMC mode (T359152)]] (duration: 16m 45s) [20:59:51] T359152: Deploy initial version of night mode to pilot wikis on the mobile website for testing - https://phabricator.wikimedia.org/T359152 [21:05:09] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9644083 (10Papaul) @Jhancock.wm please proceed with this task and let me know if you have any issues. [21:10:01] (03PS1) 10Bking: elastic2037: silence icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/1012751 (https://phabricator.wikimedia.org/T359742) [21:12:57] (03PS1) 10Majavah: P:toolforge: remove support for grid bastions [puppet] - 10https://gerrit.wikimedia.org/r/1012752 (https://phabricator.wikimedia.org/T314665) [21:20:21] (03CR) 10Bking: [C:03+2] cloudelastic: check/alert on cluster inconsistencies [puppet] - 10https://gerrit.wikimedia.org/r/1012703 (https://phabricator.wikimedia.org/T358541) (owner: 10Bking) [21:20:47] (03PS1) 10Majavah: Add python3 back to images that used to have it via webservice-runner [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1012753 [21:23:06] (03CR) 10BryanDavis: [C:03+1] "It would be nice to put the python cgi runner block conditionally back in shared/lighttpd/webservice-runner too." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1012753 (owner: 10Majavah) [21:23:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:25:59] (03CR) 10Majavah: [C:03+2] "If you mean" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1012753 (owner: 10Majavah) [21:26:33] (03Merged) 10jenkins-bot: Add python3 back to images that used to have it via webservice-runner [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1012753 (owner: 10Majavah) [21:27:24] (03CR) 10Ebernhardson: [C:03+1] elastic2037: silence icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/1012751 (https://phabricator.wikimedia.org/T359742) (owner: 10Bking) [21:28:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:29:41] (03PS2) 10Bking: elastic2037: silence icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/1012751 (https://phabricator.wikimedia.org/T359742) [21:33:50] (03CR) 10Bking: [C:03+2] elastic2037: silence icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/1012751 (https://phabricator.wikimedia.org/T359742) (owner: 10Bking) [21:35:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:39:42] (03PS1) 10Jdlrobson: Make night theme available on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012755 (https://phabricator.wikimedia.org/T359152) [21:44:37] (03CR) 10Stoyofuku-wmf: [C:03+1] "I feel good about this based on the other one (but looks like I don't have +2 rights sorry 😭)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012755 (https://phabricator.wikimedia.org/T359152) (owner: 10Jdlrobson) [21:47:01] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1012688 (owner: 10Ssingh) [22:00:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:02:20] (03PS1) 10Fabfur: benthos: enabled batching policy for memory buffer too [puppet] - 10https://gerrit.wikimedia.org/r/1012756 (https://phabricator.wikimedia.org/T360454) [22:05:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:06:28] (03CR) 10Fabfur: [C:03+2] benthos: enabled batching policy for memory buffer too [puppet] - 10https://gerrit.wikimedia.org/r/1012756 (https://phabricator.wikimedia.org/T360454) (owner: 10Fabfur) [22:09:54] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [22:12:04] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [22:15:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:31:16] 10SRE-swift-storage, 06Commons, 06serviceops: Commons thumbnails are broken for certain large sizes of thumbnail images - https://phabricator.wikimedia.org/T358738#9644283 (10tstarling) I thought there was no cross-DC replication of thumbnails. T299125#8221206 seems to support that. So it's expected that a b... [22:32:39] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:32:46] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:33:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:39:45] (03CR) 10Muehlenhoff: peopleweb: set envoy::ssl_provider to cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1012749 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [22:48:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:51:31] (03PS1) 10Andrew Bogott: puppetserver: add puppet7-facts-export-nodb.py [puppet] - 10https://gerrit.wikimedia.org/r/1012764 (https://phabricator.wikimedia.org/T351450) [22:51:32] (03PS1) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 [22:55:04] (03PS2) 10Andrew Bogott: puppetserver: add puppet7-facts-export-nodb.py [puppet] - 10https://gerrit.wikimedia.org/r/1012764 (https://phabricator.wikimedia.org/T351450) [22:55:04] (03PS2) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 [22:59:06] (03CR) 10CI reject: [V:04-1] puppetserver: add puppet7-facts-export-nodb.py [puppet] - 10https://gerrit.wikimedia.org/r/1012764 (https://phabricator.wikimedia.org/T351450) (owner: 10Andrew Bogott) [22:59:35] (03CR) 10Dzahn: peopleweb: set envoy::ssl_provider to cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1012749 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [23:06:27] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:06:34] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:08:57] (03PS1) 10Catrope: htmlform: Fix double escaping in Label div [core] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1012768 (https://phabricator.wikimedia.org/T360381) [23:10:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:15:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:20:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:25:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:31:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:46:02] (03PS1) 10Fabfur: benthos: switch to unix socket for performance testing [puppet] - 10https://gerrit.wikimedia.org/r/1012790 (https://phabricator.wikimedia.org/T360454) [23:46:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:52:18] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1012790 (https://phabricator.wikimedia.org/T360454) (owner: 10Fabfur) [23:55:32] (03PS2) 10Fabfur: benthos: switch to unix socket for performance testing [puppet] - 10https://gerrit.wikimedia.org/r/1012790 (https://phabricator.wikimedia.org/T360454) [23:58:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown