[00:17:57] <denisse>	 !log Adding a logger processor to the `parse_ncredir_log_format` on `ncredir2001` to examine the JSON structure - T364354
[00:17:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:18:00] <stashbot>	 T364354: An alert for "reduced availability for job ncredir in ops@codfw" fired even tho graphs look healthy - https://phabricator.wikimedia.org/T364354
[00:35:52] <denisse>	 I can't find the root cause but I think it has to do with the grok processor.
[00:36:05] <denisse>	 More specifically incorrect parsing of its JSON output.
[00:42:35] <denisse>	 !log Writing output to `/tmp/benthos_output.txt` shows that the grok processor's output is being parsed correctly - T364354
[00:42:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:42:40] <stashbot>	 T364354: An alert for "reduced availability for job ncredir in ops@codfw" fired even tho graphs look healthy - https://phabricator.wikimedia.org/T364354
[00:46:36] <denisse>	 I'm not sure how to debug this further. My hypothesis of what could be causing the issue are incorrect, JSON parsing is correct and the path queries are also correct.
[00:47:04] <denisse>	 !log Reverting debug changes to their previous state - T364354
[00:47:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:50:57] <sukhe>	 denisse: thank you for looking into it <3. Traffic can take it up tomorrow!
[00:51:16] <denisse>	 sukhe: Running yamllint on the config file doesn't show any breaking errors however, I'm curious about this line root = this.message. I wonder if it should be root: this.message.
[00:51:40] <denisse>	 I think that would be the only "error" in the YAML file.
[00:53:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:53:31] <sukhe>	 denisse: I am not a benthos expert by any means but I have seen this enough times in fabfur's patches that I think this is correct, the benthos format that is https://www.benthos.dev/docs/guides/bloblang/walkthrough/
[00:56:10] <denisse>	 sukhe: That makes sense, thank you. Looking at benthos logs it seems like bloblang can be indeed used on YAML files so that shouldn't be the issue.
[00:57:47] <icinga-wm>	 PROBLEM - Host mr1-esams.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[01:00:31] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 203 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[01:02:17] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 176 probes of 732 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[01:02:49] <icinga-wm>	 RECOVERY - Host mr1-esams.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 86.63 ms
[01:03:50] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T364358 (10phaultfinder) 03NEW
[01:05:27] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 39 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[01:07:19] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 38 probes of 732 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[01:08:09] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.4 [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1027481 (https://phabricator.wikimedia.org/T361398)
[01:08:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.4 [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1027481 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot)
[01:21:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[01:26:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[01:28:34] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.4 [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1027481 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot)
[01:30:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[01:30:56] <wikibugs>	 10SRE-Access-Requests, 06Movement-Insights: Restore nshahquinn-wmf and hghani to analytics-product-users - https://phabricator.wikimedia.org/T364359 (10nshahquinn-wmf) 03NEW
[01:32:00] <wikibugs>	 10SRE-Access-Requests, 06Movement-Insights: Restore nshahquinn-wmf and hghani to analytics-product-users - https://phabricator.wikimedia.org/T364359#9776333 (10nshahquinn-wmf) I didn't file separate tickets for each one of us since this is really a bug fix rather than a new request for permissions, but I'm hap...
[01:35:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[02:00:04] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T0200)
[02:36:27] <jinxer-wm>	 FIRING: [6x] JobUnavailable: Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:00:04] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T0300)
[03:00:13] <jinxer-wm>	 FIRING: [6x] JobUnavailable: Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:01:35] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028611 (https://phabricator.wikimedia.org/T361398)
[03:01:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028611 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot)
[03:02:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] testwikis wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028611 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot)
[03:03:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:52:11] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[04:00:04] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T0400)
[04:04:53] <logmsgbot>	 !log mwpresync@deploy1002 Pruned MediaWiki: 1.43.0-wmf.1, 1.43.0-wmf.2 (duration: 04m 50s)
[04:25:33] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (install7001), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[05:00:09] <icinga-wm>	 RECOVERY - MD RAID on mw2382 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[05:13:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T0600)
[06:00:05] <jouncebot>	 kormat, marostegui, Amir1, and arnaudb: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T0600).
[06:01:20] <wikibugs>	 (03CR) 10Hashar: "I guess I will send our customizations to upstream so we don't have to carry them over :)" [puppet] - 10https://gerrit.wikimedia.org/r/1027726 (owner: 10Paladox)
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:14:18] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+1] ml-services: tune autoscaling for damaging, goodfaith and reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028552 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey)
[06:20:06] <wikibugs>	 (03CR) 10Hashar: "I have proposed changes upstream to get rid of our templates customization:" [puppet] - 10https://gerrit.wikimedia.org/r/1027726 (owner: 10Paladox)
[06:37:32] <wikibugs>	 (03CR) 10Slyngshede: pcc: fix delete-canceled-pcc-run-dirs timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1027008 (https://phabricator.wikimedia.org/T364173) (owner: 10JHathaway)
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:01:27] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:10:26] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "hiera: update installserver for magru" [puppet] - 10https://gerrit.wikimedia.org/r/1028707 (https://phabricator.wikimedia.org/T364016)
[07:11:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "sites: update installserver for magru" [homer/public] - 10https://gerrit.wikimedia.org/r/1028709
[07:18:39] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1028707 (https://phabricator.wikimedia.org/T364016) (owner: 10Muehlenhoff)
[07:20:15] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/1028709 (owner: 10Muehlenhoff)
[07:20:56] <kart_>	 urbanecm: and updates from cswiki WP on, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1025300 AFAIK, it was asked by community only.
[07:21:03] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Revert "sites: update installserver for magru" [homer/public] - 10https://gerrit.wikimedia.org/r/1028709 (owner: 10Muehlenhoff)
[07:21:44] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "sites: update installserver for magru" [homer/public] - 10https://gerrit.wikimedia.org/r/1028709 (owner: 10Muehlenhoff)
[07:22:41] <wikibugs>	 (03PS11) 10Sohom Datta: [ruwiki] Limit the use of the ContentTranslation tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019390 (https://phabricator.wikimedia.org/T362440) (owner: 10Dreamrimmer)
[07:25:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Revert "hiera: update installserver for magru" [puppet] - 10https://gerrit.wikimedia.org/r/1028707 (https://phabricator.wikimedia.org/T364016) (owner: 10Muehlenhoff)
[07:26:59] <NMW03>	 hi. who is the deployer now
[07:31:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host install7001.wikimedia.org with OS bullseye
[07:32:26] <wikibugs>	 (03CR) 10Volans: [C:03+1] pcc: fix delete-canceled-pcc-run-dirs timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1027008 (https://phabricator.wikimedia.org/T364173) (owner: 10JHathaway)
[07:35:16] <wikibugs>	 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9776549 (10hashar)
[07:39:21] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host install7001.wikimedia.org with OS bullseye
[07:40:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts install7001.wikimedia.org
[07:41:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:42:42] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "Enable install7001 as webproxy in magru" [dns] - 10https://gerrit.wikimedia.org/r/1028754
[07:45:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[07:46:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:48:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Revert "Enable install7001 as webproxy in magru" [dns] - 10https://gerrit.wikimedia.org/r/1028754 (owner: 10Muehlenhoff)
[07:50:03] <zabe>	 jouncebot: nowandnext
[07:50:03] <jouncebot>	 For the next 0 hour(s) and 9 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T0700)
[07:50:03] <jouncebot>	 In 2 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T1000)
[07:50:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[07:51:09] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Stop setting wgPasswordDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027335 (owner: 10Zabe)
[07:52:11] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[07:52:21] <wikibugs>	 (03PS3) 10Zabe: Stop setting wgPasswordDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027335
[07:52:28] <wikibugs>	 (03CR) 10Zabe: Stop setting wgPasswordDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027335 (owner: 10Zabe)
[07:52:30] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Stop setting wgPasswordDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027335 (owner: 10Zabe)
[07:53:14] <wikibugs>	 (03Merged) 10jenkins-bot: Stop setting wgPasswordDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027335 (owner: 10Zabe)
[07:58:28] <wikibugs>	 (03PS1) 10Zabe: Use OpenSSL for PBKDF2 password hashing on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028756 (https://phabricator.wikimedia.org/T320929)
[07:59:03] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Add the wmf-java-cacerts truststore to all remaining airflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1026964 (https://phabricator.wikimedia.org/T362181) (owner: 10Btullis)
[07:59:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Use OpenSSL for PBKDF2 password hashing on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028756 (https://phabricator.wikimedia.org/T320929) (owner: 10Zabe)
[08:00:07] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:1027335|Stop setting wgPasswordDefault]]
[08:00:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:00:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:00:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts install7001.wikimedia.org
[08:02:08] <wikibugs>	 (03PS2) 10Zabe: Use OpenSSL for PBKDF2 password hashing on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028756 (https://phabricator.wikimedia.org/T320929)
[08:02:46] <logmsgbot>	 !log zabe@deploy1002 zabe: Backport for [[gerrit:1027335|Stop setting wgPasswordDefault]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:02:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install7001.wikimedia.org
[08:02:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:03:17] <logmsgbot>	 !log zabe@deploy1002 zabe: Continuing with sync
[08:06:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install7001.wikimedia.org - jmm@cumin2002"
[08:06:43] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:trafficserver::backend add cloudtestidm [puppet] - 10https://gerrit.wikimedia.org/r/1026790 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede)
[08:07:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install7001.wikimedia.org - jmm@cumin2002"
[08:07:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:07:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install7001.wikimedia.org on all recursors
[08:07:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install7001.wikimedia.org on all recursors
[08:07:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install7001.wikimedia.org - jmm@cumin2002"
[08:07:46] <wikibugs>	 (03PS1) 10Brouberol: elasticsearch: defaut to rolling restarting a single node at a time [cookbooks] - 10https://gerrit.wikimedia.org/r/1028757 (https://phabricator.wikimedia.org/T362534)
[08:08:10] <wikibugs>	 (03CR) 10Gehel: [C:03+1] "LGTM, simple enough" [cookbooks] - 10https://gerrit.wikimedia.org/r/1028757 (https://phabricator.wikimedia.org/T362534) (owner: 10Brouberol)
[08:08:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install7001.wikimedia.org - jmm@cumin2002"
[08:08:46] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2267 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[08:08:52] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1457 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[08:10:02] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on parse2020 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[08:10:25] <wikibugs>	 (03CR) 10Btullis: [C:03+1] global_config: Only expose the IP of the analytics meta master [puppet] - 10https://gerrit.wikimedia.org/r/1028486 (https://phabricator.wikimedia.org/T361955) (owner: 10Brouberol)
[08:11:04] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1020 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[08:11:42] <wikibugs>	 (03CR) 10Brouberol: [V:03+1 C:03+2] global_config: Only expose the IP of the analytics meta master [puppet] - 10https://gerrit.wikimedia.org/r/1028486 (https://phabricator.wikimedia.org/T361955) (owner: 10Brouberol)
[08:13:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host install7001.wikimedia.org with OS bullseye
[08:14:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] prometheus: use datacenters for snmp_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1028502 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi)
[08:15:32] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1027335|Stop setting wgPasswordDefault]] (duration: 15m 24s)
[08:16:01] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Use OpenSSL for PBKDF2 password hashing on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028756 (https://phabricator.wikimedia.org/T320929) (owner: 10Zabe)
[08:16:49] <wikibugs>	 (03Merged) 10jenkins-bot: Use OpenSSL for PBKDF2 password hashing on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028756 (https://phabricator.wikimedia.org/T320929) (owner: 10Zabe)
[08:17:20] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:1028756|Use OpenSSL for PBKDF2 password hashing on testwiki (T320929)]]
[08:17:23] <stashbot>	 T320929: Deploy MediaWiki config change to use OpenSSL for PBKDF2 password hashing - https://phabricator.wikimedia.org/T320929
[08:18:52] <taavi>	 zabe: ping me when I can deploy something please?
[08:19:00] <zabe>	 sure
[08:19:14] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Make caps an optional parameter to the Ceph::Auth::ClientAuth type [puppet] - 10https://gerrit.wikimedia.org/r/1026867 (https://phabricator.wikimedia.org/T364105) (owner: 10Btullis)
[08:19:44] <logmsgbot>	 !log zabe@deploy1002 zabe: Backport for [[gerrit:1028756|Use OpenSSL for PBKDF2 password hashing on testwiki (T320929)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:22:40] <logmsgbot>	 !log zabe@deploy1002 zabe: Continuing with sync
[08:23:28] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: assemble snmp.yml when updating modules [puppet] - 10https://gerrit.wikimedia.org/r/1028759 (https://phabricator.wikimedia.org/T364016)
[08:24:46] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: assemble snmp.yml when updating modules [puppet] - 10https://gerrit.wikimedia.org/r/1028759 (https://phabricator.wikimedia.org/T364016)
[08:25:34] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[08:26:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2303/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028759 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi)
[08:26:40] <urbanecm>	 kart_: feel free to go ahead please!
[08:26:56] <urbanecm>	 (Assuming no one is deploying anything atm)
[08:27:18] <wikibugs>	 (03PS5) 10Zabe: Use OpenSSL for PBKDF2 password hashing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842522 (https://phabricator.wikimedia.org/T320929) (owner: 10PleaseStand)
[08:27:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] prometheus: assemble snmp.yml when updating modules [puppet] - 10https://gerrit.wikimedia.org/r/1028759 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi)
[08:30:19] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add the wmf-java-cacerts truststore to all remaining airflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1026964 (https://phabricator.wikimedia.org/T362181) (owner: 10Btullis)
[08:30:44] <wikibugs>	 (03PS1) 10Zabe: Revert "beta: Use OpenSSL for PBKDF2 password hashing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028760
[08:31:09] <wikibugs>	 (03PS2) 10Zabe: Revert "beta: Use OpenSSL for PBKDF2 password hashing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028760
[08:32:06] <wikibugs>	 (03CR) 10Majavah: [C:03+1] cloudweb: Enable profile::auto_restarts::service for nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/1026459 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[08:32:10] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2429 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[08:32:12] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: analytics-meta-replica on an-mariadb1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for mpic_staging@10.% on query. Default database: mpic_staging. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:32:18] <zabe>	 ehm
[08:32:45] <zabe>	 there was a peak of 564 errors Uncaught MWException: Invalid IP given in XFF ...
[08:33:51] <wikibugs>	 (03CR) 10Majavah: "I assume this is not wanted for the non-Wikitech appserver fleet?" [puppet] - 10https://gerrit.wikimedia.org/r/1026453 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[08:34:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on install7001.wikimedia.org with reason: host reimage
[08:34:20] <wikibugs>	 (03CR) 10Majavah: [C:03+1] Enable profile::auto_restarts::service for docker/containerd [puppet] - 10https://gerrit.wikimedia.org/r/1026451 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[08:34:39] <zabe>	 apparantly not related to my patch
[08:34:43] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1028756|Use OpenSSL for PBKDF2 password hashing on testwiki (T320929)]] (duration: 17m 22s)
[08:34:45] <stashbot>	 T320929: Deploy MediaWiki config change to use OpenSSL for PBKDF2 password hashing - https://phabricator.wikimedia.org/T320929
[08:34:58] <zabe>	 taavi: I'm done
[08:35:16] <taavi>	 ack, thanks
[08:35:21] <moritzm>	 !log installing glibc security updates on buster
[08:35:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:10] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028383 (owner: 10JMeybohm)
[08:36:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023432 (owner: 10Majavah)
[08:36:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on install7001.wikimedia.org with reason: host reimage
[08:37:33] <wikibugs>	 (03Merged) 10jenkins-bot: wikitech: Also disable password changes when logged-in [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023432 (owner: 10Majavah)
[08:37:50] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:1023432|wikitech: Also disable password changes when logged-in]]
[08:38:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1028559 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French)
[08:38:47] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw2267 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[08:38:53] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1457 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[08:40:01] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:40:01] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on parse2020 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[08:40:20] <logmsgbot>	 !log taavi@deploy1002 taavi: Backport for [[gerrit:1023432|wikitech: Also disable password changes when logged-in]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:40:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] confd: prom exporter uses resource name to find state file [puppet] - 10https://gerrit.wikimedia.org/r/1028560 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French)
[08:41:03] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1020 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[08:41:45] <logmsgbot>	 !log taavi@deploy1002 taavi: Continuing with sync
[08:45:02] <wikibugs>	 (03PS1) 10Stevemunene: Move datahub and datahub-staging helfile deployments to dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028761 (https://phabricator.wikimedia.org/T363300)
[08:50:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] cloudweb: Enable profile::auto_restarts::service for nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/1026459 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[08:53:41] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:1023432|wikitech: Also disable password changes when logged-in]] (duration: 15m 50s)
[08:54:05] <taavi>	 I'm also done deploying
[08:59:22] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[08:59:45] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[09:00:20] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[09:00:42] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[09:01:01] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[09:01:07] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[09:01:19] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[09:01:32] <wikibugs>	 (03CR) 10Klausman: [C:03+1] amd-pytorch21: fix the ROCm version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1028519 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey)
[09:01:33] <wikibugs>	 10SRE-swift-storage, 06Commons: Commons: File:Gnome-edit-delete.svg not found - https://phabricator.wikimedia.org/T363995#9776697 (10jcrespo) >>! In T363995#9775321, @jcrespo wrote: > [2024-05-06 14:33:33,903] INFO:backup '9/96/Gnome-edit-delete.svg' downloaded > [2024-05-06 14:33:33,904] INFO:backup sha256 su...
[09:02:11] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw2429 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[09:02:24] <wikibugs>	 (03PS1) 10Muehlenhoff: query_sever::deploy::manual: Remove obsolete class [puppet] - 10https://gerrit.wikimedia.org/r/1028763 (https://phabricator.wikimedia.org/T316876)
[09:02:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete Hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1028764 (https://phabricator.wikimedia.org/T316876)
[09:02:25] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[09:02:31] <wikibugs>	 (03CR) 10Klausman: [C:03+1] ml-services: tune autoscaling for damaging, goodfaith and reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028552 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey)
[09:02:33] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[09:03:12] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[09:03:58] <jnuche>	 hi there, does anyone else need to run backports? train presync failed last night and I would like to re-run it
[09:05:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host install7001.wikimedia.org with OS bullseye
[09:05:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install7001.wikimedia.org
[09:10:01] <jnuche>	 ok, will do that now
[09:10:26] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028765 (https://phabricator.wikimedia.org/T361398)
[09:10:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028765 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot)
[09:10:30] <wikibugs>	 (03PS1) 10Muehlenhoff: Reapply "sites: update installserver for magru" [homer/public] - 10https://gerrit.wikimedia.org/r/1028766
[09:11:10] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028765 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot)
[09:11:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Reapply "hiera: update installserver for magru" [puppet] - 10https://gerrit.wikimedia.org/r/1028767
[09:11:35] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] amd-pytorch21: fix the ROCm version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1028519 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey)
[09:11:39] <logmsgbot>	 !log jnuche@deploy1002 Started scap: testwikis wikis to 1.43.0-wmf.4  refs T361398
[09:11:42] <stashbot>	 T361398: 1.43.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T361398
[09:12:51] <wikibugs>	 (03Abandoned) 10Zabe: testwikis wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028611 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot)
[09:13:40] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:16:01] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] apertium: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028604 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[09:16:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Enable profile::auto_restarts::service for docker/containerd [puppet] - 10https://gerrit.wikimedia.org/r/1026451 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[09:21:23] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] api-gateway: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028605 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[09:21:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1157.eqiad.wmnet
[09:22:52] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db1157 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028769 (https://phabricator.wikimedia.org/T349619)
[09:23:13] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: analytics-meta-replica on an-mariadb1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:25:27] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] Reapply "hiera: update installserver for magru" [puppet] - 10https://gerrit.wikimedia.org/r/1028767 (owner: 10Muehlenhoff)
[09:25:31] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] Reapply "sites: update installserver for magru" [homer/public] - 10https://gerrit.wikimedia.org/r/1028766 (owner: 10Muehlenhoff)
[09:26:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db1157 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028769 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[09:27:31] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (install7001), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[09:31:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1157.eqiad.wmnet
[09:31:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1166.eqiad.wmnet
[09:32:42] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance
[09:32:56] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance
[09:33:02] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[09:33:03] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2165 (T352010)', diff saved to https://phabricator.wikimedia.org/P61981 and previous config saved to /var/cache/conftool/dbconfig/20240507-093302-ladsgroup.json
[09:33:11] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[09:33:18] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db1166 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028772 (https://phabricator.wikimedia.org/T349619)
[09:35:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Reapply "hiera: update installserver for magru" [puppet] - 10https://gerrit.wikimedia.org/r/1028767 (owner: 10Muehlenhoff)
[09:36:08] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[09:36:31] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[09:36:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db1166 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028772 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[09:36:58] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[09:37:31] <wikibugs>	 (03PS1) 10Btullis: Fix the cephosd dse-k8s-csi user caps [puppet] - 10https://gerrit.wikimedia.org/r/1028773 (https://phabricator.wikimedia.org/T327259)
[09:37:37] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[09:37:59] <logmsgbot>	 !log brouberol@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[09:38:17] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1028773 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[09:38:33] <logmsgbot>	 !log brouberol@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[09:38:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See inline, also let' split this in two patches: one for titan and one for thanos frontend" [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[09:38:56] <logmsgbot>	 !log brouberol@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[09:39:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] "I'd rather not have puppet restart prometheus by itself since it can take a long time and tends to be distructive" [puppet] - 10https://gerrit.wikimedia.org/r/1025682 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi)
[09:39:17] <logmsgbot>	 !log brouberol@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[09:39:33] <wikibugs>	 (03PS1) 10Slyngshede: P:trafficserver::backend Fix URL for CloudIDM [puppet] - 10https://gerrit.wikimedia.org/r/1028774
[09:40:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Reapply "sites: update installserver for magru" [homer/public] - 10https://gerrit.wikimedia.org/r/1028766 (owner: 10Muehlenhoff)
[09:40:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1166.eqiad.wmnet
[09:41:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1189.eqiad.wmnet
[09:42:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db1189 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028776 (https://phabricator.wikimedia.org/T349619)
[09:43:23] <wikibugs>	 (03PS1) 10Slyngshede: Revert "P:trafficserver::backend add cloudtestidm" [puppet] - 10https://gerrit.wikimedia.org/r/1028574
[09:43:38] <wikibugs>	 (03PS1) 10Ladsgroup: Stop writing to old columns of pagelinks in most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028778 (https://phabricator.wikimedia.org/T352010)
[09:45:34] <wikibugs>	 (03PS2) 10Ladsgroup: Stop writing to old columns of pagelinks in most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028778 (https://phabricator.wikimedia.org/T352010)
[09:46:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Doh :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1028774 (owner: 10Slyngshede)
[09:49:45] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.decommission for hosts prometheus7001.magru.wmnet
[09:50:33] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] ratelimit: Update ratelimit service to git 3fcc360 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1028532 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm)
[09:52:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db1189 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028776 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[09:54:02] <wikibugs>	 (03CR) 10JMeybohm: [V:03+2 C:03+2] ratelimit: Update ratelimit service to git 3fcc360 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1028532 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm)
[09:54:27] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.dns.netbox
[09:55:17] <logmsgbot>	 !log jnuche@deploy1002 sync-world aborted: testwikis wikis to 1.43.0-wmf.4  refs T361398 (duration: 43m 38s)
[09:55:21] <stashbot>	 T361398: 1.43.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T361398
[09:55:47] <jnuche>	 that was an accident, I need to rerun...
[09:56:43] <logmsgbot>	 !log jnuche@deploy1002 Started scap: testwikis wikis to 1.43.0-wmf.4  refs T361398
[09:56:46] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#9776836 (10Ladsgroup) We already have numbers for those and they look not great for the switch: see T360589 and T211661#8377883
[09:56:49] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Fix the cephosd dse-k8s-csi user caps [puppet] - 10https://gerrit.wikimedia.org/r/1028773 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[09:57:01] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Fix the cephosd dse-k8s-csi user caps [puppet] - 10https://gerrit.wikimedia.org/r/1028773 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[09:57:52] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - filippo@cumin1002"
[09:58:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1189.eqiad.wmnet
[09:58:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry rolling restart_daemons on A:docker-registry
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T1000)
[10:01:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry (exit_code=0) rolling restart_daemons on A:docker-registry
[10:04:17] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2021 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:05:19] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2035 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:08:07] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2381 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:08:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1198.eqiad.wmnet
[10:09:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-codfw
[10:12:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db1198 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028783 (https://phabricator.wikimedia.org/T349619)
[10:12:48] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] admin_ng/helmfile_istio-gateway: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025296 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli)
[10:13:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db1198 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028783 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:14:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-codfw
[10:15:17] <wikibugs>	 (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1021892 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis)
[10:15:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-eqiad
[10:16:05] <logmsgbot>	 !log jnuche@deploy1002 Finished scap: testwikis wikis to 1.43.0-wmf.4  refs T361398 (duration: 19m 22s)
[10:16:08] <stashbot>	 T361398: 1.43.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T361398
[10:20:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad
[10:21:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1198.eqiad.wmnet
[10:22:10] <wikibugs>	 (03PS1) 10Slyngshede: P:idm Fix certificate name [puppet] - 10https://gerrit.wikimedia.org/r/1028785 (https://phabricator.wikimedia.org/T362128)
[10:23:02] <wikibugs>	 (03CR) 10Hnowlan: "No swagger spec for the gateways unfortunately, but checking something like https://staging.svc.eqiad.wmnet:8087/core/v1/wikipedia/en/page" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028605 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[10:25:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot-master rolling restart_daemons on A:maps-master
[10:25:45] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:25:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1223.eqiad.wmnet
[10:26:09] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:26:37] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.259 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:26:59] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:27:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot-master (exit_code=0) rolling restart_daemons on A:maps-master
[10:28:07] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - filippo@cumin1002"
[10:28:07] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:28:08] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts prometheus7001.magru.wmnet
[10:29:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db1223 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028786 (https://phabricator.wikimedia.org/T349619)
[10:32:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1028785 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede)
[10:32:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db1223 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028786 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:34:17] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2021 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:35:19] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2035 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:37:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1223.eqiad.wmnet
[10:38:07] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw2381 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:40:03] <wikibugs>	 (03PS1) 10Muehlenhoff: chartmuseum: Enable profile::auto_restarts::service for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1028789 (https://phabricator.wikimedia.org/T135991)
[10:42:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/1028790 (https://phabricator.wikimedia.org/T135991)
[10:45:04] <wikibugs>	 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9776962 (10MoritzMuehlenhoff)
[10:46:21] <wikibugs>	 (03PS1) 10Fabfur: cache: Use fifo-log-demux between haproxy and benthos [puppet] - 10https://gerrit.wikimedia.org/r/1028791 (https://phabricator.wikimedia.org/T364379)
[10:46:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cache: Use fifo-log-demux between haproxy and benthos [puppet] - 10https://gerrit.wikimedia.org/r/1028791 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur)
[10:53:43] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Update prometheus config to reflect matomo profile change [puppet] - 10https://gerrit.wikimedia.org/r/1021892 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis)
[10:56:31] <wikibugs>	 (03PS1) 10Muehlenhoff: parsoid/testing: Enable profile::auto_restarts::service for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1028793 (https://phabricator.wikimedia.org/T135991)
[11:01:27] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:01:42] <wikibugs>	 (03PS1) 10Muehlenhoff: ci: Enable profile::auto_restarts::service for docker/containerd [puppet] - 10https://gerrit.wikimedia.org/r/1028795 (https://phabricator.wikimedia.org/T135991)
[11:04:15] <wikibugs>	 (03PS1) 10Muehlenhoff: ci: Enable profile::auto_restarts::service for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1028796 (https://phabricator.wikimedia.org/T135991)
[11:05:48] <hnowlan>	 !log depooling 6 codfw appservers in advance of reimaging 
[11:05:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:28] <wikibugs>	 (03PS3) 10Jforrester: wikifunctions: Enable wasmedge resource limits in evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024767 (owner: 10Cory Massaro)
[11:13:44] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Enable wasmedge resource limits in evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024767 (owner: 10Cory Massaro)
[11:14:34] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Enable wasmedge resource limits in evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024767 (owner: 10Cory Massaro)
[11:15:38] <logmsgbot>	 !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[11:16:09] <logmsgbot>	 !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[11:17:10] <logmsgbot>	 !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[11:19:35] <logmsgbot>	 !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[11:19:53] <logmsgbot>	 !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[11:20:48] <wikibugs>	 (03PS6) 10JMeybohm: Add new chart: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026564 (https://phabricator.wikimedia.org/T362310)
[11:20:54] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "Actually the requests are made to a specific port as seen in the URL: `mw-jobrunner.discovery.wmnet:4448`  and the port is thus in the `Ho" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020277 (owner: 10Hnowlan)
[11:21:10] <wikibugs>	 (03CR) 10JMeybohm: Add new chart: ratelimit (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026564 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm)
[11:21:37] <wikibugs>	 (03CR) 10Majavah: "Can we use the relatively new `ClusterConfig` class instead?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020277 (owner: 10Hnowlan)
[11:21:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add new chart: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026564 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm)
[11:22:11] <logmsgbot>	 !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[11:23:32] <wikibugs>	 (03PS7) 10JMeybohm: Add new chart: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026564 (https://phabricator.wikimedia.org/T362310)
[11:28:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Stop installing git-fat on Cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/1028798 (https://phabricator.wikimedia.org/T364373)
[11:29:44] <wikibugs>	 (03CR) 10Dreamrimmer: [ruwiki] Limit the use of the ContentTranslation tool (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019390 (https://phabricator.wikimedia.org/T362440) (owner: 10Dreamrimmer)
[11:30:39] <wikibugs>	 (03PS1) 10Muehlenhoff: query_service: Stop installing git-fat [puppet] - 10https://gerrit.wikimedia.org/r/1028799 (https://phabricator.wikimedia.org/T316876)
[11:32:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Stop installing git-fat also on Buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1028800 (https://phabricator.wikimedia.org/T279509)
[11:33:06] <wikibugs>	 (03CR) 10Anzx: [C:03+1] [ruwiki] Limit the use of the ContentTranslation tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019390 (https://phabricator.wikimedia.org/T362440) (owner: 10Dreamrimmer)
[11:36:25] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:idm Fix certificate name [puppet] - 10https://gerrit.wikimedia.org/r/1028785 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede)
[11:38:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[11:40:21] <dmacks>	 I'm getting Wikimedia\Rdbms\DBTransactionSizeError on commons, both with regular edits and file-deletion.
[11:42:03] <dmacks>	 Sporatic
[11:42:09] <wikibugs>	 (03PS2) 10Majavah: libraryupgrader: Automatically restart celery processes [puppet] - 10https://gerrit.wikimedia.org/r/1027500
[11:43:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[11:43:35] <wikibugs>	 (03CR) 10Majavah: [C:03+2] libraryupgrader: Automatically restart celery processes [puppet] - 10https://gerrit.wikimedia.org/r/1027500 (owner: 10Majavah)
[11:44:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1028600 (https://phabricator.wikimedia.org/T364068) (owner: 10Dzahn)
[11:48:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[11:51:01] <wikibugs>	 (03PS1) 10Slyngshede: P:trafficserver::backend Add cloudtestidm [puppet] - 10https://gerrit.wikimedia.org/r/1028805 (https://phabricator.wikimedia.org/T362128)
[11:52:11] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[11:53:52] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2304/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028805 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede)
[11:56:30] <wikibugs>	 (03PS1) 10Fabfur: fifo_log_demux: add new parameters for current release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383)
[11:56:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] fifo_log_demux: add new parameters for current release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur)
[11:58:17] <wikibugs>	 (03PS2) 10Fabfur: fifo_log_demux: add new parameters for current release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383)
[11:58:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] fifo_log_demux: add new parameters for current release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur)
[12:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T1200)
[12:02:12] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.ganeti.makevm for new host prometheus7001.magru.wmnet
[12:02:13] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.dns.netbox
[12:02:44] <wikibugs>	 (03PS3) 10Fabfur: fifo_log_demux: add new parameters for current release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383)
[12:03:31] <moritzm>	 !log installing ruby3.1 security updates
[12:03:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:02] <wikibugs>	 (03Abandoned) 10Slyngshede: P:trafficserver::backend Fix URL for CloudIDM [puppet] - 10https://gerrit.wikimedia.org/r/1028774 (owner: 10Slyngshede)
[12:04:58] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "slyngshede@cp1108:~$ curl -i -H "Host: cloudtestidm.wikimedia.org" https://cloudidm2001-dev.codfw.wmnet/accounts/login/ 2>/dev/null |head " [puppet] - 10https://gerrit.wikimedia.org/r/1028805 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede)
[12:05:11] <logmsgbot>	 !log btullis@deploy1002 Started deploy [airflow-dags/analytics@e5ba870]: (no justification provided)
[12:05:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1028805 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede)
[12:05:44] <logmsgbot>	 !log btullis@deploy1002 Finished deploy [airflow-dags/analytics@e5ba870]: (no justification provided) (duration: 00m 32s)
[12:07:30] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7001.magru.wmnet - filippo@cumin1002"
[12:08:22] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7001.magru.wmnet - filippo@cumin1002"
[12:08:22] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:08:23] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.dns.wipe-cache prometheus7001.magru.wmnet on all recursors
[12:08:26] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7001.magru.wmnet on all recursors
[12:08:46] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7001.magru.wmnet - filippo@cumin1002"
[12:09:21] <wikibugs>	 (03PS4) 10Fabfur: fifo_log_demux: add new parameters for current release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383)
[12:09:39] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7001.magru.wmnet - filippo@cumin1002"
[12:10:29] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reimage for host prometheus7001.magru.wmnet with OS bullseye
[12:12:10] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hints for ruby3.1 [puppet] - 10https://gerrit.wikimedia.org/r/1028808
[12:12:23] <wikibugs>	 (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2305/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur)
[12:13:42] <wikibugs>	 (03PS5) 10Fabfur: fifo_log_demux: add new parameters for current release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383)
[12:15:01] <wikibugs>	 (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2306/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur)
[12:16:30] <wikibugs>	 (03PS6) 10Fabfur: fifo_log_demux: add new parameters for current release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383)
[12:17:55] <wikibugs>	 (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2307/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur)
[12:19:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add library hints for ruby3.1 [puppet] - 10https://gerrit.wikimedia.org/r/1028808 (owner: 10Muehlenhoff)
[12:26:33] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] fifo_log_demux: add new parameters for current release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur)
[12:29:57] <wikibugs>	 (03PS1) 10Brouberol: hadoop secrets: make analytics DB password available to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1028814 (https://phabricator.wikimedia.org/T363437)
[12:32:27] <wikibugs>	 (03PS1) 10Vgutierrez: prometheus::ops: Remove ncredir job [puppet] - 10https://gerrit.wikimedia.org/r/1028818 (https://phabricator.wikimedia.org/T364354)
[12:32:54] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] elasticsearch: defaut to rolling restarting a single node at a time [cookbooks] - 10https://gerrit.wikimedia.org/r/1028757 (https://phabricator.wikimedia.org/T362534) (owner: 10Brouberol)
[12:33:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] hadoop secrets: make analytics DB password available to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1028814 (https://phabricator.wikimedia.org/T363437) (owner: 10Brouberol)
[12:35:21] <wikibugs>	 (03CR) 10Ssingh: fifo_log_demux: add new parameters for current release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur)
[12:35:38] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2308/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028818 (https://phabricator.wikimedia.org/T364354) (owner: 10Vgutierrez)
[12:36:25] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] fifo_log_demux: add new parameters for current release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur)
[12:38:54] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1 C:03+2] P:trafficserver::backend Add cloudtestidm [puppet] - 10https://gerrit.wikimedia.org/r/1028805 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede)
[12:41:10] <wikibugs>	 (03PS2) 10Brouberol: hadoop: make analytics DB password available to analytics-privatedata-user [puppet] - 10https://gerrit.wikimedia.org/r/1028814 (https://phabricator.wikimedia.org/T363437)
[12:44:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Extend cloudbackup Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1024686 (owner: 10Muehlenhoff)
[12:46:16] <wikibugs>	 (03CR) 10Hnowlan: "Yep, the port is the issue here - This is scheduled for deploy in the backport window in 15 mins. https://gerrit.wikimedia.org/r/c/operati" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020277 (owner: 10Hnowlan)
[12:46:50] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[12:47:18] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] amd-pytorch21: fix the ROCm version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1028519 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey)
[12:48:13] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[12:49:01] <wikibugs>	 (03PS5) 10Elukey: Move ms-fe1009's envoy TLS cert to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412)
[12:51:14] <wikibugs>	 (03PS3) 10Brouberol: hadoop: make analytics DB password available to analytics-privatedata-user [puppet] - 10https://gerrit.wikimedia.org/r/1028814 (https://phabricator.wikimedia.org/T363437)
[12:51:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/1028819
[12:51:58] <wikibugs>	 (03PS7) 10Fabfur: fifo_log_demux: add new parameters for current release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383)
[12:52:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] fifo_log_demux: add new parameters for current release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur)
[12:53:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1028818 (https://phabricator.wikimedia.org/T364354) (owner: 10Vgutierrez)
[12:53:28] <wikibugs>	 (03PS4) 10Brouberol: hadoop: make analytics DB password available to analytics-privatedata-user [puppet] - 10https://gerrit.wikimedia.org/r/1028814 (https://phabricator.wikimedia.org/T363437)
[12:53:40] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1 C:03+2] prometheus::ops: Remove ncredir job [puppet] - 10https://gerrit.wikimedia.org/r/1028818 (https://phabricator.wikimedia.org/T364354) (owner: 10Vgutierrez)
[12:54:56] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2311/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028814 (https://phabricator.wikimedia.org/T363437) (owner: 10Brouberol)
[12:57:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/1028819 (owner: 10Muehlenhoff)
[12:59:10] <wikibugs>	 (03CR) 10Elukey: [C:03+2] ml-services: tune autoscaling for damaging, goodfaith and reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028552 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T1300).
[13:00:05] <jouncebot>	 hnowlan, DreamRimmer, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:10] <anzx>	 o/
[13:01:22] <DreamRimmer>	 I am around
[13:01:24] <hnowlan>	 o/
[13:01:48] <wikibugs>	 (03PS8) 10Fabfur: fifo_log_demux: add new parameters for current release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383)
[13:04:09] <wikibugs>	 (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2312/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur)
[13:04:26] <wikibugs>	 (03PS1) 10Vgutierrez: ncredir: Remove mtail puppetization [puppet] - 10https://gerrit.wikimedia.org/r/1028821 (https://phabricator.wikimedia.org/T364385)
[13:04:37] <wikibugs>	 (03CR) 10Fabfur: fifo_log_demux: add new parameters for current release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur)
[13:04:59] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[13:05:13] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:05:26] <wikibugs>	 (03CR) 10Fabfur: [C:04-2] "Do not merge until fifo-log-demux is upgraded to 0.7.0 on all hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur)
[13:05:30] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[13:05:40] <wikibugs>	 (03PS2) 10Vgutierrez: ncredir: Remove mtail puppetization [puppet] - 10https://gerrit.wikimedia.org/r/1028821 (https://phabricator.wikimedia.org/T364385)
[13:10:02] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2313/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028821 (https://phabricator.wikimedia.org/T364385) (owner: 10Vgutierrez)
[13:10:13] <jinxer-wm>	 FIRING: [6x] JobUnavailable: Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:11:57] <wikibugs>	 (03PS3) 10Vgutierrez: ncredir: Remove mtail puppetization [puppet] - 10https://gerrit.wikimedia.org/r/1028821 (https://phabricator.wikimedia.org/T364385)
[13:13:40] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:14:11] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[13:16:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:17:11] <hnowlan>	 looking ^ 
[13:17:18] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[13:19:01] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[13:19:41] <wikibugs>	 (03PS3) 10Elukey: role::swift::proxy: simplify hiera configuration for the tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/1026937 (https://phabricator.wikimedia.org/T356412)
[13:19:52] <wikibugs>	 (03PS6) 10Elukey: Move ms-fe1009's envoy TLS cert to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412)
[13:20:47] <hnowlan>	 no deployers today? 
[13:21:10] <wikibugs>	 (03CR) 10Elukey: [C:03+2] role::swift::proxy: simplify hiera configuration for the tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/1026937 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[13:21:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:21:27] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:21:43] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[13:23:21] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "LGTM! Ran PCC on the latest patchset as well (for the removed log.pp) and it looks good https://puppet-compiler.wmflabs.org/output/1028821" [puppet] - 10https://gerrit.wikimedia.org/r/1028821 (https://phabricator.wikimedia.org/T364385) (owner: 10Vgutierrez)
[13:25:13] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:25:23] <logmsgbot>	 !log filippo@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7001.magru.wmnet with OS bullseye
[13:25:24] <logmsgbot>	 !log filippo@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7001.magru.wmnet
[13:26:04] <wikibugs>	 (03PS7) 10Elukey: Move ms-fe1009's envoy TLS cert to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412)
[13:26:05] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] kubernetes: add 6 codfw appservers as workers [puppet] - 10https://gerrit.wikimedia.org/r/1026941 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan)
[13:27:07] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 44 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:29:38] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.ganeti.makevm for new host prometheus7001.magru.wmnet
[13:29:39] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.dns.netbox
[13:31:41] <logmsgbot>	 !log filippo@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[13:31:46] <logmsgbot>	 !log filippo@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7001.magru.wmnet
[13:31:58] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.decommission for hosts prometheus7001.magru.wmnet
[13:36:00] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.dns.netbox
[13:37:45] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:38:00] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:40:00] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - filippo@cumin1002"
[13:40:07] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2325.codfw.wmnet with OS bullseye
[13:40:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] elasticsearch: Remove support for sslcert SSL provider [puppet] - 10https://gerrit.wikimedia.org/r/1026803 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff)
[13:40:10] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2305.codfw.wmnet with OS bullseye
[13:40:12] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2338.codfw.wmnet with OS bullseye
[13:40:17] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2407.codfw.wmnet with OS bullseye
[13:40:19] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2390.codfw.wmnet with OS bullseye
[13:40:34] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2359.codfw.wmnet with OS bullseye
[13:40:52] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - filippo@cumin1002"
[13:40:52] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:40:52] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts prometheus7001.magru.wmnet
[13:41:59] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:42:13] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:42:31] <wikibugs>	 (03PS2) 10Muehlenhoff: Stop supporting sslcert in Profile::Pki::Provider type [puppet] - 10https://gerrit.wikimedia.org/r/1026804 (https://phabricator.wikimedia.org/T357750)
[13:43:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:43:49] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.267 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:44:03] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:44:55] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.ganeti.makevm for new host prometheus7001.magru.wmnet
[13:44:57] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.dns.netbox
[13:46:15] <jinxer-wm>	 FIRING: AppserversUnreachable: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[13:46:31] <hnowlan>	 ^me, it'll clear in a sec 
[13:46:54] <wikibugs>	 (03PS1) 10Joal: Update analytics import-mediawiki-dumps [puppet] - 10https://gerrit.wikimedia.org/r/1028831
[13:47:07] <DreamRimmer>	   no deployers today?
[13:47:07] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 30 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:47:12] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1026804 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff)
[13:47:29] <logmsgbot>	 !log mfossati@deploy1002 Started deploy [airflow-dags/platform_eng@ad4934c]: (no justification provided)
[13:48:01] <logmsgbot>	 !log mfossati@deploy1002 Finished deploy [airflow-dags/platform_eng@ad4934c]: (no justification provided) (duration: 00m 32s)
[13:48:03] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1028821 (https://phabricator.wikimedia.org/T364385) (owner: 10Vgutierrez)
[13:48:21] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Update analytics import-mediawiki-dumps [puppet] - 10https://gerrit.wikimedia.org/r/1028831 (owner: 10Joal)
[13:48:25] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:49:24] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7001.magru.wmnet - filippo@cumin1002"
[13:50:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Stop supporting sslcert in Profile::Pki::Provider type [puppet] - 10https://gerrit.wikimedia.org/r/1026804 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff)
[13:50:17] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7001.magru.wmnet - filippo@cumin1002"
[13:50:17] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:50:17] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.dns.wipe-cache prometheus7001.magru.wmnet on all recursors
[13:50:20] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7001.magru.wmnet on all recursors
[13:50:41] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7001.magru.wmnet - filippo@cumin1002"
[13:51:24] <wikibugs>	 (03PS8) 10JMeybohm: Add new chart: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026564 (https://phabricator.wikimedia.org/T362310)
[13:51:32] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7001.magru.wmnet - filippo@cumin1002"
[13:52:13] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2434 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:53:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:53:58] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reimage for host prometheus7001.magru.wmnet with OS bullseye
[13:54:04] <wikibugs>	 (03PS1) 10Elukey: role::ml_k8s::*::worker: use Dragonly for amd-pytorch images [puppet] - 10https://gerrit.wikimedia.org/r/1028833
[13:55:57] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2379 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:56:23] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2390.codfw.wmnet with reason: host reimage
[13:56:27] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2407.codfw.wmnet with reason: host reimage
[13:56:28] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2359.codfw.wmnet with reason: host reimage
[13:56:31] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2325.codfw.wmnet with reason: host reimage
[13:56:49] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2305.codfw.wmnet with reason: host reimage
[13:57:39] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2338.codfw.wmnet with reason: host reimage
[13:57:59] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] Make base.certificates compatible with chart modules and scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026860 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm)
[13:58:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:58:25] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:01:38] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "LGTM, we could get rid of $kafka_tls in this CR or in a following one given that we dropped support for non-TLS setups" [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins)
[14:01:58] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2390.codfw.wmnet with reason: host reimage
[14:02:45] <jinxer-wm>	 RESOLVED: AppserversUnreachable: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[14:03:17] <logmsgbot>	 !log btullis@deploy1002 Started deploy [airflow-dags/analytics@6be7efd]: (no justification provided)
[14:03:45] <logmsgbot>	 !log btullis@deploy1002 Finished deploy [airflow-dags/analytics@6be7efd]: (no justification provided) (duration: 00m 27s)
[14:04:36] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2407.codfw.wmnet with reason: host reimage
[14:05:38] <wikibugs>	 (03CR) 10Bking: [C:03+1] global_config: add elasticearch instances [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[14:08:07] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2359.codfw.wmnet with reason: host reimage
[14:09:14] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2334 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:09:24] <wikibugs>	 (03CR) 10Muehlenhoff: "I'd day let's do a followup and proceed with this change as-is, the non TLS branch has been dead code for > years, a few more days won't m" [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins)
[14:09:27] <wikibugs>	 (03PS1) 10Elukey: ml-services: update hugging face's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028839 (https://phabricator.wikimedia.org/T362984)
[14:10:44] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2305.codfw.wmnet with reason: host reimage
[14:11:43] <logmsgbot>	 !log mfossati@deploy1002 Started deploy [airflow-dags/platform_eng@b543b85]: (no justification provided)
[14:12:08] <logmsgbot>	 !log mfossati@deploy1002 Finished deploy [airflow-dags/platform_eng@b543b85]: (no justification provided) (duration: 00m 24s)
[14:12:18] <wikibugs>	 (03PS1) 10Btullis: Revert "Update analytics import-mediawiki-dumps" [puppet] - 10https://gerrit.wikimedia.org/r/1028579
[14:12:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:12:41] <wikibugs>	 (03CR) 10Joal: "LGTM! sorry for the noise" [puppet] - 10https://gerrit.wikimedia.org/r/1028579 (owner: 10Btullis)
[14:12:49] <wikibugs>	 (03CR) 10Joal: [C:03+1] Revert "Update analytics import-mediawiki-dumps" [puppet] - 10https://gerrit.wikimedia.org/r/1028579 (owner: 10Btullis)
[14:13:02] <wikibugs>	 (03PS1) 10Hnowlan: kubernetes: make 5 eqiad api appservers k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1028840 (https://phabricator.wikimedia.org/T362323)
[14:13:25] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:13:39] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7001.magru.wmnet with reason: host reimage
[14:13:46] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2325.codfw.wmnet with reason: host reimage
[14:14:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [software/bitu] - 10https://gerrit.wikimedia.org/r/1026458 (owner: 10Slyngshede)
[14:16:47] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7001.magru.wmnet with reason: host reimage
[14:17:08] <wikibugs>	 (03CR) 10Elukey: [C:03+2] ml-services: update hugging face's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028839 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey)
[14:17:09] <wikibugs>	 10ops-codfw, 06SRE: Inbound interface errors - https://phabricator.wikimedia.org/T364358#9777558 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[14:17:57] <wikibugs>	 (03PS1) 10Hnowlan: mw-web, mw-api-ext: bump replicas in advance of traffic shift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028842 (https://phabricator.wikimedia.org/T362323)
[14:19:42] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2338.codfw.wmnet with reason: host reimage
[14:19:43] <wikibugs>	 (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1028843
[14:20:08] <wikibugs>	 (03PS1) 10Hnowlan: trafficserver: move k8s traffic shift to 90% [puppet] - 10https://gerrit.wikimedia.org/r/1028844 (https://phabricator.wikimedia.org/T362323)
[14:20:46] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2390.codfw.wmnet with OS bullseye
[14:22:07] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[14:22:12] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw2434 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:22:38] <wikibugs>	 (03CR) 10Eevans: [C:03+1] Stop installing git-fat on Cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/1028798 (https://phabricator.wikimedia.org/T364373) (owner: 10Muehlenhoff)
[14:23:24] <wikibugs>	 (03PS34) 10Brouberol: global_config: add elasticearch instances [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894)
[14:23:25] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:23:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM, just keeping the upstream default of 15% seems fine." [puppet] - 10https://gerrit.wikimedia.org/r/1028565 (owner: 10Ahmon Dancy)
[14:23:46] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2407.codfw.wmnet with OS bullseye
[14:23:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Stop installing git-fat on Cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/1028798 (https://phabricator.wikimedia.org/T364373) (owner: 10Muehlenhoff)
[14:24:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] "I'll uninstall git-fat as a followup once Puppet has run on these hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1028798 (https://phabricator.wikimedia.org/T364373) (owner: 10Muehlenhoff)
[14:25:58] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw2379 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:27:44] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] global_config: add elasticearch instances [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[14:28:57] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2359.codfw.wmnet with OS bullseye
[14:30:04] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2305.codfw.wmnet with OS bullseye
[14:31:27] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7001.magru.wmnet with OS bullseye
[14:31:27] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host prometheus7001.magru.wmnet
[14:32:35] <wikibugs>	 (03CR) 10CDobbins: [C:03+2] purged: add PKI cert handling (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins)
[14:33:07] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2325.codfw.wmnet with OS bullseye
[14:33:11] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Revert "Update analytics import-mediawiki-dumps" [puppet] - 10https://gerrit.wikimedia.org/r/1028579 (owner: 10Btullis)
[14:33:28] <wikibugs>	 (03PS1) 10Hnowlan: kubernetes: make 6 codfw api appservers k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1028847 (https://phabricator.wikimedia.org/T351074)
[14:35:08] <icinga-wm>	 PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:35:48] <wikibugs>	 (03PS2) 10Filippo Giunchedi: grafana: add magru prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1028503 (https://phabricator.wikimedia.org/T364016)
[14:35:48] <wikibugs>	 (03PS2) 10Filippo Giunchedi: trafficserver: add prometheus-magru.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1028504 (https://phabricator.wikimedia.org/T364016)
[14:35:48] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Revert "site: provision prometheus7001 with insetup" [puppet] - 10https://gerrit.wikimedia.org/r/1028848 (https://phabricator.wikimedia.org/T364016)
[14:35:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] wmnet: add prometheus.svc.magru [dns] - 10https://gerrit.wikimedia.org/r/1028464 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi)
[14:36:02] <icinga-wm>	 PROBLEM - Host ps1-d2-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:36:03] <wikibugs>	 (03PS3) 10Filippo Giunchedi: wmnet: add prometheus.svc.magru [dns] - 10https://gerrit.wikimedia.org/r/1028464 (https://phabricator.wikimedia.org/T364016)
[14:36:27] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:36:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] Revert "site: provision prometheus7001 with insetup" [puppet] - 10https://gerrit.wikimedia.org/r/1028848 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi)
[14:37:20] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] postfix: add some recommended hardening settings [puppet] - 10https://gerrit.wikimedia.org/r/1024729 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway)
[14:38:58] <godog>	 jhathaway: I've merged you patch too
[14:39:03] <godog>	 'your patch' even
[14:39:14] <jhathaway>	 godog: thanks
[14:39:14] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw2334 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:39:32] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2338.codfw.wmnet with OS bullseye
[14:39:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] wmnet: add prometheus.svc.magru [dns] - 10https://gerrit.wikimedia.org/r/1028464 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi)
[14:40:03] <wikibugs>	 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.06 - 2024.05.26): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9777657 (10xcollazo) Hello there.  Due to T364250, the host `snapshot1011` will not be running the typical `wikidata...
[14:40:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] "Wasn't even needed to manually uninstall, when removing the Py2 packages Puppet uninstalled git-fat as well since it depended on Python 2." [puppet] - 10https://gerrit.wikimedia.org/r/1028798 (https://phabricator.wikimedia.org/T364373) (owner: 10Muehlenhoff)
[14:41:18] <hnowlan>	 !log running homer 'cr*codfw*' commit to configure BGP for new k8s codfw workers 
[14:41:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:33] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Enable profile::auto_restarts::service for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/1028790 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[14:43:38] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] chartmuseum: Enable profile::auto_restarts::service for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1028789 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[14:44:39] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] role::ml_k8s::*::worker: use Dragonly for amd-pytorch images [puppet] - 10https://gerrit.wikimedia.org/r/1028833 (owner: 10Elukey)
[14:44:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2121.codfw.wmnet
[14:44:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Enable profile::auto_restarts::service for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/1028790 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[14:44:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] chartmuseum: Enable profile::auto_restarts::service for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1028789 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[14:46:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db2121 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028850 (https://phabricator.wikimedia.org/T349619)
[14:46:35] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+1] Stop installing git-fat also on Buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1028800 (https://phabricator.wikimedia.org/T279509) (owner: 10Muehlenhoff)
[14:47:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db2121 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028850 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[14:50:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Stop installing git-fat also on Buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1028800 (https://phabricator.wikimedia.org/T279509) (owner: 10Muehlenhoff)
[14:50:19] <godog>	 !log silence site=magru alerts during prometheus7001 - T364016
[14:50:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:24] <stashbot>	 T364016: Q4:magru VM tracking task - https://phabricator.wikimedia.org/T364016
[14:50:59] <wikibugs>	 (03CR) 10JMeybohm: [C:04-1] kubernetes: make 5 eqiad api appservers k8s workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028840 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan)
[14:51:09] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] trafficserver: move k8s traffic shift to 90% [puppet] - 10https://gerrit.wikimedia.org/r/1028844 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan)
[14:51:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] grafana: add magru prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1028503 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi)
[14:51:38] <wikibugs>	 (03PS1) 10Kevin Bazira: admin_ng: add commons host header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1027484 (https://phabricator.wikimedia.org/T363449)
[14:51:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2121.codfw.wmnet
[14:52:03] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] mw-web, mw-api-ext: bump replicas in advance of traffic shift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028842 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan)
[14:52:59] <moritzm>	 !log installing mariadb-10.5 security updates (as packaged in Debian, not the wmf-mariadb packages)
[14:53:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:07] <logmsgbot>	 !log hnowlan@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(mw2305.codfw.wmnet|mw2325.codfw.wmnet|mw2338.codfw.wmnet|mw2359.codfw.wmnet|mw2390.codfw.wmnet|mw2407.codfw.wmnet),cluster=kubernetes,service=kubesvc
[14:53:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2122.codfw.wmnet
[14:53:50] <sukhe>	 !log A:cp and A:magru: running haproxy-restart
[14:53:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:15] <wikibugs>	 (03PS2) 10Hnowlan: kubernetes: make 5 eqiad api appservers k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1028840 (https://phabricator.wikimedia.org/T362323)
[14:54:28] <wikibugs>	 (03CR) 10Hnowlan: kubernetes: make 5 eqiad api appservers k8s workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028840 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan)
[14:54:39] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe1009.eqiad.wmnet
[14:54:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] "Service is up, proceeding" [puppet] - 10https://gerrit.wikimedia.org/r/1028504 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi)
[14:54:51] <wikibugs>	 (03CR) 10JMeybohm: [C:04-1] kubernetes: make 6 codfw api appservers k8s workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028847 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan)
[14:54:53] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#9777706 (10Nosferattus) @Ladsgroup: Please excuse me if I'm wrong, but I don't see how those statistics are related to what I suggested. I read those stat...
[14:55:03] <elukey>	 !log depool ms-fe1009's nginx (swift proxy) to safely apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1026927
[14:55:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db2122 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028852 (https://phabricator.wikimedia.org/T349619)
[14:56:42] <wikibugs>	 (03CR) 10Elukey: [C:03+2] role::ml_k8s::*::worker: use Dragonly for amd-pytorch images [puppet] - 10https://gerrit.wikimedia.org/r/1028833 (owner: 10Elukey)
[15:00:05] <jouncebot>	 eoghan, jelto, arnoldokoth, and mutante: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) SRE Collaboration Services office hours deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T1500).
[15:00:46] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Make base.certificates compatible with chart modules and scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026860 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm)
[15:01:36] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Make base.certificates compatible with chart modules and scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026860 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm)
[15:01:40] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] New version of base.certificates module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026859 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm)
[15:02:31] <wikibugs>	 (03Merged) 10jenkins-bot: New version of base.certificates module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026859 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm)
[15:02:35] <wikibugs>	 (03Merged) 10jenkins-bot: Make base.certificates compatible with chart modules and scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026860 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm)
[15:03:18] <wikibugs>	 (03CR) 10Klausman: [C:03+1] admin_ng: add commons host header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1027484 (https://phabricator.wikimedia.org/T363449) (owner: 10Kevin Bazira)
[15:03:36] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Move ms-fe1009's envoy TLS cert to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[15:05:07] <wikibugs>	 (03PS2) 10Hnowlan: kubernetes: make 6 codfw api appservers k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1028847 (https://phabricator.wikimedia.org/T351074)
[15:05:15] <wikibugs>	 (03CR) 10Hnowlan: kubernetes: make 6 codfw api appservers k8s workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028847 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan)
[15:07:44] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9777731 (10MoritzMuehlenhoff)
[15:08:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] kubernetes: make 6 codfw api appservers k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1028847 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan)
[15:09:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Reapply "Enable install7001 as webproxy in magru" [dns] - 10https://gerrit.wikimedia.org/r/1028853
[15:09:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db2122 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028852 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[15:09:48] <wikibugs>	 (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1028854
[15:10:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Reapply "Enable install7001 as webproxy in magru" [dns] - 10https://gerrit.wikimedia.org/r/1028853 (owner: 10Muehlenhoff)
[15:12:29] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe1009.eqiad.wmnet
[15:12:49] <elukey>	 !log repool ms-fe1009's envoy with PKI TLS cert
[15:12:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:57] <wikibugs>	 (03PS3) 10Hnowlan: kubernetes: make 6 codfw api appservers k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1028847 (https://phabricator.wikimedia.org/T351074)
[15:13:17] <godog>	 !log remove accidentally set site!=magru silence, add site=magru silence instead - T364016
[15:13:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:20] <stashbot>	 T364016: Q4:magru VM tracking task - https://phabricator.wikimedia.org/T364016
[15:14:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2122.codfw.wmnet
[15:14:24] <wikibugs>	 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412#9777740 (10elukey) ms-fe1009's envoy migrated to PKI! We'll wait a couple of days before proceeding with either eqiad or codfw.
[15:15:13] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:18:23] <godog>	 mmhh grafana down, taking a look
[15:18:31] <wikibugs>	 (03PS1) 10Zabe: hieradata: Add itwiki to private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1028855 (https://phabricator.wikimedia.org/T363825)
[15:18:57] <wikibugs>	 (03CR) 10Zabe: hieradata: Add itwiki to private wikis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028855 (https://phabricator.wikimedia.org/T363825) (owner: 10Zabe)
[15:19:08] <godog>	 actually nevermind it was restarting for configuration change
[15:19:31] <wikibugs>	 (03PS2) 10Zabe: hieradata: Add arbcom_itwiki to private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1028855 (https://phabricator.wikimedia.org/T363825)
[15:19:38] <moritzm>	 !log imported nodejs 20.5.1-deb-1nodesource1 to thirdparty/node20 T362681
[15:19:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:41] <stashbot>	 T362681: Provide nodejs20 base images for production - https://phabricator.wikimedia.org/T362681
[15:20:44] <wikibugs>	 06SRE, 06collaboration-services, 06serviceops: upgrade deployment servers to bullseye / add bullseye support to puppet role - https://phabricator.wikimedia.org/T363415#9777749 (10LSobanski) p:05Triage→03Medium
[15:22:25] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] kubernetes: make 5 eqiad api appservers k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1028840 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan)
[15:22:47] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] kubernetes: make 6 codfw api appservers k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1028847 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan)
[15:24:43] <wikibugs>	 (03PS1) 10Zabe: Add Apache configuration for wikipedia-it-arbcom.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1028858 (https://phabricator.wikimedia.org/T363825)
[15:25:50] <Amir1>	 jouncebot: nowandnext
[15:25:50] <jouncebot>	 For the next 0 hour(s) and 34 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T1500)
[15:25:50] <jouncebot>	 In 0 hour(s) and 34 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T1600)
[15:26:52] <wikibugs>	 (03CR) 10Bking: [C:03+1] aliases: add datacenter-scoped cumin aliases for flink zk ensembles [puppet] - 10https://gerrit.wikimedia.org/r/1028520 (https://phabricator.wikimedia.org/T363975) (owner: 10Brouberol)
[15:27:43] <wikibugs>	 (03PS4) 10Herron: pyrra: varnish: workaround site grouping limitation [puppet] - 10https://gerrit.wikimedia.org/r/1028854 (https://phabricator.wikimedia.org/T302995)
[15:27:56] <wikibugs>	 (03CR) 10Herron: [C:03+2] pyrra: varnish: workaround site grouping limitation [puppet] - 10https://gerrit.wikimedia.org/r/1028854 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[15:29:10] <hnowlan>	 !log depooling 5 eqiad api appservers in advance of reimaging to k8s workers
[15:29:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:00] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Stop writing to old columns of pagelinks in most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028778 (https://phabricator.wikimedia.org/T352010) (owner: 10Ladsgroup)
[15:31:14] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] aliases: add datacenter-scoped cumin aliases for flink zk ensembles [puppet] - 10https://gerrit.wikimedia.org/r/1028520 (https://phabricator.wikimedia.org/T363975) (owner: 10Brouberol)
[15:31:43] <wikibugs>	 (03PS1) 10Elukey: role::swift::proxy: move eqiad envoys to PKI TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1028859 (https://phabricator.wikimedia.org/T356412)
[15:31:46] <wikibugs>	 (03Merged) 10jenkins-bot: Stop writing to old columns of pagelinks in most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028778 (https://phabricator.wikimedia.org/T352010) (owner: 10Ladsgroup)
[15:31:51] <wikibugs>	 (03PS1) 10Elukey: role::swift::proxy: move codfw envoys to PKI TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1028860 (https://phabricator.wikimedia.org/T356412)
[15:32:32] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1028778|Stop writing to old columns of pagelinks in most wikis (T352010 T299947)]]
[15:32:38] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[15:32:39] <stashbot>	 T299947: Normalize pagelinks table - https://phabricator.wikimedia.org/T299947
[15:33:22] <wikibugs>	 (03CR) 10Volans: aliases: add datacenter-scoped cumin aliases for flink zk ensembles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028520 (https://phabricator.wikimedia.org/T363975) (owner: 10Brouberol)
[15:35:31] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1028859 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[15:37:10] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1028860 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[15:38:11] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1028778|Stop writing to old columns of pagelinks in most wikis (T352010 T299947)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[15:38:15] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[15:38:16] <stashbot>	 T299947: Normalize pagelinks table - https://phabricator.wikimedia.org/T299947
[15:38:30] <wikibugs>	 (03CR) 10Elukey: role::swift::proxy: move codfw envoys to PKI TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1028860 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[15:41:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1028859 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[15:41:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1028860 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[15:45:52] <wikibugs>	 06SRE-OnFire, 10Data-Platform-SRE (2024.05.06 - 2024.05.26), 03Discovery-Search (Current work), 10Sustainability (Incident Followup): Post incident tasks: Search missing results/unavailable for some eqiad users - https://phabricator.wikimedia.org/T363694#9777868 (10Gehel) 05Open→03Resolved Subtasks...
[15:52:07] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Continuing with sync
[15:52:11] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[15:57:18] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on parse1024 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:57:37] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] admin: add linafaridwmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1028600 (https://phabricator.wikimedia.org/T364068) (owner: 10Dzahn)
[15:57:46] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1452 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:58:02] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance
[15:58:06] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1012 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:58:15] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance
[15:58:23] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2179 (T352010)', diff saved to https://phabricator.wikimedia.org/P61983 and previous config saved to /var/cache/conftool/dbconfig/20240507-155822-ladsgroup.json
[15:58:27] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[16:00:04] <jouncebot>	 jhathaway and rzl: Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T1600). Please do the needful.
[16:00:04] <jouncebot>	 zabe: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:38] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1352 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[16:00:43] <zabe>	 o/
[16:00:52] <jhathaway>	 o/
[16:00:54] <jhathaway>	 merging in
[16:01:00] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] Add Apache configuration for wikipedia-it-arbcom.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1028858 (https://phabricator.wikimedia.org/T363825) (owner: 10Zabe)
[16:01:07] <rzl>	 thanks jhathaway
[16:01:26] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review, 03WMDE-TechWish-Sprint-2024-04-24: Requesting access to analytics-privatedata-users  for  linafaridwmde - https://phabricator.wikimedia.org/T364068#9778014 (10Dzahn) a:05Dzahn→03None
[16:01:28] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review, 03WMDE-TechWish-Sprint-2024-04-24: Requesting access to analytics-privatedata-users  for  linafaridwmde - https://phabricator.wikimedia.org/T364068#9778006 (10Dzahn) 05In progress→03Resolved a:03Dzahn @Lina_Farid_WMDE You have been added to the gr...
[16:01:32] <jhathaway>	 yup!
[16:02:22] <jhathaway>	 zabe, done
[16:02:27] <zabe>	 Thanks!
[16:02:54] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+1] "https://phabricator.wikimedia.org/T364359 says these users were supposed to get the new group but not be removed from the old group." [puppet] - 10https://gerrit.wikimedia.org/r/1023965 (https://phabricator.wikimedia.org/T363360) (owner: 10BCornwall)
[16:04:16] <taavi>	 zabe: jhathaway: iirc that change also needs a mw-on-k8s deployment (after a puppet run on deploy1002) to apply properly
[16:05:02] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1028778|Stop writing to old columns of pagelinks in most wikis (T352010 T299947)]] (duration: 32m 29s)
[16:05:12] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[16:05:12] <stashbot>	 T299947: Normalize pagelinks table - https://phabricator.wikimedia.org/T299947
[16:05:18] <wikibugs>	 06SRE, 06collaboration-services, 06serviceops: add bullseye support to deployment server puppet role - upgrade deployment server in devtools - https://phabricator.wikimedia.org/T363415#9778028 (10Dzahn)
[16:05:35] <zabe>	 ok, i can do that
[16:07:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[16:08:38] <denisse>	 ^  Strange, looking.
[16:08:49] <logmsgbot>	 !log zabe@deploy1002 Started scap: (no justification provided)
[16:08:49] <jhathaway>	 denisse: could be the path we just pushed
[16:08:49] <logmsgbot>	 !log zabe@deploy1002 sync-world aborted: (no justification provided) (duration: 00m 00s)
[16:08:57] <jhathaway>	 looking also
[16:09:10] <denisse>	 jhathaway: That seems reasonable, I've ACK'd the alerts.
[16:09:22] <wikibugs>	 (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1028864
[16:13:06] <jhathaway>	 mutante: the issue seems to be with the user linafaridwmde you just added
[16:13:40] <jhathaway>	 I think you missed specifying their gid, so useradd assumes there is a gid that matches the uid of the user
[16:15:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[16:18:18] <zabe>	 jhathaway: so is it okay to do the mw-on-k8s deployment or should I wait?
[16:18:31] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] coredump.conf: Remove misconfigured KeepFree setting [puppet] - 10https://gerrit.wikimedia.org/r/1028565 (owner: 10Ahmon Dancy)
[16:18:51] <jhathaway>	 probably okay, but let me fix first...
[16:18:57] <zabe>	 okay:)
[16:19:12] <mutante>	 jhathaway: happy to add GID 500 but I remember checking if others like this had it and theydidnt
[16:19:24] <mutante>	 it's not a shell user
[16:19:47] <hnowlan>	 appserver errors are all of the form "Lock wait timeout exceeded" - some kind of DB flakiness happening? 
[16:19:51] <mutante>	 looking at the error
[16:20:02] <jhathaway>	 if it is not a shell user, then it would be in ldap_only_users correct mutante?
[16:20:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[16:20:21] <jhathaway>	 puppet is trying to add a local account for that user
[16:20:23] <mutante>	 no, not for this case where it's the weird "privatedata-users without shell"
[16:20:32] <mutante>	 CI didn't like the previous PS
[16:22:12] <mutante>	 that's why I dislike that weird special case :p
[16:22:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[16:22:27] <wikibugs>	 (03PS1) 10Dzahn: Revert "admin: add linafaridwmde to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/1028581
[16:23:58] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "Warning: /Stage[main]/Admin/Exec[enforce-users-groups-cleanup]: Skipping because of failed dependencies" [puppet] - 10https://gerrit.wikimedia.org/r/1028581 (owner: 10Dzahn)
[16:25:46] <jhathaway>	 mutante: no shell means no ssh, which is a bit weird for sure, which users don't have gids?
[16:25:54] <mutante>	 jhathaway: reverted and starting to run cumin on --failed-only in batches
[16:26:02] <jhathaway>	 thanks
[16:26:21] <jhathaway>	 zabe: I think you are clear to go
[16:26:25] <wikibugs>	 (03PS1) 10Ladsgroup: Partial cherry-pick of I9d8409fdbd757e [core] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1028865 (https://phabricator.wikimedia.org/T361398)
[16:26:32] <zabe>	 alright
[16:26:44] <logmsgbot>	 !log zabe@deploy1002 Started scap: T363825
[16:26:47] <stashbot>	 T363825: Create private wikipedia_it_arbcom wiki - https://phabricator.wikimedia.org/T363825
[16:27:10] <mutante>	 jhathaway: yes, it is supposed to be no ssh.  https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Accounts_and_passwords_explained:_LDAP/Wikitech/MW_Developer_vs_shell/ssh/posix_vs_Kerberos
[16:27:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: Average latency high: eqiad api_appserver POST/200: 0.613747498021355s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=api_appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyE
[16:27:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[16:27:18] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on parse1024 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[16:27:30] <mutante>	 I want to point out I have no relation to those mediawiki error rates ^
[16:27:36] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+1] "Good idea, I'll send another patch for titan. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[16:28:06] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1012 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[16:28:51] <wikibugs>	 (03PS9) 10Andrea Denisse: thanos: Provision Thanos frontend TLS certificates with CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414)
[16:29:10] <zabe>	 I do neither and that looks actually quite conserning
[16:30:11] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review, 03WMDE-TechWish-Sprint-2024-04-24: Requesting access to analytics-privatedata-users  for  linafaridwmde - https://phabricator.wikimedia.org/T364068#9778167 (10Dzahn) 05Resolved→03Open unfortunately merging the code change caused widespread puppet failur...
[16:30:38] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1352 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[16:30:41] <wikibugs>	 (03CR) 10Andrea Denisse: thanos: Provision Thanos frontend TLS certificates with CFSSL (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[16:31:06] <mutante>	 jhathaway: I think you were right about the original thing that it was missing the gid line. some just have it in another order.. but I will let this calm down first
[16:31:21] <jhathaway>	 nod, sounds good
[16:31:55] <wikibugs>	 (03PS10) 10Andrea Denisse: thanos: Provision Thanos frontend TLS certificates with CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414)
[16:32:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: Average latency high: eqiad api_appserver POST/200: ...
[16:32:15] <jinxer-wm>	 0.7291964295290969s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=api_appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[16:32:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[16:32:31] <zabe>	 created T364404 for the mw errors
[16:32:31] <stashbot>	 T364404: Wikimedia\Rdbms\DBTransactionSizeError: Transaction spent {time}s in writes, exceeding the 3s limit - https://phabricator.wikimedia.org/T364404
[16:34:27] <logmsgbot>	 !log zabe@deploy1002 Finished scap: T363825 (duration: 07m 42s)
[16:34:30] <stashbot>	 T363825: Create private wikipedia_it_arbcom wiki - https://phabricator.wikimedia.org/T363825
[16:36:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.57% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:36:23] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (NOOP 9 CORE_DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[16:36:59] <wikibugs>	 06SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Dennis Mburugu - https://phabricator.wikimedia.org/T364320#9778211 (10DMburugu) Turnilo and Superset
[16:37:06] <hnowlan>	 I depooled some api appservers earlier in advance of reimaging. Will repool them to help soak this up, but I don't think it's strictly a capacity issue 
[16:37:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[16:37:27] <hnowlan>	 these lock timeouts and latency increases, along with little bumps in DB errors make me nervous
[16:38:06] <icinga-wm>	 PROBLEM - Host mwlog1002 is DOWN: PING CRITICAL - Packet loss = 100%
[16:38:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: Average latency high: eqiad api_appserver GET/200: 0.5382698779576534s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyEx
[16:38:22] <wikibugs>	 (03PS1) 10Btullis: Move stats misc_jobs from stat1007 to stat1011 [puppet] - 10https://gerrit.wikimedia.org/r/1028866 (https://phabricator.wikimedia.org/T353785)
[16:38:44] <dancy>	 ooh I wonder if the error rate crashed mwlog1002 somehow
[16:39:06] <wikibugs>	 (03PS11) 10Andrea Denisse: thanos: Provision Thanos frontend TLS certificates with CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414)
[16:39:50] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-staging2001.codfw.wmnet
[16:40:29] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2321/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028866 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis)
[16:41:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.57% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:42:40] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (NOOP 8 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[16:42:45] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[16:43:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiLatencyExceeded: Average latency high: eqiad api_appserver GET/200: 0.23324278673749851s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[16:43:57] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+1] "Hello team, here are the PCC results for the latest patch: https://puppet-compiler.wmflabs.org/output/1028546/2322/" [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[16:45:27] <wikibugs>	 (03PS4) 10Volans: sre.hosts.decommission: ask on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1018718 (https://phabricator.wikimedia.org/T361306)
[16:45:27] <wikibugs>	 (03PS1) 10Volans: sre.ganeti.makevm: add logging message [cookbooks] - 10https://gerrit.wikimedia.org/r/1028867
[16:46:50] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+1] thanos: Provision Thanos frontend TLS certificates with CFSSL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[16:47:44] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1028867 (owner: 10Volans)
[16:47:45] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[16:48:31] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1028799 (https://phabricator.wikimedia.org/T316876) (owner: 10Muehlenhoff)
[16:48:36] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2001.codfw.wmnet
[16:48:58] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1028763 (https://phabricator.wikimedia.org/T316876) (owner: 10Muehlenhoff)
[16:49:42] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1026822 (owner: 10Muehlenhoff)
[16:49:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.ganeti.makevm: add logging message [cookbooks] - 10https://gerrit.wikimedia.org/r/1028867 (owner: 10Volans)
[16:50:21] <dancy>	 mutante: Can you check on mwlog1002.eqiad.wmnet ?
[16:53:25] <mutante>	 dancy: oh, you mean the host is down... yea
[16:53:37] <dancy>	 Nod. It died suddently
[16:53:54] <sukhe>	 I looked briefly but nothing in sel for what it's worth
[16:54:12] <denisse>	 dancy: We're looking at it and tracking the issue on this ticket: https://phabricator.wikimedia.org/T364404
[16:55:18] <mutante>	 the machine is up and sitting at login
[16:55:28] <mutante>	 so that usually means cable disconnected
[16:55:30] <mutante>	 or networking 
[16:55:36] <dancy>	 Interesting
[16:56:00] <dancy>	 Is it responsive to the login prompt?
[16:56:05] <mutante>	 logged in via mgmt
[16:56:12] <mutante>	 yes
[16:56:30] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[16:56:43] <mutante>	 jhathaway: fwiw that's that ^
[16:56:54] <icinga-wm>	 RECOVERY - Host mwlog1002 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms
[16:56:54] <denisse>	 🫶
[16:57:06] <dancy>	 woohoo!
[16:57:26] <dancy>	 mwlog1002 load average is >12.
[16:57:40] <mutante>	 mwlog1002 is reachable again
[16:57:46] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1452 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[16:58:09] <mutante>	 CPU used by rsyslogd
[16:58:16] <mutante>	 load going down
[16:59:04] <mutante>	 mwlog1002 rsyslogd: omkafka: kafka error message: -181,'Local: SSL error','ssl://kafka-logging1004.eqiad.wmnet:9093/1004: SSL handshake failed: Disconnected: connecting to a PLAINTEXT broker listener? 
[16:59:52] <wikibugs>	 (03CR) 10Hashar: [C:04-1] "Thanks for the discovery of `plugin.restApi().post()` and crafting the post.  The code should overall be refined, see my various comments." [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox)
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T1700)
[17:00:05] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1027484 (https://phabricator.wikimedia.org/T363449) (owner: 10Kevin Bazira)
[17:03:07] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: add commons host header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1027484 (https://phabricator.wikimedia.org/T363449) (owner: 10Kevin Bazira)
[17:05:24] <wikibugs>	 (03CR) 10Scott French: [C:03+2] apertium: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028604 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[17:06:25] <wikibugs>	 (03Merged) 10jenkins-bot: apertium: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028604 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[17:07:50] <wikibugs>	 (03PS1) 10Dzahn: Revert "Revert "admin: add linafaridwmde to analytics-privatedata-users"" [puppet] - 10https://gerrit.wikimedia.org/r/1028582
[17:09:30] <wikibugs>	 (03PS6) 10Herron: pyrra: etcd: add generic rules workaround [puppet] - 10https://gerrit.wikimedia.org/r/1028864 (https://phabricator.wikimedia.org/T302995)
[17:13:51] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/apertium: apply
[17:14:15] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/apertium: apply
[17:14:23] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "It is fine to restart Envoy, afaik that is only used for https://integration.wikimedia.org/ and my guess is the only side effects would be" [puppet] - 10https://gerrit.wikimedia.org/r/1028796 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[17:15:58] <wikibugs>	 (03PS4) 10JHathaway: puppetserver-deploy-code: bail out if current branch is not 'production' [puppet] - 10https://gerrit.wikimedia.org/r/1026682 (https://phabricator.wikimedia.org/T364047) (owner: 10Andrew Bogott)
[17:16:14] <wikibugs>	 (03CR) 10Herron: [C:03+2] pyrra: etcd: add generic rules workaround [puppet] - 10https://gerrit.wikimedia.org/r/1028864 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[17:16:47] <wikibugs>	 (03CR) 10JHathaway: puppetserver-deploy-code: bail out if current branch is not 'production' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1026682 (https://phabricator.wikimedia.org/T364047) (owner: 10Andrew Bogott)
[17:19:19] <wikibugs>	 (03CR) 10Hashar: "The issue I have is the containers being restarted while they might be running containers. That would cause Jenkins jobs to fail abruptly " [puppet] - 10https://gerrit.wikimedia.org/r/1028795 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[17:20:20] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/apertium: apply
[17:20:44] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Allow 'rel="me"' in Postorius list info links to verify Mastodon link on wikis.world - https://phabricator.wikimedia.org/T364402#9778349 (10Aklapper) @Greenreaper: That sounds to me like upstream behavior to improve (not to mangle the `rel` parameter value)? Per https://www.me...
[17:21:09] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/apertium: apply
[17:29:28] <wikibugs>	 (03PS3) 10Scott French: api-gateway: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028605 (https://phabricator.wikimedia.org/T362978)
[17:32:29] <logmsgbot>	 !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/apertium: apply
[17:33:42] <logmsgbot>	 !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/apertium: apply
[17:35:30] <wikibugs>	 (03PS1) 10Zabe: Avoid empty insert in SqlScoreStorage::storeScores [extensions/ORES] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1028583 (https://phabricator.wikimedia.org/T364218)
[17:38:11] <wikibugs>	 (03CR) 10Eevans: [C:03+2] New group for users of Cassandra staging (cassandra-dev) [puppet] - 10https://gerrit.wikimedia.org/r/1026194 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans)
[17:49:34] <wikibugs>	 (03PS2) 10Dzahn: Revert "Revert "admin: add linafaridwmde to analytics-privatedata-users"" [puppet] - 10https://gerrit.wikimedia.org/r/1028582
[17:50:06] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] ci: Enable profile::auto_restarts::service for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1028796 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[17:50:40] <wikibugs>	 (03PS2) 10Muehlenhoff: ci: Enable profile::auto_restarts::service for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1028796 (https://phabricator.wikimedia.org/T135991)
[17:51:04] <wikibugs>	 (03CR) 10Dzahn: ci: Enable profile::auto_restarts::service for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1028796 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[17:52:15] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Revert "Revert "admin: add linafaridwmde to analytics-privatedata-users"" [puppet] - 10https://gerrit.wikimedia.org/r/1028582 (owner: 10Dzahn)
[17:53:22] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1028796 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[17:54:05] <wikibugs>	 (03PS1) 10Andrea Denisse: thanos: Update TLS certificate in Envoy config to match CFSSL provisioning [puppet] - 10https://gerrit.wikimedia.org/r/1028876 (https://phabricator.wikimedia.org/T360414)
[17:55:38] <wikibugs>	 06SRE, 10SRE-Access-Requests, 03WMDE-TechWish-Sprint-2024-04-24: Requesting access to analytics-privatedata-users  for  linafaridwmde - https://phabricator.wikimedia.org/T364068#9778496 (10Dzahn) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1028581  https://gerrit.wikimedia.org/r/c/operations/puppet/...
[17:56:42] <wikibugs>	 06SRE, 10SRE-Access-Requests, 03WMDE-TechWish-Sprint-2024-04-24: Requesting access to analytics-privatedata-users  for  linafaridwmde - https://phabricator.wikimedia.org/T364068#9778497 (10Dzahn) 05Open→03Resolved a:03Dzahn issue fixed. user is now being created by puppet. within the next half hour...
[17:58:43] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (NOOP 16): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2323/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1028876 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[18:00:05] <jouncebot>	 jeena and thcipriani: gettimeofday() says it's time for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T1800)
[18:00:12] <wikibugs>	 06SRE, 06collaboration-services, 06serviceops: add bullseye support to deployment server puppet role - upgrade deployment server in devtools - https://phabricator.wikimedia.org/T363415#9778509 (10Dzahn) This still needs https://gerrit.wikimedia.org/r/c/operations/puppet/+/1026193  to be merged to be able to...
[18:04:55] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T364414 (10Jdforrester-WMF) 03NEW
[18:06:54] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+1] "Hello team, the PCC results for this change show a NOOP however, I think this change is important because there `thanos-query.discovery.wm" [puppet] - 10https://gerrit.wikimedia.org/r/1028876 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[18:08:23] <wikibugs>	 (03CR) 10Thcipriani: "Sounds like we need to backport this pre-train, correct?" [core] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1028865 (https://phabricator.wikimedia.org/T361398) (owner: 10Ladsgroup)
[18:09:12] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] puppetserver-deploy-code: bail out if current branch is not 'production' [puppet] - 10https://gerrit.wikimedia.org/r/1026682 (https://phabricator.wikimedia.org/T364047) (owner: 10Andrew Bogott)
[18:09:16] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T364414#9778550 (10Dzahn)
[18:10:39] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T364414#9778555 (10Dzahn) @thcipriani please consider for approval  (https://wikimedia.namely.com/people/eaebb898-01ba-404e-8cf8-2ed33c4e0d04/show/personal/employee-information/)
[18:10:52] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T364414#9778558 (10Dzahn) @Mcastro Please confirm if you approve
[18:12:25] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06serviceops: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416 (10RobH) 03NEW
[18:12:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:13:44] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06serviceops: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416#9778581 (10RobH)
[18:14:22] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06serviceops: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416#9778582 (10RobH)
[18:15:01] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T364414#9778588 (10ecarg) 'signing' my request this way; TY James!
[18:16:20] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06serviceops: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416#9778606 (10Dzahn) fwiw - for the person who will add the production puppet role to this later:  This is only possible since just recently but should be mostly unblocked now:  details in T363415 -...
[18:17:25] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06serviceops: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416#9778608 (10RobH) a:03akosiaris @akosiaris,  The parent ordering task for the deploy1002 replacement didn't have racking info, but I didn't want to stall ordering to get it so I've created this...
[18:21:16] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Allow 'rel="me"' in Postorius list info links to verify Mastodon link on wikis.world - https://phabricator.wikimedia.org/T364402#9778615 (10GreenReaper) It actually looks like it is upstream of that. [The commit adding markdown support](https://gitlab.com/mailman/postorius/-/c...
[18:23:10] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Allow 'rel="me"' in Postorius list info links to verify Mastodon link on wikis.world - https://phabricator.wikimedia.org/T364402#9778628 (10Pppery) But do note that Wikimedia tends to be slow at pulling down upstream changes, and I have no idea how fast Mailman does so.
[18:23:40] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:33:45] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists, 07Upstream: Allow 'rel="me"' in Postorius list info links to verify Mastodon link on wikis.world - https://phabricator.wikimedia.org/T364402#9778698 (10GreenReaper)
[18:37:26] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra: Reimage aqs1013 - https://phabricator.wikimedia.org/T364422 (10Eevans) 03NEW
[18:37:34] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra: Reimage aqs1013 - https://phabricator.wikimedia.org/T364422#9778735 (10Eevans) p:05Triage→03High
[18:38:48] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9778737 (10Eevans) >>! In T362033#9758428, @Volans wrote: > Maybe a little drastic option, but could we try to reimage one of those 2 server and wait few days? > That will surely wipe clean any manual proced...
[18:40:10] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on aqs1013.eqiad.wmnet with reason: Decommissioning — T364422
[18:40:13] <stashbot>	 T364422: Reimage aqs1013 - https://phabricator.wikimedia.org/T364422
[18:40:24] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1013.eqiad.wmnet with reason: Decommissioning — T364422
[18:40:28] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra: Reimage aqs1013 - https://phabricator.wikimedia.org/T364422#9778741 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=397ec6a2-88d3-4fa3-b149-367bc8b4c353) set by eevans@cumin1002 for 30 days, 0:00:00 on 1 host(s) and their services with reason: Decommission...
[18:51:46] <wikibugs>	 (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1028881
[18:53:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:54:51] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T364414#9778759 (10Mcastro) Approved.
[18:57:56] <jeena>	 I will backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1028865 and then roll the train to group0
[18:58:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[18:58:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:58:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1028865 (https://phabricator.wikimedia.org/T361398) (owner: 10Ladsgroup)
[19:03:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[19:03:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:08:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:10:05] <wikibugs>	 (03PS6) 10Herron: pyrra: logstash: add generic rules workaround [puppet] - 10https://gerrit.wikimedia.org/r/1028881 (https://phabricator.wikimedia.org/T302995)
[19:10:19] <wikibugs>	 (03CR) 10Herron: [V:03+1 C:03+2] pyrra: logstash: add generic rules workaround [puppet] - 10https://gerrit.wikimedia.org/r/1028881 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[19:18:25] <wikibugs>	 (03Merged) 10jenkins-bot: Partial cherry-pick of I9d8409fdbd757e [core] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1028865 (https://phabricator.wikimedia.org/T361398) (owner: 10Ladsgroup)
[19:18:53] <logmsgbot>	 !log jhuneidi@deploy1002 Started scap: Backport for [[gerrit:1028865|Partial cherry-pick of I9d8409fdbd757e (T361398 T362566)]]
[19:18:58] <stashbot>	 T361398: 1.43.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T361398
[19:18:59] <stashbot>	 T362566: Stop growth of text table by storing ES addresses in content table - https://phabricator.wikimedia.org/T362566
[19:20:03] <wikibugs>	 (03PS2) 10Volans: sre.ganeti.makevm: add logging message [cookbooks] - 10https://gerrit.wikimedia.org/r/1028867
[19:21:29] <logmsgbot>	 !log jhuneidi@deploy1002 ladsgroup and jhuneidi: Backport for [[gerrit:1028865|Partial cherry-pick of I9d8409fdbd757e (T361398 T362566)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[19:21:45] <logmsgbot>	 !log jhuneidi@deploy1002 ladsgroup and jhuneidi: Continuing with sync
[19:23:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:24:20] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:24:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/1028867 (owner: 10Volans)
[19:28:30] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on parse1014 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:28:54] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2314 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:29:14] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1060 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:29:16] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1021 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:29:18] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on parse1024 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:30:47] <wikibugs>	 (03PS1) 10Aklapper: Make Translations extension work with upstream Phorge [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1028887 (https://phabricator.wikimedia.org/T364426)
[19:34:16] <wikibugs>	 (03PS2) 10Aklapper: Make Translations extension work with upstream Phorge [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1028887 (https://phabricator.wikimedia.org/T364426)
[19:34:33] <logmsgbot>	 !log jhuneidi@deploy1002 Finished scap: Backport for [[gerrit:1028865|Partial cherry-pick of I9d8409fdbd757e (T361398 T362566)]] (duration: 15m 39s)
[19:34:37] <stashbot>	 T361398: 1.43.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T361398
[19:34:37] <stashbot>	 T362566: Stop growth of text table by storing ES addresses in content table - https://phabricator.wikimedia.org/T362566
[19:35:56] <wikibugs>	 (03CR) 10Aklapper: "Not perfect because still hardcoding WMF Phabricator instance in browse-uri, but good enough to make this extension work on my machine wit" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1028887 (https://phabricator.wikimedia.org/T364426) (owner: 10Aklapper)
[19:43:31] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028888 (https://phabricator.wikimedia.org/T361398)
[19:43:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028888 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot)
[19:44:21] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028888 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot)
[19:46:30] <denisse>	 !log disabling Puppet on the Logstash hosts that serve OpenSearch dashboards to test the CFSSL certificates - T360414
[19:46:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:46:35] <stashbot>	 T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414
[19:50:44] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] logstash: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1025879 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[19:52:11] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[19:52:48] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1416 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:52:48] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1393 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:53:42] <denisse>	 ^ looking
[19:53:48] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1458 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:54:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.67% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:55:26] <denisse>	 I can confirm the iptables rules are loaded.
[19:56:02] <denisse>	 The ferm unit looks healthy.
[19:57:35] <logmsgbot>	 !log denisse@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on 12 hosts with reason: Downtiming the Logstash hosts serving OpenSearch Dashboards as part of the cergen to CFSSL migration - T360414
[19:57:38] <stashbot>	 T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414
[19:57:55] <logmsgbot>	 !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 12 hosts with reason: Downtiming the Logstash hosts serving OpenSearch Dashboards as part of the cergen to CFSSL migration - T360414
[19:58:30] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on parse1014 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:58:54] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw2314 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:59:14] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1060 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:59:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.19% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:59:16] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1021 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:59:18] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on parse1024 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:59:18] <logmsgbot>	 !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.4  refs T361398
[19:59:21] <stashbot>	 T361398: 1.43.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T361398
[19:59:47] <denisse>	 Those alerts self resolved, but I'm unsure as to why they fired, there aren't any anomalies on the logs.
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T2000)
[20:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[20:00:44] <jeena>	 I think I need to roll back train due to https://phabricator.wikimedia.org/T364428
[20:00:51] <swfrench-wmf>	 denisse: this might be the same issue as https://phabricator.wikimedia.org/T354855
[20:01:19] <swfrench-wmf>	 (if so, it should auto-resolve on the next puppet run, due to the changes to the check script)
[20:02:10] <denisse>	 swfrench-wmf: That indeed looks like it, thanks!
[20:03:24] <jeena>	 rolling back
[20:04:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.57% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:04:18] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028891 (https://phabricator.wikimedia.org/T361398)
[20:04:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028891 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot)
[20:04:28] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028891 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot)
[20:04:48] <logmsgbot>	 !log jhuneidi@deploy1002 Started scap: testwikis wikis to 1.43.0-wmf.4  refs T361398
[20:04:51] <stashbot>	 T361398: 1.43.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T361398
[20:06:36] <denisse>	 !log Enabling Puppet on the Logstash hosts that serve OpenSearch dashboards to migrate to CFSSL certificates - T360414
[20:06:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:42] <stashbot>	 T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414
[20:09:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.57% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:09:57] <denisse>	 !log Restarting envoyproxy and opensearch-dashboards services on the Logstash hosts that serve OpenSearch dashboards to migrate to CFSSL certificates - T360414
[20:10:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:10] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+2 C:03+2] ssl: Remove unnecessary dummy key for the kibana hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1026693 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[20:14:20] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:14:33] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] wmcs: Remove unnecesary kibana and kibana-discovery certificates [puppet] - 10https://gerrit.wikimedia.org/r/1026692 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[20:14:34] <wikibugs>	 10ops-eqiad, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429 (10RobH) 03NEW
[20:15:30] <wikibugs>	 10ops-eqiad, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#9779047 (10RobH)
[20:16:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 36.55% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:17:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[20:17:24] <denisse>	 !log Deleting the kibana and kibana-combined certificates from the private repository - T360414
[20:17:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:17:29] <stashbot>	 T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414
[20:17:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA: db1246 crashed - https://phabricator.wikimedia.org/T363119#9779052 (10Jclark-ctr) Friday dell agreed to replace Backplane and cables. shipped out Monday expected arrival Tuesday.
[20:18:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:19:51] <logmsgbot>	 !log jhuneidi@deploy1002 Finished scap: testwikis wikis to 1.43.0-wmf.4  refs T361398 (duration: 15m 03s)
[20:19:54] <stashbot>	 T361398: 1.43.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T361398
[20:20:54] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Avoid empty insert in SqlScoreStorage::storeScores [extensions/ORES] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1028583 (https://phabricator.wikimedia.org/T364218) (owner: 10Zabe)
[20:21:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 36.55% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:22:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[20:22:48] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1416 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[20:22:48] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1393 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[20:23:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:23:17] <wikibugs>	 (03Merged) 10jenkins-bot: Avoid empty insert in SqlScoreStorage::storeScores [extensions/ORES] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1028583 (https://phabricator.wikimedia.org/T364218) (owner: 10Zabe)
[20:23:48] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1458 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[20:24:17] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:1028583|Avoid empty insert in SqlScoreStorage::storeScores (T364218)]]
[20:24:20] <stashbot>	 T364218: UnexpectedValueException: Wikimedia\Rdbms\InsertQueryBuilder::execute can't have empty $rows value (via ORES SqlScoreStorage) - https://phabricator.wikimedia.org/T364218
[20:26:39] <wikibugs>	 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9779063 (10andrea.denisse)
[20:26:55] <logmsgbot>	 !log zabe@deploy1002 zabe: Backport for [[gerrit:1028583|Avoid empty insert in SqlScoreStorage::storeScores (T364218)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:27:14] <logmsgbot>	 !log zabe@deploy1002 zabe: Continuing with sync
[20:28:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:28:48] <wikibugs>	 (03PS12) 10Andrea Denisse: thanos: Provision Thanos frontend TLS certificates with CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414)
[20:32:36] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (NOOP 8 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[20:34:17] <wikibugs>	 (03PS4) 10Scott French: confd: Extend confd-lint-wrap to accept a unique resource name [puppet] - 10https://gerrit.wikimedia.org/r/1028559 (https://phabricator.wikimedia.org/T363924)
[20:34:17] <wikibugs>	 (03PS4) 10Scott French: confd: prom exporter uses resource name to find state file [puppet] - 10https://gerrit.wikimedia.org/r/1028560 (https://phabricator.wikimedia.org/T363924)
[20:34:17] <wikibugs>	 (03PS1) 10Scott French: confd: confd-lint-wrap ignores positional args separator [puppet] - 10https://gerrit.wikimedia.org/r/1028897 (https://phabricator.wikimedia.org/T363924)
[20:34:19] <wikibugs>	 (03PS1) 10Scott French: confd: insert positional argument separator in check_cmd [puppet] - 10https://gerrit.wikimedia.org/r/1028898 (https://phabricator.wikimedia.org/T363924)
[20:34:58] <wikibugs>	 (03CR) 10Scott French: [C:04-1] "Thank you both for the review." [puppet] - 10https://gerrit.wikimedia.org/r/1028559 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French)
[20:35:10] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2429 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[20:36:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.2% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:38:25] <wikibugs>	 (03CR) 10Scott French: "Alright, I've uploaded two patches beneath this one, which pave the way for using argparse here in confd-lint-wrap. Thanks for your patien" [puppet] - 10https://gerrit.wikimedia.org/r/1028559 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French)
[20:39:58] <wikibugs>	 (03CR) 10Scott French: "Thanks in advance for the reviews. See also [0] for the rationale behind this." [puppet] - 10https://gerrit.wikimedia.org/r/1028897 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French)
[20:40:18] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1028583|Avoid empty insert in SqlScoreStorage::storeScores (T364218)]] (duration: 16m 01s)
[20:40:28] <stashbot>	 T364218: UnexpectedValueException: Wikimedia\Rdbms\InsertQueryBuilder::execute can't have empty $rows value (via ORES SqlScoreStorage) - https://phabricator.wikimedia.org/T364218
[20:40:59] <wikibugs>	 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9779078 (10andrea.denisse) We have a special situation with the thanos* hosts:  The `thanos-fe` hosts TLS certificates are not provisione...
[20:41:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.2% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:47:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T352010)', diff saved to https://phabricator.wikimedia.org/P61985 and previous config saved to /var/cache/conftool/dbconfig/20240507-204701-ladsgroup.json
[20:47:05] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[20:47:57] <wikibugs>	 (03CR) 10Scott French: "PCC diff for puppetmaster1001: https://puppet-compiler.wmflabs.org/output/1028898/2328/" [puppet] - 10https://gerrit.wikimedia.org/r/1028898 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French)
[20:48:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:57:20] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Use OpenSSL for PBKDF2 password hashing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842522 (https://phabricator.wikimedia.org/T320929) (owner: 10PleaseStand)
[20:58:09] <wikibugs>	 (03Merged) 10jenkins-bot: Use OpenSSL for PBKDF2 password hashing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842522 (https://phabricator.wikimedia.org/T320929) (owner: 10PleaseStand)
[20:58:41] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:842522|Use OpenSSL for PBKDF2 password hashing (T320929)]]
[20:58:44] <stashbot>	 T320929: Deploy MediaWiki config change to use OpenSSL for PBKDF2 password hashing - https://phabricator.wikimedia.org/T320929
[21:01:15] <logmsgbot>	 !log zabe@deploy1002 zabe and ki: Backport for [[gerrit:842522|Use OpenSSL for PBKDF2 password hashing (T320929)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:01:56] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists, 07Upstream: Allow 'rel="me"' in Postorius list info links to verify Mastodon link on wikis.world - https://phabricator.wikimedia.org/T364402#9779144 (10GreenReaper) [Opened #305 in pupa/readme_renderer](https://github.com/pypa/readme_renderer/issues/305) for this issue. If t...
[21:02:10] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P61986 and previous config saved to /var/cache/conftool/dbconfig/20240507-210209-ladsgroup.json
[21:03:23] <logmsgbot>	 !log zabe@deploy1002 zabe and ki: Continuing with sync
[21:05:10] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw2429 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[21:05:56] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T352010)', diff saved to https://phabricator.wikimedia.org/P61987 and previous config saved to /var/cache/conftool/dbconfig/20240507-210556-ladsgroup.json
[21:05:59] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[21:15:56] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:842522|Use OpenSSL for PBKDF2 password hashing (T320929)]] (duration: 17m 14s)
[21:16:00] <stashbot>	 T320929: Deploy MediaWiki config change to use OpenSSL for PBKDF2 password hashing - https://phabricator.wikimedia.org/T320929
[21:17:20] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P61988 and previous config saved to /var/cache/conftool/dbconfig/20240507-211717-ladsgroup.json
[21:17:20] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Revert "beta: Use OpenSSL for PBKDF2 password hashing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028760 (owner: 10Zabe)
[21:18:08] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "beta: Use OpenSSL for PBKDF2 password hashing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028760 (owner: 10Zabe)
[21:21:08] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P61989 and previous config saved to /var/cache/conftool/dbconfig/20240507-212103-ladsgroup.json
[21:22:03] <wikibugs>	 (03CR) 10Pppery: [C:03+1] "Seems reasonable to me." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1028887 (https://phabricator.wikimedia.org/T364426) (owner: 10Aklapper)
[21:25:56] <wikibugs>	 (03CR) 10Pppery: [C:03+1] Make Translations extension work with upstream Phorge (031 comment) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1028887 (https://phabricator.wikimedia.org/T364426) (owner: 10Aklapper)
[21:28:13] <wikibugs>	 (03PS3) 10Volans: reqconfig: add command to search IP in ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423)
[21:29:49] <wikibugs>	 (03PS3) 10Volans: sre.ganeti.makevm: add logging message [cookbooks] - 10https://gerrit.wikimedia.org/r/1028867
[21:29:58] <wikibugs>	 (03CR) 10Volans: [C:03+2] sre.ganeti.makevm: add logging message [cookbooks] - 10https://gerrit.wikimedia.org/r/1028867 (owner: 10Volans)
[21:30:42] <wikibugs>	 (03CR) 10Volans: "All tests should work now, at least they work locally :) Thanks for the patch to support v2, that helped." [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) (owner: 10Volans)
[21:32:28] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T352010)', diff saved to https://phabricator.wikimedia.org/P61990 and previous config saved to /var/cache/conftool/dbconfig/20240507-213227-ladsgroup.json
[21:32:32] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[21:33:44] <wikibugs>	 (03Merged) 10jenkins-bot: sre.ganeti.makevm: add logging message [cookbooks] - 10https://gerrit.wikimedia.org/r/1028867 (owner: 10Volans)
[21:33:56] <wikibugs>	 (03CR) 10Volans: confd: confd-lint-wrap ignores positional args separator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028897 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French)
[21:36:15] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P61991 and previous config saved to /var/cache/conftool/dbconfig/20240507-213614-ladsgroup.json
[21:51:22] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T352010)', diff saved to https://phabricator.wikimedia.org/P61992 and previous config saved to /var/cache/conftool/dbconfig/20240507-215122-ladsgroup.json
[21:51:27] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[21:54:35] <wikibugs>	 (03CR) 10Scott French: "Thanks, Riccardo!" [puppet] - 10https://gerrit.wikimedia.org/r/1028897 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French)
[22:12:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:18:43] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1028897 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French)
[22:19:02] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1028898 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French)
[22:24:24] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1028559 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French)
[22:24:54] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1028560 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French)
[22:58:44] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "LGTM! So, this would supersede [0], right? (e.g., benefits from being directly integrated with the tool, provides a bit more detail on mat" [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) (owner: 10Volans)
[23:18:26] <wikibugs>	 (03PS1) 10Scott French: benthos: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028910 (https://phabricator.wikimedia.org/T362978)
[23:18:28] <wikibugs>	 (03PS1) 10Scott French: blubberoid: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028911 (https://phabricator.wikimedia.org/T362978)
[23:25:22] <wikibugs>	 (03CR) 10Scott French: "Again continuing alphabetically, though skipping aqs-http-gateway for the moment, as there's a bit of coordination needed. Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028910 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[23:32:03] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review, Janis, and for the tip, Hugh. That should be sufficient, as this change should ideally break things in a non-subtle" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028605 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[23:35:41] <jinxer-wm>	 FIRING: ProbeDown: Service kubemaster2001:6443 has failed probes (http_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:38:11] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1028926
[23:38:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1028926 (owner: 10TrainBranchBot)
[23:40:41] <jinxer-wm>	 RESOLVED: ProbeDown: Service kubemaster2001:6443 has failed probes (http_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:52:11] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[23:55:26] <swfrench-wmf>	 FWIW, the kubemaster2001 probe failures above look like another instance of https://phabricator.wikimedia.org/T358936 (certificates refreshed during puppet run on kubemaster2002 just prior)