[00:17:57] !log Adding a logger processor to the `parse_ncredir_log_format` on `ncredir2001` to examine the JSON structure - T364354 [00:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:00] T364354: An alert for "reduced availability for job ncredir in ops@codfw" fired even tho graphs look healthy - https://phabricator.wikimedia.org/T364354 [00:35:52] I can't find the root cause but I think it has to do with the grok processor. [00:36:05] More specifically incorrect parsing of its JSON output. [00:42:35] !log Writing output to `/tmp/benthos_output.txt` shows that the grok processor's output is being parsed correctly - T364354 [00:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:40] T364354: An alert for "reduced availability for job ncredir in ops@codfw" fired even tho graphs look healthy - https://phabricator.wikimedia.org/T364354 [00:46:36] I'm not sure how to debug this further. My hypothesis of what could be causing the issue are incorrect, JSON parsing is correct and the path queries are also correct. [00:47:04] !log Reverting debug changes to their previous state - T364354 [00:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:57] denisse: thank you for looking into it <3. Traffic can take it up tomorrow! [00:51:16] sukhe: Running yamllint on the config file doesn't show any breaking errors however, I'm curious about this line root = this.message. I wonder if it should be root: this.message. [00:51:40] I think that would be the only "error" in the YAML file. [00:53:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:53:31] denisse: I am not a benthos expert by any means but I have seen this enough times in fabfur's patches that I think this is correct, the benthos format that is https://www.benthos.dev/docs/guides/bloblang/walkthrough/ [00:56:10] sukhe: That makes sense, thank you. Looking at benthos logs it seems like bloblang can be indeed used on YAML files so that shouldn't be the issue. [00:57:47] PROBLEM - Host mr1-esams.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:00:31] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 203 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:02:17] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 176 probes of 732 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:02:49] RECOVERY - Host mr1-esams.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 86.63 ms [01:03:50] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T364358 (10phaultfinder) 03NEW [01:05:27] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 39 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:07:19] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 38 probes of 732 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:08:09] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.4 [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1027481 (https://phabricator.wikimedia.org/T361398) [01:08:11] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.4 [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1027481 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot) [01:21:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:26:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:28:34] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.4 [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1027481 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot) [01:30:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:30:56] 10SRE-Access-Requests, 06Movement-Insights: Restore nshahquinn-wmf and hghani to analytics-product-users - https://phabricator.wikimedia.org/T364359 (10nshahquinn-wmf) 03NEW [01:32:00] 10SRE-Access-Requests, 06Movement-Insights: Restore nshahquinn-wmf and hghani to analytics-product-users - https://phabricator.wikimedia.org/T364359#9776333 (10nshahquinn-wmf) I didn't file separate tickets for each one of us since this is really a bug fix rather than a new request for permissions, but I'm hap... [01:35:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T0200) [02:36:27] FIRING: [6x] JobUnavailable: Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T0300) [03:00:13] FIRING: [6x] JobUnavailable: Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:35] (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028611 (https://phabricator.wikimedia.org/T361398) [03:01:36] (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028611 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot) [03:02:16] (03CR) 10CI reject: [V:04-1] testwikis wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028611 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot) [03:03:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:52:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T0400) [04:04:53] !log mwpresync@deploy1002 Pruned MediaWiki: 1.43.0-wmf.1, 1.43.0-wmf.2 (duration: 04m 50s) [04:25:33] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (install7001), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:00:09] RECOVERY - MD RAID on mw2382 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [05:13:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T0600) [06:00:05] kormat, marostegui, Amir1, and arnaudb: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T0600). [06:01:20] (03CR) 10Hashar: "I guess I will send our customizations to upstream so we don't have to carry them over :)" [puppet] - 10https://gerrit.wikimedia.org/r/1027726 (owner: 10Paladox) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:14:18] (03CR) 10Kevin Bazira: [C:03+1] ml-services: tune autoscaling for damaging, goodfaith and reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028552 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey) [06:20:06] (03CR) 10Hashar: "I have proposed changes upstream to get rid of our templates customization:" [puppet] - 10https://gerrit.wikimedia.org/r/1027726 (owner: 10Paladox) [06:37:32] (03CR) 10Slyngshede: pcc: fix delete-canceled-pcc-run-dirs timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1027008 (https://phabricator.wikimedia.org/T364173) (owner: 10JHathaway) [07:00:05] Amir1 and Urbanecm: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:27] FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:10:26] (03PS1) 10Muehlenhoff: Revert "hiera: update installserver for magru" [puppet] - 10https://gerrit.wikimedia.org/r/1028707 (https://phabricator.wikimedia.org/T364016) [07:11:20] (03PS1) 10Muehlenhoff: Revert "sites: update installserver for magru" [homer/public] - 10https://gerrit.wikimedia.org/r/1028709 [07:18:39] (03CR) 10Cathal Mooney: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1028707 (https://phabricator.wikimedia.org/T364016) (owner: 10Muehlenhoff) [07:20:15] (03CR) 10Cathal Mooney: [C:03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/1028709 (owner: 10Muehlenhoff) [07:20:56] urbanecm: and updates from cswiki WP on, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1025300 AFAIK, it was asked by community only. [07:21:03] (03CR) 10Cathal Mooney: [C:03+2] Revert "sites: update installserver for magru" [homer/public] - 10https://gerrit.wikimedia.org/r/1028709 (owner: 10Muehlenhoff) [07:21:44] (03Merged) 10jenkins-bot: Revert "sites: update installserver for magru" [homer/public] - 10https://gerrit.wikimedia.org/r/1028709 (owner: 10Muehlenhoff) [07:22:41] (03PS11) 10Sohom Datta: [ruwiki] Limit the use of the ContentTranslation tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019390 (https://phabricator.wikimedia.org/T362440) (owner: 10Dreamrimmer) [07:25:09] (03CR) 10Muehlenhoff: [C:03+2] Revert "hiera: update installserver for magru" [puppet] - 10https://gerrit.wikimedia.org/r/1028707 (https://phabricator.wikimedia.org/T364016) (owner: 10Muehlenhoff) [07:26:59] hi. who is the deployer now [07:31:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host install7001.wikimedia.org with OS bullseye [07:32:26] (03CR) 10Volans: [C:03+1] pcc: fix delete-canceled-pcc-run-dirs timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1027008 (https://phabricator.wikimedia.org/T364173) (owner: 10JHathaway) [07:35:16] 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9776549 (10hashar) [07:39:21] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host install7001.wikimedia.org with OS bullseye [07:40:53] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts install7001.wikimedia.org [07:41:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:42:42] (03PS1) 10Muehlenhoff: Revert "Enable install7001 as webproxy in magru" [dns] - 10https://gerrit.wikimedia.org/r/1028754 [07:45:07] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:46:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:48:28] (03CR) 10Muehlenhoff: [C:03+2] Revert "Enable install7001 as webproxy in magru" [dns] - 10https://gerrit.wikimedia.org/r/1028754 (owner: 10Muehlenhoff) [07:50:03] jouncebot: nowandnext [07:50:03] For the next 0 hour(s) and 9 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T0700) [07:50:03] In 2 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T1000) [07:50:23] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [07:51:09] (03CR) 10Zabe: [C:03+2] Stop setting wgPasswordDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027335 (owner: 10Zabe) [07:52:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:52:21] (03PS3) 10Zabe: Stop setting wgPasswordDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027335 [07:52:28] (03CR) 10Zabe: Stop setting wgPasswordDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027335 (owner: 10Zabe) [07:52:30] (03CR) 10Zabe: [C:03+2] Stop setting wgPasswordDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027335 (owner: 10Zabe) [07:53:14] (03Merged) 10jenkins-bot: Stop setting wgPasswordDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027335 (owner: 10Zabe) [07:58:28] (03PS1) 10Zabe: Use OpenSSL for PBKDF2 password hashing on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028756 (https://phabricator.wikimedia.org/T320929) [07:59:03] (03CR) 10Brouberol: [C:03+1] Add the wmf-java-cacerts truststore to all remaining airflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1026964 (https://phabricator.wikimedia.org/T362181) (owner: 10Btullis) [07:59:11] (03CR) 10CI reject: [V:04-1] Use OpenSSL for PBKDF2 password hashing on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028756 (https://phabricator.wikimedia.org/T320929) (owner: 10Zabe) [08:00:07] !log zabe@deploy1002 Started scap: Backport for [[gerrit:1027335|Stop setting wgPasswordDefault]] [08:00:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:00:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:00:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts install7001.wikimedia.org [08:02:08] (03PS2) 10Zabe: Use OpenSSL for PBKDF2 password hashing on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028756 (https://phabricator.wikimedia.org/T320929) [08:02:46] !log zabe@deploy1002 zabe: Backport for [[gerrit:1027335|Stop setting wgPasswordDefault]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:02:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install7001.wikimedia.org [08:02:52] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:03:17] !log zabe@deploy1002 zabe: Continuing with sync [08:06:11] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install7001.wikimedia.org - jmm@cumin2002" [08:06:43] (03CR) 10Slyngshede: [C:03+2] P:trafficserver::backend add cloudtestidm [puppet] - 10https://gerrit.wikimedia.org/r/1026790 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [08:07:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install7001.wikimedia.org - jmm@cumin2002" [08:07:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:07:04] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install7001.wikimedia.org on all recursors [08:07:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install7001.wikimedia.org on all recursors [08:07:36] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install7001.wikimedia.org - jmm@cumin2002" [08:07:46] (03PS1) 10Brouberol: elasticsearch: defaut to rolling restarting a single node at a time [cookbooks] - 10https://gerrit.wikimedia.org/r/1028757 (https://phabricator.wikimedia.org/T362534) [08:08:10] (03CR) 10Gehel: [C:03+1] "LGTM, simple enough" [cookbooks] - 10https://gerrit.wikimedia.org/r/1028757 (https://phabricator.wikimedia.org/T362534) (owner: 10Brouberol) [08:08:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install7001.wikimedia.org - jmm@cumin2002" [08:08:46] PROBLEM - Check whether ferm is active by checking the default input chain on mw2267 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:08:52] PROBLEM - Check whether ferm is active by checking the default input chain on mw1457 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:10:02] PROBLEM - Check whether ferm is active by checking the default input chain on parse2020 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:10:25] (03CR) 10Btullis: [C:03+1] global_config: Only expose the IP of the analytics meta master [puppet] - 10https://gerrit.wikimedia.org/r/1028486 (https://phabricator.wikimedia.org/T361955) (owner: 10Brouberol) [08:11:04] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1020 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:11:42] (03CR) 10Brouberol: [V:03+1 C:03+2] global_config: Only expose the IP of the analytics meta master [puppet] - 10https://gerrit.wikimedia.org/r/1028486 (https://phabricator.wikimedia.org/T361955) (owner: 10Brouberol) [08:13:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host install7001.wikimedia.org with OS bullseye [08:14:27] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] prometheus: use datacenters for snmp_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1028502 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi) [08:15:32] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1027335|Stop setting wgPasswordDefault]] (duration: 15m 24s) [08:16:01] (03CR) 10Zabe: [C:03+2] Use OpenSSL for PBKDF2 password hashing on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028756 (https://phabricator.wikimedia.org/T320929) (owner: 10Zabe) [08:16:49] (03Merged) 10jenkins-bot: Use OpenSSL for PBKDF2 password hashing on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028756 (https://phabricator.wikimedia.org/T320929) (owner: 10Zabe) [08:17:20] !log zabe@deploy1002 Started scap: Backport for [[gerrit:1028756|Use OpenSSL for PBKDF2 password hashing on testwiki (T320929)]] [08:17:23] T320929: Deploy MediaWiki config change to use OpenSSL for PBKDF2 password hashing - https://phabricator.wikimedia.org/T320929 [08:18:52] zabe: ping me when I can deploy something please? [08:19:00] sure [08:19:14] (03CR) 10Btullis: [V:03+1 C:03+2] Make caps an optional parameter to the Ceph::Auth::ClientAuth type [puppet] - 10https://gerrit.wikimedia.org/r/1026867 (https://phabricator.wikimedia.org/T364105) (owner: 10Btullis) [08:19:44] !log zabe@deploy1002 zabe: Backport for [[gerrit:1028756|Use OpenSSL for PBKDF2 password hashing on testwiki (T320929)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:22:40] !log zabe@deploy1002 zabe: Continuing with sync [08:23:28] (03PS1) 10Filippo Giunchedi: prometheus: assemble snmp.yml when updating modules [puppet] - 10https://gerrit.wikimedia.org/r/1028759 (https://phabricator.wikimedia.org/T364016) [08:24:46] (03PS2) 10Filippo Giunchedi: prometheus: assemble snmp.yml when updating modules [puppet] - 10https://gerrit.wikimedia.org/r/1028759 (https://phabricator.wikimedia.org/T364016) [08:25:34] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:26:36] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2303/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028759 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi) [08:26:40] kart_: feel free to go ahead please! [08:26:56] (Assuming no one is deploying anything atm) [08:27:18] (03PS5) 10Zabe: Use OpenSSL for PBKDF2 password hashing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842522 (https://phabricator.wikimedia.org/T320929) (owner: 10PleaseStand) [08:27:51] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] prometheus: assemble snmp.yml when updating modules [puppet] - 10https://gerrit.wikimedia.org/r/1028759 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi) [08:30:19] (03CR) 10Btullis: [C:03+2] Add the wmf-java-cacerts truststore to all remaining airflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1026964 (https://phabricator.wikimedia.org/T362181) (owner: 10Btullis) [08:30:44] (03PS1) 10Zabe: Revert "beta: Use OpenSSL for PBKDF2 password hashing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028760 [08:31:09] (03PS2) 10Zabe: Revert "beta: Use OpenSSL for PBKDF2 password hashing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028760 [08:32:06] (03CR) 10Majavah: [C:03+1] cloudweb: Enable profile::auto_restarts::service for nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/1026459 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:32:10] PROBLEM - Check whether ferm is active by checking the default input chain on mw2429 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:32:12] PROBLEM - MariaDB Replica SQL: analytics-meta-replica on an-mariadb1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for mpic_staging@10.% on query. Default database: mpic_staging. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:32:18] ehm [08:32:45] there was a peak of 564 errors Uncaught MWException: Invalid IP given in XFF ... [08:33:51] (03CR) 10Majavah: "I assume this is not wanted for the non-Wikitech appserver fleet?" [puppet] - 10https://gerrit.wikimedia.org/r/1026453 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:34:18] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on install7001.wikimedia.org with reason: host reimage [08:34:20] (03CR) 10Majavah: [C:03+1] Enable profile::auto_restarts::service for docker/containerd [puppet] - 10https://gerrit.wikimedia.org/r/1026451 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:34:39] apparantly not related to my patch [08:34:43] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1028756|Use OpenSSL for PBKDF2 password hashing on testwiki (T320929)]] (duration: 17m 22s) [08:34:45] T320929: Deploy MediaWiki config change to use OpenSSL for PBKDF2 password hashing - https://phabricator.wikimedia.org/T320929 [08:34:58] taavi: I'm done [08:35:16] ack, thanks [08:35:21] !log installing glibc security updates on buster [08:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:10] (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028383 (owner: 10JMeybohm) [08:36:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023432 (owner: 10Majavah) [08:36:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on install7001.wikimedia.org with reason: host reimage [08:37:33] (03Merged) 10jenkins-bot: wikitech: Also disable password changes when logged-in [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023432 (owner: 10Majavah) [08:37:50] !log taavi@deploy1002 Started scap: Backport for [[gerrit:1023432|wikitech: Also disable password changes when logged-in]] [08:38:20] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1028559 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [08:38:47] RECOVERY - Check whether ferm is active by checking the default input chain on mw2267 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:38:53] RECOVERY - Check whether ferm is active by checking the default input chain on mw1457 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:40:01] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:40:01] RECOVERY - Check whether ferm is active by checking the default input chain on parse2020 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:40:20] !log taavi@deploy1002 taavi: Backport for [[gerrit:1023432|wikitech: Also disable password changes when logged-in]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:40:27] (03CR) 10Filippo Giunchedi: [C:03+1] confd: prom exporter uses resource name to find state file [puppet] - 10https://gerrit.wikimedia.org/r/1028560 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [08:41:03] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1020 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:41:45] !log taavi@deploy1002 taavi: Continuing with sync [08:45:02] (03PS1) 10Stevemunene: Move datahub and datahub-staging helfile deployments to dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028761 (https://phabricator.wikimedia.org/T363300) [08:50:14] (03CR) 10Muehlenhoff: [C:03+2] cloudweb: Enable profile::auto_restarts::service for nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/1026459 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:53:41] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:1023432|wikitech: Also disable password changes when logged-in]] (duration: 15m 50s) [08:54:05] I'm also done deploying [08:59:22] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [08:59:45] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:00:20] !log jayme@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [09:00:42] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:01:01] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:01:07] !log jayme@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [09:01:19] !log jayme@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [09:01:32] (03CR) 10Klausman: [C:03+1] amd-pytorch21: fix the ROCm version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1028519 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey) [09:01:33] 10SRE-swift-storage, 06Commons: Commons: File:Gnome-edit-delete.svg not found - https://phabricator.wikimedia.org/T363995#9776697 (10jcrespo) >>! In T363995#9775321, @jcrespo wrote: > [2024-05-06 14:33:33,903] INFO:backup '9/96/Gnome-edit-delete.svg' downloaded > [2024-05-06 14:33:33,904] INFO:backup sha256 su... [09:02:11] RECOVERY - Check whether ferm is active by checking the default input chain on mw2429 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:02:24] (03PS1) 10Muehlenhoff: query_sever::deploy::manual: Remove obsolete class [puppet] - 10https://gerrit.wikimedia.org/r/1028763 (https://phabricator.wikimedia.org/T316876) [09:02:25] (03PS1) 10Muehlenhoff: Remove obsolete Hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1028764 (https://phabricator.wikimedia.org/T316876) [09:02:25] !log jayme@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [09:02:31] (03CR) 10Klausman: [C:03+1] ml-services: tune autoscaling for damaging, goodfaith and reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028552 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey) [09:02:33] !log jayme@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [09:03:12] !log jayme@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [09:03:58] hi there, does anyone else need to run backports? train presync failed last night and I would like to re-run it [09:05:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host install7001.wikimedia.org with OS bullseye [09:05:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install7001.wikimedia.org [09:10:01] ok, will do that now [09:10:26] (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028765 (https://phabricator.wikimedia.org/T361398) [09:10:27] (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028765 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot) [09:10:30] (03PS1) 10Muehlenhoff: Reapply "sites: update installserver for magru" [homer/public] - 10https://gerrit.wikimedia.org/r/1028766 [09:11:10] (03Merged) 10jenkins-bot: testwikis wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028765 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot) [09:11:25] (03PS1) 10Muehlenhoff: Reapply "hiera: update installserver for magru" [puppet] - 10https://gerrit.wikimedia.org/r/1028767 [09:11:35] (03CR) 10JMeybohm: [C:03+1] amd-pytorch21: fix the ROCm version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1028519 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey) [09:11:39] !log jnuche@deploy1002 Started scap: testwikis wikis to 1.43.0-wmf.4 refs T361398 [09:11:42] T361398: 1.43.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T361398 [09:12:51] (03Abandoned) 10Zabe: testwikis wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028611 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot) [09:13:40] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:16:01] (03CR) 10JMeybohm: [C:03+1] apertium: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028604 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [09:16:16] (03CR) 10Muehlenhoff: [C:03+2] Enable profile::auto_restarts::service for docker/containerd [puppet] - 10https://gerrit.wikimedia.org/r/1026451 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:21:23] (03CR) 10JMeybohm: [C:03+1] api-gateway: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028605 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [09:21:52] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1157.eqiad.wmnet [09:22:52] (03PS1) 10Muehlenhoff: Switch db1157 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028769 (https://phabricator.wikimedia.org/T349619) [09:23:13] RECOVERY - MariaDB Replica SQL: analytics-meta-replica on an-mariadb1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:25:27] (03CR) 10Cathal Mooney: [C:03+1] Reapply "hiera: update installserver for magru" [puppet] - 10https://gerrit.wikimedia.org/r/1028767 (owner: 10Muehlenhoff) [09:25:31] (03CR) 10Cathal Mooney: [C:03+1] Reapply "sites: update installserver for magru" [homer/public] - 10https://gerrit.wikimedia.org/r/1028766 (owner: 10Muehlenhoff) [09:26:56] (03CR) 10Muehlenhoff: [C:03+2] Switch db1157 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028769 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:27:31] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (install7001), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:31:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1157.eqiad.wmnet [09:31:47] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1166.eqiad.wmnet [09:32:42] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance [09:32:56] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance [09:33:02] (03CR) 10Hnowlan: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [09:33:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2165 (T352010)', diff saved to https://phabricator.wikimedia.org/P61981 and previous config saved to /var/cache/conftool/dbconfig/20240507-093302-ladsgroup.json [09:33:11] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:33:18] (03PS1) 10Muehlenhoff: Switch db1166 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028772 (https://phabricator.wikimedia.org/T349619) [09:35:54] (03CR) 10Muehlenhoff: [C:03+2] Reapply "hiera: update installserver for magru" [puppet] - 10https://gerrit.wikimedia.org/r/1028767 (owner: 10Muehlenhoff) [09:36:08] !log brouberol@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:36:31] !log brouberol@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:36:32] (03CR) 10Muehlenhoff: [C:03+2] Switch db1166 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028772 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:36:58] !log brouberol@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:37:31] (03PS1) 10Btullis: Fix the cephosd dse-k8s-csi user caps [puppet] - 10https://gerrit.wikimedia.org/r/1028773 (https://phabricator.wikimedia.org/T327259) [09:37:37] !log brouberol@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:37:59] !log brouberol@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:38:17] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1028773 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [09:38:33] !log brouberol@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:38:34] (03CR) 10Filippo Giunchedi: "See inline, also let' split this in two patches: one for titan and one for thanos frontend" [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [09:38:56] !log brouberol@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:39:16] (03CR) 10Filippo Giunchedi: [C:03+2] "I'd rather not have puppet restart prometheus by itself since it can take a long time and tends to be distructive" [puppet] - 10https://gerrit.wikimedia.org/r/1025682 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi) [09:39:17] !log brouberol@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:39:33] (03PS1) 10Slyngshede: P:trafficserver::backend Fix URL for CloudIDM [puppet] - 10https://gerrit.wikimedia.org/r/1028774 [09:40:20] (03CR) 10Muehlenhoff: [C:03+2] Reapply "sites: update installserver for magru" [homer/public] - 10https://gerrit.wikimedia.org/r/1028766 (owner: 10Muehlenhoff) [09:40:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1166.eqiad.wmnet [09:41:37] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1189.eqiad.wmnet [09:42:35] (03PS1) 10Muehlenhoff: Switch db1189 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028776 (https://phabricator.wikimedia.org/T349619) [09:43:23] (03PS1) 10Slyngshede: Revert "P:trafficserver::backend add cloudtestidm" [puppet] - 10https://gerrit.wikimedia.org/r/1028574 [09:43:38] (03PS1) 10Ladsgroup: Stop writing to old columns of pagelinks in most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028778 (https://phabricator.wikimedia.org/T352010) [09:45:34] (03PS2) 10Ladsgroup: Stop writing to old columns of pagelinks in most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028778 (https://phabricator.wikimedia.org/T352010) [09:46:43] (03CR) 10Muehlenhoff: [C:03+1] "Doh :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1028774 (owner: 10Slyngshede) [09:49:45] !log filippo@cumin1002 START - Cookbook sre.hosts.decommission for hosts prometheus7001.magru.wmnet [09:50:33] (03CR) 10Hnowlan: [C:03+1] ratelimit: Update ratelimit service to git 3fcc360 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1028532 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [09:52:26] (03CR) 10Muehlenhoff: [C:03+2] Switch db1189 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028776 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:54:02] (03CR) 10JMeybohm: [V:03+2 C:03+2] ratelimit: Update ratelimit service to git 3fcc360 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1028532 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [09:54:27] !log filippo@cumin1002 START - Cookbook sre.dns.netbox [09:55:17] !log jnuche@deploy1002 sync-world aborted: testwikis wikis to 1.43.0-wmf.4 refs T361398 (duration: 43m 38s) [09:55:21] T361398: 1.43.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T361398 [09:55:47] that was an accident, I need to rerun... [09:56:43] !log jnuche@deploy1002 Started scap: testwikis wikis to 1.43.0-wmf.4 refs T361398 [09:56:46] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#9776836 (10Ladsgroup) We already have numbers for those and they look not great for the switch: see T360589 and T211661#8377883 [09:56:49] (03CR) 10Stevemunene: [C:03+1] Fix the cephosd dse-k8s-csi user caps [puppet] - 10https://gerrit.wikimedia.org/r/1028773 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [09:57:01] (03CR) 10Btullis: [C:03+2] Fix the cephosd dse-k8s-csi user caps [puppet] - 10https://gerrit.wikimedia.org/r/1028773 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [09:57:52] !log filippo@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - filippo@cumin1002" [09:58:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1189.eqiad.wmnet [09:58:37] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry rolling restart_daemons on A:docker-registry [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T1000) [10:01:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry (exit_code=0) rolling restart_daemons on A:docker-registry [10:04:17] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2021 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:05:19] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2035 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:08:07] PROBLEM - Check whether ferm is active by checking the default input chain on mw2381 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:08:52] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1198.eqiad.wmnet [10:09:35] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-codfw [10:12:21] (03PS1) 10Muehlenhoff: Switch db1198 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028783 (https://phabricator.wikimedia.org/T349619) [10:12:48] (03CR) 10JMeybohm: [C:03+1] admin_ng/helmfile_istio-gateway: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025296 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli) [10:13:56] (03CR) 10Muehlenhoff: [C:03+2] Switch db1198 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028783 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:14:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-codfw [10:15:17] (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1021892 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [10:15:30] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-eqiad [10:16:05] !log jnuche@deploy1002 Finished scap: testwikis wikis to 1.43.0-wmf.4 refs T361398 (duration: 19m 22s) [10:16:08] T361398: 1.43.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T361398 [10:20:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad [10:21:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1198.eqiad.wmnet [10:22:10] (03PS1) 10Slyngshede: P:idm Fix certificate name [puppet] - 10https://gerrit.wikimedia.org/r/1028785 (https://phabricator.wikimedia.org/T362128) [10:23:02] (03CR) 10Hnowlan: "No swagger spec for the gateways unfortunately, but checking something like https://staging.svc.eqiad.wmnet:8087/core/v1/wikipedia/en/page" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028605 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [10:25:29] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot-master rolling restart_daemons on A:maps-master [10:25:45] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:25:55] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1223.eqiad.wmnet [10:26:09] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:26:37] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.259 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:26:59] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:27:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot-master (exit_code=0) rolling restart_daemons on A:maps-master [10:28:07] !log filippo@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - filippo@cumin1002" [10:28:07] !log filippo@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:28:08] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts prometheus7001.magru.wmnet [10:29:06] (03PS1) 10Muehlenhoff: Switch db1223 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028786 (https://phabricator.wikimedia.org/T349619) [10:32:32] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1028785 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [10:32:38] (03CR) 10Muehlenhoff: [C:03+2] Switch db1223 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028786 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:34:17] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2021 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:35:19] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2035 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:37:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1223.eqiad.wmnet [10:38:07] RECOVERY - Check whether ferm is active by checking the default input chain on mw2381 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:40:03] (03PS1) 10Muehlenhoff: chartmuseum: Enable profile::auto_restarts::service for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1028789 (https://phabricator.wikimedia.org/T135991) [10:42:00] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/1028790 (https://phabricator.wikimedia.org/T135991) [10:45:04] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9776962 (10MoritzMuehlenhoff) [10:46:21] (03PS1) 10Fabfur: cache: Use fifo-log-demux between haproxy and benthos [puppet] - 10https://gerrit.wikimedia.org/r/1028791 (https://phabricator.wikimedia.org/T364379) [10:46:42] (03CR) 10CI reject: [V:04-1] cache: Use fifo-log-demux between haproxy and benthos [puppet] - 10https://gerrit.wikimedia.org/r/1028791 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur) [10:53:43] (03CR) 10Btullis: [C:03+2] Update prometheus config to reflect matomo profile change [puppet] - 10https://gerrit.wikimedia.org/r/1021892 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [10:56:31] (03PS1) 10Muehlenhoff: parsoid/testing: Enable profile::auto_restarts::service for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1028793 (https://phabricator.wikimedia.org/T135991) [11:01:27] FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:01:42] (03PS1) 10Muehlenhoff: ci: Enable profile::auto_restarts::service for docker/containerd [puppet] - 10https://gerrit.wikimedia.org/r/1028795 (https://phabricator.wikimedia.org/T135991) [11:04:15] (03PS1) 10Muehlenhoff: ci: Enable profile::auto_restarts::service for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1028796 (https://phabricator.wikimedia.org/T135991) [11:05:48] !log depooling 6 codfw appservers in advance of reimaging [11:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:28] (03PS3) 10Jforrester: wikifunctions: Enable wasmedge resource limits in evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024767 (owner: 10Cory Massaro) [11:13:44] (03CR) 10Jforrester: [C:03+2] wikifunctions: Enable wasmedge resource limits in evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024767 (owner: 10Cory Massaro) [11:14:34] (03Merged) 10jenkins-bot: wikifunctions: Enable wasmedge resource limits in evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024767 (owner: 10Cory Massaro) [11:15:38] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [11:16:09] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [11:17:10] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [11:19:35] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [11:19:53] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [11:20:48] (03PS6) 10JMeybohm: Add new chart: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026564 (https://phabricator.wikimedia.org/T362310) [11:20:54] (03CR) 10Hashar: [C:03+1] "Actually the requests are made to a specific port as seen in the URL: `mw-jobrunner.discovery.wmnet:4448` and the port is thus in the `Ho" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020277 (owner: 10Hnowlan) [11:21:10] (03CR) 10JMeybohm: Add new chart: ratelimit (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026564 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [11:21:37] (03CR) 10Majavah: "Can we use the relatively new `ClusterConfig` class instead?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020277 (owner: 10Hnowlan) [11:21:44] (03CR) 10CI reject: [V:04-1] Add new chart: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026564 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [11:22:11] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [11:23:32] (03PS7) 10JMeybohm: Add new chart: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026564 (https://phabricator.wikimedia.org/T362310) [11:28:56] (03PS1) 10Muehlenhoff: Stop installing git-fat on Cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/1028798 (https://phabricator.wikimedia.org/T364373) [11:29:44] (03CR) 10Dreamrimmer: [ruwiki] Limit the use of the ContentTranslation tool (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019390 (https://phabricator.wikimedia.org/T362440) (owner: 10Dreamrimmer) [11:30:39] (03PS1) 10Muehlenhoff: query_service: Stop installing git-fat [puppet] - 10https://gerrit.wikimedia.org/r/1028799 (https://phabricator.wikimedia.org/T316876) [11:32:21] (03PS1) 10Muehlenhoff: Stop installing git-fat also on Buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1028800 (https://phabricator.wikimedia.org/T279509) [11:33:06] (03CR) 10Anzx: [C:03+1] [ruwiki] Limit the use of the ContentTranslation tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019390 (https://phabricator.wikimedia.org/T362440) (owner: 10Dreamrimmer) [11:36:25] (03CR) 10Slyngshede: [C:03+2] P:idm Fix certificate name [puppet] - 10https://gerrit.wikimedia.org/r/1028785 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [11:38:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:40:21] I'm getting Wikimedia\Rdbms\DBTransactionSizeError on commons, both with regular edits and file-deletion. [11:42:03] Sporatic [11:42:09] (03PS2) 10Majavah: libraryupgrader: Automatically restart celery processes [puppet] - 10https://gerrit.wikimedia.org/r/1027500 [11:43:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:43:35] (03CR) 10Majavah: [C:03+2] libraryupgrader: Automatically restart celery processes [puppet] - 10https://gerrit.wikimedia.org/r/1027500 (owner: 10Majavah) [11:44:12] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1028600 (https://phabricator.wikimedia.org/T364068) (owner: 10Dzahn) [11:48:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:51:01] (03PS1) 10Slyngshede: P:trafficserver::backend Add cloudtestidm [puppet] - 10https://gerrit.wikimedia.org/r/1028805 (https://phabricator.wikimedia.org/T362128) [11:52:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:53:52] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2304/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028805 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [11:56:30] (03PS1) 10Fabfur: fifo_log_demux: add new parameters for current release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) [11:56:50] (03CR) 10CI reject: [V:04-1] fifo_log_demux: add new parameters for current release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [11:58:17] (03PS2) 10Fabfur: fifo_log_demux: add new parameters for current release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) [11:58:41] (03CR) 10CI reject: [V:04-1] fifo_log_demux: add new parameters for current release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T1200) [12:02:12] !log filippo@cumin1002 START - Cookbook sre.ganeti.makevm for new host prometheus7001.magru.wmnet [12:02:13] !log filippo@cumin1002 START - Cookbook sre.dns.netbox [12:02:44] (03PS3) 10Fabfur: fifo_log_demux: add new parameters for current release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) [12:03:31] !log installing ruby3.1 security updates [12:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:02] (03Abandoned) 10Slyngshede: P:trafficserver::backend Fix URL for CloudIDM [puppet] - 10https://gerrit.wikimedia.org/r/1028774 (owner: 10Slyngshede) [12:04:58] (03CR) 10Slyngshede: [V:03+1] "slyngshede@cp1108:~$ curl -i -H "Host: cloudtestidm.wikimedia.org" https://cloudidm2001-dev.codfw.wmnet/accounts/login/ 2>/dev/null |head " [puppet] - 10https://gerrit.wikimedia.org/r/1028805 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [12:05:11] !log btullis@deploy1002 Started deploy [airflow-dags/analytics@e5ba870]: (no justification provided) [12:05:24] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1028805 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [12:05:44] !log btullis@deploy1002 Finished deploy [airflow-dags/analytics@e5ba870]: (no justification provided) (duration: 00m 32s) [12:07:30] !log filippo@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7001.magru.wmnet - filippo@cumin1002" [12:08:22] !log filippo@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7001.magru.wmnet - filippo@cumin1002" [12:08:22] !log filippo@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:08:23] !log filippo@cumin1002 START - Cookbook sre.dns.wipe-cache prometheus7001.magru.wmnet on all recursors [12:08:26] !log filippo@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7001.magru.wmnet on all recursors [12:08:46] !log filippo@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7001.magru.wmnet - filippo@cumin1002" [12:09:21] (03PS4) 10Fabfur: fifo_log_demux: add new parameters for current release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) [12:09:39] !log filippo@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7001.magru.wmnet - filippo@cumin1002" [12:10:29] !log filippo@cumin1002 START - Cookbook sre.hosts.reimage for host prometheus7001.magru.wmnet with OS bullseye [12:12:10] (03PS1) 10Muehlenhoff: Add library hints for ruby3.1 [puppet] - 10https://gerrit.wikimedia.org/r/1028808 [12:12:23] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2305/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [12:13:42] (03PS5) 10Fabfur: fifo_log_demux: add new parameters for current release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) [12:15:01] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2306/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [12:16:30] (03PS6) 10Fabfur: fifo_log_demux: add new parameters for current release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) [12:17:55] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2307/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [12:19:28] (03CR) 10Muehlenhoff: [C:03+2] Add library hints for ruby3.1 [puppet] - 10https://gerrit.wikimedia.org/r/1028808 (owner: 10Muehlenhoff) [12:26:33] (03CR) 10Vgutierrez: [C:04-1] fifo_log_demux: add new parameters for current release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [12:29:57] (03PS1) 10Brouberol: hadoop secrets: make analytics DB password available to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1028814 (https://phabricator.wikimedia.org/T363437) [12:32:27] (03PS1) 10Vgutierrez: prometheus::ops: Remove ncredir job [puppet] - 10https://gerrit.wikimedia.org/r/1028818 (https://phabricator.wikimedia.org/T364354) [12:32:54] (03CR) 10Brouberol: [C:03+2] elasticsearch: defaut to rolling restarting a single node at a time [cookbooks] - 10https://gerrit.wikimedia.org/r/1028757 (https://phabricator.wikimedia.org/T362534) (owner: 10Brouberol) [12:33:05] (03CR) 10CI reject: [V:04-1] hadoop secrets: make analytics DB password available to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1028814 (https://phabricator.wikimedia.org/T363437) (owner: 10Brouberol) [12:35:21] (03CR) 10Ssingh: fifo_log_demux: add new parameters for current release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [12:35:38] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2308/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028818 (https://phabricator.wikimedia.org/T364354) (owner: 10Vgutierrez) [12:36:25] (03CR) 10Vgutierrez: [C:04-1] fifo_log_demux: add new parameters for current release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [12:38:54] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:trafficserver::backend Add cloudtestidm [puppet] - 10https://gerrit.wikimedia.org/r/1028805 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [12:41:10] (03PS2) 10Brouberol: hadoop: make analytics DB password available to analytics-privatedata-user [puppet] - 10https://gerrit.wikimedia.org/r/1028814 (https://phabricator.wikimedia.org/T363437) [12:44:21] (03CR) 10Muehlenhoff: [C:03+2] Extend cloudbackup Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1024686 (owner: 10Muehlenhoff) [12:46:16] (03CR) 10Hnowlan: "Yep, the port is the issue here - This is scheduled for deploy in the backport window in 15 mins. https://gerrit.wikimedia.org/r/c/operati" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020277 (owner: 10Hnowlan) [12:46:50] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [12:47:18] (03CR) 10Elukey: [V:03+2 C:03+2] amd-pytorch21: fix the ROCm version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1028519 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey) [12:48:13] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [12:49:01] (03PS5) 10Elukey: Move ms-fe1009's envoy TLS cert to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) [12:51:14] (03PS3) 10Brouberol: hadoop: make analytics DB password available to analytics-privatedata-user [puppet] - 10https://gerrit.wikimedia.org/r/1028814 (https://phabricator.wikimedia.org/T363437) [12:51:23] (03PS1) 10Muehlenhoff: Fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/1028819 [12:51:58] (03PS7) 10Fabfur: fifo_log_demux: add new parameters for current release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) [12:52:22] (03CR) 10CI reject: [V:04-1] fifo_log_demux: add new parameters for current release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [12:53:22] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1028818 (https://phabricator.wikimedia.org/T364354) (owner: 10Vgutierrez) [12:53:28] (03PS4) 10Brouberol: hadoop: make analytics DB password available to analytics-privatedata-user [puppet] - 10https://gerrit.wikimedia.org/r/1028814 (https://phabricator.wikimedia.org/T363437) [12:53:40] (03CR) 10Vgutierrez: [V:03+1 C:03+2] prometheus::ops: Remove ncredir job [puppet] - 10https://gerrit.wikimedia.org/r/1028818 (https://phabricator.wikimedia.org/T364354) (owner: 10Vgutierrez) [12:54:56] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2311/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028814 (https://phabricator.wikimedia.org/T363437) (owner: 10Brouberol) [12:57:03] (03CR) 10Muehlenhoff: [C:03+2] Fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/1028819 (owner: 10Muehlenhoff) [12:59:10] (03CR) 10Elukey: [C:03+2] ml-services: tune autoscaling for damaging, goodfaith and reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028552 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T1300). [13:00:05] hnowlan, DreamRimmer, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] o/ [13:01:22] I am around [13:01:24] o/ [13:01:48] (03PS8) 10Fabfur: fifo_log_demux: add new parameters for current release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) [13:04:09] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2312/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [13:04:26] (03PS1) 10Vgutierrez: ncredir: Remove mtail puppetization [puppet] - 10https://gerrit.wikimedia.org/r/1028821 (https://phabricator.wikimedia.org/T364385) [13:04:37] (03CR) 10Fabfur: fifo_log_demux: add new parameters for current release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [13:04:59] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:05:13] FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:05:26] (03CR) 10Fabfur: [C:04-2] "Do not merge until fifo-log-demux is upgraded to 0.7.0 on all hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [13:05:30] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:05:40] (03PS2) 10Vgutierrez: ncredir: Remove mtail puppetization [puppet] - 10https://gerrit.wikimedia.org/r/1028821 (https://phabricator.wikimedia.org/T364385) [13:10:02] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2313/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028821 (https://phabricator.wikimedia.org/T364385) (owner: 10Vgutierrez) [13:10:13] FIRING: [6x] JobUnavailable: Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:11:57] (03PS3) 10Vgutierrez: ncredir: Remove mtail puppetization [puppet] - 10https://gerrit.wikimedia.org/r/1028821 (https://phabricator.wikimedia.org/T364385) [13:13:40] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:14:11] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:16:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:17:11] looking ^ [13:17:18] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:19:01] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:19:41] (03PS3) 10Elukey: role::swift::proxy: simplify hiera configuration for the tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/1026937 (https://phabricator.wikimedia.org/T356412) [13:19:52] (03PS6) 10Elukey: Move ms-fe1009's envoy TLS cert to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) [13:20:47] no deployers today? [13:21:10] (03CR) 10Elukey: [C:03+2] role::swift::proxy: simplify hiera configuration for the tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/1026937 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [13:21:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:21:27] FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:21:43] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:23:21] (03CR) 10Ssingh: [C:03+1] "LGTM! Ran PCC on the latest patchset as well (for the removed log.pp) and it looks good https://puppet-compiler.wmflabs.org/output/1028821" [puppet] - 10https://gerrit.wikimedia.org/r/1028821 (https://phabricator.wikimedia.org/T364385) (owner: 10Vgutierrez) [13:25:13] RESOLVED: [4x] JobUnavailable: Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:25:23] !log filippo@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7001.magru.wmnet with OS bullseye [13:25:24] !log filippo@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7001.magru.wmnet [13:26:04] (03PS7) 10Elukey: Move ms-fe1009's envoy TLS cert to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) [13:26:05] (03CR) 10Hnowlan: [C:03+2] kubernetes: add 6 codfw appservers as workers [puppet] - 10https://gerrit.wikimedia.org/r/1026941 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan) [13:27:07] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 44 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:29:38] !log filippo@cumin1002 START - Cookbook sre.ganeti.makevm for new host prometheus7001.magru.wmnet [13:29:39] !log filippo@cumin1002 START - Cookbook sre.dns.netbox [13:31:41] !log filippo@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:31:46] !log filippo@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7001.magru.wmnet [13:31:58] !log filippo@cumin1002 START - Cookbook sre.hosts.decommission for hosts prometheus7001.magru.wmnet [13:36:00] !log filippo@cumin1002 START - Cookbook sre.dns.netbox [13:37:45] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:38:00] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:40:00] !log filippo@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - filippo@cumin1002" [13:40:07] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2325.codfw.wmnet with OS bullseye [13:40:08] (03CR) 10Muehlenhoff: [C:03+2] elasticsearch: Remove support for sslcert SSL provider [puppet] - 10https://gerrit.wikimedia.org/r/1026803 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff) [13:40:10] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2305.codfw.wmnet with OS bullseye [13:40:12] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2338.codfw.wmnet with OS bullseye [13:40:17] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2407.codfw.wmnet with OS bullseye [13:40:19] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2390.codfw.wmnet with OS bullseye [13:40:34] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2359.codfw.wmnet with OS bullseye [13:40:52] !log filippo@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - filippo@cumin1002" [13:40:52] !log filippo@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:40:52] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts prometheus7001.magru.wmnet [13:41:59] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:42:13] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:42:31] (03PS2) 10Muehlenhoff: Stop supporting sslcert in Profile::Pki::Provider type [puppet] - 10https://gerrit.wikimedia.org/r/1026804 (https://phabricator.wikimedia.org/T357750) [13:43:25] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:43:49] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.267 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:44:03] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:44:55] !log filippo@cumin1002 START - Cookbook sre.ganeti.makevm for new host prometheus7001.magru.wmnet [13:44:57] !log filippo@cumin1002 START - Cookbook sre.dns.netbox [13:46:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:46:31] ^me, it'll clear in a sec [13:46:54] (03PS1) 10Joal: Update analytics import-mediawiki-dumps [puppet] - 10https://gerrit.wikimedia.org/r/1028831 [13:47:07] no deployers today? [13:47:07] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 30 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:47:12] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1026804 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [13:47:29] !log mfossati@deploy1002 Started deploy [airflow-dags/platform_eng@ad4934c]: (no justification provided) [13:48:01] !log mfossati@deploy1002 Finished deploy [airflow-dags/platform_eng@ad4934c]: (no justification provided) (duration: 00m 32s) [13:48:03] (03CR) 10Vgutierrez: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1028821 (https://phabricator.wikimedia.org/T364385) (owner: 10Vgutierrez) [13:48:21] (03CR) 10Btullis: [C:03+2] Update analytics import-mediawiki-dumps [puppet] - 10https://gerrit.wikimedia.org/r/1028831 (owner: 10Joal) [13:48:25] FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:49:24] !log filippo@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7001.magru.wmnet - filippo@cumin1002" [13:50:13] (03CR) 10Muehlenhoff: [C:03+2] Stop supporting sslcert in Profile::Pki::Provider type [puppet] - 10https://gerrit.wikimedia.org/r/1026804 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [13:50:17] !log filippo@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7001.magru.wmnet - filippo@cumin1002" [13:50:17] !log filippo@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:50:17] !log filippo@cumin1002 START - Cookbook sre.dns.wipe-cache prometheus7001.magru.wmnet on all recursors [13:50:20] !log filippo@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7001.magru.wmnet on all recursors [13:50:41] !log filippo@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7001.magru.wmnet - filippo@cumin1002" [13:51:24] (03PS8) 10JMeybohm: Add new chart: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026564 (https://phabricator.wikimedia.org/T362310) [13:51:32] !log filippo@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7001.magru.wmnet - filippo@cumin1002" [13:52:13] PROBLEM - Check whether ferm is active by checking the default input chain on mw2434 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:53:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:53:58] !log filippo@cumin1002 START - Cookbook sre.hosts.reimage for host prometheus7001.magru.wmnet with OS bullseye [13:54:04] (03PS1) 10Elukey: role::ml_k8s::*::worker: use Dragonly for amd-pytorch images [puppet] - 10https://gerrit.wikimedia.org/r/1028833 [13:55:57] PROBLEM - Check whether ferm is active by checking the default input chain on mw2379 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:56:23] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2390.codfw.wmnet with reason: host reimage [13:56:27] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2407.codfw.wmnet with reason: host reimage [13:56:28] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2359.codfw.wmnet with reason: host reimage [13:56:31] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2325.codfw.wmnet with reason: host reimage [13:56:49] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2305.codfw.wmnet with reason: host reimage [13:57:39] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2338.codfw.wmnet with reason: host reimage [13:57:59] (03CR) 10Hnowlan: [C:03+1] Make base.certificates compatible with chart modules and scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026860 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [13:58:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:58:25] FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:01:38] (03CR) 10Vgutierrez: [C:03+1] "LGTM, we could get rid of $kafka_tls in this CR or in a following one given that we dropped support for non-TLS setups" [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [14:01:58] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2390.codfw.wmnet with reason: host reimage [14:02:45] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:03:17] !log btullis@deploy1002 Started deploy [airflow-dags/analytics@6be7efd]: (no justification provided) [14:03:45] !log btullis@deploy1002 Finished deploy [airflow-dags/analytics@6be7efd]: (no justification provided) (duration: 00m 27s) [14:04:36] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2407.codfw.wmnet with reason: host reimage [14:05:38] (03CR) 10Bking: [C:03+1] global_config: add elasticearch instances [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [14:08:07] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2359.codfw.wmnet with reason: host reimage [14:09:14] PROBLEM - Check whether ferm is active by checking the default input chain on mw2334 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:09:24] (03CR) 10Muehlenhoff: "I'd day let's do a followup and proceed with this change as-is, the non TLS branch has been dead code for > years, a few more days won't m" [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [14:09:27] (03PS1) 10Elukey: ml-services: update hugging face's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028839 (https://phabricator.wikimedia.org/T362984) [14:10:44] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2305.codfw.wmnet with reason: host reimage [14:11:43] !log mfossati@deploy1002 Started deploy [airflow-dags/platform_eng@b543b85]: (no justification provided) [14:12:08] !log mfossati@deploy1002 Finished deploy [airflow-dags/platform_eng@b543b85]: (no justification provided) (duration: 00m 24s) [14:12:18] (03PS1) 10Btullis: Revert "Update analytics import-mediawiki-dumps" [puppet] - 10https://gerrit.wikimedia.org/r/1028579 [14:12:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:12:41] (03CR) 10Joal: "LGTM! sorry for the noise" [puppet] - 10https://gerrit.wikimedia.org/r/1028579 (owner: 10Btullis) [14:12:49] (03CR) 10Joal: [C:03+1] Revert "Update analytics import-mediawiki-dumps" [puppet] - 10https://gerrit.wikimedia.org/r/1028579 (owner: 10Btullis) [14:13:02] (03PS1) 10Hnowlan: kubernetes: make 5 eqiad api appservers k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1028840 (https://phabricator.wikimedia.org/T362323) [14:13:25] FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:13:39] !log filippo@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7001.magru.wmnet with reason: host reimage [14:13:46] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2325.codfw.wmnet with reason: host reimage [14:14:16] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [software/bitu] - 10https://gerrit.wikimedia.org/r/1026458 (owner: 10Slyngshede) [14:16:47] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7001.magru.wmnet with reason: host reimage [14:17:08] (03CR) 10Elukey: [C:03+2] ml-services: update hugging face's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028839 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey) [14:17:09] 10ops-codfw, 06SRE: Inbound interface errors - https://phabricator.wikimedia.org/T364358#9777558 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:17:57] (03PS1) 10Hnowlan: mw-web, mw-api-ext: bump replicas in advance of traffic shift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028842 (https://phabricator.wikimedia.org/T362323) [14:19:42] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2338.codfw.wmnet with reason: host reimage [14:19:43] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1028843 [14:20:08] (03PS1) 10Hnowlan: trafficserver: move k8s traffic shift to 90% [puppet] - 10https://gerrit.wikimedia.org/r/1028844 (https://phabricator.wikimedia.org/T362323) [14:20:46] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2390.codfw.wmnet with OS bullseye [14:22:07] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:22:12] RECOVERY - Check whether ferm is active by checking the default input chain on mw2434 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:22:38] (03CR) 10Eevans: [C:03+1] Stop installing git-fat on Cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/1028798 (https://phabricator.wikimedia.org/T364373) (owner: 10Muehlenhoff) [14:23:24] (03PS34) 10Brouberol: global_config: add elasticearch instances [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894) [14:23:25] FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:23:45] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, just keeping the upstream default of 15% seems fine." [puppet] - 10https://gerrit.wikimedia.org/r/1028565 (owner: 10Ahmon Dancy) [14:23:46] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2407.codfw.wmnet with OS bullseye [14:23:59] (03CR) 10Muehlenhoff: [C:03+2] Stop installing git-fat on Cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/1028798 (https://phabricator.wikimedia.org/T364373) (owner: 10Muehlenhoff) [14:24:52] (03CR) 10Muehlenhoff: [C:03+2] "I'll uninstall git-fat as a followup once Puppet has run on these hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1028798 (https://phabricator.wikimedia.org/T364373) (owner: 10Muehlenhoff) [14:25:58] RECOVERY - Check whether ferm is active by checking the default input chain on mw2379 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:27:44] (03CR) 10Brouberol: [C:03+2] global_config: add elasticearch instances [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [14:28:57] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2359.codfw.wmnet with OS bullseye [14:30:04] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2305.codfw.wmnet with OS bullseye [14:31:27] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7001.magru.wmnet with OS bullseye [14:31:27] !log filippo@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host prometheus7001.magru.wmnet [14:32:35] (03CR) 10CDobbins: [C:03+2] purged: add PKI cert handling (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [14:33:07] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2325.codfw.wmnet with OS bullseye [14:33:11] (03CR) 10Btullis: [C:03+2] Revert "Update analytics import-mediawiki-dumps" [puppet] - 10https://gerrit.wikimedia.org/r/1028579 (owner: 10Btullis) [14:33:28] (03PS1) 10Hnowlan: kubernetes: make 6 codfw api appservers k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1028847 (https://phabricator.wikimedia.org/T351074) [14:35:08] PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:35:48] (03PS2) 10Filippo Giunchedi: grafana: add magru prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1028503 (https://phabricator.wikimedia.org/T364016) [14:35:48] (03PS2) 10Filippo Giunchedi: trafficserver: add prometheus-magru.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1028504 (https://phabricator.wikimedia.org/T364016) [14:35:48] (03PS1) 10Filippo Giunchedi: Revert "site: provision prometheus7001 with insetup" [puppet] - 10https://gerrit.wikimedia.org/r/1028848 (https://phabricator.wikimedia.org/T364016) [14:35:59] (03CR) 10Filippo Giunchedi: [C:03+2] wmnet: add prometheus.svc.magru [dns] - 10https://gerrit.wikimedia.org/r/1028464 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi) [14:36:02] PROBLEM - Host ps1-d2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:36:03] (03PS3) 10Filippo Giunchedi: wmnet: add prometheus.svc.magru [dns] - 10https://gerrit.wikimedia.org/r/1028464 (https://phabricator.wikimedia.org/T364016) [14:36:27] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:48] (03CR) 10Filippo Giunchedi: [C:03+2] Revert "site: provision prometheus7001 with insetup" [puppet] - 10https://gerrit.wikimedia.org/r/1028848 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi) [14:37:20] (03CR) 10JHathaway: [C:03+2] postfix: add some recommended hardening settings [puppet] - 10https://gerrit.wikimedia.org/r/1024729 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [14:38:58] jhathaway: I've merged you patch too [14:39:03] 'your patch' even [14:39:14] godog: thanks [14:39:14] RECOVERY - Check whether ferm is active by checking the default input chain on mw2334 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:39:32] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2338.codfw.wmnet with OS bullseye [14:39:50] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] wmnet: add prometheus.svc.magru [dns] - 10https://gerrit.wikimedia.org/r/1028464 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi) [14:40:03] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.06 - 2024.05.26): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9777657 (10xcollazo) Hello there. Due to T364250, the host `snapshot1011` will not be running the typical `wikidata... [14:40:53] (03CR) 10Muehlenhoff: [C:03+2] "Wasn't even needed to manually uninstall, when removing the Py2 packages Puppet uninstalled git-fat as well since it depended on Python 2." [puppet] - 10https://gerrit.wikimedia.org/r/1028798 (https://phabricator.wikimedia.org/T364373) (owner: 10Muehlenhoff) [14:41:18] !log running homer 'cr*codfw*' commit to configure BGP for new k8s codfw workers [14:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:33] (03CR) 10JMeybohm: [C:03+1] Enable profile::auto_restarts::service for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/1028790 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:43:38] (03CR) 10JMeybohm: [C:03+1] chartmuseum: Enable profile::auto_restarts::service for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1028789 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:44:39] (03CR) 10JMeybohm: [C:03+1] role::ml_k8s::*::worker: use Dragonly for amd-pytorch images [puppet] - 10https://gerrit.wikimedia.org/r/1028833 (owner: 10Elukey) [14:44:40] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2121.codfw.wmnet [14:44:54] (03CR) 10Muehlenhoff: [C:03+2] Enable profile::auto_restarts::service for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/1028790 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:44:59] (03CR) 10Muehlenhoff: [C:03+2] chartmuseum: Enable profile::auto_restarts::service for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1028789 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:46:12] (03PS1) 10Muehlenhoff: Switch db2121 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028850 (https://phabricator.wikimedia.org/T349619) [14:46:35] (03CR) 10Ahmon Dancy: [C:03+1] Stop installing git-fat also on Buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1028800 (https://phabricator.wikimedia.org/T279509) (owner: 10Muehlenhoff) [14:47:12] (03CR) 10Muehlenhoff: [C:03+2] Switch db2121 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028850 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:50:08] (03CR) 10Muehlenhoff: [C:03+2] Stop installing git-fat also on Buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1028800 (https://phabricator.wikimedia.org/T279509) (owner: 10Muehlenhoff) [14:50:19] !log silence site=magru alerts during prometheus7001 - T364016 [14:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:24] T364016: Q4:magru VM tracking task - https://phabricator.wikimedia.org/T364016 [14:50:59] (03CR) 10JMeybohm: [C:04-1] kubernetes: make 5 eqiad api appservers k8s workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028840 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [14:51:09] (03CR) 10JMeybohm: [C:03+1] trafficserver: move k8s traffic shift to 90% [puppet] - 10https://gerrit.wikimedia.org/r/1028844 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [14:51:28] (03CR) 10Filippo Giunchedi: [C:03+2] grafana: add magru prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1028503 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi) [14:51:38] (03PS1) 10Kevin Bazira: admin_ng: add commons host header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1027484 (https://phabricator.wikimedia.org/T363449) [14:51:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2121.codfw.wmnet [14:52:03] (03CR) 10JMeybohm: [C:03+1] mw-web, mw-api-ext: bump replicas in advance of traffic shift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028842 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [14:52:59] !log installing mariadb-10.5 security updates (as packaged in Debian, not the wmf-mariadb packages) [14:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:07] !log hnowlan@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(mw2305.codfw.wmnet|mw2325.codfw.wmnet|mw2338.codfw.wmnet|mw2359.codfw.wmnet|mw2390.codfw.wmnet|mw2407.codfw.wmnet),cluster=kubernetes,service=kubesvc [14:53:11] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2122.codfw.wmnet [14:53:50] !log A:cp and A:magru: running haproxy-restart [14:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:15] (03PS2) 10Hnowlan: kubernetes: make 5 eqiad api appservers k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1028840 (https://phabricator.wikimedia.org/T362323) [14:54:28] (03CR) 10Hnowlan: kubernetes: make 5 eqiad api appservers k8s workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028840 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [14:54:39] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe1009.eqiad.wmnet [14:54:40] (03CR) 10Filippo Giunchedi: [C:03+2] "Service is up, proceeding" [puppet] - 10https://gerrit.wikimedia.org/r/1028504 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi) [14:54:51] (03CR) 10JMeybohm: [C:04-1] kubernetes: make 6 codfw api appservers k8s workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028847 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan) [14:54:53] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#9777706 (10Nosferattus) @Ladsgroup: Please excuse me if I'm wrong, but I don't see how those statistics are related to what I suggested. I read those stat... [14:55:03] !log depool ms-fe1009's nginx (swift proxy) to safely apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1026927 [14:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:08] (03PS1) 10Muehlenhoff: Switch db2122 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028852 (https://phabricator.wikimedia.org/T349619) [14:56:42] (03CR) 10Elukey: [C:03+2] role::ml_k8s::*::worker: use Dragonly for amd-pytorch images [puppet] - 10https://gerrit.wikimedia.org/r/1028833 (owner: 10Elukey) [15:00:05] eoghan, jelto, arnoldokoth, and mutante: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) SRE Collaboration Services office hours deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T1500). [15:00:46] (03CR) 10Elukey: [C:03+1] Make base.certificates compatible with chart modules and scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026860 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [15:01:36] (03CR) 10JMeybohm: [C:03+2] Make base.certificates compatible with chart modules and scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026860 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [15:01:40] (03CR) 10JMeybohm: [C:03+2] New version of base.certificates module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026859 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [15:02:31] (03Merged) 10jenkins-bot: New version of base.certificates module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026859 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [15:02:35] (03Merged) 10jenkins-bot: Make base.certificates compatible with chart modules and scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026860 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [15:03:18] (03CR) 10Klausman: [C:03+1] admin_ng: add commons host header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1027484 (https://phabricator.wikimedia.org/T363449) (owner: 10Kevin Bazira) [15:03:36] (03CR) 10Elukey: [C:03+2] Move ms-fe1009's envoy TLS cert to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [15:05:07] (03PS2) 10Hnowlan: kubernetes: make 6 codfw api appservers k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1028847 (https://phabricator.wikimedia.org/T351074) [15:05:15] (03CR) 10Hnowlan: kubernetes: make 6 codfw api appservers k8s workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028847 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan) [15:07:44] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9777731 (10MoritzMuehlenhoff) [15:08:03] (03CR) 10CI reject: [V:04-1] kubernetes: make 6 codfw api appservers k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1028847 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan) [15:09:09] (03PS1) 10Muehlenhoff: Reapply "Enable install7001 as webproxy in magru" [dns] - 10https://gerrit.wikimedia.org/r/1028853 [15:09:15] (03CR) 10Muehlenhoff: [C:03+2] Switch db2122 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028852 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:09:48] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1028854 [15:10:38] (03CR) 10Muehlenhoff: [C:03+2] Reapply "Enable install7001 as webproxy in magru" [dns] - 10https://gerrit.wikimedia.org/r/1028853 (owner: 10Muehlenhoff) [15:12:29] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe1009.eqiad.wmnet [15:12:49] !log repool ms-fe1009's envoy with PKI TLS cert [15:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:57] (03PS3) 10Hnowlan: kubernetes: make 6 codfw api appservers k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1028847 (https://phabricator.wikimedia.org/T351074) [15:13:17] !log remove accidentally set site!=magru silence, add site=magru silence instead - T364016 [15:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:20] T364016: Q4:magru VM tracking task - https://phabricator.wikimedia.org/T364016 [15:14:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2122.codfw.wmnet [15:14:24] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412#9777740 (10elukey) ms-fe1009's envoy migrated to PKI! We'll wait a couple of days before proceeding with either eqiad or codfw. [15:15:13] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:18:23] mmhh grafana down, taking a look [15:18:31] (03PS1) 10Zabe: hieradata: Add itwiki to private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1028855 (https://phabricator.wikimedia.org/T363825) [15:18:57] (03CR) 10Zabe: hieradata: Add itwiki to private wikis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028855 (https://phabricator.wikimedia.org/T363825) (owner: 10Zabe) [15:19:08] actually nevermind it was restarting for configuration change [15:19:31] (03PS2) 10Zabe: hieradata: Add arbcom_itwiki to private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1028855 (https://phabricator.wikimedia.org/T363825) [15:19:38] !log imported nodejs 20.5.1-deb-1nodesource1 to thirdparty/node20 T362681 [15:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:41] T362681: Provide nodejs20 base images for production - https://phabricator.wikimedia.org/T362681 [15:20:44] 06SRE, 06collaboration-services, 06serviceops: upgrade deployment servers to bullseye / add bullseye support to puppet role - https://phabricator.wikimedia.org/T363415#9777749 (10LSobanski) p:05Triage→03Medium [15:22:25] (03CR) 10JMeybohm: [C:03+1] kubernetes: make 5 eqiad api appservers k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1028840 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [15:22:47] (03CR) 10JMeybohm: [C:03+1] kubernetes: make 6 codfw api appservers k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1028847 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan) [15:24:43] (03PS1) 10Zabe: Add Apache configuration for wikipedia-it-arbcom.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1028858 (https://phabricator.wikimedia.org/T363825) [15:25:50] jouncebot: nowandnext [15:25:50] For the next 0 hour(s) and 34 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T1500) [15:25:50] In 0 hour(s) and 34 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T1600) [15:26:52] (03CR) 10Bking: [C:03+1] aliases: add datacenter-scoped cumin aliases for flink zk ensembles [puppet] - 10https://gerrit.wikimedia.org/r/1028520 (https://phabricator.wikimedia.org/T363975) (owner: 10Brouberol) [15:27:43] (03PS4) 10Herron: pyrra: varnish: workaround site grouping limitation [puppet] - 10https://gerrit.wikimedia.org/r/1028854 (https://phabricator.wikimedia.org/T302995) [15:27:56] (03CR) 10Herron: [C:03+2] pyrra: varnish: workaround site grouping limitation [puppet] - 10https://gerrit.wikimedia.org/r/1028854 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [15:29:10] !log depooling 5 eqiad api appservers in advance of reimaging to k8s workers [15:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:00] (03CR) 10Ladsgroup: [C:03+2] Stop writing to old columns of pagelinks in most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028778 (https://phabricator.wikimedia.org/T352010) (owner: 10Ladsgroup) [15:31:14] (03CR) 10Brouberol: [C:03+2] aliases: add datacenter-scoped cumin aliases for flink zk ensembles [puppet] - 10https://gerrit.wikimedia.org/r/1028520 (https://phabricator.wikimedia.org/T363975) (owner: 10Brouberol) [15:31:43] (03PS1) 10Elukey: role::swift::proxy: move eqiad envoys to PKI TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1028859 (https://phabricator.wikimedia.org/T356412) [15:31:46] (03Merged) 10jenkins-bot: Stop writing to old columns of pagelinks in most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028778 (https://phabricator.wikimedia.org/T352010) (owner: 10Ladsgroup) [15:31:51] (03PS1) 10Elukey: role::swift::proxy: move codfw envoys to PKI TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1028860 (https://phabricator.wikimedia.org/T356412) [15:32:32] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1028778|Stop writing to old columns of pagelinks in most wikis (T352010 T299947)]] [15:32:38] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:32:39] T299947: Normalize pagelinks table - https://phabricator.wikimedia.org/T299947 [15:33:22] (03CR) 10Volans: aliases: add datacenter-scoped cumin aliases for flink zk ensembles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028520 (https://phabricator.wikimedia.org/T363975) (owner: 10Brouberol) [15:35:31] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1028859 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [15:37:10] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1028860 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [15:38:11] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1028778|Stop writing to old columns of pagelinks in most wikis (T352010 T299947)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:38:15] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:38:16] T299947: Normalize pagelinks table - https://phabricator.wikimedia.org/T299947 [15:38:30] (03CR) 10Elukey: role::swift::proxy: move codfw envoys to PKI TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1028860 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [15:41:29] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1028859 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [15:41:52] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1028860 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [15:45:52] 06SRE-OnFire, 10Data-Platform-SRE (2024.05.06 - 2024.05.26), 03Discovery-Search (Current work), 10Sustainability (Incident Followup): Post incident tasks: Search missing results/unavailable for some eqiad users - https://phabricator.wikimedia.org/T363694#9777868 (10Gehel) 05Open→03Resolved Subtasks... [15:52:07] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [15:52:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:57:18] PROBLEM - Check whether ferm is active by checking the default input chain on parse1024 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:57:37] (03CR) 10Dzahn: [C:03+2] admin: add linafaridwmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1028600 (https://phabricator.wikimedia.org/T364068) (owner: 10Dzahn) [15:57:46] PROBLEM - Check whether ferm is active by checking the default input chain on mw1452 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:58:02] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [15:58:06] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1012 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:58:15] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [15:58:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2179 (T352010)', diff saved to https://phabricator.wikimedia.org/P61983 and previous config saved to /var/cache/conftool/dbconfig/20240507-155822-ladsgroup.json [15:58:27] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [16:00:04] jhathaway and rzl: Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T1600). Please do the needful. [16:00:04] zabe: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:38] PROBLEM - Check whether ferm is active by checking the default input chain on mw1352 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:00:43] o/ [16:00:52] o/ [16:00:54] merging in [16:01:00] (03CR) 10JHathaway: [C:03+2] Add Apache configuration for wikipedia-it-arbcom.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1028858 (https://phabricator.wikimedia.org/T363825) (owner: 10Zabe) [16:01:07] thanks jhathaway [16:01:26] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review, 03WMDE-TechWish-Sprint-2024-04-24: Requesting access to analytics-privatedata-users for linafaridwmde - https://phabricator.wikimedia.org/T364068#9778014 (10Dzahn) a:05Dzahn→03None [16:01:28] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review, 03WMDE-TechWish-Sprint-2024-04-24: Requesting access to analytics-privatedata-users for linafaridwmde - https://phabricator.wikimedia.org/T364068#9778006 (10Dzahn) 05In progress→03Resolved a:03Dzahn @Lina_Farid_WMDE You have been added to the gr... [16:01:32] yup! [16:02:22] zabe, done [16:02:27] Thanks! [16:02:54] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://phabricator.wikimedia.org/T364359 says these users were supposed to get the new group but not be removed from the old group." [puppet] - 10https://gerrit.wikimedia.org/r/1023965 (https://phabricator.wikimedia.org/T363360) (owner: 10BCornwall) [16:04:16] zabe: jhathaway: iirc that change also needs a mw-on-k8s deployment (after a puppet run on deploy1002) to apply properly [16:05:02] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1028778|Stop writing to old columns of pagelinks in most wikis (T352010 T299947)]] (duration: 32m 29s) [16:05:12] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [16:05:12] T299947: Normalize pagelinks table - https://phabricator.wikimedia.org/T299947 [16:05:18] 06SRE, 06collaboration-services, 06serviceops: add bullseye support to deployment server puppet role - upgrade deployment server in devtools - https://phabricator.wikimedia.org/T363415#9778028 (10Dzahn) [16:05:35] ok, i can do that [16:07:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:08:38] ^ Strange, looking. [16:08:49] !log zabe@deploy1002 Started scap: (no justification provided) [16:08:49] denisse: could be the path we just pushed [16:08:49] !log zabe@deploy1002 sync-world aborted: (no justification provided) (duration: 00m 00s) [16:08:57] looking also [16:09:10] jhathaway: That seems reasonable, I've ACK'd the alerts. [16:09:22] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1028864 [16:13:06] mutante: the issue seems to be with the user linafaridwmde you just added [16:13:40] I think you missed specifying their gid, so useradd assumes there is a gid that matches the uid of the user [16:15:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:18:18] jhathaway: so is it okay to do the mw-on-k8s deployment or should I wait? [16:18:31] (03CR) 10Dzahn: [C:03+1] coredump.conf: Remove misconfigured KeepFree setting [puppet] - 10https://gerrit.wikimedia.org/r/1028565 (owner: 10Ahmon Dancy) [16:18:51] probably okay, but let me fix first... [16:18:57] okay:) [16:19:12] jhathaway: happy to add GID 500 but I remember checking if others like this had it and theydidnt [16:19:24] it's not a shell user [16:19:47] appserver errors are all of the form "Lock wait timeout exceeded" - some kind of DB flakiness happening? [16:19:51] looking at the error [16:20:02] if it is not a shell user, then it would be in ldap_only_users correct mutante? [16:20:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:20:21] puppet is trying to add a local account for that user [16:20:23] no, not for this case where it's the weird "privatedata-users without shell" [16:20:32] CI didn't like the previous PS [16:22:12] that's why I dislike that weird special case :p [16:22:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:22:27] (03PS1) 10Dzahn: Revert "admin: add linafaridwmde to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/1028581 [16:23:58] (03CR) 10Dzahn: [C:03+2] "Warning: /Stage[main]/Admin/Exec[enforce-users-groups-cleanup]: Skipping because of failed dependencies" [puppet] - 10https://gerrit.wikimedia.org/r/1028581 (owner: 10Dzahn) [16:25:46] mutante: no shell means no ssh, which is a bit weird for sure, which users don't have gids? [16:25:54] jhathaway: reverted and starting to run cumin on --failed-only in batches [16:26:02] thanks [16:26:21] zabe: I think you are clear to go [16:26:25] (03PS1) 10Ladsgroup: Partial cherry-pick of I9d8409fdbd757e [core] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1028865 (https://phabricator.wikimedia.org/T361398) [16:26:32] alright [16:26:44] !log zabe@deploy1002 Started scap: T363825 [16:26:47] T363825: Create private wikipedia_it_arbcom wiki - https://phabricator.wikimedia.org/T363825 [16:27:10] jhathaway: yes, it is supposed to be no ssh. https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Accounts_and_passwords_explained:_LDAP/Wikitech/MW_Developer_vs_shell/ssh/posix_vs_Kerberos [16:27:15] FIRING: MediaWikiLatencyExceeded: Average latency high: eqiad api_appserver POST/200: 0.613747498021355s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=api_appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyE [16:27:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:27:18] RECOVERY - Check whether ferm is active by checking the default input chain on parse1024 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:27:30] I want to point out I have no relation to those mediawiki error rates ^ [16:27:36] (03CR) 10Andrea Denisse: [V:03+1] "Good idea, I'll send another patch for titan. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [16:28:06] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1012 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:28:51] (03PS9) 10Andrea Denisse: thanos: Provision Thanos frontend TLS certificates with CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) [16:29:10] I do neither and that looks actually quite conserning [16:30:11] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review, 03WMDE-TechWish-Sprint-2024-04-24: Requesting access to analytics-privatedata-users for linafaridwmde - https://phabricator.wikimedia.org/T364068#9778167 (10Dzahn) 05Resolved→03Open unfortunately merging the code change caused widespread puppet failur... [16:30:38] RECOVERY - Check whether ferm is active by checking the default input chain on mw1352 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:30:41] (03CR) 10Andrea Denisse: thanos: Provision Thanos frontend TLS certificates with CFSSL (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [16:31:06] jhathaway: I think you were right about the original thing that it was missing the gid line. some just have it in another order.. but I will let this calm down first [16:31:21] nod, sounds good [16:31:55] (03PS10) 10Andrea Denisse: thanos: Provision Thanos frontend TLS certificates with CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) [16:32:15] RESOLVED: MediaWikiLatencyExceeded: Average latency high: eqiad api_appserver POST/200: ... [16:32:15] 0.7291964295290969s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=api_appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:32:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:32:31] created T364404 for the mw errors [16:32:31] T364404: Wikimedia\Rdbms\DBTransactionSizeError: Transaction spent {time}s in writes, exceeding the 3s limit - https://phabricator.wikimedia.org/T364404 [16:34:27] !log zabe@deploy1002 Finished scap: T363825 (duration: 07m 42s) [16:34:30] T363825: Create private wikipedia_it_arbcom wiki - https://phabricator.wikimedia.org/T363825 [16:36:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.57% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:36:23] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (NOOP 9 CORE_DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [16:36:59] 06SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Dennis Mburugu - https://phabricator.wikimedia.org/T364320#9778211 (10DMburugu) Turnilo and Superset [16:37:06] I depooled some api appservers earlier in advance of reimaging. Will repool them to help soak this up, but I don't think it's strictly a capacity issue [16:37:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:37:27] these lock timeouts and latency increases, along with little bumps in DB errors make me nervous [16:38:06] PROBLEM - Host mwlog1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:38:15] FIRING: MediaWikiLatencyExceeded: Average latency high: eqiad api_appserver GET/200: 0.5382698779576534s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyEx [16:38:22] (03PS1) 10Btullis: Move stats misc_jobs from stat1007 to stat1011 [puppet] - 10https://gerrit.wikimedia.org/r/1028866 (https://phabricator.wikimedia.org/T353785) [16:38:44] ooh I wonder if the error rate crashed mwlog1002 somehow [16:39:06] (03PS11) 10Andrea Denisse: thanos: Provision Thanos frontend TLS certificates with CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) [16:39:50] !log elukey@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-staging2001.codfw.wmnet [16:40:29] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2321/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028866 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [16:41:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.57% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:42:40] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (NOOP 8 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [16:42:45] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:43:15] RESOLVED: [2x] MediaWikiLatencyExceeded: Average latency high: eqiad api_appserver GET/200: 0.23324278673749851s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:43:57] (03CR) 10Andrea Denisse: [V:03+1] "Hello team, here are the PCC results for the latest patch: https://puppet-compiler.wmflabs.org/output/1028546/2322/" [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [16:45:27] (03PS4) 10Volans: sre.hosts.decommission: ask on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1018718 (https://phabricator.wikimedia.org/T361306) [16:45:27] (03PS1) 10Volans: sre.ganeti.makevm: add logging message [cookbooks] - 10https://gerrit.wikimedia.org/r/1028867 [16:46:50] (03CR) 10Andrea Denisse: [V:03+1] thanos: Provision Thanos frontend TLS certificates with CFSSL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [16:47:44] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1028867 (owner: 10Volans) [16:47:45] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:48:31] (03CR) 10Btullis: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1028799 (https://phabricator.wikimedia.org/T316876) (owner: 10Muehlenhoff) [16:48:36] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2001.codfw.wmnet [16:48:58] (03CR) 10Btullis: [C:03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1028763 (https://phabricator.wikimedia.org/T316876) (owner: 10Muehlenhoff) [16:49:42] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1026822 (owner: 10Muehlenhoff) [16:49:55] (03CR) 10CI reject: [V:04-1] sre.ganeti.makevm: add logging message [cookbooks] - 10https://gerrit.wikimedia.org/r/1028867 (owner: 10Volans) [16:50:21] mutante: Can you check on mwlog1002.eqiad.wmnet ? [16:53:25] dancy: oh, you mean the host is down... yea [16:53:37] Nod. It died suddently [16:53:54] I looked briefly but nothing in sel for what it's worth [16:54:12] dancy: We're looking at it and tracking the issue on this ticket: https://phabricator.wikimedia.org/T364404 [16:55:18] the machine is up and sitting at login [16:55:28] so that usually means cable disconnected [16:55:30] or networking [16:55:36] Interesting [16:56:00] Is it responsive to the login prompt? [16:56:05] logged in via mgmt [16:56:12] yes [16:56:30] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:56:43] jhathaway: fwiw that's that ^ [16:56:54] RECOVERY - Host mwlog1002 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [16:56:54] 🫶 [16:57:06] woohoo! [16:57:26] mwlog1002 load average is >12. [16:57:40] mwlog1002 is reachable again [16:57:46] RECOVERY - Check whether ferm is active by checking the default input chain on mw1452 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:58:09] CPU used by rsyslogd [16:58:16] load going down [16:59:04] mwlog1002 rsyslogd: omkafka: kafka error message: -181,'Local: SSL error','ssl://kafka-logging1004.eqiad.wmnet:9093/1004: SSL handshake failed: Disconnected: connecting to a PLAINTEXT broker listener? [16:59:52] (03CR) 10Hashar: [C:04-1] "Thanks for the discovery of `plugin.restApi().post()` and crafting the post. The code should overall be refined, see my various comments." [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T1700) [17:00:05] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1027484 (https://phabricator.wikimedia.org/T363449) (owner: 10Kevin Bazira) [17:03:07] (03Merged) 10jenkins-bot: admin_ng: add commons host header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1027484 (https://phabricator.wikimedia.org/T363449) (owner: 10Kevin Bazira) [17:05:24] (03CR) 10Scott French: [C:03+2] apertium: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028604 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [17:06:25] (03Merged) 10jenkins-bot: apertium: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028604 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [17:07:50] (03PS1) 10Dzahn: Revert "Revert "admin: add linafaridwmde to analytics-privatedata-users"" [puppet] - 10https://gerrit.wikimedia.org/r/1028582 [17:09:30] (03PS6) 10Herron: pyrra: etcd: add generic rules workaround [puppet] - 10https://gerrit.wikimedia.org/r/1028864 (https://phabricator.wikimedia.org/T302995) [17:13:51] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/apertium: apply [17:14:15] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/apertium: apply [17:14:23] (03CR) 10Hashar: [C:03+1] "It is fine to restart Envoy, afaik that is only used for https://integration.wikimedia.org/ and my guess is the only side effects would be" [puppet] - 10https://gerrit.wikimedia.org/r/1028796 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:15:58] (03PS4) 10JHathaway: puppetserver-deploy-code: bail out if current branch is not 'production' [puppet] - 10https://gerrit.wikimedia.org/r/1026682 (https://phabricator.wikimedia.org/T364047) (owner: 10Andrew Bogott) [17:16:14] (03CR) 10Herron: [C:03+2] pyrra: etcd: add generic rules workaround [puppet] - 10https://gerrit.wikimedia.org/r/1028864 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [17:16:47] (03CR) 10JHathaway: puppetserver-deploy-code: bail out if current branch is not 'production' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1026682 (https://phabricator.wikimedia.org/T364047) (owner: 10Andrew Bogott) [17:19:19] (03CR) 10Hashar: "The issue I have is the containers being restarted while they might be running containers. That would cause Jenkins jobs to fail abruptly " [puppet] - 10https://gerrit.wikimedia.org/r/1028795 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:20:20] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/apertium: apply [17:20:44] 06SRE, 10Wikimedia-Mailing-lists: Allow 'rel="me"' in Postorius list info links to verify Mastodon link on wikis.world - https://phabricator.wikimedia.org/T364402#9778349 (10Aklapper) @Greenreaper: That sounds to me like upstream behavior to improve (not to mangle the `rel` parameter value)? Per https://www.me... [17:21:09] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/apertium: apply [17:29:28] (03PS3) 10Scott French: api-gateway: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028605 (https://phabricator.wikimedia.org/T362978) [17:32:29] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/apertium: apply [17:33:42] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/apertium: apply [17:35:30] (03PS1) 10Zabe: Avoid empty insert in SqlScoreStorage::storeScores [extensions/ORES] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1028583 (https://phabricator.wikimedia.org/T364218) [17:38:11] (03CR) 10Eevans: [C:03+2] New group for users of Cassandra staging (cassandra-dev) [puppet] - 10https://gerrit.wikimedia.org/r/1026194 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [17:49:34] (03PS2) 10Dzahn: Revert "Revert "admin: add linafaridwmde to analytics-privatedata-users"" [puppet] - 10https://gerrit.wikimedia.org/r/1028582 [17:50:06] (03CR) 10Dzahn: [C:03+2] ci: Enable profile::auto_restarts::service for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1028796 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:50:40] (03PS2) 10Muehlenhoff: ci: Enable profile::auto_restarts::service for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1028796 (https://phabricator.wikimedia.org/T135991) [17:51:04] (03CR) 10Dzahn: ci: Enable profile::auto_restarts::service for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1028796 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:52:15] (03CR) 10Dzahn: [C:03+2] Revert "Revert "admin: add linafaridwmde to analytics-privatedata-users"" [puppet] - 10https://gerrit.wikimedia.org/r/1028582 (owner: 10Dzahn) [17:53:22] (03CR) 10Dzahn: [V:03+1 C:03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1028796 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:54:05] (03PS1) 10Andrea Denisse: thanos: Update TLS certificate in Envoy config to match CFSSL provisioning [puppet] - 10https://gerrit.wikimedia.org/r/1028876 (https://phabricator.wikimedia.org/T360414) [17:55:38] 06SRE, 10SRE-Access-Requests, 03WMDE-TechWish-Sprint-2024-04-24: Requesting access to analytics-privatedata-users for linafaridwmde - https://phabricator.wikimedia.org/T364068#9778496 (10Dzahn) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1028581 https://gerrit.wikimedia.org/r/c/operations/puppet/... [17:56:42] 06SRE, 10SRE-Access-Requests, 03WMDE-TechWish-Sprint-2024-04-24: Requesting access to analytics-privatedata-users for linafaridwmde - https://phabricator.wikimedia.org/T364068#9778497 (10Dzahn) 05Open→03Resolved a:03Dzahn issue fixed. user is now being created by puppet. within the next half hour... [17:58:43] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (NOOP 16): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2323/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1028876 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [18:00:05] jeena and thcipriani: gettimeofday() says it's time for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T1800) [18:00:12] 06SRE, 06collaboration-services, 06serviceops: add bullseye support to deployment server puppet role - upgrade deployment server in devtools - https://phabricator.wikimedia.org/T363415#9778509 (10Dzahn) This still needs https://gerrit.wikimedia.org/r/c/operations/puppet/+/1026193 to be merged to be able to... [18:04:55] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T364414 (10Jdforrester-WMF) 03NEW [18:06:54] (03CR) 10Andrea Denisse: [V:03+1] "Hello team, the PCC results for this change show a NOOP however, I think this change is important because there `thanos-query.discovery.wm" [puppet] - 10https://gerrit.wikimedia.org/r/1028876 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [18:08:23] (03CR) 10Thcipriani: "Sounds like we need to backport this pre-train, correct?" [core] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1028865 (https://phabricator.wikimedia.org/T361398) (owner: 10Ladsgroup) [18:09:12] (03CR) 10Andrew Bogott: [C:03+2] puppetserver-deploy-code: bail out if current branch is not 'production' [puppet] - 10https://gerrit.wikimedia.org/r/1026682 (https://phabricator.wikimedia.org/T364047) (owner: 10Andrew Bogott) [18:09:16] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T364414#9778550 (10Dzahn) [18:10:39] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T364414#9778555 (10Dzahn) @thcipriani please consider for approval (https://wikimedia.namely.com/people/eaebb898-01ba-404e-8cf8-2ed33c4e0d04/show/personal/employee-information/) [18:10:52] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T364414#9778558 (10Dzahn) @Mcastro Please confirm if you approve [18:12:25] 10ops-eqiad, 06DC-Ops, 06serviceops: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416 (10RobH) 03NEW [18:12:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:13:44] 10ops-eqiad, 06DC-Ops, 06serviceops: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416#9778581 (10RobH) [18:14:22] 10ops-eqiad, 06DC-Ops, 06serviceops: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416#9778582 (10RobH) [18:15:01] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T364414#9778588 (10ecarg) 'signing' my request this way; TY James! [18:16:20] 10ops-eqiad, 06DC-Ops, 06serviceops: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416#9778606 (10Dzahn) fwiw - for the person who will add the production puppet role to this later: This is only possible since just recently but should be mostly unblocked now: details in T363415 -... [18:17:25] 10ops-eqiad, 06DC-Ops, 06serviceops: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416#9778608 (10RobH) a:03akosiaris @akosiaris, The parent ordering task for the deploy1002 replacement didn't have racking info, but I didn't want to stall ordering to get it so I've created this... [18:21:16] 06SRE, 10Wikimedia-Mailing-lists: Allow 'rel="me"' in Postorius list info links to verify Mastodon link on wikis.world - https://phabricator.wikimedia.org/T364402#9778615 (10GreenReaper) It actually looks like it is upstream of that. [The commit adding markdown support](https://gitlab.com/mailman/postorius/-/c... [18:23:10] 06SRE, 10Wikimedia-Mailing-lists: Allow 'rel="me"' in Postorius list info links to verify Mastodon link on wikis.world - https://phabricator.wikimedia.org/T364402#9778628 (10Pppery) But do note that Wikimedia tends to be slow at pulling down upstream changes, and I have no idea how fast Mailman does so. [18:23:40] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:33:45] 06SRE, 10Wikimedia-Mailing-lists, 07Upstream: Allow 'rel="me"' in Postorius list info links to verify Mastodon link on wikis.world - https://phabricator.wikimedia.org/T364402#9778698 (10GreenReaper) [18:37:26] 10ops-eqiad, 06SRE, 10Cassandra: Reimage aqs1013 - https://phabricator.wikimedia.org/T364422 (10Eevans) 03NEW [18:37:34] 10ops-eqiad, 06SRE, 10Cassandra: Reimage aqs1013 - https://phabricator.wikimedia.org/T364422#9778735 (10Eevans) p:05Triage→03High [18:38:48] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9778737 (10Eevans) >>! In T362033#9758428, @Volans wrote: > Maybe a little drastic option, but could we try to reimage one of those 2 server and wait few days? > That will surely wipe clean any manual proced... [18:40:10] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on aqs1013.eqiad.wmnet with reason: Decommissioning — T364422 [18:40:13] T364422: Reimage aqs1013 - https://phabricator.wikimedia.org/T364422 [18:40:24] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1013.eqiad.wmnet with reason: Decommissioning — T364422 [18:40:28] 10ops-eqiad, 06SRE, 10Cassandra: Reimage aqs1013 - https://phabricator.wikimedia.org/T364422#9778741 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=397ec6a2-88d3-4fa3-b149-367bc8b4c353) set by eevans@cumin1002 for 30 days, 0:00:00 on 1 host(s) and their services with reason: Decommission... [18:51:46] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1028881 [18:53:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:54:51] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T364414#9778759 (10Mcastro) Approved. [18:57:56] I will backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1028865 and then roll the train to group0 [18:58:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:58:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:58:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1028865 (https://phabricator.wikimedia.org/T361398) (owner: 10Ladsgroup) [19:03:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:03:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:08:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:10:05] (03PS6) 10Herron: pyrra: logstash: add generic rules workaround [puppet] - 10https://gerrit.wikimedia.org/r/1028881 (https://phabricator.wikimedia.org/T302995) [19:10:19] (03CR) 10Herron: [V:03+1 C:03+2] pyrra: logstash: add generic rules workaround [puppet] - 10https://gerrit.wikimedia.org/r/1028881 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [19:18:25] (03Merged) 10jenkins-bot: Partial cherry-pick of I9d8409fdbd757e [core] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1028865 (https://phabricator.wikimedia.org/T361398) (owner: 10Ladsgroup) [19:18:53] !log jhuneidi@deploy1002 Started scap: Backport for [[gerrit:1028865|Partial cherry-pick of I9d8409fdbd757e (T361398 T362566)]] [19:18:58] T361398: 1.43.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T361398 [19:18:59] T362566: Stop growth of text table by storing ES addresses in content table - https://phabricator.wikimedia.org/T362566 [19:20:03] (03PS2) 10Volans: sre.ganeti.makevm: add logging message [cookbooks] - 10https://gerrit.wikimedia.org/r/1028867 [19:21:29] !log jhuneidi@deploy1002 ladsgroup and jhuneidi: Backport for [[gerrit:1028865|Partial cherry-pick of I9d8409fdbd757e (T361398 T362566)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:21:45] !log jhuneidi@deploy1002 ladsgroup and jhuneidi: Continuing with sync [19:23:25] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:24:20] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:24:21] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/1028867 (owner: 10Volans) [19:28:30] PROBLEM - Check whether ferm is active by checking the default input chain on parse1014 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:28:54] PROBLEM - Check whether ferm is active by checking the default input chain on mw2314 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:29:14] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1060 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:29:16] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1021 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:29:18] PROBLEM - Check whether ferm is active by checking the default input chain on parse1024 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:30:47] (03PS1) 10Aklapper: Make Translations extension work with upstream Phorge [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1028887 (https://phabricator.wikimedia.org/T364426) [19:34:16] (03PS2) 10Aklapper: Make Translations extension work with upstream Phorge [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1028887 (https://phabricator.wikimedia.org/T364426) [19:34:33] !log jhuneidi@deploy1002 Finished scap: Backport for [[gerrit:1028865|Partial cherry-pick of I9d8409fdbd757e (T361398 T362566)]] (duration: 15m 39s) [19:34:37] T361398: 1.43.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T361398 [19:34:37] T362566: Stop growth of text table by storing ES addresses in content table - https://phabricator.wikimedia.org/T362566 [19:35:56] (03CR) 10Aklapper: "Not perfect because still hardcoding WMF Phabricator instance in browse-uri, but good enough to make this extension work on my machine wit" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1028887 (https://phabricator.wikimedia.org/T364426) (owner: 10Aklapper) [19:43:31] (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028888 (https://phabricator.wikimedia.org/T361398) [19:43:33] (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028888 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot) [19:44:21] (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028888 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot) [19:46:30] !log disabling Puppet on the Logstash hosts that serve OpenSearch dashboards to test the CFSSL certificates - T360414 [19:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:35] T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414 [19:50:44] (03CR) 10Andrea Denisse: [C:03+2] logstash: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1025879 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [19:52:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:52:48] PROBLEM - Check whether ferm is active by checking the default input chain on mw1416 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:52:48] PROBLEM - Check whether ferm is active by checking the default input chain on mw1393 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:53:42] ^ looking [19:53:48] PROBLEM - Check whether ferm is active by checking the default input chain on mw1458 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:54:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.67% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:55:26] I can confirm the iptables rules are loaded. [19:56:02] The ferm unit looks healthy. [19:57:35] !log denisse@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on 12 hosts with reason: Downtiming the Logstash hosts serving OpenSearch Dashboards as part of the cergen to CFSSL migration - T360414 [19:57:38] T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414 [19:57:55] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 12 hosts with reason: Downtiming the Logstash hosts serving OpenSearch Dashboards as part of the cergen to CFSSL migration - T360414 [19:58:30] RECOVERY - Check whether ferm is active by checking the default input chain on parse1014 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:58:54] RECOVERY - Check whether ferm is active by checking the default input chain on mw2314 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:59:14] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1060 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:59:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.19% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:59:16] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1021 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:59:18] RECOVERY - Check whether ferm is active by checking the default input chain on parse1024 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:59:18] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.4 refs T361398 [19:59:21] T361398: 1.43.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T361398 [19:59:47] Those alerts self resolved, but I'm unsure as to why they fired, there aren't any anomalies on the logs. [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240507T2000) [20:00:04] No Gerrit patches in the queue for this window AFAICS. [20:00:44] I think I need to roll back train due to https://phabricator.wikimedia.org/T364428 [20:00:51] denisse: this might be the same issue as https://phabricator.wikimedia.org/T354855 [20:01:19] (if so, it should auto-resolve on the next puppet run, due to the changes to the check script) [20:02:10] swfrench-wmf: That indeed looks like it, thanks! [20:03:24] rolling back [20:04:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.57% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:04:18] (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028891 (https://phabricator.wikimedia.org/T361398) [20:04:22] (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028891 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot) [20:04:28] (03Merged) 10jenkins-bot: testwikis wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028891 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot) [20:04:48] !log jhuneidi@deploy1002 Started scap: testwikis wikis to 1.43.0-wmf.4 refs T361398 [20:04:51] T361398: 1.43.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T361398 [20:06:36] !log Enabling Puppet on the Logstash hosts that serve OpenSearch dashboards to migrate to CFSSL certificates - T360414 [20:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:42] T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414 [20:09:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.57% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:09:57] !log Restarting envoyproxy and opensearch-dashboards services on the Logstash hosts that serve OpenSearch dashboards to migrate to CFSSL certificates - T360414 [20:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:10] (03CR) 10Andrea Denisse: [V:03+2 C:03+2] ssl: Remove unnecessary dummy key for the kibana hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1026693 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [20:14:20] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:14:33] (03CR) 10Andrea Denisse: [C:03+2] wmcs: Remove unnecesary kibana and kibana-discovery certificates [puppet] - 10https://gerrit.wikimedia.org/r/1026692 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [20:14:34] 10ops-eqiad, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429 (10RobH) 03NEW [20:15:30] 10ops-eqiad, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#9779047 (10RobH) [20:16:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 36.55% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:17:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:17:24] !log Deleting the kibana and kibana-combined certificates from the private repository - T360414 [20:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:29] T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414 [20:17:46] 10ops-eqiad, 06SRE, 06DBA: db1246 crashed - https://phabricator.wikimedia.org/T363119#9779052 (10Jclark-ctr) Friday dell agreed to replace Backplane and cables. shipped out Monday expected arrival Tuesday. [20:18:25] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:19:51] !log jhuneidi@deploy1002 Finished scap: testwikis wikis to 1.43.0-wmf.4 refs T361398 (duration: 15m 03s) [20:19:54] T361398: 1.43.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T361398 [20:20:54] (03CR) 10Zabe: [C:03+2] Avoid empty insert in SqlScoreStorage::storeScores [extensions/ORES] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1028583 (https://phabricator.wikimedia.org/T364218) (owner: 10Zabe) [20:21:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 36.55% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:22:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:22:48] RECOVERY - Check whether ferm is active by checking the default input chain on mw1416 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:22:48] RECOVERY - Check whether ferm is active by checking the default input chain on mw1393 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:23:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:23:17] (03Merged) 10jenkins-bot: Avoid empty insert in SqlScoreStorage::storeScores [extensions/ORES] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1028583 (https://phabricator.wikimedia.org/T364218) (owner: 10Zabe) [20:23:48] RECOVERY - Check whether ferm is active by checking the default input chain on mw1458 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:24:17] !log zabe@deploy1002 Started scap: Backport for [[gerrit:1028583|Avoid empty insert in SqlScoreStorage::storeScores (T364218)]] [20:24:20] T364218: UnexpectedValueException: Wikimedia\Rdbms\InsertQueryBuilder::execute can't have empty $rows value (via ORES SqlScoreStorage) - https://phabricator.wikimedia.org/T364218 [20:26:39] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9779063 (10andrea.denisse) [20:26:55] !log zabe@deploy1002 zabe: Backport for [[gerrit:1028583|Avoid empty insert in SqlScoreStorage::storeScores (T364218)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:27:14] !log zabe@deploy1002 zabe: Continuing with sync [20:28:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:28:48] (03PS12) 10Andrea Denisse: thanos: Provision Thanos frontend TLS certificates with CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) [20:32:36] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (NOOP 8 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [20:34:17] (03PS4) 10Scott French: confd: Extend confd-lint-wrap to accept a unique resource name [puppet] - 10https://gerrit.wikimedia.org/r/1028559 (https://phabricator.wikimedia.org/T363924) [20:34:17] (03PS4) 10Scott French: confd: prom exporter uses resource name to find state file [puppet] - 10https://gerrit.wikimedia.org/r/1028560 (https://phabricator.wikimedia.org/T363924) [20:34:17] (03PS1) 10Scott French: confd: confd-lint-wrap ignores positional args separator [puppet] - 10https://gerrit.wikimedia.org/r/1028897 (https://phabricator.wikimedia.org/T363924) [20:34:19] (03PS1) 10Scott French: confd: insert positional argument separator in check_cmd [puppet] - 10https://gerrit.wikimedia.org/r/1028898 (https://phabricator.wikimedia.org/T363924) [20:34:58] (03CR) 10Scott French: [C:04-1] "Thank you both for the review." [puppet] - 10https://gerrit.wikimedia.org/r/1028559 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [20:35:10] PROBLEM - Check whether ferm is active by checking the default input chain on mw2429 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:36:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.2% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:38:25] (03CR) 10Scott French: "Alright, I've uploaded two patches beneath this one, which pave the way for using argparse here in confd-lint-wrap. Thanks for your patien" [puppet] - 10https://gerrit.wikimedia.org/r/1028559 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [20:39:58] (03CR) 10Scott French: "Thanks in advance for the reviews. See also [0] for the rationale behind this." [puppet] - 10https://gerrit.wikimedia.org/r/1028897 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [20:40:18] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1028583|Avoid empty insert in SqlScoreStorage::storeScores (T364218)]] (duration: 16m 01s) [20:40:28] T364218: UnexpectedValueException: Wikimedia\Rdbms\InsertQueryBuilder::execute can't have empty $rows value (via ORES SqlScoreStorage) - https://phabricator.wikimedia.org/T364218 [20:40:59] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9779078 (10andrea.denisse) We have a special situation with the thanos* hosts: The `thanos-fe` hosts TLS certificates are not provisione... [20:41:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.2% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:47:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T352010)', diff saved to https://phabricator.wikimedia.org/P61985 and previous config saved to /var/cache/conftool/dbconfig/20240507-204701-ladsgroup.json [20:47:05] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [20:47:57] (03CR) 10Scott French: "PCC diff for puppetmaster1001: https://puppet-compiler.wmflabs.org/output/1028898/2328/" [puppet] - 10https://gerrit.wikimedia.org/r/1028898 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [20:48:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:57:20] (03CR) 10Zabe: [C:03+2] Use OpenSSL for PBKDF2 password hashing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842522 (https://phabricator.wikimedia.org/T320929) (owner: 10PleaseStand) [20:58:09] (03Merged) 10jenkins-bot: Use OpenSSL for PBKDF2 password hashing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842522 (https://phabricator.wikimedia.org/T320929) (owner: 10PleaseStand) [20:58:41] !log zabe@deploy1002 Started scap: Backport for [[gerrit:842522|Use OpenSSL for PBKDF2 password hashing (T320929)]] [20:58:44] T320929: Deploy MediaWiki config change to use OpenSSL for PBKDF2 password hashing - https://phabricator.wikimedia.org/T320929 [21:01:15] !log zabe@deploy1002 zabe and ki: Backport for [[gerrit:842522|Use OpenSSL for PBKDF2 password hashing (T320929)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:01:56] 06SRE, 10Wikimedia-Mailing-lists, 07Upstream: Allow 'rel="me"' in Postorius list info links to verify Mastodon link on wikis.world - https://phabricator.wikimedia.org/T364402#9779144 (10GreenReaper) [Opened #305 in pupa/readme_renderer](https://github.com/pypa/readme_renderer/issues/305) for this issue. If t... [21:02:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P61986 and previous config saved to /var/cache/conftool/dbconfig/20240507-210209-ladsgroup.json [21:03:23] !log zabe@deploy1002 zabe and ki: Continuing with sync [21:05:10] RECOVERY - Check whether ferm is active by checking the default input chain on mw2429 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:05:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T352010)', diff saved to https://phabricator.wikimedia.org/P61987 and previous config saved to /var/cache/conftool/dbconfig/20240507-210556-ladsgroup.json [21:05:59] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [21:15:56] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:842522|Use OpenSSL for PBKDF2 password hashing (T320929)]] (duration: 17m 14s) [21:16:00] T320929: Deploy MediaWiki config change to use OpenSSL for PBKDF2 password hashing - https://phabricator.wikimedia.org/T320929 [21:17:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P61988 and previous config saved to /var/cache/conftool/dbconfig/20240507-211717-ladsgroup.json [21:17:20] (03CR) 10Zabe: [C:03+2] Revert "beta: Use OpenSSL for PBKDF2 password hashing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028760 (owner: 10Zabe) [21:18:08] (03Merged) 10jenkins-bot: Revert "beta: Use OpenSSL for PBKDF2 password hashing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028760 (owner: 10Zabe) [21:21:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P61989 and previous config saved to /var/cache/conftool/dbconfig/20240507-212103-ladsgroup.json [21:22:03] (03CR) 10Pppery: [C:03+1] "Seems reasonable to me." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1028887 (https://phabricator.wikimedia.org/T364426) (owner: 10Aklapper) [21:25:56] (03CR) 10Pppery: [C:03+1] Make Translations extension work with upstream Phorge (031 comment) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1028887 (https://phabricator.wikimedia.org/T364426) (owner: 10Aklapper) [21:28:13] (03PS3) 10Volans: reqconfig: add command to search IP in ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) [21:29:49] (03PS3) 10Volans: sre.ganeti.makevm: add logging message [cookbooks] - 10https://gerrit.wikimedia.org/r/1028867 [21:29:58] (03CR) 10Volans: [C:03+2] sre.ganeti.makevm: add logging message [cookbooks] - 10https://gerrit.wikimedia.org/r/1028867 (owner: 10Volans) [21:30:42] (03CR) 10Volans: "All tests should work now, at least they work locally :) Thanks for the patch to support v2, that helped." [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) (owner: 10Volans) [21:32:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T352010)', diff saved to https://phabricator.wikimedia.org/P61990 and previous config saved to /var/cache/conftool/dbconfig/20240507-213227-ladsgroup.json [21:32:32] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [21:33:44] (03Merged) 10jenkins-bot: sre.ganeti.makevm: add logging message [cookbooks] - 10https://gerrit.wikimedia.org/r/1028867 (owner: 10Volans) [21:33:56] (03CR) 10Volans: confd: confd-lint-wrap ignores positional args separator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028897 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [21:36:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P61991 and previous config saved to /var/cache/conftool/dbconfig/20240507-213614-ladsgroup.json [21:51:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T352010)', diff saved to https://phabricator.wikimedia.org/P61992 and previous config saved to /var/cache/conftool/dbconfig/20240507-215122-ladsgroup.json [21:51:27] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [21:54:35] (03CR) 10Scott French: "Thanks, Riccardo!" [puppet] - 10https://gerrit.wikimedia.org/r/1028897 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [22:12:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:18:43] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1028897 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [22:19:02] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1028898 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [22:24:24] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1028559 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [22:24:54] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1028560 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [22:58:44] (03CR) 10Scott French: [C:03+1] "LGTM! So, this would supersede [0], right? (e.g., benefits from being directly integrated with the tool, provides a bit more detail on mat" [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) (owner: 10Volans) [23:18:26] (03PS1) 10Scott French: benthos: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028910 (https://phabricator.wikimedia.org/T362978) [23:18:28] (03PS1) 10Scott French: blubberoid: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028911 (https://phabricator.wikimedia.org/T362978) [23:25:22] (03CR) 10Scott French: "Again continuing alphabetically, though skipping aqs-http-gateway for the moment, as there's a bit of coordination needed. Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028910 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [23:32:03] (03CR) 10Scott French: "Thanks for the review, Janis, and for the tip, Hugh. That should be sufficient, as this change should ideally break things in a non-subtle" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028605 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [23:35:41] FIRING: ProbeDown: Service kubemaster2001:6443 has failed probes (http_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:38:11] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1028926 [23:38:11] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1028926 (owner: 10TrainBranchBot) [23:40:41] RESOLVED: ProbeDown: Service kubemaster2001:6443 has failed probes (http_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:52:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:55:26] FWIW, the kubemaster2001 probe failures above look like another instance of https://phabricator.wikimedia.org/T358936 (certificates refreshed during puppet run on kubemaster2002 just prior)