[00:04:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10Eevans) [00:04:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10Eevans) 05Open→03Resolved done. [00:06:40] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.ganeti.GanetiError: Error while performing request to RAPI - https://phabricator.wikimedia.org/T353379 (10Dzahn) ooh, ok! thanks [00:34:06] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release [00:34:07] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=93) on GitLab host gitlab1003.wikimedia.org with reason: security release [00:38:02] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release [00:38:03] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: security release [00:40:24] !log dzahn@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release [00:40:24] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: security release [00:42:56] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:38] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host testhost2001.codfw.wmnet with OS bullseye [00:51:16] jouncebot: nowandnext [00:51:16] No deployments scheduled for the next 6 hour(s) and 8 minute(s) [00:51:16] In 6 hour(s) and 8 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231214T0700) [00:51:16] In 6 hour(s) and 8 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231214T0700) [00:52:13] hashar: jouncebot should know your preferences now. Thanks for the patch! [00:53:45] (Device rebooted) firing: (2) Alert for device ps1-b3-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [00:58:45] (Device rebooted) firing: (2) Device ps1-b3-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [01:03:45] (Device rebooted) firing: (2) Device ps1-b6-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [01:04:19] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:15:00 on gitlab1003.wikimedia.org with reason: upgrade gitlab1003 to new version https://phabricator.wikmedia.org/T353375 [01:04:35] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on gitlab1003.wikimedia.org with reason: upgrade gitlab1003 to new version https://phabricator.wikmedia.org/T353375 [01:08:45] (Device rebooted) resolved: Device ps1-b7-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [01:19:45] (Device rebooted) firing: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [01:24:45] (Device rebooted) resolved: Device ps1-b8-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [01:25:51] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on gitlab1003.wikimedia.org with reason: upgrade gitlab1003 to new version https://phabricator.wikmedia.org/T353375 [01:26:08] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on gitlab1003.wikimedia.org with reason: upgrade gitlab1003 to new version https://phabricator.wikmedia.org/T353375 [01:32:08] (CertAlmostExpired) firing: (2) Certificate for service kubestagemaster2001:6443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#kubestagemaster2001:6443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:36:12] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:37:40] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:17:38] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:25:46] RECOVERY - Check systemd state on dumpsdata1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:36:59] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:44:28] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:54:39] !log brion running cleanupOrphanedTranscodes on commonswiki on mwmaint2002 [02:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:32] !log cleanupOrphanedTranscodes complete. requeueTranscodes continues... forever and ever and ever [03:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:59] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:24:30] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host testhost2001.codfw.wmnet with OS bullseye [03:28:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:31:28] RECOVERY - Check unit status of geoip_update_main on puppetserver1001 is OK: OK: Status of the systemd unit geoip_update_main https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:31:28] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:34:04] RECOVERY - Check unit status of geoip_update_main on puppetmaster1001 is OK: OK: Status of the systemd unit geoip_update_main https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:47:20] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 2 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:48:50] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:32:06] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 2 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:33:36] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:32:23] (CertAlmostExpired) firing: (2) Certificate for service kubestagemaster2001:6443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#kubestagemaster2001:6443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:51:00] (03PS1) 10Marostegui: pc1015: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/982953 [05:55:32] (03CR) 10Marostegui: [C: 03+2] pc1015: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/982953 (owner: 10Marostegui) [06:00:40] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - No response from remote host 185.15.59.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:20:18] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2023-12-17 03:07:37 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:23:16] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231214T0700) [07:00:05] kormat, marostegui, and Amir1: Time to snap out of that daydream and deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231214T0700). [07:04:38] (03PS1) 10Ilias Sarantopoulos: llm: update image with sentencepiece [deployment-charts] - 10https://gerrit.wikimedia.org/r/983079 (https://phabricator.wikimedia.org/T351740) [07:14:52] (03CR) 10Ilias Sarantopoulos: [C: 03+2] llm: update image with sentencepiece [deployment-charts] - 10https://gerrit.wikimedia.org/r/983079 (https://phabricator.wikimedia.org/T351740) (owner: 10Ilias Sarantopoulos) [07:15:43] (03Merged) 10jenkins-bot: llm: update image with sentencepiece [deployment-charts] - 10https://gerrit.wikimedia.org/r/983079 (https://phabricator.wikimedia.org/T351740) (owner: 10Ilias Sarantopoulos) [07:16:52] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [07:28:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:28:18] (03PS1) 10Slyngshede: P:debmonitor::server Add Prometheus Blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/983108 (https://phabricator.wikimedia.org/T350694) [07:35:31] 10SRE, 10Infrastructure-Foundations: Setup cumin1002 - https://phabricator.wikimedia.org/T353419 (10MoritzMuehlenhoff) [07:39:02] PROBLEM - cassandra-a CQL 10.192.16.240:9042 on restbase2029 is CRITICAL: connect to address 10.192.16.240 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [07:39:20] PROBLEM - Check systemd state on restbase2029 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:39:34] PROBLEM - cassandra-a SSL 10.192.16.240:7000 on restbase2029 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [07:39:37] (03PS1) 10Muehlenhoff: Add cumin1002 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/983127 (https://phabricator.wikimedia.org/T353419) [07:40:30] PROBLEM - cassandra-a service on restbase2029 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:43:16] (03CR) 10Muehlenhoff: [C: 03+2] Add cumin1002 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/983127 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [07:43:56] RECOVERY - Check systemd state on puppetserver1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:45:13] 10SRE, 10Infrastructure-Foundations, 10netops: Adjust "port with no description on access switch" alert - https://phabricator.wikimedia.org/T353364 (10ayounsi) Another trigger, less likely, is if someone deletes a cable connected to a configured interface. Then automation will want to remove the description... [07:48:11] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1182.eqiad.wmnet onto db1233.eqiad.wmnet [07:49:16] RECOVERY - Check systemd state on puppetserver1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:49:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report [07:49:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [07:50:08] RECOVERY - Check systemd state on puppetserver2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:50:12] RECOVERY - Check systemd state on puppetserver2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:50:20] (03PS2) 10Slyngshede: P:debmonitor::server Add Prometheus Blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/983108 (https://phabricator.wikimedia.org/T350694) [07:50:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host cumin1002.eqiad.wmnet [07:50:43] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:51:23] (03CR) 10Filippo Giunchedi: [C: 03+1] Create initial stub role for logging-hd and configure for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/982801 (https://phabricator.wikimedia.org/T352517) (owner: 10Muehlenhoff) [07:53:27] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM cumin1002.eqiad.wmnet - jmm@cumin2002" [07:54:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM cumin1002.eqiad.wmnet - jmm@cumin2002" [07:54:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:54:20] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache cumin1002.eqiad.wmnet on all recursors [07:54:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cumin1002.eqiad.wmnet on all recursors [07:54:49] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM cumin1002.eqiad.wmnet - jmm@cumin2002" [07:55:25] (03PS1) 10Arnaudb: mariadb: toggle notification db1226 [puppet] - 10https://gerrit.wikimedia.org/r/982870 (https://phabricator.wikimedia.org/T344036) [07:55:26] RECOVERY - cassandra-a service on restbase2029 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:55:31] (03CR) 10Filippo Giunchedi: "LGTM, see inline" [alerts] - 10https://gerrit.wikimedia.org/r/982808 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [07:55:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM cumin1002.eqiad.wmnet - jmm@cumin2002" [07:55:46] RECOVERY - Check systemd state on restbase2029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:55:56] RECOVERY - cassandra-a SSL 10.192.16.240:7000 on restbase2029 is OK: SSL OK - Certificate restbase2029-a valid until 2025-12-05 16:11:10 +0000 (expires in 722 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [07:56:54] RECOVERY - cassandra-a CQL 10.192.16.240:9042 on restbase2029 is OK: TCP OK - 0.032 second response time on 10.192.16.240 port 9042 https://phabricator.wikimedia.org/T93886 [07:57:18] (03CR) 10Filippo Giunchedi: [C: 03+1] kubernetes::master Fix logic for certificate_expiry_days [puppet] - 10https://gerrit.wikimedia.org/r/982889 (https://phabricator.wikimedia.org/T353233) (owner: 10JMeybohm) [07:57:45] (03PS1) 10Muehlenhoff: Switch cumin1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/983129 (https://phabricator.wikimedia.org/T353419) [08:00:05] Amir1, apergos, and jnuche: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231214T0800). [08:00:06] apergos and MatmaRex: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:35] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [08:00:45] hi [08:00:54] morning! there is a trainee signed up today to learn how to deploy, and we have some patches including one of mine. if any of my co-deployers are around, it would be lovely if one could do the screenshare/typing piece of this while I do the commentary for the training. note that the trainee is not yet in the channel, let's hope there is not timezone confusion :-) [08:01:11] (03CR) 10Muehlenhoff: [C: 03+2] Switch cumin1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/983129 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [08:01:15] i have just a couple of hopefully no-op changes, i hope someone can deploy them for me :D [08:01:32] hello! give me a moment to look at the patches, since they were just added a little bit ago. [08:01:37] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [08:01:56] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:02:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host cumin1002.eqiad.wmnet with OS bullseye [08:02:33] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup cumin1002 - https://phabricator.wikimedia.org/T353419 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host cumin1002.eqiad.wmnet with OS bullseye [08:02:44] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [08:02:48] first one looks good [08:04:02] i'm not usually around at this time of day, so i didn't schedule them earlier [08:04:55] second one looks... amusing and hopefully ok :-D [08:05:30] and third one looks ok based on my quick scan of the second one and the related task [08:06:17] I'm going to go ahead and self-deploy my own change while we wait for subbu to arrive for the training, if he hasn't by the time I'm done, I'll start with yours [08:07:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ariel@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) (owner: 10ArielGlenn) [08:07:53] (03Merged) 10jenkins-bot: use virtual db domain for CentralAuth and GlobalBlocking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) (owner: 10ArielGlenn) [08:08:13] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [08:10:15] !log ariel@deploy2002 Started scap: Backport for [[gerrit:971967|use virtual db domain for CentralAuth and GlobalBlocking (T348486)]] [08:10:22] T348486: Migrate CentralAuth to use a virtual database domain - https://phabricator.wikimedia.org/T348486 [08:11:04] note that change I352499f978cb8d390911142ec723ace67eb5632e which was merged but not deployed, impacting labs config only, will also be going out [08:11:42] !log ariel@deploy2002 ariel: Backport for [[gerrit:971967|use virtual db domain for CentralAuth and GlobalBlocking (T348486)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:13:42] !log ariel@deploy2002 ariel: Continuing with sync [08:20:48] !log ariel@deploy2002 Finished scap: Backport for [[gerrit:971967|use virtual db domain for CentralAuth and GlobalBlocking (T348486)]] (duration: 10m 33s) [08:20:52] T348486: Migrate CentralAuth to use a virtual database domain - https://phabricator.wikimedia.org/T348486 [08:21:45] looks ok to me, I'll give it a minute to make sure nothing weird crops up, then we'll move alonf [08:21:47] *along [08:23:31] right, onward [08:23:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ariel@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982441 (https://phabricator.wikimedia.org/T314947) (owner: 10Bartosz Dziewoński) [08:24:17] (03Merged) 10jenkins-bot: Remove references to refreshMessageBlobs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982441 (https://phabricator.wikimedia.org/T314947) (owner: 10Bartosz Dziewoński) [08:24:19] I think our trainee will not be coming this morning, so I'll leave the google meet and we'll reschedule [08:24:28] RECOVERY - Check systemd state on puppetserver1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:34] apergos: we could probably do all of my backports at once to speed it up [08:24:42] !log ariel@deploy2002 Started scap: Backport for [[gerrit:982441|Remove references to refreshMessageBlobs.php (T314947)]] [08:24:47] T314947: Remove old refreshMessageBlobs.php script from WikimediaMaintenance - https://phabricator.wikimedia.org/T314947 [08:25:32] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes::master Fix logic for certificate_expiry_days [puppet] - 10https://gerrit.wikimedia.org/r/982889 (https://phabricator.wikimedia.org/T353233) (owner: 10JMeybohm) [08:26:11] !log ariel@deploy2002 ariel and matmarex: Backport for [[gerrit:982441|Remove references to refreshMessageBlobs.php (T314947)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:26:22] the scap for these ought to be much quicker, I would be more comfortable one at a time if you don't mind [08:26:36] and there's the first one, please test (if it can be, heh) [08:26:41] on an mwdebug host. [08:26:57] sure [08:27:18] nothing to test here, right? that file is not web-accessible [08:27:31] I don't see how it can be tested either tbh :-) [08:27:36] proceeding [08:27:39] !log ariel@deploy2002 ariel and matmarex: Continuing with sync [08:28:54] the other two aren't directly testable either. my plan was to watch the error logs from RunSingleJob.php and make sure they don't change [08:29:11] a good plan [08:29:27] (03CR) 10Marostegui: [C: 03+1] mariadb: toggle notification db1226 [puppet] - 10https://gerrit.wikimedia.org/r/982870 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [08:32:48] (this query: https://logstash.wikimedia.org/goto/2accede6853f520a5cc53e9be41b1923) [08:33:54] 👍 [08:34:06] !log drain eqiad-codfw Arelion link for 100G migration [08:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:03] !log ariel@deploy2002 Finished scap: Backport for [[gerrit:982441|Remove references to refreshMessageBlobs.php (T314947)]] (duration: 10m 20s) [08:35:11] T314947: Remove old refreshMessageBlobs.php script from WikimediaMaintenance - https://phabricator.wikimedia.org/T314947 [08:35:26] let's wait the one minute, more for form's sake than anything else with this change, then I'll move to the next one [08:37:07] and moving on. [08:37:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ariel@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982414 (https://phabricator.wikimedia.org/T353262) (owner: 10Bartosz Dziewoński) [08:38:00] (03Merged) 10jenkins-bot: RunSingleJob.php: Remove overly complicated error handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982414 (https://phabricator.wikimedia.org/T353262) (owner: 10Bartosz Dziewoński) [08:38:23] !log ariel@deploy2002 Started scap: Backport for [[gerrit:982414|RunSingleJob.php: Remove overly complicated error handling (T353262)]] [08:38:27] T353262: Remove writing to $wgCommandLineMode from RunSingleJob.php - https://phabricator.wikimedia.org/T353262 [08:38:51] (03PS1) 10Slyngshede: C:puppetmaster::monitoring Add Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/983131 (https://phabricator.wikimedia.org/T350694) [08:39:53] !log ariel@deploy2002 matmarex and ariel: Backport for [[gerrit:982414|RunSingleJob.php: Remove overly complicated error handling (T353262)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:40:18] right. no mwdebug testing, moving on [08:40:21] !log ariel@deploy2002 matmarex and ariel: Continuing with sync [08:42:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1226 (re)pooling @ 10%: Post clone db1226 repooling', diff saved to https://phabricator.wikimedia.org/P54398 and previous config saved to /var/cache/conftool/dbconfig/20231214-084228-arnaudb.json [08:42:39] (03CR) 10Arnaudb: [C: 03+2] mariadb: toggle notification db1226 [puppet] - 10https://gerrit.wikimedia.org/r/982870 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [08:43:23] (03PS2) 10Slyngshede: C:puppetmaster::monitoring Add Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/983131 (https://phabricator.wikimedia.org/T350694) [08:46:11] (03PS3) 10Slyngshede: C:puppetmaster::monitoring Add Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/983131 (https://phabricator.wikimedia.org/T350694) [08:47:02] !log ariel@deploy2002 Finished scap: Backport for [[gerrit:982414|RunSingleJob.php: Remove overly complicated error handling (T353262)]] (duration: 08m 39s) [08:47:06] T353262: Remove writing to $wgCommandLineMode from RunSingleJob.php - https://phabricator.wikimedia.org/T353262 [08:47:48] I'd like to give this one a couple minutes instead of the one minute ;-) next scap command is all queued up and ready to go after that [08:49:02] hashar: if you're running the train today, we might go over by 5 mins [08:49:23] (03PS1) 10Marostegui: orchestrator.conf.json.erb: Add Arnaud to orchestrator [puppet] - 10https://gerrit.wikimedia.org/r/983135 [08:49:25] (03PS4) 10Slyngshede: C:puppetmaster::monitoring Add Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/983131 (https://phabricator.wikimedia.org/T350694) [08:49:27] arnaudb: ^ [08:49:51] RunSingleJob.php error logs look normal [08:49:58] (03CR) 10Alexandros Kosiaris: [C: 04-1] Update cxserver to 2023-12-04-083437-production (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/977983 (https://phabricator.wikimedia.org/T344982) (owner: 10KartikMistry) [08:50:04] (03PS2) 10Marostegui: orchestrator.conf.json.erb: Add Arnaudb to orchestrator [puppet] - 10https://gerrit.wikimedia.org/r/983135 [08:50:12] (03CR) 10Arnaudb: [C: 03+1] orchestrator.conf.json.erb: Add Arnaudb to orchestrator [puppet] - 10https://gerrit.wikimedia.org/r/983135 (owner: 10Marostegui) [08:50:16] next patch [08:50:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ariel@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982415 (https://phabricator.wikimedia.org/T353262) (owner: 10Bartosz Dziewoński) [08:50:39] (03CR) 10Marostegui: [C: 03+2] orchestrator.conf.json.erb: Add Arnaudb to orchestrator [puppet] - 10https://gerrit.wikimedia.org/r/983135 (owner: 10Marostegui) [08:51:11] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/982415/ merge conflict, please resolve MatmaRex [08:51:17] scap is unable to manage that on its own [08:51:32] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/895/console" [puppet] - 10https://gerrit.wikimedia.org/r/983131 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:52:46] (03Merged) 10jenkins-bot: RunSingleJob.php: Stop writing to $wgCommandLineMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982415 (https://phabricator.wikimedia.org/T353262) (owner: 10Bartosz Dziewoński) [08:52:48] (03PS3) 10Alexandros Kosiaris: function-orchestrator: Bump mesh.configuration:1.6.x to latest patch level [deployment-charts] - 10https://gerrit.wikimedia.org/r/982822 (https://phabricator.wikimedia.org/T352906) [08:53:00] welp [08:53:06] (03PS1) 10David Caro: grid: disable hardcoded memory overcmommit on weblight [puppet] - 10https://gerrit.wikimedia.org/r/983139 [08:54:17] seems like jenkins took care of it, moving on [08:54:26] !log ariel@deploy2002 Started scap: Backport for [[gerrit:982415|RunSingleJob.php: Stop writing to $wgCommandLineMode (T353262)]] [08:54:31] T353262: Remove writing to $wgCommandLineMode from RunSingleJob.php - https://phabricator.wikimedia.org/T353262 [08:55:07] (03PS3) 10Slyngshede: P:debmonitor::server Add Prometheus Blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/983108 (https://phabricator.wikimedia.org/T350694) [08:55:52] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:56:17] !log ariel@deploy2002 ariel and matmarex: Backport for [[gerrit:982415|RunSingleJob.php: Stop writing to $wgCommandLineMode (T353262)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:56:31] no mwdebug checks blah blah blah, continuing on [08:56:34] !log ariel@deploy2002 ariel and matmarex: Continuing with sync [08:57:06] jenkins and gerrit use different methods to merge patches, and sometimes they get different results. it's always confusing when it happens [08:57:07] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/896/console" [puppet] - 10https://gerrit.wikimedia.org/r/983108 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:57:22] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:57:31] although usually i've seen it go the other way, jenkins approves tests but gerrit won't merge it [08:57:33] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1226 (re)pooling @ 20%: Post clone db1226 repooling', diff saved to https://phabricator.wikimedia.org/P54399 and previous config saved to /var/cache/conftool/dbconfig/20231214-085733-arnaudb.json [08:59:09] (03CR) 10Alexandros Kosiaris: [C: 03+2] function-orchestrator: Bump mesh.configuration:1.6.x to latest patch level [deployment-charts] - 10https://gerrit.wikimedia.org/r/982822 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [08:59:32] yes I guess I usually see gerrit complain about something I can merge in by hand without issues [09:00:06] brennen and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231214T0900) [09:00:13] (03Merged) 10jenkins-bot: function-orchestrator: Bump mesh.configuration:1.6.x to latest patch level [deployment-charts] - 10https://gerrit.wikimedia.org/r/982822 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [09:00:45] !log jayme@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM kubemaster2001.codfw.wmnet [09:01:09] almost done [09:01:18] 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10Volans) After discussions in yesterday's office hours it seems that remote IPMI is working correctly as the host does reboot and does try to boot... [09:02:12] (03CR) 10Filippo Giunchedi: [C: 03+1] P:debmonitor::server Add Prometheus Blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/983108 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:03:31] !log ariel@deploy2002 Finished scap: Backport for [[gerrit:982415|RunSingleJob.php: Stop writing to $wgCommandLineMode (T353262)]] (duration: 09m 05s) [09:03:36] T353262: Remove writing to $wgCommandLineMode from RunSingleJob.php - https://phabricator.wikimedia.org/T353262 [09:03:37] (03PS3) 10Alexandros Kosiaris: Bump mesh.configuration:1.4.x to latest patch level [deployment-charts] - 10https://gerrit.wikimedia.org/r/982823 (https://phabricator.wikimedia.org/T352906) [09:03:43] live, go watch those logs for a bit please :-) [09:03:53] thanks [09:06:00] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [09:06:50] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [09:06:51] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [09:07:26] !log jayme@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubemaster2001.codfw.wmnet [09:07:47] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1182.eqiad.wmnet onto db1233.eqiad.wmnet [09:08:00] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [09:08:01] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [09:08:19] MatmaRex: seems ok to me. [09:09:11] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [09:10:03] !log UTC morning backport and config window done [09:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:37] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cumin1002.eqiad.wmnet with OS bullseye [09:10:37] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host cumin1002.eqiad.wmnet [09:10:43] 10SRE, 10Infrastructure-Foundations: Setup cumin1002 - https://phabricator.wikimedia.org/T353419 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host cumin1002.eqiad.wmnet with OS bullseye executed with errors: - cumin1002 (**FAIL**) - Removed from Puppet and PuppetDB... [09:10:57] (03CR) 10Alexandros Kosiaris: [C: 03+2] Bump mesh.configuration:1.4.x to latest patch level [deployment-charts] - 10https://gerrit.wikimedia.org/r/982823 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [09:12:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host cumin1002.eqiad.wmnet with OS bullseye [09:12:38] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1226 (re)pooling @ 30%: Post clone db1226 repooling', diff saved to https://phabricator.wikimedia.org/P54400 and previous config saved to /var/cache/conftool/dbconfig/20231214-091238-arnaudb.json [09:15:53] (03Merged) 10jenkins-bot: Bump mesh.configuration:1.4.x to latest patch level [deployment-charts] - 10https://gerrit.wikimedia.org/r/982823 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [09:16:26] (03PS1) 10Arnaudb: mariadb: repooling 3 hosts [puppet] - 10https://gerrit.wikimedia.org/r/982871 (https://phabricator.wikimedia.org/T344036) [09:18:57] (03PS5) 10Slyngshede: C:puppetmaster::monitoring Add Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/983131 (https://phabricator.wikimedia.org/T350694) [09:19:50] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [09:20:11] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [09:20:12] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [09:20:49] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [09:20:50] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [09:21:39] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [09:22:09] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cumin1002.eqiad.wmnet with reason: host reimage [09:22:42] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 2 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:22:48] !log delete raw replica blocks for prometheus/ops (only one replica) in codfw - T351927 [09:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:53] T351927: Decide and tweak Thanos retention - https://phabricator.wikimedia.org/T351927 [09:24:14] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:24:18] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [09:24:31] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [09:24:32] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [09:24:47] !log update all the other services. T352906 [09:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:52] T352906: mediawiki k8s jobrunner fails connecting to cloudelastic with a TLS error - https://phabricator.wikimedia.org/T352906 [09:24:52] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [09:24:54] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [09:25:19] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [09:25:36] 10ops-codfw, 10ops-eqiad: Decommission Arelion's eqiad-codfw 10G link - https://phabricator.wikimedia.org/T353424 (10ayounsi) [09:25:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cumin1002.eqiad.wmnet with reason: host reimage [09:27:21] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [09:27:27] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply [09:27:44] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1226 (re)pooling @ 40%: Post clone db1226 repooling', diff saved to https://phabricator.wikimedia.org/P54401 and previous config saved to /var/cache/conftool/dbconfig/20231214-092743-arnaudb.json [09:30:08] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab Replica to new version [09:32:23] (CertAlmostExpired) firing: (2) Certificate for service kubestagemaster2001:6443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#kubestagemaster2001:6443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:35:52] (03PS1) 10Ayounsi: Remove load-balancing VRRP master pinning [homer/public] - 10https://gerrit.wikimedia.org/r/983143 (https://phabricator.wikimedia.org/T307551) [09:38:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cumin1002.eqiad.wmnet with OS bullseye [09:38:25] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [09:38:42] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [09:38:43] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: apply [09:39:20] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [09:39:22] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [09:40:01] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [09:42:11] (03PS1) 10Muehlenhoff: homer: Add profile option to disable homer for a given host [puppet] - 10https://gerrit.wikimedia.org/r/983144 (https://phabricator.wikimedia.org/T353419) [09:42:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1226 (re)pooling @ 50%: Post clone db1226 repooling', diff saved to https://phabricator.wikimedia.org/P54402 and previous config saved to /var/cache/conftool/dbconfig/20231214-094248-arnaudb.json [09:44:27] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983144 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [09:48:50] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [09:49:10] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [09:49:11] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [09:49:13] (03CR) 10Ladsgroup: [C: 03+1] mariadb: repooling 3 hosts [puppet] - 10https://gerrit.wikimedia.org/r/982871 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [09:49:25] (03CR) 10Arnaudb: [C: 03+2] mariadb: repooling 3 hosts [puppet] - 10https://gerrit.wikimedia.org/r/982871 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [09:49:30] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [09:49:31] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [09:49:52] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [09:51:07] !log Restarting CI Jenkins [09:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:37] (03PS1) 10Effie Mouzeli: memcached: provide the CA cert when listening to TLS [puppet] - 10https://gerrit.wikimedia.org/r/983146 (https://phabricator.wikimedia.org/T349619) [09:51:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1232 (re)pooling @ 5%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54404 and previous config saved to /var/cache/conftool/dbconfig/20231214-095149-arnaudb.json [09:52:08] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1237 (re)pooling @ 5%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54405 and previous config saved to /var/cache/conftool/dbconfig/20231214-095208-arnaudb.json [09:52:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1248 (re)pooling @ 5%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54406 and previous config saved to /var/cache/conftool/dbconfig/20231214-095228-arnaudb.json [09:56:36] !log remove >= 3 months old thanos blocks for prometheus/ops in eqiad/codfw and only for a single replica - T351927 [09:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:41] T351927: Decide and tweak Thanos retention - https://phabricator.wikimedia.org/T351927 [09:57:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1226 (re)pooling @ 60%: Post clone db1226 repooling', diff saved to https://phabricator.wikimedia.org/P54407 and previous config saved to /var/cache/conftool/dbconfig/20231214-095753-arnaudb.json [09:58:23] (03PS2) 10Muehlenhoff: homer: Add profile option to disable homer for a given host [puppet] - 10https://gerrit.wikimedia.org/r/983144 (https://phabricator.wikimedia.org/T353419) [09:58:36] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [09:58:48] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [09:58:50] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [09:58:53] (03PS1) 10Arnaudb: mariadb: toggle notification for db1233 [puppet] - 10https://gerrit.wikimedia.org/r/982872 (https://phabricator.wikimedia.org/T344036) [09:59:42] (03CR) 10Volans: [C: 03+1] "LGTM for the immediate workaround, but we need the hiera to disable homer to be set before the first installation" [puppet] - 10https://gerrit.wikimedia.org/r/983144 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [09:59:43] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [09:59:44] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [10:00:18] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [10:00:25] (03CR) 10Ladsgroup: "icinga is still red: https://icinga.wikimedia.org/icinga/" [puppet] - 10https://gerrit.wikimedia.org/r/982872 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [10:00:45] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983144 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [10:04:30] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [10:04:41] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [10:04:42] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [10:04:50] (03CR) 10Arnaudb: mariadb: toggle notification for db1233 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/982872 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [10:05:01] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [10:05:02] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [10:05:21] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [10:05:43] (03CR) 10Volans: [C: 04-1] "forgot one comment" [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 (owner: 10Slyngshede) [10:05:51] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 2 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:06:52] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/device-analytics: apply [10:06:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1232 (re)pooling @ 10%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54408 and previous config saved to /var/cache/conftool/dbconfig/20231214-100654-arnaudb.json [10:06:55] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:07:10] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [10:07:11] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/device-analytics: apply [10:07:14] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1237 (re)pooling @ 10%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54409 and previous config saved to /var/cache/conftool/dbconfig/20231214-100713-arnaudb.json [10:07:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1248 (re)pooling @ 10%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54410 and previous config saved to /var/cache/conftool/dbconfig/20231214-100733-arnaudb.json [10:07:38] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [10:07:39] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [10:07:56] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1233 (re)pooling @ 5%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54411 and previous config saved to /var/cache/conftool/dbconfig/20231214-100756-arnaudb.json [10:08:08] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [10:08:28] (03PS1) 10Muehlenhoff: Apply cluster::management role to cumin1002 [puppet] - 10https://gerrit.wikimedia.org/r/983147 (https://phabricator.wikimedia.org/T353419) [10:08:57] (03CR) 10Ladsgroup: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/982872 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [10:09:08] (03CR) 10Arnaudb: [C: 03+2] mariadb: toggle notification for db1233 [puppet] - 10https://gerrit.wikimedia.org/r/982872 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [10:09:17] (03CR) 10Muehlenhoff: [C: 03+2] homer: Add profile option to disable homer for a given host [puppet] - 10https://gerrit.wikimedia.org/r/983144 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [10:09:39] arnaudb: I'll merge your patch along, ok? [10:09:46] please moritzm :) [10:09:59] (03PS1) 10Elukey: ml-services: update Docker image for article-description [deployment-charts] - 10https://gerrit.wikimedia.org/r/983148 (https://phabricator.wikimedia.org/T352750) [10:10:27] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983147 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [10:10:43] arnaudb: merged [10:11:24] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [10:11:36] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [10:11:38] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [10:11:53] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [10:11:54] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [10:12:12] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [10:12:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1226 (re)pooling @ 70%: Post clone db1226 repooling', diff saved to https://phabricator.wikimedia.org/P54412 and previous config saved to /var/cache/conftool/dbconfig/20231214-101258-arnaudb.json [10:13:44] (03PS7) 10Kamila Součková: kube-state-metrics: DRY network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/974158 (https://phabricator.wikimedia.org/T264625) [10:13:50] (03CR) 10Kamila Součková: kube-state-metrics: DRY network policy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/974158 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [10:13:52] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [10:14:04] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [10:14:05] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [10:14:20] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [10:14:22] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [10:14:41] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [10:16:17] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "new cumin1002 host - jmm@cumin2002" [10:17:50] (03CR) 10Volans: [C: 04-1] "Some missing bits:" [puppet] - 10https://gerrit.wikimedia.org/r/983147 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [10:17:53] (03CR) 10Kevin Bazira: [C: 03+1] ml-services: update Docker image for article-description [deployment-charts] - 10https://gerrit.wikimedia.org/r/983148 (https://phabricator.wikimedia.org/T352750) (owner: 10Elukey) [10:18:46] (03CR) 10Elukey: [C: 03+2] ml-services: update Docker image for article-description [deployment-charts] - 10https://gerrit.wikimedia.org/r/983148 (https://phabricator.wikimedia.org/T352750) (owner: 10Elukey) [10:18:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "new cumin1002 host - jmm@cumin2002" [10:21:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1232 (re)pooling @ 25%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54413 and previous config saved to /var/cache/conftool/dbconfig/20231214-102159-arnaudb.json [10:22:19] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1237 (re)pooling @ 25%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54414 and previous config saved to /var/cache/conftool/dbconfig/20231214-102218-arnaudb.json [10:22:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1248 (re)pooling @ 25%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54415 and previous config saved to /var/cache/conftool/dbconfig/20231214-102238-arnaudb.json [10:23:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1233 (re)pooling @ 10%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54416 and previous config saved to /var/cache/conftool/dbconfig/20231214-102301-arnaudb.json [10:24:16] (03PS1) 10AikoChou: Add a testing stream for page-prediction-change events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982873 (https://phabricator.wikimedia.org/T349919) [10:26:31] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [10:28:04] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1226 (re)pooling @ 80%: Post clone db1226 repooling', diff saved to https://phabricator.wikimedia.org/P54417 and previous config saved to /var/cache/conftool/dbconfig/20231214-102803-arnaudb.json [10:29:01] (03CR) 10AikoChou: Add a testing stream for page-prediction-change events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982873 (https://phabricator.wikimedia.org/T349919) (owner: 10AikoChou) [10:31:15] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:35:26] 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10cmooney) >>! In T350179#9405805, @Volans wrote: > After discussions in yesterday's office hours it seems that remote IPMI is working correctly as... [10:37:04] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1232 (re)pooling @ 50%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54418 and previous config saved to /var/cache/conftool/dbconfig/20231214-103704-arnaudb.json [10:37:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1237 (re)pooling @ 50%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54419 and previous config saved to /var/cache/conftool/dbconfig/20231214-103723-arnaudb.json [10:37:44] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1248 (re)pooling @ 50%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54420 and previous config saved to /var/cache/conftool/dbconfig/20231214-103743-arnaudb.json [10:37:51] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 2 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:38:06] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1233 (re)pooling @ 15%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54421 and previous config saved to /var/cache/conftool/dbconfig/20231214-103806-arnaudb.json [10:38:55] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:41:29] (03CR) 10Fabfur: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/966885 (https://phabricator.wikimedia.org/T336385) (owner: 10Hnowlan) [10:42:17] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 2 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:42:29] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [10:43:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1226 (re)pooling @ 90%: Post clone db1226 repooling', diff saved to https://phabricator.wikimedia.org/P54422 and previous config saved to /var/cache/conftool/dbconfig/20231214-104308-arnaudb.json [10:43:27] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:43:33] (03CR) 10Clément Goubert: k8s topology labels: add row to rack transition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [10:45:07] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update codfw-eqiad transport ptr - ayounsi@cumin1001" [10:46:01] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update codfw-eqiad transport ptr - ayounsi@cumin1001" [10:46:02] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:47:30] (03CR) 10Hnowlan: [C: 03+2] trafficserver: route all requests for /api/rest_v1/metrics to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/966885 (https://phabricator.wikimedia.org/T336385) (owner: 10Hnowlan) [10:52:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1232 (re)pooling @ 75%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54423 and previous config saved to /var/cache/conftool/dbconfig/20231214-105209-arnaudb.json [10:52:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1237 (re)pooling @ 75%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54424 and previous config saved to /var/cache/conftool/dbconfig/20231214-105228-arnaudb.json [10:52:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1248 (re)pooling @ 75%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54425 and previous config saved to /var/cache/conftool/dbconfig/20231214-105248-arnaudb.json [10:53:11] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1233 (re)pooling @ 20%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54426 and previous config saved to /var/cache/conftool/dbconfig/20231214-105311-arnaudb.json [10:57:07] (CertAlmostExpired) resolved: (2) Certificate for service kubestagemaster2001:6443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#kubestagemaster2001:6443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:58:14] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1226 (re)pooling @ 100%: Post clone db1226 repooling', diff saved to https://phabricator.wikimedia.org/P54427 and previous config saved to /var/cache/conftool/dbconfig/20231214-105814-arnaudb.json [10:59:19] (03PS1) 10FNegri: team-wmcs: improve cloudvirt alerts [alerts] - 10https://gerrit.wikimedia.org/r/983156 [11:00:05] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231214T1100). [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231214T1100) [11:01:07] (CertAlmostExpired) firing: (2) Certificate for service kubestagemaster2001:6443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#kubestagemaster2001:6443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:02:22] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:06:15] <_joe_> !log restarted apache2 on lists1001 [11:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:14] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1232 (re)pooling @ 100%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54428 and previous config saved to /var/cache/conftool/dbconfig/20231214-110714-arnaudb.json [11:07:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1237 (re)pooling @ 100%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54429 and previous config saved to /var/cache/conftool/dbconfig/20231214-110733-arnaudb.json [11:07:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1248 (re)pooling @ 100%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54430 and previous config saved to /var/cache/conftool/dbconfig/20231214-110754-arnaudb.json [11:08:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1233 (re)pooling @ 25%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54431 and previous config saved to /var/cache/conftool/dbconfig/20231214-110816-arnaudb.json [11:12:57] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab Replica to new version [11:16:44] (03PS1) 10Sg912: Geo analytics image version change [deployment-charts] - 10https://gerrit.wikimedia.org/r/983160 [11:21:40] (03PS3) 10Bartosz Dziewoński: Update expected RunSingleJob.php status code [puppet] - 10https://gerrit.wikimedia.org/r/982236 (https://phabricator.wikimedia.org/T352265) [11:21:42] (03CR) 10Hnowlan: [C: 04-1] Geo analytics image version change (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/983160 (owner: 10Sg912) [11:23:21] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1233 (re)pooling @ 50%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54432 and previous config saved to /var/cache/conftool/dbconfig/20231214-112321-arnaudb.json [11:23:40] (03PS2) 10Sg912: Geo analytics image version change [deployment-charts] - 10https://gerrit.wikimedia.org/r/983160 [11:24:08] !log jayme@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM kubemaster2002.codfw.wmnet [11:24:10] (03CR) 10Sg912: Geo analytics image version change (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/983160 (owner: 10Sg912) [11:25:47] !log jayme@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM kubemaster1001.eqiad.wmnet [11:27:49] (03CR) 10Fabfur: [C: 03+1] "Looks good to me, let me know if you need some help with restarts or other operational stuff!" [puppet] - 10https://gerrit.wikimedia.org/r/974120 (owner: 10Stevemunene) [11:27:51] (03CR) 10Btullis: Define the spark-history chart (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol) [11:28:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:29:18] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:28] (03CR) 10Hnowlan: [C: 03+1] Geo analytics image version change [deployment-charts] - 10https://gerrit.wikimedia.org/r/983160 (owner: 10Sg912) [11:30:51] !log jayme@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubemaster2002.codfw.wmnet [11:31:21] !log jayme@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubemaster1001.eqiad.wmnet [11:33:17] (03CR) 10Sg912: [C: 03+1] Geo analytics image version change [deployment-charts] - 10https://gerrit.wikimedia.org/r/983160 (owner: 10Sg912) [11:34:00] (03PS2) 10Kamila Součková: mobileapps: 75% to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/976223 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [11:35:20] (03CR) 10Santiago Faci: [C: 03+1] "It looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/983160 (owner: 10Sg912) [11:38:27] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1233 (re)pooling @ 75%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54433 and previous config saved to /var/cache/conftool/dbconfig/20231214-113826-arnaudb.json [11:41:48] (03PS1) 10Kamila Součková: mw-api-int: replicas x125% [deployment-charts] - 10https://gerrit.wikimedia.org/r/983163 (https://phabricator.wikimedia.org/T350846) [11:42:10] (03CR) 10Btullis: "Apologies for the extra work, but I think we decided that, at the expense of repeating the config, we would create two independent helmfil" [deployment-charts] - 10https://gerrit.wikimedia.org/r/982769 (https://phabricator.wikimedia.org/T352860) (owner: 10Brouberol) [11:42:39] (03PS1) 10JMeybohm: Exclude custom probes from generic alerts [alerts] - 10https://gerrit.wikimedia.org/r/983164 (https://phabricator.wikimedia.org/T353233) [11:42:52] !log jayme@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM kubemaster1002.eqiad.wmnet [11:44:08] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:53] (03CR) 10Clément Goubert: [C: 03+1] mw-api-int: replicas x125% [deployment-charts] - 10https://gerrit.wikimedia.org/r/983163 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková) [11:47:45] (03CR) 10Kamila Součková: [C: 03+2] mw-api-int: replicas x125% [deployment-charts] - 10https://gerrit.wikimedia.org/r/983163 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková) [11:47:53] (03CR) 10Hnowlan: [C: 03+2] Geo analytics image version change [deployment-charts] - 10https://gerrit.wikimedia.org/r/983160 (owner: 10Sg912) [11:48:40] (03Merged) 10jenkins-bot: mw-api-int: replicas x125% [deployment-charts] - 10https://gerrit.wikimedia.org/r/983163 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková) [11:48:58] (03Merged) 10jenkins-bot: Geo analytics image version change [deployment-charts] - 10https://gerrit.wikimedia.org/r/983160 (owner: 10Sg912) [11:49:22] !log jayme@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubemaster1002.eqiad.wmnet [11:49:52] (03CR) 10Muehlenhoff: Apply cluster::management role to cumin1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/983147 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [11:51:07] (CertAlmostExpired) resolved: (2) Certificate for service kubestagemaster2001:6443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#kubestagemaster2001:6443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:51:13] (03PS2) 10Muehlenhoff: Apply cluster::management role to cumin1002 [puppet] - 10https://gerrit.wikimedia.org/r/983147 (https://phabricator.wikimedia.org/T353419) [11:51:25] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [11:51:30] (03PS10) 10Slyngshede: Move Debmonitor client code to separate repository. [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 [11:51:59] (03CR) 10Slyngshede: Move Debmonitor client code to separate repository. (0312 comments) [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 (owner: 10Slyngshede) [11:52:55] (03CR) 10Hnowlan: [C: 03+1] mw-on-k8s: update php-fpm-exporter version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982431 (owner: 10Clément Goubert) [11:53:00] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/983147 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [11:53:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/982846 (https://phabricator.wikimedia.org/T352838) (owner: 10Btullis) [11:53:09] (03CR) 10CI reject: [V: 04-1] Move Debmonitor client code to separate repository. [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 (owner: 10Slyngshede) [11:53:20] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: update php-fpm-exporter version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982431 (owner: 10Clément Goubert) [11:53:33] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1233 (re)pooling @ 100%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54434 and previous config saved to /var/cache/conftool/dbconfig/20231214-115332-arnaudb.json [11:54:00] (03CR) 10Btullis: [C: 03+2] Add a spark system user/group for the spark-history service [puppet] - 10https://gerrit.wikimedia.org/r/982846 (https://phabricator.wikimedia.org/T352838) (owner: 10Btullis) [11:54:32] (03Merged) 10jenkins-bot: mw-on-k8s: update php-fpm-exporter version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982431 (owner: 10Clément Goubert) [11:55:07] (CertAlmostExpired) firing: (2) Certificate for service kubestagemaster2001:6443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#kubestagemaster2001:6443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:57:00] (03PS1) 10Kamila Součková: Revert "mw-api-int: replicas x125%" [deployment-charts] - 10https://gerrit.wikimedia.org/r/982840 [11:57:48] jouncebot: nowandnext [11:57:48] For the next 0 hour(s) and 2 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231214T1100) [11:57:48] For the next 0 hour(s) and 2 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231214T1100) [11:57:48] In 1 hour(s) and 2 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231214T1300) [11:58:15] (03CR) 10Kamila Součková: [C: 03+2] Revert "mw-api-int: replicas x125%" [deployment-charts] - 10https://gerrit.wikimedia.org/r/982840 (owner: 10Kamila Součková) [11:59:07] (03Merged) 10jenkins-bot: Revert "mw-api-int: replicas x125%" [deployment-charts] - 10https://gerrit.wikimedia.org/r/982840 (owner: 10Kamila Součková) [12:01:03] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/geo-analytics: apply [12:01:22] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [12:01:42] (03PS1) 10Kamila Součková: Revert "Revert "mw-api-int: replicas x125%"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/982841 [12:01:58] (03CR) 10Clément Goubert: [C: 03+1] Revert "Revert "mw-api-int: replicas x125%"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/982841 (owner: 10Kamila Součková) [12:02:19] (03CR) 10Muehlenhoff: [C: 03+2] Apply cluster::management role to cumin1002 [puppet] - 10https://gerrit.wikimedia.org/r/983147 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [12:03:02] (03CR) 10Kamila Součková: [C: 03+2] Revert "Revert "mw-api-int: replicas x125%"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/982841 (owner: 10Kamila Součková) [12:03:19] !log cgoubert@deploy2002 Started scap: Deploying php-fpm-exporter 0.0.3 - 982431, mw-api-int: replicas x125% - 982841 [12:03:21] !log cgoubert@deploy2002 sync-world aborted: Deploying php-fpm-exporter 0.0.3 - 982431, mw-api-int: replicas x125% - 982841 (duration: 00m 02s) [12:04:05] (03Merged) 10jenkins-bot: Revert "Revert "mw-api-int: replicas x125%"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/982841 (owner: 10Kamila Součková) [12:05:55] !log cgoubert@deploy2002 Started scap: Deploying php-fpm-exporter 0.0.3 - 982431, mw-api-int: replicas x125% - 982841 [12:08:08] (03PS1) 10Muehlenhoff: homer: Fix typo in parameter name [puppet] - 10https://gerrit.wikimedia.org/r/983166 [12:09:25] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/982927 (owner: 10Ryan Kemper) [12:10:11] !log cgoubert@deploy2002 Finished scap: Deploying php-fpm-exporter 0.0.3 - 982431, mw-api-int: replicas x125% - 982841 (duration: 04m 16s) [12:12:50] PROBLEM - Check systemd state on kubernetes2026 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:14:34] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv [12:14:34] e - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IP [12:14:34] ve - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:14:35] (KubernetesCalicoDown) firing: (29) kubemaster2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:15:00] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv [12:15:00] e - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IP [12:15:00] ve - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:15:09] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/983166 (owner: 10Muehlenhoff) [12:15:52] PROBLEM - Check systemd state on kubernetes2016 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:16:42] (03PS1) 10Muehlenhoff: Make cumin1002 a DB admin host [puppet] - 10https://gerrit.wikimedia.org/r/983169 (https://phabricator.wikimedia.org/T353419) [12:16:58] (03CR) 10Muehlenhoff: [C: 03+2] homer: Fix typo in parameter name [puppet] - 10https://gerrit.wikimedia.org/r/983166 (owner: 10Muehlenhoff) [12:17:34] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 39.09% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:18:16] (MediaWikiMemcachedHighErrorRate) firing: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:18:47] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:19:34] (KubernetesCalicoDown) firing: (53) kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:19:59] (SwaggerProbeHasFailures) firing: (3) Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:20:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:20:10] calico-node is running on both though [12:20:16] RECOVERY - Check systemd state on kubernetes2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:20:32] (ProbeDown) firing: (3) Service miscweb2003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:20:48] topranks: can you check on the cr- side ? [12:20:54] here too [12:21:07] * topranks looking [12:21:12] k8s codfw high latency? [12:21:36] here [12:21:38] or network in general? [12:21:40] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - recommendation-api_4632: Servers kubernetes2046.codfw.wmnet, kubernetes2060.codfw.wmnet, kubernetes2058.codfw.wmnet, kubernetes2039.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2032.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2020.codfw.wmnet, kubernetes2024.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2 [12:21:40] w.wmnet, kubernetes2022.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2050.codfw.wmnet, kubernetes2029.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2033.codfw.wmnet, kubernetes2044.codfw.wmnet, kubernetes2041.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2035.codfw.wmnet are marked down but pooled: mobileapps_4102: Servers kubernetes2017.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2053.co [12:21:40] t, kubernetes2007.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2042.codfw.wmnet, kubernetes2043.codfw.wmnet, kubernetes2041 https://wikitech.wikimedia.org/wiki/PyBal [12:21:50] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - recommendation-api_4632: Servers kubernetes2046.codfw.wmnet, kubernetes2007.codfw.wmnet, kubernetes2030.codfw.wmnet, kubernetes2054.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2020.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2042.codfw.wmnet, kubernetes2010.codfw.wmnet, kubernetes2047.codfw.wmnet, kubernetes2050.codfw.wmnet, kubernetes2 [12:21:50] w.wmnet, kubernetes2043.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2021.codfw.wmnet, kubernetes2056.codfw.wmnet, kubernetes2008.codfw.wmnet, kubernetes2033.codfw.wmnet, kubernetes2044.codfw.wmnet, kubernetes2051.codfw.wmnet, kubernetes2027.codfw.wmnet, kubernetes2057.codfw.wmnet are marked down but pooled: mw-jobrunner_4448: Servers kubernetes2060.codfw.wmnet, kubernetes2025.codfw.wmnet, kubernetes2038.codfw.wmnet, kubernetes2024. [12:21:50] net, kubernetes2052.codfw.wmnet, kubernetes2034.codfw.wmnet, kubernetes2048.codfw.wmnet, kubernetes2047.codfw.wmnet, kubernetes2036.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes20 https://wikitech.wikimedia.org/wiki/PyBal [12:22:00] (ProbeDown) firing: (5) Service cxserver:4002 has failed probes (http_cxserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:22:09] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at codfw: 13.12% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:22:30] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2016 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:22:36] CNDs 5XX is spiking too [12:22:49] (03PS1) 10Muehlenhoff: Mark profile::homer::private_git_peer as optional [puppet] - 10https://gerrit.wikimedia.org/r/983174 (https://phabricator.wikimedia.org/T353419) [12:22:58] BGP to the LVS is ok [12:23:06] (03PS2) 10Muehlenhoff: Mark profile::homer::private_git_peer as optional [puppet] - 10https://gerrit.wikimedia.org/r/983174 (https://phabricator.wikimedia.org/T353419) [12:23:16] (MediaWikiMemcachedHighErrorRate) resolved: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:23:24] <_joe_> it's not just k8s [12:23:43] <_joe_> and I think the problem is probably memcached? [12:23:47] (ProbeDown) firing: (12) Service cxserver:4002 has failed probes (http_cxserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:24:11] _joe_: see -sre too [12:24:34] (KubernetesCalicoDown) firing: (67) kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:24:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:24:49] (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [12:24:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [12:24:59] (SwaggerProbeHasFailures) firing: (7) Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:25:07] (ProbeDown) firing: (4) Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:25:32] (ProbeDown) firing: (10) Service miscweb2003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:25:49] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [12:26:32] (03CR) 10Volans: [C: 03+1] "makes sense" [puppet] - 10https://gerrit.wikimedia.org/r/983174 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [12:27:00] (ProbeDown) firing: (20) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:27:01] (03CR) 10Muehlenhoff: [C: 03+2] Mark profile::homer::private_git_peer as optional [puppet] - 10https://gerrit.wikimedia.org/r/983174 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [12:27:09] (PHPFPMTooBusy) firing: (4) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:27:09] (MediaWikiLatencyExceeded) firing: (2) Average latency high: codfw mw-api-ext (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:28:48] (ProbeDown) firing: (21) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:29:22] PROBLEM - MediaWiki edit session loss on graphite1005 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 [12:29:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:29:35] (KubernetesCalicoDown) firing: (66) kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:29:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [12:29:50] (KubernetesCalicoDown) firing: (66) kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:29:59] (SwaggerProbeHasFailures) firing: (9) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:30:07] (ProbeDown) firing: (6) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:30:32] (ProbeDown) firing: (10) Service miscweb2003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:30:44] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:30:49] (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [12:30:56] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:31:06] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 286, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:31:34] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 203, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:32:01] (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: (2) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [12:32:06] (ProbeDown) resolved: (20) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:32:12] (PHPFPMTooBusy) resolved: (4) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at codfw: 41.99% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:32:17] (MediaWikiLatencyExceeded) resolved: (2) Average latency high: codfw mw-api-ext (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:32:23] (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=codfw%20prometheus/ops&var-cluster=text&var-origin=restbase.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [12:34:35] (KubernetesCalicoDown) resolved: (66) kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:34:42] (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=eqsin%20prometheus/ops&var-cluster=upload&var-origin=kartotherian.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHi [12:34:49] (WcqsStreamingUpdaterFlinkJobNotRunning) resolved: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [12:34:59] (SwaggerProbeHasFailures) firing: (9) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:35:07] (ProbeDown) resolved: (6) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:35:19] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [12:35:32] (ProbeDown) resolved: (9) Service miscweb2003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:37:01] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (2) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [12:37:12] (ATSBackendErrorsHigh) resolved: ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=codfw%20prometheus/ops&var-cluster=text&var-origin=restbase.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [12:37:40] (03PS1) 10Muehlenhoff: Set a default value for profile::homer::private_git_peer [puppet] - 10https://gerrit.wikimedia.org/r/983177 (https://phabricator.wikimedia.org/T353419) [12:38:28] (03PS1) 10Dreamy Jazz: CheckUser: Enable read new for event tables migration everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983178 (https://phabricator.wikimedia.org/T341829) [12:38:52] (03PS1) 10Arnaudb: mariadb: productionize db1234 db1249 [puppet] - 10https://gerrit.wikimedia.org/r/982874 (https://phabricator.wikimedia.org/T344036) [12:39:42] (ATSBackendErrorsHigh) resolved: ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=eqsin%20prometheus/ops&var-cluster=upload&var-origin=kartotherian.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrors [12:39:59] (SwaggerProbeHasFailures) resolved: (8) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:40:19] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [12:41:24] RECOVERY - Check systemd state on kubernetes2016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:42:54] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [12:44:18] RECOVERY - MediaWiki edit session loss on graphite1005 is OK: OK: Less than 30.00% above the threshold [10.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 [12:45:19] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [12:45:29] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [12:45:45] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . [12:47:59] (03PS1) 10Elukey: profile::pyrra::filesystem: remove Lift Wing Pilot [puppet] - 10https://gerrit.wikimedia.org/r/983179 (https://phabricator.wikimedia.org/T352756) [12:49:40] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/899/console" [puppet] - 10https://gerrit.wikimedia.org/r/983179 (https://phabricator.wikimedia.org/T352756) (owner: 10Elukey) [12:52:16] (03CR) 10Marostegui: "Are they bookworm? If not, they are missing wmf106 key" [puppet] - 10https://gerrit.wikimedia.org/r/982874 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [12:53:00] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2016 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:54:39] (03CR) 10Arnaudb: mariadb: productionize db1234 db1249 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/982874 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [12:57:39] (03CR) 10Cathal Mooney: [C: 03+1] "Looks alright-ish to me :P" [homer/public] - 10https://gerrit.wikimedia.org/r/983143 (https://phabricator.wikimedia.org/T307551) (owner: 10Ayounsi) [12:58:18] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10vm-requests: eqiad: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T353295 (10MoritzMuehlenhoff) >>! In T353295#9404582, @BCornwall wrote: > @MoritzMuehlenhoff The other instances also have 10G. Would you still recommend that despite it bri... [12:58:25] (03PS2) 10Arnaudb: mariadb: productionize db1234 db1249 [puppet] - 10https://gerrit.wikimedia.org/r/982874 (https://phabricator.wikimedia.org/T344036) [12:58:38] (03CR) 10Arnaudb: mariadb: productionize db1234 db1249 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/982874 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231214T1300) [13:01:06] (03CR) 10Marostegui: "If you are cloning from a 10.4 host, remember to run: mysql_upgrade on the new hosts once mariadb is back up" [puppet] - 10https://gerrit.wikimedia.org/r/982874 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [13:01:17] (03CR) 10Marostegui: [C: 03+1] mariadb: productionize db1234 db1249 [puppet] - 10https://gerrit.wikimedia.org/r/982874 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [13:02:28] (03CR) 10Arnaudb: mariadb: productionize db1234 db1249 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/982874 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [13:02:33] (03CR) 10Muehlenhoff: [C: 03+2] Create initial stub role for logging-hd and configure for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/982801 (https://phabricator.wikimedia.org/T352517) (owner: 10Muehlenhoff) [13:02:35] (03CR) 10Arnaudb: [C: 03+2] mariadb: productionize db1234 db1249 [puppet] - 10https://gerrit.wikimedia.org/r/982874 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [13:04:11] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/983177 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [13:05:50] (03CR) 10Muehlenhoff: [C: 03+2] Set a default value for profile::homer::private_git_peer [puppet] - 10https://gerrit.wikimedia.org/r/983177 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [13:08:48] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: provisionning db1249.eqiad.wmnet - T344036 [13:09:00] T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036 [13:09:04] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: provisionning db1249.eqiad.wmnet - T344036 [13:09:08] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1249.eqiad.wmnet with reason: provisionning db1249.eqiad.wmnet - T344036 [13:09:24] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1249.eqiad.wmnet with reason: provisionning db1249.eqiad.wmnet - T344036 [13:10:17] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Cloning db1149 in db1249 for T344036', diff saved to https://phabricator.wikimedia.org/P54435 and previous config saved to /var/cache/conftool/dbconfig/20231214-131017-arnaudb.json [13:10:27] (03PS1) 10Muehlenhoff: homer: One more default value [puppet] - 10https://gerrit.wikimedia.org/r/983182 (https://phabricator.wikimedia.org/T353419) [13:12:51] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1149.eqiad.wmnet onto db1249.eqiad.wmnet [13:17:38] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: provisionning db1234.eqiad.wmnet - T344036 [13:17:49] T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036 [13:17:53] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: provisionning db1234.eqiad.wmnet - T344036 [13:17:56] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1234.eqiad.wmnet with reason: provisionning db1234.eqiad.wmnet - T344036 [13:18:11] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1234.eqiad.wmnet with reason: provisionning db1234.eqiad.wmnet - T344036 [13:18:54] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::pyrra::filesystem: remove Lift Wing Pilot [puppet] - 10https://gerrit.wikimedia.org/r/983179 (https://phabricator.wikimedia.org/T352756) (owner: 10Elukey) [13:19:14] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Cloning db1134 in db1234 for T344036', diff saved to https://phabricator.wikimedia.org/P54436 and previous config saved to /var/cache/conftool/dbconfig/20231214-131913-arnaudb.json [13:19:15] (03CR) 10Filippo Giunchedi: [C: 03+1] Exclude custom probes from generic alerts [alerts] - 10https://gerrit.wikimedia.org/r/983164 (https://phabricator.wikimedia.org/T353233) (owner: 10JMeybohm) [13:21:16] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1134.eqiad.wmnet onto db1234.eqiad.wmnet [13:21:41] 10SRE-swift-storage, 10Observability-Metrics, 10Grafana: Disk space thanos-be1001:9100 alert - https://phabricator.wikimedia.org/T353091 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi We're at ~85% used now, as per cleanup in T351927. We'll be tracking followups there too, resolving this [13:21:51] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/983182 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [13:22:48] (03CR) 10Muehlenhoff: [C: 03+2] homer: One more default value [puppet] - 10https://gerrit.wikimedia.org/r/983182 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [13:23:00] (PuppetFailure) firing: Puppet has failed on netflow1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:24:17] (03PS3) 10Brouberol: spark-history: define helmfile configuration and release values [deployment-charts] - 10https://gerrit.wikimedia.org/r/982769 (https://phabricator.wikimedia.org/T352860) [13:24:20] (03CR) 10Brouberol: spark-history: define helmfile configuration and release values (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/982769 (https://phabricator.wikimedia.org/T352860) (owner: 10Brouberol) [13:27:59] (PuppetFailure) firing: (2) Puppet has failed on netflow1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:28:18] (03PS1) 10Muehlenhoff: Add cumin1002 to list of DB root clients [puppet] - 10https://gerrit.wikimedia.org/r/983185 (https://phabricator.wikimedia.org/T353419) [13:30:25] (03PS2) 10Muehlenhoff: Add cumin1002 to list of DB root clients [puppet] - 10https://gerrit.wikimedia.org/r/983185 (https://phabricator.wikimedia.org/T353419) [13:31:20] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/983185 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [13:31:29] (03CR) 10Klausman: [C: 03+1] profile::pyrra::filesystem: remove Lift Wing Pilot [puppet] - 10https://gerrit.wikimedia.org/r/983179 (https://phabricator.wikimedia.org/T352756) (owner: 10Elukey) [13:33:28] (03CR) 10Muehlenhoff: [C: 03+2] Add cumin1002 to list of DB root clients [puppet] - 10https://gerrit.wikimedia.org/r/983185 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [13:36:14] (03PS17) 10Brouberol: Define the spark-history chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722) [13:36:38] (03PS1) 10Ilias Sarantopoulos: ml-services: update outlink topic model image [deployment-charts] - 10https://gerrit.wikimedia.org/r/983189 (https://phabricator.wikimedia.org/T352834) [13:37:40] (03PS1) 10Arnaudb: mariadb: decommission hosts [puppet] - 10https://gerrit.wikimedia.org/r/982875 (https://phabricator.wikimedia.org/T350458) [13:38:43] (03PS2) 10Muehlenhoff: Make cumin1002 a DB admin host [puppet] - 10https://gerrit.wikimedia.org/r/983169 (https://phabricator.wikimedia.org/T353419) [13:39:30] (03CR) 10Marostegui: [C: 03+1] mariadb: decommission hosts [puppet] - 10https://gerrit.wikimedia.org/r/982875 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [13:39:42] (03CR) 10Arnaudb: [C: 03+2] mariadb: decommission hosts [puppet] - 10https://gerrit.wikimedia.org/r/982875 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [13:41:06] (03CR) 10Klausman: [C: 03+1] ml-services: update outlink topic model image [deployment-charts] - 10https://gerrit.wikimedia.org/r/983189 (https://phabricator.wikimedia.org/T352834) (owner: 10Ilias Sarantopoulos) [13:42:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'decommissionning hosts', diff saved to https://phabricator.wikimedia.org/P54437 and previous config saved to /var/cache/conftool/dbconfig/20231214-134203-arnaudb.json [13:42:43] !log arnaudb@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1132.eqiad.wmnet [13:43:43] !log arnaudb@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1137.eqiad.wmnet [13:44:51] !log arnaudb@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1148.eqiad.wmnet [13:45:40] (03PS2) 10Ilias Sarantopoulos: ml-services: update outlink topic model image on ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/983189 (https://phabricator.wikimedia.org/T352834) [13:48:07] !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox [13:49:59] (PuppetZeroResources) firing: Puppet has failed generate resources on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:50:07] !log arnaudb@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1132.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001" [13:50:18] (03CR) 10Herron: [C: 03+1] profile::pyrra::filesystem: remove Lift Wing Pilot [puppet] - 10https://gerrit.wikimedia.org/r/983179 (https://phabricator.wikimedia.org/T352756) (owner: 10Elukey) [13:50:36] (03CR) 10AikoChou: [C: 03+1] ml-services: update outlink topic model image on ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/983189 (https://phabricator.wikimedia.org/T352834) (owner: 10Ilias Sarantopoulos) [13:51:10] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1132.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001" [13:51:10] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:51:11] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1132.eqiad.wmnet [13:51:19] !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox [13:51:41] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: update outlink topic model image on ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/983189 (https://phabricator.wikimedia.org/T352834) (owner: 10Ilias Sarantopoulos) [13:52:14] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:30] (03Merged) 10jenkins-bot: ml-services: update outlink topic model image on ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/983189 (https://phabricator.wikimedia.org/T352834) (owner: 10Ilias Sarantopoulos) [13:52:34] 10ops-eqiad, 10decommission-hardware: decommission db1132.eqiad.wmnet - https://phabricator.wikimedia.org/T353447 (10ABran-WMF) 05In progress→03Open a:05ABran-WMF→03None [13:52:42] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:52:43] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1148.eqiad.wmnet [13:52:57] !log installing netty security updates [13:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:03] !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox [13:53:15] 10ops-eqiad, 10decommission-hardware: decommission db1148.eqiad.wmnet - https://phabricator.wikimedia.org/T353449 (10ABran-WMF) 05In progress→03Open a:05ABran-WMF→03None [13:53:27] (03PS1) 10JMeybohm: calico: Remove CPU limits globally, bump memory for wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/983191 [13:53:34] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:54:25] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:54:27] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:54:28] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1137.eqiad.wmnet [13:54:37] 10ops-eqiad, 10decommission-hardware: decommission db1137.eqiad.wmnet - https://phabricator.wikimedia.org/T353448 (10ABran-WMF) 05In progress→03Open a:05ABran-WMF→03None [13:56:04] !log installing reportbug bugfix updates on buster [13:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:19] (03PS1) 10Brouberol: spark3: set the spark history server domain as yarn.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/983192 (https://phabricator.wikimedia.org/T352863) [13:58:23] (03PS1) 10Brouberol: yarn: proxy the spark job history requests to the spark history service [puppet] - 10https://gerrit.wikimedia.org/r/983193 (https://phabricator.wikimedia.org/T352863) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231214T1400) [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:00:13] good, I’m not around :) [14:00:21] (03PS2) 10Brouberol: spark3: set the spark history server domain as yarn.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/983192 (https://phabricator.wikimedia.org/T352863) [14:00:23] (03PS2) 10Brouberol: yarn: proxy the spark job history requests to the spark history service [puppet] - 10https://gerrit.wikimedia.org/r/983193 (https://phabricator.wikimedia.org/T352863) [14:00:45] (03CR) 10Brouberol: Define the spark-history chart (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol) [14:01:28] !log installing ruby-loofah security updates [14:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:36] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983192 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [14:03:31] (03PS1) 10DCausse: cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/983194 [14:03:41] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983193 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [14:07:55] !log installing ruby-rails-html-sanitizer security updates [14:07:57] (03PS3) 10Brouberol: yarn: proxy the spark job history requests to the spark history service [puppet] - 10https://gerrit.wikimedia.org/r/983193 (https://phabricator.wikimedia.org/T352863) [14:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:11] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/900/con" [puppet] - 10https://gerrit.wikimedia.org/r/983193 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [14:14:21] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:15:11] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:15:33] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.365 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:16:15] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51008 bytes in 0.148 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:17:05] (03CR) 10Btullis: "Looks good. All I'd say now is let's remove some production/test values from the chart itself. I can see that it defaults to the test clus" [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol) [14:17:18] (03CR) 10DCausse: [C: 03+2] cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/983194 (owner: 10DCausse) [14:18:07] (03Merged) 10jenkins-bot: cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/983194 (owner: 10DCausse) [14:22:15] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:22:43] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:23:03] (03PS1) 10David Caro: NeutronAgentDown: deduplicate alert [alerts] - 10https://gerrit.wikimedia.org/r/983197 [14:23:33] dcausse: when you deploy cirrus-streaming-updater you 'll see some changes regarding the mesh image, those are safe. [14:23:57] I assume you saw them already, so I am a bit late to the party. [14:24:09] (03CR) 10Btullis: spark-history: define helmfile configuration and release values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/982769 (https://phabricator.wikimedia.org/T352860) (owner: 10Brouberol) [14:24:24] akosiaris: yes just saw those, and happy to see them because we'll need them soon to connect to cloudelastic from envoy, thanks! :) [14:27:37] (03PS1) 10Muehlenhoff: Disable dbbbackups for cumin1002 [puppet] - 10https://gerrit.wikimedia.org/r/983198 (https://phabricator.wikimedia.org/T353419) [14:27:49] (03PS2) 10JMeybohm: calico: Remove CPU limits globally, bump memory for wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/983191 [14:29:18] (03CR) 10Jcrespo: [C: 03+1] Disable dbbbackups for cumin1002 [puppet] - 10https://gerrit.wikimedia.org/r/983198 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [14:29:44] (03CR) 10Muehlenhoff: [C: 03+2] Disable dbbbackups for cumin1002 [puppet] - 10https://gerrit.wikimedia.org/r/983198 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [14:31:01] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:03] (03CR) 10Kamila Součková: [C: 03+1] calico: Remove CPU limits globally, bump memory for wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/983191 (owner: 10JMeybohm) [14:32:09] (03CR) 10Clément Goubert: [C: 03+1] calico: Remove CPU limits globally, bump memory for wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/983191 (owner: 10JMeybohm) [14:33:25] (03CR) 10Alexandros Kosiaris: [C: 03+1] calico: Remove CPU limits globally, bump memory for wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/983191 (owner: 10JMeybohm) [14:36:50] (03CR) 10JMeybohm: [C: 03+2] calico: Remove CPU limits globally, bump memory for wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/983191 (owner: 10JMeybohm) [14:37:00] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:26] (03CR) 10CI reject: [V: 04-1] calico: Remove CPU limits globally, bump memory for wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/983191 (owner: 10JMeybohm) [14:37:32] (03CR) 10JMeybohm: [C: 03+2] Exclude custom probes from generic alerts [alerts] - 10https://gerrit.wikimedia.org/r/983164 (https://phabricator.wikimedia.org/T353233) (owner: 10JMeybohm) [14:38:54] (03Merged) 10jenkins-bot: Exclude custom probes from generic alerts [alerts] - 10https://gerrit.wikimedia.org/r/983164 (https://phabricator.wikimedia.org/T353233) (owner: 10JMeybohm) [14:39:56] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] calico: Remove CPU limits globally, bump memory for wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/983191 (owner: 10JMeybohm) [14:43:20] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:43:22] (03PS1) 10David Caro: NodeDown: deduplicate alert [alerts] - 10https://gerrit.wikimedia.org/r/983201 [14:43:50] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:44:09] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:44:33] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:44:57] (03CR) 10Ottomata: [C: 03+1] Add a testing stream for page-prediction-change events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982873 (https://phabricator.wikimedia.org/T349919) (owner: 10AikoChou) [14:45:01] !log jayme@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:45:29] !log jayme@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:45:45] !log jayme@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:46:08] !log jayme@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:49:29] (03CR) 10Btullis: [C: 03+1] yarn: proxy the spark job history requests to the spark history service [puppet] - 10https://gerrit.wikimedia.org/r/983193 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [14:49:59] (PuppetZeroResources) resolved: Puppet has failed generate resources on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:50:07] (CertAlmostExpired) resolved: (2) Certificate for service kubestagemaster2001:6443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#kubestagemaster2001:6443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:53:47] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:05] (03CR) 10Btullis: "I think that there might be a few additional roles where we need to set this into the spark3-default.conf" [puppet] - 10https://gerrit.wikimedia.org/r/983192 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [14:57:04] (03CR) 10Btullis: "Actually, I think that there might be an efficient way to set this." [puppet] - 10https://gerrit.wikimedia.org/r/983192 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [15:05:07] (03PS2) 10Effie Mouzeli: memcached: provide the CA cert when listening to TLS [puppet] - 10https://gerrit.wikimedia.org/r/983146 (https://phabricator.wikimedia.org/T349619) [15:06:31] (03PS3) 10Effie Mouzeli: memcached: provide the CA cert when listening to TLS [puppet] - 10https://gerrit.wikimedia.org/r/983146 (https://phabricator.wikimedia.org/T349619) [15:10:22] (03PS4) 10Brouberol: spark-history: define helmfile configuration and release values [deployment-charts] - 10https://gerrit.wikimedia.org/r/982769 (https://phabricator.wikimedia.org/T352860) [15:12:53] (03CR) 10Brouberol: spark-history: define helmfile configuration and release values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/982769 (https://phabricator.wikimedia.org/T352860) (owner: 10Brouberol) [15:14:25] (03PS5) 10Brouberol: spark-history: define helmfile configuration and release values [deployment-charts] - 10https://gerrit.wikimedia.org/r/982769 (https://phabricator.wikimedia.org/T352860) [15:15:52] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [15:15:52] (03PS4) 10Effie Mouzeli: memcached: provide the CA cert when listening to TLS [puppet] - 10https://gerrit.wikimedia.org/r/983146 (https://phabricator.wikimedia.org/T349619) [15:16:08] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [15:16:09] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [15:16:24] (03CR) 10CI reject: [V: 04-1] memcached: provide the CA cert when listening to TLS [puppet] - 10https://gerrit.wikimedia.org/r/983146 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli) [15:17:02] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [15:17:04] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [15:19:58] (03PS5) 10Effie Mouzeli: memcached: provide the CA cert when listening to TLS [puppet] - 10https://gerrit.wikimedia.org/r/983146 (https://phabricator.wikimedia.org/T349619) [15:27:26] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [15:27:51] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [15:27:54] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [15:27:55] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [15:27:58] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [15:27:59] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [15:28:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:28:27] !log jayme@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:28:33] !log jayme@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:28:35] !log jayme@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [15:28:39] !log jayme@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [15:28:41] !log jayme@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [15:28:46] !log jayme@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [15:28:48] !log jayme@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:28:53] !log jayme@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:28:54] !log jayme@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [15:29:00] (03PS1) 10Ilias Sarantopoulos: ml-services: increase cpu for nllb [deployment-charts] - 10https://gerrit.wikimedia.org/r/983204 (https://phabricator.wikimedia.org/T351740) [15:29:03] !log jayme@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:29:34] (03PS3) 10Brouberol: spark3: set the spark history server domain as yarn.wikimedia.org for analytics-hadoop [puppet] - 10https://gerrit.wikimedia.org/r/983192 (https://phabricator.wikimedia.org/T352863) [15:29:36] (03PS4) 10Brouberol: yarn: proxy the spark job history requests to the spark history service [puppet] - 10https://gerrit.wikimedia.org/r/983193 (https://phabricator.wikimedia.org/T352863) [15:30:00] (HelmReleaseBadStatus) firing: Helm release eventgate-analytics/production on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:30:36] (03CR) 10Elukey: [C: 03+1] ml-services: increase cpu for nllb [deployment-charts] - 10https://gerrit.wikimedia.org/r/983204 (https://phabricator.wikimedia.org/T351740) (owner: 10Ilias Sarantopoulos) [15:30:52] !log mfossati@deploy2002 Started deploy [airflow-dags/platform_eng@4946bb7]: (no justification provided) [15:31:12] (03CR) 10FNegri: [C: 03+1] "This looks good and I think we can merge it!" [alerts] - 10https://gerrit.wikimedia.org/r/983197 (owner: 10David Caro) [15:31:41] !log mfossati@deploy2002 Finished deploy [airflow-dags/platform_eng@4946bb7]: (no justification provided) (duration: 00m 48s) [15:32:46] (03PS18) 10Brouberol: Define the spark-history chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722) [15:32:53] (03CR) 10CI reject: [V: 04-1] spark3: set the spark history server domain as yarn.wikimedia.org for analytics-hadoop [puppet] - 10https://gerrit.wikimedia.org/r/983192 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [15:33:08] (03CR) 10Brouberol: Define the spark-history chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol) [15:33:28] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: increase cpu for nllb [deployment-charts] - 10https://gerrit.wikimedia.org/r/983204 (https://phabricator.wikimedia.org/T351740) (owner: 10Ilias Sarantopoulos) [15:34:05] (03PS6) 10Brouberol: spark-history: define helmfile configuration and release values [deployment-charts] - 10https://gerrit.wikimedia.org/r/982769 (https://phabricator.wikimedia.org/T352860) [15:34:24] (03Merged) 10jenkins-bot: ml-services: increase cpu for nllb [deployment-charts] - 10https://gerrit.wikimedia.org/r/983204 (https://phabricator.wikimedia.org/T351740) (owner: 10Ilias Sarantopoulos) [15:34:47] (03PS2) 10AikoChou: Add a testing stream for page-prediction-change events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982873 (https://phabricator.wikimedia.org/T349919) [15:35:16] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [15:35:22] (03PS4) 10Brouberol: spark3: set the spark history server domain for analytics-hadoop [puppet] - 10https://gerrit.wikimedia.org/r/983192 (https://phabricator.wikimedia.org/T352863) [15:35:24] (03PS5) 10Brouberol: yarn: proxy the spark job history requests to the spark history service [puppet] - 10https://gerrit.wikimedia.org/r/983193 (https://phabricator.wikimedia.org/T352863) [15:35:25] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [15:36:26] (03CR) 10Effie Mouzeli: "PCC https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/906/label=puppet5-compiler-node/console" [puppet] - 10https://gerrit.wikimedia.org/r/983146 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli) [15:37:03] (03PS1) 10Dzahn: planet: remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/983207 (https://phabricator.wikimedia.org/T348392) [15:39:47] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [15:40:00] (HelmReleaseBadStatus) resolved: Helm release eventgate-analytics/production on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:40:02] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [15:40:03] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [15:40:20] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [15:40:21] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [15:40:38] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [15:41:23] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983192 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [15:42:03] !log dzahn@cumin2002 START - Cookbook sre.hosts.decommission for hosts planet2002.codfw.wmnet [15:42:07] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [15:42:21] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [15:42:22] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [15:42:35] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [15:42:36] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [15:42:51] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [15:43:33] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::pyrra::filesystem: remove Lift Wing Pilot [puppet] - 10https://gerrit.wikimedia.org/r/983179 (https://phabricator.wikimedia.org/T352756) (owner: 10Elukey) [15:44:44] (03PS1) 10DCausse: cirrus-streaming-updater: bump envoy resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/983208 (https://phabricator.wikimedia.org/T353460) [15:46:09] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1149.eqiad.wmnet onto db1249.eqiad.wmnet [15:46:22] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [15:46:31] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Dzahn) planet1002/planet2002 - deleted, replaced with bookworm VMs [15:46:57] (03PS1) 10Arnaudb: mariadb: toggle notifications for db1249 [puppet] - 10https://gerrit.wikimedia.org/r/982883 (https://phabricator.wikimedia.org/T344036) [15:47:53] (03PS5) 10Brouberol: spark3: set the spark history server domain for analytics-hadoop [puppet] - 10https://gerrit.wikimedia.org/r/983192 (https://phabricator.wikimedia.org/T352863) [15:47:55] (03PS6) 10Brouberol: yarn: proxy the spark job history requests to the spark history service [puppet] - 10https://gerrit.wikimedia.org/r/983193 (https://phabricator.wikimedia.org/T352863) [15:48:18] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [15:48:32] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [15:48:33] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [15:48:48] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [15:48:49] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [15:48:57] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [15:49:07] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [15:49:09] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [15:49:10] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [15:49:41] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [15:49:42] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [15:49:57] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/geo-analytics: apply [15:50:01] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [15:50:02] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/geo-analytics: apply [15:50:19] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [15:50:31] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [15:50:32] (03CR) 10David Caro: NeutronAgentDown: deduplicate alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/983197 (owner: 10David Caro) [15:50:37] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [15:50:38] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/image-suggestion: apply [15:50:49] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/geo-analytics: apply [15:50:50] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply [15:50:51] (03PS6) 10Brouberol: spark3: set the spark history server domain for analytics-hadoop [puppet] - 10https://gerrit.wikimedia.org/r/983192 (https://phabricator.wikimedia.org/T352863) [15:50:53] (03PS7) 10Brouberol: yarn: proxy the spark job history requests to the spark history service [puppet] - 10https://gerrit.wikimedia.org/r/983193 (https://phabricator.wikimedia.org/T352863) [15:51:15] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/image-suggestion: apply [15:51:16] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/image-suggestion: apply [15:51:17] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply [15:51:34] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [15:51:45] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/image-suggestion: apply [15:51:48] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [15:51:49] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [15:52:30] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/983192 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [15:52:30] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [15:52:48] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [15:52:50] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [15:52:56] (03PS1) 10David Caro: wmcs: use critical severity instead of task [alerts] - 10https://gerrit.wikimedia.org/r/983209 [15:53:06] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/mathoid: apply [15:53:18] (03CR) 10David Caro: NeutronAgentDown: deduplicate alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/983197 (owner: 10David Caro) [15:53:29] (03CR) 10David Caro: [C: 03+2] NeutronAgentDown: deduplicate alert [alerts] - 10https://gerrit.wikimedia.org/r/983197 (owner: 10David Caro) [15:53:31] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: planet2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin2002" [15:53:32] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [15:53:33] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/media-analytics: apply [15:53:36] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [15:53:37] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/mathoid: apply [15:53:50] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [15:53:52] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/media-analytics: apply [15:53:57] (03PS2) 10Arnaudb: mariadb: toggle notifications for db1249 [puppet] - 10https://gerrit.wikimedia.org/r/982883 (https://phabricator.wikimedia.org/T344036) [15:54:05] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [15:54:07] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [15:54:13] (03PS3) 10Arnaudb: mariadb: toggle notifications for db1249 and db1234 [puppet] - 10https://gerrit.wikimedia.org/r/982883 (https://phabricator.wikimedia.org/T344036) [15:54:18] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply [15:54:19] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/media-analytics: apply [15:54:33] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [15:54:39] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: planet2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin2002" [15:54:39] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:54:40] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts planet2002.codfw.wmnet [15:54:48] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply [15:55:04] (03CR) 10Dzahn: [C: 03+2] site: remove buster VMs from planet regex [puppet] - 10https://gerrit.wikimedia.org/r/982157 (https://phabricator.wikimedia.org/T348392) (owner: 10Dzahn) [15:55:25] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [15:55:26] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [15:55:44] (03Merged) 10jenkins-bot: NeutronAgentDown: deduplicate alert [alerts] - 10https://gerrit.wikimedia.org/r/983197 (owner: 10David Caro) [15:56:08] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [15:56:38] (03PS2) 10Dzahn: planet: remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/983207 (https://phabricator.wikimedia.org/T348392) [15:56:48] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply [15:57:16] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [15:57:17] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/page-analytics: apply [15:57:25] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [15:57:49] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [15:57:50] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [15:57:52] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply [15:57:53] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [15:58:23] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [15:58:47] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/push-notifications: apply [15:59:02] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [15:59:03] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/push-notifications: apply [15:59:30] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply [15:59:31] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/push-notifications: apply [15:59:38] (03CR) 10Marostegui: [C: 03+1] mariadb: toggle notifications for db1249 and db1234 [puppet] - 10https://gerrit.wikimedia.org/r/982883 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [16:00:00] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply [16:00:04] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [16:00:06] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [16:01:34] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [16:01:35] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [16:01:40] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [16:01:46] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [16:01:48] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [16:01:52] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [16:01:53] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [16:01:58] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [16:02:28] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/recommendation-api: apply [16:02:34] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [16:02:40] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply [16:02:41] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/recommendation-api: apply [16:02:44] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [16:02:45] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/proton: apply [16:03:01] (03CR) 10FNegri: [C: 03+1] NodeDown: deduplicate alert [alerts] - 10https://gerrit.wikimedia.org/r/983201 (owner: 10David Caro) [16:03:06] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/recommendation-api: apply [16:03:07] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/recommendation-api: apply [16:03:15] (03CR) 10FNegri: [C: 03+1] wmcs: use critical severity instead of task [alerts] - 10https://gerrit.wikimedia.org/r/983209 (owner: 10David Caro) [16:03:22] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/983207/910/" [puppet] - 10https://gerrit.wikimedia.org/r/983207 (https://phabricator.wikimedia.org/T348392) (owner: 10Dzahn) [16:03:30] (03CR) 10David Caro: [C: 03+2] NodeDown: deduplicate alert [alerts] - 10https://gerrit.wikimedia.org/r/983201 (owner: 10David Caro) [16:03:32] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: apply [16:03:37] (03CR) 10David Caro: [C: 03+2] wmcs: use critical severity instead of task [alerts] - 10https://gerrit.wikimedia.org/r/983209 (owner: 10David Caro) [16:03:40] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/proton: apply [16:03:41] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/proton: apply [16:04:08] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [16:04:18] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [16:04:19] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [16:04:32] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [16:04:33] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [16:04:43] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply [16:04:45] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [16:04:48] (03Merged) 10jenkins-bot: NodeDown: deduplicate alert [alerts] - 10https://gerrit.wikimedia.org/r/983201 (owner: 10David Caro) [16:04:55] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [16:04:56] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply [16:05:08] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [16:05:14] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [16:05:19] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [16:05:21] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [16:05:21] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [16:05:22] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [16:05:35] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [16:05:37] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [16:05:40] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [16:05:40] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [16:05:43] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [16:05:54] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [16:05:54] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [16:05:55] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [16:05:56] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:05:57] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [16:06:07] (03Merged) 10jenkins-bot: wmcs: use critical severity instead of task [alerts] - 10https://gerrit.wikimedia.org/r/983209 (owner: 10David Caro) [16:06:10] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [16:06:11] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [16:06:13] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:06:15] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:06:23] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [16:06:30] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [16:06:32] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:06:33] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:06:37] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [16:06:38] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [16:06:54] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:06:57] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [16:06:58] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [16:07:01] (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: (2) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [16:07:02] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/similar-users: apply [16:07:14] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/similar-users: apply [16:07:15] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/similar-users: apply [16:07:25] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [16:07:33] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/similar-users: apply [16:07:34] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/similar-users: apply [16:07:55] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/similar-users: apply [16:08:14] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [16:08:37] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [16:08:48] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (3) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [16:08:55] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [16:09:25] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [16:10:03] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [16:11:45] (03CR) 10Bking: [C: 03+1] "LGTM...but I am kinda curious if other applications are running into this problem. I couldn't find anything else from a quick check of the" [deployment-charts] - 10https://gerrit.wikimedia.org/r/983208 (https://phabricator.wikimedia.org/T353460) (owner: 10DCausse) [16:12:34] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (8) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [16:14:15] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [16:14:37] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [16:14:38] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [16:14:43] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply [16:14:56] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply [16:14:57] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/termbox: apply [16:15:02] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [16:15:03] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [16:15:19] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [16:15:23] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/termbox: apply [16:15:25] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/termbox: apply [16:15:27] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [16:15:28] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [16:15:38] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [16:15:56] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply [16:15:59] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/toolhub: apply [16:16:12] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1134.eqiad.wmnet onto db1234.eqiad.wmnet [16:16:24] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [16:16:27] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [16:16:28] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/toolhub: apply [16:16:37] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [16:16:38] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [16:16:41] (03CR) 10Arnaudb: [C: 03+2] mariadb: toggle notifications for db1249 and db1234 [puppet] - 10https://gerrit.wikimedia.org/r/982883 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [16:16:56] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply [16:17:01] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [16:17:03] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [16:17:10] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply [16:17:11] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/zotero: apply [16:17:21] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [16:17:23] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [16:17:24] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [16:17:33] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [16:17:34] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/zotero: apply [16:17:40] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [16:18:01] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [16:19:11] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1234 (re)pooling @ 1%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54441 and previous config saved to /var/cache/conftool/dbconfig/20231214-161910-arnaudb.json [16:19:14] (03PS1) 10Ilias Sarantopoulos: ml-services: fix langid image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/983214 [16:19:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1249 (re)pooling @ 1%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54442 and previous config saved to /var/cache/conftool/dbconfig/20231214-161915-arnaudb.json [16:19:18] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [16:19:19] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [16:19:59] (PuppetFailure) firing: Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:20:47] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [16:23:26] (03PS1) 10Alexandros Kosiaris: Revert "cirrusSearchCheckerJob: Revert to baremetal" [deployment-charts] - 10https://gerrit.wikimedia.org/r/982843 (https://phabricator.wikimedia.org/T352906) [16:23:43] (03PS2) 10Alexandros Kosiaris: Revert "cirrusSearchCheckerJob: Revert to baremetal" [deployment-charts] - 10https://gerrit.wikimedia.org/r/982843 (https://phabricator.wikimedia.org/T352906) [16:24:15] !log updates of all wikikube services done T352906 [16:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:22] T352906: mediawiki k8s jobrunner fails connecting to cloudelastic with a TLS error - https://phabricator.wikimedia.org/T352906 [16:24:46] (03CR) 10David Caro: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/983156 (owner: 10FNegri) [16:26:55] (03CR) 10FNegri: [C: 03+2] team-wmcs: improve cloudvirt alerts [alerts] - 10https://gerrit.wikimedia.org/r/983156 (owner: 10FNegri) [16:28:11] (03Merged) 10jenkins-bot: team-wmcs: improve cloudvirt alerts [alerts] - 10https://gerrit.wikimedia.org/r/983156 (owner: 10FNegri) [16:28:39] 10SRE, 10SRE-Access-Requests: Replace Kbrown's old ssh public key with a new one - https://phabricator.wikimedia.org/T353467 (10Nahid) [16:29:19] 10SRE, 10SRE-Access-Requests: Replace Kbrown's old ssh public key with a new one - https://phabricator.wikimedia.org/T353467 (10Kbrown) [16:34:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1234 (re)pooling @ 2%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54443 and previous config saved to /var/cache/conftool/dbconfig/20231214-163416-arnaudb.json [16:34:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1249 (re)pooling @ 2%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54444 and previous config saved to /var/cache/conftool/dbconfig/20231214-163420-arnaudb.json [16:35:04] (03CR) 10AikoChou: Add a testing stream for page-prediction-change events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982873 (https://phabricator.wikimedia.org/T349919) (owner: 10AikoChou) [16:37:27] (03PS7) 10Brouberol: spark-history: define helmfile configuration and release values [deployment-charts] - 10https://gerrit.wikimedia.org/r/982769 (https://phabricator.wikimedia.org/T352860) [16:38:47] (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "cirrusSearchCheckerJob: Revert to baremetal" [deployment-charts] - 10https://gerrit.wikimedia.org/r/982843 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [16:39:34] (03Merged) 10jenkins-bot: Revert "cirrusSearchCheckerJob: Revert to baremetal" [deployment-charts] - 10https://gerrit.wikimedia.org/r/982843 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [16:40:04] (03PS1) 10Hnowlan: changeprop-jobqueue: move PublishStashedFile back to non-k8s jobrunner [deployment-charts] - 10https://gerrit.wikimedia.org/r/983216 (https://phabricator.wikimedia.org/T349796) [16:42:17] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [16:42:33] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [16:42:34] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [16:42:56] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [16:42:57] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [16:43:23] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [16:46:40] (03CR) 10Brouberol: "I checked out the spark-history from https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/978629 and manage to render the chart" [deployment-charts] - 10https://gerrit.wikimedia.org/r/982769 (https://phabricator.wikimedia.org/T352860) (owner: 10Brouberol) [16:49:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1234 (re)pooling @ 4%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54445 and previous config saved to /var/cache/conftool/dbconfig/20231214-164921-arnaudb.json [16:49:27] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1249 (re)pooling @ 4%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54446 and previous config saved to /var/cache/conftool/dbconfig/20231214-164925-arnaudb.json [17:00:05] jhathaway and rzl: Your horoscope predicts another Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231214T1700). [17:00:05] MatmaRex: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:49] hi [17:01:20] i have an annoying situation, my puppet patch depends on a mediawiki-config patch [17:01:33] (03CR) 10Btullis: [C: 03+1] Define the spark-history chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol) [17:01:45] and the mediawiki-config patch depends on the puppet one. basically, they need to be deployed at the same time [17:01:53] (03CR) 10Btullis: [C: 03+1] spark-history: define helmfile configuration and release values [deployment-charts] - 10https://gerrit.wikimedia.org/r/982769 (https://phabricator.wikimedia.org/T352860) (owner: 10Brouberol) [17:02:59] MatmaRex: oh, because it's an httpbb test, got it [17:03:49] rzl: yeah. i have no idea how those tests work, to be honest. but i am hoping someone here can make it work, without causing alerts or anything like that :) [17:04:08] (03PS4) 10Bartosz Dziewoński: RunSingleJob.php: Fix use of MWExceptionHandler before it's defined [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982416 (https://phabricator.wikimedia.org/T352265) [17:04:14] okay, in some sense it'd be neat if httpbb could say `assert_status_in: [422, 500]` but that's not an option (and this is a pretty rare edge case so it's not really worth adding it) [17:04:26] honestly my advice is, we could do one of two things [17:04:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1234 (re)pooling @ 8%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54448 and previous config saved to /var/cache/conftool/dbconfig/20231214-170428-arnaudb.json [17:04:38] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1249 (re)pooling @ 8%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54449 and previous config saved to /var/cache/conftool/dbconfig/20231214-170438-arnaudb.json [17:05:21] option #1 is just deploy the config patch, then the httpbb patch, and if we get an IRC alert in between we'll just ignore it 🤷 it doesn't page anybody, and we would know what the cause is, so as long as we don't leave it that way for long, nbd [17:05:28] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:05:50] (that's predicated on the fact that it'd only be a few minutes, and there's a pretty good chance the hourly test doesn't run in that interval anyway) [17:06:14] option #2 is we delete the test from httpbb, then deploy the config patch, then deploy an updated test [17:06:22] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51007 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:07:09] rzl: i see, thanks, i can do either of these. i guess #1 would be simpler, so if you're okay with that, it seems good to me. i wasn't sure that it wasn't going to page anyone or do some other annoying alerting [17:07:20] yeah I definitely appreciate you being cautious about it [17:07:21] IRC alerts are already full of random stuff :) [17:07:51] my investment here is I wrote httpbb and I'm really, really trying to keep us in a state where we don't just get used to it when it alerts, cause then it's useless :P mixed success but we do what we can [17:08:30] but, in this case, I think that's a reasonable thing to do -- with the fallback that if it takes longer to fix than expected, for whatever reason, we'll modify or temporarily remove the test to resolve the alert [17:09:15] ha, yeah [17:10:15] I won't deploy the config change, but ping me whenever you're ready (whether it's now or during the B&C window later on) and I can deploy the test change [17:12:29] hmm, i can't deploy it myself [17:13:52] and i wasn't going to wait until the next backport window today. i guess i'll schedule it some time next week, and find someone who can deploy both patches then [17:14:08] (03PS1) 10FNegri: [toolsdb] Kill queries taking longer than 1 hour [puppet] - 10https://gerrit.wikimedia.org/r/983221 (https://phabricator.wikimedia.org/T353093) [17:14:09] thanks for the advice though rzl [17:18:11] MatmaRex: sure thing, and happy to do the httpbb part any time :) FWIW if you end up going route #2 (say, because it's easier to coordinate) you don't have to delete the whole test, just the `assert_status_in` line [17:18:57] alright, thanks [17:19:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1234 (re)pooling @ 10%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54450 and previous config saved to /var/cache/conftool/dbconfig/20231214-171934-arnaudb.json [17:19:43] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1249 (re)pooling @ 10%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54451 and previous config saved to /var/cache/conftool/dbconfig/20231214-171943-arnaudb.json [17:20:42] (03CR) 10DCausse: [C: 03+2] cirrus-streaming-updater: bump envoy resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/983208 (https://phabricator.wikimedia.org/T353460) (owner: 10DCausse) [17:21:22] er, `assert_status` that is [17:21:41] (03Merged) 10jenkins-bot: cirrus-streaming-updater: bump envoy resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/983208 (https://phabricator.wikimedia.org/T353460) (owner: 10DCausse) [17:22:50] (03CR) 10Brion VIBBER: Remove obsolete lost GPG key for Brion (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981583 (owner: 10Brion VIBBER) [17:23:40] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:24:05] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:25:15] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796 (10akosiaris) [17:27:59] (PuppetFailure) firing: (2) Puppet has failed on netflow1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:28:42] (03CR) 10Jforrester: Remove obsolete lost GPG key for Brion (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981583 (owner: 10Brion VIBBER) [17:31:59] (03PS1) 10BryanDavis: shellbox: Bump to 2023-12-14-055615 [deployment-charts] - 10https://gerrit.wikimedia.org/r/983222 (https://phabricator.wikimedia.org/T351744) [17:32:30] (03PS1) 10Subramanya Sastry: Revert "Temporarily disable isPreview in Parsoid's rendering" [extensions/DiscussionTools] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982845 [17:34:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1234 (re)pooling @ 20%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54452 and previous config saved to /var/cache/conftool/dbconfig/20231214-173439-arnaudb.json [17:34:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1249 (re)pooling @ 20%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54453 and previous config saved to /var/cache/conftool/dbconfig/20231214-173448-arnaudb.json [17:36:09] rzl: i have a puppet request window question (not for today's window). i have a change that requires a private repo file to be created at the same time, but i don't have access to the private repo. so that will need someone else to create/update the file in the private repo for me. does this seem ok for a puppet window? [17:36:24] for reference, this is the change: https://gerrit.wikimedia.org/r/c/operations/puppet/+/982914 [17:37:12] (just trying to sort out a usable process for us fr-tech folks since we will now have a prod vm we are responsible for. [17:38:17] 10Puppet, 10Instrument-ClientError: Google Translate and other translate services triggering client error alert - https://phabricator.wikimedia.org/T351738 (10Jdlrobson) 05Open→03Resolved a:03Jdlrobson Thank you for helping me with this @colewhite! I can confirm the drop today! [17:41:26] dwisehaupt: yeah absolutely [17:42:04] as long as it's otherwise suitable (manageable blast radius, +1 from your team, etc) including a private-repo patch is fine [17:42:37] (I mean obviously a literal gerrit +1 on the private repo doesn't work, but as long as it's been reviewed in principle) [17:43:57] ok cool. is there a preferred way to provide that private repo patch? ie: is what i included in the comment on that changeset ok? it's a bit hard since i have no access to the private repo, just the labs/private one. [17:44:23] in this case. i dpm [17:44:42] don't need to know what the passwords/hashes are since they will drop on the filesystem and i can get them from there. [17:46:33] yeah there wouldn't be code-review tooling or anything, somebody would literally just ssh in there and open a text editor, so as long as you give them clear enough instructions on what you need [17:46:50] that sounds perfec.t [17:46:51] and you'll be around during the window of course, so you can talk it out as needed [17:46:57] yeah, for sure. [17:48:10] thanks. this should work fine for us. [17:49:03] 👍 [17:49:44] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1234 (re)pooling @ 25%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54455 and previous config saved to /var/cache/conftool/dbconfig/20231214-174944-arnaudb.json [17:49:48] jhathaway: lmk if any of the above doesn't sound right to you :) [17:49:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1249 (re)pooling @ 25%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54456 and previous config saved to /var/cache/conftool/dbconfig/20231214-174953-arnaudb.json [17:51:25] rzl: thanks, I think that sounds as best as we can do at present, regarding the private repo [18:00:06] bd808: How many deployers does it take to do Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231214T1800). [18:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231214T1800) [18:04:32] (03CR) 10Jbond: [C: 03+1] "LGTM, not sure if admin has spec test but if it dose would be nice to add a test for this" [puppet] - 10https://gerrit.wikimedia.org/r/981418 (owner: 10Majavah) [18:04:50] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1234 (re)pooling @ 50%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54457 and previous config saved to /var/cache/conftool/dbconfig/20231214-180449-arnaudb.json [18:04:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1249 (re)pooling @ 50%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54458 and previous config saved to /var/cache/conftool/dbconfig/20231214-180458-arnaudb.json [18:08:34] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/982854 (https://phabricator.wikimedia.org/T353314) (owner: 10JMeybohm) [18:10:58] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/983131 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [18:16:45] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [18:16:46] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [18:16:46] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [18:16:50] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [18:17:28] here [18:19:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1234 (re)pooling @ 75%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54459 and previous config saved to /var/cache/conftool/dbconfig/20231214-181954-arnaudb.json [18:20:04] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1249 (re)pooling @ 75%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54460 and previous config saved to /var/cache/conftool/dbconfig/20231214-182003-arnaudb.json [18:21:46] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [18:21:46] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [18:21:46] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [18:21:50] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [18:27:43] 10SRE, 10SRE-swift-storage: The file "XXX" is in an inconsistent state within the internal storage backends - https://phabricator.wikimedia.org/T291137 (10Yann) The free license is irrevocable, and all the files were license reviewed. [18:35:00] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1234 (re)pooling @ 100%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54461 and previous config saved to /var/cache/conftool/dbconfig/20231214-183459-arnaudb.json [18:35:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1249 (re)pooling @ 100%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54462 and previous config saved to /var/cache/conftool/dbconfig/20231214-183508-arnaudb.json [18:42:43] 10SRE, 10SRE-swift-storage: The file "XXX" is in an inconsistent state within the internal storage backends - https://phabricator.wikimedia.org/T291137 (10PantheraLeo1359531) Alright, thanks! :) [18:55:12] (03CR) 10Cathal Mooney: [C: 03+1] reports: network, remove rdb from no IPv6 list [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/979399 (https://phabricator.wikimedia.org/T271142) (owner: 10Volans) [19:00:06] brennen and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231214T1900) [19:00:41] o/ [19:03:34] !log 1.42.0-wmf.9 (T350085) status: no current blockers, although we should keep an eye on T353400. rolling to all wikis. [19:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:55] T350085: 1.42.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T350085 [19:03:56] T353400: TypeError: Argument 3 passed to Wikimedia\Parsoid\Wt2Html\PP\Processors\MarkFosteredContent::moveFosteredAnnotations() must be an instance of Wikimedia\Parsoid\DOM\Element, instance of Wikimedia\Parsoid\DOM\DocumentFragment giv - https://phabricator.wikimedia.org/T353400 [19:04:09] (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983254 (https://phabricator.wikimedia.org/T350085) [19:04:11] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.42.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983254 (https://phabricator.wikimedia.org/T350085) (owner: 10TrainBranchBot) [19:04:55] (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983254 (https://phabricator.wikimedia.org/T350085) (owner: 10TrainBranchBot) [19:11:11] (03PS6) 10Effie Mouzeli: memcached: provide the CA cert when listening to TLS [puppet] - 10https://gerrit.wikimedia.org/r/983146 (https://phabricator.wikimedia.org/T349619) [19:12:15] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.9 refs T350085 [19:12:33] T350085: 1.42.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T350085 [19:28:18] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:35:38] PROBLEM - Blazegraph Port for wdqs-categories on wdqs1024 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:35:58] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: load-dcatap-weekly.service,wdqs-categories.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:54:59] bd808: thanks for the jouncebot update :-] [19:57:19] (03PS2) 10Ryan Kemper: wdqs: decom wdqs10[09-10] [puppet] - 10https://gerrit.wikimedia.org/r/982933 (https://phabricator.wikimedia.org/T351671) [20:01:11] (03CR) 10Bking: [C: 03+1] wdqs: decom wdqs10[09-10] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/982933 (https://phabricator.wikimedia.org/T351671) (owner: 10Ryan Kemper) [20:02:23] !log jmm@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM moscovium.eqiad.wmnet [20:05:04] (03PS3) 10Ryan Kemper: wdqs: decom wdqs10[09-10] [puppet] - 10https://gerrit.wikimedia.org/r/982933 (https://phabricator.wikimedia.org/T351671) [20:05:09] (03CR) 10Ryan Kemper: wdqs: decom wdqs10[09-10] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/982933 (https://phabricator.wikimedia.org/T351671) (owner: 10Ryan Kemper) [20:05:47] (03CR) 10Majavah: [C: 03+1] Enable action blocks for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981714 (https://phabricator.wikimedia.org/T353120) (owner: 10MilkyDefer) [20:06:17] !log jmm@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM moscovium.eqiad.wmnet [20:06:47] (03PS4) 10Ryan Kemper: wdqs: decom wdqs10[09-10] [puppet] - 10https://gerrit.wikimedia.org/r/982933 (https://phabricator.wikimedia.org/T351671) [20:07:13] (03PS5) 10Ryan Kemper: wdqs: decom wdqs10[09-10] [puppet] - 10https://gerrit.wikimedia.org/r/982933 (https://phabricator.wikimedia.org/T351671) [20:07:19] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/982933 (https://phabricator.wikimedia.org/T351671) (owner: 10Ryan Kemper) [20:13:47] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (4) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [20:17:07] hashar: I'm glad it worked. :) [20:17:38] its message has been a concern to me since for ever [20:17:54] and then I remembered I once tried to get rid of them :D [20:19:59] (PuppetFailure) firing: Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:21:24] I'm going to deploy updates to Shellbox so that the wikis can have some SyntaxHighlight improvements. I ended up being AFK during my normal deploy slot for this stuff earlier today because #reasons, but I'm here now. :) [20:21:35] (03CR) 10BryanDavis: [C: 03+2] shellbox: Bump to 2023-12-14-055615 [deployment-charts] - 10https://gerrit.wikimedia.org/r/983222 (https://phabricator.wikimedia.org/T351744) (owner: 10BryanDavis) [20:22:07] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: decom wdqs10[09-10] [puppet] - 10https://gerrit.wikimedia.org/r/982933 (https://phabricator.wikimedia.org/T351671) (owner: 10Ryan Kemper) [20:22:42] (03Merged) 10jenkins-bot: shellbox: Bump to 2023-12-14-055615 [deployment-charts] - 10https://gerrit.wikimedia.org/r/983222 (https://phabricator.wikimedia.org/T351744) (owner: 10BryanDavis) [20:23:11] !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission for hosts wdqs[1009-1010].eqiad.wmnet [20:29:43] (03PS1) 10Ryan Kemper: wdqs: remove ldf check [puppet] - 10https://gerrit.wikimedia.org/r/983260 (https://phabricator.wikimedia.org/T347355) [20:30:37] (03PS2) 10Ryan Kemper: wdqs: remove ldf check [puppet] - 10https://gerrit.wikimedia.org/r/983260 (https://phabricator.wikimedia.org/T347355) [20:30:49] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983260 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [20:31:19] !log ryankemper@cumin1001 START - Cookbook sre.dns.netbox [20:33:56] RECOVERY - Blazegraph process -wdqs-categories- on wdqs1024 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:36:48] RECOVERY - Blazegraph Port for wdqs-categories on wdqs1024 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:37:23] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply [20:37:40] (03CR) 10Bking: [C: 03+1] wdqs: remove ldf check [puppet] - 10https://gerrit.wikimedia.org/r/983260 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [20:37:49] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: remove ldf check [puppet] - 10https://gerrit.wikimedia.org/r/983260 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [20:37:52] !log ryankemper@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs[1009-1010].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin1001" [20:37:56] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [20:38:03] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [20:38:20] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [20:38:27] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [20:38:47] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [20:38:53] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [20:39:19] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [20:39:26] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [20:39:53] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [20:39:57] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs[1009-1010].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin1001" [20:39:58] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:39:58] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wdqs[1009-1010].eqiad.wmnet [20:42:11] (03Abandoned) 10Bking: wdqs: Change LDF monitoring URI [puppet] - 10https://gerrit.wikimedia.org/r/982172 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [20:44:25] (03PS1) 10Majavah: alertmanager: also inhibit criticals below a page [puppet] - 10https://gerrit.wikimedia.org/r/983262 [20:45:09] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [20:45:17] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission wdqs10[09-10].eqiad.wmnet - https://phabricator.wikimedia.org/T353482 (10RKemper) [20:45:26] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission wdqs10[09-10].eqiad.wmnet - https://phabricator.wikimedia.org/T353482 (10RKemper) [20:45:55] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [20:46:02] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [20:46:34] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [20:46:41] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [20:47:06] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [20:47:13] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [20:48:09] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [20:48:16] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [20:48:55] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [20:49:25] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply [20:50:10] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [20:50:16] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [20:50:34] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [20:50:41] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [20:51:09] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [20:51:15] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [20:51:26] 10SRE, 10ops-eqiad, 10cloud-services-team: Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Jclark-ctr) Server is in warranty Confirmed: Service Request 181697839 was successfully submitted. [20:51:48] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [20:51:55] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [20:51:58] 10SRE, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Aki Nakanishi - https://phabricator.wikimedia.org/T353363 (10ANakanishi_WMF) Hi @jhathaway, thanks for the guidance! I've set up a developer account. [20:52:32] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [20:52:47] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10taavi) [20:52:50] 10SRE, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Riddy Khan - https://phabricator.wikimedia.org/T353370 (10ANakanishi_WMF) @Himejijo can you follow up and create a developer account? Thanks! [20:52:57] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10taavi) [20:56:23] 10SRE, 10ops-eqiad: Degraded RAID on kubernetes1060 - https://phabricator.wikimedia.org/T353165 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [21:00:07] brennen and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231214T2100). [21:00:07] Kizule and subbu: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:01:04] o/ [21:01:52] o/ [21:02:03] o/ [21:02:58] Hi, has backport started? Sorry for being late 2 minutes, but I'm guessing that patch from subbu will go firstly. [21:03:00] (PuppetFailure) firing: (2) Puppet has failed on netflow1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:03:28] Kizule: all you missed was folks checking in. :) [21:04:33] * bd808 wishes that it was a simple thing to give everyone access to bnc services for backscroll [21:04:52] I found it in logs, so it's alright, I'm updated. [21:05:01] 10SRE, 10ops-eqiad: SMART errors on ganeti1031 - https://phabricator.wikimedia.org/T353324 (10Jclark-ctr) a:03Jclark-ctr Opened ticket with Dell Confirmed: Service Request 181698485 was successfully submitted. [21:05:12] Kizule: we're just getting started, and yeah, we'll do the patch first :) [21:05:26] i've been running an ircd for a while that provides backscroll as a builtin, definitely is a nice thing to have. [21:05:54] thcipriani: Sure, my running of namespaceDupes.php on few projects will surely take some time, hopefully we won't run out of the time. [21:07:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ssastry@deploy2002 using scap backport" [extensions/DiscussionTools] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982845 (owner: 10Subramanya Sastry) [21:07:34] Firstly on small Serbian projects (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/dblists/s3.dblist#721) and then on Serbian Wikipedia if everything goes well. :) [21:07:59] (PuppetFailure) firing: (2) Puppet has failed on netflow1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:09:18] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 templates with colons in filename made operations/puppet not cloneable on Windows - https://phabricator.wikimedia.org/T282308 (10jhathaway) a:03jhathaway [21:13:10] (03Merged) 10jenkins-bot: Revert "Temporarily disable isPreview in Parsoid's rendering" [extensions/DiscussionTools] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982845 (owner: 10Subramanya Sastry) [21:13:28] !log ssastry@deploy2002 Started scap: Backport for [[gerrit:982845|Revert "Temporarily disable isPreview in Parsoid's rendering"]] [21:14:48] !log ssastry@deploy2002 ssastry: Backport for [[gerrit:982845|Revert "Temporarily disable isPreview in Parsoid's rendering"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:15:14] 10SRE, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Riddy Khan - https://phabricator.wikimedia.org/T353370 (10Himejijo) >>! In T353370#9407923, @ANakanishi_WMF wrote: > @Himejijo can you follow up and create a developer account? Thanks! Should al... [21:17:36] I am deploying right now. [21:18:11] (03PS1) 10Cathal Mooney: Refactor server provision script to select params based on profile [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/983268 (https://phabricator.wikimedia.org/T346428) [21:18:13] The code is on mwdebug and there isn't anything to test because wikitech isn't supported by the mwdebug extension. But, i know this is safe since it is a revert of a temporary patch I got backported yday [21:18:22] !log ssastry@deploy2002 ssastry: Continuing with sync [21:18:43] (03CR) 10CI reject: [V: 04-1] Refactor server provision script to select params based on profile [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/983268 (https://phabricator.wikimedia.org/T346428) (owner: 10Cathal Mooney) [21:24:07] !log ssastry@deploy2002 Finished scap: Backport for [[gerrit:982845|Revert "Temporarily disable isPreview in Parsoid's rendering"]] (duration: 10m 38s) [21:26:11] PROBLEM - Check systemd state on mw2442 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:30:54] (03CR) 10Effie Mouzeli: "PCC https://puppet-compiler.wmflabs.org/output/983146/914/" [puppet] - 10https://gerrit.wikimedia.org/r/983146 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli) [21:32:36] Is everything alright? Backport of 982845 has finished before like 10 minutes. [21:32:43] or so [21:33:29] Kizule: yep, just doing some explainers/training things [21:33:39] oh okay then [21:34:11] ok, getting set up for namespaceDupes now [21:34:54] (03PS7) 10Effie Mouzeli: memcached: provide the CA cert when listening to TLS [puppet] - 10https://gerrit.wikimedia.org/r/983146 (https://phabricator.wikimedia.org/T349619) [21:35:04] You could probably run it with ` | phaste`, so I can see output as well, before running namespaceDupes.php with --fix. [21:35:24] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/dblists/s3.dblist#721 [21:35:35] Basically mwscript namespaceDupes.php srwikibooks | phaste and so on, I think. [21:37:56] (03CR) 10Effie Mouzeli: "PCC OK https://puppet-compiler.wmflabs.org/output/983146/915/" [puppet] - 10https://gerrit.wikimedia.org/r/983146 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli) [21:38:40] alright, I'll start with srwikibooks and get the paste up shortly [21:39:28] Okay :) [21:39:41] oh, well, srwikibooks had 0 pages to fix [21:39:49] lemme keep working through the other sr projects [21:40:01] Okay, for 0 pages to fix there is no need for pastes. :) [21:41:22] srwikinews is here: https://phabricator.wikimedia.org/P54464 [21:42:36] thcipriani: srwikinews is good to go with --fix [21:42:59] doing [21:44:17] looks good: https://phabricator.wikimedia.org/P54465 moving to srwikiquote [21:44:35] To me as well. [21:45:02] 0 there; srwikisource next [21:47:26] Serbian Wikipedia is going to be pretty big then. :D [21:50:57] srwikisource looks scary for the same reason that got us last time: https://phabricator.wikimedia.org/P54466 [21:51:06] images that are used on a ton of pages [21:51:24] namespaceDupes.php is now honoring replication, looks good to me to go. [21:51:54] I'm tracking Grafana to avoid scary scenario from previous time. [21:53:09] well I was watching grafana last time and replag didn't start until the primary choked its way through the full batch, so it was 45 minutes before replag even began on the secondaries, I'm checking in with DBAs about the safety here before I go ahead [21:53:56] if I don't hear back, I may wait on this until I get the affirmative since it caused so much trouble last time through (tl;dr: I'm nervous) [21:54:49] We are having a limit in place now, it's not going to send a batch of 1000 and something pages. [21:55:29] IF Amir1 is around, I think that he can say if this is right to do, or not. [21:56:50] Serbian Wikipedia is already questionable, not srwikisource with 171 pages. [22:01:04] yeah, but tons of pages use this image (which evidently doesn't exist): https://sr.wikisource.org/w/index.php?title=%D0%9F%D0%BE%D1%81%D0%B5%D0%B1%D0%BD%D0%BE:%D0%A8%D1%82%D0%B0_%D0%B2%D0%BE%D0%B4%D0%B8_%D0%BE%D0%B2%D0%B0%D0%BC%D0%BE/%D0%94%D0%B0%D1%82%D0%BE%D1%82%D0%B5%D0%BA%D0%B0:Murat_Sipan_vinjeta.jpg [22:02:04] yeah, let's wait on this and try it at a different time, thanks for understanding Kizule I appreciate it. [22:02:36] No, no. Let's check srwiktionary at least. [22:02:50] I'm going to remove that non-existing image, okay. THanks for letting me know. [22:04:07] sure, I can run the check on the remaining ones if that's helpful [22:06:08] Sure, check srwiktionary and then wait for green light to do it for Serbian Wikipedia. [22:06:50] https://phabricator.wikimedia.org/P54467 [22:07:25] those are also in the category of: makes me nervous. Images used on tons of pages. [22:08:30] Why would 8 pages make you nervous_ [22:08:32] ? [22:10:20] well cause they're not pages, they're files, and they're files that are each used on 100s of pages: https://sr.wiktionary.org/wiki/%D0%9F%D0%BE%D1%81%D0%B5%D0%B1%D0%BD%D0%BE:%D0%A8%D1%82%D0%B0_%D0%B2%D0%BE%D0%B4%D0%B8_%D0%BE%D0%B2%D0%B0%D0%BC%D0%BE/%D0%94%D0%B0%D1%82%D0%BE%D1%82%D0%B5%D0%BA%D0%B0:Commons-logo.svg [22:10:41] You can run it with --fix, as it's 8 links. 8 links can't cause a downtime at all. [22:10:55] I'm not doing that today, sorry [22:14:04] As I'm really tired of this disorganization (which isn't first time to happen), okay, I'll just leave it as is. [22:14:55] I understand it's frustrating and I'm sorry about that [22:15:23] I would totally understand if there is a lot of pages. [22:15:33] Links whatever. [22:15:34] But there is not a lot of pages in Serbian Wiktionary. [22:16:16] Like it is 171 on Serbian Wikisource, and you weren't comfortable because file don't exist. Okay, you are right about that and it's better to remove non-existing image from pages. [22:17:56] I'll take care about the Serbian Wikisource until next week. And we can give this a try next week. [22:18:27] it's the end of several long weeks for us not long before a major holiday, deployment during backport windows is entirely at the discretion of the deployer, and the last time this came up was an extremely unpleasant experience. patience is appreciated. [22:19:17] thanks Kizule that sounds good, that'll give time for someone who understands the mitigations we put in place to comment about the safety of the operation. [22:19:18] I don't think that we really need to do this in 2024. [22:20:12] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:20:13] thcipriani: Sounds good, see you next week. [22:20:18] o/ [22:44:46] (03PS1) 10BCornwall: site.pp: Add acmechief1002 [puppet] - 10https://gerrit.wikimedia.org/r/983276 (https://phabricator.wikimedia.org/T352242) [22:46:40] (03CR) 10BCornwall: [C: 03+2] site.pp: Add acmechief1002 [puppet] - 10https://gerrit.wikimedia.org/r/983276 (https://phabricator.wikimedia.org/T352242) (owner: 10BCornwall) [22:48:16] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host acmechief1002.eqiad.wmnet with OS bookworm [22:54:42] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:54:52] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:57:32] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51007 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:57:33] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on acmechief1002.eqiad.wmnet with reason: host reimage [22:57:42] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.473 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:02:33] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on acmechief1002.eqiad.wmnet with reason: host reimage [23:17:45] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host acmechief1002.eqiad.wmnet with OS bookworm [23:20:15] PROBLEM - Ensure that passive node gets the certificates from the active node as expected on acmechief1002 is CRITICAL: FILE_AGE CRITICAL: File not found - /var/lib/acme-chief/certs/.rsync.status https://wikitech.wikimedia.org/wiki/Acme-chief [23:28:18] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:28:28] (03PS1) 10JHathaway: lists: rename repo templates to be compatible with Windows [puppet] - 10https://gerrit.wikimedia.org/r/983278 (https://phabricator.wikimedia.org/T282308) [23:29:08] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983278 (https://phabricator.wikimedia.org/T282308) (owner: 10JHathaway) [23:38:45] (Device rebooted) firing: Alert for device ps1-c4-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [23:42:14] (03PS1) 10JHathaway: pki: rename intermediates to prevent aux.pem cloning on Windows [puppet] - 10https://gerrit.wikimedia.org/r/983279 (https://phabricator.wikimedia.org/T282308) [23:43:13] 10SRE, 10Wikimedia-Mailing-lists: Ensure windows files are not commited to operations/puppet - https://phabricator.wikimedia.org/T353487 (10jhathaway) [23:43:24] 10SRE, 10Wikimedia-Mailing-lists: Ensure windows files are not commited to operations/puppet - https://phabricator.wikimedia.org/T353487 (10jhathaway) p:05Triage→03Low [23:43:45] (Device rebooted) resolved: Device ps1-c4-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [23:44:39] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983279 (https://phabricator.wikimedia.org/T282308) (owner: 10JHathaway) [23:45:18] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10vm-requests: eqiad: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T353295 (10BCornwall) 05Open→03Resolved a:03BCornwall [23:45:24] (03CR) 10CI reject: [V: 04-1] pki: rename intermediates to prevent aux.pem cloning on Windows [puppet] - 10https://gerrit.wikimedia.org/r/983279 (https://phabricator.wikimedia.org/T282308) (owner: 10JHathaway) [23:47:50] (03PS2) 10JHathaway: pki: rename intermediates to prevent aux.pem cloning on Windows [puppet] - 10https://gerrit.wikimedia.org/r/983279 (https://phabricator.wikimedia.org/T282308) [23:47:55] 10SRE, 10Wikimedia-Mailing-lists: Ensure filenames invalid in windows are not commited to operations/puppet - https://phabricator.wikimedia.org/T353487 (10Novem_Linguae) [23:48:01] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983279 (https://phabricator.wikimedia.org/T282308) (owner: 10JHathaway) [23:50:38] (03PS1) 10Cwhite: Configure and enable StatsLib for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983229 (https://phabricator.wikimedia.org/T343024) [23:51:03] (03CR) 10Dwisehaupt: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/916/con" [puppet] - 10https://gerrit.wikimedia.org/r/982914 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)