[00:08:01] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:17:34] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (4) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [00:19:59] (PuppetFailure) firing: Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:34:45] (Device rebooted) firing: Alert for device ps1-c7-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [00:38:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/983230 [00:38:51] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/983230 (owner: 10TrainBranchBot) [00:39:45] (Device rebooted) resolved: Device ps1-c7-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [00:59:52] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/983230 (owner: 10TrainBranchBot) [01:04:40] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [01:06:17] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:07:59] (PuppetFailure) firing: (2) Puppet has failed on netflow1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:45:46] 10SRE, 10ops-eqiad: SMART errors on ganeti1031 - https://phabricator.wikimedia.org/T353324 (10Jclark-ctr) @MoritzMuehlenhoff. Can you attach complete smartctl output no errors on tsr report so dell can tell what drive is reporting error [02:37:00] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:57:48] military reduced operatians [03:08:47] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:28:19] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:44:52] (Device rebooted) firing: Alert for device ps1-d3-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [03:49:52] (Device rebooted) resolved: Device ps1-d3-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [04:17:34] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (4) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [04:20:00] (PuppetFailure) firing: Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:04:54] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [05:07:59] (PuppetFailure) firing: (2) Puppet has failed on netflow1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:08:03] (03PS1) 10Marostegui: installserver: Do not reimage db1239 [puppet] - 10https://gerrit.wikimedia.org/r/983307 [06:08:51] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:10:59] (03CR) 10Marostegui: [C: 03+2] installserver: Do not reimage db1239 [puppet] - 10https://gerrit.wikimedia.org/r/983307 (owner: 10Marostegui) [06:36:04] (03PS1) 10Marostegui: db1153: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/983308 (https://phabricator.wikimedia.org/T353499) [07:00:06] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231215T0700) [07:28:19] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:34:13] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:35:35] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:51:24] (03CR) 10Kevin Bazira: [C: 03+1] ml-services: fix langid image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/983214 (owner: 10Ilias Sarantopoulos) [07:55:07] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231215T0800) [08:02:34] (03PS1) 10Muehlenhoff: Create /var/lib/atlas [puppet] - 10https://gerrit.wikimedia.org/r/983316 (https://phabricator.wikimedia.org/T353419) [08:16:19] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:17:35] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (4) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [08:17:47] (03CR) 10Elukey: [C: 03+1] ml-services: fix langid image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/983214 (owner: 10Ilias Sarantopoulos) [08:20:14] (PuppetFailure) firing: Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:21:06] (03CR) 10Ayounsi: [C: 03+1] "+1 if pcc is happy" [puppet] - 10https://gerrit.wikimedia.org/r/983316 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [08:23:26] (03CR) 10Brouberol: [C: 03+2] spark-history: define helmfile configuration and release values [deployment-charts] - 10https://gerrit.wikimedia.org/r/982769 (https://phabricator.wikimedia.org/T352860) (owner: 10Brouberol) [08:23:55] (03CR) 10Brouberol: [C: 03+2] Define the spark-history chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol) [08:24:32] (03CR) 10Ayounsi: "This change seems to have broken Puppet on at least netflow1002" [puppet] - 10https://gerrit.wikimedia.org/r/982846 (https://phabricator.wikimedia.org/T352838) (owner: 10Btullis) [08:26:29] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:27:09] PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:29:03] (03CR) 10Muehlenhoff: [C: 03+2] Create /var/lib/atlas [puppet] - 10https://gerrit.wikimedia.org/r/983316 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [08:31:17] (03PS1) 10Elukey: pyrra::filesystem::config: propagate ensure to the file resource [puppet] - 10https://gerrit.wikimedia.org/r/983352 [08:32:58] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/918/con" [puppet] - 10https://gerrit.wikimedia.org/r/983352 (owner: 10Elukey) [08:33:35] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: fix langid image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/983214 (owner: 10Ilias Sarantopoulos) [08:33:43] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:34:29] (03Merged) 10jenkins-bot: ml-services: fix langid image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/983214 (owner: 10Ilias Sarantopoulos) [08:34:41] PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:35:12] (03CR) 10Elukey: "I am not 100% sure if ensure => file is needed, from the puppet docs "present" should be fine as well, but lemme know otherwise." [puppet] - 10https://gerrit.wikimedia.org/r/983352 (owner: 10Elukey) [08:36:45] (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:38:24] (03CR) 10Muehlenhoff: pyrra::filesystem::config: propagate ensure to the file resource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/983352 (owner: 10Elukey) [08:39:12] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [08:41:35] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:50:09] RECOVERY - Check systemd state on wdqs1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:45] (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:51:47] (03PS1) 10Muehlenhoff: On cumin1001 print a MOTD to use a different host [puppet] - 10https://gerrit.wikimedia.org/r/983355 (https://phabricator.wikimedia.org/T353419) [08:56:24] !log shutdown already down IPv6 BGP session from ulsfo to the office [08:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:00] (PuppetFailure) resolved: Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:00:05] RECOVERY - Check systemd state on wdqs1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:26] (03CR) 10Filippo Giunchedi: [C: 03+1] "Good idea!" [puppet] - 10https://gerrit.wikimedia.org/r/983262 (owner: 10Majavah) [09:00:53] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 137, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:01:46] (SystemdUnitFailed) resolved: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:03:17] 10ops-eqiad: ps1-e8-eqiad down - https://phabricator.wikimedia.org/T353503 (10ayounsi) p:05Triageโ†’03High [09:03:40] ACKNOWLEDGEMENT - ps1-e8-eqiad-infeed-load-tower-B-phase-Z on ps1-e8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T353503 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:03:40] ACKNOWLEDGEMENT - ps1-e8-eqiad-infeed-load-tower-B-phase-Y on ps1-e8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T353503 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:03:40] ACKNOWLEDGEMENT - ps1-e8-eqiad-infeed-load-tower-B-phase-X on ps1-e8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T353503 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:03:40] ACKNOWLEDGEMENT - ps1-e8-eqiad-infeed-load-tower-A-phase-Z on ps1-e8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T353503 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:03:40] ACKNOWLEDGEMENT - ps1-e8-eqiad-infeed-load-tower-A-phase-Y on ps1-e8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T353503 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:03:41] ACKNOWLEDGEMENT - ps1-e8-eqiad-infeed-load-tower-A-phase-X on ps1-e8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T353503 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:03:41] ACKNOWLEDGEMENT - Host ps1-e8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T353503 [09:04:17] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.3/12.4 point update - https://phabricator.wikimedia.org/T353057 (10MoritzMuehlenhoff) [09:04:54] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:05:11] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:05:19] (03PS2) 10Cathal Mooney: Refactor server provision script to select params based on profile [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/983268 (https://phabricator.wikimedia.org/T346428) [09:05:49] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:05:49] !log installing Linux 6.1.67 packages on Bookworm hosts [09:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:12] (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:08:15] (PuppetFailure) firing: (2) Puppet has failed on netflow1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:09:05] (03CR) 10Filippo Giunchedi: [C: 03+1] pyrra::filesystem::config: propagate ensure to the file resource [puppet] - 10https://gerrit.wikimedia.org/r/983352 (owner: 10Elukey) [09:12:48] (03CR) 10Elukey: [C: 03+2] pyrra::filesystem::config: propagate ensure to the file resource [puppet] - 10https://gerrit.wikimedia.org/r/983352 (owner: 10Elukey) [09:14:53] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:15:02] 10Puppet, 10SRE: Ensure filenames invalid in windows are not commited to operations/puppet - https://phabricator.wikimedia.org/T353487 (10Peachey88) [09:17:16] 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10MatthewVernon) The plural of anecdote is not data, but: I had one system (ms-fe2013) that did this when rebooted by the reimage cookbook; I did a... [09:19:23] (03Abandoned) 10Slyngshede: Blackbox alerting for urldownloaders [alerts] - 10https://gerrit.wikimedia.org/r/981289 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:21:35] RECOVERY - Check systemd state on wdqs1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:22:09] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:23:12] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:23:16] (03CR) 10Muehlenhoff: [C: 03+1] Add a spark system user/group for the spark-history service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/982846 (https://phabricator.wikimedia.org/T352838) (owner: 10Btullis) [09:26:21] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983355 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [09:33:03] (03PS2) 10Brouberol: kube_private: cleanup obsolete spark-history private data directories [puppet] - 10https://gerrit.wikimedia.org/r/983357 (https://phabricator.wikimedia.org/T351816) [09:33:07] (03CR) 10Brouberol: "/!\ We should only merge this after having created the new spark/spark-history{-test}.svc.eqiad.wmmnet@WIKIMEDIA keytabs and principals, a" [puppet] - 10https://gerrit.wikimedia.org/r/983357 (https://phabricator.wikimedia.org/T351816) (owner: 10Brouberol) [09:39:49] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:39:49] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:00] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10ayounsi) 05Resolvedโ†’03Open Could someone update https://wikitech.wikimedia.org/wiki/SRE/Inf... [09:47:33] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:47:43] (03PS1) 10JMeybohm: kubernetes::master: Switch blackbox check to paging [puppet] - 10https://gerrit.wikimedia.org/r/983359 (https://phabricator.wikimedia.org/T353233) [09:49:39] (03PS1) 10JMeybohm: kubernetes::master: Drop absent resource [puppet] - 10https://gerrit.wikimedia.org/r/983360 [09:50:08] (03CR) 10Hashar: "Once you got the source code merged ( https://gerrit.wikimedia.org/r/c/operations/software/debmonitor-client/+/981463/ ) and have pushed a" [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 (owner: 10Slyngshede) [09:51:37] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/920/con" [puppet] - 10https://gerrit.wikimedia.org/r/983360 (owner: 10JMeybohm) [09:54:18] (03CR) 10Btullis: [C: 03+1] kube_private: cleanup obsolete spark-history private data directories [puppet] - 10https://gerrit.wikimedia.org/r/983357 (https://phabricator.wikimedia.org/T351816) (owner: 10Brouberol) [09:55:02] (03PS3) 10AikoChou: Add a testing stream for page-prediction-change events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982873 (https://phabricator.wikimedia.org/T349919) [10:00:12] (03CR) 10AikoChou: [C: 03+2] "Thanks for the review :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982873 (https://phabricator.wikimedia.org/T349919) (owner: 10AikoChou) [10:01:23] (03Merged) 10jenkins-bot: Add a testing stream for page-prediction-change events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982873 (https://phabricator.wikimedia.org/T349919) (owner: 10AikoChou) [10:03:59] 10SRE, 10Data-Engineering, 10Data-Platform-SRE: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYMOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10Gehel) [10:09:52] (03PS3) 10Cathal Mooney: Refactor server provision script to select params based on profile [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/983268 (https://phabricator.wikimedia.org/T346428) [10:10:23] (03PS1) 10Ayounsi: Remove ID for gnmic user [puppet] - 10https://gerrit.wikimedia.org/r/983362 (https://phabricator.wikimedia.org/T352838) [10:12:27] (03PS2) 10JMeybohm: pki::multirootca: Merge custom profiles on top of default_profiles [puppet] - 10https://gerrit.wikimedia.org/r/982854 (https://phabricator.wikimedia.org/T353314) [10:12:29] (03PS1) 10JMeybohm: pki::multirootca: Override the server profiles expiry for k8s staging [puppet] - 10https://gerrit.wikimedia.org/r/983363 (https://phabricator.wikimedia.org/T353314) [10:12:48] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/983362 (https://phabricator.wikimedia.org/T352838) (owner: 10Ayounsi) [10:13:30] (03CR) 10Ayounsi: [C: 03+2] Remove ID for gnmic user [puppet] - 10https://gerrit.wikimedia.org/r/983362 (https://phabricator.wikimedia.org/T352838) (owner: 10Ayounsi) [10:14:19] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:42] (03PS2) 10Klausman: admin_ng: Increase resources for control plane in ML-* [deployment-charts] - 10https://gerrit.wikimedia.org/r/983217 [10:17:57] (03CR) 10David Caro: [C: 03+1] "LGTM, we might want to increase the timeout if many queries are getting killed, we can try getting some info from quarry/superset on query" [puppet] - 10https://gerrit.wikimedia.org/r/983221 (https://phabricator.wikimedia.org/T353093) (owner: 10FNegri) [10:20:24] (03CR) 10FNegri: [C: 03+2] [toolsdb] Kill queries taking longer than 1 hour [puppet] - 10https://gerrit.wikimedia.org/r/983221 (https://phabricator.wikimedia.org/T353093) (owner: 10FNegri) [10:25:55] (03CR) 10Elukey: [C: 04-1] "There are some issues:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/983217 (owner: 10Klausman) [10:28:56] (03PS3) 10Klausman: admin_ng: Increase resources for control plane in ML-* [deployment-charts] - 10https://gerrit.wikimedia.org/r/983217 [10:28:58] (03PS1) 10Jelto: sre.gitlab.upgrade: add error handling for broadcast message creation [cookbooks] - 10https://gerrit.wikimedia.org/r/983364 (https://phabricator.wikimedia.org/T353375) [10:31:53] (03CR) 10Marostegui: [C: 03+2] db1153: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/983308 (https://phabricator.wikimedia.org/T353499) (owner: 10Marostegui) [10:35:10] (03PS8) 10Effie Mouzeli: memcached: provide the CA cert when listening to TLS [puppet] - 10https://gerrit.wikimedia.org/r/983146 (https://phabricator.wikimedia.org/T349619) [10:36:49] (03CR) 10FNegri: [C: 03+1] "Looking forward to using this feature, as I have a YubiKey but found the GPG workaround a bit cumbersome." [puppet] - 10https://gerrit.wikimedia.org/r/981418 (owner: 10Majavah) [10:38:03] (03PS1) 10Effie Mouzeli: memcached: provide the CA cert when listening to TLS [puppet] - 10https://gerrit.wikimedia.org/r/983365 (https://phabricator.wikimedia.org/T349619) [10:39:49] (03CR) 10Elukey: [C: 04-1] admin_ng: Increase resources for control plane in ML-* (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/983217 (owner: 10Klausman) [10:40:51] (03PS9) 10Effie Mouzeli: memcached: provide the CA cert when listening to TLS [puppet] - 10https://gerrit.wikimedia.org/r/983146 (https://phabricator.wikimedia.org/T349619) [10:41:34] (03PS2) 10Effie Mouzeli: memcached: provide the CA cert when listening to TLS (mc-gp1001) [puppet] - 10https://gerrit.wikimedia.org/r/983365 (https://phabricator.wikimedia.org/T349619) [10:42:35] (03PS4) 10Klausman: admin_ng: Increase resources for control plane in ML-* [deployment-charts] - 10https://gerrit.wikimedia.org/r/983217 [10:43:03] (03PS5) 10Klausman: admin_ng: Increase resources for control plane in ML-* [deployment-charts] - 10https://gerrit.wikimedia.org/r/983217 [10:47:05] (03CR) 10Marostegui: "This is fine, but it still requires manual grants to be deployed as this file is only used for tracking." [puppet] - 10https://gerrit.wikimedia.org/r/983169 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [10:47:18] (03CR) 10Effie Mouzeli: "PCC NOOP https://puppet-compiler.wmflabs.org/output/983146/922/" [puppet] - 10https://gerrit.wikimedia.org/r/983146 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli) [10:49:21] (03CR) 10Elukey: [C: 04-1] "Please check the CI's output.." [deployment-charts] - 10https://gerrit.wikimedia.org/r/983217 (owner: 10Klausman) [10:49:49] (03PS6) 10Klausman: admin_ng: Increase resources for control plane in ML-* [deployment-charts] - 10https://gerrit.wikimedia.org/r/983217 [10:50:11] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.3/12.4 point update - https://phabricator.wikimedia.org/T353057 (10MoritzMuehlenhoff) [10:50:20] (03CR) 10Effie Mouzeli: "PCC OK https://puppet-compiler.wmflabs.org/output/983365/924/" [puppet] - 10https://gerrit.wikimedia.org/r/983365 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli) [10:50:22] 10SRE-tools, 10Infrastructure-Foundations: Decommission cookbook: lock per switch - https://phabricator.wikimedia.org/T353513 (10ayounsi) [10:51:34] (03CR) 10Klausman: admin_ng: Increase resources for control plane in ML-* (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/983217 (owner: 10Klausman) [10:51:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/983365 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli) [10:52:30] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [10:53:51] (03PS3) 10Muehlenhoff: Configure ACLs for reprepro upload queue (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) [10:54:15] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [10:54:21] (03CR) 10CI reject: [V: 04-1] Configure ACLs for reprepro upload queue (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff) [10:56:58] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [10:58:36] (03PS1) 10FNegri: [toolsdb] Fix slow query logging condition [puppet] - 10https://gerrit.wikimedia.org/r/983368 (https://phabricator.wikimedia.org/T353093) [11:00:25] (03CR) 10David Caro: [C: 03+1] "I feel so identified xd" [puppet] - 10https://gerrit.wikimedia.org/r/983368 (https://phabricator.wikimedia.org/T353093) (owner: 10FNegri) [11:01:42] (03CR) 10FNegri: [C: 03+2] [toolsdb] Fix slow query logging condition [puppet] - 10https://gerrit.wikimedia.org/r/983368 (https://phabricator.wikimedia.org/T353093) (owner: 10FNegri) [11:03:00] (PuppetFailure) firing: (2) Puppet has failed on netflow1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:06:55] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:07:15] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:08:19] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) I got some graphs here with the increases of the error count: https://grafana-rw.wikimedia.org/d/P1tFnn3Mk... [11:08:44] (03PS1) 10Slyngshede: C:puppetmaster::monitoring Prometheus stats for Puppetmerge. [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694) [11:08:51] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:09:03] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:09:26] (03CR) 10CI reject: [V: 04-1] C:puppetmaster::monitoring Prometheus stats for Puppetmerge. [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [11:10:09] (03PS1) 10Hashar: wm-pcc: only act on Puppet repositories [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/983377 (https://phabricator.wikimedia.org/T353181) [11:11:21] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) That might be all of them since nov 28th: ` dcaro@urcuchillay$ head -n 1 Hard\ drives\ health\ -\ increase... [11:11:49] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 7.451 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:11:51] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51007 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:12:48] (03PS2) 10Slyngshede: C:puppetmaster::monitoring Prometheus stats for Puppetmerge. [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694) [11:15:56] (03CR) 10Majavah: [C: 03+2] alertmanager: also inhibit criticals below a page [puppet] - 10https://gerrit.wikimedia.org/r/983262 (owner: 10Majavah) [11:17:13] (03CR) 10EoghanGaffney: [C: 03+1] sre.gitlab.upgrade: add error handling for broadcast message creation [cookbooks] - 10https://gerrit.wikimedia.org/r/983364 (https://phabricator.wikimedia.org/T353375) (owner: 10Jelto) [11:17:29] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [11:18:33] (03PS3) 10EoghanGaffney: [apt-staging] Deploy gitlab-package-puller script [puppet] - 10https://gerrit.wikimedia.org/r/982119 (https://phabricator.wikimedia.org/T347004) [11:22:17] (03CR) 10EoghanGaffney: [apt-staging] Deploy gitlab-package-puller script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/982119 (https://phabricator.wikimedia.org/T347004) (owner: 10EoghanGaffney) [11:27:08] (03CR) 10Hashar: [C: 03+2] wm-pcc: only act on Puppet repositories [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/983377 (https://phabricator.wikimedia.org/T353181) (owner: 10Hashar) [11:27:13] (03CR) 10JMeybohm: "just FYI" [puppet] - 10https://gerrit.wikimedia.org/r/983359 (https://phabricator.wikimedia.org/T353233) (owner: 10JMeybohm) [11:27:19] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/925/con" [puppet] - 10https://gerrit.wikimedia.org/r/983363 (https://phabricator.wikimedia.org/T353314) (owner: 10JMeybohm) [11:27:43] (03Merged) 10jenkins-bot: wm-pcc: only act on Puppet repositories [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/983377 (https://phabricator.wikimedia.org/T353181) (owner: 10Hashar) [11:28:16] !log hashar@deploy2002 Started deploy [gerrit/gerrit@304c63a]: wm-pcc: only act on Puppet repositories - T353181 [11:28:21] T353181: Gerrit UI shows PCC data on non-puppet.git patches that had experimental builds - https://phabricator.wikimedia.org/T353181 [11:28:24] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@304c63a]: wm-pcc: only act on Puppet repositories - T353181 (duration: 00m 08s) [11:28:25] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:31:42] ^ that deploy was an update to a javascript file served by Gerrit, so that is solely affecting the frontend webui [11:31:50] no java daemon have been hurt/restarted in the process [11:31:54] hence why I went ahead [11:33:25] (KubernetesAPINotScrapable) resolved: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [11:34:30] (03CR) 10JMeybohm: [C: 03+1] "The CI diff looks awful, but I think it will DTRT" [deployment-charts] - 10https://gerrit.wikimedia.org/r/974158 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Souฤkovรก) [11:36:52] (PuppetFailure) resolved: Puppet has failed on netflow2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:42:36] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10MatthewVernon) 05Openโ†’03Resolved Done, thanks for the reminder. [11:54:13] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/926/con" [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [12:06:00] (03CR) 10Alexandros Kosiaris: [C: 03+1] memcached: provide the CA cert when listening to TLS [puppet] - 10https://gerrit.wikimedia.org/r/983146 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli) [12:07:21] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) Got all the increments since we have data in prometheus 29th of Nov (data dump from grafana + a bit of pan... [12:07:27] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/927/con" [puppet] - 10https://gerrit.wikimedia.org/r/982119 (https://phabricator.wikimedia.org/T347004) (owner: 10EoghanGaffney) [12:10:30] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/982119 (https://phabricator.wikimedia.org/T347004) (owner: 10EoghanGaffney) [12:18:25] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (4) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [12:19:55] (03CR) 10EoghanGaffney: [C: 03+2] [apt-staging] Add script to pull artifacts from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/979912 (owner: 10EoghanGaffney) [12:20:17] (03CR) 10EoghanGaffney: [C: 03+2] [apt-staging] Add script to pull artifacts from gitlab (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/979912 (owner: 10EoghanGaffney) [12:20:22] (03CR) 10EoghanGaffney: [C: 03+2] [apt-staging] Deploy gitlab-package-puller script [puppet] - 10https://gerrit.wikimedia.org/r/982119 (https://phabricator.wikimedia.org/T347004) (owner: 10EoghanGaffney) [12:25:13] !log hashar@deploy2002 Started deploy [integration/docroot@7f6c112]: doc: add integration/tox-jenkins-override - T353515 [12:25:19] T353515: Publish documentation for integration/tox-jenkins-override - https://phabricator.wikimedia.org/T353515 [12:25:19] !log hashar@deploy2002 Finished deploy [integration/docroot@7f6c112]: doc: add integration/tox-jenkins-override - T353515 (duration: 00m 06s) [12:32:20] (03CR) 10Brouberol: [C: 03+2] kube_private: cleanup obsolete spark-history private data directories [puppet] - 10https://gerrit.wikimedia.org/r/983357 (https://phabricator.wikimedia.org/T351816) (owner: 10Brouberol) [12:38:17] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/928/con" [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [12:39:24] brouberol: souviens toi: Access et Excel! [12:42:23] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/929/con" [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [12:42:26] (03PS1) 10Brouberol: kube_private: fix path to obsolete spark-history private data directories [puppet] - 10https://gerrit.wikimedia.org/r/983391 (https://phabricator.wikimedia.org/T351816) [12:44:01] (03PS3) 10Ayounsi: k8s topology labels: add row to rack transition [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) [12:46:53] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/930/con" [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [12:49:04] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [12:49:19] (03PS1) 10EoghanGaffney: [apt-staging] Swap content for source in file directive [puppet] - 10https://gerrit.wikimedia.org/r/983395 [12:52:58] (03CR) 10Jelto: [C: 03+1] [apt-staging] Swap content for source in file directive [puppet] - 10https://gerrit.wikimedia.org/r/983395 (owner: 10EoghanGaffney) [12:53:28] (03CR) 10EoghanGaffney: [C: 03+2] [apt-staging] Swap content for source in file directive [puppet] - 10https://gerrit.wikimedia.org/r/983395 (owner: 10EoghanGaffney) [12:55:16] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [12:55:45] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [12:56:12] (03PS4) 10Cathal Mooney: Refactor server provision script to select params based on profile [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/983268 (https://phabricator.wikimedia.org/T346428) [13:00:27] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/931/con" [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:01:50] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/932/con" [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:06:12] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netbox, 10Patch-For-Review: Netbox: Add support for our complex host network setups in provision script - https://phabricator.wikimedia.org/T346428 (10cmooney) @ayounsi, @Volans I have uploaded the above patch to add the functionality as descr... [13:06:52] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [13:09:19] (03PS2) 10Hnowlan: changeprop-jobqueue: move PublishStashedFile back to non-k8s jobrunner [deployment-charts] - 10https://gerrit.wikimedia.org/r/983216 (https://phabricator.wikimedia.org/T349796) [13:11:23] (03CR) 10Btullis: [C: 03+1] kube_private: fix path to obsolete spark-history private data directories [puppet] - 10https://gerrit.wikimedia.org/r/983391 (https://phabricator.wikimedia.org/T351816) (owner: 10Brouberol) [13:13:09] (03CR) 10Brouberol: [C: 03+2] kube_private: fix path to obsolete spark-history private data directories [puppet] - 10https://gerrit.wikimedia.org/r/983391 (https://phabricator.wikimedia.org/T351816) (owner: 10Brouberol) [13:17:14] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netbox, 10Patch-For-Review: Netbox: Add support for our complex host network setups in provision script - https://phabricator.wikimedia.org/T346428 (10cmooney) p:05Triageโ†’03Medium [13:20:10] (03CR) 10Elukey: [C: 03+1] admin_ng: Increase resources for control plane in ML-* [deployment-charts] - 10https://gerrit.wikimedia.org/r/983217 (owner: 10Klausman) [13:22:32] (03CR) 10Cathal Mooney: "LGTM, but I will let service-ops approve in case they'd rather it work differently." [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [13:23:28] (03PS1) 10Ayounsi: CHANGELOG: add changelogs for release v0.6.5 [software/homer] - 10https://gerrit.wikimedia.org/r/983398 [13:24:15] (03PS1) 10Brouberol: deployment_server: revert temporary cleanup patch now that cleanup is done [puppet] - 10https://gerrit.wikimedia.org/r/983400 [13:25:56] (03CR) 10Cathal Mooney: [C: 03+1] "Makes sense yep, LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/982428 (owner: 10Muehlenhoff) [13:27:11] (03CR) 10Brouberol: [C: 03+2] deployment_server: revert temporary cleanup patch now that cleanup is done [puppet] - 10https://gerrit.wikimedia.org/r/983400 (owner: 10Brouberol) [13:28:22] (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: add error handling for broadcast message creation [cookbooks] - 10https://gerrit.wikimedia.org/r/983364 (https://phabricator.wikimedia.org/T353375) (owner: 10Jelto) [13:31:22] (03PS3) 10Slyngshede: C:puppetmaster::monitoring Prometheus stats for Puppetmerge. [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694) [13:33:03] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: add error handling for broadcast message creation [cookbooks] - 10https://gerrit.wikimedia.org/r/983364 (https://phabricator.wikimedia.org/T353375) (owner: 10Jelto) [13:36:25] (03PS1) 10Elukey: recommendation-api: update statsd configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/983403 (https://phabricator.wikimedia.org/T205870) [13:36:27] (03PS1) 10Elukey: services: deploy the new rec-api-ng Docker image in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/983404 (https://phabricator.wikimedia.org/T349118) [13:37:17] (03CR) 10Cathal Mooney: [C: 03+1] CHANGELOG: add changelogs for release v0.6.5 [software/homer] - 10https://gerrit.wikimedia.org/r/983398 (owner: 10Ayounsi) [13:38:03] (03PS2) 10Elukey: recommendation-api: update statsd configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/983403 (https://phabricator.wikimedia.org/T205870) [13:38:05] (03PS2) 10Elukey: services: deploy the new rec-api-ng Docker image in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/983404 (https://phabricator.wikimedia.org/T349118) [13:39:49] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Test upgrade GitLab Replica with insufficient API key [13:39:49] !log jelto@cumin1001 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1004.wikimedia.org with reason: Test upgrade GitLab Replica with insufficient API key [13:40:16] (03CR) 10Elukey: "The alternative is to go prometheus-only directly, let me know what you prefer." [deployment-charts] - 10https://gerrit.wikimedia.org/r/983403 (https://phabricator.wikimedia.org/T205870) (owner: 10Elukey) [13:46:04] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2112 (re)pooling @ 10%: candidate master repooling', diff saved to https://phabricator.wikimedia.org/P54473 and previous config saved to /var/cache/conftool/dbconfig/20231215-134603-arnaudb.json [13:52:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'depool db2179 to repool w/ api', diff saved to https://phabricator.wikimedia.org/P54474 and previous config saved to /var/cache/conftool/dbconfig/20231215-135228-arnaudb.json [13:52:58] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 25%: candidate master proper repooling', diff saved to https://phabricator.wikimedia.org/P54475 and previous config saved to /var/cache/conftool/dbconfig/20231215-135257-arnaudb.json [13:53:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [13:54:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [13:54:38] (03PS4) 10Muehlenhoff: Configure ACLs for reprepro upload queue (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) [13:55:09] (03CR) 10CI reject: [V: 04-1] Configure ACLs for reprepro upload queue (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff) [13:59:29] (03PS5) 10Muehlenhoff: Configure ACLs for reprepro upload queue (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) [14:00:01] (03CR) 10CI reject: [V: 04-1] Configure ACLs for reprepro upload queue (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff) [14:01:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2112 (re)pooling @ 20%: candidate master repooling', diff saved to https://phabricator.wikimedia.org/P54476 and previous config saved to /var/cache/conftool/dbconfig/20231215-140108-arnaudb.json [14:02:40] (03CR) 10Klausman: [C: 03+2] admin_ng: Increase resources for control plane in ML-* [deployment-charts] - 10https://gerrit.wikimedia.org/r/983217 (owner: 10Klausman) [14:05:17] (03Merged) 10jenkins-bot: admin_ng: Increase resources for control plane in ML-* [deployment-charts] - 10https://gerrit.wikimedia.org/r/983217 (owner: 10Klausman) [14:06:07] (03PS6) 10Muehlenhoff: Configure ACLs for reprepro upload queue (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) [14:07:29] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [14:07:53] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [14:08:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 50%: candidate master proper repooling', diff saved to https://phabricator.wikimedia.org/P54477 and previous config saved to /var/cache/conftool/dbconfig/20231215-140802-arnaudb.json [14:09:09] (03PS1) 10Bking: wdqs: Set default Accept: header [puppet] - 10https://gerrit.wikimedia.org/r/983415 (https://phabricator.wikimedia.org/T347355) [14:11:40] (03PS7) 10Muehlenhoff: Configure ACLs for reprepro upload queue [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) [14:14:33] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.3/12.4 point update - https://phabricator.wikimedia.org/T353057 (10MoritzMuehlenhoff) [14:16:14] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2112 (re)pooling @ 40%: candidate master repooling', diff saved to https://phabricator.wikimedia.org/P54478 and previous config saved to /var/cache/conftool/dbconfig/20231215-141613-arnaudb.json [14:16:49] (03PS11) 10Slyngshede: Move Debmonitor client code to separate repository. [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 [14:18:27] (03CR) 10JMeybohm: "After I98cd27c608722a96d5814d6150f1136c66130651 the networkpolicy config will no longer be needed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/978129 (https://phabricator.wikimedia.org/T264625) (owner: 10CDanis) [14:19:28] (03CR) 10CDanis: [aux-k8s-eqiad] add kube-state-metrics (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/978129 (https://phabricator.wikimedia.org/T264625) (owner: 10CDanis) [14:22:08] (03PS1) 10DCausse: cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/983419 [14:22:32] (03PS1) 10Muehlenhoff: librenms: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/983420 [14:23:08] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 75%: candidate master proper repooling', diff saved to https://phabricator.wikimedia.org/P54479 and previous config saved to /var/cache/conftool/dbconfig/20231215-142307-arnaudb.json [14:25:27] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup cumin1002 and eventually decom cumin1001 - https://phabricator.wikimedia.org/T353419 (10MoritzMuehlenhoff) p:05Triageโ†’03Medium [14:26:32] 10SRE, 10Infrastructure-Foundations: Migrate Spicerack logs from cumin1001 to cumin1002? - https://phabricator.wikimedia.org/T353523 (10MoritzMuehlenhoff) [14:27:02] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 20 days, 0:00:00 on db2194.codfw.wmnet with reason: production freeze will occur before cookbook is finished [14:27:02] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [14:27:18] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20 days, 0:00:00 on db2194.codfw.wmnet with reason: production freeze will occur before cookbook is finished [14:27:24] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [14:27:51] (03PS5) 10Ayounsi: Move git search related classes to __init__ [cookbooks] - 10https://gerrit.wikimedia.org/r/981349 (https://phabricator.wikimedia.org/T350152) [14:27:55] 10SRE, 10Infrastructure-Foundations: Move pwstore repository from cumin1001 to cumin1002 - https://phabricator.wikimedia.org/T353524 (10MoritzMuehlenhoff) [14:28:22] (03CR) 10Majavah: "What problems are you seeing that this could fix? I'm a bit nervous changing these settings as I fear it'll break something as we run quit" [puppet] - 10https://gerrit.wikimedia.org/r/983139 (owner: 10David Caro) [14:30:13] (03PS6) 10Ayounsi: Move git search related classes to __init__ [cookbooks] - 10https://gerrit.wikimedia.org/r/981349 (https://phabricator.wikimedia.org/T350152) [14:30:24] 10SRE, 10Infrastructure-Foundations: Remove cumin1001 from router ACLs - https://phabricator.wikimedia.org/T353525 (10MoritzMuehlenhoff) [14:31:19] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2112 (re)pooling @ 80%: candidate master repooling', diff saved to https://phabricator.wikimedia.org/P54480 and previous config saved to /var/cache/conftool/dbconfig/20231215-143118-arnaudb.json [14:31:31] 10SRE, 10Infrastructure-Foundations: Migrate dbbackups from cumin1001 to cumin1002 - https://phabricator.wikimedia.org/T353526 (10MoritzMuehlenhoff) [14:31:55] (03PS1) 10Elukey: profile::thanos: remove Pyrra recording rule for Istio [puppet] - 10https://gerrit.wikimedia.org/r/983421 (https://phabricator.wikimedia.org/T352756) [14:32:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983420 (owner: 10Muehlenhoff) [14:34:44] (03CR) 10CI reject: [V: 04-1] Move git search related classes to __init__ [cookbooks] - 10https://gerrit.wikimedia.org/r/981349 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [14:35:11] (03PS1) 10Brouberol: spark-history: Add missing service template include [deployment-charts] - 10https://gerrit.wikimedia.org/r/983422 (https://phabricator.wikimedia.org/T351722) [14:35:16] (03CR) 10David Caro: grid: disable hardcoded memory overcmommit on weblight (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/983139 (owner: 10David Caro) [14:36:52] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:53] (03CR) 10DCausse: [C: 03+2] cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/983419 (owner: 10DCausse) [14:37:35] (03Merged) 10jenkins-bot: cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/983419 (owner: 10DCausse) [14:37:50] (03PS1) 10Bking: wdqs: Set default Accept: header [puppet] - 10https://gerrit.wikimedia.org/r/983415 (https://phabricator.wikimedia.org/T347355) [14:38:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 100%: candidate master proper repooling', diff saved to https://phabricator.wikimedia.org/P54481 and previous config saved to /var/cache/conftool/dbconfig/20231215-143812-arnaudb.json [14:38:43] (03PS2) 10Brouberol: spark-history: Add missing service template include [deployment-charts] - 10https://gerrit.wikimedia.org/r/983422 (https://phabricator.wikimedia.org/T351722) [14:39:57] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:40:00] (03CR) 10JMeybohm: [C: 03+1] spark-history: Add missing service template include [deployment-charts] - 10https://gerrit.wikimedia.org/r/983422 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol) [14:40:08] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:41:30] (03CR) 10Brouberol: [C: 03+1] wdqs: Set default Accept: header [puppet] - 10https://gerrit.wikimedia.org/r/983415 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [14:42:10] (03CR) 10Brouberol: [C: 03+2] spark-history: Add missing service template include [deployment-charts] - 10https://gerrit.wikimedia.org/r/983422 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol) [14:42:56] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 4 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/933/c" [puppet] - 10https://gerrit.wikimedia.org/r/983359 (https://phabricator.wikimedia.org/T353233) (owner: 10JMeybohm) [14:44:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [14:45:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [14:45:40] (03PS1) 10Muehlenhoff: prometheus::snmp_exporter: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/983425 [14:46:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [14:46:10] (03CR) 10CI reject: [V: 04-1] prometheus::snmp_exporter: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/983425 (owner: 10Muehlenhoff) [14:46:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2112 (re)pooling @ 100%: candidate master repooling', diff saved to https://phabricator.wikimedia.org/P54482 and previous config saved to /var/cache/conftool/dbconfig/20231215-144624-arnaudb.json [14:46:31] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [14:46:40] (03PS7) 10Ayounsi: Move git search related classes to __init__ [cookbooks] - 10https://gerrit.wikimedia.org/r/981349 (https://phabricator.wikimedia.org/T350152) [14:46:51] (03CR) 10Bking: [C: 03+2] wdqs: Set default Accept: header [puppet] - 10https://gerrit.wikimedia.org/r/983415 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [14:51:00] 10SRE-tools, 10Infrastructure-Foundations: Decommission cookbook: lock per switch - https://phabricator.wikimedia.org/T353513 (10ABran-WMF) To add a bit of contextual informations: this was triggered during T353449 and T353448 which were run seconds apart [14:52:16] (03PS2) 10Muehlenhoff: prometheus::snmp_exporter: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/983425 [14:52:22] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes::master: Switch blackbox check to paging [puppet] - 10https://gerrit.wikimedia.org/r/983359 (https://phabricator.wikimedia.org/T353233) (owner: 10JMeybohm) [14:55:06] (03CR) 10Ayounsi: [C: 03+1] librenms: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/983420 (owner: 10Muehlenhoff) [14:56:52] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:57:13] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983425 (owner: 10Muehlenhoff) [14:58:31] (03PS2) 10Klausman: ml-services: make sure nllb-200-gpu is running only in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/983424 [15:00:05] (03PS3) 10Majavah: admin: POC: allow using security key backed SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/981418 [15:00:07] (03PS1) 10Majavah: admin: add spec test for hashuser [puppet] - 10https://gerrit.wikimedia.org/r/983429 [15:00:09] (03PS1) 10Majavah: admin: add security key based keys for taavi [puppet] - 10https://gerrit.wikimedia.org/r/983430 [15:00:33] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:00:44] (03CR) 10Majavah: admin: POC: allow using security key backed SSH keys (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/981418 (owner: 10Majavah) [15:00:55] (03CR) 10CI reject: [V: 04-1] admin: POC: allow using security key backed SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/981418 (owner: 10Majavah) [15:00:57] (03CR) 10CI reject: [V: 04-1] admin: add spec test for hashuser [puppet] - 10https://gerrit.wikimedia.org/r/983429 (owner: 10Majavah) [15:01:03] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:01:25] (03CR) 10CI reject: [V: 04-1] admin: add security key based keys for taavi [puppet] - 10https://gerrit.wikimedia.org/r/983430 (owner: 10Majavah) [15:01:33] (03PS1) 10Eevans: restbase2032: fix erroneous row [puppet] - 10https://gerrit.wikimedia.org/r/983431 (https://phabricator.wikimedia.org/T352468) [15:02:38] (03CR) 10Eevans: [C: 03+2] restbase2032: fix erroneous row [puppet] - 10https://gerrit.wikimedia.org/r/983431 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans) [15:02:40] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/983430 (owner: 10Majavah) [15:04:27] (03CR) 10Klausman: [C: 03+1] profile::thanos: remove Pyrra recording rule for Istio [puppet] - 10https://gerrit.wikimedia.org/r/983421 (https://phabricator.wikimedia.org/T352756) (owner: 10Elukey) [15:04:47] (03PS2) 10Majavah: admin: add spec test for hashuser [puppet] - 10https://gerrit.wikimedia.org/r/983429 [15:04:49] (03PS4) 10Majavah: admin: POC: allow using security key backed SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/981418 [15:04:52] (03PS2) 10Majavah: admin: add security key based keys for taavi [puppet] - 10https://gerrit.wikimedia.org/r/983430 [15:05:08] (03CR) 10Ilias Sarantopoulos: "Nice approach! But for production we need to include eqiad and exclude codfw as we GPUs on eqiad only." [deployment-charts] - 10https://gerrit.wikimedia.org/r/983424 (owner: 10Klausman) [15:05:14] (03CR) 10Herron: [C: 03+1] profile::thanos: remove Pyrra recording rule for Istio [puppet] - 10https://gerrit.wikimedia.org/r/983421 (https://phabricator.wikimedia.org/T352756) (owner: 10Elukey) [15:05:45] (03CR) 10CI reject: [V: 04-1] admin: add spec test for hashuser [puppet] - 10https://gerrit.wikimedia.org/r/983429 (owner: 10Majavah) [15:05:49] (03CR) 10CI reject: [V: 04-1] admin: POC: allow using security key backed SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/981418 (owner: 10Majavah) [15:06:26] (03PS3) 10Klausman: ml-services: make sure nllb-200-gpu is running only in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/983424 [15:07:40] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/983430 (owner: 10Majavah) [15:07:50] (03CR) 10Klausman: "PTAL" [deployment-charts] - 10https://gerrit.wikimedia.org/r/983424 (owner: 10Klausman) [15:12:37] (03PS1) 10Ilias Sarantopoulos: ml-services: manually set number of threads for nllb-cpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/983436 (https://phabricator.wikimedia.org/T351740) [15:13:35] (03PS1) 10Muehlenhoff: rsync::server: Add support for creating nftables-compatible firewall services [puppet] - 10https://gerrit.wikimedia.org/r/983437 [15:14:18] (03CR) 10CI reject: [V: 04-1] rsync::server: Add support for creating nftables-compatible firewall services [puppet] - 10https://gerrit.wikimedia.org/r/983437 (owner: 10Muehlenhoff) [15:14:57] (03PS1) 10Bking: wdqs: New LDF endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/983438 (https://phabricator.wikimedia.org/T347355) [15:15:19] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983438 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [15:15:28] (03CR) 10CI reject: [V: 04-1] wdqs: New LDF endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/983438 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [15:16:55] (03PS3) 10Majavah: admin: add spec test for hashuser [puppet] - 10https://gerrit.wikimedia.org/r/983429 [15:16:57] (03PS5) 10Majavah: admin: POC: allow using security key backed SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/981418 [15:16:59] (03PS3) 10Majavah: admin: add security key based keys for taavi [puppet] - 10https://gerrit.wikimedia.org/r/983430 [15:17:37] (03CR) 10CI reject: [V: 04-1] admin: add spec test for hashuser [puppet] - 10https://gerrit.wikimedia.org/r/983429 (owner: 10Majavah) [15:17:50] (03CR) 10CI reject: [V: 04-1] admin: POC: allow using security key backed SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/981418 (owner: 10Majavah) [15:18:08] (03PS2) 10Muehlenhoff: rsync::server: Add support for creating nftables-compatible firewall services [puppet] - 10https://gerrit.wikimedia.org/r/983437 [15:19:00] (03PS4) 10Majavah: admin: add spec test for hashuser [puppet] - 10https://gerrit.wikimedia.org/r/983429 [15:19:02] (03PS6) 10Majavah: admin: POC: allow using security key backed SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/981418 [15:19:04] (03PS4) 10Majavah: admin: add security key based keys for taavi [puppet] - 10https://gerrit.wikimedia.org/r/983430 [15:19:11] (03CR) 10Effie Mouzeli: [C: 03+2] memcached: provide the CA cert when listening to TLS [puppet] - 10https://gerrit.wikimedia.org/r/983146 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli) [15:20:37] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM but please update the commit message to reflect eqiad/codfw status" [deployment-charts] - 10https://gerrit.wikimedia.org/r/983424 (owner: 10Klausman) [15:21:01] (03PS3) 10Effie Mouzeli: memcached: provide the CA cert when listening to TLS (mc-gp1001) [puppet] - 10https://gerrit.wikimedia.org/r/983365 (https://phabricator.wikimedia.org/T349619) [15:21:05] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team (Language-2023-October-December): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491 (10elukey) Hi @santhosh, sorry for the lag but I missed the notification! >>! In T335491#9369595, @santhosh w... [15:21:11] (03PS2) 10Bking: wdqs: New LDF endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/983438 (https://phabricator.wikimedia.org/T347355) [15:21:42] (03CR) 10Effie Mouzeli: [C: 03+2] memcached: provide the CA cert when listening to TLS (mc-gp1001) [puppet] - 10https://gerrit.wikimedia.org/r/983365 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli) [15:21:44] (03CR) 10CI reject: [V: 04-1] wdqs: New LDF endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/983438 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [15:22:00] (03PS4) 10Klausman: ml-services: make sure nllb-200-gpu is running only where we have GPUs [deployment-charts] - 10https://gerrit.wikimedia.org/r/983424 [15:22:10] (03PS3) 10Bking: wdqs: New LDF endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/983438 (https://phabricator.wikimedia.org/T347355) [15:22:25] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983438 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [15:22:39] (03CR) 10CI reject: [V: 04-1] wdqs: New LDF endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/983438 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [15:22:58] (03CR) 10Klausman: ml-services: make sure nllb-200-gpu is running only where we have GPUs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/983424 (owner: 10Klausman) [15:23:22] (03CR) 10Klausman: [C: 03+2] ml-services: make sure nllb-200-gpu is running only where we have GPUs [deployment-charts] - 10https://gerrit.wikimedia.org/r/983424 (owner: 10Klausman) [15:23:39] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/983430 (owner: 10Majavah) [15:23:47] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jijiki) Prep work for memcached hosts is in place; those hosts are using each host's puppet certs for TLS, and migrating to puppet7 needs a minor tweak due... [15:24:08] (03Merged) 10jenkins-bot: ml-services: make sure nllb-200-gpu is running only where we have GPUs [deployment-charts] - 10https://gerrit.wikimedia.org/r/983424 (owner: 10Klausman) [15:24:23] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/983279 (https://phabricator.wikimedia.org/T282308) (owner: 10JHathaway) [15:24:33] (03PS4) 10Btullis: Switch presto from Puppet to PKI certificates [puppet] - 10https://gerrit.wikimedia.org/r/709713 (https://phabricator.wikimedia.org/T273642) [15:24:41] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/983278 (https://phabricator.wikimedia.org/T282308) (owner: 10JHathaway) [15:25:16] !log klausman@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [15:27:02] (03PS4) 10Bking: wdqs: New LDF endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/983438 (https://phabricator.wikimedia.org/T347355) [15:27:10] (03CR) 10Elukey: [C: 03+2] profile::thanos: remove Pyrra recording rule for Istio [puppet] - 10https://gerrit.wikimedia.org/r/983421 (https://phabricator.wikimedia.org/T352756) (owner: 10Elukey) [15:27:34] (03CR) 10CI reject: [V: 04-1] wdqs: New LDF endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/983438 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [15:28:25] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:28:30] (03PS5) 10Btullis: Switch presto from Puppet to PKI certificates [puppet] - 10https://gerrit.wikimedia.org/r/709713 (https://phabricator.wikimedia.org/T273642) [15:29:14] (03PS1) 10Esanders: DiscussionTools: Enable permalinks backend on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983440 [15:29:46] (03PS6) 10Btullis: Switch presto from Puppet to PKI certificates [puppet] - 10https://gerrit.wikimedia.org/r/709713 (https://phabricator.wikimedia.org/T273642) [15:32:34] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jijiki) [15:33:03] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1148.eqiad.wmnet - https://phabricator.wikimedia.org/T353449 (10VRiley-WMF) 05Openโ†’03Resolved a:03VRiley-WMF [15:33:05] (03PS5) 10Bking: wdqs: New LDF endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/983438 (https://phabricator.wikimedia.org/T347355) [15:33:34] (03CR) 10CI reject: [V: 04-1] wdqs: New LDF endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/983438 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [15:34:30] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jijiki) [15:34:42] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/937/con" [puppet] - 10https://gerrit.wikimedia.org/r/709713 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [15:35:08] (03PS6) 10Bking: wdqs: New LDF endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/983438 (https://phabricator.wikimedia.org/T347355) [15:35:49] (03PS1) 10Alexandros Kosiaris: tlsproxy::envoy: Allow specifying a percentage to be traced [puppet] - 10https://gerrit.wikimedia.org/r/983441 (https://phabricator.wikimedia.org/T351566) [15:36:41] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1137.eqiad.wmnet - https://phabricator.wikimedia.org/T353448 (10VRiley-WMF) 05Openโ†’03Resolved a:03VRiley-WMF [15:36:56] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1148.eqiad.wmnet - https://phabricator.wikimedia.org/T353449 (10VRiley-WMF) [15:39:20] 10SRE-swift-storage, 10Infrastructure-Foundations: unstable device mapping of SSDs causing swift/puppet problems - example reimage - https://phabricator.wikimedia.org/T308644 (10MatthewVernon) I think we're at the point where we're going to move to the more reliable approach on a rolling basis as hardware gets... [15:40:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983437 (owner: 10Muehlenhoff) [15:40:56] (03PS2) 10Ilias Sarantopoulos: ml-services: manually set number of threads for nllb-cpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/983436 (https://phabricator.wikimedia.org/T351740) [15:41:17] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1132.eqiad.wmnet - https://phabricator.wikimedia.org/T353447 (10VRiley-WMF) 05Openโ†’03Resolved a:03VRiley-WMF [15:41:46] (03PS3) 10Ilias Sarantopoulos: ml-services: manually set number of threads for nllb-cpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/983436 (https://phabricator.wikimedia.org/T351740) [15:42:29] (03CR) 10Elukey: [C: 03+1] ml-services: manually set number of threads for nllb-cpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/983436 (https://phabricator.wikimedia.org/T351740) (owner: 10Ilias Sarantopoulos) [15:42:58] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: manually set number of threads for nllb-cpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/983436 (https://phabricator.wikimedia.org/T351740) (owner: 10Ilias Sarantopoulos) [15:43:21] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::snmp_exporter: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/983425 (owner: 10Muehlenhoff) [15:43:50] (03Merged) 10jenkins-bot: ml-services: manually set number of threads for nllb-cpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/983436 (https://phabricator.wikimedia.org/T351740) (owner: 10Ilias Sarantopoulos) [15:44:20] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [15:50:54] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983441 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [15:57:45] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - No response from remote host 198.35.26.192 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:59:57] 10SRE, 10ops-eqiad: SMART errors on ganeti1031 - https://phabricator.wikimedia.org/T353324 (10MoritzMuehlenhoff) @Jclark-ctr Sure, see below: ` smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-25-amd64] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF... [16:18:25] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (4) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [16:24:20] (03PS1) 10JMeybohm: Add more calico alerts [alerts] - 10https://gerrit.wikimedia.org/r/983446 (https://phabricator.wikimedia.org/T353463) [16:26:03] (03CR) 10CI reject: [V: 04-1] Add more calico alerts [alerts] - 10https://gerrit.wikimedia.org/r/983446 (https://phabricator.wikimedia.org/T353463) (owner: 10JMeybohm) [16:26:22] (03CR) 10Brouberol: "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/983438 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [16:37:24] (03CR) 10Btullis: [C: 03+1] "Nice." [puppet] - 10https://gerrit.wikimedia.org/r/983438 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [16:40:14] (03PS2) 10JMeybohm: Add more calico alerts [alerts] - 10https://gerrit.wikimedia.org/r/983446 (https://phabricator.wikimedia.org/T353463) [16:51:08] (03PS1) 10DCausse: rdf_streaming_updater: drop RdfStreamingUpdaterNotEnoughTaskSlots [alerts] - 10https://gerrit.wikimedia.org/r/983449 (https://phabricator.wikimedia.org/T350784) [16:51:10] (03PS1) 10DCausse: rdf-streaming-updater: switch to flink-app dashboard [alerts] - 10https://gerrit.wikimedia.org/r/983450 (https://phabricator.wikimedia.org/T350784) [16:52:54] (03PS1) 10DCausse: charts: remove flink-session-cluster chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/983451 (https://phabricator.wikimedia.org/T350784) [16:53:13] (03PS7) 10Bking: wdqs: New LDF endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/983438 (https://phabricator.wikimedia.org/T347355) [16:53:30] (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: switch to flink-app dashboard [alerts] - 10https://gerrit.wikimedia.org/r/983450 (https://phabricator.wikimedia.org/T350784) (owner: 10DCausse) [16:53:33] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983438 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [17:07:17] (03PS2) 10DCausse: rdf-streaming-updater: switch to flink-app dashboard [alerts] - 10https://gerrit.wikimedia.org/r/983450 (https://phabricator.wikimedia.org/T350784) [17:08:25] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [17:12:21] 10SRE, 10ops-eqiad, 10DC-Ops: Audit of WMCS Servers Using Single & Dual Switchports - https://phabricator.wikimedia.org/T349756 (10cmooney) Hey @wiki_willy indeed we made a good start to this migration early in the year (see T319184), before other priorities interrupted progress. Having checked on the hosts... [17:20:29] 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10taavi) This can move forward now, although due to the nature of Rabbit this needs to be coordinated to avoid unnecessary downtime. It's fine if this goes to... [17:32:07] 10SRE, 10ops-eqiad, 10DC-Ops: Audit of WMCS Servers Using Single & Dual Switchports - https://phabricator.wikimedia.org/T349756 (10taavi) >>! In T349756#9409811, @cmooney wrote: > //cloudvirt1025 - cloudvirt1030// are set to "offline" in Netbox, so I'm not sure what the situation with those is. They all rem... [18:12:21] 10SRE, 10SRE-Access-Requests: Replace Kbrown's old ssh public key with a new one - https://phabricator.wikimedia.org/T353467 (10jhathaway) @Kbrown confirmed the key via slack. [18:12:37] (03PS1) 10JHathaway: admin: new ssh key for kbrown [puppet] - 10https://gerrit.wikimedia.org/r/983462 (https://phabricator.wikimedia.org/T353467) [18:29:45] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/983363 (https://phabricator.wikimedia.org/T353314) (owner: 10JMeybohm) [18:31:07] (03CR) 10Jbond: [C: 03+1] "<3 didn;t expect you to create one ๐Ÿ˜Š" [puppet] - 10https://gerrit.wikimedia.org/r/983429 (owner: 10Majavah) [18:31:36] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/981418 (owner: 10Majavah) [18:31:44] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/983430 (owner: 10Majavah) [18:34:28] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/983279 (https://phabricator.wikimedia.org/T282308) (owner: 10JHathaway) [18:36:43] (03PS1) 10Herron: wip: initial packaging [debs/python-verlib2] - 10https://gerrit.wikimedia.org/r/983468 [18:40:09] (03PS1) 10Dzahn: ssl: update certificate for webserver-misc-apps.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/983470 (https://phabricator.wikimedia.org/T333510) [18:41:11] (03CR) 10Dzahn: [C: 03+2] "openssl x509 -noout -text -in webserver-misc-apps.discovery.wmnet.crt | grep DNS" [puppet] - 10https://gerrit.wikimedia.org/r/983470 (https://phabricator.wikimedia.org/T333510) (owner: 10Dzahn) [19:00:38] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/982912 (https://phabricator.wikimedia.org/T353186) (owner: 10JHathaway) [19:01:00] (03PS1) 10Dzahn: Revert "Revert "microsites/query_service: enable TLS when monitoring commons-query"" [puppet] - 10https://gerrit.wikimedia.org/r/983303 [19:06:12] (03PS2) 10Dzahn: Revert "Revert "microsites/query_service: enable TLS when monitoring commons-query"" [puppet] - 10https://gerrit.wikimedia.org/r/983303 [19:12:22] (03PS3) 10Dzahn: Revert "Revert "microsites/query_service: enable TLS when monitoring commons-query"" [puppet] - 10https://gerrit.wikimedia.org/r/983303 [19:12:29] (03CR) 10Dzahn: "amending because meanwhile we duplicated the check, one for each team" [puppet] - 10https://gerrit.wikimedia.org/r/983303 (owner: 10Dzahn) [19:16:01] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/983207/938/" [puppet] - 10https://gerrit.wikimedia.org/r/983207 (https://phabricator.wikimedia.org/T348392) (owner: 10Dzahn) [19:28:26] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:33:01] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:34:09] (03CR) 10JHathaway: [C: 03+2] "thanks for the review" [puppet] - 10https://gerrit.wikimedia.org/r/982912 (https://phabricator.wikimedia.org/T353186) (owner: 10JHathaway) [19:36:36] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10Jhancock.wm) [19:37:27] (03CR) 10CDanis: [C: 03+1] pki: rename intermediates to prevent aux.pem cloning on Windows [puppet] - 10https://gerrit.wikimedia.org/r/983279 (https://phabricator.wikimedia.org/T282308) (owner: 10JHathaway) [19:37:28] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Releng for sandeeps - https://phabricator.wikimedia.org/T353186 (10jhathaway) 05Openโ†’03Resolved a:03jhathaway @Sandeeps merged, enjoy! [19:39:25] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Releng for sandeeps - https://phabricator.wikimedia.org/T353186 (10Sandeeps) @jhathaway Thank you [19:44:43] (03CR) 10JHathaway: [C: 03+2] "thanks @jbond @cdanis for the reviews" [puppet] - 10https://gerrit.wikimedia.org/r/983279 (https://phabricator.wikimedia.org/T282308) (owner: 10JHathaway) [19:46:27] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10Jhancock.wm) [19:47:18] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10Jhancock.wm) a:03Jhancock.wm [19:58:49] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:37] (03PS1) 10Herron: initial import from upstream 0.2.0 [debs/python-verlib2] - 10https://gerrit.wikimedia.org/r/983475 [20:03:07] (03PS2) 10Herron: wip: initial packaging [debs/python-verlib2] - 10https://gerrit.wikimedia.org/r/983468 [20:04:32] (03CR) 10Dzahn: [V: 03+1 C: 03+2] planet: remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/983207 (https://phabricator.wikimedia.org/T348392) (owner: 10Dzahn) [20:05:19] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:09:30] (03PS4) 10Dzahn: Revert "Revert "microsites/query_service: enable TLS when monitoring commons-query"" [puppet] - 10https://gerrit.wikimedia.org/r/983303 [20:14:07] (03CR) 10Dzahn: [C: 03+2] "only changing the one for serviceops-collabd, not the one for search-platform, making it a clean revert. and we can see if it actually wor" [puppet] - 10https://gerrit.wikimedia.org/r/983303 (owner: 10Dzahn) [20:18:24] (03CR) 10Majavah: [C: 03+2] admin: add spec test for hashuser [puppet] - 10https://gerrit.wikimedia.org/r/983429 (owner: 10Majavah) [20:18:26] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (4) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [20:26:38] !log milimetric@deploy2002 Started deploy [analytics/refinery@eeb98ac]: Syncing changes to HDFS [20:37:57] (03PS1) 10Herron: wip: initial packaging [debs/python-grafana-client] - 10https://gerrit.wikimedia.org/r/983477 [20:44:41] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:45:29] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:47:57] (03PS1) 10Herron: initial import of 3.10.0 [debs/python-grafana-client] - 10https://gerrit.wikimedia.org/r/983480 [20:48:17] (03PS2) 10Herron: wip: initial packaging [debs/python-grafana-client] - 10https://gerrit.wikimedia.org/r/983477 [20:49:55] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51009 bytes in 4.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:50:35] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.299 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:51:09] (03CR) 10Herron: "something to get the ball rolling on packaging this, has a build dependency on I9e13b423f6177e479d2c47543e36b212605ea0dd" [debs/python-grafana-client] - 10https://gerrit.wikimedia.org/r/983477 (owner: 10Herron) [20:51:27] 10SRE, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Riddy Khan - https://phabricator.wikimedia.org/T353370 (10jhathaway) [20:52:02] (03PS3) 10Herron: grafana-client: initial packaging [debs/python-grafana-client] - 10https://gerrit.wikimedia.org/r/983477 [20:52:48] (03PS3) 10Herron: verlib2: initial packaging [debs/python-verlib2] - 10https://gerrit.wikimedia.org/r/983468 [20:53:13] 10SRE, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Riddy Khan - https://phabricator.wikimedia.org/T353370 (10jhathaway) @ANakanishi_WMF please approve [20:54:02] (03CR) 10Herron: [V: 03+1] grafana: add dashboard datasource usage (graphite) exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [20:54:27] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:54:45] (03PS1) 10Dzahn: roles/hieradata: rename serviceops-collab team [puppet] - 10https://gerrit.wikimedia.org/r/983481 [20:56:21] 10SRE, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Riddy Khan - https://phabricator.wikimedia.org/T353370 (10jhathaway) @Milimetric please approve [20:57:47] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:00:22] 10SRE, 10ops-eqiad: ps1-e8-eqiad down - https://phabricator.wikimedia.org/T353503 (10wiki_willy) @Jclark-ctr or @VRiley-WMF - can one of you take a look at this one? Thanks, Willy [21:00:57] 10SRE, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Aki Nakanishi - https://phabricator.wikimedia.org/T353363 (10jhathaway) thanks @ANakanishi_WMF would you also please confirm your anakanishi@wikimedia.org email address associated with your devel... [21:04:45] (03PS1) 10Dzahn: site/cumin: rename insetup role for collaborative services [puppet] - 10https://gerrit.wikimedia.org/r/983485 [21:05:15] (03CR) 10CI reject: [V: 04-1] site/cumin: rename insetup role for collaborative services [puppet] - 10https://gerrit.wikimedia.org/r/983485 (owner: 10Dzahn) [21:08:09] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/983485 (owner: 10Dzahn) [21:08:26] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [21:12:29] (03CR) 10JHathaway: [C: 03+1] Install community_civicrm on crm role [puppet] - 10https://gerrit.wikimedia.org/r/982914 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [21:24:07] RECOVERY - Check systemd state on mw2442 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:26:24] !log running puppet on all prometheus* [21:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:08] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Re-images sometimes fail as the cert request goes to the wrong puppet master - https://phabricator.wikimedia.org/T353558 (10jhathaway) [21:32:21] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Re-images sometimes fail as the cert request goes to the wrong puppet master - https://phabricator.wikimedia.org/T353558 (10jhathaway) p:05Triageโ†’03Medium [21:32:45] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Re-images sometimes fail as the cert request goes to the wrong puppet master - https://phabricator.wikimedia.org/T353558 (10jhathaway) a:05Volansโ†’03None [21:35:08] (03CR) 10CDanis: "This looks reasonable, but can you re-run pcc to explicitly include mwdebug1001?" [puppet] - 10https://gerrit.wikimedia.org/r/983441 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [21:44:57] (03CR) 10Bking: [C: 03+1] "Neither colons nor snakes stays these couriers from the swift completion of their appointed rounds..." [puppet] - 10https://gerrit.wikimedia.org/r/983278 (https://phabricator.wikimedia.org/T282308) (owner: 10JHathaway) [21:48:24] !log milimetric@deploy2002 Finished deploy [analytics/refinery@eeb98ac]: Syncing changes to HDFS (duration: 81m 46s) [21:48:49] !log milimetric@deploy2002 Started deploy [analytics/refinery@eeb98ac] (thin): Syncing changes to HDFS [21:48:55] !log milimetric@deploy2002 Finished deploy [analytics/refinery@eeb98ac] (thin): Syncing changes to HDFS (duration: 00m 06s) [22:03:26] !log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@5090fdc]: (no justification provided) [22:03:51] !log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@5090fdc]: (no justification provided) (duration: 00m 25s) [22:19:33] (03PS1) 10Dzahn: query_service: force TLS for monitoring for search-platform team [puppet] - 10https://gerrit.wikimedia.org/r/983491 (https://phabricator.wikimedia.org/T333510) [22:30:11] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/983462 (https://phabricator.wikimedia.org/T353467) (owner: 10JHathaway) [22:38:46] (03PS1) 10Peter Fischer: Bump services/cirrus-streaming-updater version [deployment-charts] - 10https://gerrit.wikimedia.org/r/983495 [22:39:53] (03CR) 10Peter Fischer: [C: 03+2] Bump services/cirrus-streaming-updater version [deployment-charts] - 10https://gerrit.wikimedia.org/r/983495 (owner: 10Peter Fischer) [22:40:44] (03Merged) 10jenkins-bot: Bump services/cirrus-streaming-updater version [deployment-charts] - 10https://gerrit.wikimedia.org/r/983495 (owner: 10Peter Fischer) [22:42:27] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [22:42:57] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:06:33] !log milimetric@deploy2002 Started deploy [airflow-dags/platform_eng@160d0f0]: (no justification provided) [23:06:58] !log milimetric@deploy2002 Finished deploy [airflow-dags/platform_eng@160d0f0]: (no justification provided) (duration: 00m 25s) [23:28:26] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:37:46] (03CR) 10JHathaway: [C: 03+2] lists: rename repo templates to be compatible with Windows [puppet] - 10https://gerrit.wikimedia.org/r/983278 (https://phabricator.wikimedia.org/T282308) (owner: 10JHathaway) [23:46:22] !log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@9600237]: (no justification provided) [23:46:29] (WidespreadPuppetFailure) firing: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [23:46:50] !log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@9600237]: (no justification provided) (duration: 00m 27s) [23:56:22] (03PS1) 10JHathaway: pki: fix rename of intermediates [puppet] - 10https://gerrit.wikimedia.org/r/983504 (https://phabricator.wikimedia.org/T282308) [23:57:28] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983504 (https://phabricator.wikimedia.org/T282308) (owner: 10JHathaway)