[00:00:20] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:48] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:40] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "I confirmed the files in /srv/org/wikimedia/releases/ (the document root) are identical on back backends and that the existing timer and r" [puppet] - 10https://gerrit.wikimedia.org/r/893577 (https://phabricator.wikimedia.org/T330960) (owner: 10Dzahn) [00:09:46] (03PS2) 10Dzahn: switch releases.wikimedia.org backends rsync direction [puppet] - 10https://gerrit.wikimedia.org/r/893577 (https://phabricator.wikimedia.org/T330960) [00:09:52] PROBLEM - Disk space on krb1001 is CRITICAL: DISK CRITICAL - free space: / 1748 MB (3% inode=96%): /tmp 1748 MB (3% inode=96%): /var/tmp 1748 MB (3% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=krb1001&var-datasource=eqiad+prometheus/ops [00:13:39] !log switching releases.wikimedia.org from eqiad to codfw - T330960 [00:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:46] T330960: switch releases.wikimedia.org from eqiad to codfw - https://phabricator.wikimedia.org/T330960 [00:20:01] (CirrusSearchNodeIndexingNotIncreasing) firing: Elasticsearch instance elastic1054-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [00:29:24] (03CR) 10Cwhite: [C: 03+1] "Looks good, thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/893505 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [00:31:20] RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:54:45] (JobUnavailable) firing: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:04:32] (03PS1) 10Aaron Schulz: DNM: set "s3" as the default section name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893834 [01:04:34] (03PS1) 10Aaron Schulz: Use pt-heartbeat for all non-static external clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893835 (https://phabricator.wikimedia.org/T129093) [01:12:27] !log releases1002: deleting /usr/local/sbin/sync-srv-org-wikimedia-reprepro-releases1002.eqiad.wmnet which confusingly contains an rsync command to rsync from releases1001 which does not exist anymore T330960 [01:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:34] T330960: switch releases.wikimedia.org from eqiad to codfw - https://phabricator.wikimedia.org/T330960 [01:12:36] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10phaultfinder) [01:15:01] (CirrusSearchNodeIndexingNotIncreasing) firing: (2) Elasticsearch instance elastic1053-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [01:24:28] RECOVERY - NTP peers on dns1001 is OK: NTP OK: Offset 0.070128 secs https://wikitech.wikimedia.org/wiki/NTP [01:31:47] (03PS2) 10Dzahn: releases: add monitor for releases-jenkins.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/893828 [01:34:13] (03CR) 10CI reject: [V: 04-1] releases: add monitor for releases-jenkins.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/893828 (owner: 10Dzahn) [01:34:16] (03PS2) 10Dzahn: switch releases.wikimedia.org from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/893576 (https://phabricator.wikimedia.org/T330960) [01:34:34] (03PS3) 10Dzahn: releases: add monitor for releases-jenkins.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/893828 (https://phabricator.wikimedia.org/T330960) [01:38:20] (03PS3) 10Dzahn: switch releases.wikimedia.org from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/893576 (https://phabricator.wikimedia.org/T330960) [01:38:34] (03CR) 10Dzahn: [C: 03+2] switch releases.wikimedia.org from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/893576 (https://phabricator.wikimedia.org/T330960) (owner: 10Dzahn) [01:47:53] 10SRE, 10API Platform, 10GrowthExperiments-ImpactModule, 10Growth-Team (Current Sprint), 10MW-1.40-notes (1.40.0-wmf.21; 2023-01-30): UserImpact: Fetch information for more articles when calculating most-viewed-articles data point - https://phabricator.wikimedia.org/T324675 (10Etonkovidova) Checked in `w... [02:06:25] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Dzahn) https://releases.wikimedia.org switched to codfw (T330960) [02:09:45] (JobUnavailable) firing: (2) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:16:57] 10SRE-Access-Requests: shell user "toan" - address couldn't be found - https://phabricator.wikimedia.org/T331081 (10Dzahn) [02:17:24] (03PS1) 10Ssingh: prometheus: update pdnsrec job for bullseye upgrade of dnsrec hosts [puppet] - 10https://gerrit.wikimedia.org/r/893836 (https://phabricator.wikimedia.org/T321309) [02:23:35] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39934/console" [puppet] - 10https://gerrit.wikimedia.org/r/893836 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [02:24:45] (JobUnavailable) firing: (2) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:25:34] (03CR) 10Krinkle: [C: 03+2] build: Fix missing git config for git-stash command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893821 (https://phabricator.wikimedia.org/T331020) (owner: 10Krinkle) [02:26:21] (03Merged) 10jenkins-bot: build: Fix missing git config for git-stash command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893821 (https://phabricator.wikimedia.org/T331020) (owner: 10Krinkle) [02:33:56] PROBLEM - Check systemd state on mwlog2002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:37:25] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) @cmooney I was about to update the table but I can't only you can. So for everything going from A1 to Bx and A8 to Bx should be 12m (x=1,2,3,4,5,6,7,8). I will g... [02:38:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:43:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:08:13] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T331073 (10Peachey88) [03:08:18] 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T330971 (10Peachey88) [03:09:41] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T331068 (10Peachey88) [03:09:46] 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T330971 (10Peachey88) [03:09:52] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T331064 (10Peachey88) [03:09:56] 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T330971 (10Peachey88) [03:10:05] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T331059 (10Peachey88) [03:10:08] 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T330971 (10Peachey88) [03:41:00] (03CR) 10Krinkle: Migrate $wmfRealm calls to $wmgRealm (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759300 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [04:08:56] (03PS1) 10Aaron Schulz: DNM: add per-action component-level profiling in statsd using excimer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) [04:09:37] (03CR) 10CI reject: [V: 04-1] DNM: add per-action component-level profiling in statsd using excimer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) (owner: 10Aaron Schulz) [04:15:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:38:20] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-snowick-singleuser-conda-analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:15:01] (CirrusSearchNodeIndexingNotIncreasing) firing: (2) Elasticsearch instance elastic1053-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [05:20:24] PROBLEM - NTP peers on dns1001 is CRITICAL: NTP CRITICAL: Server has the LI_ALARM bit set, Offset 0.552156 secs (CRITICAL) https://wikitech.wikimedia.org/wiki/NTP [05:55:40] (03PS1) 10Marostegui: install_server: Do not reimage db2187 [puppet] - 10https://gerrit.wikimedia.org/r/893845 (https://phabricator.wikimedia.org/T326596) [05:57:01] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2187 [puppet] - 10https://gerrit.wikimedia.org/r/893845 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui) [05:57:18] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T330681 (10Marostegui) 05Openβ†’03Resolved The raid is back in optimal status [06:00:47] (03CR) 10Marostegui: "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/893803 (https://phabricator.wikimedia.org/T329499) (owner: 10Jcrespo) [06:12:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10phaultfinder) [06:16:32] 10SRE, 10Maps: Remove bbcrewind.co.uk exemption for Wikimedia Maps - https://phabricator.wikimedia.org/T331087 (10Legoktm) p:05Triageβ†’03Low [06:18:46] (03PS1) 10Legoktm: varnish: Remove bbcrewind exemption for Wikimedia Maps [puppet] - 10https://gerrit.wikimedia.org/r/893846 (https://phabricator.wikimedia.org/T331087) [06:20:35] 10SRE, 10Maps: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10Legoktm) Thanks for the update - filed {T331087} to track the removal. [06:24:45] (JobUnavailable) firing: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:33:46] PROBLEM - NTP peers on dns1001 is CRITICAL: NTP CRITICAL: Server has the LI_ALARM bit set, Offset 0.542774 secs (CRITICAL) https://wikitech.wikimedia.org/wiki/NTP [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230303T0700) [07:19:02] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:26:26] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 181, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:27:49] TheresNoTime: I need you know please [07:29:08] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4044 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikipedia.org has 81052 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [07:29:35] !log truncate /var/log/auth.log.1 on krb1001 to free space (root partition almost filled up) [07:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:58] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4044 is OK: SSL OK - OCSP staple validity for wikipedia.org has 350942 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-05-24 08:07:08 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [07:34:22] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: update pdnsrec job for bullseye upgrade of dnsrec hosts [puppet] - 10https://gerrit.wikimedia.org/r/893836 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [07:45:42] (03PS1) 10Hashar: Revert "switch releases.wikimedia.org backends rsync direction" [puppet] - 10https://gerrit.wikimedia.org/r/893778 (https://phabricator.wikimedia.org/T330960) [07:45:44] (03PS1) 10Hashar: Revert "switch releases.wikimedia.org from eqiad to codfw" [dns] - 10https://gerrit.wikimedia.org/r/893779 (https://phabricator.wikimedia.org/T330960) [07:47:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10phaultfinder) [07:47:54] RECOVERY - Disk space on krb1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=krb1001&var-datasource=eqiad+prometheus/ops [07:49:15] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10ayounsi) > @ayounsi interested to hear your thoughts, personally my instinct is to stick with the Spine1->CR1 and Spine2->CR2 setup, keeping things the same as Eqiad. A... [07:50:32] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10ayounsi) [07:57:55] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/893836 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230303T0800) [08:24:48] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:31:54] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:35:13] (03CR) 10Elukey: [C: 03+1] "Not super expert in the cookbook and Redfish API, but the logic looks good." [cookbooks] - 10https://gerrit.wikimedia.org/r/893706 (owner: 10Volans) [08:36:55] !log restarting ntp in dns1001 [08:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:02] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Remove redundant wgOriginTrials and wgFeaturePolicyReportOnly settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891314 (owner: 10Krinkle) [08:38:14] RECOVERY - NTP peers on dns1001 is OK: NTP OK: Offset 0.000657 secs https://wikitech.wikimedia.org/wiki/NTP [08:39:28] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.533 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:40:38] (03CR) 10Elukey: [C: 03+2] sre.k8s.upgrade-cluster: use verbatim_hosts=True for alert manager (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/888202 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [08:40:57] (03CR) 10Elukey: [C: 03+2] sre.hosts.reimage: add full path for facter and run clear dchp earlier (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/892377 (https://phabricator.wikimedia.org/T306421) (owner: 10Elukey) [08:42:50] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49709 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:45:14] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Install software version upgrade [08:45:36] (03PS1) 10Muehlenhoff: Readd kelhurd to LDAP access table [puppet] - 10https://gerrit.wikimedia.org/r/893992 (https://phabricator.wikimedia.org/T323943) [08:45:42] 10SRE, 10SRE-Access-Requests: shell user "toan" - address couldn't be found - https://phabricator.wikimedia.org/T331081 (10Aklapper) For the records, I disabled the Phab account `@toan` a 10 days ago for the same reason (email bounces). > Does our offboarding process between WMDE and WMF work as it should? I'... [08:47:18] (03CR) 10Muehlenhoff: [C: 03+2] Readd kelhurd to LDAP access table [puppet] - 10https://gerrit.wikimedia.org/r/893992 (https://phabricator.wikimedia.org/T323943) (owner: 10Muehlenhoff) [08:48:24] !log restart pybal on lvs1020 (standby) and then on lvs1019 (active) to pick up monitoring change (https://gerrit.wikimedia.org/r/c/operations/puppet/+/893008) [08:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:03] !log restart pybal on lvs2010 (standby) and then on lvs2009 (active) to pick up monitoring change (https://gerrit.wikimedia.org/r/c/operations/puppet/+/893008) [08:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:40] all done! [09:01:56] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Install software version upgrade [09:03:32] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, two comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/893676 (owner: 10Jbond) [09:04:21] 10SRE, 10Scap, 10Release-Engineering-Team (GitLab V: Event Horizon πŸŒ„): Scap fails on debian bullseye targets - https://phabricator.wikimedia.org/T326668 (10jnuche) 05Openβ†’03Resolved Bullseye hosts now get specific Python wheels created on our base bullseye image. [09:05:45] (03CR) 10Elukey: [C: 03+2] ml-services: Deploy nsfw model with debian bullseye [deployment-charts] - 10https://gerrit.wikimedia.org/r/891501 (https://phabricator.wikimedia.org/T329612) (owner: 10Ilias Sarantopoulos) [09:07:42] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:08:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/893674 (owner: 10Jbond) [09:10:03] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:10:17] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:14:32] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:14:46] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:15:01] (CirrusSearchNodeIndexingNotIncreasing) firing: (2) Elasticsearch instance elastic1053-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [09:16:36] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.790 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:18:02] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49708 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:23:13] (03PS1) 10Jbond: puppet: update log rotate to call /usr/lib/rsyslog/rsyslog-rotate [puppet] - 10https://gerrit.wikimedia.org/r/893997 [09:23:48] 10SRE, 10Observability-Metrics: Grafana: CVE-2022-39307 CVE-2022-39306 - https://phabricator.wikimedia.org/T322829 (10MoritzMuehlenhoff) 05Openβ†’03Resolved a:03MoritzMuehlenhoff This got addressed along with the changes made at T328405 [09:24:45] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39935/console" [puppet] - 10https://gerrit.wikimedia.org/r/893997 (owner: 10Jbond) [09:27:11] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Install software version upgrade [09:29:47] (03PS1) 10Nicolas Fraison: hadoop: automate refresh of exclude nodes in NN adn RM [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) [09:30:09] (03CR) 10CI reject: [V: 04-1] hadoop: automate refresh of exclude nodes in NN adn RM [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) (owner: 10Nicolas Fraison) [09:31:33] (03PS2) 10Nicolas Fraison: hadoop: automate refresh of exclude nodes in NN adn RM [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) [09:31:56] (03CR) 10CI reject: [V: 04-1] hadoop: automate refresh of exclude nodes in NN adn RM [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) (owner: 10Nicolas Fraison) [09:31:58] (03PS3) 10Nicolas Fraison: hadoop: automate refresh of exclude nodes in NN and RM [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) [09:32:20] (03CR) 10CI reject: [V: 04-1] hadoop: automate refresh of exclude nodes in NN and RM [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) (owner: 10Nicolas Fraison) [09:33:12] (03CR) 10Vgutierrez: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/893997 (owner: 10Jbond) [09:34:38] (03PS4) 10Nicolas Fraison: hadoop: automate refresh of exclude nodes in NN and RM [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) [09:35:21] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppet: update log rotate to call /usr/lib/rsyslog/rsyslog-rotate [puppet] - 10https://gerrit.wikimedia.org/r/893997 (owner: 10Jbond) [09:36:06] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39936/console" [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) (owner: 10Nicolas Fraison) [09:36:10] (03PS1) 10Phuedx: Stop refining SpecialMuteSubmit events [puppet] - 10https://gerrit.wikimedia.org/r/894000 (https://phabricator.wikimedia.org/T329718) [09:37:08] (03PS5) 10Nicolas Fraison: hadoop: automate refresh of exclude nodes in NN and RM [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) [09:38:13] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39937/console" [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) (owner: 10Nicolas Fraison) [09:39:14] 10SRE, 10SRE-Access-Requests: shell user "toan" - address couldn't be found - https://phabricator.wikimedia.org/T331081 (10Lucas_Werkmeister_WMDE) FWIW, @toan is indeed no longer working at WMDE. I’ll see if I can find someone on our side who knows more about the offboarding process. [09:39:26] (03CR) 10Jbond: [V: 03+1 C: 03+2] kerberos: update the motd so its not to big [puppet] - 10https://gerrit.wikimedia.org/r/893674 (owner: 10Jbond) [09:40:21] (03PS2) 10Jbond: apt: update the motd so its not to big [puppet] - 10https://gerrit.wikimedia.org/r/893676 [09:40:31] (03CR) 10Jbond: [V: 03+2 C: 03+2] apt: update the motd so its not to big [puppet] - 10https://gerrit.wikimedia.org/r/893676 (owner: 10Jbond) [09:43:50] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:44:00] (03PS6) 10Nicolas Fraison: hadoop: automate refresh of exclude nodes in NN and RM [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) [09:44:08] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:45:22] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39938/console" [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) (owner: 10Nicolas Fraison) [09:45:24] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Install software version upgrade [09:45:32] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49709 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:45:48] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.260 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:45:59] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Install software version upgrade [09:46:14] (03PS7) 10Nicolas Fraison: hadoop: automate refresh of exclude nodes in NN and RM [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) [09:46:31] 10SRE, 10SRE-Access-Requests: shell user "toan" - address couldn't be found - https://phabricator.wikimedia.org/T331081 (10MoritzMuehlenhoff) >>! In T331081#8662987, @Lucas_Werkmeister_WMDE wrote: > FWIW, @toan is indeed no longer working at WMDE. I’ll see if I can find someone on our side who knows more about... [09:49:20] (03PS1) 10Muehlenhoff: Remove access for toan [puppet] - 10https://gerrit.wikimedia.org/r/894008 (https://phabricator.wikimedia.org/T331081) [09:49:45] (03PS1) 10MVernon: install_server: use newer partman setup for new ms backends [puppet] - 10https://gerrit.wikimedia.org/r/894009 (https://phabricator.wikimedia.org/T308677) [09:50:54] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for toan [puppet] - 10https://gerrit.wikimedia.org/r/894008 (https://phabricator.wikimedia.org/T331081) (owner: 10Muehlenhoff) [09:51:00] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39939/console" [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) (owner: 10Nicolas Fraison) [09:51:22] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/894009 (https://phabricator.wikimedia.org/T308677) (owner: 10MVernon) [09:54:08] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Tobias Andersson out of all services on: 909 hosts [09:54:33] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Tobias Andersson out of all services on: 909 hosts [09:55:35] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Tobias Andersson out of all services on: 1119 hosts [09:56:07] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Tobias Andersson out of all services on: 1119 hosts [09:57:48] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: shell user "toan" - address couldn't be found - https://phabricator.wikimedia.org/T331081 (10MoritzMuehlenhoff) Access has been removed and I filed https://phabricator.wikimedia.org/T331100 for HDFS/stat homes. [10:00:19] 10SRE, 10SRE-Access-Requests: Complete offboarding for Jonas Kress - https://phabricator.wikimedia.org/T331102 (10Lucas_Werkmeister_WMDE) [10:01:00] (03CR) 10Jbond: [C: 03+1] "LGTM, i also created https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/859470/5/cookbooks/sre/swift/convert-disks.py for converting " [puppet] - 10https://gerrit.wikimedia.org/r/894009 (https://phabricator.wikimedia.org/T308677) (owner: 10MVernon) [10:05:04] (03CR) 10MVernon: [C: 03+2] install_server: use newer partman setup for new ms backends [puppet] - 10https://gerrit.wikimedia.org/r/894009 (https://phabricator.wikimedia.org/T308677) (owner: 10MVernon) [10:05:40] (03PS1) 10Muehlenhoff: Remove access for jk [puppet] - 10https://gerrit.wikimedia.org/r/894011 (https://phabricator.wikimedia.org/T331102) [10:09:55] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, and 2 others: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10MatthewVernon) @Papaul there's now a partman recipe for these new nodes (see the above merged CR). [10:10:22] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10MatthewVernon) Thanks! [10:11:19] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for jk [puppet] - 10https://gerrit.wikimedia.org/r/894011 (https://phabricator.wikimedia.org/T331102) (owner: 10Muehlenhoff) [10:12:15] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Jonas Kress (WMDE) out of all services on: 1119 hosts [10:12:46] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jonas Kress (WMDE) out of all services on: 1119 hosts [10:13:45] (03PS6) 10Jbond: convrt-ssds: update cookbook to reimage ms-be with new partition schema [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) [10:13:58] (03CR) 10Jbond: convrt-ssds: update cookbook to reimage ms-be with new partition schema (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [10:14:44] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Complete offboarding for Jonas Kress - https://phabricator.wikimedia.org/T331102 (10MoritzMuehlenhoff) 05Openβ†’03Resolved p:05Triageβ†’03Medium a:03MoritzMuehlenhoff Access has been removed. I filed https://phabricator.wikimedia.org/T331108 for HDFS/st... [10:15:31] 10SRE, 10SRE-Access-Requests: shell user "toan" - address couldn't be found - https://phabricator.wikimedia.org/T331081 (10MoritzMuehlenhoff) 05Openβ†’03Resolved a:03MoritzMuehlenhoff [10:17:25] (03CR) 10MVernon: [C: 03+1] "Hi," [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [10:18:02] (03PS1) 10JMeybohm: admin: Update ssh key for rmaung [puppet] - 10https://gerrit.wikimedia.org/r/894012 (https://phabricator.wikimedia.org/T330335) [10:18:29] (03CR) 10Jbond: icinga: update casts in icinga_status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893797 (owner: 10Jbond) [10:19:22] (03CR) 10JMeybohm: [C: 03+2] admin: Update ssh key for rmaung [puppet] - 10https://gerrit.wikimedia.org/r/894012 (https://phabricator.wikimedia.org/T330335) (owner: 10JMeybohm) [10:20:49] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and Kerberos identity for RMaung - https://phabricator.wikimedia.org/T330335 (10JMeybohm) 05Openβ†’03Resolved a:03JMeybohm >>! In T330335#8661559, @Rmaung wrote: > @JMeybohm yes, please! Ok, done. Next tim... [10:21:54] (03PS2) 10Jbond: icinga: update casts in icinga_status [puppet] - 10https://gerrit.wikimedia.org/r/893797 [10:23:37] (03CR) 10Jbond: icinga: update casts in icinga_status (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/893797 (owner: 10Jbond) [10:23:40] (03CR) 10Jbond: [C: 03+2] icinga: update casts in icinga_status [puppet] - 10https://gerrit.wikimedia.org/r/893797 (owner: 10Jbond) [10:24:45] (JobUnavailable) firing: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:25:42] !log installing 5.10.162 kernels on buster systems running Linux 5.10 [10:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:08] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Research intern nickifeajika - https://phabricator.wikimedia.org/T330993 (10JMeybohm) [10:30:01] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Install software version upgrade [10:30:16] 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T330971 (10nfraison) @Cmjohnson this node has strange behaviour on raid/disks All disks are really slow compare to ones on other nodes. After looking at that it has indeed bad Current Cache policy set... [10:30:47] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Research intern nickifeajika - https://phabricator.wikimedia.org/T330993 (10JMeybohm) Hi @nickifeajika, I've added the template for access requests to this task and would kindly ask to fill in the fields by editing the descripti... [10:37:08] yes [10:39:57] !log restart ntp.service in dns2001 [10:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:32] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for AranyaP - https://phabricator.wikimedia.org/T331067 (10JMeybohm) [10:45:07] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for AranyaP - https://phabricator.wikimedia.org/T331067 (10JMeybohm) Welcome @aranyap! Please add the ssh key type to your request as well (assuming it's a ed25519 key, but just to be precise about it). @... [10:46:47] (03PS7) 10Jbond: convrt-ssds: update cookbook to reimage ms-be with new partition schema [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) [10:51:01] (03PS8) 10Jbond: convrt-ssds: update cookbook to reimage ms-be with new partition schema [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) [10:53:52] (03CR) 10Jbond: convrt-ssds: update cookbook to reimage ms-be with new partition schema (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [10:55:09] (03CR) 10Jaime Nuche: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/893485 (https://phabricator.wikimedia.org/T323909) (owner: 10Hashar) [10:55:16] (03PS8) 10Nicolas Fraison: hadoop: automate refresh of exclude nodes in NN and RM [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) [10:55:18] (03CR) 10Jaime Nuche: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/893484 (https://phabricator.wikimedia.org/T328920) (owner: 10Hashar) [10:55:38] (03CR) 10CI reject: [V: 04-1] hadoop: automate refresh of exclude nodes in NN and RM [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) (owner: 10Nicolas Fraison) [11:05:11] ACKNOWLEDGEMENT - Check systemd state on ms-be1069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service MVernon known issue - T327253 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:07:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10phaultfinder) [11:08:40] (03CR) 10Arturo Borrero Gonzalez: "does this require some kind of migration or compat period? Would things break if this is merged as-is?" [puppet] - 10https://gerrit.wikimedia.org/r/893759 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [11:11:19] (03CR) 10Arturo Borrero Gonzalez: "same question as in the other patch. Do you think this is safe to merge as-is without compat phase? Wondering if we'll have breakage in ca" [puppet] - 10https://gerrit.wikimedia.org/r/893545 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [11:13:36] !log imported PHP 7.4 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u2+icu67u1 to component/icu67 (build of PHP against co-installable ICU67) T329491 [11:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:43] T329491: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 [11:16:00] (03CR) 10Arturo Borrero Gonzalez: labstore: Send prom stats for getent_check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813898 (https://phabricator.wikimedia.org/T313444) (owner: 10David Caro) [11:17:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [11:17:13] (03CR) 10Arturo Borrero Gonzalez: "removing myself from reviewers in an attempt to cleanup my gerrit dashboard. Please ping me if/when we want to restart work on this front." [puppet] - 10https://gerrit.wikimedia.org/r/737774 (owner: 10Jbond) [11:17:16] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [11:18:36] (03CR) 10Arturo Borrero Gonzalez: "still interested in getting this merged? This needs manual rebase anyway." [puppet] - 10https://gerrit.wikimedia.org/r/740141 (owner: 10Arturo Borrero Gonzalez) [11:19:10] (03Abandoned) 10Arturo Borrero Gonzalez: ceph: osd: introduce support for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [11:19:41] (03Abandoned) 10Arturo Borrero Gonzalez: wmcs: toolforge: introduce cookbook to build/deploy all k8s components [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773612 (owner: 10Arturo Borrero Gonzalez) [11:21:14] (03PS4) 10Arturo Borrero Gonzalez: wmcs.labstore: add some alerts for labstore [alerts] - 10https://gerrit.wikimedia.org/r/813926 (owner: 10David Caro) [11:22:43] (03CR) 10Arturo Borrero Gonzalez: "naive idea: if NFS hardware servers are going away, perhaps don't introduce this change? Not sure what is the timeline though." [alerts] - 10https://gerrit.wikimedia.org/r/813926 (owner: 10David Caro) [11:23:58] (03CR) 10Filippo Giunchedi: icinga: update casts in icinga_status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893797 (owner: 10Jbond) [11:25:17] (03Abandoned) 10Arturo Borrero Gonzalez: wmcs: vps: start_instance_with_prefix: refactor and fix default behavior [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/756589 (owner: 10Arturo Borrero Gonzalez) [11:25:29] (03Abandoned) 10Arturo Borrero Gonzalez: wmcs: vps: create_instance_with_prefix: support creating more than 1 instance [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755707 (owner: 10Arturo Borrero Gonzalez) [11:26:08] (03Abandoned) 10Arturo Borrero Gonzalez: wmcs: toolforge: k8s: factorize build/deplo code into a manager class [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/774459 (owner: 10Arturo Borrero Gonzalez) [11:27:41] (03PS1) 10Elukey: profile::service_proxy::envoy: add support for inference [puppet] - 10https://gerrit.wikimedia.org/r/894014 (https://phabricator.wikimedia.org/T330414) [11:29:03] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm [11:29:07] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**) - Do... [11:29:17] (03CR) 10Arturo Borrero Gonzalez: "please rebase and merge if still relevant." [alerts] - 10https://gerrit.wikimedia.org/r/813915 (https://phabricator.wikimedia.org/T313444) (owner: 10David Caro) [11:29:19] (03PS1) 10Elukey: kserve: upgrade to 0.10 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/894015 (https://phabricator.wikimedia.org/T331114) [11:29:54] 10SRE, 10SRE-Access-Requests: offboarding for Tobias Schumann (contractor for Wikimedia Deutschland) - https://phabricator.wikimedia.org/T331116 (10Lucas_Werkmeister_WMDE) [11:34:21] (03PS1) 10Btullis: Add an entry in the service catalog for the aqs service running in codfw [puppet] - 10https://gerrit.wikimedia.org/r/894017 (https://phabricator.wikimedia.org/T331115) [11:35:15] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39940/console" [puppet] - 10https://gerrit.wikimedia.org/r/894017 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis) [11:35:32] (03PS1) 10Muehlenhoff: Remove LDAP access for tobias-schumann-wmde-ext [puppet] - 10https://gerrit.wikimedia.org/r/894018 (https://phabricator.wikimedia.org/T331116) [11:36:55] (03PS2) 10Btullis: Add an entry in the service catalog for the aqs service running in codfw [puppet] - 10https://gerrit.wikimedia.org/r/894017 (https://phabricator.wikimedia.org/T331115) [11:37:32] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for tobias-schumann-wmde-ext [puppet] - 10https://gerrit.wikimedia.org/r/894018 (https://phabricator.wikimedia.org/T331116) (owner: 10Muehlenhoff) [11:37:56] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: offboarding for Tobias Schumann (contractor for Wikimedia Deutschland) - https://phabricator.wikimedia.org/T331116 (10MoritzMuehlenhoff) 05Openβ†’03Resolved a:03MoritzMuehlenhoff Thanks! LDAP access for cn=nda and cn=wmde has been removed. [11:42:43] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39941/console" [puppet] - 10https://gerrit.wikimedia.org/r/894017 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis) [11:45:17] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39942/console" [puppet] - 10https://gerrit.wikimedia.org/r/894017 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis) [11:49:46] (03PS1) 10Btullis: Add forward and reverse entries for aqs.svc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/894024 (https://phabricator.wikimedia.org/T331115) [11:50:08] (03PS1) 10Arturo Borrero Gonzalez: wmcs: nfs: primary: introduce missing hiera keys for maintain_dbusers [puppet] - 10https://gerrit.wikimedia.org/r/894025 (https://phabricator.wikimedia.org/T304040) [12:00:13] (03PS2) 10Arturo Borrero Gonzalez: wmcs: nfs: primary: introduce missing hiera keys for maintain_dbusers [puppet] - 10https://gerrit.wikimedia.org/r/894025 (https://phabricator.wikimedia.org/T304040) [12:01:14] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/output/894025/39944/" [puppet] - 10https://gerrit.wikimedia.org/r/894025 (https://phabricator.wikimedia.org/T304040) (owner: 10Arturo Borrero Gonzalez) [12:07:03] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: offboarding for Tobias Schumann (contractor for Wikimedia Deutschland) - https://phabricator.wikimedia.org/T331116 (10Aklapper) @MoritzMuehlenhoff Is there a checklist template? I'd like to see disabling Phabricator accounts included. (And ideally SUL account... [12:11:42] 10SRE, 10SRE-Access-Requests: offboarding for Tobias Schumann (contractor for Wikimedia Deutschland) - https://phabricator.wikimedia.org/T331116 (10MoritzMuehlenhoff) >>! In T331116#8663460, @Aklapper wrote: > @MoritzMuehlenhoff Is there a checklist template? I'd like to see disabling Phabricator accounts incl... [12:14:05] (03Abandoned) 10Nicolas Fraison: aqs: set DNS entry for aqs.svc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/893789 (owner: 10Nicolas Fraison) [12:14:09] (03Abandoned) 10Nicolas Fraison: aqs: set up lvs for aqs codfw [puppet] - 10https://gerrit.wikimedia.org/r/893763 (owner: 10Nicolas Fraison) [12:17:06] (03PS1) 10Jbond: prometheus: ensure we update the mtime if no update required [puppet] - 10https://gerrit.wikimedia.org/r/894026 [12:21:58] (03Abandoned) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: introduce some caching logic in the wrapper [puppet] - 10https://gerrit.wikimedia.org/r/779515 (https://phabricator.wikimedia.org/T302178) (owner: 10Arturo Borrero Gonzalez) [12:23:43] (03CR) 10Arturo Borrero Gonzalez: "please rebase and merged if still required." [puppet] - 10https://gerrit.wikimedia.org/r/853539 (owner: 10David Caro) [12:26:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "perhaps another good idea is to introduce a datatype for all harbor config parameters, that can be reused in the profile hiera lookup and " [puppet] - 10https://gerrit.wikimedia.org/r/893480 (owner: 10David Caro) [12:30:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: update wmcs-k8s-get-cert for certificates/v1 [puppet] - 10https://gerrit.wikimedia.org/r/890502 (https://phabricator.wikimedia.org/T292238) (owner: 10Majavah) [12:31:19] 10SRE, 10Infrastructure-Foundations: Tweak Kerberos auth logging - https://phabricator.wikimedia.org/T331123 (10MoritzMuehlenhoff) [12:31:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:36:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:37:36] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10cmooney) a:03cmooney [12:43:28] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) @ayounsi thanks for updating the desc! @papaul I'll update the table with the info provided and get back to you if any more questions. I'll also put together... [12:49:08] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [12:51:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] openstack: remove osmdb dns records [puppet] - 10https://gerrit.wikimedia.org/r/892903 (https://phabricator.wikimedia.org/T323159) (owner: 10Majavah) [12:53:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "let's collect +1 from Andrew as well here." [puppet] - 10https://gerrit.wikimedia.org/r/887299 (https://phabricator.wikimedia.org/T302404) (owner: 10Majavah) [12:55:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "will merge next week (friday as I write this comment)" [puppet] - 10https://gerrit.wikimedia.org/r/884307 (owner: 10Muehlenhoff) [12:55:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [12:55:25] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [12:56:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, will merge next week (friday as I write this comment)" [puppet] - 10https://gerrit.wikimedia.org/r/881885 (owner: 10Muehlenhoff) [12:56:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Needs manual rebase apparently. LGTM otherwise." [puppet] - 10https://gerrit.wikimedia.org/r/881887 (owner: 10Muehlenhoff) [12:57:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] clouddumps: Enable profile::auto_restarts::service for rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/881391 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:02:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] metricsinfra: don't use /cloud path prefix for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/869187 (https://phabricator.wikimedia.org/T325617) (owner: 10David Caro) [13:04:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "this should be ready to merge, no?" [puppet] - 10https://gerrit.wikimedia.org/r/868634 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro) [13:13:08] RECOVERY - BGP status on cr2-eqdfw is OK: BGP OK - up: 178, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:13:13] (03CR) 10David Caro: [C: 03+2] harbor: move to epp template for the config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893480 (owner: 10David Caro) [13:15:01] (CirrusSearchNodeIndexingNotIncreasing) firing: (2) Elasticsearch instance elastic1053-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:15:26] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 20485 [13:15:35] (03PS9) 10Nicolas Fraison: hadoop: automate refresh of exclude nodes in NN and RM [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) [13:15:48] RECOVERY - BGP status on cr1-drmrs is OK: BGP OK - up: 83, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:15:55] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 20485 [13:16:10] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 20485 [13:16:25] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 20485 [13:21:10] (03PS1) 10Filippo Giunchedi: prometheus: don't repeat pint source SPDX header [puppet] - 10https://gerrit.wikimedia.org/r/894038 (https://phabricator.wikimedia.org/T309182) [13:23:01] (03PS2) 10Filippo Giunchedi: prometheus: don't repeat pint source SPDX header [puppet] - 10https://gerrit.wikimedia.org/r/894038 (https://phabricator.wikimedia.org/T309182) [13:38:10] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4037 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikipedia.org has 58910 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [13:40:00] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4037 is OK: SSL OK - OCSP staple validity for wikipedia.org has 328800 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-05-24 08:07:08 +0000 (expires in 81 days) https://wikitech.wikimedia.org/wiki/HTTPS [13:43:56] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/894038 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [13:44:23] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: don't repeat pint source SPDX header [puppet] - 10https://gerrit.wikimedia.org/r/894038 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [13:47:06] ACKNOWLEDGEMENT - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds John Bond T326848 https://wikitech.wikimedia.org/wiki/Swift [13:47:51] ACKNOWLEDGEMENT - Check systemd state on an-airflow1005 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_airflow-kerberos@search.service,wmf_auto_restart_airflow-webserver@search.service Btullis Host not yet in service: T327970 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:51] ACKNOWLEDGEMENT - Checks that the airflow database for airflow search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow db check did not succeed Btullis Host not yet in service: T327970 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:47:52] ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet did not succeed Btullis Host not yet in service: T327970 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:48:18] RECOVERY - Ensure hosts are not performing a change on every puppet run on cumin1001 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [13:48:18] RECOVERY - Ensure hosts are not performing a change on every puppet run on cumin2002 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [13:49:08] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:10] 10SRE, 10Infrastructure-Foundations: Tweak Kerberos auth logging - https://phabricator.wikimedia.org/T331123 (10MoritzMuehlenhoff) The logrotate config for auth.log is included in the /etc/logrotate.d/rsyslog config shipped in the Debian package. There's no built-in way to override an existing config with anot... [13:51:19] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm [13:51:24] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**) - Re... [13:53:55] 10SRE, 10ops-codfw, 10DC-Ops, 10Prod-Kubernetes, 10serviceops: kubernetes202[34] implementation tracking - https://phabricator.wikimedia.org/T313871 (10JMeybohm) 05Openβ†’03Resolved Was was done as part of {T326340} [13:53:59] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes202[34] - https://phabricator.wikimedia.org/T313870 (10JMeybohm) [13:54:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Jgreen) [13:56:32] 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T330971 (10Cmjohnson) We can replace the BBU, let's get the disk replaced first and then create a new ticket for a BBU [13:59:29] (03PS1) 10Jgreen: Add hosts frbast1002,frmon1002,frpig1002, remove frauth1001,frpm1001 [dns] - 10https://gerrit.wikimedia.org/r/894044 (https://phabricator.wikimedia.org/T319460) [14:02:19] (03CR) 10Jgreen: [C: 03+2] Add hosts frbast1002,frmon1002,frpig1002, remove frauth1001,frpm1001 [dns] - 10https://gerrit.wikimedia.org/r/894044 (https://phabricator.wikimedia.org/T319460) (owner: 10Jgreen) [14:02:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [14:02:55] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [14:03:56] ACKNOWLEDGEMENT - Check systemd state on dumpsdata1005 is CRITICAL: CRITICAL - degraded: The following units failed: cleanup_xmldumps.service John Bond T331129 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:37] (03CR) 10Btullis: [V: 03+1] "This only compiles for the new aqs2* servers in the changed environment, so the PCC run doesn't show a diff. However, looking at the raw o" [puppet] - 10https://gerrit.wikimedia.org/r/894017 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis) [14:08:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10nskaggs) [14:08:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10nskaggs) [14:09:23] !log bking@cumin2002 banning elastic1053-59 from the cluster in preparation for T322082 [14:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:29] T322082: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 [14:10:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host urldownloader1003.wikimedia.org [14:10:24] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:15:04] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10MoritzMuehlenhoff) [14:16:33] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM urldownloader1003.wikimedia.org - jmm@cumin2002" [14:16:42] (03PS1) 10Muehlenhoff: Failover urldownloader [dns] - 10https://gerrit.wikimedia.org/r/894047 (https://phabricator.wikimedia.org/T329073) [14:17:40] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10jbond) [14:18:02] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ayounsi) [14:18:28] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10MoritzMuehlenhoff) [14:19:02] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10MoritzMuehlenhoff) [14:19:18] (03CR) 10Andrew Bogott: OpenStack: rename 'projectadmin' role to 'member' role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893759 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [14:20:01] (CirrusSearchNodeIndexingNotIncreasing) firing: (2) Elasticsearch instance elastic1053-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:21:55] (03PS2) 10Btullis: Add forward and reverse entries for aqs.svc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/894024 (https://phabricator.wikimedia.org/T331115) [14:22:30] RECOVERY - Check systemd state on mwlog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:43] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm [14:24:45] (JobUnavailable) firing: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:24:51] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**) - Re... [14:25:01] (CirrusSearchNodeIndexingNotIncreasing) resolved: (2) Elasticsearch instance elastic1053-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:26:57] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 14 hosts with reason: rerack [14:27:30] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 14 hosts with reason: rerack [14:27:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=86f268fc-ff2d-4948-aa3f-f9d831ed4c29) set by bking@cumin2002 for 1 day, 0:00:00 on 14 host(s) and their... [14:27:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM urldownloader1003.wikimedia.org - jmm@cumin2002" [14:27:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:27:42] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache urldownloader1003.wikimedia.org on all recursors [14:27:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) urldownloader1003.wikimedia.org on all recursors [14:28:31] (03PS1) 10Btullis: Switch the druid datasource for aqs to use the latest mediwiki_history [puppet] - 10https://gerrit.wikimedia.org/r/894049 [14:30:12] (03CR) 10Dzahn: [C: 03+2] Revert "switch releases.wikimedia.org backends rsync direction" [puppet] - 10https://gerrit.wikimedia.org/r/893778 (https://phabricator.wikimedia.org/T330960) (owner: 10Hashar) [14:30:29] (03PS1) 10Andrew Bogott: Openstack codfw1dev: remove the local cinder code from cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/894050 (https://phabricator.wikimedia.org/T324729) [14:31:46] (03CR) 10Andrew Bogott: [C: 03+2] Openstack codfw1dev: remove the local cinder code from cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/894050 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [14:33:56] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Thank you" [puppet] - 10https://gerrit.wikimedia.org/r/894026 (owner: 10Jbond) [14:34:48] (03CR) 10Jbond: [C: 03+2] prometheus: ensure we update the mtime if no update required [puppet] - 10https://gerrit.wikimedia.org/r/894026 (owner: 10Jbond) [14:35:42] (03CR) 10Ssingh: [V: 03+1 C: 03+2] prometheus: update pdnsrec job for bullseye upgrade of dnsrec hosts [puppet] - 10https://gerrit.wikimedia.org/r/893836 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:36:05] (03CR) 10Dzahn: [C: 03+2] "revert first, discussion later. needs follow-up though what exactly it means that it lives solely on one backend. the apache site exists o" [dns] - 10https://gerrit.wikimedia.org/r/893779 (https://phabricator.wikimedia.org/T330960) (owner: 10Hashar) [14:36:34] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4043 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikipedia.org has 55405 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [14:36:51] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10MoritzMuehlenhoff) A basic installation can get kicked off now, there's a few glitches to resolve this (e.g. it prompts for partman questions, so probably something changed in the prese... [14:37:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host urldownloader1003.wikimedia.org [14:37:50] (03CR) 10Jbond: puppetmaster - hiera: order site after role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740141 (owner: 10Arturo Borrero Gonzalez) [14:38:24] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4043 is OK: SSL OK - OCSP staple validity for wikipedia.org has 325296 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-05-24 08:07:08 +0000 (expires in 81 days) https://wikitech.wikimedia.org/wiki/HTTPS [14:39:32] (03PS2) 10Dzahn: Revert "switch releases.wikimedia.org from eqiad to codfw" [dns] - 10https://gerrit.wikimedia.org/r/893779 (https://phabricator.wikimedia.org/T330960) (owner: 10Hashar) [14:44:32] (03CR) 10Jbond: puppetmaster - hiera: order site after role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740141 (owner: 10Arturo Borrero Gonzalez) [14:44:56] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Infrastructure-Foundations, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10lbowmaker) [14:45:04] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases1002.eqiad.wmnet.service,rsync-srv-patches-releases-primary.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:08] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-patches-releases2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:29] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10lbowmaker) [14:46:23] (03CR) 10Ilias Sarantopoulos: [C: 03+1] profile::service_proxy::envoy: add support for inference [puppet] - 10https://gerrit.wikimedia.org/r/894014 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [14:48:54] (03CR) 10Jbond: [C: 03+1] "lgtm but you may want to collect a +1 from traffic" [puppet] - 10https://gerrit.wikimedia.org/r/894017 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis) [14:55:20] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10elukey) [14:56:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host urldownloader1004.wikimedia.org [14:56:48] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:57:02] (03PS1) 10Herron: grafana: serve grafana/grafana-rw from codfw [puppet] - 10https://gerrit.wikimedia.org/r/894056 (https://phabricator.wikimedia.org/T329073) [14:57:58] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10fgiunchedi) [14:58:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/894024 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis) [14:58:59] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM urldownloader1004.wikimedia.org - jmm@cumin2002" [14:59:05] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1053.eqiad.wmnet'] [15:00:40] (03PS2) 10Herron: grafana: serve grafana/grafana-rw from codfw [puppet] - 10https://gerrit.wikimedia.org/r/894056 (https://phabricator.wikimedia.org/T329073) [15:01:01] 10SRE, 10ops-codfw, 10decommission-hardware: decommission wdqs200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T331074 (10Papaul) a:05Papaulβ†’03Jhancock.wm [15:02:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM urldownloader1004.wikimedia.org - jmm@cumin2002" [15:02:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:02:27] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache urldownloader1004.wikimedia.org on all recursors [15:02:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) urldownloader1004.wikimedia.org on all recursors [15:02:32] RECOVERY - Check systemd state on parse2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:14] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:10] RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:06] (03CR) 10Cwhite: grafana: serve grafana/grafana-rw from codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894056 (https://phabricator.wikimedia.org/T329073) (owner: 10Herron) [15:07:59] (03PS3) 10Herron: grafana: serve grafana/grafana-rw from codfw [puppet] - 10https://gerrit.wikimedia.org/r/894056 (https://phabricator.wikimedia.org/T329073) [15:08:28] (03CR) 10Herron: grafana: serve grafana/grafana-rw from codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894056 (https://phabricator.wikimedia.org/T329073) (owner: 10Herron) [15:08:37] (03CR) 10David Caro: [C: 03+2] "yep, merging" [puppet] - 10https://gerrit.wikimedia.org/r/868634 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro) [15:08:42] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:30] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:31] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10herron) [15:11:25] (03PS5) 10David Caro: karma: add metrcsinfra alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/868638 (https://phabricator.wikimedia.org/T323714) [15:11:27] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic1053.eqiad.wmnet'] [15:12:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host urldownloader1004.wikimedia.org [15:12:36] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1053.eqiad.wmnet'] [15:12:40] (03CR) 10Cwhite: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/894056 (https://phabricator.wikimedia.org/T329073) (owner: 10Herron) [15:12:43] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Dzahn) [15:15:37] (03PS1) 10Elukey: kserve: upgrade to kserve 0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/894058 (https://phabricator.wikimedia.org/T331114) [15:16:41] (03CR) 10CI reject: [V: 04-1] kserve: upgrade to kserve 0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/894058 (https://phabricator.wikimedia.org/T331114) (owner: 10Elukey) [15:17:39] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10BTullis) [15:18:04] ACKNOWLEDGEMENT - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service Hnowlan Service to be removed https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:36] (03PS2) 10Elukey: kserve: upgrade to kserve 0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/894058 (https://phabricator.wikimedia.org/T331114) [15:19:32] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:47] ^ looking at that, we had to revert the rsync direction. more often than not rsync and needs reset-failed [15:21:05] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1053.eqiad.wmnet'] [15:21:22] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:16] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10hnowlan) [15:23:30] (03CR) 10David Caro: wmcs: add ldap getent speed alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/813915 (https://phabricator.wikimedia.org/T313444) (owner: 10David Caro) [15:23:59] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1054.eqiad.wmnet'] [15:24:32] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1055.eqiad.wmnet'] [15:24:53] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1054.eqiad.wmnet'] [15:25:08] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1056.eqiad.wmnet'] [15:25:35] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1057.eqiad.wmnet'] [15:25:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST jobs) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:26:12] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1054.eqiad.wmnet'] [15:26:13] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/894056 (https://phabricator.wikimedia.org/T329073) (owner: 10Herron) [15:26:31] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1057.eqiad.wmnet'] [15:26:37] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1057.eqiad.wmnet'] [15:27:03] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T330218 (10Papaul) 05Openβ†’03Resolved I check the interface again today no errors `` Input errors: Errors: 0, Drops: 0, Framing errors: 0, Runts: 0, Bucket drops: 0, Policed discards: 0, L3 incompletes: 0, L2 channel errors: 0,... [15:27:06] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1054.eqiad.wmnet'] [15:27:21] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1054.eqiad.wmnet'] [15:27:32] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1057.eqiad.wmnet'] [15:27:41] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1057.eqiad.wmnet'] [15:28:16] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1054.eqiad.wmnet'] [15:28:35] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1057.eqiad.wmnet'] [15:28:40] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:30] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:59] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (GET endpoints) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:32:12] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic1055.eqiad.wmnet'] [15:32:48] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1056.eqiad.wmnet'] [15:33:41] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1056.eqiad.wmnet'] [15:33:59] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic1056.eqiad.wmnet'] [15:35:33] (03PS1) 10BBlack: ntp: remove peer xleave option [puppet] - 10https://gerrit.wikimedia.org/r/894062 [15:36:38] (03CR) 10Vgutierrez: [C: 03+1] ntp: remove peer xleave option [puppet] - 10https://gerrit.wikimedia.org/r/894062 (owner: 10BBlack) [15:36:55] !log bking@cumin2002 START - Cookbook sre.hosts.provision for host elastic1053.mgmt.eqiad.wmnet with reboot policy GRACEFUL [15:36:55] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) thanks for the update! Please let me know if there is something I can do to help with this (... [15:37:07] (03CR) 10BBlack: [C: 03+2] ntp: remove peer xleave option [puppet] - 10https://gerrit.wikimedia.org/r/894062 (owner: 10BBlack) [15:38:40] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1055.eqiad.wmnet'] [15:38:42] 10SRE-tools, 10Infrastructure-Foundations: firmware-upgrade cookbook fails after sucessful upgrade - https://phabricator.wikimedia.org/T331135 (10ayounsi) [15:38:53] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1056.eqiad.wmnet'] [15:39:20] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic1056.eqiad.wmnet'] [15:39:38] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:40] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1056.eqiad.wmnet'] [15:41:28] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:57] !log xcollazo@deploy2002 Started deploy [airflow-dags/platform_eng@8d9af3e]: Deploying latest image_suggestions DAG on platform_eng Airflow instance [15:43:19] !log xcollazo@deploy2002 Finished deploy [airflow-dags/platform_eng@8d9af3e]: Deploying latest image_suggestions DAG on platform_eng Airflow instance (duration: 00m 21s) [15:43:50] RECOVERY - IPMI Sensor Status on ml-cache1001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:44:14] (03PS1) 10Muehlenhoff: libraryupgrader: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/894063 [15:44:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] OpenStack: rename 'projectadmin' role to 'member' role [puppet] - 10https://gerrit.wikimedia.org/r/893759 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [15:45:12] (03CR) 10Tacsipacsi: "Can we get this merged? It’s been waiting with two +1’s for almost ten months." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790707 (https://phabricator.wikimedia.org/T277883) (owner: 10Awight) [15:45:34] (03CR) 10Arturo Borrero Gonzalez: puppetmaster - hiera: order site after role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740141 (owner: 10Arturo Borrero Gonzalez) [15:45:54] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1053.mgmt.eqiad.wmnet with reboot policy GRACEFUL [15:46:33] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1055.eqiad.wmnet'] [15:47:02] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1056.eqiad.wmnet'] [15:47:41] 10SRE-tools, 10Infrastructure-Foundations: firmware-upgrade cookbook fails after sucessful upgrade - https://phabricator.wikimedia.org/T331135 (10jbond) this could just be down to idrac being a bit flaky, but ill see if i can recreate [15:48:48] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:18] ^ annoying but harmless.. and super common [15:49:32] attempting proper fix regardless [15:50:38] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Cmjohnson) updated network switches [15:50:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Cmjohnson) a:05Jclark-ctrβ†’03Cmjohnson [15:51:43] mutante: fyi i noticed on one of the releases server that one of the rsync jobs and puppet are conflicting [15:52:48] jbond: thanks! yea, see my comment right above. it's annoying me and I will try to fix it, but it's [15:53:14] mutante: ahh ok cool i wasn;t sure if it was the same issue or not [15:53:16] it's not breaking anything but sure is confusing [15:53:25] !log mforns@deploy2002 Started deploy [airflow-dags/analytics@ad17aa9]: (no justification provided) [15:53:32] i also noted that group 705 doesn;t exist on releases1002 [15:53:38] also the part that the _name_ of the sync script does not match the command inside it :p [15:53:48] !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@ad17aa9]: (no justification provided) (duration: 00m 22s) [15:53:54] so perhaps need some reserved group id but didn;t look much more then that [15:54:26] mutante: for the second bit you can set syslog_identifer on the systemd::timer::job [15:56:45] jbond: I don't think group 705 is needed on either of them. it's group 838 though, deployment-jenkins vs deployment [15:57:12] jbond: btw, thanks for fixing rsync thingie on apt*. That was more trivial than I thought. For some reason I thought you had already flipped that in hiera [15:57:23] and it was doing something else wrong.. but yea:) [15:58:18] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4038 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikipedia.org has 50502 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [15:59:46] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:00:08] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4038 is OK: SSL OK - OCSP staple validity for wikipedia.org has 320392 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-05-24 08:07:08 +0000 (expires in 81 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:01:12] no probs it was something i had recently fixed for gitlab so was fresh in my mind [16:01:36] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:15] (03CR) 10Jbond: puppetmaster - hiera: order site after role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740141 (owner: 10Arturo Borrero Gonzalez) [16:03:20] (03Abandoned) 10Jbond: puppetmaster - hiera: order site after role [puppet] - 10https://gerrit.wikimedia.org/r/740141 (owner: 10Arturo Borrero Gonzalez) [16:06:10] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/894063 (owner: 10Muehlenhoff) [16:08:56] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:09:04] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 476, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:09:14] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "update location of elastic1053 - bking@cumin2002 - T322082" [16:09:20] T322082: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 [16:10:20] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "update location of elastic1053 - bking@cumin2002 - T322082" [16:10:46] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:14:59] 10SRE, 10Traffic: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ssingh) [16:18:47] (03PS1) 10AOkoth: vrts: mask/unmask services on non-active host [puppet] - 10https://gerrit.wikimedia.org/r/894086 (https://phabricator.wikimedia.org/T323515) [16:19:54] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:44] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:23:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2070.codfw.wmnet with OS bullseye [16:23:18] (03CR) 10Dzahn: "Hi Arnold, what is the background story here. Why do we want everything to be masked? Just to be sure nothing can write to the db or is th" [puppet] - 10https://gerrit.wikimedia.org/r/894086 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [16:23:22] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2070.codfw.wmnet with OS bullseye [16:23:28] (03PS2) 10AOkoth: vrts: mask/unmask services on non-active host [puppet] - 10https://gerrit.wikimedia.org/r/894086 (https://phabricator.wikimedia.org/T323515) [16:26:29] (03PS25) 10Btullis: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) [16:27:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10phaultfinder) [16:29:02] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:51] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: `Filebackend::Multiwrite`, multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10MatthewVernon) [16:30:23] (03CR) 10AOkoth: vrts: mask/unmask services on non-active host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894086 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [16:30:52] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:10] (03PS3) 10AOkoth: vrts: mask/unmask services on non-active host [puppet] - 10https://gerrit.wikimedia.org/r/894086 (https://phabricator.wikimedia.org/T323515) [16:33:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for AranyaP - https://phabricator.wikimedia.org/T331067 (10Ottomata) Approved. [16:34:48] (03PS1) 10Herron: wip [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/894089 [16:35:17] 10SRE, 10ops-codfw, 10decommission-hardware: decommission wdqs200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T331074 (10Jhancock.wm) a:05Jhancock.wmβ†’03Papaul [16:36:05] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1058.eqiad.wmnet'] [16:36:52] (03PS1) 10Dzahn: releases: ensure rsync timers are removed when switching backends [puppet] - 10https://gerrit.wikimedia.org/r/894090 (https://phabricator.wikimedia.org/T330960) [16:37:20] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1059.eqiad.wmnet'] [16:38:08] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:38:20] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1060.eqiad.wmnet'] [16:38:32] !log bking@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['elastic1060.eqiad.wmnet'] [16:38:38] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1060.eqiad.wmnet'] [16:39:36] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/894086/39949/" [puppet] - 10https://gerrit.wikimedia.org/r/894086 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [16:39:46] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1061.eqiad.wmnet'] [16:41:46] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:43:07] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1058.eqiad.wmnet'] [16:44:25] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1059.eqiad.wmnet'] [16:45:26] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1060.eqiad.wmnet'] [16:46:51] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1061.eqiad.wmnet'] [16:48:56] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:50:44] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:56:32] !log bking@cumin2002 START - Cookbook sre.hosts.provision for host elastic1054.mgmt.eqiad.wmnet with reboot policy GRACEFUL [16:58:04] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:28] cdanis arnoldokoth nothing to report hope it continues to be quite and enjoy your weekend [16:59:33] (cc Emperor ) [17:00:08] jbond: Thanks. I hope so too. 🀞 [17:01:30] !log bking@cumin2002 ban elastic1059-1066 T322082 [17:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:35] T322082: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 [17:01:46] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:42] (03PS1) 10Esanders: Enable VE on more namespaces on foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894094 (https://phabricator.wikimedia.org/T331079) [17:04:48] (03CR) 10Dzahn: [C: 03+2] "this should do it, though the compiler output might look confusing. will double check: https://puppet-compiler.wmflabs.org/output/894090/3" [puppet] - 10https://gerrit.wikimedia.org/r/894090 (https://phabricator.wikimedia.org/T330960) (owner: 10Dzahn) [17:08:43] (03CR) 10Dzahn: [C: 03+2] "yea,fixed. the offending timer is gone on the primary server now:" [puppet] - 10https://gerrit.wikimedia.org/r/894090 (https://phabricator.wikimedia.org/T330960) (owner: 10Dzahn) [17:10:06] (03CR) 10Dzahn: [C: 03+2] "so with the loop in place, ALL secondary servers will still pull from the one primary server but timers on the primary server itself are r" [puppet] - 10https://gerrit.wikimedia.org/r/894090 (https://phabricator.wikimedia.org/T330960) (owner: 10Dzahn) [17:12:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1054.mgmt.eqiad.wmnet with reboot policy GRACEFUL [17:18:22] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:20:14] RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:25:46] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10colewhite) [17:26:35] ffs, now it fails on the _other_ backend [17:26:44] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10Jhancock.wm) a:03Papaul [17:29:24] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:29:56] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on releases2002.codfw.wmnet with reason: debugging [17:30:11] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on releases2002.codfw.wmnet with reason: debugging [17:31:14] RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:55] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2070.codfw.wmnet with reason: host reimage [17:35:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2070.codfw.wmnet with reason: host reimage [17:35:29] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update location of elastic1054 - bking@cumin2002 - T322082" [17:35:36] T322082: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 [17:37:12] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Update location of elastic1054 - bking@cumin2002 - T322082" [17:37:17] (03PS2) 10Krinkle: mc: Add new $wgWANObjectCache setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889245 (https://phabricator.wikimedia.org/T329680) [17:37:38] (03CR) 10Krinkle: [C: 03+2] mc: Add new $wgWANObjectCache setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889245 (https://phabricator.wikimedia.org/T329680) (owner: 10Krinkle) [17:38:28] (03Merged) 10jenkins-bot: mc: Add new $wgWANObjectCache setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889245 (https://phabricator.wikimedia.org/T329680) (owner: 10Krinkle) [17:40:49] 10Puppet, 10Infrastructure-Foundations, 10Instrument-ClientError, 10Observability-Logging, 10patch-welcome: Prevent Firefox and Chrome extensions from being able to trigger alerts - https://phabricator.wikimedia.org/T330680 (10Jdlrobson) 05Openβ†’03Resolved a:03Jdlrobson Thanks a bunch @colewhite - t... [17:43:20] (03CR) 10Dzahn: [C: 03+2] "this fixes it on the primary but introduces a new problem on the secondary. because rsync::quickdatacopy itself does a bunch of "if src/de" [puppet] - 10https://gerrit.wikimedia.org/r/894090 (https://phabricator.wikimedia.org/T330960) (owner: 10Dzahn) [17:44:01] (03PS1) 10Dzahn: Revert "releases: ensure rsync timers are removed when switching backends" [puppet] - 10https://gerrit.wikimedia.org/r/894072 [17:44:27] (03PS2) 10Dzahn: Revert "releases: ensure rsync timers are removed when switching backends" [puppet] - 10https://gerrit.wikimedia.org/r/894072 (https://phabricator.wikimedia.org/T330960) [17:45:39] PROBLEM - IPMI Sensor Status on mw1435 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:46:54] (03CR) 10CI reject: [V: 04-1] Revert "releases: ensure rsync timers are removed when switching backends" [puppet] - 10https://gerrit.wikimedia.org/r/894072 (https://phabricator.wikimedia.org/T330960) (owner: 10Dzahn) [17:47:59] !log krinkle@deploy2002 Synchronized wmf-config/mc.php: Ic55725cc500d99: Prepare mc.php for next week train (duration: 07m 39s) [17:48:05] gotta love it when CI hates you for a clean rervert [17:48:36] ah, it's just commit message again [17:48:53] (03PS3) 10Dzahn: Revert "releases: ensure rsync timers are removed when switching backends" [puppet] - 10https://gerrit.wikimedia.org/r/894072 (https://phabricator.wikimedia.org/T330960) [17:48:56] Yeah, it’s just a too wordy revert reason [17:49:26] this rsync stuff is super annoying :) [17:49:29] ack RhinosF1 [17:51:20] (03CR) 10CI reject: [V: 04-1] Revert "releases: ensure rsync timers are removed when switching backends" [puppet] - 10https://gerrit.wikimedia.org/r/894072 (https://phabricator.wikimedia.org/T330960) (owner: 10Dzahn) [17:52:06] (03PS1) 10Hashar: wm-checks-api: support the Early Warning bot [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/894099 (https://phabricator.wikimedia.org/T330850) [17:52:20] mutante: you need a like [17:52:22] Line [17:52:28] Between reason and Bug: [17:53:49] (03PS4) 10Dzahn: Revert "releases: ensure rsync timers are removed when switching backends" [puppet] - 10https://gerrit.wikimedia.org/r/894072 (https://phabricator.wikimedia.org/T330960) [17:54:13] * RhinosF1 kicks jerkins and asks it to be nice to mutante [17:54:17] yea:) it's only the 1000th time that is the reason :) [17:54:41] mutante: I think Jenkins is ready to finish for the weekend [17:54:41] the other 1001 times it's the line length in the revert that's the issue :) [17:56:08] sukhe: lol, that was the case in the previous PS, how could you tell :) [17:56:27] mutante: suffered through that :P [17:56:29] :) [17:57:11] mutante: V+2 \o/ [17:57:27] Now you can merge and run away for the weekend [17:58:29] RhinosF1: you are too optimistic. this revert just goes back to the problem before the merge :) [17:58:51] but I think I will solve it by manually deleting a file .. for now [17:59:13] (03CR) 10Dzahn: [C: 03+2] Revert "releases: ensure rsync timers are removed when switching backends" [puppet] - 10https://gerrit.wikimedia.org/r/894072 (https://phabricator.wikimedia.org/T330960) (owner: 10Dzahn) [17:59:26] Oh [17:59:48] with this change: issue on server in codfw, without this change: issue on server in eqiad :) [17:59:59] because rsync::quickdatacopy needs to be on BOTH [18:00:12] and then it tries to handle internally whether the timer should be active [18:05:33] RhinosF1: funny! you are actually right after all. because "do this change and then revert it" is actually a way to fix it. [18:06:39] timer got removed along with needed rsyncd config.. then rsyncd config got added back but timer wasn't. so the status now is ok. except it will be an issue again in the future [18:06:48] 10SRE, 10Traffic, 10Patch-For-Review: per-backend-service concurrency limits in ATS-BE - https://phabricator.wikimedia.org/T306223 (10BCornwall) 05Openβ†’03Stalled [18:06:59] PROBLEM - IPMI Sensor Status on elastic1088 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:13:18] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:16:28] !log cmjohnson@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloucephosd - cmjohnson@cumin1001" [18:17:48] !log bking@cumin2002 START - Cookbook sre.hosts.provision for host elastic1056.mgmt.eqiad.wmnet with reboot policy GRACEFUL [18:22:37] (03PS1) 10Jelto: gitlab_runner: add optional docker registry proxy to runners [puppet] - 10https://gerrit.wikimedia.org/r/894100 (https://phabricator.wikimedia.org/T329679) [18:24:45] (JobUnavailable) firing: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:25:40] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1056.mgmt.eqiad.wmnet with reboot policy GRACEFUL [18:28:21] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloucephosd - cmjohnson@cumin1001" [18:28:21] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:30:42] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10Papaul) @MatthewVernon on ms-be2070 the OS install did complete with no issues using the partman recipe and server did boot into the OS. However, after... [18:31:52] (03PS1) 10Ssingh: sites.yaml: remove authdns[12]001 [homer/public] - 10https://gerrit.wikimedia.org/r/894102 (https://phabricator.wikimedia.org/T330670) [18:32:42] (03CR) 10Ssingh: "DO NOT MERGE till Monday March 6." [homer/public] - 10https://gerrit.wikimedia.org/r/894102 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [18:36:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:40:50] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2070.codfw.wmnet with OS bullseye [18:40:57] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2070.codfw.wmnet with OS bullseye executed with erro... [18:41:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:42:25] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update location of elastic1056 - bking@cumin2002 - T322082" [18:42:30] T322082: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 [18:43:31] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Update location of elastic1056 - bking@cumin2002 - T322082" [18:52:11] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for AranyaP - https://phabricator.wikimedia.org/T331067 (10KFrancis) @JMeybohm Confirming the NDA is on file. Please proceed with the access request. Thanks! [18:53:45] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for Schwirz - https://phabricator.wikimedia.org/T330095 (10KFrancis) @JMeybohm Confirming the NDA has been signed. Please proceed with the access request. Thanks! [18:55:58] 10SRE-tools, 10Infrastructure-Foundations: firmware-upgrade cookbook fails after successful upgrade - https://phabricator.wikimedia.org/T331135 (10Aklapper) [18:56:30] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for AranyaP - https://phabricator.wikimedia.org/T331067 (10aranyap) [18:56:51] (03CR) 10Ayounsi: [C: 03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/894102 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [18:59:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2070.codfw.wmnet with OS bullseye [18:59:43] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2070.codfw.wmnet with OS bullseye [19:01:11] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for AranyaP - https://phabricator.wikimedia.org/T331067 (10aranyap) >>! In T331067#8663266, @JMeybohm wrote: > Welcome @aranyap! Please add the ssh key type to your request as well (assuming it's a ed25519... [19:02:18] !log bking@cumin2002 START - Cookbook sre.hosts.provision for host elastic1055.mgmt.eqiad.wmnet with reboot policy GRACEFUL [19:10:45] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for AranyaP - https://phabricator.wikimedia.org/T331067 (10Jcross) I am the contract contact person, and the end date can be listed as June 30, 2023. Thanks! [19:11:13] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1055.mgmt.eqiad.wmnet with reboot policy GRACEFUL [19:15:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2070.codfw.wmnet with reason: host reimage [19:18:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2070.codfw.wmnet with reason: host reimage [19:21:17] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for AranyaP - https://phabricator.wikimedia.org/T331067 (10aranyap) [19:32:32] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update location of elastic1055 - bking@cumin2002 - T322082" [19:32:39] T322082: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 [19:36:04] !log bking@cumin2002 END (ERROR) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=97) generate netbox hiera data: "Update location of elastic1055 - bking@cumin2002 - T322082" [19:36:28] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update location of elastic1055 - bking@cumin2002 - T322082" [19:39:10] !log bking@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Update location of elastic1055 - bking@cumin2002 - T322082" [19:39:16] T322082: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 [19:39:23] RECOVERY - IPMI Sensor Status on elastic1088 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:40:18] !log bking@cumin2002 START - Cookbook sre.hosts.provision for host elastic1057.mgmt.eqiad.wmnet with reboot policy GRACEFUL [19:42:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudcephosd1035.mgmt.eqiad.wmnet with reboot policy FORCED [19:48:22] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1057.mgmt.eqiad.wmnet with reboot policy GRACEFUL [19:49:57] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update location of elastic hosts - bking@cumin2002 - T322082" [19:50:03] T322082: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 [19:51:45] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Update location of elastic hosts - bking@cumin2002 - T322082" [19:52:36] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10Papaul) on the second run i got ` Booting from Hard drive C: GRUB [19:52:47] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1035.mgmt.eqiad.wmnet with reboot policy FORCED [19:53:15] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudcephosd1036.mgmt.eqiad.wmnet with reboot policy FORCED [19:54:28] (03PS1) 10Zabe: beta: Add deployment-db13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894112 (https://phabricator.wikimedia.org/T331019) [19:54:53] (03CR) 10Zabe: [C: 03+2] beta: Add deployment-db13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894112 (https://phabricator.wikimedia.org/T331019) (owner: 10Zabe) [19:56:20] (03Merged) 10jenkins-bot: beta: Add deployment-db13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894112 (https://phabricator.wikimedia.org/T331019) (owner: 10Zabe) [19:56:50] 10SRE, 10Traffic: purged rdkafka crashes: assert: rkq->rkq_refcnt > 0 - https://phabricator.wikimedia.org/T293605 (10BCornwall) 05Openβ†’03Resolved a:03BCornwall I'm not seeing evidence that this is an issue any more. Please re-open if this re-occurs! [20:02:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 (10Jclark-ctr) [20:05:29] !log bking@cumin2002 START - Cookbook sre.hosts.provision for host elastic1058.mgmt.eqiad.wmnet with reboot policy GRACEFUL [20:09:26] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1036.mgmt.eqiad.wmnet with reboot policy FORCED [20:10:00] (03PS1) 10Zabe: beta: Pool deployment-db13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894114 (https://phabricator.wikimedia.org/T331019) [20:11:16] (03CR) 10Zabe: [C: 03+2] beta: Pool deployment-db13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894114 (https://phabricator.wikimedia.org/T331019) (owner: 10Zabe) [20:12:13] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudcephosd1037.mgmt.eqiad.wmnet with reboot policy FORCED [20:12:18] (03Merged) 10jenkins-bot: beta: Pool deployment-db13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894114 (https://phabricator.wikimedia.org/T331019) (owner: 10Zabe) [20:13:50] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1058.mgmt.eqiad.wmnet with reboot policy GRACEFUL [20:14:08] 10SRE, 10DNS, 10Traffic: Need Assistance adding DNS records to claim domain - https://phabricator.wikimedia.org/T300076 (10BCornwall) 05Stalledβ†’03Resolved a:03BCornwall Looks like this was forgotten to be resolved. [20:17:19] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1037.mgmt.eqiad.wmnet with reboot policy FORCED [20:17:48] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudcephosd1038.mgmt.eqiad.wmnet with reboot policy FORCED [20:23:27] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update location of elastic1058 - bking@cumin2002 - T322082" [20:23:34] T322082: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 [20:23:47] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1038.mgmt.eqiad.wmnet with reboot policy FORCED [20:24:15] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudcephosd1039.mgmt.eqiad.wmnet with reboot policy FORCED [20:25:05] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Update location of elastic1058 - bking@cumin2002 - T322082" [20:29:57] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1039.mgmt.eqiad.wmnet with reboot policy FORCED [20:30:10] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudcephosd1040.mgmt.eqiad.wmnet with reboot policy FORCED [20:33:32] !log bking@cumin2002 START - Cookbook sre.hosts.provision for host elastic1059.mgmt.eqiad.wmnet with reboot policy GRACEFUL [20:35:53] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1040.mgmt.eqiad.wmnet with reboot policy FORCED [20:37:55] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4043 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikipedia.org has 37324 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [20:39:45] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4043 is OK: SSL OK - OCSP staple validity for wikipedia.org has 300014 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-05-24 07:09:36 +0000 (expires in 81 days) https://wikitech.wikimedia.org/wiki/HTTPS [20:40:18] hm [20:41:25] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1059.mgmt.eqiad.wmnet with reboot policy GRACEFUL [20:50:41] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2070.codfw.wmnet with OS bullseye [20:50:49] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2070.codfw.wmnet with OS bullseye executed with erro... [20:52:56] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update location of elastic1059 - bking@cumin2002 - T322082" [20:53:02] T322082: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 [20:55:21] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Update location of elastic1059 - bking@cumin2002 - T322082" [20:58:15] !log bking@cumin2002 persistently unban all elastic nodes in eqiad T322082 [20:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:21] T322082: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 [22:02:59] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4037 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikipedia.org has 32221 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [22:04:49] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4037 is OK: SSL OK - OCSP staple validity for wikipedia.org has 294910 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-05-24 07:09:36 +0000 (expires in 81 days) https://wikitech.wikimedia.org/wiki/HTTPS [22:23:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:24:45] (JobUnavailable) firing: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:28:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:40:51] 10SRE-swift-storage: Bring ms-fe201[3-4] into service - https://phabricator.wikimedia.org/T331178 (10Eevans) [22:41:50] 10SRE-swift-storage: Bring ms-fe201[3-4] into service - https://phabricator.wikimedia.org/T331178 (10Eevans) p:05Triageβ†’03Medium [22:44:12] (03PS1) 10Eevans: swift: add ms-fe201[3-4] as new Swift proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/894122 (https://phabricator.wikimedia.org/T331178) [22:59:47] PROBLEM - Host ms-be2070 is DOWN: PING CRITICAL - Packet loss = 100% [23:07:38] (03CR) 10EoghanGaffney: [C: 03+1] P:phabricator::aphlict: Set deploy_root as git safe.dir [puppet] - 10https://gerrit.wikimedia.org/r/893762 (owner: 10ClΓ©ment Goubert)