[00:00:27] (03PS2) 10Jforrester: mathoid: Upgrade image from 2023-11-03-103402 to 2024-06-18-233457 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047201 (https://phabricator.wikimedia.org/T349118) [00:00:53] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1047199 (owner: 10TrainBranchBot) [00:05:44] (03PS2) 10Scott French: service: move data-gateway service to production [puppet] - 10https://gerrit.wikimedia.org/r/1032593 (https://phabricator.wikimedia.org/T364921) [00:05:44] (03PS2) 10Scott French: envoy: add data-gateway service listener [puppet] - 10https://gerrit.wikimedia.org/r/1032599 (https://phabricator.wikimedia.org/T364921) [00:37:59] RECOVERY - Host elastic2088 is UP: PING WARNING - Packet loss = 75%, RTA = 30.36 ms [00:39:39] PROBLEM - Host elastic2088 is DOWN: PING CRITICAL - Packet loss = 100% [00:51:25] FIRING: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:40:55] 06SRE, 10Cassandra, 06Data Products, 06serviceops, and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9905781 (10Scott_French) [01:59:31] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 477.82 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:10:16] (03PS2) 10Scott French: drivers/etcd: only attempt to load existing configs [software/conftool] - 10https://gerrit.wikimedia.org/r/1047193 (https://phabricator.wikimedia.org/T367919) [02:34:47] FIRING: [2x] UdpMxIrcEchoThroughput: irc1002:9221 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho - https://alerts.wikimedia.org/?q=alertname%3DUdpMxIrcEchoThroughput [02:38:46] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:45:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:55:48] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:57:31] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:01:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:04:43] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:15:48] FIRING: [2x] ProbeDown: Service ml-cache2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:36:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [03:41:45] FIRING: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [04:01:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:04:43] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:30:17] 06SRE, 10Wikimedia-Mailing-lists: MM3/postorius: takes too long to load - https://phabricator.wikimedia.org/T314247#9905830 (10Krd) >>! In T314247#9905328, @Dzahn wrote: > Mailman migrated to a new server and a new version just now. Did this get faster? Nope. [04:37:31] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 471.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:51:25] FIRING: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [04:57:59] RECOVERY - Host elastic2088 is UP: PING WARNING - Packet loss = 71%, RTA = 30.46 ms [04:59:03] PROBLEM - Host elastic2088 is DOWN: PING CRITICAL - Packet loss = 100% [05:09:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'test depool db1169', diff saved to https://phabricator.wikimedia.org/P65168 and previous config saved to /var/cache/conftool/dbconfig/20240619-050951-marostegui.json [05:10:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'repool db1169', diff saved to https://phabricator.wikimedia.org/P65169 and previous config saved to /var/cache/conftool/dbconfig/20240619-051014-marostegui.json [05:12:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1169', diff saved to https://phabricator.wikimedia.org/P65170 and previous config saved to /var/cache/conftool/dbconfig/20240619-051233-root.json [05:12:45] (03CR) 10Giuseppe Lavagetto: [C:03+2] drivers/etcd: only attempt to load existing configs [software/conftool] - 10https://gerrit.wikimedia.org/r/1047193 (https://phabricator.wikimedia.org/T367919) (owner: 10Scott French) [05:12:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65171 and previous config saved to /var/cache/conftool/dbconfig/20240619-051248-root.json [05:14:31] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:15:45] (03Merged) 10jenkins-bot: drivers/etcd: only attempt to load existing configs [software/conftool] - 10https://gerrit.wikimedia.org/r/1047193 (https://phabricator.wikimedia.org/T367919) (owner: 10Scott French) [05:16:48] (03PS1) 10Marostegui: Revert^3 "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1047244 [05:17:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P65172 and previous config saved to /var/cache/conftool/dbconfig/20240619-051659-root.json [05:17:15] (03CR) 10Marostegui: [C:03+2] Revert^3 "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1047244 (owner: 10Marostegui) [05:18:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65173 and previous config saved to /var/cache/conftool/dbconfig/20240619-051809-root.json [05:19:17] (03PS1) 10Giuseppe Lavagetto: Release 3.0.1 [software/conftool] - 10https://gerrit.wikimedia.org/r/1047245 [05:20:12] (03PS2) 10Giuseppe Lavagetto: Release 3.0.1 [software/conftool] - 10https://gerrit.wikimedia.org/r/1047245 (https://phabricator.wikimedia.org/T367919) [05:24:10] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [05:24:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [05:26:33] (03CR) 10Giuseppe Lavagetto: [C:03+2] Release 3.0.1 [software/conftool] - 10https://gerrit.wikimedia.org/r/1047245 (https://phabricator.wikimedia.org/T367919) (owner: 10Giuseppe Lavagetto) [05:27:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65174 and previous config saved to /var/cache/conftool/dbconfig/20240619-052754-root.json [05:29:26] (03Merged) 10jenkins-bot: Release 3.0.1 [software/conftool] - 10https://gerrit.wikimedia.org/r/1047245 (https://phabricator.wikimedia.org/T367919) (owner: 10Giuseppe Lavagetto) [05:32:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P65175 and previous config saved to /var/cache/conftool/dbconfig/20240619-053205-root.json [05:33:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65176 and previous config saved to /var/cache/conftool/dbconfig/20240619-053315-root.json [05:42:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T364069)', diff saved to https://phabricator.wikimedia.org/P65177 and previous config saved to /var/cache/conftool/dbconfig/20240619-054214-marostegui.json [05:42:20] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [05:43:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65178 and previous config saved to /var/cache/conftool/dbconfig/20240619-054259-root.json [05:44:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65179 and previous config saved to /var/cache/conftool/dbconfig/20240619-054443-root.json [05:47:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P65180 and previous config saved to /var/cache/conftool/dbconfig/20240619-054710-root.json [05:48:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65181 and previous config saved to /var/cache/conftool/dbconfig/20240619-054820-root.json [05:51:50] (03PS2) 10KartikMistry: testwiki: Enable MinT for Wikipedia readers MVP on a Igbo Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047014 (https://phabricator.wikimedia.org/T367852) [05:59:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65182 and previous config saved to /var/cache/conftool/dbconfig/20240619-055948-root.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T0600) [06:02:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P65183 and previous config saved to /var/cache/conftool/dbconfig/20240619-060216-root.json [06:03:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65184 and previous config saved to /var/cache/conftool/dbconfig/20240619-060326-root.json [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:05:56] <_joe_> !log deleting manually thirdparty/conda repositories from reprepro T364550 [06:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:00] T364550: Remove unused thirdparty/conda repository - https://phabricator.wikimedia.org/T364550 [06:08:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047014 (https://phabricator.wikimedia.org/T367852) (owner: 10KartikMistry) [06:08:18] <_joe_> !log uploaded newer python-conftool packages T367919 [06:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:23] T367919: Avoid error logging while searching configs during normal operation - https://phabricator.wikimedia.org/T367919 [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:10:21] (03PS1) 10Slyngshede: Update Debian packaging to work with Tomcat 10. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047254 (https://phabricator.wikimedia.org/T367487) [06:14:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65185 and previous config saved to /var/cache/conftool/dbconfig/20240619-061454-root.json [06:17:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P65186 and previous config saved to /var/cache/conftool/dbconfig/20240619-061721-root.json [06:18:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65187 and previous config saved to /var/cache/conftool/dbconfig/20240619-061831-root.json [06:21:10] <_joe_> !log upgrading conftool everywhere T367919 [06:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:15] T367919: Avoid error logging while searching configs during normal operation - https://phabricator.wikimedia.org/T367919 [06:22:19] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9905914 (10WMDE-leszek) I approve from WMDE's side. Thank you. [06:27:29] (03PS7) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) [06:30:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65188 and previous config saved to /var/cache/conftool/dbconfig/20240619-062959-root.json [06:33:05] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [06:33:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65189 and previous config saved to /var/cache/conftool/dbconfig/20240619-063337-root.json [06:34:47] FIRING: [2x] UdpMxIrcEchoThroughput: irc1002:9221 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho - https://alerts.wikimedia.org/?q=alertname%3DUdpMxIrcEchoThroughput [06:38:24] (03CR) 10Dzahn: [C:03+1] "has approval now https://phabricator.wikimedia.org/T367184#9905914" [puppet] - 10https://gerrit.wikimedia.org/r/1047176 (https://phabricator.wikimedia.org/T367184) (owner: 10Dzahn) [06:39:42] (03CR) 10Ayounsi: [C:03+2] Prepare for netbox-dev [puppet] - 10https://gerrit.wikimedia.org/r/1047081 (https://phabricator.wikimedia.org/T336275) (owner: 10Elukey) [06:40:16] !log merge Puppet "Prepare for netbox-dev" CR1047081 [06:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:28] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9905920 (10Dzahn) Thanks! all looks good to me and is ready for review and merge. just a US holiday here tomorrow, but this will be done soon. [06:44:10] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2964/console" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [06:45:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65190 and previous config saved to /var/cache/conftool/dbconfig/20240619-064505-root.json [06:45:28] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2966/console" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [06:45:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:47:37] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:51:59] !log stop db1240:s1, wipe and reimport db1240:s3 T367162 [06:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:05] T367162: db1240.s3 index issues - https://phabricator.wikimedia.org/T367162 [06:55:42] (03PS8) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) [06:59:49] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [07:00:05] Amir1 and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65191 and previous config saved to /var/cache/conftool/dbconfig/20240619-070010-root.json [07:00:20] \o [07:00:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047014 (https://phabricator.wikimedia.org/T367852) (owner: 10KartikMistry) [07:01:46] (03Merged) 10jenkins-bot: testwiki: Enable MinT for Wikipedia readers MVP on a Igbo Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047014 (https://phabricator.wikimedia.org/T367852) (owner: 10KartikMistry) [07:02:39] !log kartik@deploy1002 Started scap: Backport for [[gerrit:1047014|testwiki: Enable MinT for Wikipedia readers MVP on a Igbo Wikipedia (T367852)]] [07:02:44] T367852: Enable MinT for Wiki Readers MVP on Test Wiki - https://phabricator.wikimedia.org/T367852 [07:07:16] !log kartik@deploy1002 kartik: Backport for [[gerrit:1047014|testwiki: Enable MinT for Wikipedia readers MVP on a Igbo Wikipedia (T367852)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:08:15] (03CR) 10Muehlenhoff: [C:03+1] "LGTM (but need to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1047086 first)" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [07:12:33] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 317.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:12:56] !log kartik@deploy1002 kartik: Continuing with sync [07:13:29] I'll submit followup patch as it seems testwiki won't be useful. [07:15:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65192 and previous config saved to /var/cache/conftool/dbconfig/20240619-071516-root.json [07:15:48] FIRING: [2x] ProbeDown: Service ml-cache2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:17:23] (03PS1) 10KartikMistry: igwiki: Enable MinT for Wikipedia readers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047382 (https://phabricator.wikimedia.org/T363464) [07:19:59] !log Deploy schema change on old s7 eqiad master db1160 dbmaint T364069 [07:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:04] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [07:20:36] (03CR) 10Arnaudb: [C:03+1] cephadm: limit mgr daemons to _admin-labelled hosts [puppet] - 10https://gerrit.wikimedia.org/r/1047117 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [07:21:37] (03CR) 10Muehlenhoff: [C:03+2] profile::java: Add support for Java 21 [puppet] - 10https://gerrit.wikimedia.org/r/1047086 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff) [07:22:33] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:22:51] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1047014|testwiki: Enable MinT for Wikipedia readers MVP on a Igbo Wikipedia (T367852)]] (duration: 20m 12s) [07:22:56] T367852: Enable MinT for Wiki Readers MVP on Test Wiki - https://phabricator.wikimedia.org/T367852 [07:23:55] (03CR) 10Ayounsi: [C:03+2] Netbox 4: JOBRESULT_RETENTION -> JOB_RETENTION [puppet] - 10https://gerrit.wikimedia.org/r/918353 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [07:27:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047382 (https://phabricator.wikimedia.org/T363464) (owner: 10KartikMistry) [07:28:46] (03Merged) 10jenkins-bot: igwiki: Enable MinT for Wikipedia readers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047382 (https://phabricator.wikimedia.org/T363464) (owner: 10KartikMistry) [07:28:59] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047254 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [07:29:20] !log kartik@deploy1002 Started scap: Backport for [[gerrit:1047382|igwiki: Enable MinT for Wikipedia readers (T363464)]] [07:29:24] T363464: Enable MinT for Wikipedia readers MVP on a wiki - https://phabricator.wikimedia.org/T363464 [07:29:50] (03CR) 10Slyngshede: [C:03+2] Update Debian packaging to work with Tomcat 10. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047254 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [07:29:54] (03CR) 10Slyngshede: [V:03+2 C:03+2] Update Debian packaging to work with Tomcat 10. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047254 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [07:30:48] (03PS9) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) [07:33:26] "07:32:22 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-06-19-072928-publish (ran as mwdeploy@mw2321.codfw.wmnet) returned [255]: ssh: connect to host mw2321.codfw.wmnet port 22: Connection timed out" -- seems mw2321 down? [07:33:54] !log kartik@deploy1002 kartik: Backport for [[gerrit:1047382|igwiki: Enable MinT for Wikipedia readers (T363464)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:35:50] kart_: yeah, it's been unreachable for a couple of days now: https://phabricator.wikimedia.org/T367702 [07:36:03] there's some work going on to keep this kind of issue from affecting deployments: https://phabricator.wikimedia.org/T367862 [07:36:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [07:36:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [07:38:44] !log kartik@deploy1002 kartik: Continuing with sync [07:39:08] jnuche: Thanks! [07:41:00] (03PS1) 10Muehlenhoff: Add new IRC servers also to the k8s hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047430 (https://phabricator.wikimedia.org/T331702) [07:42:00] FIRING: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [07:42:34] (03PS11) 10Marostegui: mariadb: bugfixes mysql_legacy [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496) (owner: 10Arnaudb) [07:44:42] (03PS2) 10Clément Goubert: httpbb: Remove appserver hourly tests [puppet] - 10https://gerrit.wikimedia.org/r/1047107 (https://phabricator.wikimedia.org/T362323) [07:46:59] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9906001 (10klausman) >>! In T357415#9905563, @Papaul wrote: > **Information2** > The server has only the the SFT-OOB-LIC license which is the Supermicro Out of band OOB li... [07:48:16] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1047382|igwiki: Enable MinT for Wikipedia readers (T363464)]] (duration: 18m 55s) [07:48:20] T363464: Enable MinT for Wikipedia readers MVP on a wiki - https://phabricator.wikimedia.org/T363464 [07:54:14] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host netbox-dev2003.codfw.wmnet [07:54:15] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [07:54:29] PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100% [07:55:27] RECOVERY - Host ml-cache2001 is UP: PING OK - Packet loss = 0%, RTA = 30.33 ms [07:56:40] (03CR) 10MVernon: [C:03+2] cephadm: limit mgr daemons to _admin-labelled hosts [puppet] - 10https://gerrit.wikimedia.org/r/1047117 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [07:57:25] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netbox-dev2003.codfw.wmnet - ayounsi@cumin1002" [07:57:41] 06SRE, 07SRE-Unowned, 10Wikimedia-IRC-RC-Server, 13Patch-For-Review: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702#9906036 (10MoritzMuehlenhoff) >>! In T331702#9902698, @MoritzMuehlenhoff wrote: > Bullseye-based servers are up and running, one can connect to irc1002.wiki... [07:58:03] (03CR) 10Giuseppe Lavagetto: [C:03+1] "LGTM, but please add this also to mw-debug" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047430 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [07:58:47] RESOLVED: [2x] ProbeDown: Service ml-cache2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:59:15] (03CR) 10MVernon: [C:03+2] Move moss-fe{1,2}001 back to apus cluster [puppet] - 10https://gerrit.wikimedia.org/r/1047033 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [07:59:20] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netbox-dev2003.codfw.wmnet - ayounsi@cumin1002" [07:59:20] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:59:20] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache netbox-dev2003.codfw.wmnet on all recursors [07:59:24] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox-dev2003.codfw.wmnet on all recursors [07:59:50] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netbox-dev2003.codfw.wmnet - ayounsi@cumin1002" [08:00:04] jnuche and brennen: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-0+Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T0800). [08:00:19] hi there, will deploy the train in the next few minutes [08:00:27] (03PS12) 10Arnaudb: mariadb: bugfixes mysql_legacy [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496) [08:00:47] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netbox-dev2003.codfw.wmnet - ayounsi@cumin1002" [08:01:35] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host netbox-dev2003.codfw.wmnet with OS bookworm [08:02:10] (03PS1) 10Fabfur: hiera: upgrade haproxy to 2.8 on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1047436 (https://phabricator.wikimedia.org/T367756) [08:02:59] (03PS1) 10TrainBranchBot: group1 wikis to 1.43.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047437 (https://phabricator.wikimedia.org/T361404) [08:03:01] (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.43.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047437 (https://phabricator.wikimedia.org/T361404) (owner: 10TrainBranchBot) [08:03:20] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on P{ms-fe*} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [08:03:41] (03Merged) 10jenkins-bot: group1 wikis to 1.43.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047437 (https://phabricator.wikimedia.org/T361404) (owner: 10TrainBranchBot) [08:04:43] RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 302.10 ms [08:09:43] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on P{ms-fe*} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [08:09:48] (03PS3) 10Arturo Borrero Gonzalez: toolforge: haproxy: check the k8s api-server /healthz endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1047113 (https://phabricator.wikimedia.org/T367389) [08:09:58] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047113 (https://phabricator.wikimedia.org/T367389) (owner: 10Arturo Borrero Gonzalez) [08:11:07] PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100% [08:11:51] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe1001.eqiad.wmnet with OS bookworm [08:12:07] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9906053 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe1001.eqiad.wmnet with OS bookworm [08:12:53] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe2001.codfw.wmnet with OS bookworm [08:13:03] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9906054 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bookworm [08:13:46] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047436 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [08:15:20] (03PS2) 10Muehlenhoff: Add new IRC servers also to the k8s hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047430 (https://phabricator.wikimedia.org/T331702) [08:15:43] (03CR) 10Muehlenhoff: "Ack, updated the patch." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047430 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [08:16:07] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for DE test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1047060 (owner: 10Muehlenhoff) [08:16:19] RECOVERY - Host mr1-eqsin.oob is UP: PING WARNING - Packet loss = 90%, RTA = 392.21 ms [08:17:18] Emperor: I'll merge your patch along, ok? "Move moss-fe{1,2}001 back to apus cluster" [08:17:30] (03PS1) 10Clément Goubert: mediawiki: Remove bare-metal cluster alerts [alerts] - 10https://gerrit.wikimedia.org/r/1047439 (https://phabricator.wikimedia.org/T362323) [08:17:37] (03CR) 10CI reject: [V:04-1] mediawiki: Remove bare-metal cluster alerts [alerts] - 10https://gerrit.wikimedia.org/r/1047439 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [08:17:58] (03PS6) 10JMeybohm: admin_ng: Add toggles for PSP to PSS migration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020313 (https://phabricator.wikimedia.org/T273507) [08:18:04] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.43.0-wmf.10 refs T361404 [08:18:09] T361404: 1.43.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T361404 [08:18:31] moritzm: please do [08:19:17] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1046596 (https://phabricator.wikimedia.org/T367490) (owner: 10Muehlenhoff) [08:20:20] ack, merged [08:20:28] (03CR) 10Muehlenhoff: [C:03+2] Drop ldap-admins access group from mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/1046596 (https://phabricator.wikimedia.org/T367490) (owner: 10Muehlenhoff) [08:21:05] (03PS3) 10Clément Goubert: mediawiki: Remove bare-metal cluster alerts [alerts] - 10https://gerrit.wikimedia.org/r/1047439 (https://phabricator.wikimedia.org/T362323) [08:22:43] PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100% [08:23:37] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-fe1001.eqiad.wmnet with OS bookworm [08:23:51] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9906067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe1001.eqiad.wmnet with OS bookworm executed with errors: - moss-fe1001 (... [08:23:59] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe1001.eqiad.wmnet with OS bookworm [08:24:06] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on P{ms-fe*} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [08:24:15] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9906068 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe1001.eqiad.wmnet with OS bookworm [08:25:11] (03CR) 10Arnaudb: [C:03+2] mariadb: prometheus config tweak for db1125 [puppet] - 10https://gerrit.wikimedia.org/r/1046712 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [08:25:45] (03PS10) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) [08:26:14] (03CR) 10Vgutierrez: [C:04-1] "you're missing hieradata/hosts/cp5030.yaml && hieradata/hosts/cp5032.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1047436 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [08:28:51] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 225, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:29:48] (03PS1) 10Jon Harald Søby: Add new protection level (user) for nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047441 (https://phabricator.wikimedia.org/T367943) [08:30:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047441 (https://phabricator.wikimedia.org/T367943) (owner: 10Jon Harald Søby) [08:30:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on P{ms-fe*} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [08:30:45] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-fe2001.codfw.wmnet with OS bookworm [08:30:54] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9906132 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bookworm executed with errors: - moss-fe2001 (... [08:31:07] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe2001.codfw.wmnet with OS bookworm [08:31:17] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9906136 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bookworm [08:34:13] (03CR) 10Ayounsi: [V:03+2 C:03+2] Netbox deploy for 4.0.3 [software/netbox-deploy] (dev) - 10https://gerrit.wikimedia.org/r/1038694 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:34:28] (03PS2) 10Fabfur: hiera: upgrade haproxy to 2.8 on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1047436 (https://phabricator.wikimedia.org/T367756) [08:34:39] (03CR) 10Fabfur: "Good catch, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1047436 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [08:35:19] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047436 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [08:35:24] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 15830 [08:35:29] (03PS1) 10Fabfur: benthos:cache: switch to rfc5424 format [puppet] - 10https://gerrit.wikimedia.org/r/1047442 (https://phabricator.wikimedia.org/T365718) [08:35:55] (03PS1) 10Muehlenhoff: Default to use acmechief1002 [puppet] - 10https://gerrit.wikimedia.org/r/1047443 (https://phabricator.wikimedia.org/T365799) [08:36:29] 06SRE, 10Acme-chief, 06Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7 - https://phabricator.wikimedia.org/T365799#9906174 (10MoritzMuehlenhoff) [08:36:57] (03PS6) 10Clément Goubert: mediawiki: Remove bare-metal cluster alerts [alerts] - 10https://gerrit.wikimedia.org/r/1047439 (https://phabricator.wikimedia.org/T362323) [08:38:07] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe1001.eqiad.wmnet with reason: host reimage [08:38:30] (03CR) 10Vgutierrez: [C:03+1] hiera: upgrade haproxy to 2.8 on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1047436 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [08:39:18] (03PS2) 10Fabfur: benthos:cache: switch to rfc5424 format [puppet] - 10https://gerrit.wikimedia.org/r/1047442 (https://phabricator.wikimedia.org/T365718) [08:39:46] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 15830 [08:40:38] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe1001.eqiad.wmnet with reason: host reimage [08:42:37] (03PS11) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) [08:42:40] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Turn down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949 (10Clement_Goubert) 03NEW [08:42:51] (03PS1) 10Muehlenhoff: No longer refer to setting the acmechief hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1047444 (https://phabricator.wikimedia.org/T365799) [08:43:17] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2972/co" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [08:44:03] (03CR) 10Alexandros Kosiaris: [C:03+1] Add new IRC servers also to the k8s hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047430 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [08:44:32] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1047041 (https://phabricator.wikimedia.org/T367768) (owner: 10Brouberol) [08:44:34] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2973/console" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [08:44:56] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:45:54] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:46:31] (03CR) 10Brouberol: [C:03+2] ATS: replace service by discovery record for all DSE services [puppet] - 10https://gerrit.wikimedia.org/r/1047041 (https://phabricator.wikimedia.org/T367768) (owner: 10Brouberol) [08:46:33] (03CR) 10Muehlenhoff: [C:03+2] mailman: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1042999 (owner: 10Muehlenhoff) [08:46:44] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52197 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:46:46] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.219 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:47:04] brouberol: I'll merge your patch along, ok? [08:47:13] yes please! [08:48:21] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe2001.codfw.wmnet with reason: host reimage [08:48:42] RECOVERY - Host mr1-eqsin.oob is UP: PING WARNING - Packet loss = 90%, RTA = 384.18 ms [08:50:28] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047442 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [08:51:20] !incidents [08:51:20] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe2001.codfw.wmnet with reason: host reimage [08:51:20] 4758 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [08:51:25] FIRING: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:51:30] (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to 2.8 on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1047436 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [08:51:53] !incidents [08:51:53] 4758 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [08:52:14] !log upgrading eqsin cp hosts to haproxy 2.8.10 (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1047436) (T367756) [08:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:19] Amir1: sorry, that was the ack expiring on db1165 [08:52:19] T367756: Upgrade hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756 [08:52:27] aaah [08:52:31] that explains it [08:54:50] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp5017.*} and A:cp [08:55:05] PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100% [08:56:51] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2974/console" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [08:57:40] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp5017.*} and A:cp [08:58:49] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp5025.*} and A:cp [08:59:24] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-fe1001.eqiad.wmnet with OS bookworm [09:00:06] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9906267 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe1001.eqiad.wmnet with OS bookworm completed: - moss-fe1001 (**PASS**)... [09:00:14] (03PS1) 10Ilias Sarantopoulos: ml-services: switch to llama3-8B-instruct [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047448 (https://phabricator.wikimedia.org/T354870) [09:00:37] (03PS1) 10Ayounsi: netbox-dev2003: disable validators [puppet] - 10https://gerrit.wikimedia.org/r/1047449 [09:00:53] (03CR) 10Muehlenhoff: "How long will it be unavailable? Is it just a puppet run or are more steps needed? If it's break we can also just access some missed conne" [puppet] - 10https://gerrit.wikimedia.org/r/1047076 (https://phabricator.wikimedia.org/T367861) (owner: 10Vgutierrez) [09:01:11] (03CR) 10Ayounsi: [C:03+2] netbox-dev2003: disable validators [puppet] - 10https://gerrit.wikimedia.org/r/1047449 (owner: 10Ayounsi) [09:01:12] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp5025.*} and A:cp [09:02:05] (03CR) 10Klausman: [C:03+1] ml-services: switch to llama3-8B-instruct [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047448 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [09:03:29] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: switch to llama3-8B-instruct [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047448 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [09:04:23] (03Merged) 10jenkins-bot: ml-services: switch to llama3-8B-instruct [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047448 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [09:05:02] (03PS2) 10Clément Goubert: Start removing legacy bare metal listeners [puppet] - 10https://gerrit.wikimedia.org/r/1047447 (https://phabricator.wikimedia.org/T367949) [09:05:55] (03PS1) 10Ayounsi: Netbox: add bookworm support for DB module [puppet] - 10https://gerrit.wikimedia.org/r/1047450 [09:06:42] (03CR) 10Vgutierrez: "process looks like this:" [puppet] - 10https://gerrit.wikimedia.org/r/1047076 (https://phabricator.wikimedia.org/T367861) (owner: 10Vgutierrez) [09:09:25] (03CR) 10Hnowlan: [C:03+1] mediawiki: Remove bare-metal cluster alerts [alerts] - 10https://gerrit.wikimedia.org/r/1047439 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [09:09:56] (03CR) 10Ayounsi: [C:03+2] Netbox: add bookworm support for DB module [puppet] - 10https://gerrit.wikimedia.org/r/1047450 (owner: 10Ayounsi) [09:10:29] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review, 07User-notice: Mailman Downtime: Migrate mailman from lists1001 to lists1004 - https://phabricator.wikimedia.org/T367521#9906284 (10eoghan) 05Open→03Resolved The maintenance was completed yesterday and so far the serv... [09:10:56] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-fe2001.codfw.wmnet with OS bookworm [09:11:00] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [09:11:03] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Remove bare-metal cluster alerts [alerts] - 10https://gerrit.wikimedia.org/r/1047439 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [09:11:06] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9906288 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bookworm completed: - moss-fe2001 (**PASS**)... [09:12:14] (03Merged) 10jenkins-bot: mediawiki: Remove bare-metal cluster alerts [alerts] - 10https://gerrit.wikimedia.org/r/1047439 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [09:13:41] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on netbox-dev2003.codfw.wmnet with reason: host reimage [09:14:52] (03PS1) 10Clément Goubert: kubernetes: Reimage 6 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1047451 (https://phabricator.wikimedia.org/T351074) [09:15:07] !log Depooling mw2400.codfw.wmnet,mw2403.codfw.wmnet,mw2404.codfw.wmnet,mw2405.codfw.wmnet,mw2408.codfw.wmnet,mw2409.codfw.wmnet for reimage - T351074 [09:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:12] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [09:16:10] RESOLVED: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:16:17] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netbox-dev2003.codfw.wmnet with reason: host reimage [09:20:24] (03Abandoned) 10Clément Goubert: Rename jobrunners to videoscalers [alerts] - 10https://gerrit.wikimedia.org/r/1019852 (owner: 10Alexandros Kosiaris) [09:21:03] RECOVERY - Host mr1-eqsin.oob is UP: PING WARNING - Packet loss = 77%, RTA = 325.12 ms [09:21:06] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [09:22:23] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [09:22:57] (03CR) 10Muehlenhoff: P:idp Allow upgrade to Tomcat 10. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [09:24:47] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw wikikube worker nodes - https://phabricator.wikimedia.org/T367286#9906315 (10Clement_Goubert) 05Open→03Declined [09:25:34] (03PS1) 10Brouberol: karapace: disable the systemd service to see if errors surface [puppet] - 10https://gerrit.wikimedia.org/r/1047452 (https://phabricator.wikimedia.org/T363461) [09:27:17] (03PS1) 10Ladsgroup: Remove pagelinks override in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047453 (https://phabricator.wikimedia.org/T367940) [09:27:27] PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100% [09:28:25] (03CR) 10Ladsgroup: [C:03+2] Remove pagelinks override in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047453 (https://phabricator.wikimedia.org/T367940) (owner: 10Ladsgroup) [09:29:04] (03Merged) 10jenkins-bot: Remove pagelinks override in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047453 (https://phabricator.wikimedia.org/T367940) (owner: 10Ladsgroup) [09:32:29] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [09:34:59] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:36:11] (03CR) 10Giuseppe Lavagetto: [C:03+1] kubernetes: Reimage 6 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1047451 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [09:37:51] RECOVERY - Host mr1-eqsin.oob is UP: PING WARNING - Packet loss = 90%, RTA = 362.96 ms [09:38:16] (03CR) 10Clément Goubert: [C:03+2] kubernetes: Reimage 6 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1047451 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [09:40:10] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-eqsin [09:40:17] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2400 to wikikube-worker2011 [09:40:31] (03PS12) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) [09:40:34] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [09:40:58] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2975/console" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [09:43:16] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2400 to wikikube-worker2011 - cgoubert@cumin1002" [09:44:19] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2976/co" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [09:44:41] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [09:44:50] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047465 [09:45:20] (03CR) 10Majavah: [C:03+1] toolforge: haproxy: check the k8s api-server /healthz endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1047113 (https://phabricator.wikimedia.org/T367389) (owner: 10Arturo Borrero Gonzalez) [09:46:07] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2400 to wikikube-worker2011 - cgoubert@cumin1002" [09:46:07] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:46:07] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2011 [09:46:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2011 [09:46:48] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2400 to wikikube-worker2011 [09:47:07] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2403 to wikikube-worker2012 [09:47:24] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [09:47:36] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "netbox-dev2003 - ayounsi@cumin1002" [09:49:27] PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100% [09:51:07] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "netbox-dev2003 - ayounsi@cumin1002" [09:51:23] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2403 to wikikube-worker2012 - cgoubert@cumin1002" [09:51:53] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] toolforge: haproxy: check the k8s api-server /healthz endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1047113 (https://phabricator.wikimedia.org/T367389) (owner: 10Arturo Borrero Gonzalez) [09:53:07] FIRING: ProbeDown: Service people1004:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:53:48] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2403 to wikikube-worker2012 - cgoubert@cumin1002" [09:53:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:53:49] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2012 [09:54:46] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [09:55:32] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [09:58:07] RESOLVED: ProbeDown: Service people1004:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:58:40] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2012 [09:58:48] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2403 to wikikube-worker2012 [09:59:24] (03CR) 10Kamila Součková: [C:03+2] admin: add Audrey Penven to ldap_only (wmde/nda) [puppet] - 10https://gerrit.wikimedia.org/r/1047176 (https://phabricator.wikimedia.org/T367184) (owner: 10Dzahn) [09:59:53] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2404 to wikikube-worker2013 [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T1000) [10:00:09] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [10:00:30] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet,service=s1 [10:01:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T364069)', diff saved to https://phabricator.wikimedia.org/P65194 and previous config saved to /var/cache/conftool/dbconfig/20240619-100118-marostegui.json [10:01:23] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [10:03:23] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2404 to wikikube-worker2013 - cgoubert@cumin1002" [10:04:53] RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 239.93 ms [10:05:34] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2404 to wikikube-worker2013 - cgoubert@cumin1002" [10:05:34] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:05:34] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2013 [10:05:39] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [10:05:55] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2013 [10:06:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2404 to wikikube-worker2013 [10:06:30] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2405 to wikikube-worker2014 [10:06:49] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [10:09:29] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2405 to wikikube-worker2014 - cgoubert@cumin1002" [10:12:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2405 to wikikube-worker2014 - cgoubert@cumin1002" [10:12:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:12:32] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2014 [10:12:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2014 [10:12:58] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2405 to wikikube-worker2014 [10:14:12] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2408 to wikikube-worker2017 [10:14:29] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [10:16:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P65195 and previous config saved to /var/cache/conftool/dbconfig/20240619-101625-marostegui.json [10:16:48] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2408 to wikikube-worker2017 - cgoubert@cumin1002" [10:17:52] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2408 to wikikube-worker2017 - cgoubert@cumin1002" [10:17:52] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:17:55] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2017 [10:18:10] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2017 [10:18:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2408 to wikikube-worker2017 [10:18:28] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2409 to wikikube-worker2018 [10:18:44] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [10:21:03] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2409 to wikikube-worker2018 - cgoubert@cumin1002" [10:22:20] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2409 to wikikube-worker2018 - cgoubert@cumin1002" [10:22:20] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:22:21] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2018 [10:23:05] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2018 [10:23:14] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2409 to wikikube-worker2018 [10:23:43] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2011.codfw.wmnet with OS bullseye [10:24:12] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2012.codfw.wmnet with OS bullseye [10:24:30] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2013.codfw.wmnet with OS bullseye [10:24:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [10:24:50] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2014.codfw.wmnet with OS bullseye [10:24:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [10:25:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2124 (T367856)', diff saved to https://phabricator.wikimedia.org/P65196 and previous config saved to /var/cache/conftool/dbconfig/20240619-102504-marostegui.json [10:25:09] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [10:25:26] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2017.codfw.wmnet with OS bullseye [10:25:40] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2018.codfw.wmnet with OS bullseye [10:29:59] (03PS1) 10Jelto: gitlab: add custom nginx config to block manual Trusted Runners edits [puppet] - 10https://gerrit.wikimedia.org/r/1047470 (https://phabricator.wikimedia.org/T366786) [10:31:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65197 and previous config saved to /var/cache/conftool/dbconfig/20240619-103109-root.json [10:31:24] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9906515 (10kamila) 05In progress→03Resolved a:03kamila Done, though it's my first time doing clinic duty, so let me know if it doesn't work :D [10:32:19] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T367736#9906530 (10Clement_Goubert) [10:32:35] (03PS1) 10Kamila Součková: Extend access for AndyRussG [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) [10:32:38] (03PS1) 10Ladsgroup: mariadb: Remove direct grants on mailman databases [puppet] - 10https://gerrit.wikimedia.org/r/1047474 (https://phabricator.wikimedia.org/T367833) [10:32:44] (03CR) 10CI reject: [V:04-1] gitlab: add custom nginx config to block manual Trusted Runners edits [puppet] - 10https://gerrit.wikimedia.org/r/1047470 (https://phabricator.wikimedia.org/T366786) (owner: 10Jelto) [10:33:08] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2977/console" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [10:33:56] (03CR) 10Fabfur: "I think we could retry to apply this in ulsfo now that HAProxy is at version 2.8.10" [puppet] - 10https://gerrit.wikimedia.org/r/1047442 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [10:34:01] (03PS2) 10Ladsgroup: mariadb: Remove direct grants on mailman databases [puppet] - 10https://gerrit.wikimedia.org/r/1047474 (https://phabricator.wikimedia.org/T367833) [10:34:09] (03PS2) 10Jelto: gitlab: add custom nginx config to block manual Trusted Runners edits [puppet] - 10https://gerrit.wikimedia.org/r/1047470 (https://phabricator.wikimedia.org/T366786) [10:34:47] FIRING: [2x] UdpMxIrcEchoThroughput: irc1002:9221 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho - https://alerts.wikimedia.org/?q=alertname%3DUdpMxIrcEchoThroughput [10:35:50] (03CR) 10Muehlenhoff: [C:03+2] Add new IRC servers also to the k8s hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047430 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [10:36:33] !log jmm@deploy1002 Started scap: (no justification provided) [10:37:33] (03CR) 10Vgutierrez: [C:03+1] benthos:cache: switch to rfc5424 format [puppet] - 10https://gerrit.wikimedia.org/r/1047442 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [10:37:43] (03CR) 10Marostegui: "let's double check if they exist in the db, and if they do, let's kill them" [puppet] - 10https://gerrit.wikimedia.org/r/1047474 (https://phabricator.wikimedia.org/T367833) (owner: 10Ladsgroup) [10:37:47] (03CR) 10Marostegui: [C:03+1] mariadb: Remove direct grants on mailman databases [puppet] - 10https://gerrit.wikimedia.org/r/1047474 (https://phabricator.wikimedia.org/T367833) (owner: 10Ladsgroup) [10:38:35] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2978/co" [puppet] - 10https://gerrit.wikimedia.org/r/1047470 (https://phabricator.wikimedia.org/T366786) (owner: 10Jelto) [10:39:22] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2011.codfw.wmnet with reason: host reimage [10:39:33] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2012.codfw.wmnet with reason: host reimage [10:40:06] !log jmm@deploy1002 Finished scap: (no justification provided) (duration: 04m 03s) [10:40:16] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2014.codfw.wmnet with reason: host reimage [10:40:35] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2013.codfw.wmnet with reason: host reimage [10:40:50] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2018.codfw.wmnet with reason: host reimage [10:40:52] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2017.codfw.wmnet with reason: host reimage [10:41:29] (03CR) 10Ladsgroup: "I will!" [puppet] - 10https://gerrit.wikimedia.org/r/1047474 (https://phabricator.wikimedia.org/T367833) (owner: 10Ladsgroup) [10:41:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2011.codfw.wmnet with reason: host reimage [10:43:36] (03CR) 10Muehlenhoff: Extend access for AndyRussG (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková) [10:43:39] (03CR) 10Hnowlan: [C:03+1] httpbb: Remove appserver hourly tests [puppet] - 10https://gerrit.wikimedia.org/r/1047107 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [10:43:45] (03CR) 10Ladsgroup: [C:04-1] "Sounds good to me, let me double check where this list is exactly used." [puppet] - 10https://gerrit.wikimedia.org/r/1047034 (https://phabricator.wikimedia.org/T256190) (owner: 10Ladsgroup) [10:44:30] (03CR) 10Hnowlan: [C:03+1] envoy: add data-gateway service listener [puppet] - 10https://gerrit.wikimedia.org/r/1032599 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [10:44:32] RESOLVED: [2x] UdpMxIrcEchoThroughput: irc1002:9221 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho - https://alerts.wikimedia.org/?q=alertname%3DUdpMxIrcEchoThroughput [10:44:33] (03CR) 10Clément Goubert: [C:03+2] httpbb: Remove appserver hourly tests [puppet] - 10https://gerrit.wikimedia.org/r/1047107 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [10:44:36] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2018.codfw.wmnet with reason: host reimage [10:45:17] (03CR) 10Hnowlan: service: move data-gateway service to production [puppet] - 10https://gerrit.wikimedia.org/r/1032593 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [10:45:39] (03CR) 10Muehlenhoff: "Looks good, but lists1001 needs to be set to role::insetup::buster first" [puppet] - 10https://gerrit.wikimedia.org/r/1047184 (https://phabricator.wikimedia.org/T331706) (owner: 10Dzahn) [10:45:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:46:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65198 and previous config saved to /var/cache/conftool/dbconfig/20240619-104614-root.json [10:47:58] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2017.codfw.wmnet with reason: host reimage [10:49:32] (03PS2) 10Ladsgroup: prometheus: Change footer icon ping url [puppet] - 10https://gerrit.wikimedia.org/r/1047034 (https://phabricator.wikimedia.org/T256190) [10:51:40] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2014.codfw.wmnet with reason: host reimage [10:51:58] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [10:52:28] (03PS2) 10Hnowlan: admin_ng: bump limits for shellbox-video [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047124 (https://phabricator.wikimedia.org/T357309) [10:52:39] (03CR) 10Hnowlan: "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047124 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [10:55:08] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2013.codfw.wmnet with reason: host reimage [10:55:41] (03CR) 10Clément Goubert: [C:03+1] admin_ng: bump limits for shellbox-video [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047124 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [10:58:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2012.codfw.wmnet with reason: host reimage [10:58:54] (03CR) 10Hnowlan: [C:03+2] admin_ng: bump limits for shellbox-video [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047124 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [11:00:05] mvolz: Time to do the Services – Citoid / Zotero deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T1100). [11:00:48] (03CR) 10Btullis: [C:03+1] "LGTM, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043106 (https://phabricator.wikimedia.org/T366603) (owner: 10Brouberol) [11:01:13] (03CR) 10Btullis: [C:03+1] wdqs graph-split: add final svcs [dns] - 10https://gerrit.wikimedia.org/r/1042160 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [11:01:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65199 and previous config saved to /var/cache/conftool/dbconfig/20240619-110120-root.json [11:01:42] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2011.codfw.wmnet with OS bullseye [11:01:55] (03Merged) 10jenkins-bot: admin_ng: bump limits for shellbox-video [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047124 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [11:03:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2018.codfw.wmnet with OS bullseye [11:03:27] (03CR) 10Btullis: [C:03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1047452 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [11:04:24] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [11:04:25] FIRING: SystemdUnitFailed: ferm.service on kubernetes2053:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:05:01] jouncebot: nowandnext [11:05:01] For the next 0 hour(s) and 54 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T1100) [11:05:01] In 1 hour(s) and 54 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T1300) [11:06:01] !log hnowlan@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:07:20] !log hnowlan@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:07:55] !log hnowlan@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:07:56] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2017.codfw.wmnet with OS bullseye [11:08:29] !log hnowlan@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:09:25] FIRING: [3x] SystemdUnitFailed: ferm.service on kubernetes2053:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:10:49] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:11:23] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2014.codfw.wmnet with OS bullseye [11:12:05] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:13:51] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:14:15] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:14:30] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [11:15:07] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2013.codfw.wmnet with OS bullseye [11:15:55] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [11:16:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65200 and previous config saved to /var/cache/conftool/dbconfig/20240619-111625-root.json [11:17:50] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2012.codfw.wmnet with OS bullseye [11:18:40] !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host netbox-dev2003.codfw.wmnet with OS bookworm [11:18:40] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host netbox-dev2003.codfw.wmnet [11:20:39] PROBLEM - MariaDB Replica Lag: s2 on clouddb1014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 7379.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:25:39] RECOVERY - MariaDB Replica Lag: s2 on clouddb1014 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:26:00] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [11:27:29] (03PS1) 10Ayounsi: Add "netbox-dev" to deployment server [puppet] - 10https://gerrit.wikimedia.org/r/1047482 (https://phabricator.wikimedia.org/T336275) [11:28:04] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047482 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [11:31:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65201 and previous config saved to /var/cache/conftool/dbconfig/20240619-113131-root.json [11:34:14] (03PS2) 10Kamila Součková: Extend access for AndyRussG [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) [11:34:25] FIRING: [3x] SystemdUnitFailed: ferm.service on kubernetes2053:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:34:39] (03CR) 10Kamila Součková: Extend access for AndyRussG (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková) [11:35:54] (03CR) 10Majavah: "question inline from me who is not at all familiar with the process" [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková) [11:36:49] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-eqsin [11:39:25] RESOLVED: [3x] SystemdUnitFailed: ferm.service on kubernetes2053:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:42:00] FIRING: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [11:46:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65203 and previous config saved to /var/cache/conftool/dbconfig/20240619-114636-root.json [11:50:26] (03PS1) 10Fabfur: hiera: test downgrading haproxy on cp5017 [puppet] - 10https://gerrit.wikimedia.org/r/1047483 (https://phabricator.wikimedia.org/T367756) [11:50:44] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [11:52:18] (03PS2) 10Fabfur: hiera: test downgrading haproxy on cp5017 [puppet] - 10https://gerrit.wikimedia.org/r/1047483 (https://phabricator.wikimedia.org/T367756) [11:53:13] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047483 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [11:55:18] (03CR) 10Fabfur: [C:03+2] hiera: test downgrading haproxy on cp5017 [puppet] - 10https://gerrit.wikimedia.org/r/1047483 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [11:57:20] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp5017.*} and A:cp [11:57:29] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1042.eqiad.wmnet with OS bookworm [11:58:51] (03PS1) 10Majavah: hieradata: Migrate cloudvirt1042 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1047486 (https://phabricator.wikimedia.org/T364457) [11:59:17] (03Abandoned) 10Majavah: hieradata: Move cloudvirt-wdqs* to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1046676 (https://phabricator.wikimedia.org/T364457) (owner: 10Majavah) [12:00:23] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp5017.*} and A:cp [12:00:50] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [12:01:05] (03PS2) 10Majavah: hieradata: Move cloudvirt1042 to OVS and single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1047486 (https://phabricator.wikimedia.org/T319184) [12:01:14] !log slyngshede@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on idp-test1002.wikimedia.org with reason: CAS 7 upgrade [12:01:28] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on idp-test1002.wikimedia.org with reason: CAS 7 upgrade [12:01:38] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9906713 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d9d9df4b-e647-4f8e-8b55-811d9f86d7d0) set by slyngshede@cumin1002 for 5 days, 0:00:00 on 1 host(s) an... [12:01:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65204 and previous config saved to /var/cache/conftool/dbconfig/20240619-120142-root.json [12:02:03] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp5017.eqsin.wmnet [12:02:05] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [12:02:20] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:idp Allow upgrade to Tomcat 10. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [12:02:36] (03CR) 10Majavah: [C:03+2] hieradata: Move cloudvirt1042 to OVS and single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1047486 (https://phabricator.wikimedia.org/T319184) (owner: 10Majavah) [12:02:58] slyngs: please merge mine too [12:03:02] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp5017.eqsin.wmnet [12:03:27] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp5017.eqsin.wmnet [12:03:52] taavi: Will do [12:04:46] Done [12:04:55] (03CR) 10Brouberol: [C:03+2] karapace: disable the systemd service to see if errors surface [puppet] - 10https://gerrit.wikimedia.org/r/1047452 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [12:06:21] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp5017.eqsin.wmnet [12:07:02] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [12:07:30] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [12:07:38] (03PS1) 10Muehlenhoff: Point codfw and codfw1dev to use the eqiad LDAP ro servers as well [puppet] - 10https://gerrit.wikimedia.org/r/1047488 (https://phabricator.wikimedia.org/T367861) [12:08:14] !log Will test-replace the PXE chainloader (/srv/tftpboot/lpxelinux.0) on install2003 with a newer version to see if it fixes the ldlinux.c32 error. Puppet will be disabled on that machine for the duration. [12:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:49] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [12:11:02] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [12:11:39] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [12:12:00] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [12:12:21] (03PS1) 10Majavah: hieradata: Move cloudvirt1043 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1047489 (https://phabricator.wikimedia.org/T364457) [12:13:09] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1043.eqiad.wmnet with OS bookworm [12:13:36] (03CR) 10Majavah: [C:03+2] hieradata: Move cloudvirt1043 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1047489 (https://phabricator.wikimedia.org/T364457) (owner: 10Majavah) [12:14:13] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [12:14:34] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [12:15:07] PROBLEM - karapace http server on karapace1001 is CRITICAL: connect to address 10.64.0.24 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Karapace [12:16:13] !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ml-staging2003.codfw.wmnet with OS bookworm [12:16:27] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9906779 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ml-staging2003.codfw.wmnet with OS bookworm [12:16:32] (03CR) 10Muehlenhoff: [C:03+2] Stop syncing swift rings on Puppet 5 frontends [puppet] - 10https://gerrit.wikimedia.org/r/1043128 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:19:21] (03PS1) 10Hnowlan: shellbox-video: drop requests/replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047491 (https://phabricator.wikimedia.org/T357309) [12:19:33] !log pfischer@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [12:20:00] !log pfischer@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:21:59] !log pfischer@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [12:22:06] !log pfischer@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:23:37] (03PS1) 10Fabfur: hiera: test upgrading cp5017 to haproxy 2.8.9 [puppet] - 10https://gerrit.wikimedia.org/r/1047492 (https://phabricator.wikimedia.org/T367963) [12:24:32] (03CR) 10Fabfur: [C:03+2] hiera: test upgrading cp5017 to haproxy 2.8.9 [puppet] - 10https://gerrit.wikimedia.org/r/1047492 (https://phabricator.wikimedia.org/T367963) (owner: 10Fabfur) [12:24:50] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp5017.eqsin.wmnet [12:26:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043513 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:31:29] PROBLEM - karapace http server on karapace1002 is CRITICAL: connect to address 10.64.0.5 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Karapace [12:31:45] FIRING: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [12:32:21] (03CR) 10Kamila Součková: [C:03+1] "LGTM except see inline" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047491 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [12:32:59] (03CR) 10Muehlenhoff: [C:03+2] thanos: Limit access to swift ring sync to Puppet 7 servers [puppet] - 10https://gerrit.wikimedia.org/r/1043513 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:33:58] (03CR) 10Vgutierrez: [C:03+1] Point codfw and codfw1dev to use the eqiad LDAP ro servers as well [puppet] - 10https://gerrit.wikimedia.org/r/1047488 (https://phabricator.wikimedia.org/T367861) (owner: 10Muehlenhoff) [12:34:43] !log Puppet management of install2004 restored, lpxelinux.0 also restored. [12:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:38] !log taavi@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1042.eqiad.wmnet with OS bookworm [12:36:22] !log taavi@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1042.eqiad.wmnet'] [12:36:39] !log taavi@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1042.eqiad.wmnet'] [12:36:43] !log taavi@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1042.eqiad.wmnet'] [12:37:04] !log taavi@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1042.eqiad.wmnet'] [12:37:21] !log taavi@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1042.eqiad.wmnet'] [12:37:54] 06SRE, 10SRE-Access-Requests, 10wikitech.wikimedia.org: Update "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T367914#9906851 (10kamila) 05Open→03Resolved a:03kamila [12:38:08] !log pfischer@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [12:38:12] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp5017.eqsin.wmnet [12:38:17] !log pfischer@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:40:56] !log homer 'cr*codfw*' commit 'T351074' [12:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:00] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [12:41:21] (03PS1) 10Muehlenhoff: Move update-netboot-image.sh to the puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/1047495 (https://phabricator.wikimedia.org/T365798) [12:44:53] 06SRE, 06Infrastructure-Foundations: Update pxelinux in tftpboot environment - https://phabricator.wikimedia.org/T367970 (10MoritzMuehlenhoff) 03NEW [12:44:55] 06SRE, 06Infrastructure-Foundations: Update pxelinux in tftpboot environment - https://phabricator.wikimedia.org/T367970#9906873 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:45:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047495 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:45:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [12:45:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [12:45:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [12:45:51] (03CR) 10Brouberol: [C:03+2] datahub-gms: enable prometheus scraping of metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043106 (https://phabricator.wikimedia.org/T366603) (owner: 10Brouberol) [12:51:06] !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(wikikube-worker2011.codfw.wmnet|wikikube-worker2012.codfw.wmnet|wikikube-worker2013.codfw.wmnet|wikikube-worker2014.codfw.wmnet|wikikube-worker2017.codfw.wmnet|wikikube-worker2018.codfw.wmnet),cluster=kubernetes,service=kubesvc [12:52:21] !log taavi@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1042.eqiad.wmnet'] [12:52:34] there's a local diff under /srv/deployment-charts/helmfile.d/services/sessionstore/values-staging.yaml preventing the git pull of new changes. I'm going to stash it [12:55:30] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: cloudvirt1042, cloudvirt1043 fails to boot after a reimage - https://phabricator.wikimedia.org/T367971#9906903 (10taavi) [12:57:25] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for DMburugu - https://phabricator.wikimedia.org/T367872#9906906 (10kamila) [12:58:41] (03CR) 10Ssingh: [C:03+2] conftool-data: add ntp-[abc].anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1046675 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T1300). [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:00:10] o/ [13:00:24] * Lucas_WMDE doesn’t see anything to deploy either [13:02:31] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for DMburugu - https://phabricator.wikimedia.org/T367872#9906921 (10kamila) @DMburugu can you please confirm that you have read the [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_access#User_responsibilities... [13:30:58] (03PS1) 10Klausman: hiera: Fix wrong machine name in ML k8s config [puppet] - 10https://gerrit.wikimedia.org/r/1047508 [13:31:54] !log taavi@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1042.eqiad.wmnet'] [13:31:58] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp5017.eqsin.wmnet [13:32:14] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org,service=ntp-a [13:32:31] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1044.eqiad.wmnet with reason: host reimage [13:32:35] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp5017.*} and A:cp [13:33:01] (03PS3) 10Elukey: service:uwsgi: remove scap support [puppet] - 10https://gerrit.wikimedia.org/r/1047503 [13:33:41] (03PS1) 10Muehlenhoff: Remove puppet checkout on pybaltest [puppet] - 10https://gerrit.wikimedia.org/r/1047509 (https://phabricator.wikimedia.org/T365798) [13:33:51] (03CR) 10Ilias Sarantopoulos: [C:03+1] hiera: Fix wrong machine name in ML k8s config [puppet] - 10https://gerrit.wikimedia.org/r/1047508 (owner: 10Klausman) [13:34:02] (03CR) 10Klausman: [C:03+2] hiera: Fix wrong machine name in ML k8s config [puppet] - 10https://gerrit.wikimedia.org/r/1047508 (owner: 10Klausman) [13:35:20] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp5017.*} and A:cp [13:35:25] !log taavi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1043.eqiad.wmnet with OS bookworm [13:35:55] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1044.eqiad.wmnet with reason: host reimage [13:35:59] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp5017.eqsin.wmnet [13:39:23] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043516 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [13:39:45] (03PS3) 10Fabfur: benthos:cache: switch to rfc5424 format [puppet] - 10https://gerrit.wikimedia.org/r/1047442 (https://phabricator.wikimedia.org/T365718) [13:40:48] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:41:11] !log klausman@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Trying to fix Puppet error on ml-staging2003 - klausman@cumin2002" [13:41:51] !log taavi@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1042.eqiad.wmnet'] [13:42:28] !log klausman@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Trying to fix Puppet error on ml-staging2003 - klausman@cumin2002" [13:43:57] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1042.eqiad.wmnet with OS bookworm [13:47:41] 06SRE, 06Infrastructure-Foundations: firmware-update: spicerack.redfish.RedfishError: iDRAC is not ready. The configuration values cannot be accessed. Please retry after a few minutes. - https://phabricator.wikimedia.org/T367974 (10taavi) 03NEW [13:48:18] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9907018 (10Papaul) @Clement_Goubert it is a U.S holiday today can we please rescheduled this for tomorrow . Thank you Sorry about that [13:48:48] !log klausman@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Trying to fix Puppet error on ml-staging2003 - klausman@cumin2002" [13:48:50] (03PS2) 10Hnowlan: services_proxy: add shellbox-video listener [puppet] - 10https://gerrit.wikimedia.org/r/1047098 (https://phabricator.wikimedia.org/T357309) [13:49:46] !log klausman@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Trying to fix Puppet error on ml-staging2003 - klausman@cumin2002" [13:50:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: cloudvirt1042, cloudvirt1043 fails to boot after a reimage - https://phabricator.wikimedia.org/T367971#9907023 (10taavi) As suggested by volans I tried running the firmware-upgrade cookbook on the other cumin server which h... [13:50:41] !log klausman@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Trying to fix Puppet error on ml-staging2003 - klausman@cumin2002" [13:51:46] !log klausman@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Trying to fix Puppet error on ml-staging2003 - klausman@cumin2002" [13:53:26] !log taavi@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1042.eqiad.wmnet with OS bookworm [13:53:45] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1042.eqiad.wmnet with OS bookworm [13:54:12] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging2003.codfw.wmnet with reason: host reimage [13:55:24] (03PS4) 10Elukey: service:uwsgi: remove scap support [puppet] - 10https://gerrit.wikimedia.org/r/1047503 [13:55:46] (03CR) 10CI reject: [V:04-1] service:uwsgi: remove scap support [puppet] - 10https://gerrit.wikimedia.org/r/1047503 (owner: 10Elukey) [13:56:06] (03CR) 10Hnowlan: [C:03+2] services_proxy: add shellbox-video listener [puppet] - 10https://gerrit.wikimedia.org/r/1047098 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [13:56:54] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1047503 (owner: 10Elukey) [13:57:13] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging2003.codfw.wmnet with reason: host reimage [13:57:32] (03PS1) 10VolkerE: Optimize static footer 'a Wikimedia project' icon further [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047521 (https://phabricator.wikimedia.org/T256190) [14:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T1400) [14:00:14] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1047503 (owner: 10Elukey) [14:00:29] (03PS5) 10Elukey: service:uwsgi: remove scap support [puppet] - 10https://gerrit.wikimedia.org/r/1047503 [14:01:05] (03CR) 10Elukey: "I explicitly added python3-venv since it was removed by the scap cleanup, if not required I can skip it." [puppet] - 10https://gerrit.wikimedia.org/r/1047503 (owner: 10Elukey) [14:01:17] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1044.eqiad.wmnet with OS bookworm [14:02:40] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1047503 (owner: 10Elukey) [14:03:37] PROBLEM - Host ml-staging2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:02] (03PS1) 10Hnowlan: shellbox-video: set timeout to one day [puppet] - 10https://gerrit.wikimedia.org/r/1047523 (https://phabricator.wikimedia.org/T357309) [14:07:01] !log taavi@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1043.eqiad.wmnet'] [14:07:26] (03PS1) 10Brouberol: Datahub: bump chart version to let hemfile use newly released subcharts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047524 (https://phabricator.wikimedia.org/T366603) [14:07:27] !log taavi@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cloudvirt1043.eqiad.wmnet'] [14:07:33] FIRING: KubernetesCalicoDown: ml-staging2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlstaging&var-instance=ml-staging2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:08:09] (03CR) 10Btullis: [C:03+1] Datahub: bump chart version to let hemfile use newly released subcharts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047524 (https://phabricator.wikimedia.org/T366603) (owner: 10Brouberol) [14:08:09] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [14:08:54] !log taavi@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1043.eqiad.wmnet'] [14:08:55] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1042.eqiad.wmnet with reason: host reimage [14:09:03] (03CR) 10Brouberol: [C:03+2] Datahub: bump chart version to let hemfile use newly released subcharts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047524 (https://phabricator.wikimedia.org/T366603) (owner: 10Brouberol) [14:09:38] !log taavi@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudvirt1043.eqiad.wmnet'] [14:09:45] !log klausman@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - klausman@cumin2002" [14:09:55] !log taavi@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1043.eqiad.wmnet'] [14:10:12] (03CR) 10Effie Mouzeli: [C:03+1] shellbox-video: set timeout to one day [puppet] - 10https://gerrit.wikimedia.org/r/1047523 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [14:10:44] !log klausman@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - klausman@cumin2002" [14:10:50] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-staging2003.codfw.wmnet with OS bookworm [14:10:59] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9907108 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ml-staging2003.codfw.wmnet with OS bookworm completed: - ml-stag... [14:11:05] !log taavi@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudvirt1043.eqiad.wmnet'] [14:11:53] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1042.eqiad.wmnet with reason: host reimage [14:12:07] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [14:12:33] FIRING: [4x] KubernetesCalicoDown: ml-staging-ctrl2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:12:49] (03CR) 10Hnowlan: [C:03+2] shellbox-video: set timeout to one day [puppet] - 10https://gerrit.wikimedia.org/r/1047523 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [14:13:09] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [14:14:45] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [14:16:21] RECOVERY - Host ml-staging2001 is UP: PING OK - Packet loss = 0%, RTA = 79.12 ms [14:17:00] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [14:17:33] RESOLVED: [4x] KubernetesCalicoDown: ml-staging-ctrl2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:19:17] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1043.eqiad.wmnet with OS bookworm [14:19:29] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [14:19:39] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on ml-staging2003 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear, flush_l1d} https://wikitech.wikimedia.org/wiki/Microcode [14:20:11] (03PS4) 10Hnowlan: Add shellbox-video vars/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T357309) [14:20:52] (03CR) 10CI reject: [V:04-1] Add shellbox-video vars/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [14:21:05] (03PS1) 10Brouberol: superset-next: upgrade to Superset 4.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047528 (https://phabricator.wikimedia.org/T366060) [14:21:29] (03PS5) 10Hnowlan: Add shellbox-video vars/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T357309) [14:22:08] (03CR) 10CI reject: [V:04-1] Add shellbox-video vars/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [14:23:23] !log installing pymysql security updates [14:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:41] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet - https://phabricator.wikimedia.org/T366670#9907185 (10klausman) [14:24:58] !log installing libvpx security updates [14:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:39] (03CR) 10Elukey: [C:03+2] cli: modify get_distro_name to return the version id [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1043780 (https://phabricator.wikimedia.org/T240193) (owner: 10Elukey) [14:27:18] (03PS6) 10Hnowlan: Add shellbox-video vars/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T357309) [14:28:03] (03Merged) 10jenkins-bot: cli: modify get_distro_name to return the version id [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1043780 (https://phabricator.wikimedia.org/T240193) (owner: 10Elukey) [14:32:23] (03PS7) 10Hnowlan: Add shellbox-video vars/config, enable on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T357309) [14:34:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: cloudvirt1042, cloudvirt1043 fails to boot after a reimage - https://phabricator.wikimedia.org/T367971#9907209 (10taavi) 05Open→03Resolved a:05Jclark-ctr→03taavi The reimages finished succesfully after a firmwar... [14:34:58] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1043.eqiad.wmnet with reason: host reimage [14:35:38] !log installing nano security updates [14:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:56] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - taavi@cumin1002" [14:37:56] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1043.eqiad.wmnet with reason: host reimage [14:38:37] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - taavi@cumin1002" [14:38:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db[1155,1158].eqiad.wmnet with reason: Long schema change [14:38:39] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1042.eqiad.wmnet with OS bookworm [14:38:47] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:47] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:39:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db[1155,1158].eqiad.wmnet with reason: Long schema change [14:39:34] (03CR) 10Btullis: "Is this still to be reverted, or should we abandon it?" [puppet] - 10https://gerrit.wikimedia.org/r/1037014 (owner: 10Bking) [14:40:08] (03CR) 10Btullis: [C:03+1] superset-next: upgrade to Superset 4.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047528 (https://phabricator.wikimedia.org/T366060) (owner: 10Brouberol) [14:40:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db[1155-1156].eqiad.wmnet with reason: Long schema change [14:40:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db[1155-1156].eqiad.wmnet with reason: Long schema change [14:42:35] !log Deploy schema change on s2 eqiad master dbmaint T364069 [14:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:40] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [14:45:56] (03CR) 10Fabfur: [C:03+2] benthos:cache: switch to rfc5424 format [puppet] - 10https://gerrit.wikimedia.org/r/1047442 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [14:50:50] 06SRE, 06Infrastructure-Foundations, 10netops: Should we channelize unused QSFP28 ports on QFX5120s to provide 'buffer' for 10G upgrades? - https://phabricator.wikimedia.org/T367408#9907245 (10Papaul) Yes i don't think this approach will work for codfw, like @cmooney said: "codfw dc-ops match the switch port... [14:50:55] (03PS1) 10Fabfur: benthos:cache: fixed typo in field name [puppet] - 10https://gerrit.wikimedia.org/r/1047536 (https://phabricator.wikimedia.org/T365718) [14:51:49] (03PS1) 10Hnowlan: shellbox-video: drop timeout slightly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047537 (https://phabricator.wikimedia.org/T357309) [14:52:59] RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 2, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:54:21] (03CR) 10Fabfur: [C:03+2] benthos:cache: fixed typo in field name [puppet] - 10https://gerrit.wikimedia.org/r/1047536 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [14:55:09] (03PS1) 10Peter Fischer: Search update pipeline: retry 429s at HTTP client level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047538 (https://phabricator.wikimedia.org/T362310) [14:55:54] (03CR) 10Peter Fischer: [C:03+2] Search update pipeline: retry 429s at HTTP client level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047538 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer) [14:56:51] (03Merged) 10jenkins-bot: Search update pipeline: retry 429s at HTTP client level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047538 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer) [14:57:28] (03CR) 10Kosta Harlan: "I think it's ready to go, just had been waiting for T360070" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010953 (https://phabricator.wikimedia.org/T360067) (owner: 10Kosta Harlan) [14:57:32] (03PS3) 10Dreamy Jazz: extension-list: Add IPReputation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010953 (https://phabricator.wikimedia.org/T360067) (owner: 10Kosta Harlan) [14:58:31] (03CR) 10Ayounsi: [C:03+1] "ohh awesome !" [puppet] - 10https://gerrit.wikimedia.org/r/1047503 (owner: 10Elukey) [14:58:47] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:38] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9907287 (10Clement_Goubert) No worries, I'll extend the downtime, and we'll leave it like that for you to move. [14:59:48] !log cgoubert@cumin1002 START - Cookbook sre.hosts.remove-downtime for wikikube-worker2003.codfw.wmnet [14:59:48] !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.remove-downtime (exit_code=97) for wikikube-worker2003.codfw.wmnet [15:00:01] !log cgoubert@cumin1002 START - Cookbook sre.hosts.remove-downtime for mw2282.codfw.wmnet [15:00:02] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2282.codfw.wmnet [15:00:17] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for Kgraessle - https://phabricator.wikimedia.org/T367747#9907288 (10DMburugu) Approved [15:01:10] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on mw2282.codfw.wmnet with reason: Host move [15:01:23] !log pfischer@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:01:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on mw2282.codfw.wmnet with reason: Host move [15:01:36] !log pfischer@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:01:37] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9907289 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=111d8ee1-db67-4ba6-a57a-50da8c8dc4ff) set by cgoubert@cumin1002 for 2 days, 0:00:0... [15:02:06] (03CR) 10Kosta Harlan: "Also, would be happy for anyone who sees this to merge and sync it. Otherwise I will try to get to that next week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010953 (https://phabricator.wikimedia.org/T360067) (owner: 10Kosta Harlan) [15:02:14] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for DMburugu - https://phabricator.wikimedia.org/T367872#9907292 (10DMburugu) @Dzahn Sorry for the tag mix up. @kamila Yes, I have read the [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_access#User_responsi... [15:03:17] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:04:20] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns2006.wikimedia.org,service=ntp-c [15:05:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [15:06:00] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [15:06:00] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [15:06:11] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:07:15] 06SRE, 10SRE-Access-Requests: Request to add mnz to analytics-research-admins - https://phabricator.wikimedia.org/T367757#9907307 (10fkaelin) I confirm that this request is legit, also adding @XiaoXiao-WMF as manager. As for an approvers list, please add myself and @XiaoXiao-WMF (assuming your access is set... [15:07:32] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1043.eqiad.wmnet with OS bookworm [15:07:37] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS info - pt1979@cumin2002" [15:08:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS info - pt1979@cumin2002" [15:08:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:11:19] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9907339 (10klausman) [15:11:29] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9907322 (10Dzahn) Additionally added Audrey to the WMF-NDA group in Phabricator (https://phabricator.wikimedia.org/project/members/61/) That's based on T358578 (like for wmf group -> http... [15:12:36] (03PS1) 10Ayounsi: Standalone Netbox: set redis' appendfilename [puppet] - 10https://gerrit.wikimedia.org/r/1047542 (https://phabricator.wikimedia.org/T336275) [15:12:59] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9907350 (10klausman) Machine is imaged and running. The PXE boot was "fixed" by an ugly hack mentioned in T304483#9906962 While the firmware problem remains, at least we a... [15:16:44] !log sudo cumin -b1 -s120 'A:dnsbox' 'run-puppet-agent --enable "merging CR 1046685"': T366360 [15:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:49] T366360: Anycast NTP and update the list of timeservers for P:systemd::timesyncd - https://phabricator.wikimedia.org/T366360 [15:18:35] (03PS1) 10Elukey: debian: update target distribution [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/1047544 (https://phabricator.wikimedia.org/T365253) [15:19:54] (03PS2) 10Ayounsi: Standalone Netbox: set redis' appendfilename [puppet] - 10https://gerrit.wikimedia.org/r/1047542 (https://phabricator.wikimedia.org/T336275) [15:20:50] (03PS1) 10Fabfur: benthos:cache: delete message if sequence number is missing [puppet] - 10https://gerrit.wikimedia.org/r/1047545 (https://phabricator.wikimedia.org/T365718) [15:22:25] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.06.17 - 2024.07.07): Elastic2099 unresponsive - https://phabricator.wikimedia.org/T367598#9907372 (10Jhancock.wm) @Clement_Goubert and I were troubleshooting a similar issue on June 4th for kubernetes2030 kubernetes2033 kubernetes2035 the only thing... [15:22:41] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:22:51] (03CR) 10Ssingh: [V:03+1] "ntp-a being advertised from all sites (currently rolling out). This is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/1047073 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [15:23:18] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:23:36] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:23:37] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:24:00] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:24:08] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:25:09] (03CR) 10Fabfur: [C:03+2] benthos:cache: delete message if sequence number is missing [puppet] - 10https://gerrit.wikimedia.org/r/1047545 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [15:25:48] (03CR) 10JMeybohm: [C:03+1] "Good point. Let me nit on naming a bit:" [alerts] - 10https://gerrit.wikimedia.org/r/1046781 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French) [15:27:04] (03CR) 10JMeybohm: "If you find the time, please double check me on this one." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020313 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [15:28:09] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet - https://phabricator.wikimedia.org/T366670#9907386 (10Jhancock.wm) The server has had the DIMM reseated. [15:30:10] (03CR) 10Elukey: [C:03+1] Standalone Netbox: set redis' appendfilename [puppet] - 10https://gerrit.wikimedia.org/r/1047542 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [15:31:45] RESOLVED: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [15:32:25] !log taavi@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1042 [15:32:36] (03PS1) 10DCausse: cirrus-streaming-updater: increase elasticsearch-bulk-max-action-size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047546 [15:32:44] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T367648#9907393 (10Jhancock.wm) the iDrac light is rapidly blinking amber. I tried rebooting just the idrac by holding down the button for 45 seconds. it failed. The next troubleshooting step is to reboot and drain the power... [15:32:46] !log taavi@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1042 [15:33:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:33:57] (03CR) 10Ayounsi: [C:03+2] Standalone Netbox: set redis' appendfilename [puppet] - 10https://gerrit.wikimedia.org/r/1047542 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [15:36:40] (03CR) 10Kamila Součková: [C:03+1] "just note that we need to remember to also tell MediaWiki" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047537 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [15:38:01] (03CR) 10Alexandros Kosiaris: [C:03+1] mathoid: Upgrade image from 2023-11-03-103402 to 2024-06-18-233457 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047201 (https://phabricator.wikimedia.org/T349118) (owner: 10Jforrester) [15:40:17] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9907428 (10kamila) @AndyRussG since you're now with WMDE, we'd like to update your email, can you please let me know your @wikimedia.de email address? [15:41:31] (03CR) 10Elukey: [V:03+1 C:03+2] service:uwsgi: remove scap support [puppet] - 10https://gerrit.wikimedia.org/r/1047503 (owner: 10Elukey) [15:41:37] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: increase elasticsearch-bulk-max-action-size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047546 (owner: 10DCausse) [15:42:39] (03Merged) 10jenkins-bot: cirrus-streaming-updater: increase elasticsearch-bulk-max-action-size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047546 (owner: 10DCausse) [15:42:48] (03PS1) 10Majavah: hieradata: Move cloudvirt1044 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1047551 (https://phabricator.wikimedia.org/T364457) [15:43:37] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1044.eqiad.wmnet with OS bookworm [15:43:46] (03CR) 10Brouberol: [C:03+2] superset-next: upgrade to Superset 4.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047528 (https://phabricator.wikimedia.org/T366060) (owner: 10Brouberol) [15:44:12] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:44:47] (03PS1) 10Ayounsi: sre.deploy.python-code: add missing f-string [cookbooks] - 10https://gerrit.wikimedia.org/r/1047553 [15:45:02] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [15:45:17] (03CR) 10Volans: [C:03+1] "LGTM, thanks for the fix" [cookbooks] - 10https://gerrit.wikimedia.org/r/1047553 (owner: 10Ayounsi) [15:46:00] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [15:46:31] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:47:13] (03CR) 10Ladsgroup: rpc: Update function call in RunSingleJob (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038785 (https://phabricator.wikimedia.org/T363839) (owner: 10Ladsgroup) [15:49:25] FIRING: SystemdUnitFailed: ferm.service on mw2353:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:50:41] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:51:00] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:52:36] (03CR) 10JMeybohm: [C:03+1] debian: update target distribution [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/1047544 (https://phabricator.wikimedia.org/T365253) (owner: 10Elukey) [15:53:35] (03Merged) 10jenkins-bot: sre.deploy.python-code: add missing f-string [cookbooks] - 10https://gerrit.wikimedia.org/r/1047553 (owner: 10Ayounsi) [15:54:25] RESOLVED: SystemdUnitFailed: ferm.service on mw2353:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:55:28] !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code netbox-dev to netbox-dev2003.codfw.wmnet with reason: Netbox 4 on netbox-dev2003 - ayounsi@cumin1002 - T336275 [15:55:28] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) netbox-dev to netbox-dev2003.codfw.wmnet with reason: Netbox 4 on netbox-dev2003 - ayounsi@cumin1002 - T336275 [15:55:33] T336275: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275 [15:56:55] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 463, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:58:18] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: management and main interface down for mw2321.codfw.wmnet - https://phabricator.wikimedia.org/T367702#9907461 (10Jhancock.wm) tried a few things but ultimately had to power cycle the server to get it back up. Lemme know if it looks good. [15:58:40] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.06.17 - 2024.07.07): Elastic2099 unresponsive - https://phabricator.wikimedia.org/T367598#9907465 (10Papaul) @Jhancock.wm you can physical power cycle it at anytime [15:59:09] 10ops-eqdfw, 06SRE, 06DC-Ops: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864#9907466 (10Papaul) Power supply will be shipped out today [16:02:59] !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code netbox-dev to netbox-dev2003.codfw.wmnet with reason: Netbox 4 on netbox-dev2003 - ayounsi@cumin1002 - T336275 [16:03:04] T336275: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275 [16:07:29] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:07:57] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:09:27] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [16:12:19] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw2321.codfw.wmnet back to active - cgoubert@cumin1002" [16:13:40] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw2321.codfw.wmnet back to active - cgoubert@cumin1002" [16:13:40] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:13:42] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission db2102.codw.wmnet - https://phabricator.wikimedia.org/T366892#9907482 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:15:54] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox-dev to netbox-dev2003.codfw.wmnet with reason: Netbox 4 on netbox-dev2003 - ayounsi@cumin1002 - T336275 [16:15:59] T336275: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275 [16:17:48] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mw2281.codfw.wmnet mw22[83-90].codfw.wmnet - https://phabricator.wikimedia.org/T367275#9907489 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:19:40] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: service=(ntp-a|ntp-b|ntp-c) [16:27:29] !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=mw2321.codfw.wmnet,cluster=kubernetes,service=kubesvc [16:27:54] !log pooling and uncordoning mw2321.codfw.wmnet - T367702 [16:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:58] T367702: hw troubleshooting: management and main interface down for mw2321.codfw.wmnet - https://phabricator.wikimedia.org/T367702 [16:29:19] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: management and main interface down for mw2321.codfw.wmnet - https://phabricator.wikimedia.org/T367702#9907521 (10Clement_Goubert) 05Open→03Resolved Looks back up, putting it back to Active and running homer brought back BGP connectivit... [16:31:39] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:34:05] 06SRE, 06Infrastructure-Foundations, 10netops: Should we channelize unused QSFP28 ports on QFX5120s to provide 'buffer' for 10G upgrades? - https://phabricator.wikimedia.org/T367408#9907538 (10cmooney) 05Open→03Resolved a:03cmooney Cool thanks @papaul. I guess we can see how we get on over the nex... [16:40:32] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9907547 (10Papaul) All the netbox part is done waiting. [16:41:02] (03CR) 10Ssingh: [V:03+1 C:03+2] durum: switch NTP peers to ntp-[abc].anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1046689 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [16:42:04] !log sudo cumin 'A:durum' 'run-puppet-agent' to switch timesyncd NTP pools to ntp-[abc].anycast.wmnet: T366360 [16:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:09] T366360: Anycast NTP and update the list of timeservers for P:systemd::timesyncd - https://phabricator.wikimedia.org/T366360 [16:47:25] FIRING: SystemdUnitFailed: ferm.service on mw2422:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:49:25] (03PS2) 10Hnowlan: shellbox-video: drop timeout slightly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047537 (https://phabricator.wikimedia.org/T357309) [16:50:31] (03CR) 10Hnowlan: "Added a note" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047537 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [16:50:42] (03CR) 10Hnowlan: [C:03+2] shellbox-video: drop timeout slightly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047537 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [16:51:42] (03Merged) 10jenkins-bot: shellbox-video: drop timeout slightly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047537 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [16:52:25] FIRING: [3x] SystemdUnitFailed: ferm.service on kubernetes2044:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:54:48] (03PS4) 10Ssingh: P:bird::anycast_monitoring: add monitoring for 10.3.0.[5-7]/32 [puppet] - 10https://gerrit.wikimedia.org/r/1046757 (https://phabricator.wikimedia.org/T366360) [16:55:35] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2986/co" [puppet] - 10https://gerrit.wikimedia.org/r/1046757 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [16:55:48] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:56:41] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:59:29] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved wikikube-ctrl200[12] to a new rack - kamila@cumin1002" [17:00:02] (03PS1) 10Kamila Součková: Revert^2 "Add wikikube-ctrl2001 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1047560 [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T1700) [17:00:27] (03CR) 10CI reject: [V:04-1] Revert^2 "Add wikikube-ctrl2001 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1047560 (owner: 10Kamila Součková) [17:00:48] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved wikikube-ctrl200[12] to a new rack - kamila@cumin1002" [17:00:48] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:00:50] (03PS1) 10Kamila Součková: Revert^2 "Add wikikube-ctrl2002 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1047561 [17:01:01] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl2001 [17:01:03] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl2001 [17:01:27] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl2002 [17:01:30] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl2002 [17:02:25] FIRING: [4x] SystemdUnitFailed: ferm.service on kubernetes2044:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:02:45] (03PS2) 10Kamila Součková: Revert^2 "Add wikikube-ctrl2002 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1047561 [17:04:30] (03PS1) 10Superpes15: [tlywiki] Change the logo and wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047562 (https://phabricator.wikimedia.org/T366431) [17:05:30] !log taavi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1044.eqiad.wmnet with OS bookworm [17:08:54] 06SRE, 06Infrastructure-Foundations, 10netops: Should we channelize unused QSFP28 ports on QFX5120s to provide 'buffer' for 10G upgrades? - https://phabricator.wikimedia.org/T367408#9907616 (10Papaul) I don't think we have a lot of servers right that have 10G NIC put using the 1G NIC. Most of the servers... [17:13:48] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [17:16:07] * brennen looks around [17:16:54] i'm not here, ignore me, but based on a skim of tasks and logs, i think there are no train things needed for the upcoming window. [17:17:10] somebody ping me if i'm wrong and i'll materialize for a bit. [17:17:25] FIRING: [4x] SystemdUnitFailed: ferm.service on kubernetes2044:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:20:18] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved wikikube-ctrl200[12] to a new rack - kamila@cumin1002" [17:21:13] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved wikikube-ctrl200[12] to a new rack - kamila@cumin1002" [17:21:13] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:22:25] FIRING: [5x] SystemdUnitFailed: ferm.service on kubernetes2044:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:25:15] (03CR) 10Kamila Součková: [C:03+2] Revert^2 "Add wikikube-ctrl2002 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1047561 (owner: 10Kamila Součková) [17:25:43] (03PS2) 10Kamila Součková: Revert^2 "Add wikikube-ctrl2001 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1047560 [17:27:16] (03CR) 10Kamila Součková: [C:03+2] Revert^2 "Add wikikube-ctrl2001 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1047560 (owner: 10Kamila Součková) [17:27:25] FIRING: [5x] SystemdUnitFailed: ferm.service on kubernetes2044:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:41:17] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:43:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T367856)', diff saved to https://phabricator.wikimedia.org/P65207 and previous config saved to /var/cache/conftool/dbconfig/20240619-174338-marostegui.json [17:43:44] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [17:45:07] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:47:25] RESOLVED: SystemdUnitFailed: ferm.service on wikikube-worker2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:51:30] FIRING: [2x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:53:47] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:56:30] RESOLVED: [2x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:58:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P65208 and previous config saved to /var/cache/conftool/dbconfig/20240619-175846-marostegui.json [18:00:05] jnuche and brennen: MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T1800). Please do the needful. [18:13:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P65209 and previous config saved to /var/cache/conftool/dbconfig/20240619-181353-marostegui.json [18:21:51] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2001.codfw.wmnet with OS bullseye [18:22:01] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9907715 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl2001.co... [18:29:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T367856)', diff saved to https://phabricator.wikimedia.org/P65210 and previous config saved to /var/cache/conftool/dbconfig/20240619-182900-marostegui.json [18:29:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [18:29:05] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [18:29:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [18:29:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T367856)', diff saved to https://phabricator.wikimedia.org/P65211 and previous config saved to /var/cache/conftool/dbconfig/20240619-182922-marostegui.json [18:34:45] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host wikikube-ctrl2001.codfw.wmnet with OS bullseye [18:35:21] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2001.codfw.wmnet with OS bullseye [18:35:33] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9907723 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl2001.co... [18:40:16] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2002.codfw.wmnet with OS bullseye [18:40:28] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9907725 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl2002.co... [18:48:25] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl2002.codfw.wmnet with OS bullseye [18:48:32] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9907733 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl2002.codfw.... [18:49:13] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2002.codfw.wmnet with OS bullseye [18:49:18] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9907734 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl2002.co... [18:51:28] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl2001.codfw.wmnet with reason: host reimage [18:54:37] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl2001.codfw.wmnet with reason: host reimage [19:05:12] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl2002.codfw.wmnet with reason: host reimage [19:08:29] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl2002.codfw.wmnet with reason: host reimage [19:18:59] (03PS1) 10Pppery: Add fallback languages for Phabricator [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1047593 [19:30:20] (03PS2) 10Pppery: Add fallback languages for Phabricator [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1047593 [19:33:41] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T2000). [20:00:05] Superpes: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:37] Hi :P [20:02:57] 06SRE, 10SRE-Access-Requests: Request to add mnz to analytics-research-admins - https://phabricator.wikimedia.org/T367757#9907796 (10MunizaA) [20:11:10] (03PS1) 10Santiago Faci: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047600 (https://phabricator.wikimedia.org/T366940) [20:12:34] Can anyone deploy? :O [20:16:18] I can [20:16:56] (03CR) 10Zabe: [C:03+2] [tlywiki] Change the logo and wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047562 (https://phabricator.wikimedia.org/T366431) (owner: 10Superpes15) [20:17:13] Oh thanks :D [20:17:29] (03CR) 10Pppery: "Also upstream at https://we.phorge.it/D25695, since for ancient historical reasons some of the locale files are upstream." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1047593 (owner: 10Pppery) [20:17:37] (03Merged) 10jenkins-bot: [tlywiki] Change the logo and wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047562 (https://phabricator.wikimedia.org/T366431) (owner: 10Superpes15) [20:19:12] !log zabe@deploy1002 Started scap: Backport for [[gerrit:1047562|[tlywiki] Change the logo and wordmark/tagline (T366431)]] [20:19:17] T366431: Change logo for Talysh Wikipedia - https://phabricator.wikimedia.org/T366431 [20:23:52] !log zabe@deploy1002 superpes, zabe: Backport for [[gerrit:1047562|[tlywiki] Change the logo and wordmark/tagline (T366431)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:23:56] Testing :) [20:24:38] Looks fine thanks :P zabe [20:24:42] !log zabe@deploy1002 superpes, zabe: Continuing with sync [20:31:31] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9907853 (10Papaul) @kamila 2001 and 2002 are ready ` papaul@lsw1-b7-codfw> show interfaces descriptions | match wiki* xe-0/0/42... [20:32:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [20:32:31] hi. if any deployers are feeling bored, want to ship a beta-only config change too? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1047122 [20:32:40] (03PS3) 10Bartosz Dziewoński: Revert "Show experimental login popup links on the beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047122 (https://phabricator.wikimedia.org/T367891) [20:32:55] sure [20:33:25] thanks! [20:33:46] (03CR) 10Zabe: [C:03+2] Revert "Show experimental login popup links on the beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047122 (https://phabricator.wikimedia.org/T367891) (owner: 10Bartosz Dziewoński) [20:33:54] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1047562|[tlywiki] Change the logo and wordmark/tagline (T366431)]] (duration: 14m 41s) [20:33:59] T366431: Change logo for Talysh Wikipedia - https://phabricator.wikimedia.org/T366431 [20:34:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047122 (https://phabricator.wikimedia.org/T367891) (owner: 10Bartosz Dziewoński) [20:34:28] (03Merged) 10jenkins-bot: Revert "Show experimental login popup links on the beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047122 (https://phabricator.wikimedia.org/T367891) (owner: 10Bartosz Dziewoński) [20:34:59] Many thanks for your assistance :3 [20:36:03] yw:) [20:37:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [20:53:51] 10ops-eqdfw, 06SRE, 06DC-Ops: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864#9907910 (10Papaul) Hello Papaul, Thank you for the information provided. This is the information for the replacement of the faulty PEM with serial number 1F188120554. The RMA ID is # R20051... [20:55:25] FIRING: [3x] SystemdUnitFailed: etcd.service on wikikube-ctrl2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:57:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-ctrl2001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:57:41] FIRING: [2x] ProbeDown: Service wikikube-ctrl2001:6443 has failed probes (http_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wikikube-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:59:44] <_joe_> kamila_: can I assume that's you? a downtime just expired AFAICS [21:00:04] <_joe_> well it's 11 pm, someone in a better TZ will respond I hope [21:00:04] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T2100) [21:00:46] _joe_: uh, yeah, exactly [21:00:50] I'm sorry [21:00:59] forgot that they expire... [21:04:30] I am not near a computer for another 45 mins or so but I ACKed it for now [21:04:49] will downtime then unless someone doesnit before [21:06:56] I have put in an alertmanager silence [21:07:00] I am trying out the `ACK!` thing [21:08:33] !log oblivian@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wikikube-ctrl[2001-2002].codfw.wmnet with reason: Reimage --kamila [21:08:47] !log oblivian@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wikikube-ctrl[2001-2002].codfw.wmnet with reason: Reimage --kamila [21:09:11] thanks all. [21:09:18] marking as resolved as well . [21:55:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:29:17] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:29:53] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:29:53] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:44:18] (03PS1) 10Volans: Adapt build system to latest images settings [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1047607 [22:44:18] (03PS1) 10Volans: Release v0.6.6 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1047608 [22:44:53] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:44:55] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:45:19] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:05:12] !log zabe@mwmaint1002:~$ mwscript createAndPromote.php u4cwiki Superpes15 REDACTED --bureaucrat --sysop [23:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:24] !log zabe@mwmaint1002:~$ mwscript createAndPromote.php arbcom_itwiki Superpes15 REDACTED --bureaucrat --sysop [23:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:27] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:31:53] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:31:55] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:33:41] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:38:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1047610 [23:38:32] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1047610 (owner: 10TrainBranchBot) [23:39:59] (03CR) 10Krinkle: [C:03+1] rpc: Update function call in RunSingleJob (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038785 (https://phabricator.wikimedia.org/T363839) (owner: 10Ladsgroup) [23:50:43] 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for the new private wikis - https://phabricator.wikimedia.org/T367839#9908106 (10Zabe) >>! In T367839#9902539, @MatthewVernon wrote: > Have the swift containers been generated for these wikis? I can't find any ob...