[00:00:27] <wikibugs>	 (03PS2) 10Jforrester: mathoid: Upgrade image from 2023-11-03-103402 to 2024-06-18-233457 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047201 (https://phabricator.wikimedia.org/T349118)
[00:00:53] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1047199 (owner: 10TrainBranchBot)
[00:05:44] <wikibugs>	 (03PS2) 10Scott French: service: move data-gateway service to production [puppet] - 10https://gerrit.wikimedia.org/r/1032593 (https://phabricator.wikimedia.org/T364921)
[00:05:44] <wikibugs>	 (03PS2) 10Scott French: envoy: add data-gateway service listener [puppet] - 10https://gerrit.wikimedia.org/r/1032599 (https://phabricator.wikimedia.org/T364921)
[00:37:59] <icinga-wm_>	 RECOVERY - Host elastic2088 is UP: PING WARNING - Packet loss = 75%, RTA = 30.36 ms
[00:39:39] <icinga-wm_>	 PROBLEM - Host elastic2088 is DOWN: PING CRITICAL - Packet loss = 100%
[00:51:25] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[01:40:55] <wikibugs>	 06SRE, 10Cassandra, 06Data Products, 06serviceops, and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9905781 (10Scott_French)
[01:59:31] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 477.82 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:10:16] <wikibugs>	 (03PS2) 10Scott French: drivers/etcd: only attempt to load existing configs [software/conftool] - 10https://gerrit.wikimedia.org/r/1047193 (https://phabricator.wikimedia.org/T367919)
[02:34:47] <jinxer-wm>	 FIRING: [2x] UdpMxIrcEchoThroughput: irc1002:9221 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho - https://alerts.wikimedia.org/?q=alertname%3DUdpMxIrcEchoThroughput
[02:38:46] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:45:48] <jinxer-wm>	 FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:55:48] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:57:31] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[03:01:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:04:43] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[03:15:48] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-cache2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:36:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[03:41:45] <jinxer-wm>	 FIRING: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[04:01:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:04:43] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[04:30:17] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: MM3/postorius: takes too long to load - https://phabricator.wikimedia.org/T314247#9905830 (10Krd) >>! In T314247#9905328, @Dzahn wrote: > Mailman migrated to a new server and a new version just now.  Did this get faster?  Nope.
[04:37:31] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 471.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:51:25] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[04:57:59] <icinga-wm_>	 RECOVERY - Host elastic2088 is UP: PING WARNING - Packet loss = 71%, RTA = 30.46 ms
[04:59:03] <icinga-wm_>	 PROBLEM - Host elastic2088 is DOWN: PING CRITICAL - Packet loss = 100%
[05:09:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'test depool db1169', diff saved to https://phabricator.wikimedia.org/P65168 and previous config saved to /var/cache/conftool/dbconfig/20240619-050951-marostegui.json
[05:10:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'repool db1169', diff saved to https://phabricator.wikimedia.org/P65169 and previous config saved to /var/cache/conftool/dbconfig/20240619-051014-marostegui.json
[05:12:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1169', diff saved to https://phabricator.wikimedia.org/P65170 and previous config saved to /var/cache/conftool/dbconfig/20240619-051233-root.json
[05:12:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] drivers/etcd: only attempt to load existing configs [software/conftool] - 10https://gerrit.wikimedia.org/r/1047193 (https://phabricator.wikimedia.org/T367919) (owner: 10Scott French)
[05:12:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65171 and previous config saved to /var/cache/conftool/dbconfig/20240619-051248-root.json
[05:14:31] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:15:45] <wikibugs>	 (03Merged) 10jenkins-bot: drivers/etcd: only attempt to load existing configs [software/conftool] - 10https://gerrit.wikimedia.org/r/1047193 (https://phabricator.wikimedia.org/T367919) (owner: 10Scott French)
[05:16:48] <wikibugs>	 (03PS1) 10Marostegui: Revert^3 "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1047244
[05:17:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P65172 and previous config saved to /var/cache/conftool/dbconfig/20240619-051659-root.json
[05:17:15] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert^3 "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1047244 (owner: 10Marostegui)
[05:18:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65173 and previous config saved to /var/cache/conftool/dbconfig/20240619-051809-root.json
[05:19:17] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Release 3.0.1 [software/conftool] - 10https://gerrit.wikimedia.org/r/1047245
[05:20:12] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Release 3.0.1 [software/conftool] - 10https://gerrit.wikimedia.org/r/1047245 (https://phabricator.wikimedia.org/T367919)
[05:24:10] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance
[05:24:23] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance
[05:26:33] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] Release 3.0.1 [software/conftool] - 10https://gerrit.wikimedia.org/r/1047245 (https://phabricator.wikimedia.org/T367919) (owner: 10Giuseppe Lavagetto)
[05:27:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65174 and previous config saved to /var/cache/conftool/dbconfig/20240619-052754-root.json
[05:29:26] <wikibugs>	 (03Merged) 10jenkins-bot: Release 3.0.1 [software/conftool] - 10https://gerrit.wikimedia.org/r/1047245 (https://phabricator.wikimedia.org/T367919) (owner: 10Giuseppe Lavagetto)
[05:32:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P65175 and previous config saved to /var/cache/conftool/dbconfig/20240619-053205-root.json
[05:33:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65176 and previous config saved to /var/cache/conftool/dbconfig/20240619-053315-root.json
[05:42:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T364069)', diff saved to https://phabricator.wikimedia.org/P65177 and previous config saved to /var/cache/conftool/dbconfig/20240619-054214-marostegui.json
[05:42:20] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[05:43:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65178 and previous config saved to /var/cache/conftool/dbconfig/20240619-054259-root.json
[05:44:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65179 and previous config saved to /var/cache/conftool/dbconfig/20240619-054443-root.json
[05:47:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P65180 and previous config saved to /var/cache/conftool/dbconfig/20240619-054710-root.json
[05:48:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65181 and previous config saved to /var/cache/conftool/dbconfig/20240619-054820-root.json
[05:51:50] <wikibugs>	 (03PS2) 10KartikMistry: testwiki: Enable MinT for Wikipedia readers MVP on a Igbo Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047014 (https://phabricator.wikimedia.org/T367852)
[05:59:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65182 and previous config saved to /var/cache/conftool/dbconfig/20240619-055948-root.json
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T0600)
[06:02:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P65183 and previous config saved to /var/cache/conftool/dbconfig/20240619-060216-root.json
[06:03:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65184 and previous config saved to /var/cache/conftool/dbconfig/20240619-060326-root.json
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:05:56] <_joe_>	 !log deleting manually thirdparty/conda repositories from reprepro T364550
[06:06:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:06:00] <stashbot>	 T364550: Remove unused thirdparty/conda repository - https://phabricator.wikimedia.org/T364550
[06:08:13] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047014 (https://phabricator.wikimedia.org/T367852) (owner: 10KartikMistry)
[06:08:18] <_joe_>	 !log uploaded newer python-conftool packages T367919
[06:08:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:08:23] <stashbot>	 T367919: Avoid error logging while searching configs during normal operation - https://phabricator.wikimedia.org/T367919
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:10:21] <wikibugs>	 (03PS1) 10Slyngshede: Update Debian packaging to work with Tomcat 10. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047254 (https://phabricator.wikimedia.org/T367487)
[06:14:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65185 and previous config saved to /var/cache/conftool/dbconfig/20240619-061454-root.json
[06:17:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P65186 and previous config saved to /var/cache/conftool/dbconfig/20240619-061721-root.json
[06:18:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65187 and previous config saved to /var/cache/conftool/dbconfig/20240619-061831-root.json
[06:21:10] <_joe_>	 !log upgrading conftool everywhere T367919
[06:21:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:21:15] <stashbot>	 T367919: Avoid error logging while searching configs during normal operation - https://phabricator.wikimedia.org/T367919
[06:22:19] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9905914 (10WMDE-leszek) I approve from WMDE's side. Thank you.
[06:27:29] <wikibugs>	 (03PS7) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487)
[06:30:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65188 and previous config saved to /var/cache/conftool/dbconfig/20240619-062959-root.json
[06:33:05] <wikibugs>	 (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[06:33:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65189 and previous config saved to /var/cache/conftool/dbconfig/20240619-063337-root.json
[06:34:47] <jinxer-wm>	 FIRING: [2x] UdpMxIrcEchoThroughput: irc1002:9221 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho - https://alerts.wikimedia.org/?q=alertname%3DUdpMxIrcEchoThroughput
[06:38:24] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "has approval now https://phabricator.wikimedia.org/T367184#9905914" [puppet] - 10https://gerrit.wikimedia.org/r/1047176 (https://phabricator.wikimedia.org/T367184) (owner: 10Dzahn)
[06:39:42] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Prepare for netbox-dev [puppet] - 10https://gerrit.wikimedia.org/r/1047081 (https://phabricator.wikimedia.org/T336275) (owner: 10Elukey)
[06:40:16] <XioNoX>	 !log merge Puppet "Prepare for netbox-dev" CR1047081
[06:40:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:41:28] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9905920 (10Dzahn) Thanks! all looks good to me and is ready for review and merge. just a US holiday here tomorrow, but this will be done soon.
[06:44:10] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2964/console" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[06:45:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65190 and previous config saved to /var/cache/conftool/dbconfig/20240619-064505-root.json
[06:45:28] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2966/console" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[06:45:48] <jinxer-wm>	 FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:47:37] <icinga-wm_>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:51:59] <jynus>	 !log stop db1240:s1, wipe and reimport db1240:s3 T367162
[06:52:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:52:05] <stashbot>	 T367162: db1240.s3 index issues - https://phabricator.wikimedia.org/T367162
[06:55:42] <wikibugs>	 (03PS8) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487)
[06:59:49] <wikibugs>	 (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T0700).
[07:00:05] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65191 and previous config saved to /var/cache/conftool/dbconfig/20240619-070010-root.json
[07:00:20] <kart_>	 \o
[07:00:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047014 (https://phabricator.wikimedia.org/T367852) (owner: 10KartikMistry)
[07:01:46] <wikibugs>	 (03Merged) 10jenkins-bot: testwiki: Enable MinT for Wikipedia readers MVP on a Igbo Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047014 (https://phabricator.wikimedia.org/T367852) (owner: 10KartikMistry)
[07:02:39] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:1047014|testwiki: Enable MinT for Wikipedia readers MVP on a Igbo Wikipedia (T367852)]]
[07:02:44] <stashbot>	 T367852: Enable MinT for Wiki Readers MVP on Test Wiki - https://phabricator.wikimedia.org/T367852
[07:07:16] <logmsgbot>	 !log kartik@deploy1002 kartik: Backport for [[gerrit:1047014|testwiki: Enable MinT for Wikipedia readers MVP on a Igbo Wikipedia (T367852)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:08:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM (but need to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1047086 first)" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[07:12:33] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 317.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[07:12:56] <logmsgbot>	 !log kartik@deploy1002 kartik: Continuing with sync
[07:13:29] <kart_>	 I'll submit followup patch as it seems testwiki won't be useful.
[07:15:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65192 and previous config saved to /var/cache/conftool/dbconfig/20240619-071516-root.json
[07:15:48] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-cache2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:17:23] <wikibugs>	 (03PS1) 10KartikMistry: igwiki: Enable MinT for Wikipedia readers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047382 (https://phabricator.wikimedia.org/T363464)
[07:19:59] <marostegui>	 !log Deploy schema change on old s7 eqiad master db1160 dbmaint T364069
[07:20:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:04] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[07:20:36] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] cephadm: limit mgr daemons to _admin-labelled hosts [puppet] - 10https://gerrit.wikimedia.org/r/1047117 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[07:21:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] profile::java: Add support for Java 21 [puppet] - 10https://gerrit.wikimedia.org/r/1047086 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff)
[07:22:33] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[07:22:51] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1047014|testwiki: Enable MinT for Wikipedia readers MVP on a Igbo Wikipedia (T367852)]] (duration: 20m 12s)
[07:22:56] <stashbot>	 T367852: Enable MinT for Wiki Readers MVP on Test Wiki - https://phabricator.wikimedia.org/T367852
[07:23:55] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Netbox 4: JOBRESULT_RETENTION -> JOB_RETENTION [puppet] - 10https://gerrit.wikimedia.org/r/918353 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[07:27:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047382 (https://phabricator.wikimedia.org/T363464) (owner: 10KartikMistry)
[07:28:46] <wikibugs>	 (03Merged) 10jenkins-bot: igwiki: Enable MinT for Wikipedia readers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047382 (https://phabricator.wikimedia.org/T363464) (owner: 10KartikMistry)
[07:28:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047254 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[07:29:20] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:1047382|igwiki: Enable MinT for Wikipedia readers (T363464)]]
[07:29:24] <stashbot>	 T363464: Enable MinT for Wikipedia readers MVP on a wiki - https://phabricator.wikimedia.org/T363464
[07:29:50] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Update Debian packaging to work with Tomcat 10. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047254 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[07:29:54] <wikibugs>	 (03CR) 10Slyngshede: [V:03+2 C:03+2] Update Debian packaging to work with Tomcat 10. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047254 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[07:30:48] <wikibugs>	 (03PS9) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487)
[07:33:26] <kart_>	 "07:32:22 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-06-19-072928-publish (ran as mwdeploy@mw2321.codfw.wmnet) returned [255]: ssh: connect to host mw2321.codfw.wmnet port 22: Connection timed out" -- seems mw2321 down?
[07:33:54] <logmsgbot>	 !log kartik@deploy1002 kartik: Backport for [[gerrit:1047382|igwiki: Enable MinT for Wikipedia readers (T363464)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:35:50] <jnuche>	 kart_: yeah, it's been unreachable for a couple of days now: https://phabricator.wikimedia.org/T367702
[07:36:03] <jnuche>	 there's some work going on to keep this kind of issue from affecting deployments: https://phabricator.wikimedia.org/T367862
[07:36:23] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance
[07:36:26] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance
[07:38:44] <logmsgbot>	 !log kartik@deploy1002 kartik: Continuing with sync
[07:39:08] <kart_>	 jnuche: Thanks!
[07:41:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Add new IRC servers also to the k8s hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047430 (https://phabricator.wikimedia.org/T331702)
[07:42:00] <jinxer-wm>	 FIRING: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[07:42:34] <wikibugs>	 (03PS11) 10Marostegui: mariadb: bugfixes mysql_legacy [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496) (owner: 10Arnaudb)
[07:44:42] <wikibugs>	 (03PS2) 10Clément Goubert: httpbb: Remove appserver hourly tests [puppet] - 10https://gerrit.wikimedia.org/r/1047107 (https://phabricator.wikimedia.org/T362323)
[07:46:59] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9906001 (10klausman) >>! In T357415#9905563, @Papaul wrote: > **Information2** > The server has only the the SFT-OOB-LIC license which is the Supermicro Out of band OOB li...
[07:48:16] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1047382|igwiki: Enable MinT for Wikipedia readers (T363464)]] (duration: 18m 55s)
[07:48:20] <stashbot>	 T363464: Enable MinT for Wikipedia readers MVP on a wiki - https://phabricator.wikimedia.org/T363464
[07:54:14] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host netbox-dev2003.codfw.wmnet
[07:54:15] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox
[07:54:29] <icinga-wm_>	 PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100%
[07:55:27] <icinga-wm_>	 RECOVERY - Host ml-cache2001 is UP: PING OK - Packet loss = 0%, RTA = 30.33 ms
[07:56:40] <wikibugs>	 (03CR) 10MVernon: [C:03+2] cephadm: limit mgr daemons to _admin-labelled hosts [puppet] - 10https://gerrit.wikimedia.org/r/1047117 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[07:57:25] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netbox-dev2003.codfw.wmnet - ayounsi@cumin1002"
[07:57:41] <wikibugs>	 06SRE, 07SRE-Unowned, 10Wikimedia-IRC-RC-Server, 13Patch-For-Review: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702#9906036 (10MoritzMuehlenhoff) >>! In T331702#9902698, @MoritzMuehlenhoff wrote: > Bullseye-based servers are up and running, one can connect to irc1002.wiki...
[07:58:03] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] "LGTM, but please add this also to mw-debug" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047430 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff)
[07:58:47] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service ml-cache2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:59:15] <wikibugs>	 (03CR) 10MVernon: [C:03+2] Move moss-fe{1,2}001 back to apus cluster [puppet] - 10https://gerrit.wikimedia.org/r/1047033 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[07:59:20] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netbox-dev2003.codfw.wmnet - ayounsi@cumin1002"
[07:59:20] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:59:20] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache netbox-dev2003.codfw.wmnet on all recursors
[07:59:24] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox-dev2003.codfw.wmnet on all recursors
[07:59:50] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netbox-dev2003.codfw.wmnet - ayounsi@cumin1002"
[08:00:04] <jouncebot>	 jnuche and brennen: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-0+Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T0800).
[08:00:19] <jnuche>	 hi there, will deploy the train in the next few minutes
[08:00:27] <wikibugs>	 (03PS12) 10Arnaudb: mariadb: bugfixes mysql_legacy [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496)
[08:00:47] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netbox-dev2003.codfw.wmnet - ayounsi@cumin1002"
[08:01:35] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host netbox-dev2003.codfw.wmnet with OS bookworm
[08:02:10] <wikibugs>	 (03PS1) 10Fabfur: hiera: upgrade haproxy to 2.8 on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1047436 (https://phabricator.wikimedia.org/T367756)
[08:02:59] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.43.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047437 (https://phabricator.wikimedia.org/T361404)
[08:03:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.43.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047437 (https://phabricator.wikimedia.org/T361404) (owner: 10TrainBranchBot)
[08:03:20] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on P{ms-fe*} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad)
[08:03:41] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.43.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047437 (https://phabricator.wikimedia.org/T361404) (owner: 10TrainBranchBot)
[08:04:43] <icinga-wm_>	 RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 302.10 ms
[08:09:43] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on P{ms-fe*} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad)
[08:09:48] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: toolforge: haproxy: check the k8s api-server /healthz endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1047113 (https://phabricator.wikimedia.org/T367389)
[08:09:58] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047113 (https://phabricator.wikimedia.org/T367389) (owner: 10Arturo Borrero Gonzalez)
[08:11:07] <icinga-wm_>	 PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100%
[08:11:51] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe1001.eqiad.wmnet with OS bookworm
[08:12:07] <wikibugs>	 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9906053 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe1001.eqiad.wmnet with OS bookworm
[08:12:53] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe2001.codfw.wmnet with OS bookworm
[08:13:03] <wikibugs>	 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9906054 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bookworm
[08:13:46] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047436 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[08:15:20] <wikibugs>	 (03PS2) 10Muehlenhoff: Add new IRC servers also to the k8s hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047430 (https://phabricator.wikimedia.org/T331702)
[08:15:43] <wikibugs>	 (03CR) 10Muehlenhoff: "Ack, updated the patch." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047430 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff)
[08:16:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for DE test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1047060 (owner: 10Muehlenhoff)
[08:16:19] <icinga-wm_>	 RECOVERY - Host mr1-eqsin.oob is UP: PING WARNING - Packet loss = 90%, RTA = 392.21 ms
[08:17:18] <moritzm>	 Emperor: I'll merge your patch along, ok? "Move moss-fe{1,2}001 back to apus cluster" 
[08:17:30] <wikibugs>	 (03PS1) 10Clément Goubert: mediawiki: Remove bare-metal cluster alerts [alerts] - 10https://gerrit.wikimedia.org/r/1047439 (https://phabricator.wikimedia.org/T362323)
[08:17:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediawiki: Remove bare-metal cluster alerts [alerts] - 10https://gerrit.wikimedia.org/r/1047439 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert)
[08:17:58] <wikibugs>	 (03PS6) 10JMeybohm: admin_ng: Add toggles for PSP to PSS migration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020313 (https://phabricator.wikimedia.org/T273507)
[08:18:04] <logmsgbot>	 !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.43.0-wmf.10  refs T361404
[08:18:09] <stashbot>	 T361404: 1.43.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T361404
[08:18:31] <Emperor>	 moritzm: please do
[08:19:17] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1046596 (https://phabricator.wikimedia.org/T367490) (owner: 10Muehlenhoff)
[08:20:20] <moritzm>	 ack, merged
[08:20:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Drop ldap-admins access group from mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/1046596 (https://phabricator.wikimedia.org/T367490) (owner: 10Muehlenhoff)
[08:21:05] <wikibugs>	 (03PS3) 10Clément Goubert: mediawiki: Remove bare-metal cluster alerts [alerts] - 10https://gerrit.wikimedia.org/r/1047439 (https://phabricator.wikimedia.org/T362323)
[08:22:43] <icinga-wm_>	 PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100%
[08:23:37] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-fe1001.eqiad.wmnet with OS bookworm
[08:23:51] <wikibugs>	 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9906067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe1001.eqiad.wmnet with OS bookworm executed with errors: - moss-fe1001 (...
[08:23:59] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe1001.eqiad.wmnet with OS bookworm
[08:24:06] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on P{ms-fe*} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad)
[08:24:15] <wikibugs>	 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9906068 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe1001.eqiad.wmnet with OS bookworm
[08:25:11] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: prometheus config tweak for db1125 [puppet] - 10https://gerrit.wikimedia.org/r/1046712 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb)
[08:25:45] <wikibugs>	 (03PS10) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487)
[08:26:14] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "you're missing hieradata/hosts/cp5030.yaml && hieradata/hosts/cp5032.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1047436 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[08:28:51] <icinga-wm_>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 225, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:29:48] <wikibugs>	 (03PS1) 10Jon Harald Søby: Add new protection level (user) for nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047441 (https://phabricator.wikimedia.org/T367943)
[08:30:26] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047441 (https://phabricator.wikimedia.org/T367943) (owner: 10Jon Harald Søby)
[08:30:29] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on P{ms-fe*} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad)
[08:30:45] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-fe2001.codfw.wmnet with OS bookworm
[08:30:54] <wikibugs>	 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9906132 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bookworm executed with errors: - moss-fe2001 (...
[08:31:07] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe2001.codfw.wmnet with OS bookworm
[08:31:17] <wikibugs>	 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9906136 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bookworm
[08:34:13] <wikibugs>	 (03CR) 10Ayounsi: [V:03+2 C:03+2] Netbox deploy for 4.0.3 [software/netbox-deploy] (dev) - 10https://gerrit.wikimedia.org/r/1038694 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[08:34:28] <wikibugs>	 (03PS2) 10Fabfur: hiera: upgrade haproxy to 2.8 on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1047436 (https://phabricator.wikimedia.org/T367756)
[08:34:39] <wikibugs>	 (03CR) 10Fabfur: "Good catch, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1047436 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[08:35:19] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047436 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[08:35:24] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 15830
[08:35:29] <wikibugs>	 (03PS1) 10Fabfur: benthos:cache: switch to rfc5424 format [puppet] - 10https://gerrit.wikimedia.org/r/1047442 (https://phabricator.wikimedia.org/T365718)
[08:35:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Default to use acmechief1002 [puppet] - 10https://gerrit.wikimedia.org/r/1047443 (https://phabricator.wikimedia.org/T365799)
[08:36:29] <wikibugs>	 06SRE, 10Acme-chief, 06Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7 - https://phabricator.wikimedia.org/T365799#9906174 (10MoritzMuehlenhoff)
[08:36:57] <wikibugs>	 (03PS6) 10Clément Goubert: mediawiki: Remove bare-metal cluster alerts [alerts] - 10https://gerrit.wikimedia.org/r/1047439 (https://phabricator.wikimedia.org/T362323)
[08:38:07] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe1001.eqiad.wmnet with reason: host reimage
[08:38:30] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: upgrade haproxy to 2.8 on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1047436 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[08:39:18] <wikibugs>	 (03PS2) 10Fabfur: benthos:cache: switch to rfc5424 format [puppet] - 10https://gerrit.wikimedia.org/r/1047442 (https://phabricator.wikimedia.org/T365718)
[08:39:46] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 15830
[08:40:38] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe1001.eqiad.wmnet with reason: host reimage
[08:42:37] <wikibugs>	 (03PS11) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487)
[08:42:40] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Turn down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949 (10Clement_Goubert) 03NEW
[08:42:51] <wikibugs>	 (03PS1) 10Muehlenhoff: No longer refer to setting the acmechief hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1047444 (https://phabricator.wikimedia.org/T365799)
[08:43:17] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2972/co" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[08:44:03] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] Add new IRC servers also to the k8s hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047430 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff)
[08:44:32] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1047041 (https://phabricator.wikimedia.org/T367768) (owner: 10Brouberol)
[08:44:34] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2973/console" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[08:44:56] <icinga-wm_>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:45:54] <icinga-wm_>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:46:31] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] ATS: replace service by discovery record for all DSE services [puppet] - 10https://gerrit.wikimedia.org/r/1047041 (https://phabricator.wikimedia.org/T367768) (owner: 10Brouberol)
[08:46:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] mailman: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1042999 (owner: 10Muehlenhoff)
[08:46:44] <icinga-wm_>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52197 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:46:46] <icinga-wm_>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.219 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:47:04] <moritzm>	 brouberol: I'll merge your patch along, ok?
[08:47:13] <brouberol>	 yes please!
[08:48:21] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe2001.codfw.wmnet with reason: host reimage
[08:48:42] <icinga-wm_>	 RECOVERY - Host mr1-eqsin.oob is UP: PING WARNING - Packet loss = 90%, RTA = 384.18 ms
[08:50:28] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047442 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[08:51:20] <Amir1>	 !incidents
[08:51:20] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe2001.codfw.wmnet with reason: host reimage
[08:51:20] <sirenbot>	 4758 (RESOLVED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[08:51:25] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[08:51:30] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to 2.8 on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1047436 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[08:51:53] <Amir1>	 !incidents
[08:51:53] <sirenbot>	 4758 (RESOLVED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[08:52:14] <fabfur>	 !log upgrading eqsin cp hosts to haproxy 2.8.10 (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1047436) (T367756)
[08:52:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:19] <hnowlan>	 Amir1: sorry, that was the ack expiring on db1165
[08:52:19] <stashbot>	 T367756: Upgrade hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756
[08:52:27] <Amir1>	 aaah
[08:52:31] <Amir1>	 that explains it
[08:54:50] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp5017.*} and A:cp
[08:55:05] <icinga-wm_>	 PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100%
[08:56:51] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2974/console" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[08:57:40] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp5017.*} and A:cp
[08:58:49] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp5025.*} and A:cp
[08:59:24] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-fe1001.eqiad.wmnet with OS bookworm
[09:00:06] <wikibugs>	 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9906267 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe1001.eqiad.wmnet with OS bookworm completed: - moss-fe1001 (**PASS**)...
[09:00:14] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: switch to llama3-8B-instruct [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047448 (https://phabricator.wikimedia.org/T354870)
[09:00:37] <wikibugs>	 (03PS1) 10Ayounsi: netbox-dev2003: disable validators [puppet] - 10https://gerrit.wikimedia.org/r/1047449
[09:00:53] <wikibugs>	 (03CR) 10Muehlenhoff: "How long will it be unavailable? Is it just a puppet run or are more steps needed? If it's break we can also just access some missed conne" [puppet] - 10https://gerrit.wikimedia.org/r/1047076 (https://phabricator.wikimedia.org/T367861) (owner: 10Vgutierrez)
[09:01:11] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] netbox-dev2003: disable validators [puppet] - 10https://gerrit.wikimedia.org/r/1047449 (owner: 10Ayounsi)
[09:01:12] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp5025.*} and A:cp
[09:02:05] <wikibugs>	 (03CR) 10Klausman: [C:03+1] ml-services: switch to llama3-8B-instruct [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047448 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos)
[09:03:29] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: switch to llama3-8B-instruct [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047448 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos)
[09:04:23] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: switch to llama3-8B-instruct [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047448 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos)
[09:05:02] <wikibugs>	 (03PS2) 10Clément Goubert: Start removing legacy bare metal listeners [puppet] - 10https://gerrit.wikimedia.org/r/1047447 (https://phabricator.wikimedia.org/T367949)
[09:05:55] <wikibugs>	 (03PS1) 10Ayounsi: Netbox: add bookworm support for DB module [puppet] - 10https://gerrit.wikimedia.org/r/1047450
[09:06:42] <wikibugs>	 (03CR) 10Vgutierrez: "process looks like this:" [puppet] - 10https://gerrit.wikimedia.org/r/1047076 (https://phabricator.wikimedia.org/T367861) (owner: 10Vgutierrez)
[09:09:25] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mediawiki: Remove bare-metal cluster alerts [alerts] - 10https://gerrit.wikimedia.org/r/1047439 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert)
[09:09:56] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Netbox: add bookworm support for DB module [puppet] - 10https://gerrit.wikimedia.org/r/1047450 (owner: 10Ayounsi)
[09:10:29] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review, 07User-notice: Mailman Downtime: Migrate mailman from lists1001 to lists1004 - https://phabricator.wikimedia.org/T367521#9906284 (10eoghan) 05Open→03Resolved The maintenance was completed yesterday and so far the serv...
[09:10:56] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-fe2001.codfw.wmnet with OS bookworm
[09:11:00] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply
[09:11:03] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mediawiki: Remove bare-metal cluster alerts [alerts] - 10https://gerrit.wikimedia.org/r/1047439 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert)
[09:11:06] <wikibugs>	 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9906288 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bookworm completed: - moss-fe2001 (**PASS**)...
[09:12:14] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Remove bare-metal cluster alerts [alerts] - 10https://gerrit.wikimedia.org/r/1047439 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert)
[09:13:41] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on netbox-dev2003.codfw.wmnet with reason: host reimage
[09:14:52] <wikibugs>	 (03PS1) 10Clément Goubert: kubernetes: Reimage 6 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1047451 (https://phabricator.wikimedia.org/T351074)
[09:15:07] <claime>	 !log Depooling mw2400.codfw.wmnet,mw2403.codfw.wmnet,mw2404.codfw.wmnet,mw2405.codfw.wmnet,mw2408.codfw.wmnet,mw2409.codfw.wmnet for reimage - T351074
[09:15:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:12] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[09:16:10] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[09:16:17] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netbox-dev2003.codfw.wmnet with reason: host reimage
[09:20:24] <wikibugs>	 (03Abandoned) 10Clément Goubert: Rename jobrunners to videoscalers [alerts] - 10https://gerrit.wikimedia.org/r/1019852 (owner: 10Alexandros Kosiaris)
[09:21:03] <icinga-wm_>	 RECOVERY - Host mr1-eqsin.oob is UP: PING WARNING - Packet loss = 77%, RTA = 325.12 ms
[09:21:06] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply
[09:22:23] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply
[09:22:57] <wikibugs>	 (03CR) 10Muehlenhoff: P:idp Allow upgrade to Tomcat 10. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[09:24:47] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw wikikube worker nodes - https://phabricator.wikimedia.org/T367286#9906315 (10Clement_Goubert) 05Open→03Declined
[09:25:34] <wikibugs>	 (03PS1) 10Brouberol: karapace: disable the systemd service to see if errors surface [puppet] - 10https://gerrit.wikimedia.org/r/1047452 (https://phabricator.wikimedia.org/T363461)
[09:27:17] <wikibugs>	 (03PS1) 10Ladsgroup: Remove pagelinks override in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047453 (https://phabricator.wikimedia.org/T367940)
[09:27:27] <icinga-wm_>	 PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100%
[09:28:25] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Remove pagelinks override in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047453 (https://phabricator.wikimedia.org/T367940) (owner: 10Ladsgroup)
[09:29:04] <wikibugs>	 (03Merged) 10jenkins-bot: Remove pagelinks override in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047453 (https://phabricator.wikimedia.org/T367940) (owner: 10Ladsgroup)
[09:32:29] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply
[09:34:59] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[09:36:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] kubernetes: Reimage 6 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1047451 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert)
[09:37:51] <icinga-wm_>	 RECOVERY - Host mr1-eqsin.oob is UP: PING WARNING - Packet loss = 90%, RTA = 362.96 ms
[09:38:16] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] kubernetes: Reimage 6 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1047451 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert)
[09:40:10] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-eqsin
[09:40:17] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2400 to wikikube-worker2011
[09:40:31] <wikibugs>	 (03PS12) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487)
[09:40:34] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[09:40:58] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2975/console" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[09:43:16] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2400 to wikikube-worker2011 - cgoubert@cumin1002"
[09:44:19] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2976/co" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[09:44:41] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply
[09:44:50] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047465
[09:45:20] <wikibugs>	 (03CR) 10Majavah: [C:03+1] toolforge: haproxy: check the k8s api-server /healthz endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1047113 (https://phabricator.wikimedia.org/T367389) (owner: 10Arturo Borrero Gonzalez)
[09:46:07] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2400 to wikikube-worker2011 - cgoubert@cumin1002"
[09:46:07] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:46:07] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2011
[09:46:39] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2011
[09:46:48] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2400 to wikikube-worker2011
[09:47:07] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2403 to wikikube-worker2012
[09:47:24] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[09:47:36] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "netbox-dev2003 - ayounsi@cumin1002"
[09:49:27] <icinga-wm_>	 PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100%
[09:51:07] <logmsgbot>	 !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "netbox-dev2003 - ayounsi@cumin1002"
[09:51:23] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2403 to wikikube-worker2012 - cgoubert@cumin1002"
[09:51:53] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] toolforge: haproxy: check the k8s api-server /healthz endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1047113 (https://phabricator.wikimedia.org/T367389) (owner: 10Arturo Borrero Gonzalez)
[09:53:07] <jinxer-wm>	 FIRING: ProbeDown: Service people1004:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:53:48] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2403 to wikikube-worker2012 - cgoubert@cumin1002"
[09:53:49] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:53:49] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2012
[09:54:46] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply
[09:55:32] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply
[09:58:07] <jinxer-wm>	 RESOLVED: ProbeDown: Service people1004:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:58:40] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2012
[09:58:48] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2403 to wikikube-worker2012
[09:59:24] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] admin: add Audrey Penven to ldap_only (wmde/nda) [puppet] - 10https://gerrit.wikimedia.org/r/1047176 (https://phabricator.wikimedia.org/T367184) (owner: 10Dzahn)
[09:59:53] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2404 to wikikube-worker2013
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T1000)
[10:00:09] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[10:00:30] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet,service=s1
[10:01:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T364069)', diff saved to https://phabricator.wikimedia.org/P65194 and previous config saved to /var/cache/conftool/dbconfig/20240619-100118-marostegui.json
[10:01:23] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[10:03:23] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2404 to wikikube-worker2013 - cgoubert@cumin1002"
[10:04:53] <icinga-wm_>	 RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 239.93 ms
[10:05:34] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2404 to wikikube-worker2013 - cgoubert@cumin1002"
[10:05:34] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:05:34] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2013
[10:05:39] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply
[10:05:55] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2013
[10:06:13] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2404 to wikikube-worker2013
[10:06:30] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2405 to wikikube-worker2014
[10:06:49] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[10:09:29] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2405 to wikikube-worker2014 - cgoubert@cumin1002"
[10:12:32] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2405 to wikikube-worker2014 - cgoubert@cumin1002"
[10:12:32] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:12:32] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2014
[10:12:49] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2014
[10:12:58] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2405 to wikikube-worker2014
[10:14:12] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2408 to wikikube-worker2017
[10:14:29] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[10:16:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P65195 and previous config saved to /var/cache/conftool/dbconfig/20240619-101625-marostegui.json
[10:16:48] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2408 to wikikube-worker2017 - cgoubert@cumin1002"
[10:17:52] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2408 to wikikube-worker2017 - cgoubert@cumin1002"
[10:17:52] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:17:55] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2017
[10:18:10] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2017
[10:18:18] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2408 to wikikube-worker2017
[10:18:28] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2409 to wikikube-worker2018
[10:18:44] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[10:21:03] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2409 to wikikube-worker2018 - cgoubert@cumin1002"
[10:22:20] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2409 to wikikube-worker2018 - cgoubert@cumin1002"
[10:22:20] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:22:21] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2018
[10:23:05] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2018
[10:23:14] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2409 to wikikube-worker2018
[10:23:43] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2011.codfw.wmnet with OS bullseye
[10:24:12] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2012.codfw.wmnet with OS bullseye
[10:24:30] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2013.codfw.wmnet with OS bullseye
[10:24:45] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance
[10:24:50] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2014.codfw.wmnet with OS bullseye
[10:24:58] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance
[10:25:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2124 (T367856)', diff saved to https://phabricator.wikimedia.org/P65196 and previous config saved to /var/cache/conftool/dbconfig/20240619-102504-marostegui.json
[10:25:09] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[10:25:26] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2017.codfw.wmnet with OS bullseye
[10:25:40] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2018.codfw.wmnet with OS bullseye
[10:29:59] <wikibugs>	 (03PS1) 10Jelto: gitlab: add custom nginx config to block manual Trusted Runners edits [puppet] - 10https://gerrit.wikimedia.org/r/1047470 (https://phabricator.wikimedia.org/T366786)
[10:31:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65197 and previous config saved to /var/cache/conftool/dbconfig/20240619-103109-root.json
[10:31:24] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9906515 (10kamila) 05In progress→03Resolved a:03kamila Done, though it's my first time doing clinic duty, so let me know if it doesn't work :D
[10:32:19] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T367736#9906530 (10Clement_Goubert)
[10:32:35] <wikibugs>	 (03PS1) 10Kamila Součková: Extend access for AndyRussG [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681)
[10:32:38] <wikibugs>	 (03PS1) 10Ladsgroup: mariadb: Remove direct grants on mailman databases [puppet] - 10https://gerrit.wikimedia.org/r/1047474 (https://phabricator.wikimedia.org/T367833)
[10:32:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gitlab: add custom nginx config to block manual Trusted Runners edits [puppet] - 10https://gerrit.wikimedia.org/r/1047470 (https://phabricator.wikimedia.org/T366786) (owner: 10Jelto)
[10:33:08] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2977/console" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[10:33:56] <wikibugs>	 (03CR) 10Fabfur: "I think we could retry to apply this in ulsfo now that HAProxy is at version 2.8.10" [puppet] - 10https://gerrit.wikimedia.org/r/1047442 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[10:34:01] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Remove direct grants on mailman databases [puppet] - 10https://gerrit.wikimedia.org/r/1047474 (https://phabricator.wikimedia.org/T367833)
[10:34:09] <wikibugs>	 (03PS2) 10Jelto: gitlab: add custom nginx config to block manual Trusted Runners edits [puppet] - 10https://gerrit.wikimedia.org/r/1047470 (https://phabricator.wikimedia.org/T366786)
[10:34:47] <jinxer-wm>	 FIRING: [2x] UdpMxIrcEchoThroughput: irc1002:9221 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho - https://alerts.wikimedia.org/?q=alertname%3DUdpMxIrcEchoThroughput
[10:35:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add new IRC servers also to the k8s hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047430 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff)
[10:36:33] <logmsgbot>	 !log jmm@deploy1002 Started scap: (no justification provided)
[10:37:33] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] benthos:cache: switch to rfc5424 format [puppet] - 10https://gerrit.wikimedia.org/r/1047442 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[10:37:43] <wikibugs>	 (03CR) 10Marostegui: "let's double check if they exist in the db, and if they do, let's kill them" [puppet] - 10https://gerrit.wikimedia.org/r/1047474 (https://phabricator.wikimedia.org/T367833) (owner: 10Ladsgroup)
[10:37:47] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] mariadb: Remove direct grants on mailman databases [puppet] - 10https://gerrit.wikimedia.org/r/1047474 (https://phabricator.wikimedia.org/T367833) (owner: 10Ladsgroup)
[10:38:35] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2978/co" [puppet] - 10https://gerrit.wikimedia.org/r/1047470 (https://phabricator.wikimedia.org/T366786) (owner: 10Jelto)
[10:39:22] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2011.codfw.wmnet with reason: host reimage
[10:39:33] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2012.codfw.wmnet with reason: host reimage
[10:40:06] <logmsgbot>	 !log jmm@deploy1002 Finished scap: (no justification provided) (duration: 04m 03s)
[10:40:16] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2014.codfw.wmnet with reason: host reimage
[10:40:35] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2013.codfw.wmnet with reason: host reimage
[10:40:50] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2018.codfw.wmnet with reason: host reimage
[10:40:52] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2017.codfw.wmnet with reason: host reimage
[10:41:29] <wikibugs>	 (03CR) 10Ladsgroup: "I will!" [puppet] - 10https://gerrit.wikimedia.org/r/1047474 (https://phabricator.wikimedia.org/T367833) (owner: 10Ladsgroup)
[10:41:49] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2011.codfw.wmnet with reason: host reimage
[10:43:36] <wikibugs>	 (03CR) 10Muehlenhoff: Extend access for AndyRussG (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková)
[10:43:39] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] httpbb: Remove appserver hourly tests [puppet] - 10https://gerrit.wikimedia.org/r/1047107 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert)
[10:43:45] <wikibugs>	 (03CR) 10Ladsgroup: [C:04-1] "Sounds good to me, let me double check where this list is exactly used." [puppet] - 10https://gerrit.wikimedia.org/r/1047034 (https://phabricator.wikimedia.org/T256190) (owner: 10Ladsgroup)
[10:44:30] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] envoy: add data-gateway service listener [puppet] - 10https://gerrit.wikimedia.org/r/1032599 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French)
[10:44:32] <jinxer-wm>	 RESOLVED: [2x] UdpMxIrcEchoThroughput: irc1002:9221 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho - https://alerts.wikimedia.org/?q=alertname%3DUdpMxIrcEchoThroughput
[10:44:33] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] httpbb: Remove appserver hourly tests [puppet] - 10https://gerrit.wikimedia.org/r/1047107 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert)
[10:44:36] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2018.codfw.wmnet with reason: host reimage
[10:45:17] <wikibugs>	 (03CR) 10Hnowlan: service: move data-gateway service to production [puppet] - 10https://gerrit.wikimedia.org/r/1032593 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French)
[10:45:39] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, but lists1001 needs to be set to role::insetup::buster first" [puppet] - 10https://gerrit.wikimedia.org/r/1047184 (https://phabricator.wikimedia.org/T331706) (owner: 10Dzahn)
[10:45:48] <jinxer-wm>	 FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:46:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65198 and previous config saved to /var/cache/conftool/dbconfig/20240619-104614-root.json
[10:47:58] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2017.codfw.wmnet with reason: host reimage
[10:49:32] <wikibugs>	 (03PS2) 10Ladsgroup: prometheus: Change footer icon ping url [puppet] - 10https://gerrit.wikimedia.org/r/1047034 (https://phabricator.wikimedia.org/T256190)
[10:51:40] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2014.codfw.wmnet with reason: host reimage
[10:51:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[10:52:28] <wikibugs>	 (03PS2) 10Hnowlan: admin_ng: bump limits for shellbox-video [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047124 (https://phabricator.wikimedia.org/T357309)
[10:52:39] <wikibugs>	 (03CR) 10Hnowlan: "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047124 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[10:55:08] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2013.codfw.wmnet with reason: host reimage
[10:55:41] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] admin_ng: bump limits for shellbox-video [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047124 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[10:58:39] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2012.codfw.wmnet with reason: host reimage
[10:58:54] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] admin_ng: bump limits for shellbox-video [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047124 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[11:00:05] <jouncebot>	 mvolz: Time to do the Services – Citoid / Zotero deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T1100).
[11:00:48] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "LGTM, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043106 (https://phabricator.wikimedia.org/T366603) (owner: 10Brouberol)
[11:01:13] <wikibugs>	 (03CR) 10Btullis: [C:03+1] wdqs graph-split: add final svcs [dns] - 10https://gerrit.wikimedia.org/r/1042160 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene)
[11:01:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65199 and previous config saved to /var/cache/conftool/dbconfig/20240619-110120-root.json
[11:01:42] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2011.codfw.wmnet with OS bullseye
[11:01:55] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: bump limits for shellbox-video [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047124 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[11:03:24] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2018.codfw.wmnet with OS bullseye
[11:03:27] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1047452 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol)
[11:04:24] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-video: apply
[11:04:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: ferm.service on kubernetes2053:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:05:01] <hnowlan>	 jouncebot: nowandnext
[11:05:01] <jouncebot>	 For the next 0 hour(s) and 54 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T1100)
[11:05:01] <jouncebot>	 In 1 hour(s) and 54 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T1300)
[11:06:01] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[11:07:20] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[11:07:55] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[11:07:56] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2017.codfw.wmnet with OS bullseye
[11:08:29] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[11:09:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: ferm.service on kubernetes2053:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:10:49] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[11:11:23] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2014.codfw.wmnet with OS bullseye
[11:12:05] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[11:13:51] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[11:14:15] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[11:14:30] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply
[11:15:07] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2013.codfw.wmnet with OS bullseye
[11:15:55] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-video: apply
[11:16:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65200 and previous config saved to /var/cache/conftool/dbconfig/20240619-111625-root.json
[11:17:50] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2012.codfw.wmnet with OS bullseye
[11:18:40] <logmsgbot>	 !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host netbox-dev2003.codfw.wmnet with OS bookworm
[11:18:40] <logmsgbot>	 !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host netbox-dev2003.codfw.wmnet
[11:20:39] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s2 on clouddb1014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 7379.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:25:39] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s2 on clouddb1014 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:26:00] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply
[11:27:29] <wikibugs>	 (03PS1) 10Ayounsi: Add "netbox-dev" to deployment server [puppet] - 10https://gerrit.wikimedia.org/r/1047482 (https://phabricator.wikimedia.org/T336275)
[11:28:04] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047482 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[11:31:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65201 and previous config saved to /var/cache/conftool/dbconfig/20240619-113131-root.json
[11:34:14] <wikibugs>	 (03PS2) 10Kamila Součková: Extend access for AndyRussG [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681)
[11:34:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: ferm.service on kubernetes2053:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:34:39] <wikibugs>	 (03CR) 10Kamila Součková: Extend access for AndyRussG (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková)
[11:35:54] <wikibugs>	 (03CR) 10Majavah: "question inline from me who is not at all familiar with the process" [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková)
[11:36:49] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-eqsin
[11:39:25] <jinxer-wm>	 RESOLVED: [3x] SystemdUnitFailed: ferm.service on kubernetes2053:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:42:00] <jinxer-wm>	 FIRING: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[11:46:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65203 and previous config saved to /var/cache/conftool/dbconfig/20240619-114636-root.json
[11:50:26] <wikibugs>	 (03PS1) 10Fabfur: hiera: test downgrading haproxy on cp5017 [puppet] - 10https://gerrit.wikimedia.org/r/1047483 (https://phabricator.wikimedia.org/T367756)
[11:50:44] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-video: apply
[11:52:18] <wikibugs>	 (03PS2) 10Fabfur: hiera: test downgrading haproxy on cp5017 [puppet] - 10https://gerrit.wikimedia.org/r/1047483 (https://phabricator.wikimedia.org/T367756)
[11:53:13] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047483 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[11:55:18] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: test downgrading haproxy on cp5017 [puppet] - 10https://gerrit.wikimedia.org/r/1047483 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[11:57:20] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp5017.*} and A:cp
[11:57:29] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1042.eqiad.wmnet with OS bookworm
[11:58:51] <wikibugs>	 (03PS1) 10Majavah: hieradata: Migrate cloudvirt1042 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1047486 (https://phabricator.wikimedia.org/T364457)
[11:59:17] <wikibugs>	 (03Abandoned) 10Majavah: hieradata: Move cloudvirt-wdqs* to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1046676 (https://phabricator.wikimedia.org/T364457) (owner: 10Majavah)
[12:00:23] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp5017.*} and A:cp
[12:00:50] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply
[12:01:05] <wikibugs>	 (03PS2) 10Majavah: hieradata: Move cloudvirt1042 to OVS and single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1047486 (https://phabricator.wikimedia.org/T319184)
[12:01:14] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on idp-test1002.wikimedia.org with reason: CAS 7 upgrade
[12:01:28] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on idp-test1002.wikimedia.org with reason: CAS 7 upgrade
[12:01:38] <wikibugs>	 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9906713 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d9d9df4b-e647-4f8e-8b55-811d9f86d7d0) set by slyngshede@cumin1002 for 5 days, 0:00:00 on 1 host(s) an...
[12:01:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65204 and previous config saved to /var/cache/conftool/dbconfig/20240619-120142-root.json
[12:02:03] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp5017.eqsin.wmnet
[12:02:05] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1 C:03+2] P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[12:02:20] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1 C:03+2] P:idp Allow upgrade to Tomcat 10. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[12:02:36] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Move cloudvirt1042 to OVS and single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1047486 (https://phabricator.wikimedia.org/T319184) (owner: 10Majavah)
[12:02:58] <taavi>	 slyngs: please merge mine too
[12:03:02] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp5017.eqsin.wmnet
[12:03:27] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp5017.eqsin.wmnet
[12:03:52] <slyngs>	 taavi: Will do
[12:04:46] <slyngs>	 Done
[12:04:55] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] karapace: disable the systemd service to see if errors surface [puppet] - 10https://gerrit.wikimedia.org/r/1047452 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol)
[12:06:21] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp5017.eqsin.wmnet
[12:07:02] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply
[12:07:30] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply
[12:07:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Point codfw and codfw1dev to use the eqiad LDAP ro servers as well [puppet] - 10https://gerrit.wikimedia.org/r/1047488 (https://phabricator.wikimedia.org/T367861)
[12:08:14] <klausman>	 !log Will test-replace the PXE chainloader (/srv/tftpboot/lpxelinux.0) on install2003 with a newer version to see if it fixes the ldlinux.c32 error. Puppet will be disabled on that machine for the duration.
[12:08:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:49] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply
[12:11:02] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply
[12:11:39] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[12:12:00] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[12:12:21] <wikibugs>	 (03PS1) 10Majavah: hieradata: Move cloudvirt1043 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1047489 (https://phabricator.wikimedia.org/T364457)
[12:13:09] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1043.eqiad.wmnet with OS bookworm
[12:13:36] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Move cloudvirt1043 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1047489 (https://phabricator.wikimedia.org/T364457) (owner: 10Majavah)
[12:14:13] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-video: apply
[12:14:34] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply
[12:15:07] <icinga-wm_>	 PROBLEM - karapace http server on karapace1001 is CRITICAL: connect to address 10.64.0.24 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Karapace
[12:16:13] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ml-staging2003.codfw.wmnet with OS bookworm
[12:16:27] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9906779 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ml-staging2003.codfw.wmnet with OS bookworm
[12:16:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Stop syncing swift rings on Puppet 5 frontends [puppet] - 10https://gerrit.wikimedia.org/r/1043128 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[12:19:21] <wikibugs>	 (03PS1) 10Hnowlan: shellbox-video: drop requests/replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047491 (https://phabricator.wikimedia.org/T357309)
[12:19:33] <logmsgbot>	 !log pfischer@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[12:20:00] <logmsgbot>	 !log pfischer@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[12:21:59] <logmsgbot>	 !log pfischer@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[12:22:06] <logmsgbot>	 !log pfischer@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[12:23:37] <wikibugs>	 (03PS1) 10Fabfur: hiera: test upgrading cp5017 to haproxy 2.8.9 [puppet] - 10https://gerrit.wikimedia.org/r/1047492 (https://phabricator.wikimedia.org/T367963)
[12:24:32] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: test upgrading cp5017 to haproxy 2.8.9 [puppet] - 10https://gerrit.wikimedia.org/r/1047492 (https://phabricator.wikimedia.org/T367963) (owner: 10Fabfur)
[12:24:50] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp5017.eqsin.wmnet
[12:26:40] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043513 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[12:31:29] <icinga-wm_>	 PROBLEM - karapace http server on karapace1002 is CRITICAL: connect to address 10.64.0.5 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Karapace
[12:31:45] <jinxer-wm>	 FIRING: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[12:32:21] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "LGTM except see inline" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047491 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[12:32:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] thanos: Limit access to swift ring sync to Puppet 7 servers [puppet] - 10https://gerrit.wikimedia.org/r/1043513 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[12:33:58] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] Point codfw and codfw1dev to use the eqiad LDAP ro servers as well [puppet] - 10https://gerrit.wikimedia.org/r/1047488 (https://phabricator.wikimedia.org/T367861) (owner: 10Muehlenhoff)
[12:34:43] <klausman>	 !log Puppet management of install2004 restored, lpxelinux.0 also restored. 
[12:34:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:38] <logmsgbot>	 !log taavi@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1042.eqiad.wmnet with OS bookworm
[12:36:22] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1042.eqiad.wmnet']
[12:36:39] <logmsgbot>	 !log taavi@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1042.eqiad.wmnet']
[12:36:43] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1042.eqiad.wmnet']
[12:37:04] <logmsgbot>	 !log taavi@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1042.eqiad.wmnet']
[12:37:21] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1042.eqiad.wmnet']
[12:37:54] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10wikitech.wikimedia.org: Update "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T367914#9906851 (10kamila) 05Open→03Resolved a:03kamila
[12:38:08] <logmsgbot>	 !log pfischer@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[12:38:12] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp5017.eqsin.wmnet
[12:38:17] <logmsgbot>	 !log pfischer@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[12:40:56] <claime>	 !log homer 'cr*codfw*' commit 'T351074'
[12:41:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:00] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[12:41:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Move update-netboot-image.sh to the puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/1047495 (https://phabricator.wikimedia.org/T365798)
[12:44:53] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Update pxelinux in tftpboot environment - https://phabricator.wikimedia.org/T367970 (10MoritzMuehlenhoff) 03NEW
[12:44:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Update pxelinux in tftpboot environment - https://phabricator.wikimedia.org/T367970#9906873 (10MoritzMuehlenhoff) p:05Triage→03Medium
[12:45:19] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047495 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[12:45:45] <jinxer-wm>	 FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[12:45:45] <jinxer-wm>	 FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ...
[12:45:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[12:45:51] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] datahub-gms: enable prometheus scraping of metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043106 (https://phabricator.wikimedia.org/T366603) (owner: 10Brouberol)
[12:51:06] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(wikikube-worker2011.codfw.wmnet|wikikube-worker2012.codfw.wmnet|wikikube-worker2013.codfw.wmnet|wikikube-worker2014.codfw.wmnet|wikikube-worker2017.codfw.wmnet|wikikube-worker2018.codfw.wmnet),cluster=kubernetes,service=kubesvc
[12:52:21] <logmsgbot>	 !log taavi@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1042.eqiad.wmnet']
[12:52:34] <brouberol>	 there's a local diff under /srv/deployment-charts/helmfile.d/services/sessionstore/values-staging.yaml preventing the git pull of new changes. I'm going to stash it
[12:55:30] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: cloudvirt1042, cloudvirt1043 fails to boot after a reimage - https://phabricator.wikimedia.org/T367971#9906903 (10taavi)
[12:57:25] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for DMburugu - https://phabricator.wikimedia.org/T367872#9906906 (10kamila)
[12:58:41] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] conftool-data: add ntp-[abc].anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1046675 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh)
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T1300).
[13:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:10] <Lucas_WMDE>	 o/
[13:00:24] * Lucas_WMDE doesn’t see anything to deploy either
[13:02:31] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for DMburugu - https://phabricator.wikimedia.org/T367872#9906921 (10kamila) @DMburugu can you please confirm that you have read the [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_access#User_responsibilities...
[13:30:58] <wikibugs>	 (03PS1) 10Klausman: hiera: Fix wrong machine name in ML k8s config [puppet] - 10https://gerrit.wikimedia.org/r/1047508
[13:31:54] <logmsgbot>	 !log taavi@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1042.eqiad.wmnet']
[13:31:58] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp5017.eqsin.wmnet
[13:32:14] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org,service=ntp-a
[13:32:31] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1044.eqiad.wmnet with reason: host reimage
[13:32:35] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp5017.*} and A:cp
[13:33:01] <wikibugs>	 (03PS3) 10Elukey: service:uwsgi: remove scap support [puppet] - 10https://gerrit.wikimedia.org/r/1047503
[13:33:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove puppet checkout on pybaltest [puppet] - 10https://gerrit.wikimedia.org/r/1047509 (https://phabricator.wikimedia.org/T365798)
[13:33:51] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] hiera: Fix wrong machine name in ML k8s config [puppet] - 10https://gerrit.wikimedia.org/r/1047508 (owner: 10Klausman)
[13:34:02] <wikibugs>	 (03CR) 10Klausman: [C:03+2] hiera: Fix wrong machine name in ML k8s config [puppet] - 10https://gerrit.wikimedia.org/r/1047508 (owner: 10Klausman)
[13:35:20] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp5017.*} and A:cp
[13:35:25] <logmsgbot>	 !log taavi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1043.eqiad.wmnet with OS bookworm
[13:35:55] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1044.eqiad.wmnet with reason: host reimage
[13:35:59] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp5017.eqsin.wmnet
[13:39:23] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043516 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[13:39:45] <wikibugs>	 (03PS3) 10Fabfur: benthos:cache: switch to rfc5424 format [puppet] - 10https://gerrit.wikimedia.org/r/1047442 (https://phabricator.wikimedia.org/T365718)
[13:40:48] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:41:11] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Trying to fix Puppet error on ml-staging2003 - klausman@cumin2002"
[13:41:51] <logmsgbot>	 !log taavi@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1042.eqiad.wmnet']
[13:42:28] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Trying to fix Puppet error on ml-staging2003 - klausman@cumin2002"
[13:43:57] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1042.eqiad.wmnet with OS bookworm
[13:47:41] <wikibugs>	 06SRE, 06Infrastructure-Foundations: firmware-update: spicerack.redfish.RedfishError: iDRAC is not ready. The configuration values cannot be accessed. Please retry after a few minutes. - https://phabricator.wikimedia.org/T367974 (10taavi) 03NEW
[13:48:18] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9907018 (10Papaul) @Clement_Goubert it is a U.S holiday today can we please rescheduled this for tomorrow . Thank you Sorry about that
[13:48:48] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Trying to fix Puppet error on ml-staging2003 - klausman@cumin2002"
[13:48:50] <wikibugs>	 (03PS2) 10Hnowlan: services_proxy: add shellbox-video listener [puppet] - 10https://gerrit.wikimedia.org/r/1047098 (https://phabricator.wikimedia.org/T357309)
[13:49:46] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Trying to fix Puppet error on ml-staging2003 - klausman@cumin2002"
[13:50:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: cloudvirt1042, cloudvirt1043 fails to boot after a reimage - https://phabricator.wikimedia.org/T367971#9907023 (10taavi) As suggested by volans I tried running the firmware-upgrade cookbook on the other cumin server which h...
[13:50:41] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Trying to fix Puppet error on ml-staging2003 - klausman@cumin2002"
[13:51:46] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Trying to fix Puppet error on ml-staging2003 - klausman@cumin2002"
[13:53:26] <logmsgbot>	 !log taavi@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1042.eqiad.wmnet with OS bookworm
[13:53:45] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1042.eqiad.wmnet with OS bookworm
[13:54:12] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging2003.codfw.wmnet with reason: host reimage
[13:55:24] <wikibugs>	 (03PS4) 10Elukey: service:uwsgi: remove scap support [puppet] - 10https://gerrit.wikimedia.org/r/1047503
[13:55:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] service:uwsgi: remove scap support [puppet] - 10https://gerrit.wikimedia.org/r/1047503 (owner: 10Elukey)
[13:56:06] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] services_proxy: add shellbox-video listener [puppet] - 10https://gerrit.wikimedia.org/r/1047098 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[13:56:54] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1047503 (owner: 10Elukey)
[13:57:13] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging2003.codfw.wmnet with reason: host reimage
[13:57:32] <wikibugs>	 (03PS1) 10VolkerE: Optimize static footer 'a Wikimedia project' icon further [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047521 (https://phabricator.wikimedia.org/T256190)
[14:00:05] <jouncebot>	 Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T1400)
[14:00:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1047503 (owner: 10Elukey)
[14:00:29] <wikibugs>	 (03PS5) 10Elukey: service:uwsgi: remove scap support [puppet] - 10https://gerrit.wikimedia.org/r/1047503
[14:01:05] <wikibugs>	 (03CR) 10Elukey: "I explicitly added python3-venv since it was removed by the scap cleanup, if not required I can skip it." [puppet] - 10https://gerrit.wikimedia.org/r/1047503 (owner: 10Elukey)
[14:01:17] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1044.eqiad.wmnet with OS bookworm
[14:02:40] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1047503 (owner: 10Elukey)
[14:03:37] <icinga-wm_>	 PROBLEM - Host ml-staging2001 is DOWN: PING CRITICAL - Packet loss = 100%
[14:05:02] <wikibugs>	 (03PS1) 10Hnowlan: shellbox-video: set timeout to one day [puppet] - 10https://gerrit.wikimedia.org/r/1047523 (https://phabricator.wikimedia.org/T357309)
[14:07:01] <logmsgbot>	 !log taavi@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1043.eqiad.wmnet']
[14:07:26] <wikibugs>	 (03PS1) 10Brouberol: Datahub: bump chart version to let hemfile use newly released subcharts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047524 (https://phabricator.wikimedia.org/T366603)
[14:07:27] <logmsgbot>	 !log taavi@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cloudvirt1043.eqiad.wmnet']
[14:07:33] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-staging2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlstaging&var-instance=ml-staging2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:08:09] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Datahub: bump chart version to let hemfile use newly released subcharts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047524 (https://phabricator.wikimedia.org/T366603) (owner: 10Brouberol)
[14:08:09] <jinxer-wm>	 FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown
[14:08:54] <logmsgbot>	 !log taavi@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1043.eqiad.wmnet']
[14:08:55] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1042.eqiad.wmnet with reason: host reimage
[14:09:03] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Datahub: bump chart version to let hemfile use newly released subcharts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047524 (https://phabricator.wikimedia.org/T366603) (owner: 10Brouberol)
[14:09:38] <logmsgbot>	 !log taavi@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudvirt1043.eqiad.wmnet']
[14:09:45] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - klausman@cumin2002"
[14:09:55] <logmsgbot>	 !log taavi@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1043.eqiad.wmnet']
[14:10:12] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] shellbox-video: set timeout to one day [puppet] - 10https://gerrit.wikimedia.org/r/1047523 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[14:10:44] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - klausman@cumin2002"
[14:10:50] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-staging2003.codfw.wmnet with OS bookworm
[14:10:59] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9907108 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ml-staging2003.codfw.wmnet with OS bookworm completed: - ml-stag...
[14:11:05] <logmsgbot>	 !log taavi@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudvirt1043.eqiad.wmnet']
[14:11:53] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1042.eqiad.wmnet with reason: host reimage
[14:12:07] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging
[14:12:33] <jinxer-wm>	 FIRING: [4x] KubernetesCalicoDown: ml-staging-ctrl2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:12:49] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] shellbox-video: set timeout to one day [puppet] - 10https://gerrit.wikimedia.org/r/1047523 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[14:13:09] <jinxer-wm>	 RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown
[14:14:45] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging
[14:16:21] <icinga-wm_>	 RECOVERY - Host ml-staging2001 is UP: PING OK - Packet loss = 0%, RTA = 79.12 ms
[14:17:00] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production
[14:17:33] <jinxer-wm>	 RESOLVED: [4x] KubernetesCalicoDown: ml-staging-ctrl2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:19:17] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1043.eqiad.wmnet with OS bookworm
[14:19:29] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production
[14:19:39] <icinga-wm_>	 PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on ml-staging2003 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear, flush_l1d} https://wikitech.wikimedia.org/wiki/Microcode
[14:20:11] <wikibugs>	 (03PS4) 10Hnowlan: Add shellbox-video vars/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T357309)
[14:20:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add shellbox-video vars/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[14:21:05] <wikibugs>	 (03PS1) 10Brouberol: superset-next: upgrade to Superset 4.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047528 (https://phabricator.wikimedia.org/T366060)
[14:21:29] <wikibugs>	 (03PS5) 10Hnowlan: Add shellbox-video vars/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T357309)
[14:22:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add shellbox-video vars/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[14:23:23] <moritzm>	 !log installing pymysql security updates
[14:23:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:41] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet - https://phabricator.wikimedia.org/T366670#9907185 (10klausman)
[14:24:58] <moritzm>	 !log installing libvpx security updates
[14:25:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:39] <wikibugs>	 (03CR) 10Elukey: [C:03+2] cli: modify get_distro_name to return the version id [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1043780 (https://phabricator.wikimedia.org/T240193) (owner: 10Elukey)
[14:27:18] <wikibugs>	 (03PS6) 10Hnowlan: Add shellbox-video vars/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T357309)
[14:28:03] <wikibugs>	 (03Merged) 10jenkins-bot: cli: modify get_distro_name to return the version id [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1043780 (https://phabricator.wikimedia.org/T240193) (owner: 10Elukey)
[14:32:23] <wikibugs>	 (03PS7) 10Hnowlan: Add shellbox-video vars/config, enable on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T357309)
[14:34:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: cloudvirt1042, cloudvirt1043 fails to boot after a reimage - https://phabricator.wikimedia.org/T367971#9907209 (10taavi) 05Open→03Resolved a:05Jclark-ctr→03taavi The reimages finished succesfully after a firmwar...
[14:34:58] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1043.eqiad.wmnet with reason: host reimage
[14:35:38] <moritzm>	 !log installing nano security updates
[14:35:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:56] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - taavi@cumin1002"
[14:37:56] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1043.eqiad.wmnet with reason: host reimage
[14:38:37] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - taavi@cumin1002"
[14:38:38] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db[1155,1158].eqiad.wmnet with reason: Long schema change
[14:38:39] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1042.eqiad.wmnet with OS bookworm
[14:38:47] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:38:47] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:39:06] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db[1155,1158].eqiad.wmnet with reason: Long schema change
[14:39:34] <wikibugs>	 (03CR) 10Btullis: "Is this still to be reverted, or should we abandon it?" [puppet] - 10https://gerrit.wikimedia.org/r/1037014 (owner: 10Bking)
[14:40:08] <wikibugs>	 (03CR) 10Btullis: [C:03+1] superset-next: upgrade to Superset 4.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047528 (https://phabricator.wikimedia.org/T366060) (owner: 10Brouberol)
[14:40:20] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db[1155-1156].eqiad.wmnet with reason: Long schema change
[14:40:25] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db[1155-1156].eqiad.wmnet with reason: Long schema change
[14:42:35] <marostegui>	 !log Deploy schema change on s2 eqiad master  dbmaint T364069
[14:42:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:40] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[14:45:56] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] benthos:cache: switch to rfc5424 format [puppet] - 10https://gerrit.wikimedia.org/r/1047442 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[14:50:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Should we channelize unused QSFP28 ports on QFX5120s to provide 'buffer' for 10G upgrades? - https://phabricator.wikimedia.org/T367408#9907245 (10Papaul) Yes i don't think this approach will work for codfw, like @cmooney said: "codfw dc-ops match the switch port...
[14:50:55] <wikibugs>	 (03PS1) 10Fabfur: benthos:cache: fixed typo in field name [puppet] - 10https://gerrit.wikimedia.org/r/1047536 (https://phabricator.wikimedia.org/T365718)
[14:51:49] <wikibugs>	 (03PS1) 10Hnowlan: shellbox-video: drop timeout slightly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047537 (https://phabricator.wikimedia.org/T357309)
[14:52:59] <icinga-wm_>	 RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 2, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:54:21] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] benthos:cache: fixed typo in field name [puppet] - 10https://gerrit.wikimedia.org/r/1047536 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[14:55:09] <wikibugs>	 (03PS1) 10Peter Fischer: Search update pipeline: retry 429s at HTTP client level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047538 (https://phabricator.wikimedia.org/T362310)
[14:55:54] <wikibugs>	 (03CR) 10Peter Fischer: [C:03+2] Search update pipeline: retry 429s at HTTP client level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047538 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer)
[14:56:51] <wikibugs>	 (03Merged) 10jenkins-bot: Search update pipeline: retry 429s at HTTP client level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047538 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer)
[14:57:28] <wikibugs>	 (03CR) 10Kosta Harlan: "I think it's ready to go, just had been waiting for T360070" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010953 (https://phabricator.wikimedia.org/T360067) (owner: 10Kosta Harlan)
[14:57:32] <wikibugs>	 (03PS3) 10Dreamy Jazz: extension-list: Add IPReputation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010953 (https://phabricator.wikimedia.org/T360067) (owner: 10Kosta Harlan)
[14:58:31] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "ohh awesome !" [puppet] - 10https://gerrit.wikimedia.org/r/1047503 (owner: 10Elukey)
[14:58:47] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:59:38] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9907287 (10Clement_Goubert) No worries, I'll extend the downtime, and we'll leave it like that for you to move.
[14:59:48] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.remove-downtime for wikikube-worker2003.codfw.wmnet
[14:59:48] <logmsgbot>	 !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.remove-downtime (exit_code=97) for wikikube-worker2003.codfw.wmnet
[15:00:01] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.remove-downtime for mw2282.codfw.wmnet
[15:00:02] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2282.codfw.wmnet
[15:00:17] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for Kgraessle - https://phabricator.wikimedia.org/T367747#9907288 (10DMburugu) Approved
[15:01:10] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on mw2282.codfw.wmnet with reason: Host move
[15:01:23] <logmsgbot>	 !log pfischer@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[15:01:24] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on mw2282.codfw.wmnet with reason: Host move
[15:01:36] <logmsgbot>	 !log pfischer@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:01:37] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9907289 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=111d8ee1-db67-4ba6-a57a-50da8c8dc4ff) set by cgoubert@cumin1002 for 2 days, 0:00:0...
[15:02:06] <wikibugs>	 (03CR) 10Kosta Harlan: "Also, would be happy for anyone who sees this to merge and sync it. Otherwise I will try to get to that next week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010953 (https://phabricator.wikimedia.org/T360067) (owner: 10Kosta Harlan)
[15:02:14] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for DMburugu - https://phabricator.wikimedia.org/T367872#9907292 (10DMburugu) @Dzahn Sorry for the tag mix up.  @kamila Yes, I have read the [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_access#User_responsi...
[15:03:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[15:04:20] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns2006.wikimedia.org,service=ntp-c
[15:05:45] <jinxer-wm>	 RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[15:06:00] <jinxer-wm>	 RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ...
[15:06:00] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[15:06:11] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[15:07:15] <wikibugs>	 06SRE, 10SRE-Access-Requests: Request to add mnz to analytics-research-admins - https://phabricator.wikimedia.org/T367757#9907307 (10fkaelin) I confirm that this request is legit, also adding @XiaoXiao-WMF as manager.   As for an approvers list, please add myself and @XiaoXiao-WMF  (assuming your access is set...
[15:07:32] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1043.eqiad.wmnet with OS bookworm
[15:07:37] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS info - pt1979@cumin2002"
[15:08:48] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS info - pt1979@cumin2002"
[15:08:48] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:11:19] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9907339 (10klausman)
[15:11:29] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9907322 (10Dzahn) Additionally added Audrey to the WMF-NDA group in Phabricator (https://phabricator.wikimedia.org/project/members/61/)  That's based on T358578 (like for wmf group -> http...
[15:12:36] <wikibugs>	 (03PS1) 10Ayounsi: Standalone Netbox: set redis' appendfilename [puppet] - 10https://gerrit.wikimedia.org/r/1047542 (https://phabricator.wikimedia.org/T336275)
[15:12:59] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9907350 (10klausman) Machine is imaged and running. The PXE boot was "fixed" by an ugly hack mentioned in T304483#9906962 While the firmware problem remains, at least we a...
[15:16:44] <sukhe>	 !log sudo cumin -b1 -s120 'A:dnsbox' 'run-puppet-agent --enable "merging CR 1046685"': T366360
[15:16:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:49] <stashbot>	 T366360: Anycast NTP and update the list of timeservers for P:systemd::timesyncd - https://phabricator.wikimedia.org/T366360
[15:18:35] <wikibugs>	 (03PS1) 10Elukey: debian: update target distribution [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/1047544 (https://phabricator.wikimedia.org/T365253)
[15:19:54] <wikibugs>	 (03PS2) 10Ayounsi: Standalone Netbox: set redis' appendfilename [puppet] - 10https://gerrit.wikimedia.org/r/1047542 (https://phabricator.wikimedia.org/T336275)
[15:20:50] <wikibugs>	 (03PS1) 10Fabfur: benthos:cache: delete message if sequence number is missing [puppet] - 10https://gerrit.wikimedia.org/r/1047545 (https://phabricator.wikimedia.org/T365718)
[15:22:25] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.06.17 - 2024.07.07): Elastic2099 unresponsive - https://phabricator.wikimedia.org/T367598#9907372 (10Jhancock.wm) @Clement_Goubert and I were troubleshooting a similar issue on June 4th for kubernetes2030 kubernetes2033 kubernetes2035  the only thing...
[15:22:41] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[15:22:51] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "ntp-a being advertised from all sites (currently rolling out). This is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/1047073 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh)
[15:23:18] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:23:36] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[15:23:37] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:24:00] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[15:24:08] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:25:09] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] benthos:cache: delete message if sequence number is missing [puppet] - 10https://gerrit.wikimedia.org/r/1047545 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[15:25:48] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "Good point. Let me nit on naming a bit:" [alerts] - 10https://gerrit.wikimedia.org/r/1046781 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French)
[15:27:04] <wikibugs>	 (03CR) 10JMeybohm: "If you find the time, please double check me on this one." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020313 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[15:28:09] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet - https://phabricator.wikimedia.org/T366670#9907386 (10Jhancock.wm) The server has had the DIMM reseated.
[15:30:10] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Standalone Netbox: set redis' appendfilename [puppet] - 10https://gerrit.wikimedia.org/r/1047542 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[15:31:45] <jinxer-wm>	 RESOLVED: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[15:32:25] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1042
[15:32:36] <wikibugs>	 (03PS1) 10DCausse: cirrus-streaming-updater: increase elasticsearch-bulk-max-action-size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047546
[15:32:44] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T367648#9907393 (10Jhancock.wm) the iDrac light is rapidly blinking amber. I tried rebooting just the idrac by holding down the button for 45 seconds. it failed. The next troubleshooting step is to reboot and drain the power...
[15:32:46] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1042
[15:33:26] <jinxer-wm>	 FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[15:33:57] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Standalone Netbox: set redis' appendfilename [puppet] - 10https://gerrit.wikimedia.org/r/1047542 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[15:36:40] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "just note that we need to remember to also tell MediaWiki" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047537 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[15:38:01] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] mathoid: Upgrade image from 2023-11-03-103402 to 2024-06-18-233457 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047201 (https://phabricator.wikimedia.org/T349118) (owner: 10Jforrester)
[15:40:17] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9907428 (10kamila) @AndyRussG since you're now with WMDE, we'd like to update your email, can you please let me know your @wikimedia.de email address?
[15:41:31] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] service:uwsgi: remove scap support [puppet] - 10https://gerrit.wikimedia.org/r/1047503 (owner: 10Elukey)
[15:41:37] <wikibugs>	 (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: increase elasticsearch-bulk-max-action-size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047546 (owner: 10DCausse)
[15:42:39] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus-streaming-updater: increase elasticsearch-bulk-max-action-size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047546 (owner: 10DCausse)
[15:42:48] <wikibugs>	 (03PS1) 10Majavah: hieradata: Move cloudvirt1044 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1047551 (https://phabricator.wikimedia.org/T364457)
[15:43:37] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1044.eqiad.wmnet with OS bookworm
[15:43:46] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] superset-next: upgrade to Superset 4.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047528 (https://phabricator.wikimedia.org/T366060) (owner: 10Brouberol)
[15:44:12] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[15:44:47] <wikibugs>	 (03PS1) 10Ayounsi: sre.deploy.python-code: add missing f-string [cookbooks] - 10https://gerrit.wikimedia.org/r/1047553
[15:45:02] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply
[15:45:17] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, thanks for the fix" [cookbooks] - 10https://gerrit.wikimedia.org/r/1047553 (owner: 10Ayounsi)
[15:46:00] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply
[15:46:31] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:47:13] <wikibugs>	 (03CR) 10Ladsgroup: rpc: Update function call in RunSingleJob (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038785 (https://phabricator.wikimedia.org/T363839) (owner: 10Ladsgroup)
[15:49:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: ferm.service on mw2353:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:50:41] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[15:51:00] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:52:36] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] debian: update target distribution [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/1047544 (https://phabricator.wikimedia.org/T365253) (owner: 10Elukey)
[15:53:35] <wikibugs>	 (03Merged) 10jenkins-bot: sre.deploy.python-code: add missing f-string [cookbooks] - 10https://gerrit.wikimedia.org/r/1047553 (owner: 10Ayounsi)
[15:54:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: ferm.service on mw2353:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:55:28] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code netbox-dev to netbox-dev2003.codfw.wmnet with reason: Netbox 4 on netbox-dev2003 - ayounsi@cumin1002 - T336275
[15:55:28] <logmsgbot>	 !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) netbox-dev to netbox-dev2003.codfw.wmnet with reason: Netbox 4 on netbox-dev2003 - ayounsi@cumin1002 - T336275
[15:55:33] <stashbot>	 T336275: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275
[15:56:55] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 463, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:58:18] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: management and main interface down for mw2321.codfw.wmnet - https://phabricator.wikimedia.org/T367702#9907461 (10Jhancock.wm) tried a few things but ultimately had to power cycle the server to get it back up. Lemme know if it looks good.
[15:58:40] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.06.17 - 2024.07.07): Elastic2099 unresponsive - https://phabricator.wikimedia.org/T367598#9907465 (10Papaul) @Jhancock.wm you can physical power cycle it at anytime
[15:59:09] <wikibugs>	 10ops-eqdfw, 06SRE, 06DC-Ops: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864#9907466 (10Papaul) Power supply will be shipped out today
[16:02:59] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code netbox-dev to netbox-dev2003.codfw.wmnet with reason: Netbox 4 on netbox-dev2003 - ayounsi@cumin1002 - T336275
[16:03:04] <stashbot>	 T336275: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275
[16:07:29] <icinga-wm_>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:07:57] <icinga-wm_>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:09:27] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[16:12:19] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw2321.codfw.wmnet back to active - cgoubert@cumin1002"
[16:13:40] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw2321.codfw.wmnet back to active - cgoubert@cumin1002"
[16:13:40] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:13:42] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission db2102.codw.wmnet - https://phabricator.wikimedia.org/T366892#9907482 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[16:15:54] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox-dev to netbox-dev2003.codfw.wmnet with reason: Netbox 4 on netbox-dev2003 - ayounsi@cumin1002 - T336275
[16:15:59] <stashbot>	 T336275: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275
[16:17:48] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mw2281.codfw.wmnet mw22[83-90].codfw.wmnet - https://phabricator.wikimedia.org/T367275#9907489 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[16:19:40] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: service=(ntp-a|ntp-b|ntp-c)
[16:27:29] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=mw2321.codfw.wmnet,cluster=kubernetes,service=kubesvc
[16:27:54] <claime>	 !log pooling and uncordoning mw2321.codfw.wmnet - T367702
[16:27:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:58] <stashbot>	 T367702: hw troubleshooting: management and main interface down for mw2321.codfw.wmnet - https://phabricator.wikimedia.org/T367702
[16:29:19] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: management and main interface down for mw2321.codfw.wmnet - https://phabricator.wikimedia.org/T367702#9907521 (10Clement_Goubert) 05Open→03Resolved Looks back up, putting it back to Active and running homer brought back BGP connectivit...
[16:31:39] <icinga-wm_>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:34:05] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Should we channelize unused QSFP28 ports on QFX5120s to provide 'buffer' for 10G upgrades? - https://phabricator.wikimedia.org/T367408#9907538 (10cmooney) 05Open→03Resolved a:03cmooney Cool thanks @papaul.  I guess we can see how we get on over the nex...
[16:40:32] <wikibugs>	 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9907547 (10Papaul) All the netbox part is done waiting.
[16:41:02] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] durum: switch NTP peers to ntp-[abc].anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1046689 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh)
[16:42:04] <sukhe>	 !log sudo cumin 'A:durum' 'run-puppet-agent' to switch timesyncd NTP pools to ntp-[abc].anycast.wmnet: T366360
[16:42:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:09] <stashbot>	 T366360: Anycast NTP and update the list of timeservers for P:systemd::timesyncd - https://phabricator.wikimedia.org/T366360
[16:47:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: ferm.service on mw2422:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:49:25] <wikibugs>	 (03PS2) 10Hnowlan: shellbox-video: drop timeout slightly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047537 (https://phabricator.wikimedia.org/T357309)
[16:50:31] <wikibugs>	 (03CR) 10Hnowlan: "Added a note" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047537 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[16:50:42] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] shellbox-video: drop timeout slightly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047537 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[16:51:42] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-video: drop timeout slightly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047537 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[16:52:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: ferm.service on kubernetes2044:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:54:48] <wikibugs>	 (03PS4) 10Ssingh: P:bird::anycast_monitoring: add monitoring for 10.3.0.[5-7]/32 [puppet] - 10https://gerrit.wikimedia.org/r/1046757 (https://phabricator.wikimedia.org/T366360)
[16:55:35] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2986/co" [puppet] - 10https://gerrit.wikimedia.org/r/1046757 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh)
[16:55:48] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:56:41] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[16:59:29] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved wikikube-ctrl200[12] to a new rack - kamila@cumin1002"
[17:00:02] <wikibugs>	 (03PS1) 10Kamila Součková: Revert^2 "Add wikikube-ctrl2001 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1047560
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T1700)
[17:00:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert^2 "Add wikikube-ctrl2001 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1047560 (owner: 10Kamila Součková)
[17:00:48] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved wikikube-ctrl200[12] to a new rack - kamila@cumin1002"
[17:00:48] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:00:50] <wikibugs>	 (03PS1) 10Kamila Součková: Revert^2 "Add wikikube-ctrl2002 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1047561
[17:01:01] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl2001
[17:01:03] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl2001
[17:01:27] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl2002
[17:01:30] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl2002
[17:02:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: ferm.service on kubernetes2044:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:02:45] <wikibugs>	 (03PS2) 10Kamila Součková: Revert^2 "Add wikikube-ctrl2002 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1047561
[17:04:30] <wikibugs>	 (03PS1) 10Superpes15: [tlywiki] Change the logo and wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047562 (https://phabricator.wikimedia.org/T366431)
[17:05:30] <logmsgbot>	 !log taavi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1044.eqiad.wmnet with OS bookworm
[17:08:54] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Should we channelize unused QSFP28 ports on QFX5120s to provide 'buffer' for 10G upgrades? - https://phabricator.wikimedia.org/T367408#9907616 (10Papaul) I don't think we have a lot of servers right that have 10G NIC put using the 1G NIC. Most of the servers...
[17:13:48] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[17:16:07] * brennen looks around
[17:16:54] <brennen>	 i'm not here, ignore me, but based on a skim of tasks and logs, i think there are no train things needed for the upcoming window.
[17:17:10] <brennen>	 somebody ping me if i'm wrong and i'll materialize for a bit.
[17:17:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: ferm.service on kubernetes2044:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:20:18] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved wikikube-ctrl200[12] to a new rack - kamila@cumin1002"
[17:21:13] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved wikikube-ctrl200[12] to a new rack - kamila@cumin1002"
[17:21:13] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:22:25] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: ferm.service on kubernetes2044:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:25:15] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] Revert^2 "Add wikikube-ctrl2002 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1047561 (owner: 10Kamila Součková)
[17:25:43] <wikibugs>	 (03PS2) 10Kamila Součková: Revert^2 "Add wikikube-ctrl2001 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1047560
[17:27:16] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] Revert^2 "Add wikikube-ctrl2001 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1047560 (owner: 10Kamila Součková)
[17:27:25] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: ferm.service on kubernetes2044:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:41:17] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:43:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T367856)', diff saved to https://phabricator.wikimedia.org/P65207 and previous config saved to /var/cache/conftool/dbconfig/20240619-174338-marostegui.json
[17:43:44] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[17:45:07] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:47:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: ferm.service on wikikube-worker2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:51:30] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:53:47] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:56:30] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:58:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P65208 and previous config saved to /var/cache/conftool/dbconfig/20240619-175846-marostegui.json
[18:00:05] <jouncebot>	 jnuche and brennen: MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T1800). Please do the needful.
[18:13:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P65209 and previous config saved to /var/cache/conftool/dbconfig/20240619-181353-marostegui.json
[18:21:51] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2001.codfw.wmnet with OS bullseye
[18:22:01] <wikibugs>	 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9907715 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl2001.co...
[18:29:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T367856)', diff saved to https://phabricator.wikimedia.org/P65210 and previous config saved to /var/cache/conftool/dbconfig/20240619-182900-marostegui.json
[18:29:02] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance
[18:29:05] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[18:29:15] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance
[18:29:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T367856)', diff saved to https://phabricator.wikimedia.org/P65211 and previous config saved to /var/cache/conftool/dbconfig/20240619-182922-marostegui.json
[18:34:45] <logmsgbot>	 !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host wikikube-ctrl2001.codfw.wmnet with OS bullseye
[18:35:21] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2001.codfw.wmnet with OS bullseye
[18:35:33] <wikibugs>	 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9907723 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl2001.co...
[18:40:16] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2002.codfw.wmnet with OS bullseye
[18:40:28] <wikibugs>	 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9907725 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl2002.co...
[18:48:25] <logmsgbot>	 !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl2002.codfw.wmnet with OS bullseye
[18:48:32] <wikibugs>	 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9907733 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl2002.codfw....
[18:49:13] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2002.codfw.wmnet with OS bullseye
[18:49:18] <wikibugs>	 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9907734 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl2002.co...
[18:51:28] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl2001.codfw.wmnet with reason: host reimage
[18:54:37] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl2001.codfw.wmnet with reason: host reimage
[19:05:12] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl2002.codfw.wmnet with reason: host reimage
[19:08:29] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl2002.codfw.wmnet with reason: host reimage
[19:18:59] <wikibugs>	 (03PS1) 10Pppery: Add fallback languages for Phabricator [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1047593
[19:30:20] <wikibugs>	 (03PS2) 10Pppery: Add fallback languages for Phabricator [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1047593
[19:33:41] <jinxer-wm>	 FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T2000).
[20:00:05] <jouncebot>	 Superpes: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:37] <Superpes>	 Hi :P
[20:02:57] <wikibugs>	 06SRE, 10SRE-Access-Requests: Request to add mnz to analytics-research-admins - https://phabricator.wikimedia.org/T367757#9907796 (10MunizaA)
[20:11:10] <wikibugs>	 (03PS1) 10Santiago Faci: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047600 (https://phabricator.wikimedia.org/T366940)
[20:12:34] <Superpes>	 Can anyone deploy? :O
[20:16:18] <zabe>	 I can
[20:16:56] <wikibugs>	 (03CR) 10Zabe: [C:03+2] [tlywiki] Change the logo and wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047562 (https://phabricator.wikimedia.org/T366431) (owner: 10Superpes15)
[20:17:13] <Superpes>	 Oh thanks :D
[20:17:29] <wikibugs>	 (03CR) 10Pppery: "Also upstream at https://we.phorge.it/D25695, since for ancient historical reasons some of the locale files are upstream." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1047593 (owner: 10Pppery)
[20:17:37] <wikibugs>	 (03Merged) 10jenkins-bot: [tlywiki] Change the logo and wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047562 (https://phabricator.wikimedia.org/T366431) (owner: 10Superpes15)
[20:19:12] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:1047562|[tlywiki] Change the logo and wordmark/tagline (T366431)]]
[20:19:17] <stashbot>	 T366431: Change logo for Talysh Wikipedia - https://phabricator.wikimedia.org/T366431
[20:23:52] <logmsgbot>	 !log zabe@deploy1002 superpes, zabe: Backport for [[gerrit:1047562|[tlywiki] Change the logo and wordmark/tagline (T366431)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:23:56] <Superpes>	 Testing :)
[20:24:38] <Superpes>	 Looks fine thanks :P zabe
[20:24:42] <logmsgbot>	 !log zabe@deploy1002 superpes, zabe: Continuing with sync
[20:31:31] <wikibugs>	 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9907853 (10Papaul) @kamila 2001 and 2002 are ready ` papaul@lsw1-b7-codfw> show interfaces descriptions | match wiki* xe-0/0/42...
[20:32:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[20:32:31] <MatmaRex>	 hi. if any deployers are feeling bored, want to ship a beta-only config change too? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1047122
[20:32:40] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: Revert "Show experimental login popup links on the beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047122 (https://phabricator.wikimedia.org/T367891)
[20:32:55] <zabe>	 sure
[20:33:25] <MatmaRex>	 thanks!
[20:33:46] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Revert "Show experimental login popup links on the beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047122 (https://phabricator.wikimedia.org/T367891) (owner: 10Bartosz Dziewoński)
[20:33:54] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1047562|[tlywiki] Change the logo and wordmark/tagline (T366431)]] (duration: 14m 41s)
[20:33:59] <stashbot>	 T366431: Change logo for Talysh Wikipedia - https://phabricator.wikimedia.org/T366431
[20:34:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047122 (https://phabricator.wikimedia.org/T367891) (owner: 10Bartosz Dziewoński)
[20:34:28] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Show experimental login popup links on the beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047122 (https://phabricator.wikimedia.org/T367891) (owner: 10Bartosz Dziewoński)
[20:34:59] <Superpes>	 Many thanks for your assistance :3
[20:36:03] <zabe>	 yw:)
[20:37:15] <jinxer-wm>	 RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[20:53:51] <wikibugs>	 10ops-eqdfw, 06SRE, 06DC-Ops: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864#9907910 (10Papaul) Hello Papaul,  Thank you for the information provided.  This is the information for the replacement of the faulty PEM with serial number 1F188120554. The RMA ID is # R20051...
[20:55:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: etcd.service on wikikube-ctrl2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:57:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-ctrl2001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[20:57:41] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wikikube-ctrl2001:6443 has failed probes (http_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wikikube-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:59:44] <_joe_>	 kamila_: can I assume that's you? a downtime just expired AFAICS
[21:00:04] <_joe_>	 well it's 11 pm, someone in a better TZ will respond I hope
[21:00:04] <jouncebot>	 Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240619T2100)
[21:00:46] <kamila_>	 _joe_: uh, yeah, exactly
[21:00:50] <kamila_>	 I'm sorry
[21:00:59] <kamila_>	 forgot that they expire...
[21:04:30] <sukhe>	 I am not near a computer for another 45 mins or so but I ACKed it for now 
[21:04:49] <sukhe>	 will downtime then unless someone doesnit before 
[21:06:56] <cdanis>	 I have put in an alertmanager silence
[21:07:00] <cdanis>	 I am trying out the `ACK!` thing
[21:08:33] <logmsgbot>	 !log oblivian@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wikikube-ctrl[2001-2002].codfw.wmnet with reason: Reimage --kamila
[21:08:47] <logmsgbot>	 !log oblivian@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wikikube-ctrl[2001-2002].codfw.wmnet with reason: Reimage --kamila
[21:09:11] <sukhe>	 thanks all.
[21:09:18] <sukhe>	 marking as resolved as well .
[21:55:48] <jinxer-wm>	 FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:29:17] <icinga-wm_>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:29:53] <icinga-wm_>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:29:53] <icinga-wm_>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:44:18] <wikibugs>	 (03PS1) 10Volans: Adapt build system to latest images settings [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1047607
[22:44:18] <wikibugs>	 (03PS1) 10Volans: Release v0.6.6 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1047608
[22:44:53] <icinga-wm_>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:44:55] <icinga-wm_>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:45:19] <icinga-wm_>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:05:12] <zabe>	 !log zabe@mwmaint1002:~$ mwscript createAndPromote.php u4cwiki Superpes15 REDACTED --bureaucrat --sysop
[23:05:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:05:24] <zabe>	 !log zabe@mwmaint1002:~$ mwscript createAndPromote.php arbcom_itwiki Superpes15 REDACTED --bureaucrat --sysop
[23:05:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:31:27] <icinga-wm_>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:31:53] <icinga-wm_>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:31:55] <icinga-wm_>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:33:41] <jinxer-wm>	 FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[23:38:32] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1047610
[23:38:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1047610 (owner: 10TrainBranchBot)
[23:39:59] <wikibugs>	 (03CR) 10Krinkle: [C:03+1] rpc: Update function call in RunSingleJob (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038785 (https://phabricator.wikimedia.org/T363839) (owner: 10Ladsgroup)
[23:50:43] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for the new private wikis - https://phabricator.wikimedia.org/T367839#9908106 (10Zabe) >>! In T367839#9902539, @MatthewVernon wrote: > Have the swift containers been generated for these wikis? I can't find any ob...