[00:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T0000)
[00:10:35] <jinxer-wm>	 FIRING: Wikidata Reliability Metrics - Median loading time alert: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[00:16:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:30:35] <jinxer-wm>	 RESOLVED: Wikidata Reliability Metrics - Median loading time alert: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[00:38:23] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1112861
[00:38:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1112861 (owner: 10TrainBranchBot)
[00:44:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10478091 (10phaultfinder)
[00:59:14] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1112861 (owner: 10TrainBranchBot)
[01:08:47] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1112863
[01:08:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1112863 (owner: 10TrainBranchBot)
[01:31:16] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1112863 (owner: 10TrainBranchBot)
[01:33:27] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10478122 (10thcipriani) > deployment POSIX group  Approved as `deployment` gr...
[01:33:54] <wikibugs>	 (03CR) 10Thcipriani: [C:03+1] admin/data: Add user for Georgios Kyziridis (ML Team) [puppet] - 10https://gerrit.wikimedia.org/r/1109414 (owner: 10Klausman)
[01:46:22] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/a762b343b40fe38171f766309bee9f00e5029cc1d5d72196fa007b9b4489dc54/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[02:06:22] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[02:08:16] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.13 [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1112864 (https://phabricator.wikimedia.org/T382364)
[02:08:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.13 [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1112864 (https://phabricator.wikimedia.org/T382364) (owner: 10TrainBranchBot)
[02:28:32] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.13 [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1112864 (https://phabricator.wikimedia.org/T382364) (owner: 10TrainBranchBot)
[02:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:51:31] <jinxer-wm>	 FIRING: [6x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[03:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T0300)
[03:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:10:30] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:00:05] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T0400)
[04:01:44] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112868 (https://phabricator.wikimedia.org/T382364)
[04:01:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112868 (https://phabricator.wikimedia.org/T382364) (owner: 10TrainBranchBot)
[04:02:32] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112868 (https://phabricator.wikimedia.org/T382364) (owner: 10TrainBranchBot)
[04:02:58] <logmsgbot>	 !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.13  refs T382364
[04:03:02] <stashbot>	 T382364: 1.44.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T382364
[04:11:26] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[04:16:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:21:24] <icinga-wm>	 PROBLEM - Disk space on deploy2002 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/9249d506f6c493ccd9a605f0a29558143bfeec6e067778a29a480114f9f6ac6b/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy2002&var-datasource=codfw+prometheus/ops
[04:41:24] <icinga-wm>	 RECOVERY - Disk space on deploy2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy2002&var-datasource=codfw+prometheus/ops
[04:51:26] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[04:54:20] <wikibugs>	 (03PS1) 10Kevin Bazira: changeprop: add liftwing article-country stream to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112126 (https://phabricator.wikimedia.org/T382295)
[05:00:05] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T0500)
[05:01:52] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.44.0-wmf.13  refs T382364 (duration: 58m 53s)
[05:01:55] <stashbot>	 T382364: 1.44.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T382364
[05:04:57] <logmsgbot>	 !log mwpresync@deploy2002 Pruned MediaWiki: 1.44.0-wmf.8 (duration: 04m 55s)
[05:13:36] <icinga-wm>	 PROBLEM - Disk space on kafka-logging1004 is CRITICAL: DISK CRITICAL - free space: /srv 159774 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kafka-logging1004&var-datasource=eqiad+prometheus/ops
[05:17:31] <jinxer-wm>	 FIRING: Primary outbound port utilisation over 80%  #page: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[05:18:30] <jinxer-wm>	 FIRING: Primary inbound port utilisation over 80%  #page: Alert for device cr1-esams.wikimedia.org - Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[05:40:59] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2025-01-20-172318-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112871 (https://phabricator.wikimedia.org/T377966)
[05:42:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:47:48] <icinga-wm>	 PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Idle https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:47:57] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:09:40] <wikibugs>	 (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112838 (https://phabricator.wikimedia.org/T384145) (owner: 10ZhaoFJx)
[06:17:31] <jinxer-wm>	 RESOLVED: Primary outbound port utilisation over 80%  #page: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[06:18:30] <jinxer-wm>	 RESOLVED: Primary inbound port utilisation over 80%  #page: Device cr1-esams.wikimedia.org recovered from Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[06:23:10] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:23:14] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:31:45] <wikibugs>	 (03PS1) 10Marostegui: db2207,db2148: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113014 (https://phabricator.wikimedia.org/T384272)
[06:32:24] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2207,db2148: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113014 (https://phabricator.wikimedia.org/T384272) (owner: 10Marostegui)
[06:33:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2207 db2148 T384272', diff saved to https://phabricator.wikimedia.org/P72164 and previous config saved to /var/cache/conftool/dbconfig/20250121-063301-marostegui.json
[06:34:01] <wikibugs>	 (03CR) 10Anzx: [C:03+1] "looks good to me please schedule for backport, @zhaofjx@gmail.com you don't have to add reveiwer unless you have any doubt, just saying pe" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112838 (https://phabricator.wikimedia.org/T384145) (owner: 10ZhaoFJx)
[06:34:04] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10478451 (10Ladsgroup) All 16 containers of 00 to 0f have been cleaned up. Starting 10 to 1f now.
[06:34:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2212 with weight 0 T383690', diff saved to https://phabricator.wikimedia.org/P72165 and previous config saved to /var/cache/conftool/dbconfig/20250121-063416-root.json
[06:35:01] <logmsgbot>	 !log marostegui@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s1 T383690
[06:35:32] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2212 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1111266 (https://phabricator.wikimedia.org/T383690) (owner: 10Gerrit maintenance bot)
[06:37:26] <logmsgbot>	 !log marostegui@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db[2148,2207].codfw.wmnet with reason: Rebuild and upgrade db2207 db2148
[06:38:04] <logmsgbot>	 !log marostegui@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2189.codfw.wmnet with reason: Rebuild and upgrade db2189
[06:40:47] <marostegui>	 !log Starting s1 codfw failover from db2203 to db2212 - T383690
[06:41:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s1 codfw as read-only for maintenance - T383690', diff saved to https://phabricator.wikimedia.org/P72166 and previous config saved to /var/cache/conftool/dbconfig/20250121-064104-root.json
[06:43:21] <Amir1>	 Thanks!
[06:43:28] <marostegui>	 This is not going well thoguh
[06:45:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[06:45:37] <marostegui>	 I will have to finish this manually, it got stuck on the old master semi sync
[06:45:38] <marostegui>	 great
[06:45:40] <marostegui>	 Amir1: ^
[06:45:53] <Amir1>	 shit
[06:46:05] <Amir1>	 what can I do to help?
[06:46:35] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'Promote db2212 to s1 primary and set section read-write T383690', diff saved to https://phabricator.wikimedia.org/P72167 and previous config saved to /var/cache/conftool/dbconfig/20250121-064634-root.json
[06:46:51] <marostegui>	 Amir1: can you check if you can edit enwiki now?
[06:46:56] <Amir1>	 sure
[06:47:23] <Amir1>	 edits are coming in
[06:47:36] <Amir1>	 my edits are getting saved too
[06:47:41] <marostegui>	 good
[06:49:10] <logmsgbot>	 !log marostegui@dns1006 START - running authdns-update
[06:49:15] <logmsgbot>	 !log marostegui@dns1006 START - running authdns-update
[06:50:15] <jinxer-wm>	 RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[06:50:24] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1111267 (https://phabricator.wikimedia.org/T383690) (owner: 10Gerrit maintenance bot)
[06:51:00] <logmsgbot>	 !log marostegui@dns1006 END - running authdns-update
[06:51:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2203 T383690', diff saved to https://phabricator.wikimedia.org/P72168 and previous config saved to /var/cache/conftool/dbconfig/20250121-065114-marostegui.json
[06:51:18] <stashbot>	 T383690: Switchover s1 master (db2203 -> db2212) - https://phabricator.wikimedia.org/T383690
[06:51:31] <jinxer-wm>	 FIRING: [6x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[06:52:02] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2203.codfw.wmnet with reason: rebuilding index
[06:56:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2216 T384273', diff saved to https://phabricator.wikimedia.org/P72169 and previous config saved to /var/cache/conftool/dbconfig/20250121-065640-marostegui.json
[06:56:47] <stashbot>	 T384273: Rebuild db2203 - https://phabricator.wikimedia.org/T384273
[06:58:25] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2216.codfw.wmnet with reason: rebuilding index
[06:59:06] <wikibugs>	 (03PS1) 10Marostegui: db2203: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113015 (https://phabricator.wikimedia.org/T384273)
[06:59:46] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2203: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113015 (https://phabricator.wikimedia.org/T384273) (owner: 10Marostegui)
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T0700)
[07:00:05] <jouncebot>	 marostegui and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T0700).
[07:00:18] <Amir1>	 the jouncebot missed all the fun
[07:02:03] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db2216.codfw.wmnet onto db2203.codfw.wmnet
[07:10:30] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:21:04] <wikibugs>	 (03CR) 10ZhaoFJx: "Thank you for information!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112838 (https://phabricator.wikimedia.org/T384145) (owner: 10ZhaoFJx)
[07:27:07] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Two bugfixes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1113070
[07:27:22] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Two bugfixes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1113070 (owner: 10Giuseppe Lavagetto)
[07:28:35] <logmsgbot>	 !log oblivian@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Bugfixes - oblivian@cumin1002"
[07:28:37] <logmsgbot>	 !log oblivian@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfixes - oblivian@cumin1002
[07:29:08] <logmsgbot>	 !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfixes - oblivian@cumin1002
[07:29:09] <logmsgbot>	 !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Bugfixes - oblivian@cumin1002"
[07:41:59] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics@868de0c]: 202412 Backfill: Fixes on ExternalTaskMarker experiment
[07:42:31] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@868de0c]: 202412 Backfill: Fixes on ExternalTaskMarker experiment (duration: 00m 32s)
[07:56:35] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' .
[07:59:26] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' .
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T0800).
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:01:11] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Mailing list for administrators of Indonesian projects - https://phabricator.wikimedia.org/T384135#10478525 (10Ladsgroup) 05Open→03Resolved done: https://lists.wikimedia.org/postorius/lists/wiki-id-admins.lists.wikimedia.org
[08:02:13] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' .
[08:16:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:25:32] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:34:46] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112126 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira)
[08:34:50] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10478534 (10MoritzMuehlenhoff)
[08:36:13] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2024.codfw.wmnet with reason: remove from cluster for reimage
[08:36:18] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10478535 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=de610340-5385-4389-b2bb-b869e4134a65) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(...
[08:48:18] <wikibugs>	 (03PS1) 10Muehlenhoff: sre.debmonitor.remove-hosts: Reduce logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113075 (https://phabricator.wikimedia.org/T324655)
[08:50:44] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:51:58] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2216.codfw.wmnet onto db2203.codfw.wmnet
[08:58:49] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[08:59:08] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[09:00:05] <jouncebot>	 hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T0900)
[09:01:47] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10478569 (10isarantopoulos)
[09:03:52] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[09:04:09] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[09:07:19] <wikibugs>	 (03CR) 10Volans: [C:03+1] "Thanks! I've suggested one alternative option inline. Up to you, LGTM in both cases." [cookbooks] - 10https://gerrit.wikimedia.org/r/1113075 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff)
[09:10:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2024.codfw.wmnet with OS bookworm
[09:10:12] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10478580 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2024.codfw.wmnet with OS bookworm
[09:14:54] <wikibugs>	 (03PS1) 10Brouberol: airflow: remove useless separator in pod spec confusing the config checksum computation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113077 (https://phabricator.wikimedia.org/T384275)
[09:15:54] <wikibugs>	 (03PS2) 10Brouberol: airflow: remove useless separator in pod spec confusing the config checksum computation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113077 (https://phabricator.wikimedia.org/T384275)
[09:16:55] <wikibugs>	 (03PS3) 10Brouberol: airflow: remove useless separator in pod spec confusing the config checksum computation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113077 (https://phabricator.wikimedia.org/T384275)
[09:22:50] <wikibugs>	 (03PS2) 10DCausse: wdqs: enable new event stream api config in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112763 (https://phabricator.wikimedia.org/T374919)
[09:26:58] <wikibugs>	 (03PS2) 10Muehlenhoff: sre.debmonitor.remove-hosts: Reduce logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113075 (https://phabricator.wikimedia.org/T324655)
[09:27:13] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Map rest_v1/page/(html|title)/ to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1112188 (https://phabricator.wikimedia.org/T374683)
[09:28:09] <wikibugs>	 (03CR) 10Muehlenhoff: sre.debmonitor.remove-hosts: Reduce logging to SAL (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1113075 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff)
[09:28:52] <wikibugs>	 (03CR) 10DCausse: [C:03+2] wdqs: enable new event stream api config in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112763 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse)
[09:29:17] <jinxer-wm>	 FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[09:29:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] Map rest_v1/page/(html|title)/ to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1112188 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris)
[09:29:37] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[09:29:39] <Emperor>	 !incidents
[09:29:40] <sirenbot>	 5622 (UNACKED)  NELHigh sre (thanos-rule tcp.timed_out)
[09:29:40] <sirenbot>	 5611 (RESOLVED)  db2189 (paged)/MariaDB Replica SQL: s2 (paged)
[09:29:40] <sirenbot>	 5621 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr1-esams.wikimedia.org)
[09:29:40] <sirenbot>	 5620 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqiad.wikimedia.org)
[09:29:40] <sirenbot>	 5619 (RESOLVED)  db2207 (paged)/MariaDB Replica SQL: s2 (paged)
[09:29:41] <sirenbot>	 5618 (RESOLVED)  db2148 (paged)/MariaDB Replica SQL: s2 (paged)
[09:29:41] <sirenbot>	 5617 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr2-eqord.wikimedia.org)
[09:29:41] <sirenbot>	 5616 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqiad.wikimedia.org)
[09:29:41] <sirenbot>	 5615 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr2-eqord.wikimedia.org)
[09:29:42] <sirenbot>	 5614 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqiad.wikimedia.org)
[09:29:42] <sirenbot>	 5613 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr1-eqiad.wikimedia.org)
[09:29:43] <sirenbot>	 5612 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr1-eqiad.wikimedia.org)
[09:29:43] <effie>	 het 
[09:29:47] <Emperor>	 !ack 5622
[09:29:48] <sirenbot>	 5622 (ACKED)  NELHigh sre (thanos-rule tcp.timed_out)
[09:29:48] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[09:29:56] <wikibugs>	 (03CR) 10Volans: sre.debmonitor.remove-hosts: Reduce logging to SAL (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1113075 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff)
[09:29:56] <wikibugs>	 (03Merged) 10jenkins-bot: wdqs: enable new event stream api config in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112763 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse)
[09:30:07] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[09:30:19] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[09:30:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report
[09:30:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0)
[09:30:31] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[09:32:01] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[09:32:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.debmonitor.remove-hosts: Reduce logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113075 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff)
[09:32:28] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[09:32:42] <wikibugs>	 (03PS3) 10Muehlenhoff: sre.debmonitor.remove-hosts: Reduce logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113075 (https://phabricator.wikimedia.org/T324655)
[09:33:16] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1113075 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff)
[09:33:33] <wikibugs>	 (03CR) 10David Caro: wmcs: Migrate iowait stalling alerts to the alerts.git repository (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse)
[09:33:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2024.codfw.wmnet with reason: host reimage
[09:34:32] <wikibugs>	 (03CR) 10David Caro: wmcs: Migrate iowait stalling alerts to the alerts.git repository (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse)
[09:34:33] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: remove useless separator in pod spec confusing the config checksum computation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113077 (https://phabricator.wikimedia.org/T384275) (owner: 10Brouberol)
[09:34:49] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: remove useless separator in pod spec confusing the config checksum computation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113077 (https://phabricator.wikimedia.org/T384275) (owner: 10Brouberol)
[09:35:10] <wikibugs>	 (03CR) 10David Caro: [C:03+1] wmcs: Migrate iowait stalling alerts to the alerts.git repository [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse)
[09:35:29] <wikibugs>	 (03PS1) 10DCausse: wdqs: add missing page_change_content_models config entry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113080 (https://phabricator.wikimedia.org/T374919)
[09:37:34] <wikibugs>	 (03CR) 10DCausse: [C:03+2] wdqs: add missing page_change_content_models config entry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113080 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse)
[09:37:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2024.codfw.wmnet with reason: host reimage
[09:38:52] <wikibugs>	 (03Merged) 10jenkins-bot: wdqs: add missing page_change_content_models config entry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113080 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse)
[09:39:07] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[09:39:17] <jinxer-wm>	 RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[09:39:26] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[09:42:25] <wikibugs>	 (03CR) 10David Caro: wmcs: Migrate iowait stalling alerts to the alerts.git repository [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse)
[09:43:00] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "Just the comment leftover, LGTM otherwise" [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse)
[09:44:42] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM as is, just the comments need updating, thanks a lot!" [alerts] - 10https://gerrit.wikimedia.org/r/1111328 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse)
[09:45:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72172 and previous config saved to /var/cache/conftool/dbconfig/20250121-094537-root.json
[09:46:31] <wikibugs>	 (03PS1) 10DCausse: wdqs: add missing config entry main_output_stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113082 (https://phabricator.wikimedia.org/T374919)
[09:46:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72173 and previous config saved to /var/cache/conftool/dbconfig/20250121-094637-root.json
[09:47:25] <godog>	 !log set udp_localhost-info retention.bytes=100000000000 on kafka-logging - T384233
[09:47:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:29] <stashbot>	 T384233: Unexpected utilization increase in udp_localhost-info kafka-logging topic - https://phabricator.wikimedia.org/T384233
[09:47:42] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:49:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] sre.debmonitor.remove-hosts: Reduce logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113075 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff)
[09:50:16] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] "I am not familiar with the specific traffic patterns, but the alert declaration LGTM." [alerts] - 10https://gerrit.wikimedia.org/r/1111300 (https://phabricator.wikimedia.org/T373459) (owner: 10DCausse)
[09:50:24] <wikibugs>	 (03CR) 10DCausse: [C:03+2] wdqs: add missing config entry main_output_stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113082 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse)
[09:51:44] <wikibugs>	 (03Merged) 10jenkins-bot: wdqs: add missing config entry main_output_stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113082 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse)
[09:52:18] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[09:52:43] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[09:53:03] <wikibugs>	 (03PS1) 10Muehlenhoff: sre.ganeti.resource-report: Reduce logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113084 (https://phabricator.wikimedia.org/T324655)
[09:53:36] <icinga-wm>	 RECOVERY - Disk space on kafka-logging1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kafka-logging1004&var-datasource=eqiad+prometheus/ops
[09:57:21] <wikibugs>	 (03PS1) 10Muehlenhoff: sre.idm.logout: Reduce logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113087 (https://phabricator.wikimedia.org/T324655)
[09:57:23] <wikibugs>	 (03PS1) 10Muehlenhoff: sre.puppet.renew-cert: Reduce logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113088 (https://phabricator.wikimedia.org/T324655)
[09:57:26] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:58:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2024.codfw.wmnet with OS bookworm
[09:58:32] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10478723 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2024.codfw.wmnet with OS bookworm completed: - ganeti202...
[10:00:32] <godog>	 !log set udp_localhost-info retention.bytes=300000000000 on kafka-logging (back to original value) - T384233
[10:00:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:36] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10478726 (10isarantopoulos)
[10:00:36] <stashbot>	 T384233: Unexpected utilization increase in udp_localhost-info kafka-logging topic - https://phabricator.wikimedia.org/T384233
[10:00:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72174 and previous config saved to /var/cache/conftool/dbconfig/20250121-100042-root.json
[10:01:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72175 and previous config saved to /var/cache/conftool/dbconfig/20250121-100142-root.json
[10:01:57] <icinga-wm>	 ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T384281 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[10:02:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T384281 (10ops-monitoring-bot) 03NEW
[10:03:05] <wikibugs>	 (03PS1) 10Cathal Mooney: Remove config to shift AT&T traffic away from Lumen transit [homer/public] - 10https://gerrit.wikimedia.org/r/1113090 (https://phabricator.wikimedia.org/T384253)
[10:03:55] <moritzm>	 !log installing intel-microcode security updates
[10:03:57] <moritzm>	 !log installing python-tornado security updates
[10:03:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:16] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Remove config to shift AT&T traffic away from Lumen transit [homer/public] - 10https://gerrit.wikimedia.org/r/1113090 (https://phabricator.wikimedia.org/T384253) (owner: 10Cathal Mooney)
[10:04:58] <wikibugs>	 (03Merged) 10jenkins-bot: Remove config to shift AT&T traffic away from Lumen transit [homer/public] - 10https://gerrit.wikimedia.org/r/1113090 (https://phabricator.wikimedia.org/T384253) (owner: 10Cathal Mooney)
[10:08:57] <wikibugs>	 (03CR) 10Jelto: [C:04-1] "Looks mostly good but I left some comments in-line." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[10:10:28] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10478740 (10isarantopoulos) I approve both as a manager and owner of the ml g...
[10:10:43] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10478742 (10isarantopoulos)
[10:11:19] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply
[10:11:23] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10478744 (10isarantopoulos)
[10:11:50] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply
[10:12:30] <effie>	 jouncebot: now
[10:12:30] <jouncebot>	 For the next 0 hour(s) and 47 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T0900)
[10:12:34] <effie>	 jouncebot: next
[10:12:35] <jouncebot>	 In 0 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T1100)
[10:15:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72176 and previous config saved to /var/cache/conftool/dbconfig/20250121-101548-root.json
[10:16:38] <wikibugs>	 (03PS4) 10JMeybohm: admin_ng: Install VAPs instead of PSPs on k8s >= 1.24 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112183 (https://phabricator.wikimedia.org/T341984)
[10:16:39] <wikibugs>	 (03PS9) 10JMeybohm: Update staging-codfw to k8s 1.31, calico 3.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984)
[10:16:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2024 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1112767 (owner: 10Muehlenhoff)
[10:16:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72177 and previous config saved to /var/cache/conftool/dbconfig/20250121-101648-root.json
[10:18:47] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] wikikube: rename mw147[0-5] -> wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1112828 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková)
[10:19:07] <wikibugs>	 (03CR) 10Jelto: [C:04-1] miscweb: support os-reports deployment (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[10:20:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet
[10:26:42] <topranks>	 !log adjust VRRP priorities for public and analytics vlans on eqiad CRs to balance traffic 
[10:26:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:06] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Should be okay to deploy at any time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107936 (https://phabricator.wikimedia.org/T382879) (owner: 10Novem Linguae)
[10:29:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet
[10:30:09] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4838/co" [puppet] - 10https://gerrit.wikimedia.org/r/1112782 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm)
[10:30:11] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Manage VRRP priority from Netbox - https://phabricator.wikimedia.org/T381873#10478784 (10cmooney) 05Open→03Resolved a:03cmooney This is all complete and I've set priorities in Netbox to balance traffic from the 4 legacy rows in eqiad across the CRs there.
[10:30:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107936 (https://phabricator.wikimedia.org/T382879) (owner: 10Novem Linguae)
[10:30:52] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1112782 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm)
[10:30:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72179 and previous config saved to /var/cache/conftool/dbconfig/20250121-103053-root.json
[10:31:19] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: site expansion for kafka::logging role description [puppet] - 10https://gerrit.wikimedia.org/r/1113096
[10:31:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72180 and previous config saved to /var/cache/conftool/dbconfig/20250121-103153-root.json
[10:32:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: "The motd doesn't get updated because the resulting shell script fails:" [puppet] - 10https://gerrit.wikimedia.org/r/1113096 (owner: 10Filippo Giunchedi)
[10:33:37] <icinga-wm>	 RECOVERY - Host mr1-esams.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 81.62 ms
[10:34:25] <wikibugs>	 (03PS2) 10Filippo Giunchedi: hieradata: fix site expansion for role description [puppet] - 10https://gerrit.wikimedia.org/r/1113096
[10:35:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: recording rules for mw edit rates [puppet] - 10https://gerrit.wikimedia.org/r/1112172 (https://phabricator.wikimedia.org/T383963) (owner: 10Filippo Giunchedi)
[10:40:40] <topranks>	 !log de-pref Chicago routes learnt on on core routers in Dallas 
[10:40:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Improve Eqiad outbound traffic balance - https://phabricator.wikimedia.org/T384253#10478825 (10cmooney) FWIW I have made the same change in codfw for routes learnt from eqord (Chicago).  Locally-learnt routes will now be preferred unless the AS-Path from Chicago...
[10:45:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72181 and previous config saved to /var/cache/conftool/dbconfig/20250121-104559-root.json
[10:46:20] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1173 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1113098 (https://phabricator.wikimedia.org/T384284)
[10:46:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72182 and previous config saved to /var/cache/conftool/dbconfig/20250121-104658-root.json
[10:51:32] <jinxer-wm>	 FIRING: [6x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[10:54:51] <wikibugs>	 (03PS1) 10Btullis: Temporarily disable gobblin timers on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1113101 (https://phabricator.wikimedia.org/T380619)
[10:55:18] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM, just a note there" [puppet] - 10https://gerrit.wikimedia.org/r/1108091 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri)
[10:55:55] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Temporarily disable gobblin timers on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1113101 (https://phabricator.wikimedia.org/T380619) (owner: 10Btullis)
[10:56:00] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Temporarily disable gobblin timers on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1113101 (https://phabricator.wikimedia.org/T380619) (owner: 10Btullis)
[10:58:09] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri)
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T1100)
[11:00:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2024.codfw.wmnet to cluster codfw and group A
[11:01:22] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07IPv6: Enable ipv6 on ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T379890#10478911 (10MoritzMuehlenhoff)
[11:01:38] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2024.codfw.wmnet to cluster codfw and group A
[11:02:35] <effie>	 jouncebot: now
[11:02:35] <jouncebot>	 For the next 0 hour(s) and 57 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T1100)
[11:03:03] <icinga-wm>	 RECOVERY - Disk space on stat1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops
[11:03:38] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] mw-(web|api-ext)-next: bump replicas and update TODO [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112078 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French)
[11:04:51] <wikibugs>	 (03Merged) 10jenkins-bot: mw-(web|api-ext)-next: bump replicas and update TODO [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112078 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French)
[11:05:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1113096 (owner: 10Filippo Giunchedi)
[11:05:49] <wikibugs>	 (03PS1) 10Brouberol: airflow: re-introduce KRB5_KEYTAB in the task pod env [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113103 (https://phabricator.wikimedia.org/T384282)
[11:06:01] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm, output of upstream diff looks similar `~/git/calico$ git diff --stat v3.23.3 v3.29.1 -- ./libcalico-go/config/crd`" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111943 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[11:07:34] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] "lgtm! I've noticed that the templating for liftwing rules adds the comment header before each rule - not something to be fixed in this rev" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112126 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira)
[11:07:58] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: re-introduce KRB5_KEYTAB in the task pod env [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113103 (https://phabricator.wikimedia.org/T384282) (owner: 10Brouberol)
[11:08:30] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[11:09:01] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[11:09:29] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[11:10:04] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2205 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1113104 (https://phabricator.wikimedia.org/T384287)
[11:10:08] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1113105 (https://phabricator.wikimedia.org/T384287)
[11:10:17] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[11:10:31] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:10:36] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[11:11:06] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[11:11:58] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[11:12:01] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[11:12:26] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ganeti2019 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1113106
[11:12:53] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/search AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[11:12:53] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/research AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[11:13:18] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[11:13:21] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[11:13:42] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[11:14:03] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[11:14:12] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[11:14:28] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: re-introduce KRB5_KEYTAB in the task pod env [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113103 (https://phabricator.wikimedia.org/T384282) (owner: 10Brouberol)
[11:14:38] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/1113088 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff)
[11:14:54] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/1113087 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff)
[11:15:19] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288 (10cmooney) 03NEW p:05Triage→03Medium
[11:15:35] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288#10478967 (10cmooney)
[11:16:42] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288#10478971 (10cmooney)
[11:18:10] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply
[11:18:47] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply
[11:19:00] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[11:19:46] <wikibugs>	 (03CR) 10Volans: "LGTM, but I've suggested how to make it not log at all to SAL" [cookbooks] - 10https://gerrit.wikimedia.org/r/1113084 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff)
[11:22:06] <wikibugs>	 (03PS4) 10Máté Szabó: Enable electionadmin user group on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083870 (https://phabricator.wikimedia.org/T378287) (owner: 10Dreamrimmer)
[11:23:11] <wikibugs>	 (03PS2) 10Muehlenhoff: sre.idm.logout: Reduce logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113087 (https://phabricator.wikimedia.org/T324655)
[11:25:40] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[11:25:50] <wikibugs>	 (03PS1) 10Brouberol: airflow-analytics: migrate scheduler and database to Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113108 (https://phabricator.wikimedia.org/T380619)
[11:26:15] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[11:26:45] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[11:26:51] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[11:27:06] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[11:28:34] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[11:28:37] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[11:29:19] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[11:29:43] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[11:29:54] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[11:30:37] <icinga-wm>	 PROBLEM - Host restbase2037 is DOWN: PING CRITICAL - Packet loss = 100%
[11:30:38] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[11:30:40] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[11:31:28] <wikibugs>	 (03PS2) 10Brouberol: airflow-analytics: migrate scheduler and database to Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113108 (https://phabricator.wikimedia.org/T380619)
[11:31:45] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[11:32:18] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:32:20] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[11:32:40] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[11:33:03] <icinga-wm>	 PROBLEM - Disk space on stat1008 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=86%): /tmp 0 MB (0% inode=86%): /var/tmp 0 MB (0% inode=86%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops
[11:34:07] <icinga-wm>	 RECOVERY - Host restbase2037 is UP: PING OK - Packet loss = 0%, RTA = 30.27 ms
[11:34:36] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112261 (https://phabricator.wikimedia.org/T380751) (owner: 10Audrey Penven)
[11:34:42] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[11:35:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[11:36:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[11:37:02] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10479063 (10jcrespo) Indeed, that's documented at...
[11:37:53] <wikibugs>	 (03PS3) 10Brouberol: airflow-analytics: migrate scheduler and database to Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113108 (https://phabricator.wikimedia.org/T380619)
[11:38:09] <icinga-wm>	 PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - No response from remote host 195.200.68.129 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:38:40] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow-analytics: migrate scheduler and database to Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113108 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol)
[11:38:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] sre.idm.logout: Reduce logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113087 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff)
[11:38:53] <wikibugs>	 (03PS2) 10Muehlenhoff: sre.puppet.renew-cert: Reduce logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113088 (https://phabricator.wikimedia.org/T324655)
[11:39:23] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:40:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[11:41:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[11:42:18] <jinxer-wm>	 RESOLVED: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:42:42] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] rest-gateway: add params to config, rework citoid path matching (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973362 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan)
[11:43:48] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: add params to config, rework citoid path matching [deployment-charts] - 10https://gerrit.wikimedia.org/r/973362 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan)
[11:44:50] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2189.codfw.wmnet with reason: rebuilding index
[11:45:08] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10479077 (10jcrespo)
[11:45:37] <icinga-wm>	 PROBLEM - Host restbase2037 is DOWN: PING CRITICAL - Packet loss = 100%
[11:47:13] <wikibugs>	 (03PS5) 10Scott French: service::catalog: enable monitoring for mw-(web|api-ext)-next [puppet] - 10https://gerrit.wikimedia.org/r/1101124 (https://phabricator.wikimedia.org/T377040)
[11:47:18] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:47:20] <hnowlan>	 that seems bad
[11:47:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72184 and previous config saved to /var/cache/conftool/dbconfig/20250121-114728-root.json
[11:47:31] <claime>	 err yeah
[11:47:46] <hnowlan>	 I'll depool it
[11:47:56] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] service::catalog: enable monitoring for mw-(web|api-ext)-next [puppet] - 10https://gerrit.wikimedia.org/r/1101124 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French)
[11:48:15] <icinga-wm>	 RECOVERY - Host restbase2037 is UP: PING OK - Packet loss = 0%, RTA = 30.33 ms
[11:48:29] <logmsgbot>	 !log hnowlan@cumin2002 conftool action : set/pooled=no; selector: name=restbase2037.codfw.wmnet
[11:48:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72185 and previous config saved to /var/cache/conftool/dbconfig/20250121-114836-root.json
[11:49:42] <wikibugs>	 (03CR) 10Klausman: [C:03+1] changeprop: add liftwing article-country stream to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112126 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira)
[11:49:50] <hnowlan>	 I don't really care if it's back, looks like the host has bad memory 
[11:49:53] <hnowlan>	 https://phabricator.wikimedia.org/T383820
[11:50:36] <wikibugs>	 10ops-codfw, 06SRE, 10Cassandra, 06DC-Ops: restbase2037 is crashy - https://phabricator.wikimedia.org/T383820#10479104 (10hnowlan) This host went down again this morning, same DIMM errors. I've depooled it for the time being.  ` 11:30 <+icinga-wm> PROBLEM - Host restbase2037 is DOWN: PING CRITICAL - Packet...
[11:50:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[11:51:41] <hnowlan>	 ^ possibly a knock-on?
[11:52:06] <kart_>	 OK to deploy cxserver?
[11:53:55] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10479109 (10MoritzMuehlenhoff)
[11:54:07] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:54:23] <jinxer-wm>	 RESOLVED: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:54:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet
[11:54:55] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10479112 (10ops-monitoring-bot) Draining ganeti2019.codfw.wmnet of running VMs
[11:54:57] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:55:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[11:56:06] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10479118 (10jcrespo)
[11:56:52] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10479119 (10jcrespo) I will be adding now the LDAP...
[11:57:29] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:57:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet
[11:59:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet
[11:59:14] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2019.codfw.wmnet
[12:00:06] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10479138 (10jcrespo) >>! In T384239#10478122, @thc...
[12:00:10] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 9.122 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:00:22] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:00:43] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "change and diff looks reasonable to me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112058 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[12:00:48] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:00:49] <kart_>	 I'll just go ahead :)
[12:01:03] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-01-20-172318-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112871 (https://phabricator.wikimedia.org/T377966) (owner: 10KartikMistry)
[12:01:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd2004.codfw.wmnet to drbd
[12:02:25] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2025-01-20-172318-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112871 (https://phabricator.wikimedia.org/T377966) (owner: 10KartikMistry)
[12:02:34] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1112226 (https://phabricator.wikimedia.org/T367315) (owner: 10Muehlenhoff)
[12:02:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72186 and previous config saved to /var/cache/conftool/dbconfig/20250121-120234-root.json
[12:02:54] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1112228 (https://phabricator.wikimedia.org/T367315) (owner: 10Muehlenhoff)
[12:03:34] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[12:03:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72187 and previous config saved to /var/cache/conftool/dbconfig/20250121-120341-root.json
[12:04:22] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2189.codfw.wmnet
[12:04:23] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:04:54] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:05:08] <icinga-wm>	 PROBLEM - SSH on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:05:08] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:05:16] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[12:05:49] <federico3>	 !log updating db2189.codfw.wmnet for https://phabricator.wikimedia.org/T384202
[12:05:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:16] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10479174 (10ops-monitoring-bot) VM aux-k8s-etcd2004.codfw.wmnet switching disk type to drbd
[12:07:18] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:07:25] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] changeprop: add liftwing article-country stream to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112126 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira)
[12:07:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[12:08:00] <icinga-wm>	 RECOVERY - SSH on ms-fe1014 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:08:28] <icinga-wm>	 PROBLEM - Host restbase2037 is DOWN: PING CRITICAL - Packet loss = 100%
[12:08:46] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.927 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:08:52] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop: add liftwing article-country stream to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112126 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira)
[12:08:56] <hnowlan>	 downtiming restbase2037 for a day
[12:08:57] <logmsgbot>	 !log hnowlan@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on restbase2037.codfw.wmnet with reason: Memory issues, rebooting frequently. Depooled. T383820
[12:08:58] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10479189 (10jcrespo) WMF LDA group added: https://...
[12:09:00] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.169 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:09:01] <stashbot>	 T383820: restbase2037 is crashy - https://phabricator.wikimedia.org/T383820
[12:09:04] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2189.codfw.wmnet
[12:09:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[12:09:48] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[12:10:20] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[12:12:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[12:13:12] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:14:04] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.010 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:14:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[12:14:53] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[12:15:28] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[12:15:53] <kart_>	 !log Updated cxserver to 2025-01-20-172318-production (T377966, T377813)
[12:15:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:15:58] <stashbot>	 T377966: Make cxserver Logstash logs readable and reliable - https://phabricator.wikimedia.org/T377966
[12:15:58] <stashbot>	 T377813: Migrate cxserver code from CommonJS to ESM / ECMAScript - https://phabricator.wikimedia.org/T377813
[12:16:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:17:00] <icinga-wm>	 PROBLEM - MD RAID on ms-fe1014 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[12:17:01] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on ms-fe1014 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T384297 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[12:17:13] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-fe1014 - https://phabricator.wikimedia.org/T384297 (10ops-monitoring-bot) 03NEW
[12:17:13] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10479205 (10jcrespo) @SuzanneWood-WMDE A reminder that this is mainly blocked on you providing your public ssh key out of band and your manager confirming/approving the request.
[12:17:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: fix site expansion for role description [puppet] - 10https://gerrit.wikimedia.org/r/1113096 (owner: 10Filippo Giunchedi)
[12:17:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72189 and previous config saved to /var/cache/conftool/dbconfig/20250121-121739-root.json
[12:18:16] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:18:39] <arnaudb>	 nftables
[12:18:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72190 and previous config saved to /var/cache/conftool/dbconfig/20250121-121847-root.json
[12:18:51] <arnaudb>	 oops wrong window
[12:19:12] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 5.242 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:21:50] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:22:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] sre.puppet.renew-cert: Reduce logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113088 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff)
[12:22:18] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:23:08] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.571 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:27:35] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply
[12:27:52] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[12:29:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd2004.codfw.wmnet to drbd
[12:31:05] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Add DSantamaria to WMF group for access to https://superset.wikimedia.org - https://phabricator.wikimedia.org/T384169#10479227 (10jcrespo) a:05DSantamaria→03jcrespo You can proof your indentity based on your linked account here on phab, based on phab...
[12:31:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet
[12:32:02] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10479233 (10ops-monitoring-bot) Draining ganeti2019.codfw.wmnet of running VMs
[12:32:05] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[12:32:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet
[12:32:20] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[12:32:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd2004.codfw.wmnet to plain
[12:32:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72193 and previous config saved to /var/cache/conftool/dbconfig/20250121-123245-root.json
[12:32:55] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:33:07] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10479235 (10ops-monitoring-bot) VM aux-k8s-etcd2004.codfw.wmnet switching disk type to plain
[12:33:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd2004.codfw.wmnet to plain
[12:33:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72194 and previous config saved to /var/cache/conftool/dbconfig/20250121-123352-root.json
[12:34:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:34:53] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 7.898 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:37:55] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:38:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet
[12:38:35] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10479244 (10ops-monitoring-bot) Draining ganeti2019.codfw.wmnet of running VMs
[12:40:30] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Remove set user permissions from m1 backup user grants [puppet] - 10https://gerrit.wikimedia.org/r/1112802 (https://phabricator.wikimedia.org/T383902)
[12:40:31] <wikibugs>	 (03PS1) 10Jcrespo: admin: Add dsantamaria to the list of ldap-only users [puppet] - 10https://gerrit.wikimedia.org/r/1113122 (https://phabricator.wikimedia.org/T384169)
[12:41:13] <wikibugs>	 (03PS2) 10Jcrespo: admin: Add dsantamaria to the list of ldap-only users [puppet] - 10https://gerrit.wikimedia.org/r/1113122 (https://phabricator.wikimedia.org/T384169)
[12:41:21] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:42:13] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.097 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:42:53] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 7.947 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:47:21] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:47:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72197 and previous config saved to /var/cache/conftool/dbconfig/20250121-124750-root.json
[12:48:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72198 and previous config saved to /var/cache/conftool/dbconfig/20250121-124857-root.json
[12:49:13] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.227 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:49:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gerrit-metrics in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:52:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10observability: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10479262 (10cmooney) So looking at a specific peer - 2620:0:863:1:198:35:26:6 on cr4-ulsfo - I can see the SNMP 'index...
[12:53:03] <icinga-wm>	 RECOVERY - Disk space on stat1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops
[12:53:09] <wikibugs>	 (03PS1) 10Effie Mouzeli: php8.1-cli: introduce opcache and JIT [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113124 (https://phabricator.wikimedia.org/T384294)
[12:53:27] <wikibugs>	 (03PS1) 10Elukey: mapnik: fix paths for mapnik directories [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113125 (https://phabricator.wikimedia.org/T384285)
[12:54:14] <MatmaRex>	 i just saw a different Gerrit interface for a moment, then it went down
[12:54:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:54:42] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:54:43] <MatmaRex>	 i don't see anything in SAL, was that expected?
[12:58:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove firewall rule for rsync on archiva [puppet] - 10https://gerrit.wikimedia.org/r/1112226 (https://phabricator.wikimedia.org/T367315) (owner: 10Muehlenhoff)
[12:58:32] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10479280 (10jcrespo)
[12:59:02] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[12:59:15] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[12:59:28] <wikibugs>	 (03PS2) 10Effie Mouzeli: php8.1-cli: introduce opcache and JIT [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113124 (https://phabricator.wikimedia.org/T384294)
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T1300)
[13:00:10] <wikibugs>	 (03PS1) 10David Caro: toolforge::base: add cron to all boxes [puppet] - 10https://gerrit.wikimedia.org/r/1113128 (https://phabricator.wikimedia.org/T384250)
[13:00:36] <wikibugs>	 (03PS6) 10Jcrespo: admin: Add user for Georgios Kyziridis (ML Team) [puppet] - 10https://gerrit.wikimedia.org/r/1109414 (https://phabricator.wikimedia.org/T384239) (owner: 10Klausman)
[13:01:06] <wikibugs>	 (03CR) 10Jcrespo: admin: Add user for Georgios Kyziridis (ML Team) [puppet] - 10https://gerrit.wikimedia.org/r/1109414 (https://phabricator.wikimedia.org/T384239) (owner: 10Klausman)
[13:01:07] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[13:01:15] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[13:01:49] <wikibugs>	 06SRE, 10superset.wikimedia.org: Degraded Superset functionality during a high-traffic incident - https://phabricator.wikimedia.org/T384301 (10MatthewVernon) 03NEW
[13:02:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:02:51] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] toolforge::base: add cron to all boxes [puppet] - 10https://gerrit.wikimedia.org/r/1113128 (https://phabricator.wikimedia.org/T384250) (owner: 10David Caro)
[13:03:55] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[13:04:51] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 4.560 second response time https://wikitech.wikimedia.org/wiki/Swift
[13:07:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:09:03] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10479371 (10SuzanneWood-WMDE) @WMDECyn can you please approve?
[13:09:23] <effie>	  !incidents
[13:09:24] <sirenbot>	 5623 (UNACKED)  Manual (paged) by urbanecm (murbanec@wikimedia.org): Nearly complete Gerrit outage
[13:09:24] <sirenbot>	 5622 (RESOLVED)  NELHigh sre (thanos-rule tcp.timed_out)
[13:09:24] <sirenbot>	 5611 (RESOLVED)  db2189 (paged)/MariaDB Replica SQL: s2 (paged)
[13:09:24] <sirenbot>	 5621 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr1-esams.wikimedia.org)
[13:09:25] <sirenbot>	 5620 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqiad.wikimedia.org)
[13:09:25] <sirenbot>	 5619 (RESOLVED)  db2207 (paged)/MariaDB Replica SQL: s2 (paged)
[13:09:25] <sirenbot>	 5618 (RESOLVED)  db2148 (paged)/MariaDB Replica SQL: s2 (paged)
[13:09:25] <sirenbot>	 5617 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr2-eqord.wikimedia.org)
[13:09:26] <sirenbot>	 5616 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqiad.wikimedia.org)
[13:09:26] <sirenbot>	 5615 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr2-eqord.wikimedia.org)
[13:09:27] <sirenbot>	 5614 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqiad.wikimedia.org)
[13:09:27] <sirenbot>	 5613 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr1-eqiad.wikimedia.org)
[13:09:28] <sirenbot>	 5612 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr1-eqiad.wikimedia.org)
[13:09:36] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10479373 (10SuzanneWood-WMDE) Hi @jcrespo - sorry I don't understand "providing your public ssh key out of band", what do I need to do?
[13:10:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:11:32] <wikibugs>	 06SRE, 06Data-Platform-SRE, 10superset.wikimedia.org: Degraded Superset functionality during a high-traffic incident - https://phabricator.wikimedia.org/T384301#10479385 (10BTullis) Tagging this with #data-platform-sre for triage. I suspect that the errors in Superset may have been caused by timeouts queryin...
[13:11:46] <wikibugs>	 (03PS7) 10Jcrespo: admin: Add user for Georgios Kyziridis (ML Team) [puppet] - 10https://gerrit.wikimedia.org/r/1109414 (https://phabricator.wikimedia.org/T384239) (owner: 10Klausman)
[13:15:40] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] "Did some changes to commit message and patch, asking for an SRE sanity check before rebase and deploy." [puppet] - 10https://gerrit.wikimedia.org/r/1109414 (https://phabricator.wikimedia.org/T384239) (owner: 10Klausman)
[13:15:49] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10479405 (10WMDECyn) Sorry for late response, approving this request from WMDE side
[13:16:41] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply
[13:17:18] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply
[13:20:07] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons.
[13:21:18] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-analytics: migrate scheduler and database to Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113108 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol)
[13:22:45] <logmsgbot>	 !log btullis@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-launcher1002.eqiad.wmnet with reason: Migrating to kubernetes
[13:22:55] <logmsgbot>	 !log btullis@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-launcher1002.eqiad.wmnet with reason: Migrating to kubernetes
[13:23:34] <wikibugs>	 (03PS1) 10Jelto: gerrit: block alibaba Cloud IPs [puppet] - 10https://gerrit.wikimedia.org/r/1113133
[13:24:43] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] gerrit: block alibaba Cloud IPs [puppet] - 10https://gerrit.wikimedia.org/r/1113133 (owner: 10Jelto)
[13:24:59] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[13:25:49] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.756 second response time https://wikitech.wikimedia.org/wiki/Swift
[13:26:34] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gerrit: block alibaba Cloud IPs [puppet] - 10https://gerrit.wikimedia.org/r/1113133 (owner: 10Jelto)
[13:26:57] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply
[13:27:03] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply
[13:27:37] <wikibugs>	 (03PS3) 10Effie Mouzeli: php8.1-cli: introduce opcache and JIT [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113124 (https://phabricator.wikimedia.org/T384294)
[13:28:59] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[13:30:12] <wikibugs>	 (03PS1) 10Jelto: gerrit: lower throttling threshold to 15 parallel connections [puppet] - 10https://gerrit.wikimedia.org/r/1113135
[13:30:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:30:55] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 3.521 second response time https://wikitech.wikimedia.org/wiki/Swift
[13:32:15] <wikibugs>	 14SRE-Sprint-Week-Sustainability-March2023, 06Data-Persistence-Automations, 06DBA, 13Patch-For-Review, 10Sustainability (Incident Followup): Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366#10479472 (10Marostegui) @FCeratto-WMF t...
[13:32:31] <jinxer-wm>	 FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:34:14] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] gerrit: lower throttling threshold to 15 parallel connections [puppet] - 10https://gerrit.wikimedia.org/r/1113135 (owner: 10Jelto)
[13:34:33] <wikibugs>	 (03CR) 10LSobanski: [C:03+1] gerrit: lower throttling threshold to 15 parallel connections [puppet] - 10https://gerrit.wikimedia.org/r/1113135 (owner: 10Jelto)
[13:35:23] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[13:35:31] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10479488 (10jcrespo) >>! In T384018#10479373, @SuzanneWood-WMDE wrote: > Hi @jcrespo - sorry I don't understand "providing your public ssh key out of band", what do I need to do?  Ye...
[13:35:53] <wikibugs>	 (03PS1) 10Brouberol: airflow-analytics: fix DB cluster size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113136 (https://phabricator.wikimedia.org/T380619)
[13:36:17] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.584 second response time https://wikitech.wikimedia.org/wiki/Swift
[13:36:40] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gerrit: lower throttling threshold to 15 parallel connections [puppet] - 10https://gerrit.wikimedia.org/r/1113135 (owner: 10Jelto)
[13:36:43] <wikibugs>	 (03PS1) 10Mvolz: Update Zotero translators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113137 (https://phabricator.wikimedia.org/T384165)
[13:37:20] <wikibugs>	 (03CR) 10Btullis: [C:03+2] airflow-analytics: fix DB cluster size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113136 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol)
[13:37:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:37:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] airflow-analytics: fix DB cluster size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113136 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol)
[13:38:12] <wikibugs>	 (03CR) 10Brouberol: [V:03+2] airflow-analytics: fix DB cluster size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113136 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol)
[13:38:50] <wikibugs>	 (03CR) 10Btullis: [V:03+2 C:03+2] airflow-analytics: fix DB cluster size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113136 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol)
[13:38:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Update Zotero translators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113137 (https://phabricator.wikimedia.org/T384165) (owner: 10Mvolz)
[13:39:23] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10observability: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10479515 (10cmooney) >>! In T384258#10477783, @ssingh wrote: > Might be a red herring: The only thing I see that might...
[13:40:42] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job gerrit in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:41:07] <wikibugs>	 (03Merged) 10jenkins-bot: airflow-analytics: fix DB cluster size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113136 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol)
[13:43:43] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply
[13:43:49] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply
[13:45:23] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[13:45:42] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:46:33] <wikibugs>	 (03PS1) 10DCausse: flink-app: better support for properties file format [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113139
[13:47:17] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.552 second response time https://wikitech.wikimedia.org/wiki/Swift
[13:47:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:49:27] <wikibugs>	 (03PS1) 10Pmiazga: Disable new WebAuthn credentials creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113141 (https://phabricator.wikimedia.org/T378402)
[13:49:38] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Revert "Create certificates for Typha/Felix mTLS" [puppet] - 10https://gerrit.wikimedia.org/r/1112782 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm)
[13:51:12] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Update calico to v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112058 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[13:51:15] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Update calico-crds to calico v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111943 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[13:51:53] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team, 13Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10479562 (10jcrespo) a:03jc...
[13:52:48] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10479563 (10jcrespo)
[13:53:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1113122 (https://phabricator.wikimedia.org/T384169) (owner: 10Jcrespo)
[13:53:22] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Add DSantamaria to WMF group for access to https://superset.wikimedia.org - https://phabricator.wikimedia.org/T384169#10479565 (10MoritzMuehlenhoff) @DSantamaria As a note for future reference: These days the simpler process is to si...
[13:54:50] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/pooled=no; selector: name=ms-fe1014.eqiad.wmnet
[13:55:02] <Emperor>	 !log hard-reboot ms-fe1014
[13:55:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:07] <yerdua_wmde>	 I have a config change in the window coming up soon, but will only be available from 14:30 UTC onward
[13:55:07] <wikibugs>	 (03Merged) 10jenkins-bot: Update calico-crds to calico v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111943 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[13:55:13] <wikibugs>	 (03Merged) 10jenkins-bot: Update calico to v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112058 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[13:55:45] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Add DSantamaria to WMF group for access to https://superset.wikimedia.org - https://phabricator.wikimedia.org/T384169#10479568 (10jcrespo) Thanks, @MoritzMuehlenhoff , this was sort of something I realized later, on my side, as it ha...
[13:56:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM. Please note that the Kerberos principal must be created separately after merging as documented here: https://wikitech.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/1109414 (https://phabricator.wikimedia.org/T384239) (owner: 10Klausman)
[13:56:48] <wikibugs>	 (03PS1) 10Brouberol: airlow: restore Api kerberos auth by mounting the keytab into the webserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113143 (https://phabricator.wikimedia.org/T384282)
[13:57:11] <icinga-wm>	 PROBLEM - Host ms-fe1014 is DOWN: PING CRITICAL - Packet loss = 100%
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T1400).
[14:00:05] <jouncebot>	 ihurbain, DreamRimmer, and yerdua_wmde: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:43] <ihurbain>	 o/ hello - i'd need a deployer pretty please!
[14:00:52] <Lucas_WMDE>	 o/
[14:01:17] <Lucas_WMDE>	 I can deploy
[14:01:18] <ihurbain>	 (assuming gerrit is now stable enough, though)
[14:01:21] <Lucas_WMDE>	 assuming Gerrit cooperates
[14:01:25] <ihurbain>	 Lucas_WMDE: that'd be most appreciated :)
[14:01:27] <ihurbain>	 hah
[14:01:28] <DreamRimmer>	 o/
[14:02:17] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-fe1014 hardware fault (may need new disk controller?) - https://phabricator.wikimedia.org/T384317 (10MatthewVernon) 03NEW
[14:02:35] <Lucas_WMDE>	 let’s try our luck
[14:02:44] <Lucas_WMDE>	 actually, one sec
[14:02:47] <wikibugs>	 (03PS3) 10Jcrespo: admin: Add dsantamaria to the list of ldap-only users [puppet] - 10https://gerrit.wikimedia.org/r/1113122 (https://phabricator.wikimedia.org/T384169)
[14:02:48] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113143 (https://phabricator.wikimedia.org/T384282) (owner: 10Brouberol)
[14:03:02] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-fe1014 hardware fault (may need new disk controller?) - https://phabricator.wikimedia.org/T384317#10479635 (10MatthewVernon) p:05Triage→03High
[14:03:41] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] admin: Add dsantamaria to the list of ldap-only users [puppet] - 10https://gerrit.wikimedia.org/r/1113122 (https://phabricator.wikimedia.org/T384169) (owner: 10Jcrespo)
[14:03:52] <Lucas_WMDE>	 waiting before deployment per #_security
[14:04:00] <ihurbain>	 nod
[14:06:18] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airlow: restore Api kerberos auth by mounting the keytab into the webserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113143 (https://phabricator.wikimedia.org/T384282) (owner: 10Brouberol)
[14:09:18] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons.
[14:09:27] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply
[14:10:08] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply
[14:10:08] <anzx>	 Lucas_WMDE: could check if https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1112563 if this can be merged 
[14:11:00] <Lucas_WMDE>	 anzx: that looks unrelated to the deployment window?
[14:11:14] <anzx>	 yes unrelated
[14:14:20] <wikibugs>	 (03PS4) 10Vgutierrez: hiera: Add pki.goog staging account to acmechief-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195)
[14:14:20] <wikibugs>	 (03PS1) 10Vgutierrez: acme_chief: Fix handling of default account [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837)
[14:14:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111932 (https://phabricator.wikimedia.org/T340134) (owner: 10Isabelle Hurbain-Palatin)
[14:14:50] <Lucas_WMDE>	 ihurbain: starting now
[14:14:55] <wikibugs>	 (03PS2) 10Pmiazga: Disable new WebAuthn credentials creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113141 (https://phabricator.wikimedia.org/T378402)
[14:14:56] <ihurbain>	 thank you :)
[14:15:26] <wikibugs>	 (03Merged) 10jenkins-bot: Remove KartographerParsoidSupport flag from configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111932 (https://phabricator.wikimedia.org/T340134) (owner: 10Isabelle Hurbain-Palatin)
[14:15:54] * Lucas_WMDE watches zuul pull the new wmf.13 branch in allllll the repositories
[14:16:00] <Lucas_WMDE>	 uh. s/zuul/scap/ lol
[14:16:19] <ihurbain>	 such fun! :P
[14:16:20] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] "I appreciate the reminder!" [puppet] - 10https://gerrit.wikimedia.org/r/1109414 (https://phabricator.wikimedia.org/T384239) (owner: 10Klausman)
[14:16:22] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Add DSantamaria to WMF group for access to https://superset.wikimedia.org - https://phabricator.wikimedia.org/T384169#10479732 (10jcrespo) @DSantamaria : you have been added to the wmf group, which means you can now access to superse...
[14:16:26] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1111932|Remove KartographerParsoidSupport flag from configuration (T340134)]]
[14:16:30] <stashbot>	 T340134: Feature flag addition/removal for Parsoid support for Kartographer - https://phabricator.wikimedia.org/T340134
[14:16:47] <wikibugs>	 (03PS8) 10Jcrespo: admin: Add user for Georgios Kyziridis (ML Team) [puppet] - 10https://gerrit.wikimedia.org/r/1109414 (https://phabricator.wikimedia.org/T384239) (owner: 10Klausman)
[14:16:51] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4839/console" [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837) (owner: 10Vgutierrez)
[14:17:45] <wikibugs>	 (03PS2) 10Elukey: mapnik: fix paths for mapnik directories [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113125 (https://phabricator.wikimedia.org/T384285)
[14:18:18] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] "neat! Left you a question about edge cases. Merge at will if it is not relevant." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113139 (owner: 10DCausse)
[14:18:30] <wikibugs>	 (03CR) 10Elukey: "Final result:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113125 (https://phabricator.wikimedia.org/T384285) (owner: 10Elukey)
[14:19:31] <wikibugs>	 (03PS5) 10Vgutierrez: hiera: Add pki.goog staging account to acmechief-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195)
[14:19:55] <wikibugs>	 (03PS1) 10Brouberol: airflow-analytics: remove import configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113145 (https://phabricator.wikimedia.org/T380619)
[14:20:05] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez)
[14:20:27] <wikibugs>	 (03PS1) 10Filippo Giunchedi: chartmuseum: remove icinga-based http checks [puppet] - 10https://gerrit.wikimedia.org/r/1113146 (https://phabricator.wikimedia.org/T384324)
[14:21:03] <Emperor>	 !incidents
[14:21:03] <sirenbot>	 5623 (ACKED)  Manual (paged) by urbanecm (murbanec@wikimedia.org): Nearly complete Gerrit outage
[14:21:03] <sirenbot>	 5622 (RESOLVED)  NELHigh sre (thanos-rule tcp.timed_out)
[14:21:04] <sirenbot>	 5621 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr1-esams.wikimedia.org)
[14:21:04] <sirenbot>	 5620 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqiad.wikimedia.org)
[14:21:04] <sirenbot>	 5619 (RESOLVED)  db2207 (paged)/MariaDB Replica SQL: s2 (paged)
[14:21:04] <sirenbot>	 5618 (RESOLVED)  db2148 (paged)/MariaDB Replica SQL: s2 (paged)
[14:21:04] <sirenbot>	 5617 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr2-eqord.wikimedia.org)
[14:21:05] <sirenbot>	 5616 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqiad.wikimedia.org)
[14:21:05] <sirenbot>	 5615 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr2-eqord.wikimedia.org)
[14:21:06] <sirenbot>	 5614 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqiad.wikimedia.org)
[14:21:06] <sirenbot>	 5613 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr1-eqiad.wikimedia.org)
[14:21:07] <sirenbot>	 5612 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr1-eqiad.wikimedia.org)
[14:21:09] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow-analytics: remove import configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113145 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol)
[14:21:14] <Emperor>	 !resolve 5623
[14:21:15] <sirenbot>	 5623 (RESOLVED)  Manual (paged) by urbanecm (murbanec@wikimedia.org): Nearly complete Gerrit outage
[14:21:22] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Add DSantamaria to WMF group for access to https://superset.wikimedia.org - https://phabricator.wikimedia.org/T384169#10479772 (10MoritzMuehlenhoff) >>! In T384169#10479568, @jcrespo wrote: > Thanks, @MoritzMuehlenhoff , this was sor...
[14:21:24] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-analytics: remove import configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113145 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol)
[14:22:52] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply
[14:23:00] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply
[14:24:49] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 ihurbain, lucaswerkmeister-wmde: Backport for [[gerrit:1111932|Remove KartographerParsoidSupport flag from configuration (T340134)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:24:53] <stashbot>	 T340134: Feature flag addition/removal for Parsoid support for Kartographer - https://phabricator.wikimedia.org/T340134
[14:25:02] <wikibugs>	 (03PS2) 10Vgutierrez: acme_chief: Fix handling of default account [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837)
[14:25:02] <wikibugs>	 (03PS6) 10Vgutierrez: hiera: Add pki.goog staging account to acmechief-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195)
[14:25:04] <ihurbain>	 testing
[14:25:12] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] mapnik: fix paths for mapnik directories [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113125 (https://phabricator.wikimedia.org/T384285) (owner: 10Elukey)
[14:25:58] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply
[14:26:18] <wikibugs>	 (03CR) 10Ssingh: acme_chief: Fix handling of default account (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837) (owner: 10Vgutierrez)
[14:26:36] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply
[14:26:52] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4840/console" [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837) (owner: 10Vgutierrez)
[14:26:56] <ihurbain>	 Lucas_WMDE: looks good from here on mwdebug, you can proceed
[14:27:00] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 ihurbain, lucaswerkmeister-wmde: Continuing with sync
[14:27:03] <Lucas_WMDE>	 ok, thanks!
[14:27:13] <Lucas_WMDE>	 (was there anything to test beyond “it’s not broken”? just curious ^^)
[14:27:18] <ihurbain>	 (no)
[14:27:21] <Lucas_WMDE>	 ok ^^
[14:27:58] <ihurbain>	 (well, the "it's not broken" involves "cleaning up a few page caches and double checking that kartographer is still behaving with parsoid)
[14:28:25] <Lucas_WMDE>	 ok, cool
[14:28:54] <wikibugs>	 (03CR) 10DCausse: [C:03+2] flink-app: better support for properties file format (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113139 (owner: 10DCausse)
[14:29:01] <ihurbain>	 and i was PREPARED :D
[14:29:09] <Lucas_WMDE>	 that’s always good :D
[14:30:23] <wikibugs>	 (03CR) 10Jcrespo: [V:03+2 C:03+2] admin: Add user for Georgios Kyziridis (ML Team) [puppet] - 10https://gerrit.wikimedia.org/r/1109414 (https://phabricator.wikimedia.org/T384239) (owner: 10Klausman)
[14:30:33] <wikibugs>	 (03Merged) 10jenkins-bot: flink-app: better support for properties file format [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113139 (owner: 10DCausse)
[14:35:00] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[14:35:16] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[14:35:34] <wikibugs>	 (03PS3) 10Vgutierrez: acme_chief: Fix handling of default account [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837)
[14:35:34] <wikibugs>	 (03PS7) 10Vgutierrez: hiera: Add pki.goog staging account to acmechief-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195)
[14:36:02] <wikibugs>	 (03CR) 10Vgutierrez: acme_chief: Fix handling of default account (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837) (owner: 10Vgutierrez)
[14:36:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] acme_chief: Fix handling of default account [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837) (owner: 10Vgutierrez)
[14:36:13] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1111932|Remove KartographerParsoidSupport flag from configuration (T340134)]] (duration: 19m 46s)
[14:36:13] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] acme_chief: Fix handling of default account [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837) (owner: 10Vgutierrez)
[14:36:17] <stashbot>	 T340134: Feature flag addition/removal for Parsoid support for Kartographer - https://phabricator.wikimedia.org/T340134
[14:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:36:50] <Lucas_WMDE>	 ok
[14:36:56] <Lucas_WMDE>	 DreamRimmer next I think
[14:37:01] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] acme_chief: Fix handling of default account (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837) (owner: 10Vgutierrez)
[14:37:09] * Lucas_WMDE peeks at diffConfig
[14:37:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107936 (https://phabricator.wikimedia.org/T382879) (owner: 10Novem Linguae)
[14:37:35] <ihurbain>	 thanks Lucas_WMDE ! 
[14:37:39] <Lucas_WMDE>	 np :)
[14:38:07] <wikibugs>	 (03Merged) 10jenkins-bot: enable 2 factor authentication for enwiki page movers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107936 (https://phabricator.wikimedia.org/T382879) (owner: 10Novem Linguae)
[14:38:35] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1107936|enable 2 factor authentication for enwiki page movers (T382879)]]
[14:38:39] <wikibugs>	 (03PS1) 10Btullis: airflow-analytics: Allow access to the mw-api via service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113149 (https://phabricator.wikimedia.org/T380619)
[14:38:39] <stashbot>	 T382879: Add oathauth-enable permission to extendedmover group on enwiki - https://phabricator.wikimedia.org/T382879
[14:38:57] <wikibugs>	 (03PS4) 10Vgutierrez: acme_chief: Fix handling of default account [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837)
[14:38:57] <wikibugs>	 (03PS8) 10Vgutierrez: hiera: Add pki.goog staging account to acmechief-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195)
[14:40:09] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] acme_chief: Fix handling of default account [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837) (owner: 10Vgutierrez)
[14:40:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113125 (https://phabricator.wikimedia.org/T384285) (owner: 10Elukey)
[14:40:27] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4841/console" [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837) (owner: 10Vgutierrez)
[14:40:36] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1 C:03+2] acme_chief: Fix handling of default account [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837) (owner: 10Vgutierrez)
[14:41:36] <DreamRimmer>	 oauth change looks good to me
[14:41:42] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2019.codfw.wmnet with reason: remove from cluster for reimage
[14:41:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[14:41:46] <Lucas_WMDE>	 it hasn’t even deployed yet :P
[14:41:47] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] mapnik: fix paths for mapnik directories [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113125 (https://phabricator.wikimedia.org/T384285) (owner: 10Elukey)
[14:41:52] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10479882 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=05c11855-71d5-489c-8ed8-13baa1a2b7b9) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(...
[14:41:55] <wikibugs>	 (03PS1) 10DCausse: wdqs: fix staging stream names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113150 (https://phabricator.wikimedia.org/T374919)
[14:42:01] <DreamRimmer>	 But i can see
[14:42:08] <Lucas_WMDE>	 then I guess you got lucky
[14:42:22] <Lucas_WMDE>	 and hit one of the k8s deployments that were already done
[14:42:34] <Lucas_WMDE>	 but please wait until scap says it’s okay to test, it’s much less confusing that way imho
[14:42:37] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez)
[14:42:53] <DreamRimmer>	 np
[14:43:31] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10479886 (10jcrespo) 05Open→03Resolved Acc...
[14:43:40] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:44:03] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:44:33] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10observability: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10479889 (10Volans) If I understand the db structure correctly that should convert into this query:  ` select * from b...
[14:44:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet
[14:45:05] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 novemlinguae, lucaswerkmeister-wmde: Backport for [[gerrit:1107936|enable 2 factor authentication for enwiki page movers (T382879)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:45:09] <stashbot>	 T382879: Add oathauth-enable permission to extendedmover group on enwiki - https://phabricator.wikimedia.org/T382879
[14:45:17] <Lucas_WMDE>	 DreamRimmer: now it’s ready for testing ^^
[14:45:34] <Lucas_WMDE>	 looks good to me afaict
[14:45:35] <DreamRimmer>	 checking
[14:45:35] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288#10479894 (10RobH) @cmooney,  I'm updating the order task, but this was delivered in December so I can open a remote hands to get it fixed.  Do we need to schedule th...
[14:45:40] <wikibugs>	 (03CR) 10DCausse: [C:03+2] wdqs: fix staging stream names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113150 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse)
[14:45:50] <wikibugs>	 (03PS9) 10Vgutierrez: hiera: Add pki.goog staging account to acmechief-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195)
[14:46:06] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez)
[14:46:32] <DreamRimmer>	 looks good
[14:46:43] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[14:47:10] <wikibugs>	 (03Merged) 10jenkins-bot: wdqs: fix staging stream names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113150 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse)
[14:47:39] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 novemlinguae, lucaswerkmeister-wmde: Continuing with sync
[14:47:45] <Lucas_WMDE>	 ok, thanks for checking!
[14:49:16] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[14:49:36] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[14:50:15] <wikibugs>	 (03PS10) 10Vgutierrez: hiera: Add pki.goog staging account to acmechief-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195)
[14:50:16] <wikibugs>	 (03PS1) 10Vgutierrez: profile::acme_chief: Use Acme_chief::Account type [puppet] - 10https://gerrit.wikimedia.org/r/1113154 (https://phabricator.wikimedia.org/T384195)
[14:52:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] profile::acme_chief: Use Acme_chief::Account type [puppet] - 10https://gerrit.wikimedia.org/r/1113154 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez)
[14:53:03] <wikibugs>	 (03PS2) 10Brouberol: global_config: add the IP of the dyna proxy [puppet] - 10https://gerrit.wikimedia.org/r/1113151 (https://phabricator.wikimedia.org/T380619)
[14:53:54] <wikibugs>	 (03CR) 10Btullis: [C:03+1] global_config: add the IP of the dyna proxy [puppet] - 10https://gerrit.wikimedia.org/r/1113151 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol)
[14:54:34] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] global_config: add the IP of the dyna proxy [puppet] - 10https://gerrit.wikimedia.org/r/1113151 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol)
[14:54:45] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1107936|enable 2 factor authentication for enwiki page movers (T382879)]] (duration: 16m 10s)
[14:54:49] <stashbot>	 T382879: Add oathauth-enable permission to extendedmover group on enwiki - https://phabricator.wikimedia.org/T382879
[14:54:52] <Lucas_WMDE>	 alright
[14:55:01] <Lucas_WMDE>	 DreamRimmer: should be done now
[14:55:16] <Lucas_WMDE>	 yerdua_wmde: do you have time now?
[14:55:23] <Lucas_WMDE>	 (sorry if I missed a message from you, the channel is pretty busy ^^)
[14:55:26] <DreamRimmer>	 thanks :)
[14:55:32] <yerdua_wmde>	 I'm here
[14:55:34] <Lucas_WMDE>	 yay
[14:55:37] <Lucas_WMDE>	 jouncebot: nowandnext
[14:55:38] <jouncebot>	 For the next 0 hour(s) and 4 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T1400)
[14:55:38] <jouncebot>	 In 1 hour(s) and 4 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T1600)
[14:55:47] <Lucas_WMDE>	 ok, I think we’ll just overrun the window a bit
[14:55:55] <Lucas_WMDE>	 unless someone else is burning to deploy something of their own
[14:55:59] * Lucas_WMDE listens for a few seconds
[14:56:03] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:56:40] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:56:43] <Lucas_WMDE>	 let’s go
[14:56:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112261 (https://phabricator.wikimedia.org/T380751) (owner: 10Audrey Penven)
[14:57:26] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] "Let's get this out, then we can reason on how to improve the chart in general regarding all the feature flags and duplication we have." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[14:57:40] <wikibugs>	 (03Merged) 10jenkins-bot: Add known-good regexes for WikibaseQualityConstraints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112261 (https://phabricator.wikimedia.org/T380751) (owner: 10Audrey Penven)
[14:58:07] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1112261|Add known-good regexes for WikibaseQualityConstraints (T380751)]]
[14:58:12] <stashbot>	 T380751: [SW] Update format constraint regex checks to stop errors from shellbox-constraints in the logs - https://phabricator.wikimedia.org/T380751
[15:00:10] <wikibugs>	 (03PS2) 10Vgutierrez: profile::acme_chief: Use Acme_chief::Account type [puppet] - 10https://gerrit.wikimedia.org/r/1113154 (https://phabricator.wikimedia.org/T384195)
[15:00:10] <wikibugs>	 (03PS11) 10Vgutierrez: hiera: Add pki.goog staging account to acmechief-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195)
[15:00:22] * Lucas_WMDE looks for some test item with many extids
[15:00:35] <Lucas_WMDE>	 (under the assumption that many extids ≈ many format constraints)
[15:00:54] <Lucas_WMDE>	 whyyyyyyy https://www.wikidata.org/wiki/Q6382438
[15:01:01] <Lucas_WMDE>	 6688 identifiers ._.
[15:01:10] <Lucas_WMDE>	 (and not with one of the allowlisted regexes, so useless for testing)
[15:01:22] <yerdua_wmde>	 omg
[15:02:19] <wikibugs>	 (03PS1) 10Brouberol: airflow-analytics: allow the egress to ATS for task pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113159 (https://phabricator.wikimedia.org/T380619)
[15:02:36] <Lucas_WMDE>	 https://www.wikidata.org/wiki/Q1744 sure, whatever
[15:02:40] <Lucas_WMDE>	 509 extids
[15:02:52] <Lucas_WMDE>	 many of them probably not allowlisted but hopefully enough are that we’ll be able to see a performance difference
[15:03:46] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4846/console" [puppet] - 10https://gerrit.wikimedia.org/r/1113154 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez)
[15:04:06] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:04:34] <wikibugs>	 06SRE: Verify Wikipedia's Bluesky account - https://phabricator.wikimedia.org/T384332 (10LPasqual_WMF) 03NEW
[15:04:40] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, audreypenven: Backport for [[gerrit:1112261|Add known-good regexes for WikibaseQualityConstraints (T380751)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[15:04:47] <stashbot>	 T380751: [SW] Update format constraint regex checks to stop errors from shellbox-constraints in the logs - https://phabricator.wikimedia.org/T380751
[15:04:48] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:05:37] <Lucas_WMDE>	 okay, should be ready to test
[15:05:49] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow-analytics: allow the egress to ATS for task pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113159 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol)
[15:06:03] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-analytics: allow the egress to ATS for task pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113159 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol)
[15:06:04] <Lucas_WMDE>	 yerdua_wmde: any ideas for how we can test this?
[15:06:28] <yerdua_wmde>	 I was just about to ask you if you had ideas
[15:06:39] <Lucas_WMDE>	 my idea is to curl https://www.wikidata.org/w/api.php?action=wbcheckconstraints&id=Q1744&status=*&format=json&formatversion=2
[15:06:42] <Lucas_WMDE>	 (constraint check on that item)
[15:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:06:49] <Lucas_WMDE>	 with and without -H 'X-Wikimedia-Debug: backend=k8s-mwdebug'
[15:06:54] <Lucas_WMDE>	 (per https://wikitech.wikimedia.org/wiki/WikimediaDebug#Command-line_usage)
[15:07:00] <Lucas_WMDE>	 and see if the time is different
[15:07:06] <Lucas_WMDE>	 without -H: ca. 42 seconds
[15:07:18] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply
[15:07:20] <Lucas_WMDE>	 with -H: 44s
[15:07:21] <Lucas_WMDE>	 dangit
[15:07:26] <Lucas_WMDE>	 lemme try that again :'D
[15:07:32] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply
[15:08:10] <yerdua_wmde>	 is there any way to see if it touched shellbox?
[15:08:19] <Lucas_WMDE>	 I think using XHGui might work
[15:08:33] <Lucas_WMDE>	 (ca. 44s on the second curl with -H btw, dangit)
[15:08:51] <Lucas_WMDE>	 if I turn on the WikimediaDebug extension and enable XHGui, and then load the URL in the browser
[15:09:02] <Lucas_WMDE>	 I should get some useful data there
[15:09:12] <Lucas_WMDE>	 (after waiting ca. 44 seconds for the request to finish ^^)
[15:09:47] <Lucas_WMDE>	 nooo firefox don’t time out :(
[15:10:27] <Lucas_WMDE>	 well, I guess I can still find the request in xhgui anyway
[15:10:28] <Lucas_WMDE>	 https://performance.wikimedia.org/xhgui/
[15:10:30] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:10:35] <Lucas_WMDE>	 leads to https://performance.wikimedia.org/xhgui/run/view?id=678fb8c3de9320ac29a8953b
[15:10:44] <Lucas_WMDE>	 if we look around we should be able to see how many times the different FormatChecker methods are called
[15:11:04] <Lucas_WMDE>	 o_O https://performance.wikimedia.org/xhgui/run/symbol?id=678fb8c3de9320ac29a8953b&symbol=WikibaseQuality%5CConstraintReport%5CConstraintCheck%5CHelper%5CFormatCheckerHelper%3A%3ArunRegexCheck
[15:11:09] <Lucas_WMDE>	 “runRegexCheck called no functions”
[15:11:31] <yerdua_wmde>	 uh.. what?
[15:11:46] <Lucas_WMDE>	 I also tried the request again and it timed out at the MediaWiki level (RequestTimeoutException)
[15:11:49] <Lucas_WMDE>	 maybe I should pick a smaller item
[15:12:58] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112183 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[15:13:01] <Lucas_WMDE>	 https://www.wikidata.org/wiki/Q415 should have at least one format constraintr matching the allowlist, I think
[15:13:13] <Lucas_WMDE>	 damn, no
[15:13:20] <Lucas_WMDE>	  [1-9][0-9]{0,6} isn’t quite in the list
[15:14:36] <Lucas_WMDE>	 seems to be harder than I thought to find items with format constraints in that list :(
[15:14:45] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] chartmuseum: remove icinga-based http checks [puppet] - 10https://gerrit.wikimedia.org/r/1113146 (https://phabricator.wikimedia.org/T384324) (owner: 10Filippo Giunchedi)
[15:15:14] <wikibugs>	 (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112183 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[15:15:49] * Lucas_WMDE tries something else with https://w.wiki/Co2o
[15:16:36] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10480082 (10jcrespo) @Neslihan_Turan_WMDE I wasn't able to find a developer account with that cn, dn or email. My guess is you sent your SUL (wiki) account, not your developer account,...
[15:16:45] <Lucas_WMDE>	 slightly improved query https://w.wiki/Co2s
[15:17:00] <Lucas_WMDE>	 yeah, sure, random municipality in Finland https://www.wikidata.org/wiki/Q51909
[15:17:05] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10480083 (10Jhancock.wm) hey, was out sick the last half of last week. got this from Dell:    I understand the situation. Upon reviewing the details, I noticed that the disks ins...
[15:17:40] <Lucas_WMDE>	 yerdua_wmde: okay, new xhgui is here https://performance.wikimedia.org/xhgui/run/view?id=678fba7ca891a98639287cb7
[15:17:59] <Lucas_WMDE>	 here, that looks better https://performance.wikimedia.org/xhgui/run/symbol?id=678fba7ca891a98639287cb7&symbol=WikibaseQuality%5CConstraintReport%5CConstraintCheck%5CChecker%5CFormatChecker%3A%3ArunRegexCheck
[15:18:25] <wikibugs>	 (03CR) 10David Caro: [C:03+2] toolforge::base: add cron to all boxes [puppet] - 10https://gerrit.wikimedia.org/r/1113128 (https://phabricator.wikimedia.org/T384250) (owner: 10David Caro)
[15:18:25] <Lucas_WMDE>	 so, 318 calls to runRegexCheck(), of which 313 went to runRegexCheckUsingShellbox() and 5 went to FormatCheckerHelper
[15:18:34] <logmsgbot>	 !log jayme@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2010.codfw.wmnet with reason: Server moving within rack
[15:18:40] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Move kafka-main2010 within the same rack - https://phabricator.wikimedia.org/T381788#10480097 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ffdb0a96-3214-40cf-acd0-ec05d4bf5539) set by jayme@cumin1002 for 2:00:00 on 1 host(s) and their servi...
[15:18:58] <Lucas_WMDE>	 so it’s at least doing something
[15:19:11] <Lucas_WMDE>	 even if the “hit rate” seems to be much lower than I hoped for
[15:19:31] <Lucas_WMDE>	 (and I also checked that there’s no difference between the result output with and without -H 'X-Wikimedia-Debug: backend=k8s-mwdebug'
[15:19:41] <yerdua_wmde>	 if I'm reading it right, it did the check without shellbox 5 times
[15:19:48] <Lucas_WMDE>	 yeah
[15:20:06] <yerdua_wmde>	 and one would be enough to prove that it succeeded in using the config values
[15:20:25] <Lucas_WMDE>	 yup
[15:20:34] <Lucas_WMDE>	 so we can continue the deployment for now
[15:20:36] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10480114 (10MatthewVernon) Wait, didn't we buy this server and all of its drives spinning and SSD from Dell? And now they're saying they're all the wrong drives?!?
[15:20:45] <Lucas_WMDE>	 and then maybe look more into whether we want to allowlist more format strings
[15:20:49] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, audreypenven: Continuing with sync
[15:21:12] <Lucas_WMDE>	 perhaps the format strings we configured are used on many different properties, but each of those properties is only used relatively rarely…
[15:21:23] <wikibugs>	 (03PS8) 10Andrea Denisse: wmcs: Migrate network saturation alerts to the alerts.git repository [alerts] - 10https://gerrit.wikimedia.org/r/1111328 (https://phabricator.wikimedia.org/T328502)
[15:23:22] <yerdua_wmde>	 right. maybe it's worth adding more, or swapping out for format strings that are used more
[15:23:37] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10observability: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10480120 (10cmooney) Thanks @volans you have helped me a lot with this and given me confidence to look at the DB.  I s...
[15:23:40] <Lucas_WMDE>	 yeah
[15:24:00] <Lucas_WMDE>	 might also be worth adding some statsd (or prometheus…) tracking for how often a regex is allowlisted vs. not
[15:24:10] <Lucas_WMDE>	 so we can see what the hit rate is overall
[15:24:16] <Lucas_WMDE>	 not just on some random cherrypicked items
[15:24:21] <yerdua_wmde>	 makes sense
[15:24:34] <Lucas_WMDE>	 (then again, I guess the question is whether that’s still prioritized ^^)
[15:24:56] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply
[15:25:19] <yerdua_wmde>	 and I'm assuming this is a problem for another window
[15:25:30] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply
[15:27:26] <wikibugs>	 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on schoolwiki.in - https://phabricator.wikimedia.org/T383210#10480125 (10jcrespo) I believe this is something to be handled by #traffic at varnish level, more than a maps task. Is this something you handle (I am not familiar with the process) @Vgutierrez @...
[15:27:36] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply
[15:27:45] <Lucas_WMDE>	 yeah, definitely :)
[15:27:51] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1112261|Add known-good regexes for WikibaseQualityConstraints (T380751)]] (duration: 29m 44s)
[15:27:55] <stashbot>	 T380751: [SW] Update format constraint regex checks to stop errors from shellbox-constraints in the logs - https://phabricator.wikimedia.org/T380751
[15:28:01] <wikibugs>	 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on schoolwiki.in - https://phabricator.wikimedia.org/T383210#10480128 (10jcrespo) p:05Triage→03High
[15:28:01] <Lucas_WMDE>	 yay
[15:28:19] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply
[15:28:21] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[15:28:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:52] <wikibugs>	 (03PS1) 10Jelto: gerrit: change blackbox checks to collaboration-services/task [puppet] - 10https://gerrit.wikimedia.org/r/1113163
[15:29:34] <wikibugs>	 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on schoolwiki.in - https://phabricator.wikimedia.org/T383210#10480136 (10ssingh) Thanks @jcrespo; Traffic will take care of it.  @MSantos: This requires your approval before we can continue. Thanks.
[15:30:24] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10480150 (10WMDECyn) Approving this request from WMDE side
[15:30:25] <wikibugs>	 (03PS1) 10Jelto: Revert "gerrit: lower throttling threshold to 15 parallel connections" [puppet] - 10https://gerrit.wikimedia.org/r/1113164
[15:31:03] <jinxer-wm>	 FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[15:31:25] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] Revert "gerrit: lower throttling threshold to 15 parallel connections" [puppet] - 10https://gerrit.wikimedia.org/r/1113164 (owner: 10Jelto)
[15:31:58] <wikibugs>	 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on schoolwiki.in - https://phabricator.wikimedia.org/T383210#10480174 (10ssingh) a:03ssingh
[15:32:49] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] gerrit: change blackbox checks to collaboration-services/task [puppet] - 10https://gerrit.wikimedia.org/r/1113163 (owner: 10Jelto)
[15:34:53] <wikibugs>	 06SRE, 06Commons, 10MediaWiki-Uploading: HTTP 503 error when uploading images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#10480204 (10jcrespo) Hey, @Underbar_dk  is that happening still? Please provide the data suggested at https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_iss...
[15:35:19] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10observability: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10480208 (10cmooney) It also appears we are getting values populated for AcceptedPrefixes for IPv6 peers for some devi...
[15:37:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2019 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1113106 (owner: 10Muehlenhoff)
[15:38:19] <wikibugs>	 (03CR) 10Ottomata: "Added a couple of comments!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira)
[15:40:11] <icinga-wm>	 RECOVERY - Host restbase2037 is UP: PING OK - Packet loss = 0%, RTA = 31.59 ms
[15:41:25] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[15:42:30] <wikibugs>	 06SRE, 10DNS, 06Traffic: Verify Wikipedia's Bluesky account - https://phabricator.wikimedia.org/T384332#10480253 (10jcrespo) I believe authentication on blusky happens through DNS. Adding #DNS and #Traffic for awareness.  I can handle this, as we did it to authenticate the search engines consoles.  @LPasqual...
[15:43:47] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:44:17] <wikibugs>	 06SRE, 10DNS, 06Traffic: Verify Wikipedia's Bluesky account - https://phabricator.wikimedia.org/T384332#10480259 (10jcrespo) p:05Triage→03Medium
[15:44:20] <wikibugs>	 06SRE, 10DNS, 06Traffic: Verify Wikipedia's Bluesky account - https://phabricator.wikimedia.org/T384332#10480260 (10jcrespo) a:03jcrespo
[15:44:23] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:45:51] <wikibugs>	 (03CR) 10Jelto: [C:03+2] Revert "gerrit: lower throttling threshold to 15 parallel connections" [puppet] - 10https://gerrit.wikimedia.org/r/1113164 (owner: 10Jelto)
[15:47:18] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:47:24] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10480265 (10BCornwall) Yeah, this isn't an acceptable answer. They need to be more specific, I'm smelling their vagueness comes from not wanting to spend time/money.
[15:47:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2019.codfw.wmnet with OS bookworm
[15:47:36] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10480267 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2019.codfw.wmnet with OS bookworm
[15:48:27] <wikibugs>	 (03PS1) 10Muehlenhoff: sre.hosts.reimage: Skip the vlan migration reminder for ganeti nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1113167
[15:51:05] <wikibugs>	 (03PS2) 10Brouberol: airflow: enable the injection of custom config files in the worker pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109705 (https://phabricator.wikimedia.org/T380620)
[15:51:31] <jinxer-wm>	 RESOLVED: Not accepting/receiving prefixes from anycast BGP peer: Device cr4-ulsfo.wikimedia.org recovered from Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[15:51:49] <wikibugs>	 (03PS3) 10Brouberol: airflow: enable the injection of custom config files in the worker pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109705 (https://phabricator.wikimedia.org/T380620)
[15:52:02] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd2004-dev
[15:52:05] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd2004-dev
[15:52:18] <jinxer-wm>	 RESOLVED: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:52:19] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host kafka-main2010
[15:52:26] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: enable the injection of custom config files in the worker pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109705 (https://phabricator.wikimedia.org/T380620) (owner: 10Brouberol)
[15:52:28] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kafka-main2010
[15:53:29] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[15:53:54] <wikibugs>	 (03CR) 10DCausse: [C:03+2] search: add alerts for weighted_tags indexing throughput (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1111300 (https://phabricator.wikimedia.org/T373459) (owner: 10DCausse)
[15:55:11] <wikibugs>	 (03Merged) 10jenkins-bot: search: add alerts for weighted_tags indexing throughput [alerts] - 10https://gerrit.wikimedia.org/r/1111300 (https://phabricator.wikimedia.org/T373459) (owner: 10DCausse)
[15:55:25] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[1470-1475].eqiad.wmnet
[15:56:16] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: enable the injection of custom config files in the worker pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109705 (https://phabricator.wikimedia.org/T380620) (owner: 10Brouberol)
[15:57:14] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: enable the injection of custom config files in the worker pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109705 (https://phabricator.wikimedia.org/T380620) (owner: 10Brouberol)
[15:57:15] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm 🍿" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[15:58:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ml-lab1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:58:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:00:04] <jouncebot>	 jelto, arnoldokoth, and mutante: #bothumor My software never has bugs. It just develops random features. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T1600).
[16:00:08] <wikibugs>	 10ops-codfw, 06SRE, 10Cassandra, 06DC-Ops: restbase2037 is crashy - https://phabricator.wikimedia.org/T383820#10480331 (10Eevans) p:05Medium→03High >>! In T383820#10479192, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/PqHEiJQBKFqum...
[16:00:37] <wikibugs>	 (03CR) 10Clément Goubert: php8.1: introduce JIT (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113138 (https://phabricator.wikimedia.org/T384294) (owner: 10Effie Mouzeli)
[16:00:53] <wikibugs>	 06SRE, 10DNS, 06Traffic: Verify Wikipedia's Bluesky account - https://phabricator.wikimedia.org/T384332#10480336 (10LPasqual_WMF) Thank you for such a quick reply, @jcrespo.  Here's the info you requested: Host: _atproto Type: TXT Value: did=did:plc:plla3i7zproko3ekdnkoykhe  And a screenshot, just in case: {...
[16:00:54] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] admin_ng: Install VAPs instead of PSPs on k8s >= 1.24 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112183 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[16:01:03] <jinxer-wm>	 RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[16:01:12] <wikibugs>	 (03CR) 10Xcollazo: [V:03+1 C:03+1] "Verified content is indeed as revised on this patch." [puppet] - 10https://gerrit.wikimedia.org/r/1112123 (owner: 10Pppery)
[16:02:04] <wikibugs>	 (03CR) 10Mvolz: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113137 (https://phabricator.wikimedia.org/T384165) (owner: 10Mvolz)
[16:03:05] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for kafka-main2010.codfw.wmnet
[16:03:05] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2010.codfw.wmnet
[16:03:25] <wikibugs>	 06SRE, 06Commons, 10MediaWiki-Uploading: HTTP 503 error when uploading images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#10480346 (10Underbar_dk) Yes. This is still happening on my desktop.  I am finding that this issue is more likely to trigger when I try to upload multiple files at on...
[16:03:26] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply
[16:03:44] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Move kafka-main2010 within the same rack - https://phabricator.wikimedia.org/T381788#10480348 (10JMeybohm) >>! In T381788#10480097, @ops-monitoring-bot wrote: > Icinga downtime and Alertmanager silence (ID=ffdb0a96-3214-40cf-acd0-ec05d4bf5539) set by jayme@cumin1...
[16:03:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:03:59] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[1470-1475].eqiad.wmnet
[16:03:59] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] wikikube: rename mw147[0-5] -> wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1112828 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková)
[16:04:09] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply
[16:04:34] <wikibugs>	 (03PS1) 10Jcrespo: wikipedia.org: Add AT Protocol/Bluesky verification [dns] - 10https://gerrit.wikimedia.org/r/1113170 (https://phabricator.wikimedia.org/T384332)
[16:04:50] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: Install VAPs instead of PSPs on k8s >= 1.24 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112183 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[16:05:31] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:05:47] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1470 to wikikube-worker1123
[16:06:06] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[16:09:12] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv
[16:09:12] <icinga-wm>	 e - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:09:14] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv
[16:09:14] <icinga-wm>	 e - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:09:15] <wikibugs>	 (03CR) 10Ssingh: wikipedia.org: Add AT Protocol/Bluesky verification (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1113170 (https://phabricator.wikimedia.org/T384332) (owner: 10Jcrespo)
[16:09:50] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1470 to wikikube-worker1123 - kamila@cumin1002"
[16:10:05] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1471 to wikikube-worker1124
[16:10:07] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1470 to wikikube-worker1123 - kamila@cumin1002"
[16:10:07] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:10:07] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1123
[16:10:10] <wikibugs>	 (03CR) 10Jcrespo: wikipedia.org: Add AT Protocol/Bluesky verification (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1113170 (https://phabricator.wikimedia.org/T384332) (owner: 10Jcrespo)
[16:10:25] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[16:10:45] <wikibugs>	 (03PS2) 10Jcrespo: wikipedia.org: Add AT Protocol/Bluesky verification [dns] - 10https://gerrit.wikimedia.org/r/1113170 (https://phabricator.wikimedia.org/T384332)
[16:10:58] <moritzm>	 !log installing gstreamer1.0 security updates
[16:11:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:11:09] <wikibugs>	 (03CR) 10Jcrespo: wikipedia.org: Add AT Protocol/Bluesky verification (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1113170 (https://phabricator.wikimedia.org/T384332) (owner: 10Jcrespo)
[16:11:15] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1123
[16:11:26] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Thanks for creating the patch!" [dns] - 10https://gerrit.wikimedia.org/r/1113170 (https://phabricator.wikimedia.org/T384332) (owner: 10Jcrespo)
[16:11:54] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1470 to wikikube-worker1123
[16:12:21] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1113167 (owner: 10Muehlenhoff)
[16:12:52] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] wikipedia.org: Add AT Protocol/Bluesky verification [dns] - 10https://gerrit.wikimedia.org/r/1113170 (https://phabricator.wikimedia.org/T384332) (owner: 10Jcrespo)
[16:13:13] <wikibugs>	 (03PS1) 10Brouberol: airflow: add missing airflow.worker.extra-config-volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113172 (https://phabricator.wikimedia.org/T380619)
[16:13:26] <logmsgbot>	 !log jynus@dns1004 START - running authdns-update
[16:14:01] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1471 to wikikube-worker1124 - kamila@cumin1002"
[16:14:07] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10observability: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10480398 (10cmooney) Running the poller manually on netmon1003 I can also see it's getting the right value back, but i...
[16:14:12] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1472 to wikikube-worker1125
[16:14:17] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1471 to wikikube-worker1124 - kamila@cumin1002"
[16:14:17] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:14:18] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1124
[16:14:20] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[16:14:23] <jinxer-wm>	 FIRING: [5x] ProbeDown: Service restbase2037-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:15:18] <logmsgbot>	 !log jynus@dns1004 END - running authdns-update
[16:15:26] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1124
[16:15:42] <wikibugs>	 (03PS2) 10Brouberol: airflow: add missing airflow.worker.extra-config-volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113172 (https://phabricator.wikimedia.org/T380619)
[16:16:05] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1471 to wikikube-worker1124
[16:16:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:17:18] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:17:21] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: add missing airflow.worker.extra-config-volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113172 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol)
[16:18:18] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1472 to wikikube-worker1125 - kamila@cumin1002"
[16:18:40] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on mw1473:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:18:56] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1472 to wikikube-worker1125 - kamila@cumin1002"
[16:18:57] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:18:57] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1125
[16:19:10] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1473 to wikikube-worker1126
[16:19:30] <wikibugs>	 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: Verify Wikipedia's Bluesky account - https://phabricator.wikimedia.org/T384332#10480414 (10jcrespo) @LPasqual_WMF The deploy for `@wikipedia.org` should already be working, but don't be surprised if you get an error (there could be ~5 minutes of cache), if it ha...
[16:19:30] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[16:19:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2019.codfw.wmnet with reason: host reimage
[16:19:42] <papaul>	 !log power down ms-be2088 for maintenance
[16:19:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:53] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply
[16:20:06] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1125
[16:20:34] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply
[16:20:45] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1472 to wikikube-worker1125
[16:22:02] <icinga-wm>	 PROBLEM - Host ms-be2088 is DOWN: PING CRITICAL - Packet loss = 100%
[16:23:13] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1473 to wikikube-worker1126 - kamila@cumin1002"
[16:23:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2019.codfw.wmnet with reason: host reimage
[16:23:26] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1474 to wikikube-worker1127
[16:23:28] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1473 to wikikube-worker1126 - kamila@cumin1002"
[16:23:29] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:23:29] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1126
[16:23:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:23:46] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[16:24:36] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1126
[16:25:15] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1473 to wikikube-worker1126
[16:27:18] <jinxer-wm>	 RESOLVED: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:27:56] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1474 to wikikube-worker1127 - kamila@cumin1002"
[16:28:28] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1475 to wikikube-worker1128
[16:28:30] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1474 to wikikube-worker1127 - kamila@cumin1002"
[16:28:30] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:28:31] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1127
[16:28:50] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[16:29:45] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1127
[16:30:24] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1474 to wikikube-worker1127
[16:31:19] <wikibugs>	 (03CR) 10Andrea Denisse: wmcs: Migrate network saturation alerts to the alerts.git repository (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1111328 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse)
[16:31:47] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112838 (https://phabricator.wikimedia.org/T384145) (owner: 10ZhaoFJx)
[16:32:31] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1475 to wikikube-worker1128 - kamila@cumin1002"
[16:32:41] <wikibugs>	 (03PS6) 10Andrea Denisse: wmcs: Migrate iowait stalling alerts to the alerts.git repository [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502)
[16:33:00] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1475 to wikikube-worker1128 - kamila@cumin1002"
[16:33:00] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:33:00] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1128
[16:33:05] <wikibugs>	 06SRE, 10DNS, 06Traffic: Verify Wikipedia's Bluesky account - https://phabricator.wikimedia.org/T384332#10480451 (10LPasqual_WMF) @jcrespo Happy to say it is already working! [[ https://bsky.app/profile/wikipedia.org | @wikipedia.org ]] is live.  Thanks, Jaime and team. I'll follow up with a separate ticket...
[16:33:16] <wikibugs>	 (03CR) 10Andrea Denisse: wmcs: Migrate iowait stalling alerts to the alerts.git repository (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse)
[16:34:10] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1128
[16:34:14] <wikibugs>	 10ops-codfw, 06SRE, 10Cassandra, 06DC-Ops: restbase2037 is crashy - https://phabricator.wikimedia.org/T383820#10480460 (10Jhancock.wm) swapped B1 to A1. gotta let run and see if it crashes again. might not. sometimes that's all it needs. (Thanks for your patience, i was unexpectedly out the last half of la...
[16:34:48] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1475 to wikikube-worker1128
[16:34:54] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for restbase2037.codfw.wmnet
[16:34:54] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase2037.codfw.wmnet
[16:35:09] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1123.eqiad.wmnet wikikube-worker1124.eqiad.wmnet wikikube-worker1125.eqiad.wmnet wikikube-worker1126.eqiad.wmnet wikikube-worker1127.eqiad.wmnet wikikube-worker1128.eqiad.wmnet on all recursors
[16:35:13] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1123.eqiad.wmnet wikikube-worker1124.eqiad.wmnet wikikube-worker1125.eqiad.wmnet wikikube-worker1126.eqiad.wmnet wikikube-worker1127.eqiad.wmnet wikikube-worker1128.eqiad.wmnet on all recursors
[16:35:57] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1113105 (https://phabricator.wikimedia.org/T384287) (owner: 10Gerrit maintenance bot)
[16:36:10] <wikibugs>	 10ops-codfw, 06SRE, 10Cassandra, 06DC-Ops: restbase2037 is crashy - https://phabricator.wikimedia.org/T383820#10480474 (10Eevans) >>! In T383820#10480460, @Jhancock.wm wrote: > swapped B1 to A1. gotta let run and see if it crashes again. might not. sometimes that's all it needs. (Thanks for your patience,...
[16:36:17] <wikibugs>	 06SRE, 10DNS, 06Traffic: Verify Wikipedia's Bluesky account - https://phabricator.wikimedia.org/T384332#10480475 (10jcrespo) 05Open→03Resolved
[16:38:02] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez)
[16:39:57] <wikibugs>	 06SRE, 06Traffic, 10Data-Engineering (Q3 2024 January 1st - March 31th), 13Patch-For-Review: Refine add_is_wmf_domain TransformFunction fails if no source field exists - https://phabricator.wikimedia.org/T383914#10480486 (10Ahoelzl)
[16:40:33] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] systemd: added option to remain after exit [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur)
[16:41:08] <wikibugs>	 (03PS1) 10Brouberol: Revert "global_config: add the IP of the dyna proxy" [puppet] - 10https://gerrit.wikimedia.org/r/1113176 (https://phabricator.wikimedia.org/T380619)
[16:42:21] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] profile::acme_chief: Use Acme_chief::Account type [puppet] - 10https://gerrit.wikimedia.org/r/1113154 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez)
[16:43:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2019.codfw.wmnet with OS bookworm
[16:43:57] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10480502 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2019.codfw.wmnet with OS bookworm completed: - ganeti201...
[16:44:02] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1123.eqiad.wmnet with OS bookworm
[16:44:05] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1123
[16:44:05] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1123
[16:44:07] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1124.eqiad.wmnet with OS bookworm
[16:44:10] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1124
[16:44:10] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1124
[16:44:15] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1125.eqiad.wmnet with OS bookworm
[16:44:19] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1125
[16:44:19] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1125
[16:44:21] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1126.eqiad.wmnet with OS bookworm
[16:44:25] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1126
[16:44:25] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1126
[16:44:27] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1127.eqiad.wmnet with OS bookworm
[16:44:30] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1127
[16:44:30] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1127
[16:44:38] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1128.eqiad.wmnet with OS bookworm
[16:44:42] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1128
[16:44:42] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1128
[16:44:44] <icinga-wm>	 RECOVERY - Host ms-be2088 is UP: PING OK - Packet loss = 0%, RTA = 30.28 ms
[16:44:45] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Revert "global_config: add the IP of the dyna proxy" [puppet] - 10https://gerrit.wikimedia.org/r/1113176 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol)
[16:44:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10480510 (10phaultfinder)
[16:46:36] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1 C:03+2] profile::acme_chief: Use Acme_chief::Account type [puppet] - 10https://gerrit.wikimedia.org/r/1113154 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez)
[16:54:07] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[16:54:50] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[16:55:03] <wikibugs>	 (03PS1) 10Btullis: Revert "airflow-analytics: migrate scheduler and database to Kubernetes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113177
[16:55:03] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] hiera: Add pki.goog staging account to acmechief-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez)
[16:55:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "airflow-analytics: migrate scheduler and database to Kubernetes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113177 (owner: 10Btullis)
[16:58:24] <wikibugs>	 (03PS2) 10Btullis: Revert "airflow-analytics: migrate scheduler and database to Kubernetes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113177
[16:59:58] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1123.eqiad.wmnet with reason: host reimage
[17:00:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1112699 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede)
[17:00:05] <jouncebot>	 jhathaway and rzl: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T1700).
[17:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:00:10] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1125.eqiad.wmnet with reason: host reimage
[17:00:13] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1124.eqiad.wmnet with reason: host reimage
[17:00:19] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1127.eqiad.wmnet with reason: host reimage
[17:00:31] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1128.eqiad.wmnet with reason: host reimage
[17:02:06] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Revert "airflow-analytics: migrate scheduler and database to Kubernetes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113177 (owner: 10Btullis)
[17:02:10] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Revert "airflow-analytics: migrate scheduler and database to Kubernetes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113177 (owner: 10Btullis)
[17:03:13] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "airflow-analytics: migrate scheduler and database to Kubernetes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113177 (owner: 10Btullis)
[17:03:26] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1123.eqiad.wmnet with reason: host reimage
[17:04:34] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply
[17:05:08] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply
[17:06:43] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Move kafka-main2010 within the same rack - https://phabricator.wikimedia.org/T381788#10480665 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm This has been completed. Thank you for your help!
[17:06:46] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1124.eqiad.wmnet with reason: host reimage
[17:08:06] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10480676 (10Jhancock.wm)
[17:08:08] <wikibugs>	 (03PS1) 10Hnowlan: trafficserver: reoute testwiki citoid calls to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1113178 (https://phabricator.wikimedia.org/T361576)
[17:08:29] <wikibugs>	 (03PS1) 10Btullis: Revert "Temporarily disable gobblin timers on an-launcher1002" [puppet] - 10https://gerrit.wikimedia.org/r/1113179
[17:09:19] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1125.eqiad.wmnet with reason: host reimage
[17:09:35] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Configure gnmic to collect data from routers at network pops - https://phabricator.wikimedia.org/T384345 (10cmooney) 03NEW p:05Triage→03Medium
[17:10:38] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10480706 (10Jhancock.wm)
[17:10:56] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Revert "Temporarily disable gobblin timers on an-launcher1002" [puppet] - 10https://gerrit.wikimedia.org/r/1113179 (owner: 10Btullis)
[17:11:09] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10480707 (10Jhancock.wm)
[17:12:33] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1128.eqiad.wmnet with reason: host reimage
[17:15:21] <wikibugs>	 (03PS1) 10Hnowlan: trafficserver: route citoid via rest-gateway for all sites [puppet] - 10https://gerrit.wikimedia.org/r/1113182 (https://phabricator.wikimedia.org/T361576)
[17:16:38] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1127.eqiad.wmnet with reason: host reimage
[17:17:45] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10480734 (10Jhancock.wm)
[17:20:38] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10480754 (10fnegri) @RobH do you think that this can be done in the next one/two weeks? We need these servers to...
[17:23:25] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10480785 (10Papaul) @Jelto when do you think will be a best time  for you or someone in your team to help us relocate some of those mw a...
[17:24:10] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1123.eqiad.wmnet with OS bookworm
[17:24:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10480793 (10phaultfinder)
[17:24:51] <wikibugs>	 (03PS1) 10Federico Ceratto: site.pp, db2133.yaml: Remove db2133 [puppet] - 10https://gerrit.wikimedia.org/r/1113183 (https://phabricator.wikimedia.org/T384343)
[17:26:16] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, 06serviceops: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10480800 (10Jelto) >>! In T383709#10480784, @Papaul wrote: > @Jelto when do you think will be a best time  for you or so...
[17:26:26] <wikibugs>	 (03CR) 10Marostegui: site.pp, db2133.yaml: Remove db2133 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113183 (https://phabricator.wikimedia.org/T384343) (owner: 10Federico Ceratto)
[17:28:16] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1124.eqiad.wmnet with OS bookworm
[17:29:12] <wikibugs>	 (03PS1) 10DCausse: wdqs: bump image to 0.3.153 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113184 (https://phabricator.wikimedia.org/T374919)
[17:29:39] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10480813 (10RobH) >>! In T382412#10480754, @fnegri wrote: > @RobH do you think that this can be done in the next...
[17:30:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_ferm_mss.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:30:34] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, 06serviceops: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10480819 (10Papaul) @Jelto thanks please let us know when best works for you for the gerrit2002. Thanks
[17:32:02] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1125.eqiad.wmnet with OS bookworm
[17:33:11] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, 10decommission-hardware: decommission mw2282.codfw.wmnet - https://phabricator.wikimedia.org/T384226#10480828 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[17:33:19] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10480831 (10thcipriani) >>! In T384018#10477272, @jcrespo wrote: > To try to speed up confirmations, 'restricted' is documented at data.yml to require @thcipriani approval. So asking...
[17:34:50] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Add pki.goog staging account to acmechief-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez)
[17:35:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: prometheus_ferm_mss.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:35:34] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1128.eqiad.wmnet with OS bookworm
[17:35:38] <wikibugs>	 (03CR) 10DCausse: [C:03+2] wdqs: bump image to 0.3.153 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113184 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse)
[17:36:42] <wikibugs>	 (03Merged) 10jenkins-bot: wdqs: bump image to 0.3.153 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113184 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse)
[17:37:53] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288#10480844 (10cmooney) >>! In T384288#10479894, @RobH wrote: > I'm assuming we need to schedule it, and we should give them a couple days notice if we want a set sched...
[17:38:10] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1127.eqiad.wmnet with OS bookworm
[17:38:56] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=dns2004.wikimedia.org [reason: T383709]
[17:39:40] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[17:39:54] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[17:41:58] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1126.eqiad.wmnet with OS bookworm
[17:42:36] <logmsgbot>	 !log sukhe@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on dns2004.wikimedia.org with reason: T383709
[17:42:46] <wikibugs>	 (03PS1) 10Vgutierrez: acme_chief: Allow specifying an account per certificate [puppet] - 10https://gerrit.wikimedia.org/r/1113187 (https://phabricator.wikimedia.org/T384195)
[17:43:14] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, 06serviceops: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10480855 (10JMeybohm) mw2259 and mw2278 are to be decommed (T354791, T384043) mw2355 is now wikikube-worker2229 (T383862...
[17:46:49] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:46:50] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:47:52] <sukhe>	 ^ expected
[17:52:57] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dns2004
[17:53:06] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dns2004
[17:54:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10480884 (10phaultfinder)
[17:56:43] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112101 (https://phabricator.wikimedia.org/T383942) (owner: 10Jdlrobson)
[17:56:52] <wikibugs>	 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.01.11 - 2025.01.31): Data Platform access streamlining for WMDE staff - https://phabricator.wikimedia.org/T381824#10480892 (10jcrespo) Is there anything else to do here (are there any concerns left?), other than fixing documenta...
[17:58:01] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113187 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez)
[17:58:36] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for dns2004.wikimedia.org
[17:58:37] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns2004.wikimedia.org
[17:58:44] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, 06serviceops: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10480893 (10Jhancock.wm)
[17:58:47] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on dns2004 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[17:58:49] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on dns2004 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[17:59:19] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#10480904 (10jcrespo) I hope the tagging is ok, as you are doing the work. Let me know if I can help with some reviews.
[17:59:41] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on dns2004 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[17:59:41] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on dns2004 is OK: OK: UP (pid=2955) and all threads (4) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[17:59:55] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:59:55] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:00:05] <jouncebot>	 swfrench-wmf: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T1800).
[18:00:08] <wikibugs>	 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.01.11 - 2025.01.31): Data Platform access streamlining for WMDE staff - https://phabricator.wikimedia.org/T381824#10480907 (10Ottomata) Olja approved, so no concerns left.  Just needs to be implemented by fixing docs, etc.  Than...
[18:00:16] <swfrench-wmf>	 o/
[18:00:17] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1126.eqiad.wmnet with OS bookworm
[18:00:21] <swfrench-wmf>	 I'll get started shortly
[18:00:21] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1126
[18:00:21] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1126
[18:00:33] <wikibugs>	 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.01.11 - 2025.01.31): Data Platform access streamlining for WMDE staff - https://phabricator.wikimedia.org/T381824#10480909 (10jcrespo) a:03jcrespo
[18:00:36] <wikibugs>	 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.01.11 - 2025.01.31): Data Platform access streamlining for WMDE staff - https://phabricator.wikimedia.org/T381824#10480910 (10jcrespo) p:05Triage→03Medium
[18:02:42] <swfrench-wmf>	 !log disabling puppet on A:cp-text ahead of ATS mapping change - T377042
[18:03:02] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns2004.wikimedia.org [reason: T383709]
[18:03:30] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[18:04:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10480929 (10phaultfinder)
[18:05:11] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[18:06:08] <wikibugs>	 (03CR) 10Scott French: [C:03+2] trafficserver: add mw-php-migration to mapping_rules [puppet] - 10https://gerrit.wikimedia.org/r/1082581 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French)
[18:12:12] <wikibugs>	 06SRE: Verify the Foundation's Bluesky account - https://phabricator.wikimedia.org/T384350 (10LPasqual_WMF) 03NEW
[18:12:39] <wikibugs>	 06SRE: Verify the Foundation's Bluesky account - https://phabricator.wikimedia.org/T384350#10480967 (10LPasqual_WMF) I am pasting below the DNS information, with a screenshot: Host: _atproto Type: TXT Value: did=did:plc:vwdzejaw4wkxh2wvkjlcubal {F58240567}
[18:13:38] <wikibugs>	 06SRE: Verify the Foundation's Bluesky account - https://phabricator.wikimedia.org/T384350#10480971 (10ssingh) Hi @LPasqual_WMF: confirming that this an additional request for @wikimediafoundation.org, in addition to @wikipedia.org?
[18:14:36] <wikibugs>	 06SRE: Verify the Foundation's Bluesky account - https://phabricator.wikimedia.org/T384350#10480982 (10LPasqual_WMF) @ssingh Hi, that's correct!
[18:16:19] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1126.eqiad.wmnet with reason: host reimage
[18:17:24] <wikibugs>	 (03PS1) 10Ssingh: wikimediafoundation.org: add TXT record for Bluesky verification [dns] - 10https://gerrit.wikimedia.org/r/1113191 (https://phabricator.wikimedia.org/T384350)
[18:19:15] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1113191 (https://phabricator.wikimedia.org/T384350) (owner: 10Ssingh)
[18:19:50] <wikibugs>	 (03CR) 10Ssingh: [V:03+2 C:03+2] "Thanks for the review Jaime." [dns] - 10https://gerrit.wikimedia.org/r/1113191 (https://phabricator.wikimedia.org/T384350) (owner: 10Ssingh)
[18:19:52] <swfrench-wmf>	 !log validated routing behavior on cp4040 (applied at 18:10 UTC) - T377042
[18:19:59] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1126.eqiad.wmnet with reason: host reimage
[18:20:11] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[18:20:44] <wikibugs>	 06SRE, 10DNS, 13Patch-For-Review: Verify the Foundation's Bluesky account - https://phabricator.wikimedia.org/T384350#10481063 (10jcrespo)
[18:21:07] <wikibugs>	 06SRE, 10DNS, 13Patch-For-Review: Verify the Foundation's Bluesky account - https://phabricator.wikimedia.org/T384350#10481077 (10jcrespo) p:05Triage→03Medium a:03ssingh
[18:21:59] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[18:23:07] <wikibugs>	 06SRE, 10DNS, 13Patch-For-Review: Verify the Foundation's Bluesky account - https://phabricator.wikimedia.org/T384350#10481147 (10ssingh) ` $ dig _atproto.wikimediafoundation.org TXT +short  "did=did:plc:vwdzejaw4wkxh2wvkjlcubal" `  @LPasqual_WMF : Please try verifying now.
[18:23:36] <swfrench-wmf>	 !log started incrementally running puppet on A:cp-text for ATS mapping change - T377042
[18:24:23] <wikibugs>	 06SRE, 10DNS, 13Patch-For-Review: Verify the Foundation's Bluesky account - https://phabricator.wikimedia.org/T384350#10481149 (10LPasqual_WMF) @ssingh Confirming it worked. Thank you so much for taking care of this so quickly!
[18:24:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10481151 (10phaultfinder)
[18:25:18] <wikibugs>	 06SRE, 10DNS, 13Patch-For-Review: Verify the Foundation's Bluesky account - https://phabricator.wikimedia.org/T384350#10481156 (10ssingh) 05Open→03Resolved
[18:27:03] <topranks>	 !log disable-pupept on netflow7001 to test gnmic bgp endpoint 
[18:27:16] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] acme_chief: Allow specifying an account per certificate [puppet] - 10https://gerrit.wikimedia.org/r/1113187 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez)
[18:28:23] <volans>	 !log restarting stashbot
[18:33:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:35:42] <logmsgbot>	 !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7001.magru.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd
[18:35:44] <topranks>	 ^^ sry this was me forgetting to downtime 
[18:35:48] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10481199 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d0f01fc7-5a29-49c5-8292-aebad021ff73) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and th...
[18:38:52] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1126.eqiad.wmnet with OS bookworm
[18:39:08] <wikibugs>	 (03PS1) 10Clare Ming: Fix schema version for CTR instrument [extensions/WikimediaEvents] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113192 (https://phabricator.wikimedia.org/T384333)
[18:39:43] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/WikimediaEvents] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113192 (https://phabricator.wikimedia.org/T384333) (owner: 10Clare Ming)
[18:40:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383620#10481232 (10kamila)
[18:44:06] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] trafficserver: reoute testwiki citoid calls to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1113178 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan)
[18:45:57] <wikibugs>	 (03PS1) 10Raymond Ndibe: [wmcs::kubeadm::core] remove kubeadm-flags.env [puppet] - 10https://gerrit.wikimedia.org/r/1113194 (https://phabricator.wikimedia.org/T370245)
[18:49:34] <swfrench-wmf>	 !log finished running puppet on A:cp-text for ATS mapping change - T377042
[18:49:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:49:38] <stashbot>	 T377042: Support cookie-driven fractional migration to PHP 8.1 deployments of mw-web and mw-api-ext - https://phabricator.wikimedia.org/T377042
[18:53:12] <wikibugs>	 (03CR) 10AOkoth: miscweb: support os-reports deployment (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[18:53:36] <wikibugs>	 (03PS7) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794)
[18:53:51] <wikibugs>	 (03PS8) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794)
[18:54:43] <wikibugs>	 (03CR) 10Scott French: [C:03+1] trafficserver: reoute testwiki citoid calls to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1113178 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan)
[18:55:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[18:57:46] <wikibugs>	 (03PS1) 10Brouberol: airflow: DRY extra volume mounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113198 (https://phabricator.wikimedia.org/T380619)
[18:58:25] <wikibugs>	 (03PS2) 10Brouberol: airflow: DRY extra volume mounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113198 (https://phabricator.wikimedia.org/T380619)
[19:00:05] <jouncebot>	 brennen and jeena: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T1900)
[19:03:13] <wikibugs>	 (03PS9) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794)
[19:03:22] <wikibugs>	 (03PS10) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794)
[19:05:21] <brennen>	 o/
[19:13:20] <wikibugs>	 (03CR) 10Cwhite: "Hey folks, would you be willing to check this change set for accuracy and completeness?  Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105972 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite)
[19:14:01] <wikibugs>	 (03CR) 10Kevin Bazira: "thank you for the comments and sharing the documentation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira)
[19:14:20] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113200 (https://phabricator.wikimedia.org/T382364)
[19:14:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113200 (https://phabricator.wikimedia.org/T382364) (owner: 10TrainBranchBot)
[19:15:08] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113200 (https://phabricator.wikimedia.org/T382364) (owner: 10TrainBranchBot)
[19:18:01] <wikibugs>	 (03CR) 10Ottomata: EventStreamConfig: Add mediawiki.article_country_prediction_change stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira)
[19:21:55] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+1] "Looks good!" [extensions/WikimediaEvents] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113192 (https://phabricator.wikimedia.org/T384333) (owner: 10Clare Ming)
[19:23:55] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10481483 (10KFrancis) Please provide Neslihan's WMDE email address.  Thanks!
[19:25:00] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[19:25:20] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[19:25:23] <stashbot>	 dcausse@deploy2002: Failed to log message to wiki. Somebody should check the error logs.
[19:26:55] <logmsgbot>	 !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.13  refs T382364
[19:26:59] <stashbot>	 T382364: 1.44.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T382364
[19:27:36] <wikibugs>	 (03CR) 10Scott French: "Thank you both for the reviews! FYI, since the routing component of this now live, I'll move forward with deploying this tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080388 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French)
[19:31:52] <logmsgbot>	 !log jebe@deploy2002 Started deploy [airflow-dags/analytics_product@0aa9d7c]: (no justification provided)
[19:32:25] <logmsgbot>	 !log jebe@deploy2002 Finished deploy [airflow-dags/analytics_product@0aa9d7c]: (no justification provided) (duration: 00m 35s)
[19:40:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10481542 (10phaultfinder)
[19:43:43] <logmsgbot>	 !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7001.magru.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd
[19:43:52] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10481570 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=26b7dbb9-1906-4b10-a433-cc2ffb6bdb61) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and th...
[19:44:57] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] wmcs: Migrate iowait stalling alerts to the alerts.git repository [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse)
[19:45:24] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] "Merging as it was already approved by @dcaro@wikimedia.org, I just removed leftover comments." [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse)
[19:45:28] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+2 C:03+2] wmcs: Migrate iowait stalling alerts to the alerts.git repository [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse)
[19:45:49] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] "Merging as it was already approved by @dcaro@wikimedia.org, I just removed leftover comments." [alerts] - 10https://gerrit.wikimedia.org/r/1111328 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse)
[19:46:08] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+2 C:03+2] wmcs: Migrate network saturation alerts to the alerts.git repository (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1111328 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse)
[19:46:30] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] wmcs: Remove Puppet files for migrated Prometheus alerts [puppet] - 10https://gerrit.wikimedia.org/r/1111340 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse)
[19:54:40] <wikibugs>	 (03PS2) 10CDanis: draft: allow k8s NodeJS apps to opt-in to auto-ECS [puppet] - 10https://gerrit.wikimedia.org/r/1112295
[19:58:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ml-lab1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[19:59:12] <wikibugs>	 (03CR) 10Andrew Bogott: [C:04-2] Remove nutcracker from cloudweb hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861807 (https://phabricator.wikimedia.org/T277183) (owner: 10Majavah)
[19:59:41] <wikibugs>	 (03Abandoned) 10Andrew Bogott: Remove nutcracker from cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/861807 (https://phabricator.wikimedia.org/T277183) (owner: 10Majavah)
[20:01:17] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] backy2: on Bullseye, hack around a silly package name mismatch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763336 (https://phabricator.wikimedia.org/T301909) (owner: 10Andrew Bogott)
[20:01:52] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Add ceph config for cloudcephosd103[5-8] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060190 (https://phabricator.wikimedia.org/T363344) (owner: 10Andrew Bogott)
[20:02:22] <wikibugs>	 (03CR) 10CDanis: [C:03+1] thanos: further reduce trace sampling [puppet] - 10https://gerrit.wikimedia.org/r/1112700 (https://phabricator.wikimedia.org/T378190) (owner: 10Filippo Giunchedi)
[20:05:30] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:06:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10481628 (10phaultfinder)
[20:11:24] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s2 #page on db2175 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: cswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:11:31] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:12:38] <herron>	 !incidents
[20:12:39] <sirenbot>	 5624 (UNACKED)  db2175 (paged)/MariaDB Replica SQL: s2 (paged)
[20:12:39] <sirenbot>	 5623 (RESOLVED)  Manual (paged) by urbanecm (murbanec@wikimedia.org): Nearly complete Gerrit outage
[20:12:39] <sirenbot>	 5622 (RESOLVED)  NELHigh sre (thanos-rule tcp.timed_out)
[20:12:39] <sirenbot>	 5621 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr1-esams.wikimedia.org)
[20:12:40] <sirenbot>	 5620 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqiad.wikimedia.org)
[20:12:40] <sirenbot>	 5619 (RESOLVED)  db2207 (paged)/MariaDB Replica SQL: s2 (paged)
[20:12:52] <herron>	 !ack 5624
[20:12:52] <sirenbot>	 5624 (ACKED)  db2175 (paged)/MariaDB Replica SQL: s2 (paged)
[20:14:14] <Amir1>	 I have no access to pc right now. Can you depool it until I get back?
[20:14:21] <federico3>	 https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2175&from=now-30m&to=now
[20:14:45] <logmsgbot>	 !log herron@cumin1002 dbctl commit (dc=all): 'depool db2175', diff saved to https://phabricator.wikimedia.org/P72208 and previous config saved to /var/cache/conftool/dbconfig/20250121-201444-herron.json
[20:14:55] <herron>	 Amir1: you bet, just did
[20:15:03] <marostegui>	 I am fixing it
[20:15:11] <marostegui>	 Should be fixed now
[20:15:21] <marostegui>	 But let's leave it depooled so I can upgrade it tomorrow 
[20:15:23] <Amir1>	 Oh thank you both!
[20:15:24] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s2 #page on db2175 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:15:29] <herron>	 thanks marostegui ok sounds good
[20:15:33] <marostegui>	 Thanks herron for the depool 
[20:15:37] <herron>	 np!
[20:16:09] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:16:19] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:16:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:21:00] <wikibugs>	 (03CR) 10Ssingh: "Abandoning this because we are also updating eqiad in here, which is not required. Doing a new patch to make review easier, comparing agai" [dns] - 10https://gerrit.wikimedia.org/r/1101908 (https://phabricator.wikimedia.org/T380858) (owner: 10CDobbins)
[20:21:04] <wikibugs>	 (03Abandoned) 10Ssingh: Remove eqiad from public and private IP spaces [dns] - 10https://gerrit.wikimedia.org/r/1101908 (https://phabricator.wikimedia.org/T380858) (owner: 10CDobbins)
[20:21:40] <wikibugs>	 (03PS1) 10Ssingh: geo-maps: put eqiad at lowest priority for T380858 [dns] - 10https://gerrit.wikimedia.org/r/1113205 (https://phabricator.wikimedia.org/T380858)
[20:23:05] <wikibugs>	 (03CR) 10Ssingh: "For reviewers: the idea is to ensure that eqiad is lowest priority for non-eqiad DCs." [dns] - 10https://gerrit.wikimedia.org/r/1113205 (https://phabricator.wikimedia.org/T380858) (owner: 10Ssingh)
[20:23:54] <wikibugs>	 (03CR) 10Herron: [C:03+1] "good call" [puppet] - 10https://gerrit.wikimedia.org/r/1112700 (https://phabricator.wikimedia.org/T378190) (owner: 10Filippo Giunchedi)
[20:35:45] <logmsgbot>	 !log ecarg@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[20:35:47] <logmsgbot>	 !log ecarg@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[20:42:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10481696 (10phaultfinder)
[20:47:25] <wikibugs>	 (03CR) 10GergesShamon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/222255 (owner: 10Matanya)
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T2100).
[21:00:05] <jouncebot>	 ZhaoFJx, Jdlrobson, and cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:01:23] <cjming>	 o/
[21:01:27] <cjming>	 i can deploy
[21:02:50] <Jdlrobson>	 o/
[21:05:05] <cjming>	 ZhaoFJx: are you around?
[21:05:19] <cjming>	 if not, i can start with your patch Jdlrobson
[21:06:09] <wikibugs>	 (03PS2) 10Jdlrobson: Enable Vector 2022 and dark mode on Azerbaijani wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112101 (https://phabricator.wikimedia.org/T383942)
[21:07:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112101 (https://phabricator.wikimedia.org/T383942) (owner: 10Jdlrobson)
[21:08:09] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Vector 2022 and dark mode on Azerbaijani wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112101 (https://phabricator.wikimedia.org/T383942) (owner: 10Jdlrobson)
[21:08:39] <logmsgbot>	 !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1112101|Enable Vector 2022 and dark mode on Azerbaijani wikis (T383942)]]
[21:08:44] <stashbot>	 T383942: Jan 20, 2025: Vector 2022 and dark mode deployments - https://phabricator.wikimedia.org/T383942
[21:10:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10481804 (10phaultfinder)
[21:10:49] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:14:11] <cjming>	 Jdlrobson: up on test servers if you want to check
[21:14:52] <logmsgbot>	 !log cjming@deploy2002 cjming, jdlrobson: Backport for [[gerrit:1112101|Enable Vector 2022 and dark mode on Azerbaijani wikis (T383942)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:14:56] <stashbot>	 T383942: Jan 20, 2025: Vector 2022 and dark mode deployments - https://phabricator.wikimedia.org/T383942
[21:16:04] <Jdlrobson>	 cjming: on it
[21:16:38] <Jdlrobson>	 LGTM cjming 
[21:16:47] <cjming>	 great!
[21:16:52] <logmsgbot>	 !log cjming@deploy2002 cjming, jdlrobson: Continuing with sync
[21:21:46] <ZhaoFJx>	 Sorry I think I am kind of late
[21:21:53] <ZhaoFJx>	 Is the deployment still ongoing?
[21:22:08] <cjming>	 hi ZhaoFJx - yes i can do your patch next
[21:22:24] <ZhaoFJx>	 Thank you cjming
[21:23:16] <cjming>	 np!
[21:23:42] <Jdlrobson>	 thanks cjming ! Looks good!
[21:23:46] <logmsgbot>	 !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1112101|Enable Vector 2022 and dark mode on Azerbaijani wikis (T383942)]] (duration: 15m 06s)
[21:23:50] <wikibugs>	 (03PS3) 10ZhaoFJx: cawiki: Create templateeditor & protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112838 (https://phabricator.wikimedia.org/T384145)
[21:23:51] <stashbot>	 T383942: Jan 20, 2025: Vector 2022 and dark mode deployments - https://phabricator.wikimedia.org/T383942
[21:24:15] <cjming>	 Jdlrobson: yay! should be live
[21:24:42] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:24:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112838 (https://phabricator.wikimedia.org/T384145) (owner: 10ZhaoFJx)
[21:25:26] <wikibugs>	 (03Merged) 10jenkins-bot: cawiki: Create templateeditor & protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112838 (https://phabricator.wikimedia.org/T384145) (owner: 10ZhaoFJx)
[21:25:55] <logmsgbot>	 !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1112838|cawiki: Create templateeditor & protection level (T384145)]]
[21:26:00] <stashbot>	 T384145: Create template editor user group and protection level in cawiki - https://phabricator.wikimedia.org/T384145
[21:28:14] <wikibugs>	 (03CR) 10Scott French: php8.1: introduce JIT (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113138 (https://phabricator.wikimedia.org/T384294) (owner: 10Effie Mouzeli)
[21:31:23] <cjming>	 ZhaoFJx: on test servers if you'd like to test - lmk if/when to sync
[21:31:58] <ZhaoFJx>	 sure
[21:32:04] <logmsgbot>	 !log cjming@deploy2002 zhaofjx, cjming: Backport for [[gerrit:1112838|cawiki: Create templateeditor & protection level (T384145)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:32:08] <stashbot>	 T384145: Create template editor user group and protection level in cawiki - https://phabricator.wikimedia.org/T384145
[21:32:10] <ZhaoFJx>	 all good in https://ca.wikipedia.org/wiki/Especial:Drets_dels_grups_d%27usuaris
[21:32:16] <cjming>	 nice
[21:32:19] <logmsgbot>	 !log cjming@deploy2002 zhaofjx, cjming: Continuing with sync
[21:34:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10481941 (10phaultfinder)
[21:35:37] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] geo-maps: put eqiad at lowest priority for T380858 [dns] - 10https://gerrit.wikimedia.org/r/1113205 (https://phabricator.wikimedia.org/T380858) (owner: 10Ssingh)
[21:39:20] <logmsgbot>	 !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1112838|cawiki: Create templateeditor & protection level (T384145)]] (duration: 13m 24s)
[21:39:24] <stashbot>	 T384145: Create template editor user group and protection level in cawiki - https://phabricator.wikimedia.org/T384145
[21:40:14] <cjming>	 ZhaoFJx: should be live :)
[21:40:34] <ZhaoFJx>	 Checked, thank you!
[21:40:37] <ZhaoFJx>	 Have a good one
[21:40:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113192 (https://phabricator.wikimedia.org/T384333) (owner: 10Clare Ming)
[21:46:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:49:51] <wikibugs>	 (03Merged) 10jenkins-bot: Fix schema version for CTR instrument [extensions/WikimediaEvents] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113192 (https://phabricator.wikimedia.org/T384333) (owner: 10Clare Ming)
[21:50:23] <logmsgbot>	 !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1113192|Fix schema version for CTR instrument (T384333)]]
[21:50:28] <stashbot>	 T384333: Wrong schema used in the CTR instrument (so experimentation fragment is empty for every event) - https://phabricator.wikimedia.org/T384333
[21:55:19] <logmsgbot>	 !log cjming@deploy2002 cjming: Backport for [[gerrit:1113192|Fix schema version for CTR instrument (T384333)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:55:33] <logmsgbot>	 !log cjming@deploy2002 cjming: Continuing with sync
[22:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T2200)
[22:02:29] <logmsgbot>	 !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113192|Fix schema version for CTR instrument (T384333)]] (duration: 12m 05s)
[22:02:33] <stashbot>	 T384333: Wrong schema used in the CTR instrument (so experimentation fragment is empty for every event) - https://phabricator.wikimedia.org/T384333
[22:03:21] <wikibugs>	 (03PS1) 10BCornwall: slo_template: update SLO dates to current window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1113212
[22:04:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:06:42] <cjming>	 !log end of UTC late backport window
[22:06:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:09:42] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:51:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:53:17] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1171 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:01:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1121:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1121 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:12:09] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS1136/IPv4: Connect - KPN, AS1136/IPv6: Connect - KPN https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:13:17] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1171 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:14:32] <wikibugs>	 (03PS1) 10Scott French: shellbox-constraints: 1 eqiad replica on 8.1 (change 1/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113217 (https://phabricator.wikimedia.org/T377038)
[23:14:33] <wikibugs>	 (03PS1) 10Scott French: shellbox-constraints: all eqiad replicas on 8.1 (change 2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113218 (https://phabricator.wikimedia.org/T377038)
[23:14:34] <wikibugs>	 (03PS1) 10Scott French: shellbox-constraints: all replicas on PHP 8.1 (change 3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113219 (https://phabricator.wikimedia.org/T377038)
[23:21:40] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1121:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:53:25] <wikibugs>	 (03CR) 10Dzahn: "collaboration-services-releng is supposed to work. The "receiver" is "name: 'collaboration-services-releng-critical" which should be the c" [puppet] - 10https://gerrit.wikimedia.org/r/1113163 (owner: 10Jelto)
[23:55:07] <mutante>	 !incidents
[23:55:08] <sirenbot>	 5624 (RESOLVED)  db2175 (paged)/MariaDB Replica SQL: s2 (paged)
[23:55:08] <sirenbot>	 5623 (RESOLVED)  Manual (paged) by urbanecm (murbanec@wikimedia.org): Nearly complete Gerrit outage
[23:55:08] <sirenbot>	 5622 (RESOLVED)  NELHigh sre (thanos-rule tcp.timed_out)
[23:55:08] <sirenbot>	 5621 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr1-esams.wikimedia.org)
[23:55:08] <sirenbot>	 5620 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqiad.wikimedia.org)
[23:56:40] <jinxer-wm>	 FIRING: [3x] KubernetesRsyslogDown: rsyslog on wikikube-worker1121:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:58:38] <wikibugs>	 (03CR) 10Scott French: "For clarity, I should probably mention explicitly:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080388 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French)
[23:58:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ml-lab1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[23:58:52] <wikibugs>	 (03CR) 10Dzahn: "The alert that is being changed here is for the SSH port, not for Apache. When looking at incident history I see that the page was only a " [puppet] - 10https://gerrit.wikimedia.org/r/1113163 (owner: 10Jelto)