[00:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T0000) [00:10:35] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [00:16:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:30:35] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [00:38:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1112861 [00:38:23] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1112861 (owner: 10TrainBranchBot) [00:44:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10478091 (10phaultfinder) [00:59:14] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1112861 (owner: 10TrainBranchBot) [01:08:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1112863 [01:08:47] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1112863 (owner: 10TrainBranchBot) [01:31:16] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1112863 (owner: 10TrainBranchBot) [01:33:27] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10478122 (10thcipriani) > deployment POSIX group Approved as `deployment` gr... [01:33:54] (03CR) 10Thcipriani: [C:03+1] admin/data: Add user for Georgios Kyziridis (ML Team) [puppet] - 10https://gerrit.wikimedia.org/r/1109414 (owner: 10Klausman) [01:46:22] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/a762b343b40fe38171f766309bee9f00e5029cc1d5d72196fa007b9b4489dc54/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:06:22] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:08:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.13 [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1112864 (https://phabricator.wikimedia.org/T382364) [02:08:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.13 [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1112864 (https://phabricator.wikimedia.org/T382364) (owner: 10TrainBranchBot) [02:28:32] (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.13 [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1112864 (https://phabricator.wikimedia.org/T382364) (owner: 10TrainBranchBot) [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:51:31] FIRING: [6x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T0300) [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:10:30] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T0400) [04:01:44] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112868 (https://phabricator.wikimedia.org/T382364) [04:01:46] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112868 (https://phabricator.wikimedia.org/T382364) (owner: 10TrainBranchBot) [04:02:32] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112868 (https://phabricator.wikimedia.org/T382364) (owner: 10TrainBranchBot) [04:02:58] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.13 refs T382364 [04:03:02] T382364: 1.44.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T382364 [04:11:26] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:16:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:21:24] PROBLEM - Disk space on deploy2002 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/9249d506f6c493ccd9a605f0a29558143bfeec6e067778a29a480114f9f6ac6b/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy2002&var-datasource=codfw+prometheus/ops [04:41:24] RECOVERY - Disk space on deploy2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy2002&var-datasource=codfw+prometheus/ops [04:51:26] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:54:20] (03PS1) 10Kevin Bazira: changeprop: add liftwing article-country stream to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112126 (https://phabricator.wikimedia.org/T382295) [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T0500) [05:01:52] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.44.0-wmf.13 refs T382364 (duration: 58m 53s) [05:01:55] T382364: 1.44.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T382364 [05:04:57] !log mwpresync@deploy2002 Pruned MediaWiki: 1.44.0-wmf.8 (duration: 04m 55s) [05:13:36] PROBLEM - Disk space on kafka-logging1004 is CRITICAL: DISK CRITICAL - free space: /srv 159774 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kafka-logging1004&var-datasource=eqiad+prometheus/ops [05:17:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [05:18:30] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr1-esams.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [05:40:59] (03PS1) 10KartikMistry: Update cxserver to 2025-01-20-172318-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112871 (https://phabricator.wikimedia.org/T377966) [05:42:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:47:48] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Idle https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:47:57] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:09:40] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112838 (https://phabricator.wikimedia.org/T384145) (owner: 10ZhaoFJx) [06:17:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [06:18:30] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr1-esams.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [06:23:10] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:23:14] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:31:45] (03PS1) 10Marostegui: db2207,db2148: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113014 (https://phabricator.wikimedia.org/T384272) [06:32:24] (03CR) 10Marostegui: [C:03+2] db2207,db2148: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113014 (https://phabricator.wikimedia.org/T384272) (owner: 10Marostegui) [06:33:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2207 db2148 T384272', diff saved to https://phabricator.wikimedia.org/P72164 and previous config saved to /var/cache/conftool/dbconfig/20250121-063301-marostegui.json [06:34:01] (03CR) 10Anzx: [C:03+1] "looks good to me please schedule for backport, @zhaofjx@gmail.com you don't have to add reveiwer unless you have any doubt, just saying pe" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112838 (https://phabricator.wikimedia.org/T384145) (owner: 10ZhaoFJx) [06:34:04] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10478451 (10Ladsgroup) All 16 containers of 00 to 0f have been cleaned up. Starting 10 to 1f now. [06:34:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2212 with weight 0 T383690', diff saved to https://phabricator.wikimedia.org/P72165 and previous config saved to /var/cache/conftool/dbconfig/20250121-063416-root.json [06:35:01] !log marostegui@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s1 T383690 [06:35:32] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2212 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1111266 (https://phabricator.wikimedia.org/T383690) (owner: 10Gerrit maintenance bot) [06:37:26] !log marostegui@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db[2148,2207].codfw.wmnet with reason: Rebuild and upgrade db2207 db2148 [06:38:04] !log marostegui@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2189.codfw.wmnet with reason: Rebuild and upgrade db2189 [06:40:47] !log Starting s1 codfw failover from db2203 to db2212 - T383690 [06:41:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s1 codfw as read-only for maintenance - T383690', diff saved to https://phabricator.wikimedia.org/P72166 and previous config saved to /var/cache/conftool/dbconfig/20250121-064104-root.json [06:43:21] Thanks! [06:43:28] This is not going well thoguh [06:45:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:45:37] I will have to finish this manually, it got stuck on the old master semi sync [06:45:38] great [06:45:40] Amir1: ^ [06:45:53] shit [06:46:05] what can I do to help? [06:46:35] !log marostegui@cumin2002 dbctl commit (dc=all): 'Promote db2212 to s1 primary and set section read-write T383690', diff saved to https://phabricator.wikimedia.org/P72167 and previous config saved to /var/cache/conftool/dbconfig/20250121-064634-root.json [06:46:51] Amir1: can you check if you can edit enwiki now? [06:46:56] sure [06:47:23] edits are coming in [06:47:36] my edits are getting saved too [06:47:41] good [06:49:10] !log marostegui@dns1006 START - running authdns-update [06:49:15] !log marostegui@dns1006 START - running authdns-update [06:50:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:50:24] (03CR) 10Marostegui: [C:03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1111267 (https://phabricator.wikimedia.org/T383690) (owner: 10Gerrit maintenance bot) [06:51:00] !log marostegui@dns1006 END - running authdns-update [06:51:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2203 T383690', diff saved to https://phabricator.wikimedia.org/P72168 and previous config saved to /var/cache/conftool/dbconfig/20250121-065114-marostegui.json [06:51:18] T383690: Switchover s1 master (db2203 -> db2212) - https://phabricator.wikimedia.org/T383690 [06:51:31] FIRING: [6x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [06:52:02] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2203.codfw.wmnet with reason: rebuilding index [06:56:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2216 T384273', diff saved to https://phabricator.wikimedia.org/P72169 and previous config saved to /var/cache/conftool/dbconfig/20250121-065640-marostegui.json [06:56:47] T384273: Rebuild db2203 - https://phabricator.wikimedia.org/T384273 [06:58:25] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2216.codfw.wmnet with reason: rebuilding index [06:59:06] (03PS1) 10Marostegui: db2203: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113015 (https://phabricator.wikimedia.org/T384273) [06:59:46] (03CR) 10Marostegui: [C:03+2] db2203: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113015 (https://phabricator.wikimedia.org/T384273) (owner: 10Marostegui) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T0700) [07:00:05] marostegui and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T0700). [07:00:18] the jouncebot missed all the fun [07:02:03] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db2216.codfw.wmnet onto db2203.codfw.wmnet [07:10:30] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:21:04] (03CR) 10ZhaoFJx: "Thank you for information!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112838 (https://phabricator.wikimedia.org/T384145) (owner: 10ZhaoFJx) [07:27:07] (03PS1) 10Giuseppe Lavagetto: Two bugfixes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1113070 [07:27:22] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Two bugfixes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1113070 (owner: 10Giuseppe Lavagetto) [07:28:35] !log oblivian@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Bugfixes - oblivian@cumin1002" [07:28:37] !log oblivian@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfixes - oblivian@cumin1002 [07:29:08] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfixes - oblivian@cumin1002 [07:29:09] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Bugfixes - oblivian@cumin1002" [07:41:59] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@868de0c]: 202412 Backfill: Fixes on ExternalTaskMarker experiment [07:42:31] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@868de0c]: 202412 Backfill: Fixes on ExternalTaskMarker experiment (duration: 00m 32s) [07:56:35] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [07:59:26] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [08:00:05] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:01:11] 06SRE, 10Wikimedia-Mailing-lists: Mailing list for administrators of Indonesian projects - https://phabricator.wikimedia.org/T384135#10478525 (10Ladsgroup) 05Open→03Resolved done: https://lists.wikimedia.org/postorius/lists/wiki-id-admins.lists.wikimedia.org [08:02:13] !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [08:16:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:25:32] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:34:46] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112126 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [08:34:50] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10478534 (10MoritzMuehlenhoff) [08:36:13] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2024.codfw.wmnet with reason: remove from cluster for reimage [08:36:18] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10478535 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=de610340-5385-4389-b2bb-b869e4134a65) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [08:48:18] (03PS1) 10Muehlenhoff: sre.debmonitor.remove-hosts: Reduce logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113075 (https://phabricator.wikimedia.org/T324655) [08:50:44] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:51:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2216.codfw.wmnet onto db2203.codfw.wmnet [08:58:49] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [08:59:08] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [09:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T0900) [09:01:47] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10478569 (10isarantopoulos) [09:03:52] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [09:04:09] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [09:07:19] (03CR) 10Volans: [C:03+1] "Thanks! I've suggested one alternative option inline. Up to you, LGTM in both cases." [cookbooks] - 10https://gerrit.wikimedia.org/r/1113075 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff) [09:10:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2024.codfw.wmnet with OS bookworm [09:10:12] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10478580 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2024.codfw.wmnet with OS bookworm [09:14:54] (03PS1) 10Brouberol: airflow: remove useless separator in pod spec confusing the config checksum computation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113077 (https://phabricator.wikimedia.org/T384275) [09:15:54] (03PS2) 10Brouberol: airflow: remove useless separator in pod spec confusing the config checksum computation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113077 (https://phabricator.wikimedia.org/T384275) [09:16:55] (03PS3) 10Brouberol: airflow: remove useless separator in pod spec confusing the config checksum computation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113077 (https://phabricator.wikimedia.org/T384275) [09:22:50] (03PS2) 10DCausse: wdqs: enable new event stream api config in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112763 (https://phabricator.wikimedia.org/T374919) [09:26:58] (03PS2) 10Muehlenhoff: sre.debmonitor.remove-hosts: Reduce logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113075 (https://phabricator.wikimedia.org/T324655) [09:27:13] (03PS2) 10Alexandros Kosiaris: Map rest_v1/page/(html|title)/ to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1112188 (https://phabricator.wikimedia.org/T374683) [09:28:09] (03CR) 10Muehlenhoff: sre.debmonitor.remove-hosts: Reduce logging to SAL (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1113075 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff) [09:28:52] (03CR) 10DCausse: [C:03+2] wdqs: enable new event stream api config in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112763 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [09:29:17] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [09:29:20] (03CR) 10Alexandros Kosiaris: [C:03+2] Map rest_v1/page/(html|title)/ to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1112188 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [09:29:37] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [09:29:39] !incidents [09:29:40] 5622 (UNACKED) NELHigh sre (thanos-rule tcp.timed_out) [09:29:40] 5611 (RESOLVED) db2189 (paged)/MariaDB Replica SQL: s2 (paged) [09:29:40] 5621 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr1-esams.wikimedia.org) [09:29:40] 5620 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [09:29:40] 5619 (RESOLVED) db2207 (paged)/MariaDB Replica SQL: s2 (paged) [09:29:41] 5618 (RESOLVED) db2148 (paged)/MariaDB Replica SQL: s2 (paged) [09:29:41] 5617 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [09:29:41] 5616 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [09:29:41] 5615 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [09:29:42] 5614 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [09:29:42] 5613 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [09:29:43] 5612 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [09:29:43] het [09:29:47] !ack 5622 [09:29:48] 5622 (ACKED) NELHigh sre (thanos-rule tcp.timed_out) [09:29:48] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [09:29:56] (03CR) 10Volans: sre.debmonitor.remove-hosts: Reduce logging to SAL (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1113075 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff) [09:29:56] (03Merged) 10jenkins-bot: wdqs: enable new event stream api config in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112763 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [09:30:07] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [09:30:19] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [09:30:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report [09:30:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [09:30:31] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [09:32:01] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [09:32:12] (03CR) 10CI reject: [V:04-1] sre.debmonitor.remove-hosts: Reduce logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113075 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff) [09:32:28] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [09:32:42] (03PS3) 10Muehlenhoff: sre.debmonitor.remove-hosts: Reduce logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113075 (https://phabricator.wikimedia.org/T324655) [09:33:16] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1113075 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff) [09:33:33] (03CR) 10David Caro: wmcs: Migrate iowait stalling alerts to the alerts.git repository (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [09:33:57] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2024.codfw.wmnet with reason: host reimage [09:34:32] (03CR) 10David Caro: wmcs: Migrate iowait stalling alerts to the alerts.git repository (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [09:34:33] (03CR) 10Btullis: [C:03+1] airflow: remove useless separator in pod spec confusing the config checksum computation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113077 (https://phabricator.wikimedia.org/T384275) (owner: 10Brouberol) [09:34:49] (03CR) 10Brouberol: [C:03+2] airflow: remove useless separator in pod spec confusing the config checksum computation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113077 (https://phabricator.wikimedia.org/T384275) (owner: 10Brouberol) [09:35:10] (03CR) 10David Caro: [C:03+1] wmcs: Migrate iowait stalling alerts to the alerts.git repository [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [09:35:29] (03PS1) 10DCausse: wdqs: add missing page_change_content_models config entry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113080 (https://phabricator.wikimedia.org/T374919) [09:37:34] (03CR) 10DCausse: [C:03+2] wdqs: add missing page_change_content_models config entry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113080 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [09:37:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2024.codfw.wmnet with reason: host reimage [09:38:52] (03Merged) 10jenkins-bot: wdqs: add missing page_change_content_models config entry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113080 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [09:39:07] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [09:39:17] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [09:39:26] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [09:42:25] (03CR) 10David Caro: wmcs: Migrate iowait stalling alerts to the alerts.git repository [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [09:43:00] (03CR) 10David Caro: [C:03+1] "Just the comment leftover, LGTM otherwise" [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [09:44:42] (03CR) 10David Caro: [C:03+1] "LGTM as is, just the comments need updating, thanks a lot!" [alerts] - 10https://gerrit.wikimedia.org/r/1111328 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [09:45:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72172 and previous config saved to /var/cache/conftool/dbconfig/20250121-094537-root.json [09:46:31] (03PS1) 10DCausse: wdqs: add missing config entry main_output_stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113082 (https://phabricator.wikimedia.org/T374919) [09:46:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72173 and previous config saved to /var/cache/conftool/dbconfig/20250121-094637-root.json [09:47:25] !log set udp_localhost-info retention.bytes=100000000000 on kafka-logging - T384233 [09:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:29] T384233: Unexpected utilization increase in udp_localhost-info kafka-logging topic - https://phabricator.wikimedia.org/T384233 [09:47:42] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:49:03] (03CR) 10Muehlenhoff: [C:03+2] sre.debmonitor.remove-hosts: Reduce logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113075 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff) [09:50:16] (03CR) 10Gmodena: [C:03+1] "I am not familiar with the specific traffic patterns, but the alert declaration LGTM." [alerts] - 10https://gerrit.wikimedia.org/r/1111300 (https://phabricator.wikimedia.org/T373459) (owner: 10DCausse) [09:50:24] (03CR) 10DCausse: [C:03+2] wdqs: add missing config entry main_output_stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113082 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [09:51:44] (03Merged) 10jenkins-bot: wdqs: add missing config entry main_output_stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113082 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [09:52:18] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [09:52:43] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [09:53:03] (03PS1) 10Muehlenhoff: sre.ganeti.resource-report: Reduce logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113084 (https://phabricator.wikimedia.org/T324655) [09:53:36] RECOVERY - Disk space on kafka-logging1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kafka-logging1004&var-datasource=eqiad+prometheus/ops [09:57:21] (03PS1) 10Muehlenhoff: sre.idm.logout: Reduce logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113087 (https://phabricator.wikimedia.org/T324655) [09:57:23] (03PS1) 10Muehlenhoff: sre.puppet.renew-cert: Reduce logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113088 (https://phabricator.wikimedia.org/T324655) [09:57:26] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:58:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2024.codfw.wmnet with OS bookworm [09:58:32] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10478723 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2024.codfw.wmnet with OS bookworm completed: - ganeti202... [10:00:32] !log set udp_localhost-info retention.bytes=300000000000 on kafka-logging (back to original value) - T384233 [10:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:36] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10478726 (10isarantopoulos) [10:00:36] T384233: Unexpected utilization increase in udp_localhost-info kafka-logging topic - https://phabricator.wikimedia.org/T384233 [10:00:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72174 and previous config saved to /var/cache/conftool/dbconfig/20250121-100042-root.json [10:01:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72175 and previous config saved to /var/cache/conftool/dbconfig/20250121-100142-root.json [10:01:57] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T384281 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [10:02:03] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T384281 (10ops-monitoring-bot) 03NEW [10:03:05] (03PS1) 10Cathal Mooney: Remove config to shift AT&T traffic away from Lumen transit [homer/public] - 10https://gerrit.wikimedia.org/r/1113090 (https://phabricator.wikimedia.org/T384253) [10:03:55] !log installing intel-microcode security updates [10:03:57] !log installing python-tornado security updates [10:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:16] (03CR) 10Cathal Mooney: [C:03+2] Remove config to shift AT&T traffic away from Lumen transit [homer/public] - 10https://gerrit.wikimedia.org/r/1113090 (https://phabricator.wikimedia.org/T384253) (owner: 10Cathal Mooney) [10:04:58] (03Merged) 10jenkins-bot: Remove config to shift AT&T traffic away from Lumen transit [homer/public] - 10https://gerrit.wikimedia.org/r/1113090 (https://phabricator.wikimedia.org/T384253) (owner: 10Cathal Mooney) [10:08:57] (03CR) 10Jelto: [C:04-1] "Looks mostly good but I left some comments in-line." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [10:10:28] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10478740 (10isarantopoulos) I approve both as a manager and owner of the ml g... [10:10:43] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10478742 (10isarantopoulos) [10:11:19] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply [10:11:23] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10478744 (10isarantopoulos) [10:11:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply [10:12:30] jouncebot: now [10:12:30] For the next 0 hour(s) and 47 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T0900) [10:12:34] jouncebot: next [10:12:35] In 0 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T1100) [10:15:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72176 and previous config saved to /var/cache/conftool/dbconfig/20250121-101548-root.json [10:16:38] (03PS4) 10JMeybohm: admin_ng: Install VAPs instead of PSPs on k8s >= 1.24 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112183 (https://phabricator.wikimedia.org/T341984) [10:16:39] (03PS9) 10JMeybohm: Update staging-codfw to k8s 1.31, calico 3.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) [10:16:46] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2024 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1112767 (owner: 10Muehlenhoff) [10:16:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72177 and previous config saved to /var/cache/conftool/dbconfig/20250121-101648-root.json [10:18:47] (03CR) 10JMeybohm: [C:03+1] wikikube: rename mw147[0-5] -> wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1112828 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [10:19:07] (03CR) 10Jelto: [C:04-1] miscweb: support os-reports deployment (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [10:20:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet [10:26:42] !log adjust VRRP priorities for public and analytics vlans on eqiad CRs to balance traffic [10:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:06] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Should be okay to deploy at any time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107936 (https://phabricator.wikimedia.org/T382879) (owner: 10Novem Linguae) [10:29:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet [10:30:09] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4838/co" [puppet] - 10https://gerrit.wikimedia.org/r/1112782 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm) [10:30:11] 06SRE, 06Infrastructure-Foundations, 10netops: Manage VRRP priority from Netbox - https://phabricator.wikimedia.org/T381873#10478784 (10cmooney) 05Open→03Resolved a:03cmooney This is all complete and I've set priorities in Netbox to balance traffic from the 4 legacy rows in eqiad across the CRs there. [10:30:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107936 (https://phabricator.wikimedia.org/T382879) (owner: 10Novem Linguae) [10:30:52] (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1112782 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm) [10:30:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72179 and previous config saved to /var/cache/conftool/dbconfig/20250121-103053-root.json [10:31:19] (03PS1) 10Filippo Giunchedi: hieradata: site expansion for kafka::logging role description [puppet] - 10https://gerrit.wikimedia.org/r/1113096 [10:31:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72180 and previous config saved to /var/cache/conftool/dbconfig/20250121-103153-root.json [10:32:16] (03CR) 10Filippo Giunchedi: "The motd doesn't get updated because the resulting shell script fails:" [puppet] - 10https://gerrit.wikimedia.org/r/1113096 (owner: 10Filippo Giunchedi) [10:33:37] RECOVERY - Host mr1-esams.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 81.62 ms [10:34:25] (03PS2) 10Filippo Giunchedi: hieradata: fix site expansion for role description [puppet] - 10https://gerrit.wikimedia.org/r/1113096 [10:35:38] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: recording rules for mw edit rates [puppet] - 10https://gerrit.wikimedia.org/r/1112172 (https://phabricator.wikimedia.org/T383963) (owner: 10Filippo Giunchedi) [10:40:40] !log de-pref Chicago routes learnt on on core routers in Dallas [10:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:56] 06SRE, 06Infrastructure-Foundations, 10netops: Improve Eqiad outbound traffic balance - https://phabricator.wikimedia.org/T384253#10478825 (10cmooney) FWIW I have made the same change in codfw for routes learnt from eqord (Chicago). Locally-learnt routes will now be preferred unless the AS-Path from Chicago... [10:45:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72181 and previous config saved to /var/cache/conftool/dbconfig/20250121-104559-root.json [10:46:20] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1173 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1113098 (https://phabricator.wikimedia.org/T384284) [10:46:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72182 and previous config saved to /var/cache/conftool/dbconfig/20250121-104658-root.json [10:51:32] FIRING: [6x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [10:54:51] (03PS1) 10Btullis: Temporarily disable gobblin timers on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1113101 (https://phabricator.wikimedia.org/T380619) [10:55:18] (03CR) 10David Caro: [C:03+1] "LGTM, just a note there" [puppet] - 10https://gerrit.wikimedia.org/r/1108091 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [10:55:55] (03CR) 10Brouberol: [C:03+1] Temporarily disable gobblin timers on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1113101 (https://phabricator.wikimedia.org/T380619) (owner: 10Btullis) [10:56:00] (03CR) 10Btullis: [C:03+2] Temporarily disable gobblin timers on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1113101 (https://phabricator.wikimedia.org/T380619) (owner: 10Btullis) [10:58:09] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T1100) [11:00:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2024.codfw.wmnet to cluster codfw and group A [11:01:22] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07IPv6: Enable ipv6 on ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T379890#10478911 (10MoritzMuehlenhoff) [11:01:38] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2024.codfw.wmnet to cluster codfw and group A [11:02:35] jouncebot: now [11:02:35] For the next 0 hour(s) and 57 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T1100) [11:03:03] RECOVERY - Disk space on stat1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [11:03:38] (03CR) 10Effie Mouzeli: [C:03+2] mw-(web|api-ext)-next: bump replicas and update TODO [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112078 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [11:04:51] (03Merged) 10jenkins-bot: mw-(web|api-ext)-next: bump replicas and update TODO [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112078 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [11:05:45] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1113096 (owner: 10Filippo Giunchedi) [11:05:49] (03PS1) 10Brouberol: airflow: re-introduce KRB5_KEYTAB in the task pod env [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113103 (https://phabricator.wikimedia.org/T384282) [11:06:01] (03CR) 10Jelto: [C:03+1] "lgtm, output of upstream diff looks similar `~/git/calico$ git diff --stat v3.23.3 v3.29.1 -- ./libcalico-go/config/crd`" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111943 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [11:07:34] (03CR) 10Hnowlan: [C:03+1] "lgtm! I've noticed that the templating for liftwing rules adds the comment header before each rule - not something to be fixed in this rev" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112126 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [11:07:58] (03CR) 10Btullis: [C:03+1] airflow: re-introduce KRB5_KEYTAB in the task pod env [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113103 (https://phabricator.wikimedia.org/T384282) (owner: 10Brouberol) [11:08:30] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [11:09:01] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [11:09:29] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [11:10:04] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2205 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1113104 (https://phabricator.wikimedia.org/T384287) [11:10:08] (03PS1) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1113105 (https://phabricator.wikimedia.org/T384287) [11:10:17] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [11:10:31] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:10:36] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [11:11:06] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [11:11:58] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [11:12:01] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [11:12:26] (03PS1) 10Muehlenhoff: Switch ganeti2019 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1113106 [11:12:53] PROBLEM - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/search AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [11:12:53] PROBLEM - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/research AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [11:13:18] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [11:13:21] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [11:13:42] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [11:14:03] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [11:14:12] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [11:14:28] (03CR) 10Brouberol: [C:03+2] airflow: re-introduce KRB5_KEYTAB in the task pod env [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113103 (https://phabricator.wikimedia.org/T384282) (owner: 10Brouberol) [11:14:38] (03CR) 10Volans: [C:03+1] "LGTM, thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/1113088 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff) [11:14:54] (03CR) 10Volans: [C:03+1] "LGTM, thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/1113087 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff) [11:15:19] 06SRE, 06Infrastructure-Foundations, 10netops: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288 (10cmooney) 03NEW p:05Triage→03Medium [11:15:35] 06SRE, 06Infrastructure-Foundations, 10netops: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288#10478967 (10cmooney) [11:16:42] 06SRE, 06Infrastructure-Foundations, 10netops: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288#10478971 (10cmooney) [11:18:10] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [11:18:47] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [11:19:00] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [11:19:46] (03CR) 10Volans: "LGTM, but I've suggested how to make it not log at all to SAL" [cookbooks] - 10https://gerrit.wikimedia.org/r/1113084 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff) [11:22:06] (03PS4) 10Máté Szabó: Enable electionadmin user group on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083870 (https://phabricator.wikimedia.org/T378287) (owner: 10Dreamrimmer) [11:23:11] (03PS2) 10Muehlenhoff: sre.idm.logout: Reduce logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113087 (https://phabricator.wikimedia.org/T324655) [11:25:40] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [11:25:50] (03PS1) 10Brouberol: airflow-analytics: migrate scheduler and database to Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113108 (https://phabricator.wikimedia.org/T380619) [11:26:15] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [11:26:45] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [11:26:51] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [11:27:06] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [11:28:34] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [11:28:37] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [11:29:19] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [11:29:43] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [11:29:54] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [11:30:37] PROBLEM - Host restbase2037 is DOWN: PING CRITICAL - Packet loss = 100% [11:30:38] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [11:30:40] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [11:31:28] (03PS2) 10Brouberol: airflow-analytics: migrate scheduler and database to Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113108 (https://phabricator.wikimedia.org/T380619) [11:31:45] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [11:32:18] FIRING: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:32:20] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [11:32:40] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [11:33:03] PROBLEM - Disk space on stat1008 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=86%): /tmp 0 MB (0% inode=86%): /var/tmp 0 MB (0% inode=86%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [11:34:07] RECOVERY - Host restbase2037 is UP: PING OK - Packet loss = 0%, RTA = 30.27 ms [11:34:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112261 (https://phabricator.wikimedia.org/T380751) (owner: 10Audrey Penven) [11:34:42] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [11:35:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:36:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:37:02] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10479063 (10jcrespo) Indeed, that's documented at... [11:37:53] (03PS3) 10Brouberol: airflow-analytics: migrate scheduler and database to Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113108 (https://phabricator.wikimedia.org/T380619) [11:38:09] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - No response from remote host 195.200.68.129 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:38:40] (03CR) 10Btullis: [C:03+1] airflow-analytics: migrate scheduler and database to Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113108 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol) [11:38:41] (03CR) 10Muehlenhoff: [C:03+2] sre.idm.logout: Reduce logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113087 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff) [11:38:53] (03PS2) 10Muehlenhoff: sre.puppet.renew-cert: Reduce logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113088 (https://phabricator.wikimedia.org/T324655) [11:39:23] FIRING: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:40:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:41:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:42:18] RESOLVED: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:42:42] (03CR) 10Mvolz: [C:03+2] rest-gateway: add params to config, rework citoid path matching (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973362 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [11:43:48] (03Merged) 10jenkins-bot: rest-gateway: add params to config, rework citoid path matching [deployment-charts] - 10https://gerrit.wikimedia.org/r/973362 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [11:44:50] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2189.codfw.wmnet with reason: rebuilding index [11:45:08] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10479077 (10jcrespo) [11:45:37] PROBLEM - Host restbase2037 is DOWN: PING CRITICAL - Packet loss = 100% [11:47:13] (03PS5) 10Scott French: service::catalog: enable monitoring for mw-(web|api-ext)-next [puppet] - 10https://gerrit.wikimedia.org/r/1101124 (https://phabricator.wikimedia.org/T377040) [11:47:18] FIRING: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:47:20] that seems bad [11:47:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72184 and previous config saved to /var/cache/conftool/dbconfig/20250121-114728-root.json [11:47:31] err yeah [11:47:46] I'll depool it [11:47:56] (03CR) 10Effie Mouzeli: [C:03+2] service::catalog: enable monitoring for mw-(web|api-ext)-next [puppet] - 10https://gerrit.wikimedia.org/r/1101124 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [11:48:15] RECOVERY - Host restbase2037 is UP: PING OK - Packet loss = 0%, RTA = 30.33 ms [11:48:29] !log hnowlan@cumin2002 conftool action : set/pooled=no; selector: name=restbase2037.codfw.wmnet [11:48:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72185 and previous config saved to /var/cache/conftool/dbconfig/20250121-114836-root.json [11:49:42] (03CR) 10Klausman: [C:03+1] changeprop: add liftwing article-country stream to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112126 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [11:49:50] I don't really care if it's back, looks like the host has bad memory [11:49:53] https://phabricator.wikimedia.org/T383820 [11:50:36] 10ops-codfw, 06SRE, 10Cassandra, 06DC-Ops: restbase2037 is crashy - https://phabricator.wikimedia.org/T383820#10479104 (10hnowlan) This host went down again this morning, same DIMM errors. I've depooled it for the time being. ` 11:30 <+icinga-wm> PROBLEM - Host restbase2037 is DOWN: PING CRITICAL - Packet... [11:50:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:51:41] ^ possibly a knock-on? [11:52:06] OK to deploy cxserver? [11:53:55] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10479109 (10MoritzMuehlenhoff) [11:54:07] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:54:23] RESOLVED: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:54:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet [11:54:55] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10479112 (10ops-monitoring-bot) Draining ganeti2019.codfw.wmnet of running VMs [11:54:57] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:55:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:56:06] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10479118 (10jcrespo) [11:56:52] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10479119 (10jcrespo) I will be adding now the LDAP... [11:57:29] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:57:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet [11:59:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet [11:59:14] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2019.codfw.wmnet [12:00:06] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10479138 (10jcrespo) >>! In T384239#10478122, @thc... [12:00:10] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 9.122 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:00:22] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:00:43] (03CR) 10Jelto: [C:03+1] "change and diff looks reasonable to me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112058 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [12:00:48] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:00:49] I'll just go ahead :) [12:01:03] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-01-20-172318-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112871 (https://phabricator.wikimedia.org/T377966) (owner: 10KartikMistry) [12:01:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd2004.codfw.wmnet to drbd [12:02:25] (03Merged) 10jenkins-bot: Update cxserver to 2025-01-20-172318-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112871 (https://phabricator.wikimedia.org/T377966) (owner: 10KartikMistry) [12:02:34] (03CR) 10Btullis: [C:03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1112226 (https://phabricator.wikimedia.org/T367315) (owner: 10Muehlenhoff) [12:02:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72186 and previous config saved to /var/cache/conftool/dbconfig/20250121-120234-root.json [12:02:54] (03CR) 10Btullis: [C:03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1112228 (https://phabricator.wikimedia.org/T367315) (owner: 10Muehlenhoff) [12:03:34] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [12:03:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72187 and previous config saved to /var/cache/conftool/dbconfig/20250121-120341-root.json [12:04:22] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2189.codfw.wmnet [12:04:23] FIRING: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:04:54] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:05:08] PROBLEM - SSH on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:05:08] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:05:16] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:05:49] !log updating db2189.codfw.wmnet for https://phabricator.wikimedia.org/T384202 [12:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:16] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10479174 (10ops-monitoring-bot) VM aux-k8s-etcd2004.codfw.wmnet switching disk type to drbd [12:07:18] FIRING: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:07:25] (03CR) 10Hnowlan: [C:03+2] changeprop: add liftwing article-country stream to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112126 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [12:07:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:08:00] RECOVERY - SSH on ms-fe1014 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:08:28] PROBLEM - Host restbase2037 is DOWN: PING CRITICAL - Packet loss = 100% [12:08:46] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.927 second response time https://wikitech.wikimedia.org/wiki/Swift [12:08:52] (03Merged) 10jenkins-bot: changeprop: add liftwing article-country stream to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112126 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [12:08:56] downtiming restbase2037 for a day [12:08:57] !log hnowlan@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on restbase2037.codfw.wmnet with reason: Memory issues, rebooting frequently. Depooled. T383820 [12:08:58] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10479189 (10jcrespo) WMF LDA group added: https://... [12:09:00] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.169 second response time https://wikitech.wikimedia.org/wiki/Swift [12:09:01] T383820: restbase2037 is crashy - https://phabricator.wikimedia.org/T383820 [12:09:04] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2189.codfw.wmnet [12:09:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:09:48] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [12:10:20] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [12:12:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:13:12] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:14:04] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.010 second response time https://wikitech.wikimedia.org/wiki/Swift [12:14:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:14:53] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [12:15:28] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [12:15:53] !log Updated cxserver to 2025-01-20-172318-production (T377966, T377813) [12:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:58] T377966: Make cxserver Logstash logs readable and reliable - https://phabricator.wikimedia.org/T377966 [12:15:58] T377813: Migrate cxserver code from CommonJS to ESM / ECMAScript - https://phabricator.wikimedia.org/T377813 [12:16:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:17:00] PROBLEM - MD RAID on ms-fe1014 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [12:17:01] ACKNOWLEDGEMENT - MD RAID on ms-fe1014 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T384297 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [12:17:13] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-fe1014 - https://phabricator.wikimedia.org/T384297 (10ops-monitoring-bot) 03NEW [12:17:13] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10479205 (10jcrespo) @SuzanneWood-WMDE A reminder that this is mainly blocked on you providing your public ssh key out of band and your manager confirming/approving the request. [12:17:38] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: fix site expansion for role description [puppet] - 10https://gerrit.wikimedia.org/r/1113096 (owner: 10Filippo Giunchedi) [12:17:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72189 and previous config saved to /var/cache/conftool/dbconfig/20250121-121739-root.json [12:18:16] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:18:39] nftables [12:18:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72190 and previous config saved to /var/cache/conftool/dbconfig/20250121-121847-root.json [12:18:51] oops wrong window [12:19:12] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 5.242 second response time https://wikitech.wikimedia.org/wiki/Swift [12:21:50] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:22:03] (03CR) 10Muehlenhoff: [C:03+2] sre.puppet.renew-cert: Reduce logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113088 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff) [12:22:18] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:23:08] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.571 second response time https://wikitech.wikimedia.org/wiki/Swift [12:27:35] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [12:27:52] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [12:29:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd2004.codfw.wmnet to drbd [12:31:05] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Add DSantamaria to WMF group for access to https://superset.wikimedia.org - https://phabricator.wikimedia.org/T384169#10479227 (10jcrespo) a:05DSantamaria→03jcrespo You can proof your indentity based on your linked account here on phab, based on phab... [12:31:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet [12:32:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10479233 (10ops-monitoring-bot) Draining ganeti2019.codfw.wmnet of running VMs [12:32:05] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:32:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet [12:32:20] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:32:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd2004.codfw.wmnet to plain [12:32:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72193 and previous config saved to /var/cache/conftool/dbconfig/20250121-123245-root.json [12:32:55] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:33:07] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10479235 (10ops-monitoring-bot) VM aux-k8s-etcd2004.codfw.wmnet switching disk type to plain [12:33:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd2004.codfw.wmnet to plain [12:33:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72194 and previous config saved to /var/cache/conftool/dbconfig/20250121-123352-root.json [12:34:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:34:53] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 7.898 second response time https://wikitech.wikimedia.org/wiki/Swift [12:37:55] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:38:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet [12:38:35] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10479244 (10ops-monitoring-bot) Draining ganeti2019.codfw.wmnet of running VMs [12:40:30] (03PS2) 10Jcrespo: dbbackups: Remove set user permissions from m1 backup user grants [puppet] - 10https://gerrit.wikimedia.org/r/1112802 (https://phabricator.wikimedia.org/T383902) [12:40:31] (03PS1) 10Jcrespo: admin: Add dsantamaria to the list of ldap-only users [puppet] - 10https://gerrit.wikimedia.org/r/1113122 (https://phabricator.wikimedia.org/T384169) [12:41:13] (03PS2) 10Jcrespo: admin: Add dsantamaria to the list of ldap-only users [puppet] - 10https://gerrit.wikimedia.org/r/1113122 (https://phabricator.wikimedia.org/T384169) [12:41:21] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:42:13] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.097 second response time https://wikitech.wikimedia.org/wiki/Swift [12:42:53] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 7.947 second response time https://wikitech.wikimedia.org/wiki/Swift [12:47:21] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:47:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72197 and previous config saved to /var/cache/conftool/dbconfig/20250121-124750-root.json [12:48:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72198 and previous config saved to /var/cache/conftool/dbconfig/20250121-124857-root.json [12:49:13] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.227 second response time https://wikitech.wikimedia.org/wiki/Swift [12:49:42] FIRING: JobUnavailable: Reduced availability for job gerrit-metrics in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:52:47] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10479262 (10cmooney) So looking at a specific peer - 2620:0:863:1:198:35:26:6 on cr4-ulsfo - I can see the SNMP 'index... [12:53:03] RECOVERY - Disk space on stat1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [12:53:09] (03PS1) 10Effie Mouzeli: php8.1-cli: introduce opcache and JIT [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113124 (https://phabricator.wikimedia.org/T384294) [12:53:27] (03PS1) 10Elukey: mapnik: fix paths for mapnik directories [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113125 (https://phabricator.wikimedia.org/T384285) [12:54:14] i just saw a different Gerrit interface for a moment, then it went down [12:54:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:54:42] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:54:43] i don't see anything in SAL, was that expected? [12:58:04] (03CR) 10Muehlenhoff: [C:03+2] Remove firewall rule for rsync on archiva [puppet] - 10https://gerrit.wikimedia.org/r/1112226 (https://phabricator.wikimedia.org/T367315) (owner: 10Muehlenhoff) [12:58:32] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10479280 (10jcrespo) [12:59:02] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [12:59:15] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [12:59:28] (03PS2) 10Effie Mouzeli: php8.1-cli: introduce opcache and JIT [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113124 (https://phabricator.wikimedia.org/T384294) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T1300) [13:00:10] (03PS1) 10David Caro: toolforge::base: add cron to all boxes [puppet] - 10https://gerrit.wikimedia.org/r/1113128 (https://phabricator.wikimedia.org/T384250) [13:00:36] (03PS6) 10Jcrespo: admin: Add user for Georgios Kyziridis (ML Team) [puppet] - 10https://gerrit.wikimedia.org/r/1109414 (https://phabricator.wikimedia.org/T384239) (owner: 10Klausman) [13:01:06] (03CR) 10Jcrespo: admin: Add user for Georgios Kyziridis (ML Team) [puppet] - 10https://gerrit.wikimedia.org/r/1109414 (https://phabricator.wikimedia.org/T384239) (owner: 10Klausman) [13:01:07] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [13:01:15] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [13:01:49] 06SRE, 10superset.wikimedia.org: Degraded Superset functionality during a high-traffic incident - https://phabricator.wikimedia.org/T384301 (10MatthewVernon) 03NEW [13:02:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:02:51] (03CR) 10Andrew Bogott: [C:03+1] toolforge::base: add cron to all boxes [puppet] - 10https://gerrit.wikimedia.org/r/1113128 (https://phabricator.wikimedia.org/T384250) (owner: 10David Caro) [13:03:55] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:04:51] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 4.560 second response time https://wikitech.wikimedia.org/wiki/Swift [13:07:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:09:03] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10479371 (10SuzanneWood-WMDE) @WMDECyn can you please approve? [13:09:23] !incidents [13:09:24] 5623 (UNACKED) Manual (paged) by urbanecm (murbanec@wikimedia.org): Nearly complete Gerrit outage [13:09:24] 5622 (RESOLVED) NELHigh sre (thanos-rule tcp.timed_out) [13:09:24] 5611 (RESOLVED) db2189 (paged)/MariaDB Replica SQL: s2 (paged) [13:09:24] 5621 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr1-esams.wikimedia.org) [13:09:25] 5620 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [13:09:25] 5619 (RESOLVED) db2207 (paged)/MariaDB Replica SQL: s2 (paged) [13:09:25] 5618 (RESOLVED) db2148 (paged)/MariaDB Replica SQL: s2 (paged) [13:09:25] 5617 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [13:09:26] 5616 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [13:09:26] 5615 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [13:09:27] 5614 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [13:09:27] 5613 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [13:09:28] 5612 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [13:09:36] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10479373 (10SuzanneWood-WMDE) Hi @jcrespo - sorry I don't understand "providing your public ssh key out of band", what do I need to do? [13:10:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:11:32] 06SRE, 06Data-Platform-SRE, 10superset.wikimedia.org: Degraded Superset functionality during a high-traffic incident - https://phabricator.wikimedia.org/T384301#10479385 (10BTullis) Tagging this with #data-platform-sre for triage. I suspect that the errors in Superset may have been caused by timeouts queryin... [13:11:46] (03PS7) 10Jcrespo: admin: Add user for Georgios Kyziridis (ML Team) [puppet] - 10https://gerrit.wikimedia.org/r/1109414 (https://phabricator.wikimedia.org/T384239) (owner: 10Klausman) [13:15:40] (03CR) 10Jcrespo: [C:03+1] "Did some changes to commit message and patch, asking for an SRE sanity check before rebase and deploy." [puppet] - 10https://gerrit.wikimedia.org/r/1109414 (https://phabricator.wikimedia.org/T384239) (owner: 10Klausman) [13:15:49] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10479405 (10WMDECyn) Sorry for late response, approving this request from WMDE side [13:16:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [13:17:18] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [13:20:07] !log btullis@cumin1002 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [13:21:18] (03CR) 10Brouberol: [C:03+2] airflow-analytics: migrate scheduler and database to Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113108 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol) [13:22:45] !log btullis@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-launcher1002.eqiad.wmnet with reason: Migrating to kubernetes [13:22:55] !log btullis@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-launcher1002.eqiad.wmnet with reason: Migrating to kubernetes [13:23:34] (03PS1) 10Jelto: gerrit: block alibaba Cloud IPs [puppet] - 10https://gerrit.wikimedia.org/r/1113133 [13:24:43] (03CR) 10Effie Mouzeli: [C:03+1] gerrit: block alibaba Cloud IPs [puppet] - 10https://gerrit.wikimedia.org/r/1113133 (owner: 10Jelto) [13:24:59] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:25:49] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.756 second response time https://wikitech.wikimedia.org/wiki/Swift [13:26:34] (03CR) 10Jelto: [C:03+2] gerrit: block alibaba Cloud IPs [puppet] - 10https://gerrit.wikimedia.org/r/1113133 (owner: 10Jelto) [13:26:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply [13:27:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply [13:27:37] (03PS3) 10Effie Mouzeli: php8.1-cli: introduce opcache and JIT [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113124 (https://phabricator.wikimedia.org/T384294) [13:28:59] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:30:12] (03PS1) 10Jelto: gerrit: lower throttling threshold to 15 parallel connections [puppet] - 10https://gerrit.wikimedia.org/r/1113135 [13:30:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:30:55] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 3.521 second response time https://wikitech.wikimedia.org/wiki/Swift [13:32:15] 14SRE-Sprint-Week-Sustainability-March2023, 06Data-Persistence-Automations, 06DBA, 13Patch-For-Review, 10Sustainability (Incident Followup): Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366#10479472 (10Marostegui) @FCeratto-WMF t... [13:32:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:34:14] (03CR) 10Arnaudb: [C:03+1] gerrit: lower throttling threshold to 15 parallel connections [puppet] - 10https://gerrit.wikimedia.org/r/1113135 (owner: 10Jelto) [13:34:33] (03CR) 10LSobanski: [C:03+1] gerrit: lower throttling threshold to 15 parallel connections [puppet] - 10https://gerrit.wikimedia.org/r/1113135 (owner: 10Jelto) [13:35:23] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:35:31] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10479488 (10jcrespo) >>! In T384018#10479373, @SuzanneWood-WMDE wrote: > Hi @jcrespo - sorry I don't understand "providing your public ssh key out of band", what do I need to do? Ye... [13:35:53] (03PS1) 10Brouberol: airflow-analytics: fix DB cluster size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113136 (https://phabricator.wikimedia.org/T380619) [13:36:17] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.584 second response time https://wikitech.wikimedia.org/wiki/Swift [13:36:40] (03CR) 10Jelto: [C:03+2] gerrit: lower throttling threshold to 15 parallel connections [puppet] - 10https://gerrit.wikimedia.org/r/1113135 (owner: 10Jelto) [13:36:43] (03PS1) 10Mvolz: Update Zotero translators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113137 (https://phabricator.wikimedia.org/T384165) [13:37:20] (03CR) 10Btullis: [C:03+2] airflow-analytics: fix DB cluster size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113136 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol) [13:37:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:37:56] (03CR) 10CI reject: [V:04-1] airflow-analytics: fix DB cluster size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113136 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol) [13:38:12] (03CR) 10Brouberol: [V:03+2] airflow-analytics: fix DB cluster size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113136 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol) [13:38:50] (03CR) 10Btullis: [V:03+2 C:03+2] airflow-analytics: fix DB cluster size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113136 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol) [13:38:59] (03CR) 10CI reject: [V:04-1] Update Zotero translators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113137 (https://phabricator.wikimedia.org/T384165) (owner: 10Mvolz) [13:39:23] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10479515 (10cmooney) >>! In T384258#10477783, @ssingh wrote: > Might be a red herring: The only thing I see that might... [13:40:42] FIRING: [3x] JobUnavailable: Reduced availability for job gerrit in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:41:07] (03Merged) 10jenkins-bot: airflow-analytics: fix DB cluster size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113136 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol) [13:43:43] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply [13:43:49] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply [13:45:23] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:45:42] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:46:33] (03PS1) 10DCausse: flink-app: better support for properties file format [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113139 [13:47:17] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.552 second response time https://wikitech.wikimedia.org/wiki/Swift [13:47:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:49:27] (03PS1) 10Pmiazga: Disable new WebAuthn credentials creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113141 (https://phabricator.wikimedia.org/T378402) [13:49:38] (03CR) 10JMeybohm: [C:03+2] Revert "Create certificates for Typha/Felix mTLS" [puppet] - 10https://gerrit.wikimedia.org/r/1112782 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm) [13:51:12] (03CR) 10JMeybohm: [C:03+2] Update calico to v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112058 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:51:15] (03CR) 10JMeybohm: [C:03+2] Update calico-crds to calico v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111943 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:51:53] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team, 13Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10479562 (10jcrespo) a:03jc... [13:52:48] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10479563 (10jcrespo) [13:53:19] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1113122 (https://phabricator.wikimedia.org/T384169) (owner: 10Jcrespo) [13:53:22] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Add DSantamaria to WMF group for access to https://superset.wikimedia.org - https://phabricator.wikimedia.org/T384169#10479565 (10MoritzMuehlenhoff) @DSantamaria As a note for future reference: These days the simpler process is to si... [13:54:50] !log mvernon@cumin2002 conftool action : set/pooled=no; selector: name=ms-fe1014.eqiad.wmnet [13:55:02] !log hard-reboot ms-fe1014 [13:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:07] I have a config change in the window coming up soon, but will only be available from 14:30 UTC onward [13:55:07] (03Merged) 10jenkins-bot: Update calico-crds to calico v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111943 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:55:13] (03Merged) 10jenkins-bot: Update calico to v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112058 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:55:45] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Add DSantamaria to WMF group for access to https://superset.wikimedia.org - https://phabricator.wikimedia.org/T384169#10479568 (10jcrespo) Thanks, @MoritzMuehlenhoff , this was sort of something I realized later, on my side, as it ha... [13:56:15] (03CR) 10Muehlenhoff: [C:03+1] "LGTM. Please note that the Kerberos principal must be created separately after merging as documented here: https://wikitech.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/1109414 (https://phabricator.wikimedia.org/T384239) (owner: 10Klausman) [13:56:48] (03PS1) 10Brouberol: airlow: restore Api kerberos auth by mounting the keytab into the webserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113143 (https://phabricator.wikimedia.org/T384282) [13:57:11] PROBLEM - Host ms-fe1014 is DOWN: PING CRITICAL - Packet loss = 100% [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T1400). [14:00:05] ihurbain, DreamRimmer, and yerdua_wmde: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:43] o/ hello - i'd need a deployer pretty please! [14:00:52] o/ [14:01:17] I can deploy [14:01:18] (assuming gerrit is now stable enough, though) [14:01:21] assuming Gerrit cooperates [14:01:25] Lucas_WMDE: that'd be most appreciated :) [14:01:27] hah [14:01:28] o/ [14:02:17] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-fe1014 hardware fault (may need new disk controller?) - https://phabricator.wikimedia.org/T384317 (10MatthewVernon) 03NEW [14:02:35] let’s try our luck [14:02:44] actually, one sec [14:02:47] (03PS3) 10Jcrespo: admin: Add dsantamaria to the list of ldap-only users [puppet] - 10https://gerrit.wikimedia.org/r/1113122 (https://phabricator.wikimedia.org/T384169) [14:02:48] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113143 (https://phabricator.wikimedia.org/T384282) (owner: 10Brouberol) [14:03:02] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-fe1014 hardware fault (may need new disk controller?) - https://phabricator.wikimedia.org/T384317#10479635 (10MatthewVernon) p:05Triage→03High [14:03:41] (03CR) 10Jcrespo: [C:03+2] admin: Add dsantamaria to the list of ldap-only users [puppet] - 10https://gerrit.wikimedia.org/r/1113122 (https://phabricator.wikimedia.org/T384169) (owner: 10Jcrespo) [14:03:52] waiting before deployment per #_security [14:04:00] nod [14:06:18] (03CR) 10Brouberol: [C:03+2] airlow: restore Api kerberos auth by mounting the keytab into the webserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113143 (https://phabricator.wikimedia.org/T384282) (owner: 10Brouberol) [14:09:18] !log btullis@cumin1002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. [14:09:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [14:10:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [14:10:08] Lucas_WMDE: could check if https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1112563 if this can be merged [14:11:00] anzx: that looks unrelated to the deployment window? [14:11:14] yes unrelated [14:14:20] (03PS4) 10Vgutierrez: hiera: Add pki.goog staging account to acmechief-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195) [14:14:20] (03PS1) 10Vgutierrez: acme_chief: Fix handling of default account [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837) [14:14:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111932 (https://phabricator.wikimedia.org/T340134) (owner: 10Isabelle Hurbain-Palatin) [14:14:50] ihurbain: starting now [14:14:55] (03PS2) 10Pmiazga: Disable new WebAuthn credentials creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113141 (https://phabricator.wikimedia.org/T378402) [14:14:56] thank you :) [14:15:26] (03Merged) 10jenkins-bot: Remove KartographerParsoidSupport flag from configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111932 (https://phabricator.wikimedia.org/T340134) (owner: 10Isabelle Hurbain-Palatin) [14:15:54] * Lucas_WMDE watches zuul pull the new wmf.13 branch in allllll the repositories [14:16:00] uh. s/zuul/scap/ lol [14:16:19] such fun! :P [14:16:20] (03CR) 10Jcrespo: [C:03+2] "I appreciate the reminder!" [puppet] - 10https://gerrit.wikimedia.org/r/1109414 (https://phabricator.wikimedia.org/T384239) (owner: 10Klausman) [14:16:22] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Add DSantamaria to WMF group for access to https://superset.wikimedia.org - https://phabricator.wikimedia.org/T384169#10479732 (10jcrespo) @DSantamaria : you have been added to the wmf group, which means you can now access to superse... [14:16:26] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1111932|Remove KartographerParsoidSupport flag from configuration (T340134)]] [14:16:30] T340134: Feature flag addition/removal for Parsoid support for Kartographer - https://phabricator.wikimedia.org/T340134 [14:16:47] (03PS8) 10Jcrespo: admin: Add user for Georgios Kyziridis (ML Team) [puppet] - 10https://gerrit.wikimedia.org/r/1109414 (https://phabricator.wikimedia.org/T384239) (owner: 10Klausman) [14:16:51] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4839/console" [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837) (owner: 10Vgutierrez) [14:17:45] (03PS2) 10Elukey: mapnik: fix paths for mapnik directories [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113125 (https://phabricator.wikimedia.org/T384285) [14:18:18] (03CR) 10Gmodena: [C:03+1] "neat! Left you a question about edge cases. Merge at will if it is not relevant." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113139 (owner: 10DCausse) [14:18:30] (03CR) 10Elukey: "Final result:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113125 (https://phabricator.wikimedia.org/T384285) (owner: 10Elukey) [14:19:31] (03PS5) 10Vgutierrez: hiera: Add pki.goog staging account to acmechief-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195) [14:19:55] (03PS1) 10Brouberol: airflow-analytics: remove import configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113145 (https://phabricator.wikimedia.org/T380619) [14:20:05] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez) [14:20:27] (03PS1) 10Filippo Giunchedi: chartmuseum: remove icinga-based http checks [puppet] - 10https://gerrit.wikimedia.org/r/1113146 (https://phabricator.wikimedia.org/T384324) [14:21:03] !incidents [14:21:03] 5623 (ACKED) Manual (paged) by urbanecm (murbanec@wikimedia.org): Nearly complete Gerrit outage [14:21:03] 5622 (RESOLVED) NELHigh sre (thanos-rule tcp.timed_out) [14:21:04] 5621 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr1-esams.wikimedia.org) [14:21:04] 5620 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [14:21:04] 5619 (RESOLVED) db2207 (paged)/MariaDB Replica SQL: s2 (paged) [14:21:04] 5618 (RESOLVED) db2148 (paged)/MariaDB Replica SQL: s2 (paged) [14:21:04] 5617 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [14:21:05] 5616 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [14:21:05] 5615 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [14:21:06] 5614 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [14:21:06] 5613 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [14:21:07] 5612 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [14:21:09] (03CR) 10Btullis: [C:03+1] airflow-analytics: remove import configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113145 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol) [14:21:14] !resolve 5623 [14:21:15] 5623 (RESOLVED) Manual (paged) by urbanecm (murbanec@wikimedia.org): Nearly complete Gerrit outage [14:21:22] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Add DSantamaria to WMF group for access to https://superset.wikimedia.org - https://phabricator.wikimedia.org/T384169#10479772 (10MoritzMuehlenhoff) >>! In T384169#10479568, @jcrespo wrote: > Thanks, @MoritzMuehlenhoff , this was sor... [14:21:24] (03CR) 10Brouberol: [C:03+2] airflow-analytics: remove import configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113145 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol) [14:22:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply [14:23:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply [14:24:49] !log lucaswerkmeister-wmde@deploy2002 ihurbain, lucaswerkmeister-wmde: Backport for [[gerrit:1111932|Remove KartographerParsoidSupport flag from configuration (T340134)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:24:53] T340134: Feature flag addition/removal for Parsoid support for Kartographer - https://phabricator.wikimedia.org/T340134 [14:25:02] (03PS2) 10Vgutierrez: acme_chief: Fix handling of default account [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837) [14:25:02] (03PS6) 10Vgutierrez: hiera: Add pki.goog staging account to acmechief-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195) [14:25:04] testing [14:25:12] (03CR) 10Jgiannelos: [C:03+1] mapnik: fix paths for mapnik directories [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113125 (https://phabricator.wikimedia.org/T384285) (owner: 10Elukey) [14:25:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply [14:26:18] (03CR) 10Ssingh: acme_chief: Fix handling of default account (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837) (owner: 10Vgutierrez) [14:26:36] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply [14:26:52] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4840/console" [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837) (owner: 10Vgutierrez) [14:26:56] Lucas_WMDE: looks good from here on mwdebug, you can proceed [14:27:00] !log lucaswerkmeister-wmde@deploy2002 ihurbain, lucaswerkmeister-wmde: Continuing with sync [14:27:03] ok, thanks! [14:27:13] (was there anything to test beyond “it’s not broken”? just curious ^^) [14:27:18] (no) [14:27:21] ok ^^ [14:27:58] (well, the "it's not broken" involves "cleaning up a few page caches and double checking that kartographer is still behaving with parsoid) [14:28:25] ok, cool [14:28:54] (03CR) 10DCausse: [C:03+2] flink-app: better support for properties file format (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113139 (owner: 10DCausse) [14:29:01] and i was PREPARED :D [14:29:09] that’s always good :D [14:30:23] (03CR) 10Jcrespo: [V:03+2 C:03+2] admin: Add user for Georgios Kyziridis (ML Team) [puppet] - 10https://gerrit.wikimedia.org/r/1109414 (https://phabricator.wikimedia.org/T384239) (owner: 10Klausman) [14:30:33] (03Merged) 10jenkins-bot: flink-app: better support for properties file format [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113139 (owner: 10DCausse) [14:35:00] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:35:16] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:35:34] (03PS3) 10Vgutierrez: acme_chief: Fix handling of default account [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837) [14:35:34] (03PS7) 10Vgutierrez: hiera: Add pki.goog staging account to acmechief-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195) [14:36:02] (03CR) 10Vgutierrez: acme_chief: Fix handling of default account (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837) (owner: 10Vgutierrez) [14:36:07] (03CR) 10CI reject: [V:04-1] acme_chief: Fix handling of default account [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837) (owner: 10Vgutierrez) [14:36:13] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1111932|Remove KartographerParsoidSupport flag from configuration (T340134)]] (duration: 19m 46s) [14:36:13] (03CR) 10Ssingh: [C:03+1] acme_chief: Fix handling of default account [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837) (owner: 10Vgutierrez) [14:36:17] T340134: Feature flag addition/removal for Parsoid support for Kartographer - https://phabricator.wikimedia.org/T340134 [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:50] ok [14:36:56] DreamRimmer next I think [14:37:01] (03CR) 10Ssingh: [C:03+1] acme_chief: Fix handling of default account (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837) (owner: 10Vgutierrez) [14:37:09] * Lucas_WMDE peeks at diffConfig [14:37:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107936 (https://phabricator.wikimedia.org/T382879) (owner: 10Novem Linguae) [14:37:35] thanks Lucas_WMDE ! [14:37:39] np :) [14:38:07] (03Merged) 10jenkins-bot: enable 2 factor authentication for enwiki page movers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107936 (https://phabricator.wikimedia.org/T382879) (owner: 10Novem Linguae) [14:38:35] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1107936|enable 2 factor authentication for enwiki page movers (T382879)]] [14:38:39] (03PS1) 10Btullis: airflow-analytics: Allow access to the mw-api via service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113149 (https://phabricator.wikimedia.org/T380619) [14:38:39] T382879: Add oathauth-enable permission to extendedmover group on enwiki - https://phabricator.wikimedia.org/T382879 [14:38:57] (03PS4) 10Vgutierrez: acme_chief: Fix handling of default account [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837) [14:38:57] (03PS8) 10Vgutierrez: hiera: Add pki.goog staging account to acmechief-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195) [14:40:09] (03CR) 10Ssingh: [C:03+1] acme_chief: Fix handling of default account [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837) (owner: 10Vgutierrez) [14:40:21] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113125 (https://phabricator.wikimedia.org/T384285) (owner: 10Elukey) [14:40:27] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4841/console" [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837) (owner: 10Vgutierrez) [14:40:36] (03CR) 10Vgutierrez: [V:03+1 C:03+2] acme_chief: Fix handling of default account [puppet] - 10https://gerrit.wikimedia.org/r/1113144 (https://phabricator.wikimedia.org/T370837) (owner: 10Vgutierrez) [14:41:36] oauth change looks good to me [14:41:42] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2019.codfw.wmnet with reason: remove from cluster for reimage [14:41:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [14:41:46] it hasn’t even deployed yet :P [14:41:47] (03CR) 10Elukey: [V:03+2 C:03+2] mapnik: fix paths for mapnik directories [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113125 (https://phabricator.wikimedia.org/T384285) (owner: 10Elukey) [14:41:52] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10479882 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=05c11855-71d5-489c-8ed8-13baa1a2b7b9) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [14:41:55] (03PS1) 10DCausse: wdqs: fix staging stream names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113150 (https://phabricator.wikimedia.org/T374919) [14:42:01] But i can see [14:42:08] then I guess you got lucky [14:42:22] and hit one of the k8s deployments that were already done [14:42:34] but please wait until scap says it’s okay to test, it’s much less confusing that way imho [14:42:37] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez) [14:42:53] np [14:43:31] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10479886 (10jcrespo) 05Open→03Resolved Acc... [14:43:40] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:44:03] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:44:33] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10479889 (10Volans) If I understand the db structure correctly that should convert into this query: ` select * from b... [14:44:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet [14:45:05] !log lucaswerkmeister-wmde@deploy2002 novemlinguae, lucaswerkmeister-wmde: Backport for [[gerrit:1107936|enable 2 factor authentication for enwiki page movers (T382879)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:45:09] T382879: Add oathauth-enable permission to extendedmover group on enwiki - https://phabricator.wikimedia.org/T382879 [14:45:17] DreamRimmer: now it’s ready for testing ^^ [14:45:34] looks good to me afaict [14:45:35] checking [14:45:35] 06SRE, 06Infrastructure-Foundations, 10netops: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288#10479894 (10RobH) @cmooney, I'm updating the order task, but this was delivered in December so I can open a remote hands to get it fixed. Do we need to schedule th... [14:45:40] (03CR) 10DCausse: [C:03+2] wdqs: fix staging stream names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113150 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [14:45:50] (03PS9) 10Vgutierrez: hiera: Add pki.goog staging account to acmechief-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195) [14:46:06] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez) [14:46:32] looks good [14:46:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [14:47:10] (03Merged) 10jenkins-bot: wdqs: fix staging stream names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113150 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [14:47:39] !log lucaswerkmeister-wmde@deploy2002 novemlinguae, lucaswerkmeister-wmde: Continuing with sync [14:47:45] ok, thanks for checking! [14:49:16] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:49:36] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:50:15] (03PS10) 10Vgutierrez: hiera: Add pki.goog staging account to acmechief-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195) [14:50:16] (03PS1) 10Vgutierrez: profile::acme_chief: Use Acme_chief::Account type [puppet] - 10https://gerrit.wikimedia.org/r/1113154 (https://phabricator.wikimedia.org/T384195) [14:52:23] (03CR) 10CI reject: [V:04-1] profile::acme_chief: Use Acme_chief::Account type [puppet] - 10https://gerrit.wikimedia.org/r/1113154 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez) [14:53:03] (03PS2) 10Brouberol: global_config: add the IP of the dyna proxy [puppet] - 10https://gerrit.wikimedia.org/r/1113151 (https://phabricator.wikimedia.org/T380619) [14:53:54] (03CR) 10Btullis: [C:03+1] global_config: add the IP of the dyna proxy [puppet] - 10https://gerrit.wikimedia.org/r/1113151 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol) [14:54:34] (03CR) 10Brouberol: [C:03+2] global_config: add the IP of the dyna proxy [puppet] - 10https://gerrit.wikimedia.org/r/1113151 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol) [14:54:45] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1107936|enable 2 factor authentication for enwiki page movers (T382879)]] (duration: 16m 10s) [14:54:49] T382879: Add oathauth-enable permission to extendedmover group on enwiki - https://phabricator.wikimedia.org/T382879 [14:54:52] alright [14:55:01] DreamRimmer: should be done now [14:55:16] yerdua_wmde: do you have time now? [14:55:23] (sorry if I missed a message from you, the channel is pretty busy ^^) [14:55:26] thanks :) [14:55:32] I'm here [14:55:34] yay [14:55:37] jouncebot: nowandnext [14:55:38] For the next 0 hour(s) and 4 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T1400) [14:55:38] In 1 hour(s) and 4 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T1600) [14:55:47] ok, I think we’ll just overrun the window a bit [14:55:55] unless someone else is burning to deploy something of their own [14:55:59] * Lucas_WMDE listens for a few seconds [14:56:03] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:56:40] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:56:43] let’s go [14:56:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112261 (https://phabricator.wikimedia.org/T380751) (owner: 10Audrey Penven) [14:57:26] (03CR) 10Giuseppe Lavagetto: [C:03+1] "Let's get this out, then we can reason on how to improve the chart in general regarding all the feature flags and duplication we have." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [14:57:40] (03Merged) 10jenkins-bot: Add known-good regexes for WikibaseQualityConstraints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112261 (https://phabricator.wikimedia.org/T380751) (owner: 10Audrey Penven) [14:58:07] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1112261|Add known-good regexes for WikibaseQualityConstraints (T380751)]] [14:58:12] T380751: [SW] Update format constraint regex checks to stop errors from shellbox-constraints in the logs - https://phabricator.wikimedia.org/T380751 [15:00:10] (03PS2) 10Vgutierrez: profile::acme_chief: Use Acme_chief::Account type [puppet] - 10https://gerrit.wikimedia.org/r/1113154 (https://phabricator.wikimedia.org/T384195) [15:00:10] (03PS11) 10Vgutierrez: hiera: Add pki.goog staging account to acmechief-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195) [15:00:22] * Lucas_WMDE looks for some test item with many extids [15:00:35] (under the assumption that many extids ≈ many format constraints) [15:00:54] whyyyyyyy https://www.wikidata.org/wiki/Q6382438 [15:01:01] 6688 identifiers ._. [15:01:10] (and not with one of the allowlisted regexes, so useless for testing) [15:01:22] omg [15:02:19] (03PS1) 10Brouberol: airflow-analytics: allow the egress to ATS for task pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113159 (https://phabricator.wikimedia.org/T380619) [15:02:36] https://www.wikidata.org/wiki/Q1744 sure, whatever [15:02:40] 509 extids [15:02:52] many of them probably not allowlisted but hopefully enough are that we’ll be able to see a performance difference [15:03:46] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4846/console" [puppet] - 10https://gerrit.wikimedia.org/r/1113154 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez) [15:04:06] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:04:34] 06SRE: Verify Wikipedia's Bluesky account - https://phabricator.wikimedia.org/T384332 (10LPasqual_WMF) 03NEW [15:04:40] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, audreypenven: Backport for [[gerrit:1112261|Add known-good regexes for WikibaseQualityConstraints (T380751)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:04:47] T380751: [SW] Update format constraint regex checks to stop errors from shellbox-constraints in the logs - https://phabricator.wikimedia.org/T380751 [15:04:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:05:37] okay, should be ready to test [15:05:49] (03CR) 10Btullis: [C:03+1] airflow-analytics: allow the egress to ATS for task pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113159 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol) [15:06:03] (03CR) 10Brouberol: [C:03+2] airflow-analytics: allow the egress to ATS for task pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113159 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol) [15:06:04] yerdua_wmde: any ideas for how we can test this? [15:06:28] I was just about to ask you if you had ideas [15:06:39] my idea is to curl https://www.wikidata.org/w/api.php?action=wbcheckconstraints&id=Q1744&status=*&format=json&formatversion=2 [15:06:42] (constraint check on that item) [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:49] with and without -H 'X-Wikimedia-Debug: backend=k8s-mwdebug' [15:06:54] (per https://wikitech.wikimedia.org/wiki/WikimediaDebug#Command-line_usage) [15:07:00] and see if the time is different [15:07:06] without -H: ca. 42 seconds [15:07:18] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply [15:07:20] with -H: 44s [15:07:21] dangit [15:07:26] lemme try that again :'D [15:07:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply [15:08:10] is there any way to see if it touched shellbox? [15:08:19] I think using XHGui might work [15:08:33] (ca. 44s on the second curl with -H btw, dangit) [15:08:51] if I turn on the WikimediaDebug extension and enable XHGui, and then load the URL in the browser [15:09:02] I should get some useful data there [15:09:12] (after waiting ca. 44 seconds for the request to finish ^^) [15:09:47] nooo firefox don’t time out :( [15:10:27] well, I guess I can still find the request in xhgui anyway [15:10:28] https://performance.wikimedia.org/xhgui/ [15:10:30] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:10:35] leads to https://performance.wikimedia.org/xhgui/run/view?id=678fb8c3de9320ac29a8953b [15:10:44] if we look around we should be able to see how many times the different FormatChecker methods are called [15:11:04] o_O https://performance.wikimedia.org/xhgui/run/symbol?id=678fb8c3de9320ac29a8953b&symbol=WikibaseQuality%5CConstraintReport%5CConstraintCheck%5CHelper%5CFormatCheckerHelper%3A%3ArunRegexCheck [15:11:09] “runRegexCheck called no functions” [15:11:31] uh.. what? [15:11:46] I also tried the request again and it timed out at the MediaWiki level (RequestTimeoutException) [15:11:49] maybe I should pick a smaller item [15:12:58] (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112183 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [15:13:01] https://www.wikidata.org/wiki/Q415 should have at least one format constraintr matching the allowlist, I think [15:13:13] damn, no [15:13:20] [1-9][0-9]{0,6} isn’t quite in the list [15:14:36] seems to be harder than I thought to find items with format constraints in that list :( [15:14:45] (03CR) 10JMeybohm: [C:03+1] chartmuseum: remove icinga-based http checks [puppet] - 10https://gerrit.wikimedia.org/r/1113146 (https://phabricator.wikimedia.org/T384324) (owner: 10Filippo Giunchedi) [15:15:14] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112183 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [15:15:49] * Lucas_WMDE tries something else with https://w.wiki/Co2o [15:16:36] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10480082 (10jcrespo) @Neslihan_Turan_WMDE I wasn't able to find a developer account with that cn, dn or email. My guess is you sent your SUL (wiki) account, not your developer account,... [15:16:45] slightly improved query https://w.wiki/Co2s [15:17:00] yeah, sure, random municipality in Finland https://www.wikidata.org/wiki/Q51909 [15:17:05] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10480083 (10Jhancock.wm) hey, was out sick the last half of last week. got this from Dell: I understand the situation. Upon reviewing the details, I noticed that the disks ins... [15:17:40] yerdua_wmde: okay, new xhgui is here https://performance.wikimedia.org/xhgui/run/view?id=678fba7ca891a98639287cb7 [15:17:59] here, that looks better https://performance.wikimedia.org/xhgui/run/symbol?id=678fba7ca891a98639287cb7&symbol=WikibaseQuality%5CConstraintReport%5CConstraintCheck%5CChecker%5CFormatChecker%3A%3ArunRegexCheck [15:18:25] (03CR) 10David Caro: [C:03+2] toolforge::base: add cron to all boxes [puppet] - 10https://gerrit.wikimedia.org/r/1113128 (https://phabricator.wikimedia.org/T384250) (owner: 10David Caro) [15:18:25] so, 318 calls to runRegexCheck(), of which 313 went to runRegexCheckUsingShellbox() and 5 went to FormatCheckerHelper [15:18:34] !log jayme@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2010.codfw.wmnet with reason: Server moving within rack [15:18:40] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Move kafka-main2010 within the same rack - https://phabricator.wikimedia.org/T381788#10480097 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ffdb0a96-3214-40cf-acd0-ec05d4bf5539) set by jayme@cumin1002 for 2:00:00 on 1 host(s) and their servi... [15:18:58] so it’s at least doing something [15:19:11] even if the “hit rate” seems to be much lower than I hoped for [15:19:31] (and I also checked that there’s no difference between the result output with and without -H 'X-Wikimedia-Debug: backend=k8s-mwdebug' [15:19:41] if I'm reading it right, it did the check without shellbox 5 times [15:19:48] yeah [15:20:06] and one would be enough to prove that it succeeded in using the config values [15:20:25] yup [15:20:34] so we can continue the deployment for now [15:20:36] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10480114 (10MatthewVernon) Wait, didn't we buy this server and all of its drives spinning and SSD from Dell? And now they're saying they're all the wrong drives?!? [15:20:45] and then maybe look more into whether we want to allowlist more format strings [15:20:49] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, audreypenven: Continuing with sync [15:21:12] perhaps the format strings we configured are used on many different properties, but each of those properties is only used relatively rarely… [15:21:23] (03PS8) 10Andrea Denisse: wmcs: Migrate network saturation alerts to the alerts.git repository [alerts] - 10https://gerrit.wikimedia.org/r/1111328 (https://phabricator.wikimedia.org/T328502) [15:23:22] right. maybe it's worth adding more, or swapping out for format strings that are used more [15:23:37] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10480120 (10cmooney) Thanks @volans you have helped me a lot with this and given me confidence to look at the DB. I s... [15:23:40] yeah [15:24:00] might also be worth adding some statsd (or prometheus…) tracking for how often a regex is allowlisted vs. not [15:24:10] so we can see what the hit rate is overall [15:24:16] not just on some random cherrypicked items [15:24:21] makes sense [15:24:34] (then again, I guess the question is whether that’s still prioritized ^^) [15:24:56] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply [15:25:19] and I'm assuming this is a problem for another window [15:25:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply [15:27:26] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on schoolwiki.in - https://phabricator.wikimedia.org/T383210#10480125 (10jcrespo) I believe this is something to be handled by #traffic at varnish level, more than a maps task. Is this something you handle (I am not familiar with the process) @Vgutierrez @... [15:27:36] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply [15:27:45] yeah, definitely :) [15:27:51] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1112261|Add known-good regexes for WikibaseQualityConstraints (T380751)]] (duration: 29m 44s) [15:27:55] T380751: [SW] Update format constraint regex checks to stop errors from shellbox-constraints in the logs - https://phabricator.wikimedia.org/T380751 [15:28:01] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on schoolwiki.in - https://phabricator.wikimedia.org/T383210#10480128 (10jcrespo) p:05Triage→03High [15:28:01] yay [15:28:19] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply [15:28:21] !log UTC afternoon backport+config window done [15:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:52] (03PS1) 10Jelto: gerrit: change blackbox checks to collaboration-services/task [puppet] - 10https://gerrit.wikimedia.org/r/1113163 [15:29:34] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on schoolwiki.in - https://phabricator.wikimedia.org/T383210#10480136 (10ssingh) Thanks @jcrespo; Traffic will take care of it. @MSantos: This requires your approval before we can continue. Thanks. [15:30:24] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10480150 (10WMDECyn) Approving this request from WMDE side [15:30:25] (03PS1) 10Jelto: Revert "gerrit: lower throttling threshold to 15 parallel connections" [puppet] - 10https://gerrit.wikimedia.org/r/1113164 [15:31:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:31:25] (03CR) 10Arnaudb: [C:03+1] Revert "gerrit: lower throttling threshold to 15 parallel connections" [puppet] - 10https://gerrit.wikimedia.org/r/1113164 (owner: 10Jelto) [15:31:58] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on schoolwiki.in - https://phabricator.wikimedia.org/T383210#10480174 (10ssingh) a:03ssingh [15:32:49] (03CR) 10Arnaudb: [C:03+1] gerrit: change blackbox checks to collaboration-services/task [puppet] - 10https://gerrit.wikimedia.org/r/1113163 (owner: 10Jelto) [15:34:53] 06SRE, 06Commons, 10MediaWiki-Uploading: HTTP 503 error when uploading images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#10480204 (10jcrespo) Hey, @Underbar_dk is that happening still? Please provide the data suggested at https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_iss... [15:35:19] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10480208 (10cmooney) It also appears we are getting values populated for AcceptedPrefixes for IPv6 peers for some devi... [15:37:43] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2019 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1113106 (owner: 10Muehlenhoff) [15:38:19] (03CR) 10Ottomata: "Added a couple of comments!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [15:40:11] RECOVERY - Host restbase2037 is UP: PING OK - Packet loss = 0%, RTA = 31.59 ms [15:41:25] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:42:30] 06SRE, 10DNS, 06Traffic: Verify Wikipedia's Bluesky account - https://phabricator.wikimedia.org/T384332#10480253 (10jcrespo) I believe authentication on blusky happens through DNS. Adding #DNS and #Traffic for awareness. I can handle this, as we did it to authenticate the search engines consoles. @LPasqual... [15:43:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:44:17] 06SRE, 10DNS, 06Traffic: Verify Wikipedia's Bluesky account - https://phabricator.wikimedia.org/T384332#10480259 (10jcrespo) p:05Triage→03Medium [15:44:20] 06SRE, 10DNS, 06Traffic: Verify Wikipedia's Bluesky account - https://phabricator.wikimedia.org/T384332#10480260 (10jcrespo) a:03jcrespo [15:44:23] FIRING: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:45:51] (03CR) 10Jelto: [C:03+2] Revert "gerrit: lower throttling threshold to 15 parallel connections" [puppet] - 10https://gerrit.wikimedia.org/r/1113164 (owner: 10Jelto) [15:47:18] FIRING: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:47:24] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10480265 (10BCornwall) Yeah, this isn't an acceptable answer. They need to be more specific, I'm smelling their vagueness comes from not wanting to spend time/money. [15:47:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2019.codfw.wmnet with OS bookworm [15:47:36] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10480267 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2019.codfw.wmnet with OS bookworm [15:48:27] (03PS1) 10Muehlenhoff: sre.hosts.reimage: Skip the vlan migration reminder for ganeti nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1113167 [15:51:05] (03PS2) 10Brouberol: airflow: enable the injection of custom config files in the worker pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109705 (https://phabricator.wikimedia.org/T380620) [15:51:31] RESOLVED: Not accepting/receiving prefixes from anycast BGP peer: Device cr4-ulsfo.wikimedia.org recovered from Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [15:51:49] (03PS3) 10Brouberol: airflow: enable the injection of custom config files in the worker pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109705 (https://phabricator.wikimedia.org/T380620) [15:52:02] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd2004-dev [15:52:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd2004-dev [15:52:18] RESOLVED: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:52:19] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host kafka-main2010 [15:52:26] (03CR) 10Btullis: [C:03+1] airflow: enable the injection of custom config files in the worker pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109705 (https://phabricator.wikimedia.org/T380620) (owner: 10Brouberol) [15:52:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kafka-main2010 [15:53:29] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [15:53:54] (03CR) 10DCausse: [C:03+2] search: add alerts for weighted_tags indexing throughput (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1111300 (https://phabricator.wikimedia.org/T373459) (owner: 10DCausse) [15:55:11] (03Merged) 10jenkins-bot: search: add alerts for weighted_tags indexing throughput [alerts] - 10https://gerrit.wikimedia.org/r/1111300 (https://phabricator.wikimedia.org/T373459) (owner: 10DCausse) [15:55:25] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[1470-1475].eqiad.wmnet [15:56:16] (03CR) 10Btullis: [C:03+1] airflow: enable the injection of custom config files in the worker pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109705 (https://phabricator.wikimedia.org/T380620) (owner: 10Brouberol) [15:57:14] (03CR) 10Brouberol: [C:03+2] airflow: enable the injection of custom config files in the worker pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109705 (https://phabricator.wikimedia.org/T380620) (owner: 10Brouberol) [15:57:15] (03CR) 10Jelto: [C:03+1] "lgtm 🍿" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [15:58:48] FIRING: PuppetFailure: Puppet has failed on ml-lab1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:58:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:00:04] jelto, arnoldokoth, and mutante: #bothumor My software never has bugs. It just develops random features. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T1600). [16:00:08] 10ops-codfw, 06SRE, 10Cassandra, 06DC-Ops: restbase2037 is crashy - https://phabricator.wikimedia.org/T383820#10480331 (10Eevans) p:05Medium→03High >>! In T383820#10479192, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/PqHEiJQBKFqum... [16:00:37] (03CR) 10Clément Goubert: php8.1: introduce JIT (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113138 (https://phabricator.wikimedia.org/T384294) (owner: 10Effie Mouzeli) [16:00:53] 06SRE, 10DNS, 06Traffic: Verify Wikipedia's Bluesky account - https://phabricator.wikimedia.org/T384332#10480336 (10LPasqual_WMF) Thank you for such a quick reply, @jcrespo. Here's the info you requested: Host: _atproto Type: TXT Value: did=did:plc:plla3i7zproko3ekdnkoykhe And a screenshot, just in case: {... [16:00:54] (03CR) 10JMeybohm: [C:03+2] admin_ng: Install VAPs instead of PSPs on k8s >= 1.24 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112183 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [16:01:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [16:01:12] (03CR) 10Xcollazo: [V:03+1 C:03+1] "Verified content is indeed as revised on this patch." [puppet] - 10https://gerrit.wikimedia.org/r/1112123 (owner: 10Pppery) [16:02:04] (03CR) 10Mvolz: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113137 (https://phabricator.wikimedia.org/T384165) (owner: 10Mvolz) [16:03:05] !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for kafka-main2010.codfw.wmnet [16:03:05] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2010.codfw.wmnet [16:03:25] 06SRE, 06Commons, 10MediaWiki-Uploading: HTTP 503 error when uploading images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#10480346 (10Underbar_dk) Yes. This is still happening on my desktop. I am finding that this issue is more likely to trigger when I try to upload multiple files at on... [16:03:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply [16:03:44] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Move kafka-main2010 within the same rack - https://phabricator.wikimedia.org/T381788#10480348 (10JMeybohm) >>! In T381788#10480097, @ops-monitoring-bot wrote: > Icinga downtime and Alertmanager silence (ID=ffdb0a96-3214-40cf-acd0-ec05d4bf5539) set by jayme@cumin1... [16:03:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:03:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[1470-1475].eqiad.wmnet [16:03:59] (03CR) 10Kamila Součková: [C:03+2] wikikube: rename mw147[0-5] -> wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1112828 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [16:04:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply [16:04:34] (03PS1) 10Jcrespo: wikipedia.org: Add AT Protocol/Bluesky verification [dns] - 10https://gerrit.wikimedia.org/r/1113170 (https://phabricator.wikimedia.org/T384332) [16:04:50] (03Merged) 10jenkins-bot: admin_ng: Install VAPs instead of PSPs on k8s >= 1.24 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112183 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [16:05:31] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:05:47] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1470 to wikikube-worker1123 [16:06:06] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:09:12] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv [16:09:12] e - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:09:14] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv [16:09:14] e - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:09:15] (03CR) 10Ssingh: wikipedia.org: Add AT Protocol/Bluesky verification (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1113170 (https://phabricator.wikimedia.org/T384332) (owner: 10Jcrespo) [16:09:50] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1470 to wikikube-worker1123 - kamila@cumin1002" [16:10:05] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1471 to wikikube-worker1124 [16:10:07] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1470 to wikikube-worker1123 - kamila@cumin1002" [16:10:07] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:10:07] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1123 [16:10:10] (03CR) 10Jcrespo: wikipedia.org: Add AT Protocol/Bluesky verification (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1113170 (https://phabricator.wikimedia.org/T384332) (owner: 10Jcrespo) [16:10:25] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:10:45] (03PS2) 10Jcrespo: wikipedia.org: Add AT Protocol/Bluesky verification [dns] - 10https://gerrit.wikimedia.org/r/1113170 (https://phabricator.wikimedia.org/T384332) [16:10:58] !log installing gstreamer1.0 security updates [16:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:09] (03CR) 10Jcrespo: wikipedia.org: Add AT Protocol/Bluesky verification (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1113170 (https://phabricator.wikimedia.org/T384332) (owner: 10Jcrespo) [16:11:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1123 [16:11:26] (03CR) 10Ssingh: [C:03+1] "Thanks for creating the patch!" [dns] - 10https://gerrit.wikimedia.org/r/1113170 (https://phabricator.wikimedia.org/T384332) (owner: 10Jcrespo) [16:11:54] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1470 to wikikube-worker1123 [16:12:21] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1113167 (owner: 10Muehlenhoff) [16:12:52] (03CR) 10Jcrespo: [C:03+2] wikipedia.org: Add AT Protocol/Bluesky verification [dns] - 10https://gerrit.wikimedia.org/r/1113170 (https://phabricator.wikimedia.org/T384332) (owner: 10Jcrespo) [16:13:13] (03PS1) 10Brouberol: airflow: add missing airflow.worker.extra-config-volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113172 (https://phabricator.wikimedia.org/T380619) [16:13:26] !log jynus@dns1004 START - running authdns-update [16:14:01] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1471 to wikikube-worker1124 - kamila@cumin1002" [16:14:07] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10480398 (10cmooney) Running the poller manually on netmon1003 I can also see it's getting the right value back, but i... [16:14:12] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1472 to wikikube-worker1125 [16:14:17] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1471 to wikikube-worker1124 - kamila@cumin1002" [16:14:17] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:14:18] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1124 [16:14:20] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:14:23] FIRING: [5x] ProbeDown: Service restbase2037-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:15:18] !log jynus@dns1004 END - running authdns-update [16:15:26] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1124 [16:15:42] (03PS2) 10Brouberol: airflow: add missing airflow.worker.extra-config-volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113172 (https://phabricator.wikimedia.org/T380619) [16:16:05] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1471 to wikikube-worker1124 [16:16:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:17:18] FIRING: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:17:21] (03CR) 10Brouberol: [C:03+2] airflow: add missing airflow.worker.extra-config-volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113172 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol) [16:18:18] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1472 to wikikube-worker1125 - kamila@cumin1002" [16:18:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on mw1473:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:18:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1472 to wikikube-worker1125 - kamila@cumin1002" [16:18:57] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:18:57] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1125 [16:19:10] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1473 to wikikube-worker1126 [16:19:30] 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: Verify Wikipedia's Bluesky account - https://phabricator.wikimedia.org/T384332#10480414 (10jcrespo) @LPasqual_WMF The deploy for `@wikipedia.org` should already be working, but don't be surprised if you get an error (there could be ~5 minutes of cache), if it ha... [16:19:30] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:19:32] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2019.codfw.wmnet with reason: host reimage [16:19:42] !log power down ms-be2088 for maintenance [16:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply [16:20:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1125 [16:20:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply [16:20:45] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1472 to wikikube-worker1125 [16:22:02] PROBLEM - Host ms-be2088 is DOWN: PING CRITICAL - Packet loss = 100% [16:23:13] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1473 to wikikube-worker1126 - kamila@cumin1002" [16:23:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2019.codfw.wmnet with reason: host reimage [16:23:26] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1474 to wikikube-worker1127 [16:23:28] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1473 to wikikube-worker1126 - kamila@cumin1002" [16:23:29] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:23:29] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1126 [16:23:40] FIRING: KubernetesRsyslogDown: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:23:46] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:24:36] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1126 [16:25:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1473 to wikikube-worker1126 [16:27:18] RESOLVED: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:27:56] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1474 to wikikube-worker1127 - kamila@cumin1002" [16:28:28] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1475 to wikikube-worker1128 [16:28:30] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1474 to wikikube-worker1127 - kamila@cumin1002" [16:28:30] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:28:31] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1127 [16:28:50] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:29:45] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1127 [16:30:24] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1474 to wikikube-worker1127 [16:31:19] (03CR) 10Andrea Denisse: wmcs: Migrate network saturation alerts to the alerts.git repository (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1111328 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [16:31:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112838 (https://phabricator.wikimedia.org/T384145) (owner: 10ZhaoFJx) [16:32:31] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1475 to wikikube-worker1128 - kamila@cumin1002" [16:32:41] (03PS6) 10Andrea Denisse: wmcs: Migrate iowait stalling alerts to the alerts.git repository [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) [16:33:00] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1475 to wikikube-worker1128 - kamila@cumin1002" [16:33:00] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:33:00] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1128 [16:33:05] 06SRE, 10DNS, 06Traffic: Verify Wikipedia's Bluesky account - https://phabricator.wikimedia.org/T384332#10480451 (10LPasqual_WMF) @jcrespo Happy to say it is already working! [[ https://bsky.app/profile/wikipedia.org | @wikipedia.org ]] is live. Thanks, Jaime and team. I'll follow up with a separate ticket... [16:33:16] (03CR) 10Andrea Denisse: wmcs: Migrate iowait stalling alerts to the alerts.git repository (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [16:34:10] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1128 [16:34:14] 10ops-codfw, 06SRE, 10Cassandra, 06DC-Ops: restbase2037 is crashy - https://phabricator.wikimedia.org/T383820#10480460 (10Jhancock.wm) swapped B1 to A1. gotta let run and see if it crashes again. might not. sometimes that's all it needs. (Thanks for your patience, i was unexpectedly out the last half of la... [16:34:48] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1475 to wikikube-worker1128 [16:34:54] !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for restbase2037.codfw.wmnet [16:34:54] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase2037.codfw.wmnet [16:35:09] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1123.eqiad.wmnet wikikube-worker1124.eqiad.wmnet wikikube-worker1125.eqiad.wmnet wikikube-worker1126.eqiad.wmnet wikikube-worker1127.eqiad.wmnet wikikube-worker1128.eqiad.wmnet on all recursors [16:35:13] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1123.eqiad.wmnet wikikube-worker1124.eqiad.wmnet wikikube-worker1125.eqiad.wmnet wikikube-worker1126.eqiad.wmnet wikikube-worker1127.eqiad.wmnet wikikube-worker1128.eqiad.wmnet on all recursors [16:35:57] (03CR) 10BCornwall: [C:03+1] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1113105 (https://phabricator.wikimedia.org/T384287) (owner: 10Gerrit maintenance bot) [16:36:10] 10ops-codfw, 06SRE, 10Cassandra, 06DC-Ops: restbase2037 is crashy - https://phabricator.wikimedia.org/T383820#10480474 (10Eevans) >>! In T383820#10480460, @Jhancock.wm wrote: > swapped B1 to A1. gotta let run and see if it crashes again. might not. sometimes that's all it needs. (Thanks for your patience,... [16:36:17] 06SRE, 10DNS, 06Traffic: Verify Wikipedia's Bluesky account - https://phabricator.wikimedia.org/T384332#10480475 (10jcrespo) 05Open→03Resolved [16:38:02] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez) [16:39:57] 06SRE, 06Traffic, 10Data-Engineering (Q3 2024 January 1st - March 31th), 13Patch-For-Review: Refine add_is_wmf_domain TransformFunction fails if no source field exists - https://phabricator.wikimedia.org/T383914#10480486 (10Ahoelzl) [16:40:33] (03CR) 10BCornwall: [C:03+1] systemd: added option to remain after exit [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur) [16:41:08] (03PS1) 10Brouberol: Revert "global_config: add the IP of the dyna proxy" [puppet] - 10https://gerrit.wikimedia.org/r/1113176 (https://phabricator.wikimedia.org/T380619) [16:42:21] (03CR) 10BCornwall: [C:03+1] profile::acme_chief: Use Acme_chief::Account type [puppet] - 10https://gerrit.wikimedia.org/r/1113154 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez) [16:43:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2019.codfw.wmnet with OS bookworm [16:43:57] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10480502 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2019.codfw.wmnet with OS bookworm completed: - ganeti201... [16:44:02] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1123.eqiad.wmnet with OS bookworm [16:44:05] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1123 [16:44:05] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1123 [16:44:07] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1124.eqiad.wmnet with OS bookworm [16:44:10] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1124 [16:44:10] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1124 [16:44:15] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1125.eqiad.wmnet with OS bookworm [16:44:19] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1125 [16:44:19] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1125 [16:44:21] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1126.eqiad.wmnet with OS bookworm [16:44:25] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1126 [16:44:25] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1126 [16:44:27] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1127.eqiad.wmnet with OS bookworm [16:44:30] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1127 [16:44:30] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1127 [16:44:38] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1128.eqiad.wmnet with OS bookworm [16:44:42] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1128 [16:44:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1128 [16:44:44] RECOVERY - Host ms-be2088 is UP: PING OK - Packet loss = 0%, RTA = 30.28 ms [16:44:45] (03CR) 10Brouberol: [C:03+2] Revert "global_config: add the IP of the dyna proxy" [puppet] - 10https://gerrit.wikimedia.org/r/1113176 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol) [16:44:58] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10480510 (10phaultfinder) [16:46:36] (03CR) 10Vgutierrez: [V:03+1 C:03+2] profile::acme_chief: Use Acme_chief::Account type [puppet] - 10https://gerrit.wikimedia.org/r/1113154 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez) [16:54:07] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:54:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:55:03] (03PS1) 10Btullis: Revert "airflow-analytics: migrate scheduler and database to Kubernetes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113177 [16:55:03] (03CR) 10BCornwall: [C:03+1] hiera: Add pki.goog staging account to acmechief-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez) [16:55:10] (03CR) 10CI reject: [V:04-1] Revert "airflow-analytics: migrate scheduler and database to Kubernetes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113177 (owner: 10Btullis) [16:58:24] (03PS2) 10Btullis: Revert "airflow-analytics: migrate scheduler and database to Kubernetes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113177 [16:59:58] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1123.eqiad.wmnet with reason: host reimage [17:00:03] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1112699 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [17:00:05] jhathaway and rzl: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:10] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1125.eqiad.wmnet with reason: host reimage [17:00:13] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1124.eqiad.wmnet with reason: host reimage [17:00:19] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1127.eqiad.wmnet with reason: host reimage [17:00:31] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1128.eqiad.wmnet with reason: host reimage [17:02:06] (03CR) 10Brouberol: [C:03+1] Revert "airflow-analytics: migrate scheduler and database to Kubernetes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113177 (owner: 10Btullis) [17:02:10] (03CR) 10Btullis: [C:03+2] Revert "airflow-analytics: migrate scheduler and database to Kubernetes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113177 (owner: 10Btullis) [17:03:13] (03Merged) 10jenkins-bot: Revert "airflow-analytics: migrate scheduler and database to Kubernetes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113177 (owner: 10Btullis) [17:03:26] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1123.eqiad.wmnet with reason: host reimage [17:04:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply [17:05:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply [17:06:43] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Move kafka-main2010 within the same rack - https://phabricator.wikimedia.org/T381788#10480665 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm This has been completed. Thank you for your help! [17:06:46] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1124.eqiad.wmnet with reason: host reimage [17:08:06] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10480676 (10Jhancock.wm) [17:08:08] (03PS1) 10Hnowlan: trafficserver: reoute testwiki citoid calls to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1113178 (https://phabricator.wikimedia.org/T361576) [17:08:29] (03PS1) 10Btullis: Revert "Temporarily disable gobblin timers on an-launcher1002" [puppet] - 10https://gerrit.wikimedia.org/r/1113179 [17:09:19] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1125.eqiad.wmnet with reason: host reimage [17:09:35] 06SRE, 06Infrastructure-Foundations, 10netops: Configure gnmic to collect data from routers at network pops - https://phabricator.wikimedia.org/T384345 (10cmooney) 03NEW p:05Triage→03Medium [17:10:38] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10480706 (10Jhancock.wm) [17:10:56] (03CR) 10Btullis: [C:03+2] Revert "Temporarily disable gobblin timers on an-launcher1002" [puppet] - 10https://gerrit.wikimedia.org/r/1113179 (owner: 10Btullis) [17:11:09] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10480707 (10Jhancock.wm) [17:12:33] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1128.eqiad.wmnet with reason: host reimage [17:15:21] (03PS1) 10Hnowlan: trafficserver: route citoid via rest-gateway for all sites [puppet] - 10https://gerrit.wikimedia.org/r/1113182 (https://phabricator.wikimedia.org/T361576) [17:16:38] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1127.eqiad.wmnet with reason: host reimage [17:17:45] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10480734 (10Jhancock.wm) [17:20:38] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10480754 (10fnegri) @RobH do you think that this can be done in the next one/two weeks? We need these servers to... [17:23:25] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10480785 (10Papaul) @Jelto when do you think will be a best time for you or someone in your team to help us relocate some of those mw a... [17:24:10] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1123.eqiad.wmnet with OS bookworm [17:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10480793 (10phaultfinder) [17:24:51] (03PS1) 10Federico Ceratto: site.pp, db2133.yaml: Remove db2133 [puppet] - 10https://gerrit.wikimedia.org/r/1113183 (https://phabricator.wikimedia.org/T384343) [17:26:16] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, 06serviceops: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10480800 (10Jelto) >>! In T383709#10480784, @Papaul wrote: > @Jelto when do you think will be a best time for you or so... [17:26:26] (03CR) 10Marostegui: site.pp, db2133.yaml: Remove db2133 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113183 (https://phabricator.wikimedia.org/T384343) (owner: 10Federico Ceratto) [17:28:16] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1124.eqiad.wmnet with OS bookworm [17:29:12] (03PS1) 10DCausse: wdqs: bump image to 0.3.153 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113184 (https://phabricator.wikimedia.org/T374919) [17:29:39] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10480813 (10RobH) >>! In T382412#10480754, @fnegri wrote: > @RobH do you think that this can be done in the next... [17:30:25] FIRING: SystemdUnitFailed: prometheus_ferm_mss.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:30:34] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, 06serviceops: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10480819 (10Papaul) @Jelto thanks please let us know when best works for you for the gerrit2002. Thanks [17:32:02] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1125.eqiad.wmnet with OS bookworm [17:33:11] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, 10decommission-hardware: decommission mw2282.codfw.wmnet - https://phabricator.wikimedia.org/T384226#10480828 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [17:33:19] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10480831 (10thcipriani) >>! In T384018#10477272, @jcrespo wrote: > To try to speed up confirmations, 'restricted' is documented at data.yml to require @thcipriani approval. So asking... [17:34:50] (03CR) 10Vgutierrez: [C:03+2] hiera: Add pki.goog staging account to acmechief-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez) [17:35:25] RESOLVED: SystemdUnitFailed: prometheus_ferm_mss.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:35:34] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1128.eqiad.wmnet with OS bookworm [17:35:38] (03CR) 10DCausse: [C:03+2] wdqs: bump image to 0.3.153 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113184 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [17:36:42] (03Merged) 10jenkins-bot: wdqs: bump image to 0.3.153 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113184 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [17:37:53] 06SRE, 06Infrastructure-Foundations, 10netops: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288#10480844 (10cmooney) >>! In T384288#10479894, @RobH wrote: > I'm assuming we need to schedule it, and we should give them a couple days notice if we want a set sched... [17:38:10] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1127.eqiad.wmnet with OS bookworm [17:38:56] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=dns2004.wikimedia.org [reason: T383709] [17:39:40] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [17:39:54] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [17:41:58] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1126.eqiad.wmnet with OS bookworm [17:42:36] !log sukhe@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on dns2004.wikimedia.org with reason: T383709 [17:42:46] (03PS1) 10Vgutierrez: acme_chief: Allow specifying an account per certificate [puppet] - 10https://gerrit.wikimedia.org/r/1113187 (https://phabricator.wikimedia.org/T384195) [17:43:14] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, 06serviceops: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10480855 (10JMeybohm) mw2259 and mw2278 are to be decommed (T354791, T384043) mw2355 is now wikikube-worker2229 (T383862... [17:46:49] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:46:50] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:47:52] ^ expected [17:52:57] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dns2004 [17:53:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dns2004 [17:54:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10480884 (10phaultfinder) [17:56:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112101 (https://phabricator.wikimedia.org/T383942) (owner: 10Jdlrobson) [17:56:52] 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.01.11 - 2025.01.31): Data Platform access streamlining for WMDE staff - https://phabricator.wikimedia.org/T381824#10480892 (10jcrespo) Is there anything else to do here (are there any concerns left?), other than fixing documenta... [17:58:01] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113187 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez) [17:58:36] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for dns2004.wikimedia.org [17:58:37] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns2004.wikimedia.org [17:58:44] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, 06serviceops: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10480893 (10Jhancock.wm) [17:58:47] PROBLEM - Bird Internet Routing Daemon on dns2004 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:58:49] PROBLEM - Check if anycast-healthchecker and all configured threads are running on dns2004 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [17:59:19] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#10480904 (10jcrespo) I hope the tagging is ok, as you are doing the work. Let me know if I can help with some reviews. [17:59:41] RECOVERY - Bird Internet Routing Daemon on dns2004 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:59:41] RECOVERY - Check if anycast-healthchecker and all configured threads are running on dns2004 is OK: OK: UP (pid=2955) and all threads (4) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [17:59:55] RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:59:55] RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:00:05] swfrench-wmf: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T1800). [18:00:08] 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.01.11 - 2025.01.31): Data Platform access streamlining for WMDE staff - https://phabricator.wikimedia.org/T381824#10480907 (10Ottomata) Olja approved, so no concerns left. Just needs to be implemented by fixing docs, etc. Than... [18:00:16] o/ [18:00:17] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1126.eqiad.wmnet with OS bookworm [18:00:21] I'll get started shortly [18:00:21] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1126 [18:00:21] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1126 [18:00:33] 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.01.11 - 2025.01.31): Data Platform access streamlining for WMDE staff - https://phabricator.wikimedia.org/T381824#10480909 (10jcrespo) a:03jcrespo [18:00:36] 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.01.11 - 2025.01.31): Data Platform access streamlining for WMDE staff - https://phabricator.wikimedia.org/T381824#10480910 (10jcrespo) p:05Triage→03Medium [18:02:42] !log disabling puppet on A:cp-text ahead of ATS mapping change - T377042 [18:03:02] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns2004.wikimedia.org [reason: T383709] [18:03:30] !log sukhe@dns1004 START - running authdns-update [18:04:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10480929 (10phaultfinder) [18:05:11] !log sukhe@dns1004 END - running authdns-update [18:06:08] (03CR) 10Scott French: [C:03+2] trafficserver: add mw-php-migration to mapping_rules [puppet] - 10https://gerrit.wikimedia.org/r/1082581 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French) [18:12:12] 06SRE: Verify the Foundation's Bluesky account - https://phabricator.wikimedia.org/T384350 (10LPasqual_WMF) 03NEW [18:12:39] 06SRE: Verify the Foundation's Bluesky account - https://phabricator.wikimedia.org/T384350#10480967 (10LPasqual_WMF) I am pasting below the DNS information, with a screenshot: Host: _atproto Type: TXT Value: did=did:plc:vwdzejaw4wkxh2wvkjlcubal {F58240567} [18:13:38] 06SRE: Verify the Foundation's Bluesky account - https://phabricator.wikimedia.org/T384350#10480971 (10ssingh) Hi @LPasqual_WMF: confirming that this an additional request for @wikimediafoundation.org, in addition to @wikipedia.org? [18:14:36] 06SRE: Verify the Foundation's Bluesky account - https://phabricator.wikimedia.org/T384350#10480982 (10LPasqual_WMF) @ssingh Hi, that's correct! [18:16:19] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1126.eqiad.wmnet with reason: host reimage [18:17:24] (03PS1) 10Ssingh: wikimediafoundation.org: add TXT record for Bluesky verification [dns] - 10https://gerrit.wikimedia.org/r/1113191 (https://phabricator.wikimedia.org/T384350) [18:19:15] (03CR) 10Jcrespo: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1113191 (https://phabricator.wikimedia.org/T384350) (owner: 10Ssingh) [18:19:50] (03CR) 10Ssingh: [V:03+2 C:03+2] "Thanks for the review Jaime." [dns] - 10https://gerrit.wikimedia.org/r/1113191 (https://phabricator.wikimedia.org/T384350) (owner: 10Ssingh) [18:19:52] !log validated routing behavior on cp4040 (applied at 18:10 UTC) - T377042 [18:19:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1126.eqiad.wmnet with reason: host reimage [18:20:11] !log sukhe@dns1004 START - running authdns-update [18:20:44] 06SRE, 10DNS, 13Patch-For-Review: Verify the Foundation's Bluesky account - https://phabricator.wikimedia.org/T384350#10481063 (10jcrespo) [18:21:07] 06SRE, 10DNS, 13Patch-For-Review: Verify the Foundation's Bluesky account - https://phabricator.wikimedia.org/T384350#10481077 (10jcrespo) p:05Triage→03Medium a:03ssingh [18:21:59] !log sukhe@dns1004 END - running authdns-update [18:23:07] 06SRE, 10DNS, 13Patch-For-Review: Verify the Foundation's Bluesky account - https://phabricator.wikimedia.org/T384350#10481147 (10ssingh) ` $ dig _atproto.wikimediafoundation.org TXT +short "did=did:plc:vwdzejaw4wkxh2wvkjlcubal" ` @LPasqual_WMF : Please try verifying now. [18:23:36] !log started incrementally running puppet on A:cp-text for ATS mapping change - T377042 [18:24:23] 06SRE, 10DNS, 13Patch-For-Review: Verify the Foundation's Bluesky account - https://phabricator.wikimedia.org/T384350#10481149 (10LPasqual_WMF) @ssingh Confirming it worked. Thank you so much for taking care of this so quickly! [18:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10481151 (10phaultfinder) [18:25:18] 06SRE, 10DNS, 13Patch-For-Review: Verify the Foundation's Bluesky account - https://phabricator.wikimedia.org/T384350#10481156 (10ssingh) 05Open→03Resolved [18:27:03] !log disable-pupept on netflow7001 to test gnmic bgp endpoint [18:27:16] (03CR) 10BCornwall: [C:03+1] acme_chief: Allow specifying an account per certificate [puppet] - 10https://gerrit.wikimedia.org/r/1113187 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez) [18:28:23] !log restarting stashbot [18:33:42] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:35:42] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7001.magru.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd [18:35:44] ^^ sry this was me forgetting to downtime [18:35:48] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10481199 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d0f01fc7-5a29-49c5-8292-aebad021ff73) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and th... [18:38:52] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1126.eqiad.wmnet with OS bookworm [18:39:08] (03PS1) 10Clare Ming: Fix schema version for CTR instrument [extensions/WikimediaEvents] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113192 (https://phabricator.wikimedia.org/T384333) [18:39:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/WikimediaEvents] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113192 (https://phabricator.wikimedia.org/T384333) (owner: 10Clare Ming) [18:40:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383620#10481232 (10kamila) [18:44:06] (03CR) 10Kamila Součková: [C:03+1] trafficserver: reoute testwiki citoid calls to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1113178 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan) [18:45:57] (03PS1) 10Raymond Ndibe: [wmcs::kubeadm::core] remove kubeadm-flags.env [puppet] - 10https://gerrit.wikimedia.org/r/1113194 (https://phabricator.wikimedia.org/T370245) [18:49:34] !log finished running puppet on A:cp-text for ATS mapping change - T377042 [18:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:38] T377042: Support cookie-driven fractional migration to PHP 8.1 deployments of mw-web and mw-api-ext - https://phabricator.wikimedia.org/T377042 [18:53:12] (03CR) 10AOkoth: miscweb: support os-reports deployment (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [18:53:36] (03PS7) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) [18:53:51] (03PS8) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) [18:54:43] (03CR) 10Scott French: [C:03+1] trafficserver: reoute testwiki citoid calls to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1113178 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan) [18:55:08] (03CR) 10CI reject: [V:04-1] miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [18:57:46] (03PS1) 10Brouberol: airflow: DRY extra volume mounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113198 (https://phabricator.wikimedia.org/T380619) [18:58:25] (03PS2) 10Brouberol: airflow: DRY extra volume mounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113198 (https://phabricator.wikimedia.org/T380619) [19:00:05] brennen and jeena: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T1900) [19:03:13] (03PS9) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) [19:03:22] (03PS10) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) [19:05:21] o/ [19:13:20] (03CR) 10Cwhite: "Hey folks, would you be willing to check this change set for accuracy and completeness? Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105972 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite) [19:14:01] (03CR) 10Kevin Bazira: "thank you for the comments and sharing the documentation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [19:14:20] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113200 (https://phabricator.wikimedia.org/T382364) [19:14:22] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113200 (https://phabricator.wikimedia.org/T382364) (owner: 10TrainBranchBot) [19:15:08] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113200 (https://phabricator.wikimedia.org/T382364) (owner: 10TrainBranchBot) [19:18:01] (03CR) 10Ottomata: EventStreamConfig: Add mediawiki.article_country_prediction_change stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [19:21:55] (03CR) 10Santiago Faci: [C:03+1] "Looks good!" [extensions/WikimediaEvents] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113192 (https://phabricator.wikimedia.org/T384333) (owner: 10Clare Ming) [19:23:55] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10481483 (10KFrancis) Please provide Neslihan's WMDE email address. Thanks! [19:25:00] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [19:25:20] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [19:25:23] dcausse@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [19:26:55] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.13 refs T382364 [19:26:59] T382364: 1.44.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T382364 [19:27:36] (03CR) 10Scott French: "Thank you both for the reviews! FYI, since the routing component of this now live, I'll move forward with deploying this tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080388 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French) [19:31:52] !log jebe@deploy2002 Started deploy [airflow-dags/analytics_product@0aa9d7c]: (no justification provided) [19:32:25] !log jebe@deploy2002 Finished deploy [airflow-dags/analytics_product@0aa9d7c]: (no justification provided) (duration: 00m 35s) [19:40:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10481542 (10phaultfinder) [19:43:43] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7001.magru.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd [19:43:52] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10481570 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=26b7dbb9-1906-4b10-a433-cc2ffb6bdb61) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and th... [19:44:57] (03CR) 10Andrea Denisse: [C:03+2] wmcs: Migrate iowait stalling alerts to the alerts.git repository [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [19:45:24] (03CR) 10Andrea Denisse: [C:03+2] "Merging as it was already approved by @dcaro@wikimedia.org, I just removed leftover comments." [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [19:45:28] (03CR) 10Andrea Denisse: [V:03+2 C:03+2] wmcs: Migrate iowait stalling alerts to the alerts.git repository [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [19:45:49] (03CR) 10Andrea Denisse: [C:03+2] "Merging as it was already approved by @dcaro@wikimedia.org, I just removed leftover comments." [alerts] - 10https://gerrit.wikimedia.org/r/1111328 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [19:46:08] (03CR) 10Andrea Denisse: [V:03+2 C:03+2] wmcs: Migrate network saturation alerts to the alerts.git repository (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1111328 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [19:46:30] (03CR) 10Andrea Denisse: [C:03+2] wmcs: Remove Puppet files for migrated Prometheus alerts [puppet] - 10https://gerrit.wikimedia.org/r/1111340 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [19:54:40] (03PS2) 10CDanis: draft: allow k8s NodeJS apps to opt-in to auto-ECS [puppet] - 10https://gerrit.wikimedia.org/r/1112295 [19:58:48] FIRING: PuppetFailure: Puppet has failed on ml-lab1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:59:12] (03CR) 10Andrew Bogott: [C:04-2] Remove nutcracker from cloudweb hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861807 (https://phabricator.wikimedia.org/T277183) (owner: 10Majavah) [19:59:41] (03Abandoned) 10Andrew Bogott: Remove nutcracker from cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/861807 (https://phabricator.wikimedia.org/T277183) (owner: 10Majavah) [20:01:17] (03CR) 10Andrew Bogott: [C:03+2] backy2: on Bullseye, hack around a silly package name mismatch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763336 (https://phabricator.wikimedia.org/T301909) (owner: 10Andrew Bogott) [20:01:52] (03CR) 10Andrew Bogott: [C:03+2] Add ceph config for cloudcephosd103[5-8] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060190 (https://phabricator.wikimedia.org/T363344) (owner: 10Andrew Bogott) [20:02:22] (03CR) 10CDanis: [C:03+1] thanos: further reduce trace sampling [puppet] - 10https://gerrit.wikimedia.org/r/1112700 (https://phabricator.wikimedia.org/T378190) (owner: 10Filippo Giunchedi) [20:05:30] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:06:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10481628 (10phaultfinder) [20:11:24] PROBLEM - MariaDB Replica SQL: s2 #page on db2175 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: cswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:11:31] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:12:38] !incidents [20:12:39] 5624 (UNACKED) db2175 (paged)/MariaDB Replica SQL: s2 (paged) [20:12:39] 5623 (RESOLVED) Manual (paged) by urbanecm (murbanec@wikimedia.org): Nearly complete Gerrit outage [20:12:39] 5622 (RESOLVED) NELHigh sre (thanos-rule tcp.timed_out) [20:12:39] 5621 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr1-esams.wikimedia.org) [20:12:40] 5620 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [20:12:40] 5619 (RESOLVED) db2207 (paged)/MariaDB Replica SQL: s2 (paged) [20:12:52] !ack 5624 [20:12:52] 5624 (ACKED) db2175 (paged)/MariaDB Replica SQL: s2 (paged) [20:14:14] I have no access to pc right now. Can you depool it until I get back? [20:14:21] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2175&from=now-30m&to=now [20:14:45] !log herron@cumin1002 dbctl commit (dc=all): 'depool db2175', diff saved to https://phabricator.wikimedia.org/P72208 and previous config saved to /var/cache/conftool/dbconfig/20250121-201444-herron.json [20:14:55] Amir1: you bet, just did [20:15:03] I am fixing it [20:15:11] Should be fixed now [20:15:21] But let's leave it depooled so I can upgrade it tomorrow [20:15:23] Oh thank you both! [20:15:24] RECOVERY - MariaDB Replica SQL: s2 #page on db2175 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:15:29] thanks marostegui ok sounds good [20:15:33] Thanks herron for the depool [20:15:37] np! [20:16:09] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:16:19] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:16:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:21:00] (03CR) 10Ssingh: "Abandoning this because we are also updating eqiad in here, which is not required. Doing a new patch to make review easier, comparing agai" [dns] - 10https://gerrit.wikimedia.org/r/1101908 (https://phabricator.wikimedia.org/T380858) (owner: 10CDobbins) [20:21:04] (03Abandoned) 10Ssingh: Remove eqiad from public and private IP spaces [dns] - 10https://gerrit.wikimedia.org/r/1101908 (https://phabricator.wikimedia.org/T380858) (owner: 10CDobbins) [20:21:40] (03PS1) 10Ssingh: geo-maps: put eqiad at lowest priority for T380858 [dns] - 10https://gerrit.wikimedia.org/r/1113205 (https://phabricator.wikimedia.org/T380858) [20:23:05] (03CR) 10Ssingh: "For reviewers: the idea is to ensure that eqiad is lowest priority for non-eqiad DCs." [dns] - 10https://gerrit.wikimedia.org/r/1113205 (https://phabricator.wikimedia.org/T380858) (owner: 10Ssingh) [20:23:54] (03CR) 10Herron: [C:03+1] "good call" [puppet] - 10https://gerrit.wikimedia.org/r/1112700 (https://phabricator.wikimedia.org/T378190) (owner: 10Filippo Giunchedi) [20:35:45] !log ecarg@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [20:35:47] !log ecarg@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [20:42:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10481696 (10phaultfinder) [20:47:25] (03CR) 10GergesShamon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/222255 (owner: 10Matanya) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T2100). [21:00:05] ZhaoFJx, Jdlrobson, and cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:01:23] o/ [21:01:27] i can deploy [21:02:50] o/ [21:05:05] ZhaoFJx: are you around? [21:05:19] if not, i can start with your patch Jdlrobson [21:06:09] (03PS2) 10Jdlrobson: Enable Vector 2022 and dark mode on Azerbaijani wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112101 (https://phabricator.wikimedia.org/T383942) [21:07:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112101 (https://phabricator.wikimedia.org/T383942) (owner: 10Jdlrobson) [21:08:09] (03Merged) 10jenkins-bot: Enable Vector 2022 and dark mode on Azerbaijani wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112101 (https://phabricator.wikimedia.org/T383942) (owner: 10Jdlrobson) [21:08:39] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1112101|Enable Vector 2022 and dark mode on Azerbaijani wikis (T383942)]] [21:08:44] T383942: Jan 20, 2025: Vector 2022 and dark mode deployments - https://phabricator.wikimedia.org/T383942 [21:10:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10481804 (10phaultfinder) [21:10:49] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:14:11] Jdlrobson: up on test servers if you want to check [21:14:52] !log cjming@deploy2002 cjming, jdlrobson: Backport for [[gerrit:1112101|Enable Vector 2022 and dark mode on Azerbaijani wikis (T383942)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:14:56] T383942: Jan 20, 2025: Vector 2022 and dark mode deployments - https://phabricator.wikimedia.org/T383942 [21:16:04] cjming: on it [21:16:38] LGTM cjming [21:16:47] great! [21:16:52] !log cjming@deploy2002 cjming, jdlrobson: Continuing with sync [21:21:46] Sorry I think I am kind of late [21:21:53] Is the deployment still ongoing? [21:22:08] hi ZhaoFJx - yes i can do your patch next [21:22:24] Thank you cjming [21:23:16] np! [21:23:42] thanks cjming ! Looks good! [21:23:46] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1112101|Enable Vector 2022 and dark mode on Azerbaijani wikis (T383942)]] (duration: 15m 06s) [21:23:50] (03PS3) 10ZhaoFJx: cawiki: Create templateeditor & protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112838 (https://phabricator.wikimedia.org/T384145) [21:23:51] T383942: Jan 20, 2025: Vector 2022 and dark mode deployments - https://phabricator.wikimedia.org/T383942 [21:24:15] Jdlrobson: yay! should be live [21:24:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:24:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112838 (https://phabricator.wikimedia.org/T384145) (owner: 10ZhaoFJx) [21:25:26] (03Merged) 10jenkins-bot: cawiki: Create templateeditor & protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112838 (https://phabricator.wikimedia.org/T384145) (owner: 10ZhaoFJx) [21:25:55] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1112838|cawiki: Create templateeditor & protection level (T384145)]] [21:26:00] T384145: Create template editor user group and protection level in cawiki - https://phabricator.wikimedia.org/T384145 [21:28:14] (03CR) 10Scott French: php8.1: introduce JIT (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113138 (https://phabricator.wikimedia.org/T384294) (owner: 10Effie Mouzeli) [21:31:23] ZhaoFJx: on test servers if you'd like to test - lmk if/when to sync [21:31:58] sure [21:32:04] !log cjming@deploy2002 zhaofjx, cjming: Backport for [[gerrit:1112838|cawiki: Create templateeditor & protection level (T384145)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:32:08] T384145: Create template editor user group and protection level in cawiki - https://phabricator.wikimedia.org/T384145 [21:32:10] all good in https://ca.wikipedia.org/wiki/Especial:Drets_dels_grups_d%27usuaris [21:32:16] nice [21:32:19] !log cjming@deploy2002 zhaofjx, cjming: Continuing with sync [21:34:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10481941 (10phaultfinder) [21:35:37] (03CR) 10BCornwall: [C:03+1] geo-maps: put eqiad at lowest priority for T380858 [dns] - 10https://gerrit.wikimedia.org/r/1113205 (https://phabricator.wikimedia.org/T380858) (owner: 10Ssingh) [21:39:20] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1112838|cawiki: Create templateeditor & protection level (T384145)]] (duration: 13m 24s) [21:39:24] T384145: Create template editor user group and protection level in cawiki - https://phabricator.wikimedia.org/T384145 [21:40:14] ZhaoFJx: should be live :) [21:40:34] Checked, thank you! [21:40:37] Have a good one [21:40:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113192 (https://phabricator.wikimedia.org/T384333) (owner: 10Clare Ming) [21:46:25] FIRING: [2x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:49:51] (03Merged) 10jenkins-bot: Fix schema version for CTR instrument [extensions/WikimediaEvents] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113192 (https://phabricator.wikimedia.org/T384333) (owner: 10Clare Ming) [21:50:23] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1113192|Fix schema version for CTR instrument (T384333)]] [21:50:28] T384333: Wrong schema used in the CTR instrument (so experimentation fragment is empty for every event) - https://phabricator.wikimedia.org/T384333 [21:55:19] !log cjming@deploy2002 cjming: Backport for [[gerrit:1113192|Fix schema version for CTR instrument (T384333)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:55:33] !log cjming@deploy2002 cjming: Continuing with sync [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250121T2200) [22:02:29] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113192|Fix schema version for CTR instrument (T384333)]] (duration: 12m 05s) [22:02:33] T384333: Wrong schema used in the CTR instrument (so experimentation fragment is empty for every event) - https://phabricator.wikimedia.org/T384333 [22:03:21] (03PS1) 10BCornwall: slo_template: update SLO dates to current window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1113212 [22:04:42] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:06:42] !log end of UTC late backport window [22:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:51:25] FIRING: [2x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:53:17] PROBLEM - Hadoop NodeManager on an-worker1171 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:01:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1121:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1121 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:12:09] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS1136/IPv4: Connect - KPN, AS1136/IPv6: Connect - KPN https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:13:17] RECOVERY - Hadoop NodeManager on an-worker1171 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:14:32] (03PS1) 10Scott French: shellbox-constraints: 1 eqiad replica on 8.1 (change 1/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113217 (https://phabricator.wikimedia.org/T377038) [23:14:33] (03PS1) 10Scott French: shellbox-constraints: all eqiad replicas on 8.1 (change 2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113218 (https://phabricator.wikimedia.org/T377038) [23:14:34] (03PS1) 10Scott French: shellbox-constraints: all replicas on PHP 8.1 (change 3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113219 (https://phabricator.wikimedia.org/T377038) [23:21:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1121:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:53:25] (03CR) 10Dzahn: "collaboration-services-releng is supposed to work. The "receiver" is "name: 'collaboration-services-releng-critical" which should be the c" [puppet] - 10https://gerrit.wikimedia.org/r/1113163 (owner: 10Jelto) [23:55:07] !incidents [23:55:08] 5624 (RESOLVED) db2175 (paged)/MariaDB Replica SQL: s2 (paged) [23:55:08] 5623 (RESOLVED) Manual (paged) by urbanecm (murbanec@wikimedia.org): Nearly complete Gerrit outage [23:55:08] 5622 (RESOLVED) NELHigh sre (thanos-rule tcp.timed_out) [23:55:08] 5621 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr1-esams.wikimedia.org) [23:55:08] 5620 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [23:56:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on wikikube-worker1121:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:58:38] (03CR) 10Scott French: "For clarity, I should probably mention explicitly:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080388 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French) [23:58:48] FIRING: PuppetFailure: Puppet has failed on ml-lab1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:58:52] (03CR) 10Dzahn: "The alert that is being changed here is for the SSH port, not for Apache. When looking at incident history I see that the page was only a " [puppet] - 10https://gerrit.wikimedia.org/r/1113163 (owner: 10Jelto)