[00:04:53] (03CR) 10Cwhite: [C:03+1] Netbox alerting: Add remaining netbox alert. [alerts] - 10https://gerrit.wikimedia.org/r/1092779 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [00:17:09] (03PS1) 10EggRoll97: enwiki: Add abusefilter-access-protected-vars to EFH/EFM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092956 (https://phabricator.wikimedia.org/T380332) [00:38:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1092960 [00:38:30] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1092960 (owner: 10TrainBranchBot) [00:47:05] FIRING: [17x] ProbeDown: Service restbase2036-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:47:43] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:54:10] FIRING: [17x] ProbeDown: Service restbase2036-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:08:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1092964 [01:08:26] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1092964 (owner: 10TrainBranchBot) [01:09:16] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1092960 (owner: 10TrainBranchBot) [01:23:21] (03CR) 10Pppery: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092956 (https://phabricator.wikimedia.org/T380332) (owner: 10EggRoll97) [01:43:15] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1092964 (owner: 10TrainBranchBot) [01:45:03] PROBLEM - Disk space on Hadoop worker on an-worker1172 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/f 13 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [01:47:05] FIRING: [15x] ProbeDown: Service restbase2036-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:05:03] RECOVERY - Disk space on Hadoop worker on an-worker1172 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [02:19:03] PROBLEM - Disk space on Hadoop worker on an-worker1172 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/f 14 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [02:24:10] FIRING: [13x] ProbeDown: Service restbase2036-c:9042 has failed probes (tcp_cassandra_c_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:35:03] RECOVERY - Disk space on Hadoop worker on an-worker1172 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:50:29] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) (owner: 10Andrew Bogott) [02:55:56] (03PS6) 10Andrew Bogott: resolvconf: don't update resolv.conf with 0 nameservers [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) [02:55:58] (03CR) 10Andrew Bogott: resolvconf: don't update resolv.conf with 0 nameservers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott) [02:56:11] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott) [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:11:13] (03PS1) 10Andrew Bogott: Initial insetup role for cloudcephosd2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/1092983 (https://phabricator.wikimedia.org/T378825) [03:16:26] (03CR) 10Andrew Bogott: [C:03+2] Initial insetup role for cloudcephosd2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/1092983 (https://phabricator.wikimedia.org/T378825) (owner: 10Andrew Bogott) [03:18:02] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10338946 (10Andrew) a:05Andrew→03None puppet is updated (although untested, for obvious reasons) [03:44:54] (03PS1) 10AikoChou: ml-services: update articlequality in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092994 (https://phabricator.wikimedia.org/T374034) [03:53:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092956 (https://phabricator.wikimedia.org/T380332) (owner: 10EggRoll97) [03:55:31] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 4730 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [04:18:58] (03PS1) 10Kevin Bazira: ml-services: update article-country response schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093006 (https://phabricator.wikimedia.org/T371897) [05:31:14] (03CR) 10Raymond Ndibe: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1090520 (https://phabricator.wikimedia.org/T358225) (owner: 10Raymond Ndibe) [05:36:21] (03PS2) 10Raymond Ndibe: profile::manifests::toolforge::bastion: harbor to /etc/toolforge/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1090520 (https://phabricator.wikimedia.org/T358225) [05:54:31] RECOVERY - Disk space on archiva1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [06:27:05] FIRING: [12x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:38:28] (03PS1) 10Thiemo Kreuz (WMDE): EditionLookup: Update EntityLookup calls [extensions/Wikisource] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093172 (https://phabricator.wikimedia.org/T380304) [06:39:07] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] EditionLookup: Update EntityLookup calls [extensions/Wikisource] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093172 (https://phabricator.wikimedia.org/T380304) (owner: 10Thiemo Kreuz (WMDE)) [06:46:53] PROBLEM - Disk space on Hadoop worker on analytics1076 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/f 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T0700) [07:00:51] (03PS1) 10Brouberol: airflow-analytics-test: deploy the scheduler and kerberos components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093177 (https://phabricator.wikimedia.org/T380284) [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:05:57] (03CR) 10Pppery: [C:04-1] "Stale, need to redo" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1089939 (owner: 10Pppery) [07:07:37] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:10:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:13:23] 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T380341 (10phaultfinder) 03NEW [07:18:29] 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T380341#10339044 (10phaultfinder) [07:27:50] (03PS2) 10Muehlenhoff: Remove legacy logstash IDP service [puppet] - 10https://gerrit.wikimedia.org/r/1088323 [07:28:38] (03PS3) 10Muehlenhoff: Remove legacy logstash IDP service [puppet] - 10https://gerrit.wikimedia.org/r/1088323 [07:28:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088323 (owner: 10Muehlenhoff) [07:49:45] (03CR) 10Slyngshede: [C:03+1] profile::ldap::bitu: Remove obsolete check [puppet] - 10https://gerrit.wikimedia.org/r/1092845 (owner: 10Muehlenhoff) [07:56:43] (03PS1) 10Slyngshede: Release v0.1.2 [software/bitu] - 10https://gerrit.wikimedia.org/r/1093250 [08:00:05] Amir1, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:01:33] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10339063 (10MoritzMuehlenhoff) For Ganeti I propose the following plan. It allows us to keep all misc services running on magru, so no need to fiddle with... [08:03:00] (03CR) 10Muehlenhoff: [C:03+2] Remove legacy logstash IDP service [puppet] - 10https://gerrit.wikimedia.org/r/1088323 (owner: 10Muehlenhoff) [08:05:53] (03PS2) 10Brouberol: airflow-analytics-test: deploy the scheduler and kerberos components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093177 (https://phabricator.wikimedia.org/T380284) [08:07:37] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:10:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:18:33] !log Restarted CI Jenkins to upgrade Leastload plugin and remove the SSH server plugin [08:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:17] (03CR) 10Muehlenhoff: [C:03+2] profile::ldap::bitu: Remove obsolete check [puppet] - 10https://gerrit.wikimedia.org/r/1092845 (owner: 10Muehlenhoff) [08:21:09] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1093250 (owner: 10Slyngshede) [08:21:42] (03CR) 10Slyngshede: [C:03+2] Release v0.1.2 [software/bitu] - 10https://gerrit.wikimedia.org/r/1093250 (owner: 10Slyngshede) [08:24:24] (03CR) 10David Caro: profile::manifests::toolforge::bastion: harbor to /etc/toolforge/common.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1090520 (https://phabricator.wikimedia.org/T358225) (owner: 10Raymond Ndibe) [08:24:56] (03Merged) 10jenkins-bot: Release v0.1.2 [software/bitu] - 10https://gerrit.wikimedia.org/r/1093250 (owner: 10Slyngshede) [08:26:03] (03CR) 10Muehlenhoff: [C:03+1] "Oops, yes of course :-) This should be good to merge now." [puppet] - 10https://gerrit.wikimedia.org/r/1089716 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [08:26:20] (03PS1) 10Muehlenhoff: Revert "Remove obsolete package::builder role" [puppet] - 10https://gerrit.wikimedia.org/r/1093265 [08:26:29] (03PS2) 10Muehlenhoff: Revert "Remove obsolete package::builder role" [puppet] - 10https://gerrit.wikimedia.org/r/1093265 [08:27:08] (03CR) 10CI reject: [V:04-1] Revert "Remove obsolete package::builder role" [puppet] - 10https://gerrit.wikimedia.org/r/1093265 (owner: 10Muehlenhoff) [08:30:07] RECOVERY - haproxy failover on dbproxy1026 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [08:30:40] (03PS3) 10Muehlenhoff: Revert "Remove obsolete package::builder role" [puppet] - 10https://gerrit.wikimedia.org/r/1093265 [08:33:21] (03CR) 10Muehlenhoff: [C:03+2] Revert "Remove obsolete package::builder role" [puppet] - 10https://gerrit.wikimedia.org/r/1093265 (owner: 10Muehlenhoff) [08:34:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7004.magru.wmnet [08:35:14] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10339088 (10ops-monitoring-bot) Draining ganeti7004.magru.wmnet of running VMs [08:35:31] (03PS1) 10Slyngshede: Bitu version 0.1.2 [dns] - 10https://gerrit.wikimedia.org/r/1093266 [08:35:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7004.magru.wmnet [08:35:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7004.magru.wmnet [08:35:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7004.magru.wmnet [08:36:00] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10339089 (10ops-monitoring-bot) Draining ganeti7004.magru.wmnet of running VMs [08:37:01] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1093266 (owner: 10Slyngshede) [08:37:57] (03CR) 10Slyngshede: [C:03+2] Bitu version 0.1.2 [dns] - 10https://gerrit.wikimedia.org/r/1093266 (owner: 10Slyngshede) [08:42:11] (03PS3) 10Brouberol: airflow-analytics-test: deploy the scheduler and kerberos components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093177 (https://phabricator.wikimedia.org/T380284) [08:43:40] (03CR) 10Arnaudb: [C:03+1] "native methods have been extensively tested, some bugs have been fixed by @rcoccioli@wikimedia.org along the way." [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 (owner: 10Volans) [08:44:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of bast7001.wikimedia.org to plain [08:46:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of bast7001.wikimedia.org to plain [08:48:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum7002.magru.wmnet to plain [08:51:18] !log disabling puppet on all k8s controll planes for rollout of T380142 [08:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:22] T380142: Reimaging a kubernetes control-plane invalidates service-account tokens issued by it - https://phabricator.wikimedia.org/T380142 [08:52:24] (03PS4) 10Brouberol: airflow-analytics-test: deploy the scheduler and kerberos components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093177 (https://phabricator.wikimedia.org/T380284) [08:53:57] (03CR) 10JMeybohm: [C:03+2] kubernetes::master: Don't override sa certificates on reimage [puppet] - 10https://gerrit.wikimedia.org/r/1092803 (https://phabricator.wikimedia.org/T380142) (owner: 10JMeybohm) [08:55:22] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, 10GitLab (Infrastructure): Migrate gitlab storage to apus - https://phabricator.wikimedia.org/T378922#10339149 (10LSobanski) [08:55:23] (03Abandoned) 10LSobanski: Switch alerts deployment source to GitLab [puppet] - 10https://gerrit.wikimedia.org/r/982086 (https://phabricator.wikimedia.org/T349626) (owner: 10LSobanski) [08:56:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7002.magru.wmnet to plain [08:57:01] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T380341#10339152 (10LSobanski) [08:58:15] (03CR) 10Aklapper: [C:03+2] EditionLookup: Update EntityLookup calls [extensions/Wikisource] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093172 (https://phabricator.wikimedia.org/T380304) (owner: 10Thiemo Kreuz (WMDE)) [08:58:24] (03CR) 10Aklapper: [C:03+2] "As this very code change has already been merged into the master branch in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikisourc" [extensions/Wikisource] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093172 (https://phabricator.wikimedia.org/T380304) (owner: 10Thiemo Kreuz (WMDE)) [08:58:27] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:58:35] PROBLEM - Bird Internet Routing Daemon on durum7002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [08:58:35] PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum7002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [08:58:35] PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:00:05] andre and brennen: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T0900). [09:00:27] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:00:33] RECOVERY - Bird Internet Routing Daemon on durum7002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [09:00:35] RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum7002 is OK: OK: UP (pid=2382) and all threads (8) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [09:00:35] RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:13:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7002.magru.wmnet to plain [09:13:25] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update article-country response schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093006 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [09:13:33] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update reference-quality docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092866 (https://phabricator.wikimedia.org/T378939) (owner: 10AikoChou) [09:13:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7002.magru.wmnet to plain [09:14:17] PROBLEM - Disk space on Hadoop worker on an-worker1087 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [09:15:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh7002.wikimedia.org to plain [09:15:29] (03Merged) 10jenkins-bot: EditionLookup: Update EntityLookup calls [extensions/Wikisource] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093172 (https://phabricator.wikimedia.org/T380304) (owner: 10Thiemo Kreuz (WMDE)) [09:18:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7002.wikimedia.org to plain [09:19:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:19:33] PROBLEM - Bird Internet Routing Daemon on doh7002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [09:19:35] PROBLEM - Check if anycast-healthchecker and all configured threads are running on doh7002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [09:19:35] PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:20:16] !log aklapper@deploy2002 Started scap sync-world: Backport for [[gerrit:1093172|EditionLookup: Update EntityLookup calls (T380304)]] [09:20:20] T380304: Wikisource extension: Error: Call to undefined method Wikibase\Client\WikibaseClient::getRestrictedEntityLookup() - https://phabricator.wikimedia.org/T380304 [09:20:27] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:20:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:20:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of prometheus7001.magru.wmnet to plain [09:21:27] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:21:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of prometheus7001.magru.wmnet to plain [09:21:33] RECOVERY - Bird Internet Routing Daemon on doh7002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [09:21:35] RECOVERY - Check if anycast-healthchecker and all configured threads are running on doh7002 is OK: OK: UP (pid=2373) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [09:21:35] RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:23:54] (03PS7) 10DCausse: Migrate package to opensearch [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1080749 (https://phabricator.wikimedia.org/T372769) (owner: 10Ebernhardson) [09:23:55] (03PS1) 10DCausse: Update README and gitreview [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1093270 [09:24:58] (03CR) 10DCausse: [C:03+1] "nice!" [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1080749 (https://phabricator.wikimedia.org/T372769) (owner: 10Ebernhardson) [09:26:41] !log aklapper@deploy2002 aklapper, thiemowmde: Backport for [[gerrit:1093172|EditionLookup: Update EntityLookup calls (T380304)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:26:52] T380304: Wikisource extension: Error: Call to undefined method Wikibase\Client\WikibaseClient::getRestrictedEntityLookup() - https://phabricator.wikimedia.org/T380304 [09:27:15] !log aklapper@deploy2002 aklapper, thiemowmde: Continuing with sync [09:27:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:28:42] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:30:35] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update article-country response schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093006 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [09:31:43] (03Merged) 10jenkins-bot: ml-services: update article-country response schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093006 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [09:32:55] FIRING: MaxConntrack: Max conntrack at 100% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [09:33:17] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_esams [09:33:31] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_esams [09:33:49] !log aklapper@deploy2002 Finished scap sync-world: Backport for [[gerrit:1093172|EditionLookup: Update EntityLookup calls (T380304)]] (duration: 13m 33s) [09:33:53] T380304: Wikisource extension: Error: Call to undefined method Wikibase\Client\WikibaseClient::getRestrictedEntityLookup() - https://phabricator.wikimedia.org/T380304 [09:34:44] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:35:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:37:56] RESOLVED: MaxConntrack: Max conntrack at 99.34% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [09:38:15] !log decommission cxserver endpoints /api/rest_v1/list/(pair|tool|languagepairs) from RESTBase T375616 [09:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:19] T375616: Block RESTBase cxserver v1 endpoints in favor of the new endpoints - https://phabricator.wikimedia.org/T375616 [09:40:44] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093271 (https://phabricator.wikimedia.org/T375663) [09:40:46] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093271 (https://phabricator.wikimedia.org/T375663) (owner: 10TrainBranchBot) [09:41:05] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:41:33] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093271 (https://phabricator.wikimedia.org/T375663) (owner: 10TrainBranchBot) [09:43:29] (03PS1) 10Muehlenhoff: Add cn=wmf as managed group for idm-test [puppet] - 10https://gerrit.wikimedia.org/r/1093272 [09:44:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:44:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:49:24] (03CR) 10Btullis: [C:03+1] "Let's go!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093177 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [09:51:45] (03CR) 10Btullis: [C:03+2] Canary cephosd1001 to use nftables instead of iptables [puppet] - 10https://gerrit.wikimedia.org/r/1089716 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [09:52:12] !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.4 refs T375663 [09:52:16] T375663: 1.44.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T375663 [09:55:25] FIRING: SystemdUnitFailed: kube-publish-sa-cert.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:55:43] (03CR) 10Fabfur: cache: install lshw from bullseye-backports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092887 (https://phabricator.wikimedia.org/T380295) (owner: 10Fabfur) [09:55:57] (03PS4) 10Fabfur: cache: install lshw from bullseye-backports [puppet] - 10https://gerrit.wikimedia.org/r/1092887 (https://phabricator.wikimedia.org/T380295) [09:58:14] (03PS1) 10Slyngshede: P:idm enable bitu-account-manager permission request. [puppet] - 10https://gerrit.wikimedia.org/r/1093275 [10:04:57] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092887 (https://phabricator.wikimedia.org/T380295) (owner: 10Fabfur) [10:05:25] RESOLVED: SystemdUnitFailed: kube-publish-sa-cert.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:06:04] (03CR) 10Fabfur: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1092914 (https://phabricator.wikimedia.org/T370745) (owner: 10CDanis) [10:06:53] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one nit inline (feel free to ignore)" [puppet] - 10https://gerrit.wikimedia.org/r/1092887 (https://phabricator.wikimedia.org/T380295) (owner: 10Fabfur) [10:10:23] (03CR) 10Fabfur: cache: install lshw from bullseye-backports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092887 (https://phabricator.wikimedia.org/T380295) (owner: 10Fabfur) [10:10:36] (03PS5) 10Fabfur: cache: install lshw from bullseye-backports [puppet] - 10https://gerrit.wikimedia.org/r/1092887 (https://phabricator.wikimedia.org/T380295) [10:10:45] (03CR) 10Slyngshede: Add cn=wmf as managed group for idm-test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093272 (owner: 10Muehlenhoff) [10:11:25] FIRING: SystemdUnitFailed: kube-publish-sa-cert.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:13:50] (03PS1) 10JMeybohm: Fix permissions and notify of kube-publish-sa-cert [puppet] - 10https://gerrit.wikimedia.org/r/1093280 (https://phabricator.wikimedia.org/T380142) [10:14:19] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1093280 (https://phabricator.wikimedia.org/T380142) (owner: 10JMeybohm) [10:16:10] RESOLVED: SystemdUnitFailed: kube-publish-sa-cert.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:17:20] (03CR) 10JMeybohm: [C:03+2] Fix permissions and notify of kube-publish-sa-cert [puppet] - 10https://gerrit.wikimedia.org/r/1093280 (https://phabricator.wikimedia.org/T380142) (owner: 10JMeybohm) [10:17:21] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071610 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [10:18:19] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1014.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1014.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs10 [10:18:19] .wmnet, wdqs1013.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:18:21] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1014.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1015.eqiad.wmnet, wdqs10 [10:18:21] .wmnet, wdqs1019.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:18:28] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:19:46] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:21:06] (03PS2) 10Muehlenhoff: Add cn=wmf as managed group for idm-test [puppet] - 10https://gerrit.wikimedia.org/r/1093272 [10:21:16] is the wdqs expected? [10:21:19] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:21:21] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:21:27] ok [10:21:51] jynus: no most likely due to a single client abusing the service, looking [10:22:01] I see [10:22:20] let me know if I can help in any way [10:22:40] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott) [10:22:46] !log removing leadership from kafka-main1001 - T363214 [10:22:48] jynus: sure, thanks for the offer! [10:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:51] T363214: kafka-main100[6789] and kafka-main1010 implementation tracking - https://phabricator.wikimedia.org/T363214 [10:22:58] (03CR) 10Muehlenhoff: Add cn=wmf as managed group for idm-test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093272 (owner: 10Muehlenhoff) [10:23:09] going to wait a bit before investigating sparql query logs [10:24:03] seems like it's recovering... (crossing fingers) [10:26:05] (03CR) 10Muehlenhoff: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1092887 (https://phabricator.wikimedia.org/T380295) (owner: 10Fabfur) [10:27:05] FIRING: [12x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:28:30] (03CR) 10Fabfur: [C:03+2] cache: install lshw from bullseye-backports [puppet] - 10https://gerrit.wikimedia.org/r/1092887 (https://phabricator.wikimedia.org/T380295) (owner: 10Fabfur) [10:32:05] FIRING: [14x] ProbeDown: Service kubestagemaster2005:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:33:13] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kafka-main[1001,1006].eqiad.wmnet with reason: Hardware refresh [10:33:21] !log re-enabled puppet on all k8s controll planes for rollout of T380142 [10:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:25] T380142: Reimaging a kubernetes control-plane invalidates service-account tokens issued by it - https://phabricator.wikimedia.org/T380142 [10:33:28] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kafka-main[1001,1006].eqiad.wmnet with reason: Hardware refresh [10:33:30] !log btullis@cumin1002 START - Cookbook sre.ceph.roll-restart-reboot-server rolling reboot on P{cephosd1001.eqiad.wmnet} and (A:cephosd) [10:34:32] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_esams [10:35:17] (03CR) 10AikoChou: [C:03+2] ml-services: update reference-quality docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092866 (https://phabricator.wikimedia.org/T378939) (owner: 10AikoChou) [10:35:53] PROBLEM - BFD status on lsw1-e1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:36:18] (03Merged) 10jenkins-bot: ml-services: update reference-quality docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092866 (https://phabricator.wikimedia.org/T378939) (owner: 10AikoChou) [10:36:19] PROBLEM - BGP status on lsw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:37:04] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_esams [10:38:26] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_magru [10:38:41] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_magru [10:39:55] RECOVERY - BFD status on lsw1-e1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:42:19] RECOVERY - BGP status on lsw1-e1-eqiad.mgmt is OK: BGP OK - up: 9, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:43:40] !log btullis@cumin1002 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling reboot on P{cephosd1001.eqiad.wmnet} and (A:cephosd) [10:50:21] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1020.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1020.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1019.eqiad.wmnet are mar [10:50:21] but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:50:49] sigh... [10:51:12] FIRING: [2x] ProbeDown: Service aux-k8s-ctrl1003:6443 has failed probes (http_aux_k8s_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#aux-k8s-ctrl1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:51:19] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled: aux-k8s-ctrl_6443: Serv [10:51:19] k8s-ctrl1002.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:53:02] !incidents [10:53:02] 5464 (ACKED) [2x] ProbeDown sre (aux-k8s-ctrl1003:6443 probes/custom eqiad) [10:53:02] dcausse: hey hey, you know what's up or should we start investigating? [10:53:06] I am here [10:53:09] I was acking it [10:53:16] probe down is potentially me [10:54:08] I am checking traffic impact [10:54:44] lots of probes at 50% [10:54:54] then it's not me [10:55:15] this one is the only one that goes at 60& [10:55:19] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:55:21] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:55:52] issues started around 10:15 [10:56:09] potentially before, around 8:48 [10:56:12] RESOLVED: [2x] ProbeDown: Service aux-k8s-ctrl1003:6443 has failed probes (http_aux_k8s_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#aux-k8s-ctrl1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:56:33] is this a network availability issue, a host availability? [10:56:58] XioNoX: re wdqs, most probably a single client, investigation might take time since I have to go through query logs :/ [10:57:05] or a monitoring issue? [10:58:17] does anyone see service problems other than the failing probes? [10:59:10] FIRING: [16x] ProbeDown: Service kubestagemaster1004:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:59:40] jayme: ^ is that you? [10:59:46] I don't think so [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T1100) [11:00:06] all apiservers respond to queries just fine [11:00:49] XioNoX: on the layer that you know, do you see any connectivity issue between prometheus and those reported failed services? [11:00:59] I will check prometheus meanwhile [11:02:05] FIRING: [16x] ProbeDown: Service kubestagemaster1004:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:02:52] jynus: looking [11:03:06] if the services themselves look fine, it is either the connectivity or the source host/service [11:03:42] XioNoX: there is increase in network activity on prometheus codfw hosts, maybe a clue [11:03:57] there are a ton of things in the probedown alert (36) - but even when expanding the card in the UI I don't see a kube apiserver there [11:04:46] maybe prometheus is getting saturated [11:05:45] there was a spike of writes on prometheus too [11:05:50] https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?orgId=1&from=now-12h&to=now [11:05:57] yeah something is up [11:06:49] but started a while ago [11:06:57] I'm getting intermittent connection issues between deploy and the kubemaster [11:06:58] "socket: permission denied" [11:07:03] and slowly went up [11:07:52] cgoubert@deploy2002:~$ kubectl get job mw-script.codfw.kftdqx7r -o yaml [11:07:54] The connection to the server kubemaster.svc.codfw.wmnet:6443 was refused - did you specify the right host or port? kj [11:07:58] was there any recent firewall update? [11:08:03] next invocation went through [11:08:41] there was an update to the envoy firewall rules yesterday [11:08:54] not sure if it's relevant, I'd expect it to have bitten us right away if so [11:08:55] this seems more recent, a few hours at most [11:09:04] yeah [11:09:30] from the monitoring stuff started between 7:30/8:30UTC [11:09:36] got a signal here, unsure if relevant: https://logstash.wikimedia.org/goto/987de189fd67c42d8080ae3ab22dbf54 [11:09:39] maybe a bit before [11:10:39] (03PS1) 10Tiziano Fogli: opensearch: reduce noise of PrometheusRuleEvaluationFailures [alerts] - 10https://gerrit.wikimedia.org/r/1093302 (https://phabricator.wikimedia.org/T374178) [11:10:44] nah, it happens all the time, it is noise [11:11:00] no smoking gun on the network metrics so far [11:11:26] claime: I did trigger a restart of all kube-apiserver processes via a puppet change ~30min ago [11:18:39] looking at https://logstash.wikimedia.org/app/dashboards#/view/f3e709c0-a5f8-11ec-bf8e-43f1807d5bc2 (logstash logs) there are no smoking gun neither [11:22:39] !log decommission cxserver endpoints /api/rest_v1/transform/html/from, /api/rest_v1/transform/word/from from RESTBase T375616 [11:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:44] T375616: Block RESTBase cxserver v1 endpoints in favor of the new endpoints - https://phabricator.wikimedia.org/T375616 [11:24:41] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_magru [11:27:32] So the only anomalies I can see are for small spikes of wdqs errors, but those happen before [11:28:22] the only issues I see ongoing are wdqs and etherpad not near 100% availability [11:28:44] (03CR) 10Máté Szabó: [C:03+1] prometheus-mcrouter-exporter: update to v0.4.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1092338 (https://phabricator.wikimedia.org/T380212) (owner: 10Effie Mouzeli) [11:28:46] but issues with those services wouldn't explain the probe failures [11:30:02] jynus: from there https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?orgId=1&from=now-24h&to=now&viewPanel=2 it looks like something is funky with prometheus itself [11:30:19] both eqiad and codfw prom endpoints [11:30:33] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_magru [11:31:06] XioNoX: I think you are right, some kind of overload [11:31:17] so probes could have failed from client side [11:31:38] let's ping observability and keep an eye on it [11:32:10] jynus: +1 [11:32:33] jynus: who from o11y is on the closest timezone? [11:32:51] I am talking on their channel [11:36:46] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2143.codfw.wmnet with OS bookworm [11:37:14] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2144.codfw.wmnet with OS bookworm [11:38:17] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2145.codfw.wmnet with OS bookworm [11:38:44] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2146.codfw.wmnet with OS bookworm [11:39:11] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2147.codfw.wmnet with OS bookworm [11:39:43] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2148.codfw.wmnet with OS bookworm [11:40:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [11:40:16] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2149.codfw.wmnet with OS bookworm [11:55:20] (03CR) 10Fabfur: [C:03+2] haproxykafka: working on TLS client authentication to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [11:56:09] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2144.codfw.wmnet with reason: host reimage [11:56:13] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2143.codfw.wmnet with reason: host reimage [11:57:25] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2145.codfw.wmnet with reason: host reimage [11:57:38] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2146.codfw.wmnet with reason: host reimage [11:58:20] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2147.codfw.wmnet with reason: host reimage [11:59:13] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2148.codfw.wmnet with reason: host reimage [11:59:23] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2149.codfw.wmnet with reason: host reimage [12:00:04] mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T1200). [12:00:45] FIRING: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:01:42] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2144.codfw.wmnet with reason: host reimage [12:02:21] (03PS1) 10Fabfur: haproxykafka: fix permissions on ssl files [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776) [12:04:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2149.codfw.wmnet with reason: host reimage [12:05:52] (03PS1) 10Fabfur: Revert "haproxykafka: working on TLS client authentication to kafka" [puppet] - 10https://gerrit.wikimedia.org/r/1093321 [12:06:36] (03CR) 10Vgutierrez: [C:04-1] "you need to pass the chosen user from haproxykafka profile to the haproxykafka class, where right now is taking the default hardcoded valu" [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [12:06:51] (03PS1) 10Cathal Mooney: Temporarily change cumin installserver alias to not include mgaru [puppet] - 10https://gerrit.wikimedia.org/r/1093322 (https://phabricator.wikimedia.org/T376737) [12:07:55] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2148.codfw.wmnet with reason: host reimage [12:08:12] !log disable puppet on cumin2002 to test cumin alias for A:installserver [12:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:19] (03CR) 10Fabfur: [C:03+2] Revert "haproxykafka: working on TLS client authentication to kafka" [puppet] - 10https://gerrit.wikimedia.org/r/1093321 (owner: 10Fabfur) [12:09:42] FIRING: JobUnavailable: Reduced availability for job haproxykafka in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:10:09] (03CR) 10Slyngshede: Add cn=wmf as managed group for idm-test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093272 (owner: 10Muehlenhoff) [12:10:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:11:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2147.codfw.wmnet with reason: host reimage [12:14:43] !log sukhe@cumin1002 START - Cookbook sre.hosts.dhcp for host cp7007.magru.wmnet [12:15:14] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2145.codfw.wmnet with reason: host reimage [12:15:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:16:32] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cp7007.magru.wmnet [12:16:59] !log sukhe@cumin2002 START - Cookbook sre.hosts.dhcp for host cp7007.magru.wmnet [12:18:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2146.codfw.wmnet with reason: host reimage [12:19:09] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cp7007.magru.wmnet [12:19:42] RESOLVED: JobUnavailable: Reduced availability for job haproxykafka in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:20:57] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp7007.magru.wmnet with OS bullseye [12:21:13] 10ops-magru, 06SRE, 06Traffic, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10339739 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp7007.magru.wmnet with... [12:21:22] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2144.codfw.wmnet with OS bookworm [12:22:15] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2143.codfw.wmnet with reason: host reimage [12:22:21] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [12:23:33] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [12:23:56] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2149.codfw.wmnet with OS bookworm [12:26:13] (03CR) 10Effie Mouzeli: [C:03+2] prometheus-mcrouter-exporter: update to v0.4.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1092338 (https://phabricator.wikimedia.org/T380212) (owner: 10Effie Mouzeli) [12:26:29] (03CR) 10Effie Mouzeli: [V:03+2 C:03+2] prometheus-mcrouter-exporter: update to v0.4.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1092338 (https://phabricator.wikimedia.org/T380212) (owner: 10Effie Mouzeli) [12:26:58] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2148.codfw.wmnet with OS bookworm [12:28:23] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093323 [12:31:09] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2147.codfw.wmnet with OS bookworm [12:33:35] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [12:33:59] RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.57 ms [12:34:11] PROBLEM - ps1-c3-codfw-infeed-load-tower-B-phase-Y on ps1-c3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:34:11] PROBLEM - ps1-c3-codfw-infeed-load-tower-A-phase-Y on ps1-c3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:34:11] PROBLEM - ps1-c3-codfw-infeed-load-tower-B-phase-X on ps1-c3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:34:11] PROBLEM - ps1-c3-codfw-infeed-load-tower-A-phase-X on ps1-c3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:34:11] PROBLEM - ps1-c3-codfw-infeed-load-tower-A-phase-Z on ps1-c3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:34:11] PROBLEM - ps1-c3-codfw-infeed-load-tower-B-phase-Z on ps1-c3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:34:30] ? [12:34:40] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2145.codfw.wmnet with OS bookworm [12:34:51] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [12:36:27] (03PS3) 10Muehlenhoff: Add cn=wmf as managed group for idm-test [puppet] - 10https://gerrit.wikimedia.org/r/1093272 [12:36:47] (03CR) 10Muehlenhoff: Add cn=wmf as managed group for idm-test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093272 (owner: 10Muehlenhoff) [12:38:25] 10ops-magru, 06SRE, 06Traffic, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10339847 (10RobH) Work rescheduled after conversation with both @ssingh and @MoritzMuehlenhoff regarding ganeti host cadence and swa... [12:38:27] !log re-enable puppet on cumin2002 [12:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:19] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2146.codfw.wmnet with OS bookworm [12:39:28] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093325 [12:40:23] PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [12:40:45] RESOLVED: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:41:03] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2150.codfw.wmnet with OS bookworm [12:41:30] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2151.codfw.wmnet with OS bookworm [12:41:31] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2143.codfw.wmnet with OS bookworm [12:42:12] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2152.codfw.wmnet with OS bookworm [12:42:46] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2153.codfw.wmnet with OS bookworm [12:43:31] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2154.codfw.wmnet with OS bookworm [12:44:02] 10ops-magru, 06SRE, 06Traffic, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10339856 (10ssingh) Thanks @RobH , sounds good! >>! In T376737#10339847, @RobH wrote: > Work rescheduled after conversation with bo... [12:44:06] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2155.codfw.wmnet with OS bookworm [12:46:10] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7007.magru.wmnet with reason: host reimage [12:49:36] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1017.eqiad.wmnet [12:49:54] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10339906 (10ops-monitoring-bot) Draining ganeti1017.eqiad.wmnet of running VMs [12:50:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [12:50:05] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7007.magru.wmnet with reason: host reimage [12:51:18] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [12:53:45] Is enwiki down for anyone else? [12:53:57] Can't access it from Singapore atm [12:54:14] Our servers are currently under maintenance or experiencing a technical problem. Please try again in a few minutes. [12:54:36] Nvm, back up [12:54:57] FIRING: [4x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:55:17] acked [12:55:21] * Emperor here [12:55:22] hey hey [12:55:23] Niharika: on it [12:55:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1017.eqiad.wmnet [12:55:43] !incidents [12:55:43] 5465 (ACKED) [4x] ProbeDown sre (text-https:443 probes/service) [12:55:43] 5464 (RESOLVED) [2x] ProbeDown sre (aux-k8s-ctrl1003:6443 probes/custom eqiad) [12:56:45] looks like that p.aged everyone immediately? [12:57:05] RESOLVED: [8x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:57:15] seems so, yes [12:57:16] I got a page yes [12:58:12] me too [12:58:15] was it a temp blip? [12:58:18] Yup [12:58:32] Yup to the "paged everyone" [12:58:46] Niharika: can you confirm it works now? [12:59:01] I acked it relativelly fast [12:59:13] could be what I said about rounting issues the other day? [12:59:14] I'm going to update T371244 [12:59:14] T371244: VictorOps paged batphone immediately rather than after 5m - https://phabricator.wikimedia.org/T371244 [12:59:32] I see in the VO log "User escalator_sysuser routed incident #5465 from SRE:SRE Business Hours (Escalation) to SRE:SRE Batphone (Escalation)" [12:59:34] which I attributed to a UI bug, but may be something ongoing [12:59:52] the VO app tells me that the Batphone is on-call now [12:59:57] RESOLVED: [4x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:00:16] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2150.codfw.wmnet with reason: host reimage [13:00:23] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2151.codfw.wmnet with reason: host reimage [13:00:42] !oncall-now [13:00:43] Oncall now for team SRE, rotation business_hours: [13:00:43] X.ioNoX, j.ynus [13:01:20] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2152.codfw.wmnet with reason: host reimage [13:01:28] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2153.codfw.wmnet with reason: host reimage [13:01:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1017.eqiad.wmnet [13:01:47] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10339960 (10ops-monitoring-bot) Draining ganeti1017.eqiad.wmnet of running VMs [13:01:55] weird, the VO website doesn't [13:02:14] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2154.codfw.wmnet with reason: host reimage [13:02:21] ok maybe the very nice and intuitive UI of the app is fooling me, not sure [13:02:58] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2155.codfw.wmnet with reason: host reimage [13:03:40] elukey: I've added your observation to my note on T371244 [13:03:58] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2150.codfw.wmnet with reason: host reimage [13:04:27] 06SRE, 06SRE-OnFire, 06SRE Observability: VictorOps paged batphone immediately rather than after 5m - https://phabricator.wikimedia.org/T371244#10339967 (10MatthewVernon) I think this happened again today, with [[ https://portal.victorops.com/ui/wikimedia/incident/5465/details | incident 5465 ]] - everyone w... [13:04:46] elukey: I think batphone is always "on-call" for escalation purposes but I may be wrong [13:04:59] VO is so clear on what's happening /s [13:06:46] (03PS2) 10Anzx: knwiki: update portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093328 (https://phabricator.wikimedia.org/T380366) [13:07:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2153.codfw.wmnet with reason: host reimage [13:07:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093328 (https://phabricator.wikimedia.org/T380366) (owner: 10Anzx) [13:11:08] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2155.codfw.wmnet with reason: host reimage [13:12:22] (03CR) 10Muehlenhoff: [C:03+2] idm: Set Envoy firewall config in nftables-compatible syntax [puppet] - 10https://gerrit.wikimedia.org/r/1092842 (owner: 10Muehlenhoff) [13:14:15] (03PS1) 10Effie Mouzeli: kafka-main: Replace kafka-main1001 with kafka-main1006 [puppet] - 10https://gerrit.wikimedia.org/r/1093330 (https://phabricator.wikimedia.org/T363214) [13:14:15] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2154.codfw.wmnet with reason: host reimage [13:16:01] (03PS2) 10Effie Mouzeli: kafka-main: Replace kafka-main1001 with kafka-main1006 [puppet] - 10https://gerrit.wikimedia.org/r/1093330 (https://phabricator.wikimedia.org/T363214) [13:17:23] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2152.codfw.wmnet with reason: host reimage [13:17:36] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7007.magru.wmnet with OS bullseye [13:17:42] 10ops-magru, 06SRE, 06Traffic, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10340002 (10RobH) [13:17:45] 10ops-magru, 06SRE, 06Traffic, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10340017 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp7007.magru.wmnet with OS... [13:18:45] (03CR) 10Clément Goubert: [C:03+1] kafka-main: Replace kafka-main1001 with kafka-main1006 [puppet] - 10https://gerrit.wikimedia.org/r/1093330 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [13:19:34] (03CR) 10Effie Mouzeli: [C:03+2] kafka-main: Replace kafka-main1001 with kafka-main1006 [puppet] - 10https://gerrit.wikimedia.org/r/1093330 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [13:20:19] (03CR) 10Btullis: [C:03+2] "I'm happy that this works as expected now." [puppet] - 10https://gerrit.wikimedia.org/r/1089716 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [13:20:47] (03PS1) 10Muehlenhoff: idp-test: Set Envoy firewall config in nftables-compatible syntax [puppet] - 10https://gerrit.wikimedia.org/r/1093332 [13:21:01] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2151.codfw.wmnet with reason: host reimage [13:23:07] claime: yes yes definitely, the escalation is always "on-call", I thought I'd seen EMEA as well for batphone, but probably got fooled by the UI [13:23:20] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2150.codfw.wmnet with OS bookworm [13:24:39] (03PS1) 10Btullis: Upgrade the remainder of the cephosd cluster to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1093333 (https://phabricator.wikimedia.org/T327259) [13:26:00] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2153.codfw.wmnet with OS bookworm [13:26:00] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1093333 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [13:26:02] (03PS57) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129) [13:26:16] besides the VO issue, do we know what happened? [13:26:26] !log jiji@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-eqiad [13:28:28] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:28:41] !log btullis@cumin1002 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop analytics cluster: Roll restart of jvm daemons for openjdk upgrade. [13:29:25] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1093333 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [13:29:51] (03CR) 10Btullis: [V:03+1 C:03+2] Upgrade the remainder of the cephosd cluster to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1093333 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [13:29:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:30:32] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1093332 (owner: 10Muehlenhoff) [13:31:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2155.codfw.wmnet with OS bookworm [13:33:08] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2154.codfw.wmnet with OS bookworm [13:36:21] !log jiji@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-eqiad [13:38:00] jynus: sorry, I got pulled into a meeting. Yeah, it worked after a minute or so [13:38:10] Thanks for tackling it [13:38:41] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2152.codfw.wmnet with OS bookworm [13:38:46] !log putting kafka-main1006.eqiad.wmnet in production [13:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:22] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2151.codfw.wmnet with OS bookworm [13:44:35] !log homer 'lsw1-d1-codfw*' commit 'T377028' [13:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:39] T377028: wikikube-worker21[36-55] implementation tracking - https://phabricator.wikimedia.org/T377028 [13:45:34] !log homer 'lsw1-b2-codfw*' commit 'T377028' [13:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:47] !log homer 'lsw1-d6-codfw*' commit 'T377028' [13:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:55] !log homer 'lsw1-c7-codfw*' commit 'T377028' [13:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:37] !log homer 'lsw1-b7-codfw*' commit 'T377028' [13:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:21] !log homer 'lsw1-d5-codfw*' commit 'T377028' [13:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:03] !log homer 'lsw1-c4-codfw*' commit 'T377028' [13:50:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [13:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:07] T377028: wikikube-worker21[36-55] implementation tracking - https://phabricator.wikimedia.org/T377028 [13:50:43] !log homer 'lsw1-d7-codfw*' commit 'T377028' [13:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:25] !log homer 'lsw1-c2-codfw*' commit 'T377028' [13:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:13] !log homer 'lsw1-d2-codfw*' commit 'T377028' [13:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:52:58] !log homer 'lsw1-b4-codfw*' commit 'T377028' [13:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:34] (03PS58) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129) [13:53:39] !log homer 'lsw1-d4-codfw*' commit 'T377028' [13:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:54] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2136-2139,2141-2155].codfw.wmnet [13:56:02] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2136-2139,2141-2155].codfw.wmnet [13:57:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:57:34] (03PS1) 10Effie Mouzeli: Update various kafka-main connection strings for kafka-main1006 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093337 (https://phabricator.wikimedia.org/T363214) [13:58:10] (03PS1) 10David Caro: toolforge:haproxy: add api gateway health check [puppet] - 10https://gerrit.wikimedia.org/r/1093339 (https://phabricator.wikimedia.org/T348633) [13:58:48] (03CR) 10CI reject: [V:04-1] toolforge:haproxy: add api gateway health check [puppet] - 10https://gerrit.wikimedia.org/r/1093339 (https://phabricator.wikimedia.org/T348633) (owner: 10David Caro) [13:59:34] (03CR) 10CI reject: [V:04-1] mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129) (owner: 10Arnaudb) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T1400). [14:00:05] albertoleoncio and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:14] (03PS2) 10David Caro: toolforge:haproxy: add api gateway health check [puppet] - 10https://gerrit.wikimedia.org/r/1093339 (https://phabricator.wikimedia.org/T348633) [14:00:24] Hi! [14:00:31] (03CR) 10David Caro: "Note that this depends on https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/53" [puppet] - 10https://gerrit.wikimedia.org/r/1093339 (https://phabricator.wikimedia.org/T348633) (owner: 10David Caro) [14:02:18] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:02:45] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:02:47] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:02:56] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:02:58] !log jiji@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:03:32] !log jiji@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:03:34] !log jiji@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:03:49] !log jiji@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:03:51] !log jiji@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [14:04:28] !log jiji@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [14:04:30] !log jiji@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [14:04:46] !log jiji@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [14:04:47] !log jiji@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [14:04:56] (03PS2) 10Fabfur: haproxykafka: fix permissions on ssl files [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776) [14:05:05] !log jiji@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [14:05:06] !log jiji@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:05:07] (03CR) 10CI reject: [V:04-1] haproxykafka: fix permissions on ssl files [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [14:05:44] !log jiji@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:05:45] !log jiji@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [14:06:23] !log jiji@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:07:47] Ping... [14:12:53] (03CR) 10Vgutierrez: [C:04-1] haproxykafka: fix permissions on ssl files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [14:15:38] (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1093339 (https://phabricator.wikimedia.org/T348633) (owner: 10David Caro) [14:16:18] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10340289 (10MoritzMuehlenhoff) [14:17:04] (03PS3) 10Fabfur: haproxykafka: fix permissions on ssl files [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776) [14:17:14] (03CR) 10CI reject: [V:04-1] haproxykafka: fix permissions on ssl files [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [14:17:31] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1093272 (owner: 10Muehlenhoff) [14:17:45] (03CR) 10Ssingh: "I think we should keep this if we need it next week but most likely we will use install7001." [puppet] - 10https://gerrit.wikimedia.org/r/1093322 (https://phabricator.wikimedia.org/T376737) (owner: 10Cathal Mooney) [14:17:52] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1093332 (owner: 10Muehlenhoff) [14:18:03] (03PS1) 10Ssingh: Revert "Change insrallserver in magru to point to eqiad insrall server" [homer/public] - 10https://gerrit.wikimedia.org/r/1093340 [14:18:21] (03PS1) 10Ssingh: Revert "magru: use eqiad's installserver temporarily for testing" [puppet] - 10https://gerrit.wikimedia.org/r/1093341 [14:18:38] (03CR) 10Fabfur: haproxykafka: fix permissions on ssl files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [14:19:20] (03CR) 10Ssingh: [C:03+2] Revert "magru: use eqiad's installserver temporarily for testing" [puppet] - 10https://gerrit.wikimedia.org/r/1093341 (owner: 10Ssingh) [14:20:38] (03PS4) 10Fabfur: haproxykafka: fix permissions on ssl files [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776) [14:20:40] (03PS2) 10Papaul: Revert "Change insallserver in magru to point to eqiad install server" [homer/public] - 10https://gerrit.wikimedia.org/r/1093340 (owner: 10Ssingh) [14:21:11] (03CR) 10Papaul: [C:03+1] Revert "Change insallserver in magru to point to eqiad install server" [homer/public] - 10https://gerrit.wikimedia.org/r/1093340 (owner: 10Ssingh) [14:21:46] (03PS3) 10Papaul: Revert "Change installserver in magru to point to eqiad install server" [homer/public] - 10https://gerrit.wikimedia.org/r/1093340 (owner: 10Ssingh) [14:22:38] (03CR) 10Ssingh: [C:03+2] Revert "Change installserver in magru to point to eqiad install server" [homer/public] - 10https://gerrit.wikimedia.org/r/1093340 (owner: 10Ssingh) [14:23:05] !log running homer on asw*magru* [14:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:26] !log starting resharding of commons backup files into new host backup1010 T376892 [14:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:29] T376892: Expand media backup storage available space to 960 TB per datacenter - https://phabricator.wikimedia.org/T376892 [14:23:55] ^ XioNoX expect increase internal traffic for a few days (probably unnoticed, but FYI) [14:24:01] *increased [14:24:32] !log btullis@cumin1002 START - Cookbook sre.ceph.roll-restart-reboot-server rolling reboot on P{cephosd100[2-4].eqiad.wmnet} and (A:cephosd) [14:24:43] ok [14:25:14] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-11-12-161156 to 2024-11-19-132736 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093347 (https://phabricator.wikimedia.org/T377547) [14:25:18] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-11-13-145636 to 2024-11-18-142635 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093348 (https://phabricator.wikimedia.org/T376938) [14:25:19] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-11-18-142635 to 2024-11-19-132736 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093349 (https://phabricator.wikimedia.org/T378044) [14:25:25] jouncebot: nowandnext [14:25:25] For the next 0 hour(s) and 34 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T1400) [14:25:26] In 0 hour(s) and 34 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T1500) [14:25:43] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7007.magru.wmnet [reason: host reimaged] [14:26:12] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [14:26:56] Lucas_WMDE: urbanecm: TheresNoTime: is the backport window being used? [14:27:36] PROBLEM - BFD status on lsw1-e2-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:27:38] PROBLEM - BGP status on lsw1-e2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:28:36] (03PS1) 10Muehlenhoff: debmonitor: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1093350 [14:29:46] !log T380226 💙cdanis@mwmaint2002.codfw.wmnet ~ 🕤☕ mwscript sql.php --wiki=commonswiki --cluster=extension1 /srv/mediawiki/php-1.44.0-wmf.4/extensions/JsonConfig/sql/mysql/tables-generated.sql [14:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:52] T380226: Install globaljsonlinks* tables on X1 for use with commons commons for Charts deployment - https://phabricator.wikimedia.org/T380226 [14:30:36] RECOVERY - BFD status on lsw1-e2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:30:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1093350 (owner: 10Muehlenhoff) [14:30:38] RECOVERY - BGP status on lsw1-e2-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:35:01] (03CR) 10Klausman: [C:03+1] ml-services: update articlequality in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092994 (https://phabricator.wikimedia.org/T374034) (owner: 10AikoChou) [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:41:44] PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:42:36] PROBLEM - BFD status on lsw1-e3-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:43:44] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:45:36] RECOVERY - BFD status on lsw1-e3-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:45:44] RECOVERY - BGP status on lsw1-e3-eqiad.mgmt is OK: BGP OK - up: 26, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:48:19] (03PS1) 10Klausman: ml-staging/experimental: Fix wrongly-scoped limitranges [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093354 [14:48:37] (03PS1) 10Btullis: Failover analytics-hive to standby coordinator [dns] - 10https://gerrit.wikimedia.org/r/1093355 (https://phabricator.wikimedia.org/T377938) [14:49:20] (03PS4) 10Elukey: sre.hosts.{dhcp,reimage}: force tftp as default option [cookbooks] - 10https://gerrit.wikimedia.org/r/1092802 (https://phabricator.wikimedia.org/T363576) [14:49:30] (03CR) 10Muehlenhoff: [C:03+2] Add cn=wmf as managed group for idm-test [puppet] - 10https://gerrit.wikimedia.org/r/1093272 (owner: 10Muehlenhoff) [14:49:30] (03PS1) 10Arlolra: Bump wikimedia/parsoid to 0.21.0-a7 [vendor] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093356 (https://phabricator.wikimedia.org/T373776) [14:49:32] (03CR) 10Klausman: [V:03+1] "Verified by hotpatching:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093354 (owner: 10Klausman) [14:50:53] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hadoop.roll-restart-workers (exit_code=99) restart workers for Hadoop analytics cluster: Roll restart of jvm daemons for openjdk upgrade. [14:53:53] !log power cycling unresponsive mgmt switch in codfw: msw-c3-codfw [14:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:20] (03PS1) 10Arlolra: Bump wikimedia/parsoid to 0.21.0-a7 [vendor] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093358 (https://phabricator.wikimedia.org/T373776) [14:56:36] (03PS1) 10Arlolra: Bump wikimedia/parsoid to 0.21.0-a7 [core] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093359 (https://phabricator.wikimedia.org/T380333) [14:57:18] (03Abandoned) 10Arlolra: Bump wikimedia/parsoid to 0.21.0-a7 [vendor] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093356 (https://phabricator.wikimedia.org/T373776) (owner: 10Arlolra) [14:57:23] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:57:34] PROBLEM - BGP status on lsw1-f1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:57:40] PROBLEM - BFD status on lsw1-f1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:58:51] (03PS1) 10Muehlenhoff: idm-test: Fix syntax for wmf group config [puppet] - 10https://gerrit.wikimedia.org/r/1093360 [14:59:19] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be1005 - https://phabricator.wikimedia.org/T370453#10340445 (10elukey) @Jclark-ctr the host is provisioned, next step is the number 2 in T370453#10326159, lemme know if you want me to do it or not! [14:59:50] (03CR) 10Slyngshede: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1093360 (owner: 10Muehlenhoff) [15:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T1500) [15:00:20] 06SRE, 06SRE-OnFire, 06SRE Observability: VictorOps paged batphone immediately rather than after 5m - https://phabricator.wikimedia.org/T371244#10340460 (10herron) escalator_sysuser is our account for the vo-escalate service which runs from the active alert host. vo-escalate checks every 15 seconds looking... [15:00:40] RECOVERY - BFD status on lsw1-f1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:00:42] (03CR) 10Vgutierrez: [C:03+1] haproxykafka: fix permissions on ssl files [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [15:03:34] RECOVERY - BGP status on lsw1-f1-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:04:01] (03CR) 10Muehlenhoff: [C:03+2] idm-test: Fix syntax for wmf group config [puppet] - 10https://gerrit.wikimedia.org/r/1093360 (owner: 10Muehlenhoff) [15:04:22] !log btullis@cumin1002 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling reboot on P{cephosd100[2-4].eqiad.wmnet} and (A:cephosd) [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:37] (03CR) 10Fabfur: [C:04-1] "To be merged after https://phabricator.wikimedia.org/T380373" [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [15:07:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093359 (https://phabricator.wikimedia.org/T380333) (owner: 10Arlolra) [15:07:51] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Upgrade evaluators from 2024-11-12-161156 to 2024-11-19-132736 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093347 (https://phabricator.wikimedia.org/T377547) (owner: 10Jforrester) [15:08:52] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-11-12-161156 to 2024-11-19-132736 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093347 (https://phabricator.wikimedia.org/T377547) (owner: 10Jforrester) [15:09:24] !log bootstrapping cassandra, restbase2037-{a,b,c} — T380236 [15:09:26] !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:29] T380236: Refresh restbase202[1-3] w/ restbase203[6-8] - https://phabricator.wikimedia.org/T380236 [15:09:50] RECOVERY - ps1-c3-codfw-infeed-load-tower-A-phase-Z on ps1-c3-codfw is OK: SNMP OK - ps1-c3-codfw-infeed-load-tower-A-phase-Z 432 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:10:14] !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:12:30] PROBLEM - Host lsw1-c4-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:12:40] PROBLEM - Host ps1-c4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:13:00] !log apine@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:13:30] 06SRE, 10Bitu, 06Infrastructure-Foundations: Add feature for a user to request removal of an LDAP group - https://phabricator.wikimedia.org/T380382 (10MoritzMuehlenhoff) 03NEW [15:13:36] 06SRE, 10Bitu, 06Infrastructure-Foundations: Add feature for a user to request removal of an LDAP group - https://phabricator.wikimedia.org/T380382#10340537 (10MoritzMuehlenhoff) p:05Triage→03Low [15:13:41] 07sre-alert-triage, 10SRE Observability (FY2024/2025-Q2): Alert in need of triage: JobUnavailable - https://phabricator.wikimedia.org/T380022#10340525 (10lmata) a:03tappof [15:13:51] !log apine@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:14:28] (03PS4) 10Ssingh: wmflib: add facter data for lshw -class memory [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) [15:14:48] !log apine@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:15:55] !log apine@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:17:19] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-11-13-145636 to 2024-11-18-142635 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093348 (https://phabricator.wikimedia.org/T376938) (owner: 10Jforrester) [15:17:52] (03CR) 10David Caro: [C:03+2] toolforge:haproxy: add api gateway health check [puppet] - 10https://gerrit.wikimedia.org/r/1093339 (https://phabricator.wikimedia.org/T348633) (owner: 10David Caro) [15:18:03] (03PS5) 10Elukey: sre.hosts.{dhcp,reimage}: force tftp as default option [cookbooks] - 10https://gerrit.wikimedia.org/r/1092802 (https://phabricator.wikimedia.org/T363576) [15:18:29] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-11-13-145636 to 2024-11-18-142635 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093348 (https://phabricator.wikimedia.org/T376938) (owner: 10Jforrester) [15:18:49] (03CR) 10Ssingh: "Some updates:" [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [15:19:20] !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:19:55] !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:21:24] (03CR) 10JMeybohm: [C:03+1] Update various kafka-main connection strings for kafka-main1006 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093337 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [15:21:28] (03CR) 10Elukey: "test-cookbooked, it seems working fine, lemme know!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1092802 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey) [15:22:03] !log apine@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:22:56] !log apine@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:23:01] !log apine@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:23:48] !log apine@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:24:37] (03PS5) 10Brouberol: airflow-analytics-test: deploy the scheduler and kerberos components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093177 (https://phabricator.wikimedia.org/T380284) [15:24:37] (03PS1) 10Brouberol: Airflow: add missing hive connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093364 (https://phabricator.wikimedia.org/T380284) [15:24:39] (03PS1) 10Brouberol: airflow: upgrade base image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093365 (https://phabricator.wikimedia.org/T380284) [15:24:40] (03PS1) 10Brouberol: airflow: allow multiple DAG folders to be pulled in [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093366 (https://phabricator.wikimedia.org/T380284) [15:25:04] (03CR) 10JMeybohm: [C:04-1] sre.hosts.{dhcp,reimage}: force tftp as default option (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1092802 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey) [15:25:22] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-11-18-142635 to 2024-11-19-132736 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093349 (https://phabricator.wikimedia.org/T378044) (owner: 10Jforrester) [15:26:15] (03PS6) 10Ssingh: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1092324 (https://phabricator.wikimedia.org/T378724) [15:26:35] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-11-18-142635 to 2024-11-19-132736 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093349 (https://phabricator.wikimedia.org/T378044) (owner: 10Jforrester) [15:26:39] (03CR) 10Elukey: sre.hosts.{dhcp,reimage}: force tftp as default option (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1092802 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey) [15:26:55] (03CR) 10Ssingh: P:hardware::check: add profile to check HW configuration (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1092324 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [15:27:10] !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:28:42] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Link down for wikikube-worker2140.codfw.wmnet - https://phabricator.wikimedia.org/T380265#10340624 (10Papaul) @Jhancock.wm @Clement_Goubert the interface on the switch side is up ` xe-0/0/26 up up wikikube-worker2140 [15:28:52] PROBLEM - Hadoop DataNode on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [15:29:11] (03CR) 10Subramanya Sastry: [C:03+1] Bump wikimedia/parsoid to 0.21.0-a7 [vendor] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093358 (https://phabricator.wikimedia.org/T373776) (owner: 10Arlolra) [15:31:52] RECOVERY - Host ps1-c4-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.95 ms [15:31:53] !log starting resharding of commons backup files into new host backup2010 T376892 [15:31:54] PROBLEM - MariaDB Replica Lag: s1 on db1206 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 325.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:57] T376892: Expand media backup storage available space to 960 TB per datacenter - https://phabricator.wikimedia.org/T376892 [15:32:04] RECOVERY - Host lsw1-c4-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.89 ms [15:32:32] I'll downtime the host, it serves a very tiny fraction of "legit" mysql traffic [15:32:41] and is impacted by running dumps [15:33:22] (03CR) 10Cathal Mooney: [C:03+1] hiera: set do_ipv6_primary_ra for all LVS in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1091243 (https://phabricator.wikimedia.org/T358260) (owner: 10Ssingh) [15:33:38] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: host overworked by dumps - T368098 [15:33:42] T368098: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098 [15:33:51] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: host overworked by dumps - T368098 [15:35:46] (03CR) 10Ebernhardson: [V:03+2 C:03+2] Repoint .gitreview at new repo [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1092936 (owner: 10Ebernhardson) [15:37:22] !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:37:40] !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:39:18] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Link down for wikikube-worker2140.codfw.wmnet - https://phabricator.wikimedia.org/T380265#10340666 (10Clement_Goubert) i just managed to mount the ip adresses on the other interface `eno12399np0` and the link is up. Looks like the wrong one go... [15:39:56] PROBLEM - Host lsw1-c4-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:40:09] FIRING: HelmReleaseBadStatus: Helm release wikifunctions/main-orchestrator on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:40:31] (03PS1) 10Brouberol: an-test-client1002: ensure that airflow services are absent [puppet] - 10https://gerrit.wikimedia.org/r/1093368 (https://phabricator.wikimedia.org/T380284) [15:40:40] PROBLEM - Host ps1-c4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:41:44] (03PS2) 10Brouberol: an-test-client1002: ensure that airflow services are absent [puppet] - 10https://gerrit.wikimedia.org/r/1093368 (https://phabricator.wikimedia.org/T380284) [15:41:57] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-11-19-132736 to 2024-11-19-140330 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093369 [15:42:19] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-11-19-132736 to 2024-11-19-140330 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093369 (owner: 10Jforrester) [15:43:26] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-11-19-132736 to 2024-11-19-140330 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093369 (owner: 10Jforrester) [15:43:44] RECOVERY - ps1-c3-codfw-infeed-load-tower-A-phase-Y on ps1-c3-codfw is OK: SNMP OK - ps1-c3-codfw-infeed-load-tower-A-phase-Y 502 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:43:46] RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.20 ms [15:43:52] RECOVERY - ps1-c3-codfw-infeed-load-tower-B-phase-Z on ps1-c3-codfw is OK: SNMP OK - ps1-c3-codfw-infeed-load-tower-B-phase-Z 434 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:43:52] RECOVERY - ps1-c3-codfw-infeed-load-tower-A-phase-X on ps1-c3-codfw is OK: SNMP OK - ps1-c3-codfw-infeed-load-tower-A-phase-X 473 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:43:52] RECOVERY - ps1-c3-codfw-infeed-load-tower-B-phase-Y on ps1-c3-codfw is OK: SNMP OK - ps1-c3-codfw-infeed-load-tower-B-phase-Y 523 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:43:52] RECOVERY - ps1-c3-codfw-infeed-load-tower-B-phase-X on ps1-c3-codfw is OK: SNMP OK - ps1-c3-codfw-infeed-load-tower-B-phase-X 483 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:43:52] RECOVERY - Hadoop DataNode on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [15:43:56] RECOVERY - Host lsw1-c3-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.76 ms [15:44:02] RECOVERY - BGP status on lsw1-c3-codfw.mgmt is OK: BGP OK - up: 2, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:44:02] RECOVERY - Juniper alarms on lsw1-c3-codfw.mgmt is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:44:05] !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:44:34] !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:44:47] !log dancy@deploy2002 Started scap sync-world: no-op deployment for testing. [15:45:00] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (DIFF 6 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1093368 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [15:45:09] RESOLVED: HelmReleaseBadStatus: Helm release wikifunctions/main-orchestrator on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:46:30] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Link down for wikikube-worker2140.codfw.wmnet - https://phabricator.wikimedia.org/T380265#10340713 (10Papaul) @Clement_Goubert on your output below you was looking at the second interface (eno12409np1) ` root@wikikube-worker2140:~# ethtool en... [15:47:51] (03CR) 10Btullis: an-test-client1002: ensure that airflow services are absent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093368 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [15:48:08] !log dancy@deploy2002 Finished scap sync-world: no-op deployment for testing. (duration: 03m 21s) [15:49:22] (03CR) 10Btullis: [C:03+1] Airflow: add missing hive connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093364 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [15:49:34] !log apine@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:49:36] (03CR) 10Btullis: [C:03+1] airflow: upgrade base image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093365 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [15:49:36] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Link down for wikikube-worker2140.codfw.wmnet - https://phabricator.wikimedia.org/T380265#10340717 (10Clement_Goubert) Yes, `eno12409np1` was the one where the IPs were originally mounted when I encountered the issue. In order to troubleshoot,... [15:49:40] (03CR) 10Brouberol: [V:03+1] an-test-client1002: ensure that airflow services are absent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093368 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [15:50:09] (03CR) 10Brouberol: [C:03+2] Airflow: add missing hive connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093364 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [15:50:13] (03CR) 10Brouberol: [C:03+2] airflow: upgrade base image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093365 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [15:50:21] brouberol: We're in our window right now. [15:50:26] !log apine@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:50:35] !log apine@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:51:21] (03CR) 10Btullis: airflow: allow multiple DAG folders to be pulled in (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093366 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [15:51:23] !log apine@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:51:32] James_F: oh so I should hold off all deployments, even to dse-k8s-eqiad? [15:51:32] (03Merged) 10jenkins-bot: Airflow: add missing hive connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093364 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [15:51:34] (03Merged) 10jenkins-bot: airflow: upgrade base image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093365 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [15:51:43] brouberol: Ideally. We'll be done in 8 mins. [15:51:48] no worries [15:51:52] Well, we'll be done in a few seconds actually. :-) [15:51:54] 10ops-magru, 06SRE, 06Traffic, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10340716 (10RobH) Confirmed new window with Willy and sent update to ticket: > Support, Can we shift this to work on Monday, Nove... [15:52:29] brouberol: Over to you. [15:52:46] (03CR) 10Btullis: an-test-client1002: ensure that airflow services are absent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093368 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [15:53:20] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Link down for wikikube-worker2140.codfw.wmnet - https://phabricator.wikimedia.org/T380265#10340723 (10Papaul) 05Open→03Resolved glad all is working> I am resolving this task. Thank you [15:53:27] 👍 thanks! [15:55:31] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Link down for wikikube-worker2140.codfw.wmnet - https://phabricator.wikimedia.org/T380265#10340731 (10Clement_Goubert) 05Resolved→03Open @Papaul sorry for the misunderstanding, but it's not resolved. The interface that is supposed to have... [15:55:45] (03CR) 10Brouberol: airflow: allow multiple DAG folders to be pulled in (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093366 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [15:56:51] (03PS3) 10Brouberol: an-test-client1002: ensure that airflow services are absent [puppet] - 10https://gerrit.wikimedia.org/r/1093368 (https://phabricator.wikimedia.org/T380284) [15:56:55] (03CR) 10Brouberol: an-test-client1002: ensure that airflow services are absent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093368 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [15:57:35] (03CR) 10Btullis: [C:03+1] an-test-client1002: ensure that airflow services are absent [puppet] - 10https://gerrit.wikimedia.org/r/1093368 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [15:57:48] (03CR) 10Brouberol: [C:03+2] an-test-client1002: ensure that airflow services are absent [puppet] - 10https://gerrit.wikimedia.org/r/1093368 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [15:58:05] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Link down for wikikube-worker2140.codfw.wmnet - https://phabricator.wikimedia.org/T380265#10340746 (10Papaul) @Clement_Goubert got you know i will fix it in netbox. Sorry i misunderstood you. [16:01:47] (03PS1) 10Brouberol: an-test-client1002: disable puppet management of airflow services [puppet] - 10https://gerrit.wikimedia.org/r/1093373 (https://phabricator.wikimedia.org/T380284) [16:02:06] (03PS1) 10Arnaudb: mariadb: db1246 temporary insetup [puppet] - 10https://gerrit.wikimedia.org/r/1093372 (https://phabricator.wikimedia.org/T374215) [16:02:07] (03CR) 10Arnaudb: "as discussed on SRE foundation https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-sre-foundations/20241120.txt" [puppet] - 10https://gerrit.wikimedia.org/r/1093372 (https://phabricator.wikimedia.org/T374215) (owner: 10Arnaudb) [16:04:14] (03CR) 10Btullis: "I don't think we need to do this. We can just manually delete the services." [puppet] - 10https://gerrit.wikimedia.org/r/1093373 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [16:04:49] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-staging/experimental: Fix wrongly-scoped limitranges [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093354 (owner: 10Klausman) [16:05:00] (03CR) 10Brouberol: "I think we do, as puppet is currently broken on the host:" [puppet] - 10https://gerrit.wikimedia.org/r/1093373 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [16:06:22] (03CR) 10Klausman: [V:03+1 C:03+2] ml-staging/experimental: Fix wrongly-scoped limitranges [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093354 (owner: 10Klausman) [16:07:49] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1093372 (https://phabricator.wikimedia.org/T374215) (owner: 10Arnaudb) [16:08:02] (03CR) 10JHathaway: [C:03+1] wmflib: add facter data for lshw -class memory [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [16:09:59] (03Merged) 10jenkins-bot: ml-staging/experimental: Fix wrongly-scoped limitranges [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093354 (owner: 10Klausman) [16:10:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1017.eqiad.wmnet [16:10:41] (03CR) 10Btullis: [C:03+1] an-test-client1002: disable puppet management of airflow services [puppet] - 10https://gerrit.wikimedia.org/r/1093373 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [16:11:42] (03PS2) 10Brouberol: an-test-client1002: disable puppet management of airflow services [puppet] - 10https://gerrit.wikimedia.org/r/1093373 (https://phabricator.wikimedia.org/T380284) [16:12:11] (03PS3) 10Brouberol: an-test-client1002: disable puppet management of airflow services [puppet] - 10https://gerrit.wikimedia.org/r/1093373 (https://phabricator.wikimedia.org/T380284) [16:12:56] (03CR) 10Brouberol: "I've made it so that we can have an empty hash of airflow instances, which will sidestep the associated puppet resource creation." [puppet] - 10https://gerrit.wikimedia.org/r/1093373 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [16:14:49] (03CR) 10Brouberol: [C:03+2] an-test-client1002: disable puppet management of airflow services [puppet] - 10https://gerrit.wikimedia.org/r/1093373 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [16:15:49] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [16:17:55] (03CR) 10JHathaway: sre.hosts.{dhcp,reimage}: force tftp as default option (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1092802 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey) [16:17:58] (03CR) 10Effie Mouzeli: [C:03+2] Update various kafka-main connection strings for kafka-main1006 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093337 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [16:19:33] (03Merged) 10jenkins-bot: Update various kafka-main connection strings for kafka-main1006 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093337 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [16:19:36] (03PS1) 10Jcrespo: mediabackup: Setup backup1010 as the 6th media backup host in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1093377 (https://phabricator.wikimedia.org/T376892) [16:20:42] (03CR) 10Jcrespo: [C:04-2] "Do not merge- transfer is ongoing, and database and software package needs to be updated before merging." [puppet] - 10https://gerrit.wikimedia.org/r/1093377 (https://phabricator.wikimedia.org/T376892) (owner: 10Jcrespo) [16:21:03] (03PS1) 10Brouberol: hotfix: prevent puppet resource creation when no airflow instances are specified [puppet] - 10https://gerrit.wikimedia.org/r/1093378 (https://phabricator.wikimedia.org/T380284) [16:21:14] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [16:21:42] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply [16:21:59] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [16:22:19] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [16:22:32] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [16:22:54] !log aikochou@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [16:23:28] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [16:23:38] (03CR) 10Brouberol: [C:03+2] hotfix: prevent puppet resource creation when no airflow instances are specified [puppet] - 10https://gerrit.wikimedia.org/r/1093378 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [16:23:47] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [16:25:00] (03PS2) 10Jcrespo: mediabackup: Setup backup1010 as the 6th media backup host in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1093377 (https://phabricator.wikimedia.org/T376892) [16:25:19] (03PS1) 10Jcrespo: mediabackup: Setup backup2010 as the 6th media backup host in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1093379 (https://phabricator.wikimedia.org/T376892) [16:25:33] (03CR) 10Jcrespo: [C:04-2] mediabackup: Setup backup2010 as the 6th media backup host in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1093379 (https://phabricator.wikimedia.org/T376892) (owner: 10Jcrespo) [16:25:51] !log aikochou@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [16:26:01] (03PS2) 10Jcrespo: mediabackup: Setup backup2010 as the 6th media backup host in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1093379 (https://phabricator.wikimedia.org/T376892) [16:26:07] (03CR) 10Jcrespo: "Do not merge- transfer is ongoing, and database and software package needs to be updated before merging." [puppet] - 10https://gerrit.wikimedia.org/r/1093379 (https://phabricator.wikimedia.org/T376892) (owner: 10Jcrespo) [16:27:19] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [16:27:20] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [16:28:37] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [16:28:54] RECOVERY - MariaDB Replica Lag: s1 on db1206 is OK: OK slave_sql_lag Replication lag: 55.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:30:55] FIRING: MaxConntrack: Max conntrack at 100% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [16:34:50] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [16:35:02] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [16:35:11] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [16:35:24] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [16:35:25] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [16:35:35] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [16:35:48] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [16:35:55] RESOLVED: MaxConntrack: Max conntrack at 98.77% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [16:35:57] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [16:36:38] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [16:37:25] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [16:37:48] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [16:38:10] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [16:38:11] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [16:38:42] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [16:48:16] idp.wikimedia.org is returning 503, anyone knows if there's an outage going? (two people known to have issues for now) [16:49:03] (03PS1) 10Effie Mouzeli: mw-debug: update prometheus-mcrouter-exporter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093382 (https://phabricator.wikimedia.org/T380212) [16:55:19] (03PS1) 10David Caro: toolforge:haproxy: monitor the https port, not the internal one [puppet] - 10https://gerrit.wikimedia.org/r/1093384 (https://phabricator.wikimedia.org/T348633) [16:55:57] (03CR) 10CI reject: [V:04-1] toolforge:haproxy: monitor the https port, not the internal one [puppet] - 10https://gerrit.wikimedia.org/r/1093384 (https://phabricator.wikimedia.org/T348633) (owner: 10David Caro) [16:56:38] (03PS2) 10David Caro: toolforge:haproxy: monitor the https port, not the internal one [puppet] - 10https://gerrit.wikimedia.org/r/1093384 (https://phabricator.wikimedia.org/T348633) [16:56:59] (03CR) 10Effie Mouzeli: [C:03+2] mw-debug: update prometheus-mcrouter-exporter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093382 (https://phabricator.wikimedia.org/T380212) (owner: 10Effie Mouzeli) [16:58:28] 06SRE, 10envoy, 06serviceops, 06Traffic: Upgrade Envoy to >= 1.24 - https://phabricator.wikimedia.org/T380211#10340940 (10jijiki) p:05Triage→03Medium [16:59:42] (03Merged) 10jenkins-bot: mw-debug: update prometheus-mcrouter-exporter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093382 (https://phabricator.wikimedia.org/T380212) (owner: 10Effie Mouzeli) [17:00:09] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:00:41] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [17:01:02] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:02:23] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [17:02:26] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:02:58] (03PS1) 10Reedy: UcfirstOverrides: Fix indenting of comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093387 [17:03:24] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [17:04:06] !log restart tomcat on idp2004 [17:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:54] (03PS1) 10Máté Szabó: Configure instrument for the Incident Reporting System [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093389 (https://phabricator.wikimedia.org/T372823) [17:07:04] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:07:06] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:07:39] (03PS1) 10Effie Mouzeli: mw-debug: enable mcrouter container in pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093390 [17:08:06] (03CR) 10Ssingh: [C:03+2] wmflib: add facter data for lshw -class memory [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [17:09:41] (03CR) 10Ssingh: [C:03+2] "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [17:12:46] (03CR) 10Effie Mouzeli: [C:03+2] mw-debug: enable mcrouter container in pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093390 (owner: 10Effie Mouzeli) [17:12:53] !log joal@deploy2002 Started deploy [analytics/refinery@295d5a4]: Regular analytics weekly train BIS [analytics/refinery@295d5a44] [17:14:01] (03Merged) 10jenkins-bot: mw-debug: enable mcrouter container in pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093390 (owner: 10Effie Mouzeli) [17:16:34] !log joal@deploy2002 Finished deploy [analytics/refinery@295d5a4]: Regular analytics weekly train BIS [analytics/refinery@295d5a44] (duration: 03m 41s) [17:16:42] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:17:01] !log joal@deploy2002 Started deploy [analytics/refinery@295d5a4] (thin): Regular analytics weekly train BIS THIN [analytics/refinery@295d5a44] [17:19:41] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:19:58] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [17:20:02] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:20:13] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [17:22:03] !log joal@deploy2002 Finished deploy [analytics/refinery@295d5a4] (thin): Regular analytics weekly train BIS THIN [analytics/refinery@295d5a44] (duration: 05m 02s) [17:22:58] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be1005 - https://phabricator.wikimedia.org/T370453#10341044 (10elukey) Quick note about the reimage step - due to a bug in Supermicro's BMC firmware (at least, this is what we suspect) the first reimage ru... [17:27:17] !jouncebot now [17:27:17] a Python reminder bot for deployments. see https://wikitech.wikimedia.org/wiki/Tool:Jouncebot [17:27:31] jouncebot: now [17:27:31] No deployments scheduled for the next 0 hour(s) and 32 minute(s) [17:27:45] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:28:15] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [17:28:49] !log joal@deploy2002 Started deploy [analytics/refinery@295d5a4] (hadoop-test): Regular analytics weekly train BIS TEST [analytics/refinery@295d5a44] [17:28:54] (03CR) 10David Caro: [C:03+2] toolforge:haproxy: monitor the https port, not the internal one [puppet] - 10https://gerrit.wikimedia.org/r/1093384 (https://phabricator.wikimedia.org/T348633) (owner: 10David Caro) [17:32:25] !log joal@deploy2002 Finished deploy [analytics/refinery@295d5a4] (hadoop-test): Regular analytics weekly train BIS TEST [analytics/refinery@295d5a44] (duration: 03m 36s) [17:38:45] (03CR) 10Máté Szabó: [C:04-2] "DNM due to pending L3SC" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093389 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó) [17:43:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091868 (https://phabricator.wikimedia.org/T379317) (owner: 10NMW03) [17:45:29] (03PS1) 10Btullis: Update spark shufflers on the test cluster to deploy version 3.5 [puppet] - 10https://gerrit.wikimedia.org/r/1093394 (https://phabricator.wikimedia.org/T380040) [17:47:34] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1093394 (https://phabricator.wikimedia.org/T380040) (owner: 10Btullis) [17:49:07] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [17:50:13] (03PS1) 10David Caro: toolforge:haproxy: use the external name and force tls [puppet] - 10https://gerrit.wikimedia.org/r/1093395 (https://phabricator.wikimedia.org/T348633) [17:50:50] (03CR) 10CI reject: [V:04-1] toolforge:haproxy: use the external name and force tls [puppet] - 10https://gerrit.wikimedia.org/r/1093395 (https://phabricator.wikimedia.org/T348633) (owner: 10David Caro) [17:52:08] (03PS1) 10Joal: Bump gobblin-wmf jar to newest version [puppet] - 10https://gerrit.wikimedia.org/r/1093396 (https://phabricator.wikimedia.org/T376144) [17:52:16] (03PS2) 10David Caro: toolforge:haproxy: use the external name and force tls [puppet] - 10https://gerrit.wikimedia.org/r/1093395 (https://phabricator.wikimedia.org/T348633) [17:52:58] (03CR) 10CI reject: [V:04-1] toolforge:haproxy: use the external name and force tls [puppet] - 10https://gerrit.wikimedia.org/r/1093395 (https://phabricator.wikimedia.org/T348633) (owner: 10David Caro) [17:54:15] (03CR) 10Btullis: [C:03+2] Bump gobblin-wmf jar to newest version [puppet] - 10https://gerrit.wikimedia.org/r/1093396 (https://phabricator.wikimedia.org/T376144) (owner: 10Joal) [17:55:20] (03CR) 10BCornwall: [C:03+1] Failover analytics-hive to standby coordinator [dns] - 10https://gerrit.wikimedia.org/r/1093355 (https://phabricator.wikimedia.org/T377938) (owner: 10Btullis) [17:55:50] (03CR) 10Btullis: [C:03+2] Failover analytics-hive to standby coordinator [dns] - 10https://gerrit.wikimedia.org/r/1093355 (https://phabricator.wikimedia.org/T377938) (owner: 10Btullis) [17:59:07] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [17:59:25] (03PS3) 10David Caro: toolforge:haproxy: use the external name and force tls [puppet] - 10https://gerrit.wikimedia.org/r/1093395 (https://phabricator.wikimedia.org/T348633) [18:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T1800) [18:03:53] (03PS4) 10David Caro: toolforge:haproxy: use the external name and ip and force tls [puppet] - 10https://gerrit.wikimedia.org/r/1093395 (https://phabricator.wikimedia.org/T348633) [18:04:31] (03CR) 10CI reject: [V:04-1] toolforge:haproxy: use the external name and ip and force tls [puppet] - 10https://gerrit.wikimedia.org/r/1093395 (https://phabricator.wikimedia.org/T348633) (owner: 10David Caro) [18:05:07] (03PS5) 10David Caro: toolforge:haproxy: use the external name and ip and force tls [puppet] - 10https://gerrit.wikimedia.org/r/1093395 (https://phabricator.wikimedia.org/T348633) [18:06:25] (03PS6) 10David Caro: toolforge:haproxy: use the external name and ip and force tls [puppet] - 10https://gerrit.wikimedia.org/r/1093395 (https://phabricator.wikimedia.org/T348633) [18:07:12] (03PS1) 10C. Scott Ananian: Turn on Parsoid fragment support everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093399 (https://phabricator.wikimedia.org/T374661) [18:09:47] (03PS1) 10Greg Grossmeier: CSP for banner preview: allow remind me later SMS host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093401 (https://phabricator.wikimedia.org/T380232) [18:09:54] (03CR) 10David Caro: [C:03+2] "I should add a test to this :/, tested in toolsbeta now" [puppet] - 10https://gerrit.wikimedia.org/r/1093395 (https://phabricator.wikimedia.org/T348633) (owner: 10David Caro) [18:26:57] (03CR) 10Herron: [C:03+2] "self-merging to get these VM builds started" [puppet] - 10https://gerrit.wikimedia.org/r/1092922 (https://phabricator.wikimedia.org/T378986) (owner: 10Herron) [18:30:28] (03PS1) 10C. Scott Ananian: Deploy Parsoid Read Views to de/ja/ru wikivoyage, incubatorwiki and dagwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093405 (https://phabricator.wikimedia.org/T375394) [18:50:25] (03PS1) 10Herron: site: fix site prefix typo in hostname [puppet] - 10https://gerrit.wikimedia.org/r/1093407 [18:51:34] (03CR) 10Herron: [C:03+2] site: fix site prefix typo in hostname [puppet] - 10https://gerrit.wikimedia.org/r/1093407 (owner: 10Herron) [18:52:25] 06SRE, 06SRE-OnFire, 06SRE Observability: VictorOps paged batphone immediately rather than after 5m - https://phabricator.wikimedia.org/T371244#10341512 (10lmata) Case 3622388 created! on splunk support added @andrea.denisse @colewhite and @herron as watchers in case support responds while i'm away. [18:52:44] (03PS1) 10Jdlrobson: Temporarily disable dark mode for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093408 (https://phabricator.wikimedia.org/T379765) [18:53:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093408 (https://phabricator.wikimedia.org/T379765) (owner: 10Jdlrobson) [18:58:44] !log herron@cumin1002 START - Cookbook sre.ganeti.makevm for new host aux-k8s-ctrl2002.codfw.wmnet [18:58:45] !log herron@cumin1002 START - Cookbook sre.dns.netbox [18:59:17] 10ops-magru, 06SRE, 06Traffic, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10341561 (10RobH) New work window confirmed by ascenty: Comentário gerado em Smart Hands: Hello, > We received the Ticket and sc... [19:00:05] andre and brennen: Time to do the MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T1900). [19:00:32] nothing for this window. [19:03:16] !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-ctrl2002.codfw.wmnet - herron@cumin1002" [19:03:20] !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-ctrl2002.codfw.wmnet - herron@cumin1002" [19:03:20] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:03:20] !log herron@cumin1002 START - Cookbook sre.dns.wipe-cache aux-k8s-ctrl2002.codfw.wmnet on all recursors [19:03:23] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-ctrl2002.codfw.wmnet on all recursors [19:03:49] !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-ctrl2002.codfw.wmnet - herron@cumin1002" [19:03:53] !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-ctrl2002.codfw.wmnet - herron@cumin1002" [19:04:33] !log herron@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-ctrl2002.codfw.wmnet with OS bookworm [19:04:43] 06SRE, 10vm-requests, 07Kubernetes: codfw: (2x) aux-k8s-ctrl nodes - https://phabricator.wikimedia.org/T378986#10341601 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1002 for host aux-k8s-ctrl2002.codfw.wmnet with OS bookworm [19:08:28] !log bking@krb1001 add kerberos keytab for blunderbuss https://phabricator.wikimedia.org/P71106 T371994 [19:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:36] T371994: Deploy the HDFS synchronizer (blunderbuss) service to the dse-k8s cluster - https://phabricator.wikimedia.org/T371994 [19:12:15] !log bootstrapping cassandra, restbase2038-{a,b,c} — T380236 [19:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:21] T380236: Refresh restbase202[1-3] w/ restbase203[6-8] - https://phabricator.wikimedia.org/T380236 [19:14:35] (03CR) 10Wfan: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093401 (https://phabricator.wikimedia.org/T380232) (owner: 10Greg Grossmeier) [19:17:55] !log herron@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-ctrl2002.codfw.wmnet with reason: host reimage [19:20:42] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-ctrl2002.codfw.wmnet with reason: host reimage [19:35:01] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-ctrl2002.codfw.wmnet with OS bookworm [19:35:02] !log herron@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-ctrl2002.codfw.wmnet [19:35:08] 06SRE, 10vm-requests, 07Kubernetes: codfw: (2x) aux-k8s-ctrl nodes - https://phabricator.wikimedia.org/T378986#10341673 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1002 for host aux-k8s-ctrl2002.codfw.wmnet with OS bookworm completed: - aux-k8s-ctrl2002 (**PASS**) - R... [19:41:24] jouncebot nowandnext [19:41:24] For the next 1 hour(s) and 18 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T1900) [19:41:24] In 1 hour(s) and 18 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T2100) [19:42:06] !log dancy@deploy2002 Installing scap version "4.126.0" for 209 hosts [19:47:17] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2005.codfw.wmnet with OS bullseye [19:47:32] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10341702 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye [19:51:08] !log dancy@deploy2002 Installing scap version "4.126.0" for 1 hosts [19:52:32] !log hashar@deploy2002 Started deploy [integration/docroot@1627206]: build: update mediawiki-codesniffer to 45.0.0 & prevent LibUp from removing a phpcs rule [19:52:43] !log hashar@deploy2002 Finished deploy [integration/docroot@1627206]: build: update mediawiki-codesniffer to 45.0.0 & prevent LibUp from removing a phpcs rule (duration: 00m 10s) [19:59:53] PROBLEM - Cassandra instance data free space on restbase2025 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8699 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [20:03:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2041.codfw.wmnet with OS bookworm [20:03:40] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10341785 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2041.codfw.wmnet with OS bookworm [20:04:53] PROBLEM - Cassandra instance data free space on restbase2025 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8681 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [20:05:39] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2005.codfw.wmnet with OS bullseye [20:05:47] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10341789 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye ex... [20:08:33] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2005.codfw.wmnet with OS bullseye [20:08:39] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10341806 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye [20:10:13] !log dancy@deploy2002 Installing scap version "4.126.0" for 1 hosts [20:11:50] (03PS1) 10Fabfur: benthos: WIP for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) [20:13:13] !log herron@cumin1002 START - Cookbook sre.ganeti.makevm for new host aux-k8s-ctrl2003.codfw.wmnet [20:13:15] !log herron@cumin1002 START - Cookbook sre.dns.netbox [20:13:53] PROBLEM - Cassandra instance data free space on restbase2025 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8670 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [20:14:53] RECOVERY - Cassandra instance data free space on restbase2025 is OK: DISK OK - free space: /srv/cassandra/instance-data 14024 MB (33% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [20:26:48] !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-ctrl2003.codfw.wmnet - herron@cumin1002" [20:28:02] !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-ctrl2003.codfw.wmnet - herron@cumin1002" [20:28:02] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:28:02] !log herron@cumin1002 START - Cookbook sre.dns.wipe-cache aux-k8s-ctrl2003.codfw.wmnet on all recursors [20:28:05] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-ctrl2003.codfw.wmnet on all recursors [20:28:32] !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-ctrl2003.codfw.wmnet - herron@cumin1002" [20:28:36] !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-ctrl2003.codfw.wmnet - herron@cumin1002" [20:30:01] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2005.codfw.wmnet with OS bullseye [20:30:25] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10341887 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye ex... [20:30:28] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2005.codfw.wmnet with OS bullseye [20:30:42] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10341889 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye [20:32:47] !log herron@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-ctrl2003.codfw.wmnet with OS bookworm [20:32:53] 06SRE, 10vm-requests, 07Kubernetes: codfw: (2x) aux-k8s-ctrl nodes - https://phabricator.wikimedia.org/T378986#10341893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1002 for host aux-k8s-ctrl2003.codfw.wmnet with OS bookworm [20:39:42] !log dancy@deploy2002 Installing scap version "4.126.0" for 1 hosts [20:40:38] !log dancy@deploy2002 Installation of scap version "4.126.0" completed for 1 hosts [20:44:29] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [20:47:15] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2041.codfw.wmnet with OS bookworm [20:47:33] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10341929 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es2041.codfw.wmnet with OS bookworm executed... [20:48:00] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [20:48:02] !log herron@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-ctrl2003.codfw.wmnet with reason: host reimage [20:48:05] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [20:48:42] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [20:49:07] o/ [20:49:41] jouncebot: next [20:49:41] In 0 hour(s) and 10 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T2100) [20:49:49] * anzx 👋 [20:51:41] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-ctrl2003.codfw.wmnet with reason: host reimage [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T2100). [21:00:05] albertoleoncio, arlolra, Nemoralis, anzx, and bwang: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:10] o/ [21:00:20] Hi! [21:00:21] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2005.codfw.wmnet with OS bullseye [21:00:26] o/ [21:00:35] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10341947 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye ex... [21:01:11] quick question, my patch depends on another patch (on the WikimediaMessages repo). Should I add that to the deployment window? [21:01:22] https://phabricator.wikimedia.org/T379317 [21:03:39] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2005.codfw.wmnet with OS bullseye [21:03:49] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10341965 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye [21:05:08] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-ctrl2003.codfw.wmnet with OS bookworm [21:05:08] !log herron@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-ctrl2003.codfw.wmnet [21:05:14] 06SRE, 10vm-requests, 07Kubernetes: codfw: (2x) aux-k8s-ctrl nodes - https://phabricator.wikimedia.org/T378986#10341979 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1002 for host aux-k8s-ctrl2003.codfw.wmnet with OS bookworm completed: - aux-k8s-ctrl2003 (**PASS**) - R... [21:05:18] Nemoralis: I think you need to ask someone to merge the messages first, wait about a week, and then deploy the config change [21:05:42] But I'm not really sure [21:06:00] !log dancy@deploy2002 Installing scap version "4.124.0" for 209 hosts [21:07:01] i'm here for the arlolra patch (arlo will probably also be around) [21:08:29] hi - pardon lateness - i can deploy if needed [21:08:47] !log dancy@deploy2002 Installing scap version "4.124.0" for 209 hosts [21:08:56] Thanks! [21:10:09] (03PS3) 10Albertoleoncio: [ptwiki] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091810 (https://phabricator.wikimedia.org/T380090) [21:10:15] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T379668#10342002 (10phaultfinder) [21:10:56] cscott: presumably the 2 backports can go out together? [21:11:31] cjming: yes, arlo/i are backporting a mediawiki-vendor patch to bump parsoid. i believe that's done by putting the two patches together on the same `scap backport` command [21:12:07] (03CR) 10TrainBranchBot: [C:03+2] "Copied votes on follow-up patch sets have been updated:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091810 (https://phabricator.wikimedia.org/T380090) (owner: 10Albertoleoncio) [21:12:21] (03CR) 10TrainBranchBot: "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091810 (https://phabricator.wikimedia.org/T380090) (owner: 10Albertoleoncio) [21:12:53] PROBLEM - Cassandra instance data free space on restbase2025 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8673 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [21:13:25] (03Merged) 10jenkins-bot: [ptwiki] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091810 (https://phabricator.wikimedia.org/T380090) (owner: 10Albertoleoncio) [21:13:54] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1091810|[ptwiki] Enable the CampaignEvents extension (T380090)]] [21:13:58] T380090: Enable CampaignEvents Extension on ptwiki - https://phabricator.wikimedia.org/T380090 [21:14:13] albertoleoncio: on test servers if you'd like to verify [21:14:18] lmk if/when to sync [21:14:29] Let me check [21:14:44] oh whoops - sorry - just a sec [21:14:57] hello! sorry im late , but im ready to deploy [21:15:27] albertoleoncio: hold on a sec - should be ready soon [21:15:45] ok [21:15:53] * cjming waves to bwang [21:16:39] bwang: no worries - i'm running thru the queue in order [21:19:20] cjming: Working now! [21:19:30] really? [21:19:37] Yep [21:19:54] !log cjming@deploy2002 cjming, albertoleoncio: Backport for [[gerrit:1091810|[ptwiki] Enable the CampaignEvents extension (T380090)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:19:55] cool - so i will sync! [21:19:58] T380090: Enable CampaignEvents Extension on ptwiki - https://phabricator.wikimedia.org/T380090 [21:20:13] !log cjming@deploy2002 cjming, albertoleoncio: Continuing with sync [21:20:14] On k8s-mwdebug, I mean =D [21:23:44] Nemoralis: regarding your dependent patch - i'm not sure either - hopefully someone here can confirm [21:28:58] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1091810|[ptwiki] Enable the CampaignEvents extension (T380090)]] (duration: 15m 04s) [21:29:02] T380090: Enable CampaignEvents Extension on ptwiki - https://phabricator.wikimedia.org/T380090 [21:29:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T379668#10342045 (10phaultfinder) [21:30:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [vendor] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093358 (https://phabricator.wikimedia.org/T373776) (owner: 10Arlolra) [21:30:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093359 (https://phabricator.wikimedia.org/T380333) (owner: 10Arlolra) [21:30:14] cjming: I mean, we can deploy it but only the message names (keys) will be visible until the messages are synchronized [21:30:22] unless we backport messages too lol [21:30:51] Nemoralis: do you want to add the messages patch to the queue? [21:31:06] albertoleoncio: should be live! [21:31:06] I can if it is possible [21:31:30] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2005.codfw.wmnet with OS bullseye [21:31:34] np - i'm doing cscott's/arlo's backports now - we can do yours after [21:31:40] cjming: Its live already, since some minutes ago :-) [21:31:41] thanks! [21:31:43] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10342060 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye ex... [21:32:12] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2005.codfw.wmnet with OS bullseye [21:32:13] cjming: I am adding then [21:32:20] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10342063 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye [21:32:21] sure - np [21:33:39] done [21:34:24] cool [21:38:00] just lmk when its my turn :) [21:38:17] will do! [21:40:28] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [21:41:09] oof - cscott your patches are going to take a while to merge -- sorry everyone - i should have merged them at the top of the hour [21:41:55] no worries i'm here [21:42:53] RECOVERY - Cassandra instance data free space on restbase2025 is OK: DISK OK - free space: /srv/cassandra/instance-data 13401 MB (31% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [21:43:40] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2005.codfw.wmnet with reason: host reimage [21:44:16] is it because it is dependency update? [21:46:12] not sure what you're asking but the vendor/core backports merge estimates are 20+ minutes (shorter now) [21:46:37] . [21:47:04] once they're thru tho, the rest of the config patches in the queue should be zippy [21:47:30] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2005.codfw.wmnet with reason: host reimage [21:50:38] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [21:52:49] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [21:55:37] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.21.0-a7 [vendor] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093358 (https://phabricator.wikimedia.org/T373776) (owner: 10Arlolra) [21:55:54] boom [21:56:00] lol [21:56:50] the other one still has a bit to go [21:57:53] fwiw i'm happy to go long and do the rest of the patches if no one needs the window after this [22:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T2200) [22:00:53] UTC late backport window is running a little over - is that ok? [22:01:05] no problem for me [22:01:05] ok with me [22:02:30] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.21.0-a7 [core] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093359 (https://phabricator.wikimedia.org/T380333) (owner: 10Arlolra) [22:02:31] Abstract Wikipedia team - lmk if you are waiting -- otherwise i'll continue onward [22:02:38] finally [22:02:59] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [22:03:02] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1093358|Bump wikimedia/parsoid to 0.21.0-a7 (T373776 T380333)]], [[gerrit:1093359|Bump wikimedia/parsoid to 0.21.0-a7 (T380333)]] [22:03:08] T373776: Parsoid does not correctly render if used with templates - https://phabricator.wikimedia.org/T373776 [22:03:09] T380333: CTT midweek deploy - https://phabricator.wikimedia.org/T380333 [22:03:48] cjming: the patch that bwang is deploying https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1093408 is pretty important to go out today - dark mode for anonymous users is rendering the wrong colors in certain places which is important we fix. I was in meetings but can also take over for bwang if he needs to run [22:04:34] Jdlrobson: sounds good [22:05:54] thanks cjming :) [22:06:12] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [22:08:47] !log cjming@deploy2002 arlolra, cjming: Backport for [[gerrit:1093358|Bump wikimedia/parsoid to 0.21.0-a7 (T373776 T380333)]], [[gerrit:1093359|Bump wikimedia/parsoid to 0.21.0-a7 (T380333)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:08:49] arlolra, cscott - on mwdebug - lmk when i can sync [22:08:51] T373776: Parsoid does not correctly render if used with templates - https://phabricator.wikimedia.org/T373776 [22:08:52] T380333: CTT midweek deploy - https://phabricator.wikimedia.org/T380333 [22:09:01] ok, thanks, testing [22:09:34] !log jhathaway@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhathaway@cumin2002" [22:11:20] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhathaway@cumin2002" [22:11:21] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2005.codfw.wmnet with OS bullseye [22:11:28] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10342161 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye co... [22:11:55] PROBLEM - Cassandra instance data free space on restbase2025 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8702 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [22:12:16] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2005.codfw.wmnet with OS bullseye [22:12:24] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye [22:13:02] cjming: ok, lgtm [22:13:07] oh good [22:13:11] !log cjming@deploy2002 arlolra, cjming: Continuing with sync [22:14:00] (03PS3) 10NMW03: Add contact form for U4C [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091868 (https://phabricator.wikimedia.org/T379317) [22:16:15] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [22:16:24] im still here as well to help test [22:17:22] great - i'm just going to keep plowing thru the queue - should be relatively quick [22:17:29] cscott: could you review https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Scribunto/+/1093422?forceReload=true should this be removed after namespace is live on wiki https://github.com/wikimedia/operations-mediawiki-config/blob/ebe7b5ea3a09cd6e334dda4128df5c7e9f45e2b3/wmf-config/core-Namespaces.php#L2906 [22:18:45] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [22:18:49] FIRING: HelmReleaseBadStatus: Helm release blunderbuss/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=blunderbuss - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:20:14] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1093358|Bump wikimedia/parsoid to 0.21.0-a7 (T373776 T380333)]], [[gerrit:1093359|Bump wikimedia/parsoid to 0.21.0-a7 (T380333)]] (duration: 17m 11s) [22:20:19] T373776: Parsoid does not correctly render if used with templates - https://phabricator.wikimedia.org/T373776 [22:20:19] T380333: CTT midweek deploy - https://phabricator.wikimedia.org/T380333 [22:20:55] PROBLEM - Cassandra instance data free space on restbase2025 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8715 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [22:21:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091868 (https://phabricator.wikimedia.org/T379317) (owner: 10NMW03) [22:21:47] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [22:22:13] thanks cjming [22:22:24] (03Merged) 10jenkins-bot: Add contact form for U4C [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091868 (https://phabricator.wikimedia.org/T379317) (owner: 10NMW03) [22:22:36] arlolra: yw! glad it worked out [22:22:51] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1091868|Add contact form for U4C (T379317)]] [22:22:51] cjming: don't forget the strings lol [22:23:01] anzx: yes, i think the mediawiki-config clause can be deleted once scribunto is defining the namespace itself. [22:23:03] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [22:23:05] T379317: Contact form requested - U4C - https://phabricator.wikimedia.org/T379317 [22:23:26] Nemoralis: i think your other patch just needs a merge -- and then if you want to backport, you need to set those patches up [22:23:49] RESOLVED: HelmReleaseBadStatus: Helm release blunderbuss/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=blunderbuss - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:24:16] cscott: thanks [22:24:53] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2005.codfw.wmnet with reason: host reimage [22:25:08] cjming: really? I don't think so. messages are usually updated during the mediawiki train. I don't think it will work with just merging [22:25:09] Nemoralis: i'm going to go ahead and sync your config patch and move on -- lmk if you need backports for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMessages/+/1091869 [22:25:42] if you do ^^, please create those patches and i can do those when they're ready [22:25:42] I think I do, otherwise contact form will display message keys instead of its actual content [22:25:58] what patches? [22:26:06] That config patch shouldn't have been merged without the strings being merged... [22:26:22] as it's completely useless standalone [22:26:38] yes, that's what I was asking at first [22:27:03] oh whoops - can we just merge https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMessages/+/1091869? and do backports for 1.44.0-wmf.4 and 3? [22:27:36] Well, no one has even reviewed the master patch yet [22:27:37] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2005.codfw.wmnet with reason: host reimage [22:28:08] !log cjming@deploy2002 nmw03, cjming: Backport for [[gerrit:1091868|Add contact form for U4C (T379317)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:28:13] T379317: Contact form requested - U4C - https://phabricator.wikimedia.org/T379317 [22:28:51] ok - then i'm going to not sync the config patch [22:29:28] Nemoralis: if you can get your master patch merged and backports set up, i'm happy to do those after i finish up the rest of the queue [22:29:38] contact form works fine btw, we just need strings https://i.imgur.com/BrmouqZ.png [22:29:49] cjming: do you mean strings by master patch? [22:31:23] Nemoralis: master patch is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMessages/+/1091869 << this needs merging and then if you want 1.44.0-wmf.4 and 1.44.0-wmf.3 to have the changes, we need those patches too [22:31:35] alright [22:31:37] !log cjming@deploy2002 Sync cancelled. [22:32:45] (03PS1) 10TrainBranchBot: Revert "Add contact form for U4C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093446 [22:32:46] (03CR) 10TrainBranchBot: "cjming@deploy2002 created a revert of this change as I91382012f2d2d4cc23f4c8f6699d7bffc0be2462" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091868 (https://phabricator.wikimedia.org/T379317) (owner: 10NMW03) [22:33:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093446 (owner: 10TrainBranchBot) [22:33:26] anzx: are you still around? [22:33:39] cjming: yes [22:34:23] cool - will do yours now, then bwang's, then Nemoralis' if they're ready [22:34:26] (03Merged) 10jenkins-bot: Revert "Add contact form for U4C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093446 (owner: 10TrainBranchBot) [22:34:44] (03PS3) 10Anzx: knwiki: update portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093328 (https://phabricator.wikimedia.org/T380366) [22:34:57] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1093446|Revert "Add contact form for U4C"]] [22:35:53] I am not sure yet, let me check if I can find someone from the LPL [22:37:06] Nemoralis: apologies if i led you astray [22:37:16] no worries :) [22:37:55] RECOVERY - Cassandra instance data free space on restbase2025 is OK: DISK OK - free space: /srv/cassandra/instance-data 14918 MB (35% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [22:39:12] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [22:39:20] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [22:40:37] !log cjming@deploy2002 trainbranchbot, cjming: Backport for [[gerrit:1093446|Revert "Add contact form for U4C"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:40:42] !log cjming@deploy2002 trainbranchbot, cjming: Continuing with sync [22:41:19] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [22:41:55] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [22:49:15] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2005.codfw.wmnet with OS bullseye [22:49:20] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1093446|Revert "Add contact form for U4C"]] (duration: 14m 22s) [22:49:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093328 (https://phabricator.wikimedia.org/T380366) (owner: 10Anzx) [22:50:21] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10342292 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye co... [22:50:35] (03Merged) 10jenkins-bot: knwiki: update portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093328 (https://phabricator.wikimedia.org/T380366) (owner: 10Anzx) [22:51:00] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1093328|knwiki: update portal namespace (T380366)]] [22:51:04] T380366: knwiki: update portal namespace - https://phabricator.wikimedia.org/T380366 [22:52:29] (03PS6) 10Bking: dse-k8s-services: introduce Blunderbuss config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091827 (https://phabricator.wikimedia.org/T371994) [22:52:58] !log Import libvmod-querysort 0.4-3 into varnish-staging apt component [22:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:09] (03PS2) 10Jdlrobson: Temporarily disable dark mode for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093408 (https://phabricator.wikimedia.org/T379765) [22:55:25] !log cjming@deploy2002 cjming, anzx: Backport for [[gerrit:1093328|knwiki: update portal namespace (T380366)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:55:28] cjming: checking [22:55:36] anzx: ty [22:55:41] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10342303 (10jhathaway) @elukey thanos-be2005 is now re-imaging without any user intervention. It wasn't quite as easy as just running the re-image script... [22:55:55] FIRING: MaxConntrack: Max conntrack at 100% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [22:56:22] cjming: looks good [22:56:25] !log cjming@deploy2002 cjming, anzx: Continuing with sync [22:56:41] PROBLEM - Check size of conntrack table on krb1001 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [22:57:41] RECOVERY - Check size of conntrack table on krb1001 is OK: OK: nf_conntrack is 74 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [23:00:56] RESOLVED: MaxConntrack: Max conntrack at 91.24% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [23:03:18] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1093328|knwiki: update portal namespace (T380366)]] (duration: 12m 17s) [23:03:22] T380366: knwiki: update portal namespace - https://phabricator.wikimedia.org/T380366 [23:03:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093408 (https://phabricator.wikimedia.org/T379765) (owner: 10Jdlrobson) [23:03:41] cjming: thank you for deploy [23:03:50] @jdlrobson, can you take over the deploy validation? [23:03:55] anzx: yw! [23:04:06] (03Merged) 10jenkins-bot: Temporarily disable dark mode for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093408 (https://phabricator.wikimedia.org/T379765) (owner: 10Jdlrobson) [23:04:36] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1093408|Temporarily disable dark mode for anonymous users (T379765)]] [23:04:40] T379765: Nov 19: Vector 2022 Deployments - https://phabricator.wikimedia.org/T379765 [23:04:49] bwang: sorry it's taken so long - should be verifiable soon on test servers - probably in a minute or 2 cc Jdlrobson [23:04:58] sounds good i can stay on then [23:05:31] 👍 [23:08:32] !log cjming@deploy2002 jdlrobson, cjming: Backport for [[gerrit:1093408|Temporarily disable dark mode for anonymous users (T379765)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:08:35] bwang: please check on test servers and lmk when to sync [23:09:58] which server? [23:10:14] mwdebug [23:10:30] do you have the browser extension? [23:10:41] ok i got it! thank you [23:10:43] it looks good [23:10:53] awesome [23:10:57] syncing [23:10:59] !log cjming@deploy2002 jdlrobson, cjming: Continuing with sync [23:11:01] cjming: yep lgtm too [23:11:15] should be live shortly [23:14:09] Nemoralis: looks like the master patch is still being reviewed - is it safe to say it's not going to be ready? [23:16:19] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:17:29] i can stay on for a bit longer if you think it (and backports) will be ready -- otherwise i might suggest getting the backports set up for the next available window... i am also unclear if there needs to be time for the strings to propagate before the config is merged [23:17:42] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1093408|Temporarily disable dark mode for anonymous users (T379765)]] (duration: 13m 06s) [23:17:47] T379765: Nov 19: Vector 2022 Deployments - https://phabricator.wikimedia.org/T379765 [23:19:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:22:02] !log end of UTC late backport window [23:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:55] thanks cjming [23:49:47] PROBLEM - Disk space on Hadoop worker on an-worker1143 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration