[00:03:44] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1052444 (owner: 10TrainBranchBot) [00:25:33] PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:25:33] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:26:17] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:26:17] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:33:33] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.85 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:42:33] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.03 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:55:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T367856)', diff saved to https://phabricator.wikimedia.org/P65912 and previous config saved to /var/cache/conftool/dbconfig/20240708-005501-marostegui.json [00:55:05] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [01:10:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P65913 and previous config saved to /var/cache/conftool/dbconfig/20240708-011008-marostegui.json [01:25:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P65914 and previous config saved to /var/cache/conftool/dbconfig/20240708-012515-marostegui.json [01:40:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T367856)', diff saved to https://phabricator.wikimedia.org/P65915 and previous config saved to /var/cache/conftool/dbconfig/20240708-014022-marostegui.json [01:40:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [01:40:26] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [01:40:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [01:40:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T367856)', diff saved to https://phabricator.wikimedia.org/P65916 and previous config saved to /var/cache/conftool/dbconfig/20240708-014044-marostegui.json [01:48:05] FIRING: PuppetFailure: Puppet has failed on deploy1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:49:44] FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [02:00:35] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:39:17] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:49:17] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:59:17] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:59:21] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:41] FIRING: [2x] SystemdUnitFailed: rsync-deployment_module.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:37:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:44:17] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:11:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin) [04:37:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T367856)', diff saved to https://phabricator.wikimedia.org/P65917 and previous config saved to /var/cache/conftool/dbconfig/20240708-043738-marostegui.json [04:37:42] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [04:52:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P65918 and previous config saved to /var/cache/conftool/dbconfig/20240708-045246-marostegui.json [05:02:10] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2123 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1052446 (https://phabricator.wikimedia.org/T369478) [05:02:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s5 T369478 [05:02:58] T369478: Switchover s5 master (db2213 -> db2123) - https://phabricator.wikimedia.org/T369478 [05:03:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2123 with weight 0 T369478', diff saved to https://phabricator.wikimedia.org/P65919 and previous config saved to /var/cache/conftool/dbconfig/20240708-050301-root.json [05:03:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s5 T369478 [05:04:13] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2123 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1052446 (https://phabricator.wikimedia.org/T369478) (owner: 10Gerrit maintenance bot) [05:05:35] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:15:37] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [05:16:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2123 from dump/slow', diff saved to https://phabricator.wikimedia.org/P65920 and previous config saved to /var/cache/conftool/dbconfig/20240708-051605-marostegui.json [05:16:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P65921 and previous config saved to /var/cache/conftool/dbconfig/20240708-051615-marostegui.json [05:18:13] !log Starting s5 codfw failover from db2213 to db2123 - T369478 [05:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:16] T369478: Switchover s5 master (db2213 -> db2123) - https://phabricator.wikimedia.org/T369478 [05:18:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2123 to s5 primary T369478', diff saved to https://phabricator.wikimedia.org/P65922 and previous config saved to /var/cache/conftool/dbconfig/20240708-051840-root.json [05:19:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2213 T369478', diff saved to https://phabricator.wikimedia.org/P65923 and previous config saved to /var/cache/conftool/dbconfig/20240708-051935-root.json [05:20:39] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [05:24:29] (03PS1) 10Marostegui: db2213: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1052447 [05:24:35] !log Deploy schema change on s5 codfw db2213 dbmaint T367856 [05:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:38] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [05:24:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2213.codfw.wmnet with reason: Long schema change [05:24:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2213.codfw.wmnet with reason: Long schema change [05:25:05] (03CR) 10Marostegui: [C:03+2] db2213: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1052447 (owner: 10Marostegui) [05:29:49] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 28306 [05:30:06] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 28306 [05:31:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T367856)', diff saved to https://phabricator.wikimedia.org/P65925 and previous config saved to /var/cache/conftool/dbconfig/20240708-053122-marostegui.json [05:31:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance [05:31:25] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [05:31:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance [05:31:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T367856)', diff saved to https://phabricator.wikimedia.org/P65926 and previous config saved to /var/cache/conftool/dbconfig/20240708-053133-marostegui.json [05:32:20] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 6447 [05:33:38] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 6447 [05:33:49] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 132167 [05:34:31] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 132167 [05:34:42] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 4788 [05:35:48] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4788 [05:36:00] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 999 [05:36:09] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 999 [05:36:18] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 28352 [05:36:34] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 28352 [05:36:46] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 61672 [05:37:00] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 61672 [05:37:11] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 268248 [05:37:27] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 268248 [05:37:54] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 18013 [05:38:20] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 18013 [05:38:24] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 61942 [05:38:38] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 61942 [05:38:52] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 263522 [05:39:07] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 263522 [05:39:10] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 17072 [05:39:42] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 17072 [05:48:05] FIRING: PuppetFailure: Puppet has failed on deploy1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:49:44] FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [05:59:21] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 28008 [05:59:36] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 28008 [06:01:47] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 270052 [06:02:43] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 270052 [06:03:18] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 52468 [06:04:17] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:04:42] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 52468 [06:04:46] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 7738 [06:05:23] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 7738 [06:05:42] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 52320 [06:06:46] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 52320 [06:08:27] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 269783 [06:08:37] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 269783 [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:28] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 61512 [06:11:25] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 61512 [06:13:21] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 27768 [06:13:54] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:13:58] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 27768 [06:14:00] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:14:02] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 137409 [06:15:53] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 137409 [06:16:13] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 52455 [06:16:48] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 3.905 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:16:52] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52338 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:17:39] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 52455 [06:28:04] 14SRE-Sprint-Week-Sustainability-March2023, 10conftool, 06Traffic, 13Patch-For-Review, 10Sustainability (Incident Followup): requestctl can't act on cache hits - https://phabricator.wikimedia.org/T317794#9959652 (10Joe) [07:00:05] Amir1 and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240708T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:47] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 270052 [07:02:10] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 270052 [07:06:41] FIRING: [2x] SystemdUnitFailed: rsync-deployment_module.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:12:30] (03PS1) 10Slyngshede: Template: Fix missing success styling on logout. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1052580 [07:16:31] (03CR) 10Hashar: [C:04-1] gerrit: Add if statement for reason in PatchSetAbandoned (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1051134 (owner: 10Paladox) [07:18:13] (03CR) 10Slyngshede: [V:03+2 C:03+1] Template: Fix missing success styling on logout. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1052580 (owner: 10Slyngshede) [07:18:19] (03CR) 10Slyngshede: [V:03+2 C:03+2] Template: Fix missing success styling on logout. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1052580 (owner: 10Slyngshede) [07:36:36] (03PS2) 10Hashar: gerrit: enable built-in log rotation [puppet] - 10https://gerrit.wikimedia.org/r/1049090 (https://phabricator.wikimedia.org/T367505) [07:37:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:38:56] (03CR) 10Hashar: [C:03+1] "We have successfully upgraded to Gerrit 3.10 and can now configure it to handle the logrotation instead of using a home made systemd timer" [puppet] - 10https://gerrit.wikimedia.org/r/1049090 (https://phabricator.wikimedia.org/T367505) (owner: 10Hashar) [07:47:48] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:47:48] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:48:12] RECOVERY - BFD status on cr1-drmrs is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:48:22] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:02:39] (03CR) 10Arnaudb: [C:03+2] mariadb: recording rules to monitor [puppet] - 10https://gerrit.wikimedia.org/r/1050376 (https://phabricator.wikimedia.org/T367283) (owner: 10Arnaudb) [08:16:58] PROBLEM - Disk space on mw1446 is CRITICAL: DISK CRITICAL - free space: / 1445 MB (0% inode=99%): /tmp 1445 MB (0% inode=99%): /var/tmp 1445 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1446&var-datasource=eqiad+prometheus/ops [08:26:56] !log re-enable business hours americas oncall - T369122 [08:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:59] T369122: On-call batphone escalation configuration holidays FY2024/25 - https://phabricator.wikimedia.org/T369122 [08:27:40] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: add new component thirdparty/kubeadm-k8s-1-25 [puppet] - 10https://gerrit.wikimedia.org/r/1052667 (https://phabricator.wikimedia.org/T369163) [08:29:21] (03Abandoned) 10Arturo Borrero Gonzalez: aptrepro: enable thirdparty/kubeadm-k8s-1-24 for buster and bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1010906 (https://phabricator.wikimedia.org/T359619) (owner: 10Arturo Borrero Gonzalez) [08:31:07] (03PS2) 10Arturo Borrero Gonzalez: aptrepo: add new component thirdparty/kubeadm-k8s-1-25 [puppet] - 10https://gerrit.wikimedia.org/r/1052667 (https://phabricator.wikimedia.org/T369163) [08:31:52] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052667 (https://phabricator.wikimedia.org/T369163) (owner: 10Arturo Borrero Gonzalez) [08:33:07] (03PS3) 10Arturo Borrero Gonzalez: aptrepo: add new component thirdparty/kubeadm-k8s-1-25 [puppet] - 10https://gerrit.wikimedia.org/r/1052667 (https://phabricator.wikimedia.org/T369163) [08:33:15] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052667 (https://phabricator.wikimedia.org/T369163) (owner: 10Arturo Borrero Gonzalez) [08:35:26] PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:35:28] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:35:50] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:35:50] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:38:11] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] aptrepo: add new component thirdparty/kubeadm-k8s-1-25 [puppet] - 10https://gerrit.wikimedia.org/r/1052667 (https://phabricator.wikimedia.org/T369163) (owner: 10Arturo Borrero Gonzalez) [08:42:21] !log update packages for thirdparty/kubeadm-k8s-1-25 bookworm-wikimedia in apt1002 (T369163) [08:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:24] 06SRE, 10LDAP-Access-Requests: Grant access to wmf to lferreira - https://phabricator.wikimedia.org/T369348#9959948 (10Lferreira) @Aklapper Done! [08:42:27] T369163: toolforge: prepare deb packages for k8s 1.25 - https://phabricator.wikimedia.org/T369163 [08:44:48] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 16), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9959951 (10elukey) Folks today I found snapshot1017 with puppet disable for mo... [08:46:19] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 16), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9959956 (10Marostegui) I don't think we should be leaving a host with puppet d... [08:47:39] (03PS1) 10Marostegui: installserver: Allow pc2017 reimage [puppet] - 10https://gerrit.wikimedia.org/r/1052671 (https://phabricator.wikimedia.org/T368919) [08:49:10] (03Abandoned) 10JMeybohm: Remove kubetcd200[4-6] from etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1034447 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [08:50:27] !log Restarting MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [08:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:43] !log Running `foreachwikiindblist group1.dblist extensions/CheckUser/maintenance/deleteReadOldRowsInCuChanges.php --batch-size=200` in a tmux session [08:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:52] (03CR) 10Marostegui: [C:03+2] installserver: Allow pc2017 reimage [puppet] - 10https://gerrit.wikimedia.org/r/1052671 (https://phabricator.wikimedia.org/T368919) (owner: 10Marostegui) [08:56:18] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw [08:57:30] RECOVERY - BFD status on cr1-drmrs is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:57:32] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:57:52] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:57:52] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:59:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw [09:06:57] jouncebot: nowandnext [09:06:57] No deployments scheduled for the next 0 hour(s) and 53 minute(s) [09:06:57] In 0 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240708T1000) [09:07:04] (03PS2) 10Elukey: redfish: add the add_account function [software/spicerack] - 10https://gerrit.wikimedia.org/r/1052311 (https://phabricator.wikimedia.org/T365372) [09:07:31] (03CR) 10Elukey: "Still testing if the code works on Dells" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1052311 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [09:08:23] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: introduce component thirdparty/k9s for bookworm-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1052677 (https://phabricator.wikimedia.org/T366061) [09:10:50] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] aptrepo: introduce component thirdparty/k9s for bookworm-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1052677 (https://phabricator.wikimedia.org/T366061) (owner: 10Arturo Borrero Gonzalez) [09:14:44] 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading: Upload errors due to swift failures, 503s - https://phabricator.wikimedia.org/T369388#9959990 (10MatthewVernon) p:05Unbreak!→03Medium [09:16:18] 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading: Upload errors due to swift failures, 503s - https://phabricator.wikimedia.org/T369388#9959995 (10MatthewVernon) [this is likely related to T360913] [09:17:31] !log aborrero@apt1002:~$ sudo -i reprepro --component thirdparty/k9s includedeb bookworm-wikimedia /home/aborrero/k9s_linux_amd64.deb (T366061) [09:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:33] T366061: [infra,k8s] package k9s for use in kubernetes - https://phabricator.wikimedia.org/T366061 [09:18:02] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: control: install k9s [puppet] - 10https://gerrit.wikimedia.org/r/1052678 (https://phabricator.wikimedia.org/T366061) [09:21:12] (03PS2) 10Hnowlan: shellbox-video: increase replicas, namespace resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050375 (https://phabricator.wikimedia.org/T356241) [09:21:37] (03CR) 10Volans: "Nice addition! One main comment inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1052311 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [09:22:40] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] toolforge: k8s: control: install k9s [puppet] - 10https://gerrit.wikimedia.org/r/1052678 (https://phabricator.wikimedia.org/T366061) (owner: 10Arturo Borrero Gonzalez) [09:23:17] (03CR) 10Elukey: [C:03+2] services: lower mesh's envoy concurrency to 8 for Wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052262 (https://phabricator.wikimedia.org/T368238) (owner: 10Elukey) [09:23:24] (03CR) 10Elukey: [C:03+2] services: upgrade mesh's envoy Docker version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052263 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [09:23:30] (03PS2) 10Elukey: services: upgrade mesh's envoy Docker version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052263 (https://phabricator.wikimedia.org/T368366) [09:23:30] (03CR) 10CI reject: [V:04-1] services: upgrade mesh's envoy Docker version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052263 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [09:23:39] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052263 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [09:24:44] (03Merged) 10jenkins-bot: services: upgrade mesh's envoy Docker version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052263 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [09:26:32] PROBLEM - Disk space on mw1445 is CRITICAL: DISK CRITICAL - free space: / 11869 MB (2% inode=99%): /tmp 11869 MB (2% inode=99%): /var/tmp 11869 MB (2% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1445&var-datasource=eqiad+prometheus/ops [09:31:36] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: sync [09:31:45] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: sync [09:32:32] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: sync [09:32:59] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: sync [09:36:23] (03CR) 10Btullis: [C:03+2] cephcsi: Grant the provisioner access to the ceph userID secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052341 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [09:38:34] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: sync [09:38:58] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: sync [09:39:53] (03Merged) 10jenkins-bot: cephcsi: Grant the provisioner access to the ceph userID secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052341 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [09:41:34] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: sync [09:41:38] (03CR) 10Jelto: [C:03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1051208 (owner: 10Ahmon Dancy) [09:41:52] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [09:42:58] (03CR) 10Lucas Werkmeister (WMDE): "AFAICT this only implements half of T368632 – what about the Wikiproiektu namespace?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052372 (https://phabricator.wikimedia.org/T368632) (owner: 10GergesShamon) [09:44:34] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: sync [09:44:42] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: sync [09:46:01] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1049090 (https://phabricator.wikimedia.org/T367505) (owner: 10Hashar) [09:48:05] FIRING: PuppetFailure: Puppet has failed on deploy1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:49:34] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [09:49:44] FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:49:51] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [09:50:17] (03PS1) 10Filippo Giunchedi: clinic-duty: update equinix parsing [software] - 10https://gerrit.wikimedia.org/r/1052688 [09:50:27] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: sync [09:50:33] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: sync [09:55:06] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: sync [09:58:15] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [09:58:34] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.76 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240708T1000) [10:00:20] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:00:51] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:02:49] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [10:04:22] (03PS2) 10Filippo Giunchedi: clinic-duty: update equinix parsing [software] - 10https://gerrit.wikimedia.org/r/1052688 [10:05:41] (03CR) 10Filippo Giunchedi: [C:03+2] clinic-duty: update equinix parsing [software] - 10https://gerrit.wikimedia.org/r/1052688 (owner: 10Filippo Giunchedi) [10:06:34] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [10:06:38] (03PS1) 10Elukey: role::deployment_server::kubernetes: update Envoy's version [puppet] - 10https://gerrit.wikimedia.org/r/1052691 (https://phabricator.wikimedia.org/T368366) [10:08:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T367856)', diff saved to https://phabricator.wikimedia.org/P65927 and previous config saved to /var/cache/conftool/dbconfig/20240708-100804-marostegui.json [10:08:07] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [10:10:35] (03CR) 10Elukey: "My plan is to send an email to ops@ announcing the diff, so people will be able to rollout the new envoy version during next deployments (" [puppet] - 10https://gerrit.wikimedia.org/r/1052691 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [10:15:34] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 20.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:23:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P65928 and previous config saved to /var/cache/conftool/dbconfig/20240708-102311-marostegui.json [10:23:40] (03PS1) 10Marostegui: Revert "db2213: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1052693 [10:23:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65929 and previous config saved to /var/cache/conftool/dbconfig/20240708-102347-root.json [10:24:06] (03CR) 10Marostegui: [C:03+2] Revert "db2213: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1052693 (owner: 10Marostegui) [10:26:07] (03PS2) 10GergesShamon: [euwiki] Enable Visual Editor in namespace Project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052372 (https://phabricator.wikimedia.org/T368632) [10:26:26] FIRING: [3x] SystemdUnitFailed: rsync-deployment_module.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:26:45] (03CR) 10CI reject: [V:04-1] [euwiki] Enable Visual Editor in namespace Project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052372 (https://phabricator.wikimedia.org/T368632) (owner: 10GergesShamon) [10:27:44] (03PS3) 10GergesShamon: [euwiki] Enable Visual Editor in namespace Project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052372 (https://phabricator.wikimedia.org/T368632) [10:29:09] (03PS4) 10GergesShamon: [euwiki] Enable Visual Editor in namespaces Project and Wikiproiektu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052372 (https://phabricator.wikimedia.org/T368632) [10:31:09] (03PS4) 10Paladox: gerrit: Add if statement for reason in PatchSetAbandoned [puppet] - 10https://gerrit.wikimedia.org/r/1051134 [10:31:50] (03CR) 10Paladox: gerrit: Add if statement for reason in PatchSetAbandoned (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1051134 (owner: 10Paladox) [10:32:29] (03PS1) 10Btullis: Disable monitoring on clouddb1021 prior to decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1052696 (https://phabricator.wikimedia.org/T368518) [10:33:10] (03PS2) 10Btullis: Disable monitoring on clouddb1021 prior to decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1052696 (https://phabricator.wikimedia.org/T368518) [10:33:56] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3172/console" [puppet] - 10https://gerrit.wikimedia.org/r/1052696 (https://phabricator.wikimedia.org/T368518) (owner: 10Btullis) [10:35:11] (03CR) 10Marostegui: [C:03+1] Disable monitoring on clouddb1021 prior to decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1052696 (https://phabricator.wikimedia.org/T368518) (owner: 10Btullis) [10:38:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P65930 and previous config saved to /var/cache/conftool/dbconfig/20240708-103818-marostegui.json [10:38:37] (03PS1) 10Lucas Werkmeister (WMDE): Add entity-schema to $wgWBRepoSettings['searchIndexTypes'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052699 (https://phabricator.wikimedia.org/T369495) [10:38:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65931 and previous config saved to /var/cache/conftool/dbconfig/20240708-103854-root.json [10:39:36] (03PS1) 10JMeybohm: aux: Add securityContext to istio components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052700 (https://phabricator.wikimedia.org/T362978) [10:39:40] (03PS1) 10JMeybohm: dse: Add securityContext to istio components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052701 (https://phabricator.wikimedia.org/T362978) [10:39:44] (03PS1) 10JMeybohm: ml: Add securityContext to istio components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052702 (https://phabricator.wikimedia.org/T362978) [10:40:39] (03CR) 10Btullis: [V:03+1 C:03+2] Disable monitoring on clouddb1021 prior to decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1052696 (https://phabricator.wikimedia.org/T368518) (owner: 10Btullis) [10:40:49] (03CR) 10JMeybohm: "I'm not sure this is completely correct as the config structure differs from what we use on wikikube (and CNI is configured). So please do" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052702 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [10:40:56] (03CR) 10JMeybohm: "I'm not sure this is completely correct as the config structure differs from what we use on wikikube (and CNI is configured). So please do" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052701 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [10:41:49] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 272432 [10:42:03] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 272432 [10:42:36] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 262476 [10:42:52] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 262476 [10:42:53] (03PS2) 10JMeybohm: dse: Add securityContext to istio components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052701 (https://phabricator.wikimedia.org/T362978) [10:43:00] (03PS2) 10JMeybohm: ml: Add securityContext to istio components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052702 (https://phabricator.wikimedia.org/T362978) [10:43:28] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 268248 [10:43:38] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 268248 [10:43:41] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 270359 [10:43:52] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 270359 [10:45:01] !log rebooting A:cp-esams (T366555) [10:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:12] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_esams [10:45:13] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_esams [10:52:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [10:53:17] fabfur: expected? [10:53:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T367856)', diff saved to https://phabricator.wikimedia.org/P65932 and previous config saved to /var/cache/conftool/dbconfig/20240708-105325-marostegui.json [10:53:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [10:53:29] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [10:53:40] !incidents [10:53:40] 4840 (UNACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [10:53:40] 4839 (RESOLVED) ATSBackendErrorsHigh cache_text sre (phabricator.discovery.wmnet eqiad) [10:53:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [10:53:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1213 (T367856)', diff saved to https://phabricator.wikimedia.org/P65933 and previous config saved to /var/cache/conftool/dbconfig/20240708-105348-marostegui.json [10:53:49] !ack 4840 [10:53:49] 4840 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [10:53:52] it looks like the availability is recovering again [10:54:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65934 and previous config saved to /var/cache/conftool/dbconfig/20240708-105400-root.json [10:54:01] 3min into lunch >D [10:54:04] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9960396 (10Lucas_Werkmeister_WMDE) {T355292} should probably be a subtask of this (or maybe a subtask of T321899)? At least I’ve been told th... [10:54:37] I just cut my thumb cooking :D [10:54:45] ouch [10:55:12] arnaudb: I think you can go plaster yourself ;) [10:55:29] oh its done, it was just before the phone rang :D [10:55:48] !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3074.esams.wmnet [10:55:56] !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3066.esams.wmnet [10:56:14] ah, thought you got terrified by it ringing :) [10:56:23] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9960410 (10Clement_Goubert) [10:56:25] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9960411 (10Clement_Goubert) [10:56:27] FIRING: [3x] SystemdUnitFailed: rsync-deployment_module.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:56:49] a string of bad luck i'd say [10:57:06] we had a bump on CDN but it seems gone [10:57:14] hey fabfur - there was an availibility blib during your cp reboots, could that be related? [10:57:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [10:58:15] jayme: it's a bit early to get plastered though [10:58:16] :p [10:58:42] claime: what took you so long?! [10:59:42] Blame painkillers [10:59:47] * claime is not touching production today [11:00:03] ok, fair. You're excused this time [11:03:04] BTW currently running a script that is deleting a lot of rows from `cu_changes` and is currently on `s4`, which might explain the replication lag for `s4` cloud DBs. [11:04:02] It seems to be resolved based on the log, so I am not going to stop my script at the moment. [11:04:47] Dreamy_Jazz: they aren't lagging at the moment [11:04:50] arnaudb: ^ [11:05:16] AFAIK `cu_changes` is excluded from the cloud DBs by the sanitarium hosts, but I presume that the deletion statements still need to be filtered somehow. [11:06:48] there was a bit of replag on clouddb Dreamy_Jazz but this is expected on that host, threshold tweaking is currently ongoing [11:06:58] 👍 [11:06:59] unless you saw something hidden? [11:07:05] 👀 [11:07:18] No, just was looking at the scroll-back and saw a replication lag alert [11:07:25] ack thanks [11:09:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65935 and previous config saved to /var/cache/conftool/dbconfig/20240708-110905-root.json [11:09:23] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9960448 (10Clement_Goubert) [11:16:33] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9960456 (10Clement_Goubert) [11:20:25] (03PS4) 10Jforrester: [wikifunctionswiki] Disable MobileFrontend in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010270 (https://phabricator.wikimedia.org/T349408) [11:20:30] (03PS5) 10Jforrester: [wikifunctionswiki] Disable MobileFrontend in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010270 (https://phabricator.wikimedia.org/T349408) [11:20:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010270 (https://phabricator.wikimedia.org/T349408) (owner: 10Jforrester) [11:20:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010270 (https://phabricator.wikimedia.org/T349408) (owner: 10Jforrester) [11:21:19] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9960472 (10Clement_Goubert) 05Open→03Resolved The work this task tracked is now completed. Remaining migrations {T352650}, {T355292}, {T355292... [11:22:27] (03PS2) 10Jforrester: wikifunctions: Raise CPU limit in orchestrator from 200m to 400m [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051813 (https://phabricator.wikimedia.org/T368892) [11:22:39] (03CR) 10Jforrester: [C:03+2] wikifunctions: Raise CPU limit in orchestrator from 200m to 400m [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051813 (https://phabricator.wikimedia.org/T368892) (owner: 10Jforrester) [11:23:49] (03Merged) 10jenkins-bot: wikifunctions: Raise CPU limit in orchestrator from 200m to 400m [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051813 (https://phabricator.wikimedia.org/T368892) (owner: 10Jforrester) [11:24:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65936 and previous config saved to /var/cache/conftool/dbconfig/20240708-112411-root.json [11:24:14] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [11:24:16] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [11:24:39] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [11:24:42] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [11:25:11] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [11:25:28] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [11:25:59] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [11:26:49] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [11:26:54] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [11:27:41] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [11:29:07] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9960486 (10Clement_Goubert) 05Open→03In progress [11:34:10] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [11:34:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [11:34:26] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9960507 (10Clement_Goubert) 05In progress→03Resolved All internal traffic has been migrated. [11:36:02] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786#9960518 (10Clement_Goubert) [11:37:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:37:14] PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:37:14] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:37:18] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:37:22] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:37:52] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9960524 (10Clement_Goubert) >>! In T290536#9960396, @Lucas_Werkmeister_WMDE wrote: > {T355292} should probably be a subtask of this (or maybe... [11:39:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65937 and previous config saved to /var/cache/conftool/dbconfig/20240708-113917-root.json [11:41:40] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9960525 (10Clement_Goubert) [11:42:10] (03PS1) 10Phuedx: lib: Update metrics-platform to 84ed8dcbe7c9 [extensions/EventLogging] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1052711 [11:42:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/EventLogging] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1052711 (owner: 10Phuedx) [11:46:02] (03CR) 10Ayounsi: [V:03+1] "No rush at all. I'm fine deploying it in a few weeks as it's a small edge case of the full routed ganeti setup." [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [11:47:30] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 262476 [11:47:51] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 262476 [11:54:04] (03PS1) 10Btullis: Configure reuse-parts for an-mariadb servers [puppet] - 10https://gerrit.wikimedia.org/r/1052712 (https://phabricator.wikimedia.org/T365503) [11:54:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65938 and previous config saved to /var/cache/conftool/dbconfig/20240708-115422-root.json [11:58:16] (03CR) 10Btullis: [C:03+2] Configure reuse-parts for an-mariadb servers [puppet] - 10https://gerrit.wikimedia.org/r/1052712 (https://phabricator.wikimedia.org/T365503) (owner: 10Btullis) [12:00:28] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9960571 (10Lucas_Werkmeister_WMDE) /me shakes fist at Phorge for not letting me award this task another token 🪙🪙🪙🪙🪙 [12:17:52] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Uniquemia - https://phabricator.wikimedia.org/T369500 (10EUwandu-WMF) 03NEW [12:19:02] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-mariadb1002.eqiad.wmnet with OS bookworm [12:20:46] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052723 [12:27:30] !log btullis@deploy1002 Started deploy [airflow-dags/analytics@a2faba7]: (no justification provided) [12:27:57] !log btullis@deploy1002 Finished deploy [airflow-dags/analytics@a2faba7]: (no justification provided) (duration: 00m 27s) [12:28:13] (03PS1) 10Slyngshede: data.yaml: Remove shell access for ezachte and chelsyx. [puppet] - 10https://gerrit.wikimedia.org/r/1052728 [12:29:29] (03CR) 10JMeybohm: [C:03+1] role::deployment_server::kubernetes: update Envoy's version [puppet] - 10https://gerrit.wikimedia.org/r/1052691 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [12:29:37] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.98 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:32:53] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-mariadb1002.eqiad.wmnet with reason: host reimage [12:35:14] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-mariadb1002.eqiad.wmnet with reason: host reimage [12:36:13] !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3075.esams.wmnet [12:36:31] !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3067.esams.wmnet [12:43:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Pool with small weight T365805', diff saved to https://phabricator.wikimedia.org/P65939 and previous config saved to /var/cache/conftool/dbconfig/20240708-124310-marostegui.json [12:43:14] T365805: Test MariaDB 10.11 - https://phabricator.wikimedia.org/T365805 [12:44:19] (03PS1) 10Filippo Giunchedi: pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640) [12:44:21] (03PS1) 10Filippo Giunchedi: pontoon: add frontend proxy capability to LB [puppet] - 10https://gerrit.wikimedia.org/r/1052731 (https://phabricator.wikimedia.org/T352640) [12:44:43] (03CR) 10CI reject: [V:04-1] pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [12:46:36] (03CR) 10CDanis: [C:03+1] haproxy,hiera: Test bwlimit per url on cp4051 [puppet] - 10https://gerrit.wikimedia.org/r/1052064 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [12:47:48] (03CR) 10Vgutierrez: [C:03+2] haproxy,hiera: Test bwlimit per url on cp4051 [puppet] - 10https://gerrit.wikimedia.org/r/1052064 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [12:48:26] !log test bwlimit per url on cp4051 - T317799 [12:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:29] T317799: Rate limiting for hotlinked images - https://phabricator.wikimedia.org/T317799 [12:49:02] (03CR) 10Elukey: [C:03+2] role::deployment_server::kubernetes: update Envoy's version [puppet] - 10https://gerrit.wikimedia.org/r/1052691 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [12:49:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052165 (https://phabricator.wikimedia.org/T9496) (owner: 10Pppery) [12:50:23] (03CR) 10CDanis: [C:03+1] conftool/cli: add option to log actions with a reason string [software/conftool] - 10https://gerrit.wikimedia.org/r/1052307 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [12:51:48] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:51:54] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-mariadb1002.eqiad.wmnet with OS bookworm [12:51:56] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:57:14] (03CR) 10Cathal Mooney: [C:03+1] "There is a subtle difference here in terms of what Bird does with the information. With the address%interface syntax that just adds the i" [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240708T1300). [13:00:05] Gerges, tchin, James_F, phuedx, and pppery: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] Here [13:00:14] Hi [13:01:42] Hey. [13:02:56] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1165.eqiad.wmnet with reason: Maintenance [13:03:06] hello [13:03:09] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1165.eqiad.wmnet with reason: Maintenance [13:03:10] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:03:26] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:03:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T367781)', diff saved to https://phabricator.wikimedia.org/P65940 and previous config saved to /var/cache/conftool/dbconfig/20240708-130333-arnaudb.json [13:03:36] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [13:04:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T367781)', diff saved to https://phabricator.wikimedia.org/P65941 and previous config saved to /var/cache/conftool/dbconfig/20240708-130441-arnaudb.json [13:06:31] RECOVERY - Disk space on mw1445 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1445&var-datasource=eqiad+prometheus/ops [13:06:53] sorry jynus, apparently irccloud sopped alerting me about mentions [13:11:21] (03PS2) 10Filippo Giunchedi: pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640) [13:11:21] (03PS2) 10Filippo Giunchedi: pontoon: add frontend proxy capability to LB [puppet] - 10https://gerrit.wikimedia.org/r/1052731 (https://phabricator.wikimedia.org/T352640) [13:11:42] i missed the ping somehow [13:11:46] is anyone deploying? [13:12:03] Looks like no [13:12:14] let's get started then [13:13:43] hello Gerges, do we have a 👍 for the VE enabling from someone on the Editing team (as the VE maintainers)? AFAIK, they'd like to review before a deployment like this one happens. [13:14:06] (03PS3) 10TChin: EventStreamConfig: Add hive ingestion defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) [13:14:13] (03CR) 10Urbanecm: [C:03+2] EventStreamConfig: Add hive ingestion defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin) [13:14:57] (03CR) 10Urbanecm: [C:03+2] [wikifunctionswiki] Disable MobileFrontend in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010270 (https://phabricator.wikimedia.org/T349408) (owner: 10Jforrester) [13:15:10] (03Merged) 10jenkins-bot: EventStreamConfig: Add hive ingestion defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin) [13:15:24] (03CR) 10Urbanecm: [C:03+2] lib: Update metrics-platform to 84ed8dcbe7c9 [extensions/EventLogging] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1052711 (owner: 10Phuedx) [13:15:45] (03CR) 10CI reject: [V:04-1] pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [13:15:48] (03Merged) 10jenkins-bot: [wikifunctionswiki] Disable MobileFrontend in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010270 (https://phabricator.wikimedia.org/T349408) (owner: 10Jforrester) [13:16:24] (03PS1) 10Vgutierrez: hiera: Fix cp4051 bwlimit configuration [puppet] - 10https://gerrit.wikimedia.org/r/1052736 (https://phabricator.wikimedia.org/T317799) [13:16:25] Pppery: would you mind securing a +1 on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1052165, please? [13:16:59] RECOVERY - Disk space on mw1446 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1446&var-datasource=eqiad+prometheus/ops [13:17:10] !log urbanecm@deploy1002 Started scap sync-world: Backport for [[gerrit:1050596|EventStreamConfig: Add hive ingestion defaults (T367134)]], [[gerrit:1010270|[wikifunctionswiki] Disable MobileFrontend in production (T349408)]] [13:17:14] T367134: [Refine Refactoring] Integrate Refine workflow configuration into ESC - https://phabricator.wikimedia.org/T367134 [13:17:15] T349408: Use responsive Vector-2022 instead of Minerva for Wikifunctions Mobile and drop the secondary domain/MobileFrontend part - https://phabricator.wikimedia.org/T349408 [13:17:20] (03CR) 10CDanis: [C:03+1] hiera: Fix cp4051 bwlimit configuration [puppet] - 10https://gerrit.wikimedia.org/r/1052736 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [13:17:41] (03CR) 10Vgutierrez: [C:03+2] hiera: Fix cp4051 bwlimit configuration [puppet] - 10https://gerrit.wikimedia.org/r/1052736 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [13:19:25] actually... phuedx doesn't appear to be around, removing the +2 [13:19:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P65942 and previous config saved to /var/cache/conftool/dbconfig/20240708-131948-arnaudb.json [13:20:35] sent them a slack message, they're joining [13:20:38] (03CR) 10Urbanecm: [C:03+2] lib: Update metrics-platform to 84ed8dcbe7c9 [extensions/EventLogging] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1052711 (owner: 10Phuedx) [13:20:58] oops, I also totally missed the ping [13:21:09] * Lucas_WMDE lets urbanecm deploy [13:21:20] o/ [13:21:22] hi phuedx! [13:21:31] waiting on CI currently [13:22:53] (03PS1) 10Effie Mouzeli: memcached: enable extstore to eqiad only [puppet] - 10https://gerrit.wikimedia.org/r/1052739 (https://phabricator.wikimedia.org/T352885) [13:23:16] (03CR) 10CI reject: [V:04-1] memcached: enable extstore to eqiad only [puppet] - 10https://gerrit.wikimedia.org/r/1052739 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli) [13:24:02] (03PS2) 10Effie Mouzeli: memcached: enable extstore to eqiad only [puppet] - 10https://gerrit.wikimedia.org/r/1052739 (https://phabricator.wikimedia.org/T352885) [13:24:25] (03CR) 10Ssingh: "Thanks for the review folks!" [software/conftool] - 10https://gerrit.wikimedia.org/r/1052307 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [13:24:26] (03CR) 10Ssingh: [C:03+2] conftool/cli: add option to log actions with a reason string [software/conftool] - 10https://gerrit.wikimedia.org/r/1052307 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [13:24:48] (03PS3) 10Filippo Giunchedi: pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640) [13:24:48] (03PS3) 10Filippo Giunchedi: pontoon: add frontend proxy capability to LB [puppet] - 10https://gerrit.wikimedia.org/r/1052731 (https://phabricator.wikimedia.org/T352640) [13:25:33] 06SRE, 06Infrastructure-Foundations, 10netops: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9960830 (10Papaul) @cmooney the 18th works for me thanks. [13:26:58] 06SRE, 06Traffic, 13Patch-For-Review: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#9960835 (10ssingh) [13:27:08] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Management routers to 22.4R3-S2 - https://phabricator.wikimedia.org/T369504 (10cmooney) 03NEW p:05Triage→03Medium [13:27:41] Pppery: Gerges: reminding about my pings from above, can you take a look please? [13:28:13] I saw that ping. Was thinking about who to add as reviewers, though. [13:28:52] (03CR) 10Arnaudb: [C:03+2] mariadb: add monitoring on io pressure for mariadb hosts [alerts] - 10https://gerrit.wikimedia.org/r/1049196 (https://phabricator.wikimedia.org/T367281) (owner: 10Arnaudb) [13:29:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T367856)', diff saved to https://phabricator.wikimedia.org/P65943 and previous config saved to /var/cache/conftool/dbconfig/20240708-132911-marostegui.json [13:29:12] (03CR) 10Arnaudb: [V:03+1 C:03+2] mariadb: add monitoring on io pressure for mariadb hosts [alerts] - 10https://gerrit.wikimedia.org/r/1049196 (https://phabricator.wikimedia.org/T367281) (owner: 10Arnaudb) [13:29:15] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [13:29:18] (03CR) 10CI reject: [V:04-1] pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [13:30:28] (03Merged) 10jenkins-bot: mariadb: add monitoring on io pressure for mariadb hosts [alerts] - 10https://gerrit.wikimedia.org/r/1049196 (https://phabricator.wikimedia.org/T367281) (owner: 10Arnaudb) [13:31:56] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: security update - bking@cumin2002 - T366555 [13:31:58] !log urbanecm@deploy1002 tchin, jforrester, urbanecm: Backport for [[gerrit:1050596|EventStreamConfig: Add hive ingestion defaults (T367134)]], [[gerrit:1010270|[wikifunctionswiki] Disable MobileFrontend in production (T349408)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:32:02] finally [13:32:03] T367134: [Refine Refactoring] Integrate Refine workflow configuration into ESC - https://phabricator.wikimedia.org/T367134 [13:32:03] T349408: Use responsive Vector-2022 instead of Minerva for Wikifunctions Mobile and drop the secondary domain/MobileFrontend part - https://phabricator.wikimedia.org/T349408 [13:32:17] tchin: James_F: please take a look at the first two changes at mwdebug, if possible :) [13:32:19] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: security update - bking@cumin2002 - T366555 [13:32:32] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: security update - bking@cumin2002 - T366555 [13:32:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052325 (https://phabricator.wikimedia.org/T356241) (owner: 10Kamila Součková) [13:32:59] urbanecm: LGTM. [13:33:08] ty [13:34:04] (03PS4) 10Filippo Giunchedi: pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640) [13:34:04] (03PS4) 10Filippo Giunchedi: pontoon: add frontend proxy capability to LB [puppet] - 10https://gerrit.wikimedia.org/r/1052731 (https://phabricator.wikimedia.org/T352640) [13:34:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P65944 and previous config saved to /var/cache/conftool/dbconfig/20240708-133456-arnaudb.json [13:35:01] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 236 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 236, active_shards: 236, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 236, delayed_unassigned_shards: 0, number_of_pending_tasks: 2, number [13:35:01] light_fetch: 0, task_max_waiting_in_queue_millis: 527, active_shards_percent_as_number: 50.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:35:01] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 7 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 9, active_shards: 9, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 7, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, [13:35:01] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 56.25 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:35:05] (03CR) 10Pppery: "Adding the author and approver of the original patch that added the functionality I'm fixing (https://gerrit.wikimedia.org/r/c/operations/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052165 (https://phabricator.wikimedia.org/T9496) (owner: 10Pppery) [13:36:01] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 9, active_shards: 16, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig [13:36:01] : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:36:33] tchin: what about you? [13:36:48] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1052728 (owner: 10Slyngshede) [13:37:01] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 252, active_shards: 472, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max [13:37:01] _in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:37:14] (03CR) 10Marostegui: mediawiki: Start the table catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1043246 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [13:37:37] I added some people as reviewers to my missing.php patch, but that probably won't take place during this backport window, so call it not done today and I am going to reschedule it for a later window [13:38:24] (03PS1) 10Effie Mouzeli: mw-mcrouter: rollout to eqiad mw-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052741 (https://phabricator.wikimedia.org/T346690) [13:38:31] (03CR) 10CI reject: [V:04-1] pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [13:38:39] Pppery: thanks and sorry for the delay :). [13:38:58] looks good [13:39:03] !log urbanecm@deploy1002 tchin, jforrester, urbanecm: Continuing with sync [13:39:06] proceeding then, thanks [13:39:15] (03PS1) 10Ssingh: Release 3.0.2 [software/conftool] - 10https://gerrit.wikimedia.org/r/1052742 [13:40:03] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 16 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 16, delayed_unassigned_shards: 0, number_of_pending_tasks: 4, [13:40:03] f_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 1436, active_shards_percent_as_number: 0.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:40:23] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 336 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 136, active_shards: 136, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 334, delayed_unassigned_shards: 0, number_of_pending_tasks: 1, number [13:40:23] light_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 28.8135593220339 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:41:03] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 9, active_shards: 16, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig [13:41:03] : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:41:03] (03CR) 10Ssingh: "I am not sure about the conftool release cycle and if this warrants a new release or not so I will leave that to you. Please feel free to " [software/conftool] - 10https://gerrit.wikimedia.org/r/1052742 (owner: 10Ssingh) [13:41:23] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 252, active_shards: 472, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max [13:41:23] _in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:41:43] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on relforge1003 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:42:02] (03Merged) 10jenkins-bot: lib: Update metrics-platform to 84ed8dcbe7c9 [extensions/EventLogging] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1052711 (owner: 10Phuedx) [13:42:07] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: security update - bking@cumin2002 - T366555 [13:42:32] (03CR) 10Elukey: [C:03+1] aux: Add securityContext to istio components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052700 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [13:42:46] (03CR) 10Elukey: [C:03+2] aux: Add securityContext to istio components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052700 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [13:43:29] (03PS1) 10Alexandros Kosiaris: ats: Route /api/ to /w/rest.php on mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/1052745 (https://phabricator.wikimedia.org/T364400) [13:44:05] (03PS5) 10Filippo Giunchedi: pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640) [13:44:06] (03PS5) 10Filippo Giunchedi: pontoon: add frontend proxy capability to LB [puppet] - 10https://gerrit.wikimedia.org/r/1052731 (https://phabricator.wikimedia.org/T352640) [13:44:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P65945 and previous config saved to /var/cache/conftool/dbconfig/20240708-134418-marostegui.json [13:47:16] (03CR) 10Ladsgroup: mediawiki: Start the table catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1043246 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [13:47:48] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1050596|EventStreamConfig: Add hive ingestion defaults (T367134)]], [[gerrit:1010270|[wikifunctionswiki] Disable MobileFrontend in production (T349408)]] (duration: 30m 38s) [13:47:52] T367134: [Refine Refactoring] Integrate Refine workflow configuration into ESC - https://phabricator.wikimedia.org/T367134 [13:47:53] T349408: Use responsive Vector-2022 instead of Minerva for Wikifunctions Mobile and drop the secondary domain/MobileFrontend part - https://phabricator.wikimedia.org/T349408 [13:47:56] and synced! [13:47:56] (03CR) 10Marostegui: mediawiki: Start the table catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1043246 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [13:48:15] Whee. [13:48:15] !log urbanecm@deploy1002 Started scap sync-world: Backport for [[gerrit:1052711|lib: Update metrics-platform to 84ed8dcbe7c9]] [13:48:22] continuing with the last one [13:48:40] (03CR) 10CI reject: [V:04-1] pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [13:48:53] (03CR) 10Marostegui: [C:03+1] "It looks good for now, maybe once we start using it, we'll notice stuff that needs changing to adapt more to our needs." [puppet] - 10https://gerrit.wikimedia.org/r/1043246 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [13:50:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T367781)', diff saved to https://phabricator.wikimedia.org/P65946 and previous config saved to /var/cache/conftool/dbconfig/20240708-135002-arnaudb.json [13:50:05] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1168.eqiad.wmnet with reason: Maintenance [13:50:06] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [13:50:18] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1168.eqiad.wmnet with reason: Maintenance [13:50:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T367781)', diff saved to https://phabricator.wikimedia.org/P65947 and previous config saved to /var/cache/conftool/dbconfig/20240708-135024-arnaudb.json [13:50:33] !log urbanecm@deploy1002 phuedx, urbanecm: Backport for [[gerrit:1052711|lib: Update metrics-platform to 84ed8dcbe7c9]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:50:44] phuedx: can you take a look at mwdebug, please? [13:50:55] urbanecm: On it [13:51:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T367781)', diff saved to https://phabricator.wikimedia.org/P65948 and previous config saved to /var/cache/conftool/dbconfig/20240708-135132-arnaudb.json [13:51:43] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on relforge1003 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:52:03] urbanecm: How about https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1052372 ? [13:52:26] Gerges: i asked a question a couple of lines above, but did not receive a response :). [13:52:28] let me repaste [13:52:39] 15:13 hello Gerges, do we have a 👍 for the VE enabling from someone on the Editing team (as the VE maintainers)? AFAIK, they'd like to review before a deployment like this one happens. [13:53:34] urbanecm: LGTM [13:53:39] thanks, proceeding [13:53:41] !log urbanecm@deploy1002 phuedx, urbanecm: Continuing with sync [13:53:51] So what do I do? [13:54:08] Gerges: do we have the plus one from Editing team, or not? [13:55:19] Do I need to wait for the review editing team? [13:55:20] (03CR) 10Btullis: dse-k8s-services: Add net-new chart for Airflow (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [13:55:42] (03PS6) 10Filippo Giunchedi: pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640) [13:55:42] (03PS6) 10Filippo Giunchedi: pontoon: add frontend proxy capability to LB [puppet] - 10https://gerrit.wikimedia.org/r/1052731 (https://phabricator.wikimedia.org/T352640) [13:56:37] Gerges: if it didn't happen already, yes. it might be a good idea to ping them on the task (I can ask once I'm done with the deployment). [13:57:11] Okay [13:58:51] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1052711|lib: Update metrics-platform to 84ed8dcbe7c9]] (duration: 10m 36s) [13:58:56] and synced [13:58:59] anything else? [13:59:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P65949 and previous config saved to /var/cache/conftool/dbconfig/20240708-135925-marostegui.json [13:59:56] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1052748 (https://phabricator.wikimedia.org/T369514) [14:00:01] (03PS1) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1052749 (https://phabricator.wikimedia.org/T369514) [14:00:33] Gerges: I commented on the task: https://phabricator.wikimedia.org/T368632#9961075. Let's see what they say. [14:01:11] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2203 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1052751 (https://phabricator.wikimedia.org/T369515) [14:05:43] (03CR) 10Filippo Giunchedi: [C:03+2] mobileapps: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043107 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:05:55] (03PS3) 10Filippo Giunchedi: mobileapps: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043107 (https://phabricator.wikimedia.org/T320563) [14:06:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P65950 and previous config saved to /var/cache/conftool/dbconfig/20240708-140640-arnaudb.json [14:06:59] PROBLEM - Disk space on mw1446 is CRITICAL: DISK CRITICAL - free space: / 5095 MB (1% inode=99%): /tmp 5095 MB (1% inode=99%): /var/tmp 5095 MB (1% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1446&var-datasource=eqiad+prometheus/ops [14:09:50] (03PS1) 10Btullis: Puppetize the disabling of the misc dumps on snapshot1017 [puppet] - 10https://gerrit.wikimedia.org/r/1052752 (https://phabricator.wikimedia.org/T368098) [14:10:44] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3173/co" [puppet] - 10https://gerrit.wikimedia.org/r/1052752 (https://phabricator.wikimedia.org/T368098) (owner: 10Btullis) [14:13:06] !log cleaning up old shellbox files on mw1446 [14:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T367856)', diff saved to https://phabricator.wikimedia.org/P65951 and previous config saved to /var/cache/conftool/dbconfig/20240708-141432-marostegui.json [14:14:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [14:14:36] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [14:14:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [14:16:53] RECOVERY - Disk space on mw1446 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1446&var-datasource=eqiad+prometheus/ops [14:16:58] !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3076.esams.wmnet [14:17:07] !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3068.esams.wmnet [14:17:24] !log bking@cumin2002 START - Cookbook sre.wdqs.reboot [14:17:25] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [14:17:56] !log bking@cumin2002 START - Cookbook sre.wdqs.reboot [14:17:56] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [14:18:03] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for xiaoxiao - https://phabricator.wikimedia.org/T369519 (10XiaoXiao-WMF) 03NEW [14:18:37] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 24.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:20:13] !log bking@cumin2002 START - Cookbook sre.wdqs.reboot [14:20:14] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [14:20:25] !log bking@cumin2002 START - Cookbook sre.wdqs.reboot [14:20:26] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [14:20:42] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] mobileapps: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043107 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:20:50] !log bking@cumin2002 START - Cookbook sre.wdqs.reboot [14:20:50] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [14:21:19] !log bking@cumin2002 START - Cookbook sre.wdqs.reboot [14:21:19] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [14:21:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P65952 and previous config saved to /var/cache/conftool/dbconfig/20240708-142147-arnaudb.json [14:21:50] !log bking@cumin2002 START - Cookbook sre.wdqs.reboot [14:21:50] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [14:21:57] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [14:22:34] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [14:22:35] !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:22:48] (03PS3) 10Alexandros Kosiaris: Update modules/README.md [deployment-charts] - 10https://gerrit.wikimedia.org/r/953553 [14:23:09] !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [14:23:45] (03CR) 10Alexandros Kosiaris: "Revisit this. I 've added a few more stuff and I 'll take a look at some point into what sextant does to fix the incompatibility issues." [deployment-charts] - 10https://gerrit.wikimedia.org/r/953553 (owner: 10Alexandros Kosiaris) [14:25:57] (03CR) 10Herron: [V:03+1] "Thanks! Great points and agreed overall. I'm hoping to revisit this to see how the metrics behave in Pyrra today, and assuming we can leav" [puppet] - 10https://gerrit.wikimedia.org/r/1051439 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [14:27:42] !log bking@cumin2002 START - Cookbook sre.wdqs.reboot [14:27:43] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [14:31:45] !log root@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1011.eqiad.wmnet [14:34:01] !log root@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1011.eqiad.wmnet [14:36:21] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2161 - https://phabricator.wikimedia.org/T369229#9961274 (10Jhancock.wm) request submitted with Dell. SR193625600. might have a spare on hand to get it up now. the SR will allow us to replace the spare. will lyk [14:36:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T367781)', diff saved to https://phabricator.wikimedia.org/P65953 and previous config saved to /var/cache/conftool/dbconfig/20240708-143654-arnaudb.json [14:36:56] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1180.eqiad.wmnet with reason: Maintenance [14:36:58] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [14:37:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1180.eqiad.wmnet with reason: Maintenance [14:37:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T367781)', diff saved to https://phabricator.wikimedia.org/P65954 and previous config saved to /var/cache/conftool/dbconfig/20240708-143716-arnaudb.json [14:37:37] (03CR) 10Ebernhardson: [C:03+2] "seems reasonably, looks to already be applied in prod." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051358 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer) [14:38:48] (03Merged) 10jenkins-bot: Search update pipeline: reduce client-side rate-limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051358 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer) [14:39:17] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T367781)', diff saved to https://phabricator.wikimedia.org/P65955 and previous config saved to /var/cache/conftool/dbconfig/20240708-143925-arnaudb.json [14:42:30] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Uniquemia - https://phabricator.wikimedia.org/T369500#9961302 (10fgiunchedi) Hello @EUwandu-WMF, I couldn't find the uniquemia account on wikitech, or at least one with `euwandu-ctr@wikimedia.org` as its email, what wikitech account should we be using? tha... [14:43:50] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1011.eqiad.wmnet [14:43:51] !log root@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cloudcephosd1011.eqiad.wmnet [14:44:23] (03PS1) 10TChin: EventStreamConfig: Enable hive ingestion for mediawiki.page-delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052762 (https://phabricator.wikimedia.org/T367134) [14:46:36] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2161 - https://phabricator.wikimedia.org/T369229#9961346 (10Marostegui) Thank you! [14:49:09] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for xiaoxiao - https://phabricator.wikimedia.org/T369519#9961374 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Hello @XiaoXiao-WMF; I've added you to `wmf` ldap group. I'm tentatively resolving the task though please reopen if sth is amiss [14:49:59] PROBLEM - Disk space on mw1438 is CRITICAL: DISK CRITICAL - free space: / 10721 MB (2% inode=99%): /tmp 10721 MB (2% inode=99%): /var/tmp 10721 MB (2% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1438&var-datasource=eqiad+prometheus/ops [14:51:14] (03CR) 10Clément Goubert: [C:03+1] "This will shift around 100rps to from mw-web to mw-api-ext. It shouldn't need a replica bump, but we should still keep an eye on latency a" [puppet] - 10https://gerrit.wikimedia.org/r/1052745 (https://phabricator.wikimedia.org/T364400) (owner: 10Alexandros Kosiaris) [14:51:17] (03CR) 10Ebernhardson: [C:03+1] cirrus: add cirrussearch-legacy-updater dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052135 (owner: 10DCausse) [14:51:35] !log cleaning up old shellbox files on mw1438 [14:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:56] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host search-loader1002.eqiad.wmnet [14:52:11] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host search-loader1002.eqiad.wmnet [14:53:22] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host search-loader1002.eqiad.wmnet [14:53:26] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host search-loader1002.eqiad.wmnet [14:53:38] 06SRE, 06Infrastructure-Foundations: Request access to servers Dcops group - https://phabricator.wikimedia.org/T360356#9961390 (10elukey) [14:53:45] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host search-loader1002.eqiad.wmnet [14:54:10] (03CR) 10Clément Goubert: [C:03+1] mw-mcrouter: rollout to eqiad mw-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052741 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [14:54:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P65956 and previous config saved to /var/cache/conftool/dbconfig/20240708-145432-arnaudb.json [14:56:37] RECOVERY - Disk space on mw1438 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1438&var-datasource=eqiad+prometheus/ops [14:57:21] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host search-loader1002.eqiad.wmnet [14:59:07] !log root@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1011.eqiad.wmnet with OS bullseye [14:59:17] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052285 (https://phabricator.wikimedia.org/T369342) (owner: 10NMW03) [15:04:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [15:04:29] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 16), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9961423 (10xcollazo) >>! In T368098#9959951, @elukey wrote: > Folks today I fo... [15:07:31] (03PS4) 10Ladsgroup: mediawiki: Start the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1043246 (https://phabricator.wikimedia.org/T363581) [15:07:36] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mediawiki: Start the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1043246 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [15:07:45] (03CR) 10Btullis: [V:03+1 C:03+2] Puppetize the disabling of the misc dumps on snapshot1017 [puppet] - 10https://gerrit.wikimedia.org/r/1052752 (https://phabricator.wikimedia.org/T368098) (owner: 10Btullis) [15:07:49] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mediawiki: Start the table catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1043246 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [15:09:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P65957 and previous config saved to /var/cache/conftool/dbconfig/20240708-150939-arnaudb.json [15:11:44] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [15:12:46] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 16), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9961448 (10xcollazo) >>! In T368098#9953045, @Ladsgroup wrote: >>>! In T368098... [15:12:50] (03PS1) 10Scott French: commons-impact-analytics: bump image to v1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052765 (https://phabricator.wikimedia.org/T361835) [15:13:36] !log root@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1011.eqiad.wmnet with reason: host reimage [15:13:47] (03CR) 10Volans: "Approach looks good to me, question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [15:14:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [15:16:44] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [15:16:46] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1011.eqiad.wmnet with reason: host reimage [15:20:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [15:21:05] (03CR) 10Volans: [C:03+1] "makes sense to me (to be tested ;) )" [cookbooks] - 10https://gerrit.wikimedia.org/r/1050445 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [15:22:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Bumping db1227 weight (T366852)', diff saved to https://phabricator.wikimedia.org/P65958 and previous config saved to /var/cache/conftool/dbconfig/20240708-152222-ladsgroup.json [15:22:26] T366852: Discover and fix under-utilized replicas - https://phabricator.wikimedia.org/T366852 [15:24:16] (03CR) 10Volans: "One concern inline" [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [15:24:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T367781)', diff saved to https://phabricator.wikimedia.org/P65959 and previous config saved to /var/cache/conftool/dbconfig/20240708-152446-arnaudb.json [15:24:49] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1187.eqiad.wmnet with reason: Maintenance [15:24:50] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [15:25:02] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1187.eqiad.wmnet with reason: Maintenance [15:25:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T367781)', diff saved to https://phabricator.wikimedia.org/P65960 and previous config saved to /var/cache/conftool/dbconfig/20240708-152508-arnaudb.json [15:25:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [15:27:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T367781)', diff saved to https://phabricator.wikimedia.org/P65961 and previous config saved to /var/cache/conftool/dbconfig/20240708-152717-arnaudb.json [15:30:04] jan_drewniak: Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240708T1530). Please do the needful. [15:30:48] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052766 (https://phabricator.wikimedia.org/T128546) [15:34:19] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052766 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:34:57] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052766 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:36:31] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410#9961563 (10Volans) @elukey do you know how much of an effort would it be to change library ba... [15:36:44] (03CR) 10Volans: [C:03+1] "LGTM, question inline" [software/homer] - 10https://gerrit.wikimedia.org/r/1050262 (owner: 10Ayounsi) [15:37:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:38:07] (03CR) 10RLazarus: [C:03+1] commons-impact-analytics: bump image to v1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052765 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [15:38:14] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [15:38:18] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [15:39:08] (03PS1) 10JHathaway: wikipedia.org spf: indicate mail is sent from this domain. [dns] - 10https://gerrit.wikimedia.org/r/1052768 (https://phabricator.wikimedia.org/T369341) [15:41:53] (03CR) 10Scott French: [C:03+2] commons-impact-analytics: bump image to v1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052765 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [15:42:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P65962 and previous config saved to /var/cache/conftool/dbconfig/20240708-154224-arnaudb.json [15:42:25] (03CR) 10EoghanGaffney: [C:03+1] "This covers a good number of the affected domains but there are some others, we can deal with them in a separate patch!" [dns] - 10https://gerrit.wikimedia.org/r/1052768 (https://phabricator.wikimedia.org/T369341) (owner: 10JHathaway) [15:42:43] (03CR) 10JHathaway: [C:03+2] wikipedia.org spf: indicate mail is sent from this domain. [dns] - 10https://gerrit.wikimedia.org/r/1052768 (https://phabricator.wikimedia.org/T369341) (owner: 10JHathaway) [15:42:58] (03Merged) 10jenkins-bot: commons-impact-analytics: bump image to v1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052765 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [15:43:23] (03PS1) 10Arnaudb: bashrc: change option on alias [puppet] - 10https://gerrit.wikimedia.org/r/1052769 [15:43:25] (03CR) 10Arnaudb: [C:03+2] bashrc: change option on alias [puppet] - 10https://gerrit.wikimedia.org/r/1052769 (owner: 10Arnaudb) [15:44:12] (03CR) 10Dwisehaupt: "Yes, there should be a new endpoint to check. I brought it up with fr-tech last week before the US holiday and plan to have an answer soon" [puppet] - 10https://gerrit.wikimedia.org/r/1052062 (https://phabricator.wikimedia.org/T368114) (owner: 10Filippo Giunchedi) [15:44:18] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:44:36] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [15:44:49] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1046698| Bumping portals to master (T128546)]] (duration: 07m 54s) [15:44:56] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:45:01] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [15:45:11] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:45:30] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:45:52] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:46:48] !log pfischer@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:47:29] !log pfischer@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:48:53] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:51:18] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:1046698| Bumping portals to master (T128546)]] (duration: 06m 28s) [15:51:21] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:51:26] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:54:05] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:55:02] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 16), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9961705 (10Scott_French) Thanks, @SGupta-WMF! @mforns - The v1.0.1 image is n... [15:57:19] (03PS1) 10Ahmon Dancy: class scap::scripts: Drop logstash_checker.py, phase 1 [puppet] - 10https://gerrit.wikimedia.org/r/1052771 [15:57:19] (03PS1) 10Ahmon Dancy: class scap::scripts: Drop logstash_checker.py, phase 2 [puppet] - 10https://gerrit.wikimedia.org/r/1052772 [15:57:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P65963 and previous config saved to /var/cache/conftool/dbconfig/20240708-155731-arnaudb.json [15:57:42] (03CR) 10CI reject: [V:04-1] class scap::scripts: Drop logstash_checker.py, phase 1 [puppet] - 10https://gerrit.wikimedia.org/r/1052771 (owner: 10Ahmon Dancy) [15:57:43] !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3069.esams.wmnet [15:57:54] !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3077.esams.wmnet [15:58:53] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:59:13] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:59:44] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [16:00:14] (03PS2) 10Ahmon Dancy: class scap::scripts: Drop logstash_checker.py, phase 1 [puppet] - 10https://gerrit.wikimedia.org/r/1052771 [16:00:14] (03PS2) 10Ahmon Dancy: class scap::scripts: Drop logstash_checker.py, phase 2 [puppet] - 10https://gerrit.wikimedia.org/r/1052772 [16:01:26] RESOLVED: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:02:02] 06SRE, 06Infrastructure-Foundations, 10netops: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9961749 (10cmooney) [16:02:43] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052772 (owner: 10Ahmon Dancy) [16:03:18] (03PS1) 10Alexandros Kosiaris: mesh: Add faultinjection capabilities (c/p part) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052775 [16:03:18] (03PS1) 10Alexandros Kosiaris: mesh: Add faultinjection capabilities [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052776 [16:03:54] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052771 (owner: 10Ahmon Dancy) [16:04:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [16:05:23] (03PS1) 10David Caro: cloudcephosd1011: update interface names [puppet] - 10https://gerrit.wikimedia.org/r/1052777 (https://phabricator.wikimedia.org/T309789) [16:06:12] (03CR) 10David Caro: [C:03+2] cloudcephosd1011: update interface names [puppet] - 10https://gerrit.wikimedia.org/r/1052777 (https://phabricator.wikimedia.org/T309789) (owner: 10David Caro) [16:06:47] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9961762 (10cmooney) [16:07:18] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9961779 (10cmooney) [16:08:41] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1011.eqiad.wmnet with OS bullseye [16:09:11] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1011.eqiad.wmnet [16:09:21] (03CR) 10Ahmon Dancy: "Not sure what's up with PCC" [puppet] - 10https://gerrit.wikimedia.org/r/1052771 (owner: 10Ahmon Dancy) [16:10:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [16:12:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T367781)', diff saved to https://phabricator.wikimedia.org/P65964 and previous config saved to /var/cache/conftool/dbconfig/20240708-161238-arnaudb.json [16:12:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1201.eqiad.wmnet with reason: Maintenance [16:12:45] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [16:12:55] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1201.eqiad.wmnet with reason: Maintenance [16:13:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1201 (T367781)', diff saved to https://phabricator.wikimedia.org/P65965 and previous config saved to /var/cache/conftool/dbconfig/20240708-161302-arnaudb.json [16:15:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T367781)', diff saved to https://phabricator.wikimedia.org/P65966 and previous config saved to /var/cache/conftool/dbconfig/20240708-161510-arnaudb.json [16:15:19] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1011.eqiad.wmnet [16:15:35] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:15:44] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [16:20:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [16:25:43] RESOLVED: [2x] OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [16:26:40] jouncebot: nowandnext [16:26:40] No deployments scheduled for the next 0 hour(s) and 33 minute(s) [16:26:41] In 0 hour(s) and 33 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240708T1700) [16:26:41] In 0 hour(s) and 33 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240708T1700) [16:30:15] (03PS2) 10Ladsgroup: Reduce frequency of two query pages in commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052058 (https://phabricator.wikimedia.org/T369024) [16:30:18] (03CR) 10Ladsgroup: [C:03+2] Reduce frequency of two query pages in commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052058 (https://phabricator.wikimedia.org/T369024) (owner: 10Ladsgroup) [16:30:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P65967 and previous config saved to /var/cache/conftool/dbconfig/20240708-163017-arnaudb.json [16:30:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052058 (https://phabricator.wikimedia.org/T369024) (owner: 10Ladsgroup) [16:31:03] (03Merged) 10jenkins-bot: Reduce frequency of two query pages in commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052058 (https://phabricator.wikimedia.org/T369024) (owner: 10Ladsgroup) [16:31:18] !log ladsgroup@deploy1002 Started scap sync-world: Backport for [[gerrit:1052058|Reduce frequency of two query pages in commonswiki (T369024)]] [16:31:27] T369024: SpecialUncategorizedPages slow query - https://phabricator.wikimedia.org/T369024 [16:33:35] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1052058|Reduce frequency of two query pages in commonswiki (T369024)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:34:02] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [16:36:10] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for xiaoxiao - https://phabricator.wikimedia.org/T369519#9961950 (10Aklapper) [16:39:08] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1052058|Reduce frequency of two query pages in commonswiki (T369024)]] (duration: 07m 50s) [16:39:11] T369024: SpecialUncategorizedPages slow query - https://phabricator.wikimedia.org/T369024 [16:45:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P65968 and previous config saved to /var/cache/conftool/dbconfig/20240708-164524-arnaudb.json [16:50:57] (03PS1) 10Ottomata: EventLoggingLegacyProxy - move endpoint to w/beacon/event.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052782 (https://phabricator.wikimedia.org/T353817) [16:51:04] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2161 - https://phabricator.wikimedia.org/T369229#9962007 (10Jhancock.wm) no spare, but got confirmation that the replacement is ordered. Should be here very soon. [16:51:56] (03CR) 10Ottomata: EventLoggingLegacyProxy - move endpoint to w/beacon/event.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052782 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [16:54:47] (03PS1) 10Herron: wip [alerts] - 10https://gerrit.wikimedia.org/r/1052784 [16:55:55] (03PS2) 10Anzx: jawiki: add throttle rule for edit-a-thon July 11-18, 2024 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052781 (https://phabricator.wikimedia.org/T369522) [16:56:18] (03PS62) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [16:56:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052781 (https://phabricator.wikimedia.org/T369522) (owner: 10Anzx) [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240708T1700) [17:00:04] ryankemper: Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240708T1700). Please do the needful. [17:00:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T367781)', diff saved to https://phabricator.wikimedia.org/P65969 and previous config saved to /var/cache/conftool/dbconfig/20240708-170031-arnaudb.json [17:00:34] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1224.eqiad.wmnet with reason: Maintenance [17:00:36] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [17:00:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1224.eqiad.wmnet with reason: Maintenance [17:00:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1224 (T367781)', diff saved to https://phabricator.wikimedia.org/P65970 and previous config saved to /var/cache/conftool/dbconfig/20240708-170053-arnaudb.json [17:01:26] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:02:05] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:02:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [17:03:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T367781)', diff saved to https://phabricator.wikimedia.org/P65971 and previous config saved to /var/cache/conftool/dbconfig/20240708-170302-arnaudb.json [17:07:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [17:14:17] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:14:35] (03CR) 10Scott French: [C:03+1] Remove legacy appserver and api records [dns] - 10https://gerrit.wikimedia.org/r/1050304 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [17:15:22] (03CR) 10Scott French: [C:03+1] service.yaml: Switch api and appserver to lvs_setup 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/1050381 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [17:18:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P65972 and previous config saved to /var/cache/conftool/dbconfig/20240708-171810-arnaudb.json [17:18:52] (03CR) 10Scott French: "From reading [0], it sounds like the `service::catalog` entries need to move to `service_setup` in this step as well (i.e., before the PyB" [puppet] - 10https://gerrit.wikimedia.org/r/1050382 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [17:23:12] (03CR) 10Scott French: [C:03+1] Remove conftool-data and service catalog for legacy appservers 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/1050383 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [17:23:20] (03CR) 10JHathaway: [C:03+1] data.yaml: Remove shell access for ezachte and chelsyx. [puppet] - 10https://gerrit.wikimedia.org/r/1052728 (owner: 10Slyngshede) [17:24:00] (03PS63) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [17:32:20] (03PS2) 10Herron: istio_sli_avail: alert if metric goes absent [alerts] - 10https://gerrit.wikimedia.org/r/1052784 (https://phabricator.wikimedia.org/T352756) [17:33:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P65973 and previous config saved to /var/cache/conftool/dbconfig/20240708-173316-arnaudb.json [17:34:44] (03CR) 10Hashar: [C:03+1] "This can be merged anytime :)" [puppet] - 10https://gerrit.wikimedia.org/r/1049090 (https://phabricator.wikimedia.org/T367505) (owner: 10Hashar) [17:34:45] 06SRE, 10DNS, 10fundraising-tech-ops, 06Traffic, 13Patch-For-Review: Cleanup unused DNS subdomains - https://phabricator.wikimedia.org/T367012#9962213 (10Dzahn) Thanks @AKanji-WMF Are you still using http://mandrillapp.com/ / MailChimp for fundraising emails with benefactors.wikimedia.org ? [17:34:56] (03PS1) 10Ottomata: mediawiki.org - Apache rewrite /beacon/event -> /w/beacon/event.php [puppet] - 10https://gerrit.wikimedia.org/r/1052791 (https://phabricator.wikimedia.org/T353817) [17:35:44] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [17:35:47] (03PS2) 10Ottomata: mediawiki.org - Apache rewrite /beacon/event -> /w/beacon/event.php [puppet] - 10https://gerrit.wikimedia.org/r/1052791 (https://phabricator.wikimedia.org/T353817) [17:35:57] (03CR) 10Dzahn: [C:03+2] gerrit: enable built-in log rotation [puppet] - 10https://gerrit.wikimedia.org/r/1049090 (https://phabricator.wikimedia.org/T367505) (owner: 10Hashar) [17:37:21] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Migrate puppet merges to a cookbook - https://phabricator.wikimedia.org/T366355#9962238 (10jhathaway) I agree that decoupling makes sense and that it is worth the effort to try and run the current script on the puppets... [17:38:21] !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3070.esams.wmnet [17:38:44] (03CR) 10Bking: "Acknowledged" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [17:40:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [17:40:51] !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3078.esams.wmnet [17:41:28] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T368766#9962325 (10VRiley-WMF) @Eevans It did. I was planning on swapping the unit back. Is there a good time to proceed with this? [17:41:37] 06SRE, 06collaboration-services, 06DBA, 13Patch-For-Review: Update grants for mailman - https://phabricator.wikimedia.org/T367833#9962322 (10Ladsgroup) 05Open→03Resolved ^ dropped the user in production on m5. [17:45:30] (03CR) 10Dzahn: [C:03+2] "how are we handling the service restart and make sure it's not forgotten? We have removed the logrotation now so let's not run out of disk" [puppet] - 10https://gerrit.wikimedia.org/r/1049090 (https://phabricator.wikimedia.org/T367505) (owner: 10Hashar) [17:48:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T367781)', diff saved to https://phabricator.wikimedia.org/P65974 and previous config saved to /var/cache/conftool/dbconfig/20240708-174823-arnaudb.json [17:48:25] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1225.eqiad.wmnet with reason: Maintenance [17:48:27] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [17:48:39] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1225.eqiad.wmnet with reason: Maintenance [17:48:59] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1231.eqiad.wmnet with reason: Maintenance [17:49:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1231.eqiad.wmnet with reason: Maintenance [17:49:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T367781)', diff saved to https://phabricator.wikimedia.org/P65975 and previous config saved to /var/cache/conftool/dbconfig/20240708-174918-arnaudb.json [17:50:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T367781)', diff saved to https://phabricator.wikimedia.org/P65976 and previous config saved to /var/cache/conftool/dbconfig/20240708-175026-arnaudb.json [17:50:53] (03PS1) 10JHathaway: wikipedia.org spf: add a comment [dns] - 10https://gerrit.wikimedia.org/r/1052792 (https://phabricator.wikimedia.org/T369341) [17:51:36] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Migrate puppet merges to a cookbook - https://phabricator.wikimedia.org/T366355#9962378 (10CDanis) +1 [17:52:40] (03CR) 10CDanis: [C:03+1] merge_cli: fix a puppet-merge.sh comment [puppet] - 10https://gerrit.wikimedia.org/r/1052260 (https://phabricator.wikimedia.org/T366355) (owner: 10Elukey) [17:52:44] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [17:54:21] (03CR) 10JHathaway: [C:03+1] MediaWiki: Allow Bitu to be used as a 2FA proxy. [software/bitu] - 10https://gerrit.wikimedia.org/r/1052085 (https://phabricator.wikimedia.org/T359551) (owner: 10Slyngshede) [17:54:32] (03CR) 10JHathaway: [C:03+2] wikipedia.org spf: add a comment [dns] - 10https://gerrit.wikimedia.org/r/1052792 (https://phabricator.wikimedia.org/T369341) (owner: 10JHathaway) [17:56:26] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:56:55] (03CR) 10JHathaway: [C:03+1] "looks good, root@wikimedia.org is also another option" [puppet] - 10https://gerrit.wikimedia.org/r/1051846 (owner: 10Dzahn) [17:57:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [17:58:15] (03CR) 10JHathaway: [C:03+1] puppetmaster::gitclone: disarm pre-commit and post-commit hooks [puppet] - 10https://gerrit.wikimedia.org/r/1052261 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [17:58:24] (03PS1) 10Ssingh: P:dns::auth: indentation clean-up, no code change [puppet] - 10https://gerrit.wikimedia.org/r/1052793 [17:59:24] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3174/console" [puppet] - 10https://gerrit.wikimedia.org/r/1052793 (owner: 10Ssingh) [18:01:32] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Migrate puppet merges to a cookbook - https://phabricator.wikimedia.org/T366355#9962431 (10CDanis) >>! In T366355#9954359, @elukey wrote: > I've also checked what puppet-merge does behind the scenes, and the gist of it... [18:02:05] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:02:18] (03CR) 10Ssingh: [V:03+1 C:03+2] P:dns::auth: indentation clean-up, no code change [puppet] - 10https://gerrit.wikimedia.org/r/1052793 (owner: 10Ssingh) [18:02:49] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host search-loader2002.codfw.wmnet [18:05:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P65977 and previous config saved to /var/cache/conftool/dbconfig/20240708-180533-arnaudb.json [18:05:35] FIRING: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:06:37] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host search-loader2002.codfw.wmnet [18:09:17] RESOLVED: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:14:17] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:16:44] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [18:20:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P65978 and previous config saved to /var/cache/conftool/dbconfig/20240708-182041-arnaudb.json [18:21:44] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [18:34:30] (03CR) 10Andrea Denisse: [C:03+1] "I just added a nit regarding variable naming but I see that other parts of the code (unrelated to this patch) use the same variable name (" [puppet] - 10https://gerrit.wikimedia.org/r/1052731 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [18:35:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T367781)', diff saved to https://phabricator.wikimedia.org/P65979 and previous config saved to /var/cache/conftool/dbconfig/20240708-183548-arnaudb.json [18:35:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [18:35:52] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [18:36:04] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [18:36:18] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2114.codfw.wmnet with reason: Maintenance [18:36:32] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2114.codfw.wmnet with reason: Maintenance [18:36:39] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2124.codfw.wmnet with reason: Maintenance [18:36:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2124.codfw.wmnet with reason: Maintenance [18:36:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2124 (T367781)', diff saved to https://phabricator.wikimedia.org/P65980 and previous config saved to /var/cache/conftool/dbconfig/20240708-183658-arnaudb.json [18:38:27] (03CR) 10Dzahn: [C:03+2] puppetmaster: change git sender email address to git@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1051846 (owner: 10Dzahn) [18:39:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T367781)', diff saved to https://phabricator.wikimedia.org/P65981 and previous config saved to /var/cache/conftool/dbconfig/20240708-183923-arnaudb.json [18:39:32] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [18:44:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051853 (owner: 10Catrope) [18:45:34] (03CR) 10Dzahn: [C:03+2] "changed but I think we now need a list admin to allow the "non-member sender address". Trying to find out who that is." [puppet] - 10https://gerrit.wikimedia.org/r/1051846 (owner: 10Dzahn) [18:49:12] (03PS1) 10Ssingh: conftool-data: add geodns schema [puppet] - 10https://gerrit.wikimedia.org/r/1052803 (https://phabricator.wikimedia.org/T369366) [18:49:13] (03PS1) 10Ssingh: P:dns::auth::update: maintain admin_state via confd [puppet] - 10https://gerrit.wikimedia.org/r/1052804 (https://phabricator.wikimedia.org/T369366) [18:50:14] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1052804 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [18:54:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P65982 and previous config saved to /var/cache/conftool/dbconfig/20240708-185430-arnaudb.json [19:02:41] !log bking@cumin2002 START - Cookbook sre.wdqs.reboot [19:02:47] (03PS1) 10Dzahn: Revert "puppetmaster: change git sender email address to git@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1052805 [19:05:20] (03CR) 10Dzahn: [C:03+2] Revert "puppetmaster: change git sender email address to git@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1052805 (owner: 10Dzahn) [19:06:46] (03CR) 10Krinkle: EventLoggingLegacyProxy - move endpoint to w/beacon/event.php (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052782 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [19:09:33] 06SRE, 10Incident Tooling: wikimediastatus.net help popups are mobile-unfriendly and keyboard-inaccessible - https://phabricator.wikimedia.org/T327201#9962670 (10CDanis) >>! In T327201#9958666, @DMacks wrote: > It is still not fixed on my desktop-Mac Firefox. There is no longer a scrollbar, but the box is stil... [19:09:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P65983 and previous config saved to /var/cache/conftool/dbconfig/20240708-190937-arnaudb.json [19:12:46] (03PS1) 10Dzahn: Revert^2 "puppetmaster: change git sender email address to git@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1052807 [19:20:50] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [19:20:59] (03CR) 10Dzahn: [C:03+2] Revert^2 "puppetmaster: change git sender email address to git@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1052807 (owner: 10Dzahn) [19:21:02] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [19:21:10] !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3071.esams.wmnet [19:21:26] !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3079.esams.wmnet [19:24:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T367781)', diff saved to https://phabricator.wikimedia.org/P65984 and previous config saved to /var/cache/conftool/dbconfig/20240708-192444-arnaudb.json [19:24:48] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2129.codfw.wmnet with reason: Maintenance [19:24:54] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [19:25:02] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2129.codfw.wmnet with reason: Maintenance [19:25:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2129 (T367781)', diff saved to https://phabricator.wikimedia.org/P65985 and previous config saved to /var/cache/conftool/dbconfig/20240708-192508-arnaudb.json [19:27:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T367781)', diff saved to https://phabricator.wikimedia.org/P65986 and previous config saved to /var/cache/conftool/dbconfig/20240708-192735-arnaudb.json [19:37:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:37:31] (03CR) 10Aude: [C:03+1] "looks good. tested this locally" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051853 (owner: 10Catrope) [19:38:36] (03CR) 10Ottomata: "Abandoning this based on discussion:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052782 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [19:38:39] (03Abandoned) 10Ottomata: EventLoggingLegacyProxy - move endpoint to w/beacon/event.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052782 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [19:39:02] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [19:42:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P65987 and previous config saved to /var/cache/conftool/dbconfig/20240708-194242-arnaudb.json [19:44:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [19:44:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [19:44:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T367856)', diff saved to https://phabricator.wikimedia.org/P65988 and previous config saved to /var/cache/conftool/dbconfig/20240708-194435-marostegui.json [19:44:39] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [19:45:31] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 137, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:46:23] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:57:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P65989 and previous config saved to /var/cache/conftool/dbconfig/20240708-195749-arnaudb.json [19:58:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051709 (https://phabricator.wikimedia.org/T363587) (owner: 10Gmodena) [19:59:41] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240708T2000). [20:00:05] Nemoralis, anzx, and RoanKattouw: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:26] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:08:55] !log bking@cumin2002 START - Cookbook sre.wdqs.reboot [20:12:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T367781)', diff saved to https://phabricator.wikimedia.org/P65990 and previous config saved to /var/cache/conftool/dbconfig/20240708-201256-arnaudb.json [20:12:59] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2151.codfw.wmnet with reason: Maintenance [20:13:00] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [20:13:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2151.codfw.wmnet with reason: Maintenance [20:13:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T367781)', diff saved to https://phabricator.wikimedia.org/P65991 and previous config saved to /var/cache/conftool/dbconfig/20240708-201318-arnaudb.json [20:14:06] (03CR) 10Dzahn: [C:03+2] "This package is actually installed on every single machine. (" [puppet] - 10https://gerrit.wikimedia.org/r/1052383 (https://phabricator.wikimedia.org/T369322) (owner: 10Urbanecm) [20:15:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T367781)', diff saved to https://phabricator.wikimedia.org/P65992 and previous config saved to /var/cache/conftool/dbconfig/20240708-201545-arnaudb.json [20:17:21] (03CR) 10Dzahn: [C:03+2] stewards: Add Phabricator API configuration [puppet] - 10https://gerrit.wikimedia.org/r/1052185 (https://phabricator.wikimedia.org/T369322) (owner: 10Urbanecm) [20:18:27] I guess nobody is doing the deployment yet? I can start [20:19:22] And nobody is here for the other patches? [20:19:37] Alright well then I'll finish my lunch and then deploy my patch [20:27:53] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [20:28:37] !log bking@cumin2002 START - Cookbook sre.wdqs.reboot [20:30:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P65993 and previous config saved to /var/cache/conftool/dbconfig/20240708-203052-arnaudb.json [20:35:28] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [20:35:33] (03PS1) 10Btullis: Allow dse-k8s-worker hosts to access ceph ports [puppet] - 10https://gerrit.wikimedia.org/r/1052812 (https://phabricator.wikimedia.org/T327259) [20:35:41] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052771 (owner: 10Ahmon Dancy) [20:36:04] (03PS2) 10Btullis: Allow dse-k8s-worker hosts to access ceph ports [puppet] - 10https://gerrit.wikimedia.org/r/1052812 (https://phabricator.wikimedia.org/T327259) [20:36:49] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3176/co" [puppet] - 10https://gerrit.wikimedia.org/r/1052812 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [20:38:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051853 (owner: 10Catrope) [20:38:28] !log bking@cumin2002 START - Cookbook sre.wdqs.reboot [20:38:44] (03PS2) 10Catrope: Graph extension: Add tracking for data sources used in tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051853 [20:38:50] (03CR) 10TrainBranchBot: "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051853 (owner: 10Catrope) [20:38:57] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 41 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:39:27] (03Merged) 10jenkins-bot: Graph extension: Add tracking for data sources used in tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051853 (owner: 10Catrope) [20:39:44] !log catrope@deploy1002 Started scap sync-world: Backport for [[gerrit:1051853|Graph extension: Add tracking for data sources used in tags]] [20:40:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Maintenance [20:40:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Maintenance [20:40:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1230 (T367856)', diff saved to https://phabricator.wikimedia.org/P65994 and previous config saved to /var/cache/conftool/dbconfig/20240708-204042-marostegui.json [20:40:46] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [20:42:06] !log catrope@deploy1002 catrope: Backport for [[gerrit:1051853|Graph extension: Add tracking for data sources used in tags]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:42:47] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [20:43:49] (03PS3) 10Btullis: Allow dse-k8s-worker hosts to access ceph ports [puppet] - 10https://gerrit.wikimedia.org/r/1052812 (https://phabricator.wikimedia.org/T327259) [20:43:58] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1022.eqiad.wmnet [20:46:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P65995 and previous config saved to /var/cache/conftool/dbconfig/20240708-204559-arnaudb.json [20:46:21] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3178/co" [puppet] - 10https://gerrit.wikimedia.org/r/1052812 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [20:47:21] (03CR) 10Btullis: Allow dse-k8s-worker hosts to access ceph ports [puppet] - 10https://gerrit.wikimedia.org/r/1052812 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [20:47:33] !log catrope@deploy1002 catrope: Continuing with sync [20:47:49] (03PS1) 10Andrew Bogott: trove guest agent: look for cinder volume on /sdb [puppet] - 10https://gerrit.wikimedia.org/r/1052814 [20:47:56] (03CR) 10Btullis: [C:03+2] Allow dse-k8s-worker hosts to access ceph ports [puppet] - 10https://gerrit.wikimedia.org/r/1052812 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [20:48:23] (03PS2) 10Andrew Bogott: trove guest agent: look for cinder volume on /sdb [puppet] - 10https://gerrit.wikimedia.org/r/1052814 [20:48:57] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 27 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:49:00] (03CR) 10Andrew Bogott: [C:03+2] trove guest agent: look for cinder volume on /sdb [puppet] - 10https://gerrit.wikimedia.org/r/1052814 (owner: 10Andrew Bogott) [20:49:31] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 225, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:49:55] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:50:23] o/ [20:50:24] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1022.eqiad.wmnet [20:50:32] I forgot that I have a deployment [20:52:45] !log catrope@deploy1002 Finished scap: Backport for [[gerrit:1051853|Graph extension: Add tracking for data sources used in tags]] (duration: 13m 00s) [20:55:19] catrope: are you able to deploy my patch too? [20:55:45] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1023.eqiad.wmnet [20:56:26] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:56:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [20:59:20] 06SRE, 06Traffic: Regression: Reading spam blacklists of all projects suddenly returns status 429 on fifth consecutive read - https://phabricator.wikimedia.org/T369414#9963012 (10bd808) I expect that the `?action=raw` query string is what is causing you to run into a rate limit. I think you will have a better... [20:59:41] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:00:05] Reedy, sbassett, Maryum, and manfredi: Time to do the Weekly Security deployment window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240708T2100). [21:01:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T367781)', diff saved to https://phabricator.wikimedia.org/P65996 and previous config saved to /var/cache/conftool/dbconfig/20240708-210106-arnaudb.json [21:01:09] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2158.codfw.wmnet with reason: Maintenance [21:01:10] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [21:01:23] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2158.codfw.wmnet with reason: Maintenance [21:01:24] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [21:01:37] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [21:01:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [21:01:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T367781)', diff saved to https://phabricator.wikimedia.org/P65997 and previous config saved to /var/cache/conftool/dbconfig/20240708-210144-arnaudb.json [21:01:45] !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3072.esams.wmnet [21:02:02] !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3080.esams.wmnet [21:02:43] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1023.eqiad.wmnet [21:04:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T367781)', diff saved to https://phabricator.wikimedia.org/P65998 and previous config saved to /var/cache/conftool/dbconfig/20240708-210410-arnaudb.json [21:05:17] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[1093-1095].eqiad.wmnet with reason: T348977 [21:05:23] Nemoralis: Sorry for the delay, yes I'll deploy yours now [21:05:28] T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977 [21:05:35] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[1093-1095].eqiad.wmnet with reason: T348977 [21:05:41] (03PS2) 10NMW03: Enable VisualEditor by default on Italian Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052285 (https://phabricator.wikimedia.org/T369342) [21:05:47] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic109[3-5]* for T348977 - bking@cumin2002 [21:05:50] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic109[3-5]* for T348977 - bking@cumin2002 [21:05:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052285 (https://phabricator.wikimedia.org/T369342) (owner: 10NMW03) [21:06:30] (03Merged) 10jenkins-bot: Enable VisualEditor by default on Italian Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052285 (https://phabricator.wikimedia.org/T369342) (owner: 10NMW03) [21:06:46] !log catrope@deploy1002 Started scap sync-world: Backport for [[gerrit:1052285|Enable VisualEditor by default on Italian Wikibooks (T369342)]] [21:06:48] T369342: Enable VisualEditor by default on Italian Wikibooks - https://phabricator.wikimedia.org/T369342 [21:07:15] Nemoralis1: Hi, just in case you missed it, I started deploying your patch [21:07:42] thank you, I can test it when it is available [21:09:21] !log catrope@deploy1002 catrope, nmw03: Backport for [[gerrit:1052285|Enable VisualEditor by default on Italian Wikibooks (T369342)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:09:40] testing... [21:10:44] RoanKattouw: LGTM [21:10:53] Thanks, continuing [21:10:56] !log catrope@deploy1002 catrope, nmw03: Continuing with sync [21:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:13:40] (03CR) 10Ebernhardson: [C:03+1] wdqs: enable throttling only for requests coming from the CDN [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [21:14:21] (03PS1) 10Btullis: cephcsi: Bump the image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052818 (https://phabricator.wikimedia.org/T327259) [21:14:31] 10SRE-swift-storage, 07Wikimedia-production-error: Unable to undelete File:Boston_Bruins.svg - https://phabricator.wikimedia.org/T369299#9963067 (10Sreejithk2000) It worked today when i tried. Closing the bug hence. [21:14:41] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Uniquemia - https://phabricator.wikimedia.org/T369500#9963066 (10EUwandu-WMF) Hello @fgiunchedi , Please can you check again to see if it works now? Here is the screenshot of my sign-in with Uniquemia on Wikitech as well if it is helpful{F56297464} [21:15:12] 10SRE-swift-storage, 07Wikimedia-production-error: Unable to undelete File:Boston_Bruins.svg - https://phabricator.wikimedia.org/T369299#9963068 (10Sreejithk2000) 05Open→03Resolved a:03Sreejithk2000 [21:16:08] !log catrope@deploy1002 Finished scap: Backport for [[gerrit:1052285|Enable VisualEditor by default on Italian Wikibooks (T369342)]] (duration: 09m 23s) [21:16:11] T369342: Enable VisualEditor by default on Italian Wikibooks - https://phabricator.wikimedia.org/T369342 [21:16:36] (03CR) 10Dzahn: admin: add approvers to group analytics-research-admins [puppet] - 10https://gerrit.wikimedia.org/r/1049239 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [21:18:05] Nemoralis: All done [21:18:12] thank you! [21:18:25] (03CR) 10Btullis: [C:03+2] cephcsi: Bump the image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052818 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [21:19:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P65999 and previous config saved to /var/cache/conftool/dbconfig/20240708-211918-arnaudb.json [21:20:09] (03PS4) 10Dzahn: admin: add approvers to group analytics-research-admins [puppet] - 10https://gerrit.wikimedia.org/r/1049239 (https://phabricator.wikimedia.org/T276465) [21:20:44] (03CR) 10Dzahn: [C:03+2] "Miriam agreed on the ticket and also confirmed Xiao Xiao" [puppet] - 10https://gerrit.wikimedia.org/r/1049239 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [21:21:36] (03Merged) 10jenkins-bot: cephcsi: Bump the image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052818 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [21:21:58] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 13Patch-For-Review: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#9963101 (10Dzahn) [21:23:53] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [21:24:29] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: add approvers to analytics-research-admins - https://phabricator.wikimedia.org/T368435#9963118 (10Dzahn) Also thanks @Volans for the details and suggesting to add docs to Wikitech [21:24:49] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [21:27:07] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: add approvers to analytics-research-admins - https://phabricator.wikimedia.org/T368435#9963110 (10Dzahn) 05Open→03Resolved a:03Dzahn Thank you @Miriam! You and Xiao Xiao have been added to the code base. So far this isn't happe... [21:28:56] (03CR) 10Dzahn: [C:03+2] "https://puppet-compiler.wmflabs.org/output/1046121/3179/miscweb1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1046121 (https://phabricator.wikimedia.org/T364367) (owner: 10Stevemunene) [21:34:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P66000 and previous config saved to /var/cache/conftool/dbconfig/20240708-213425-arnaudb.json [21:37:11] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:37:11] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:37:19] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:38:03] RECOVERY - BFD status on cr1-drmrs is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:41:52] (03Abandoned) 10Jforrester: Drop experimental mediawiki-dev chart, unused(?) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953633 (owner: 10Jforrester) [21:42:52] (03CR) 10Bartosz Dziewoński: Handle sso.wikimedia.org domain (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [21:46:01] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 16), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9963189 (10mforns) @Scott_French Thank you! We would like to bring up the prod... [21:48:44] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [21:49:12] FIRING: ProbeDown: Service miscweb2003:443 has failed probes (http_query_scholarly_wikidata_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:49:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T367781)', diff saved to https://phabricator.wikimedia.org/P66001 and previous config saved to /var/cache/conftool/dbconfig/20240708-214932-arnaudb.json [21:49:35] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2169.codfw.wmnet with reason: Maintenance [21:49:36] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [21:49:48] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2169.codfw.wmnet with reason: Maintenance [21:49:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T367781)', diff saved to https://phabricator.wikimedia.org/P66002 and previous config saved to /var/cache/conftool/dbconfig/20240708-214954-arnaudb.json [21:51:12] (03PS1) 10Bking: elastic: test envoy TLS terminator in relforge [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950) [21:51:45] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [21:52:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T367781)', diff saved to https://phabricator.wikimedia.org/P66003 and previous config saved to /var/cache/conftool/dbconfig/20240708-215220-arnaudb.json [21:52:54] (03PS2) 10Bking: elastic: test envoy TLS terminator in relforge [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950) [21:53:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [21:54:12] FIRING: [4x] ProbeDown: Service miscweb1003:443 has failed probes (http_query_main_wikidata_org_collab_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:55:24] (03CR) 10CI reject: [V:04-1] elastic: test envoy TLS terminator in relforge [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [21:59:13] (03PS3) 10Bking: elastic: test envoy TLS terminator in relforge [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950) [21:59:35] (03CR) 10CI reject: [V:04-1] elastic: test envoy TLS terminator in relforge [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [22:02:29] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/elasticsearch/manifests/tlsproxy.pp [22:05:38] (03PS4) 10Bking: elastic: test envoy TLS terminator in relforge [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950) [22:06:00] (03CR) 10CI reject: [V:04-1] elastic: test envoy TLS terminator in relforge [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [22:07:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P66004 and previous config saved to /var/cache/conftool/dbconfig/20240708-220727-arnaudb.json [22:07:32] (03PS5) 10Bking: elastic: test envoy TLS terminator in relforge [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950) [22:09:12] (03PS6) 10Bking: elastic: test envoy TLS terminator in relforge [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950) [22:10:35] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [22:11:42] 10SRE-swift-storage, 07Wikimedia-production-error: Unable to undelete File:Boston_Bruins.svg - https://phabricator.wikimedia.org/T369299#9963274 (10Aklapper) a:05Sreejithk2000→03None [22:18:26] (03PS7) 10Bking: elastic: test envoy TLS terminator in relforge [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950) [22:21:05] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [22:22:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P66005 and previous config saved to /var/cache/conftool/dbconfig/20240708-222234-arnaudb.json [22:25:44] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [22:26:42] !log bking@cumin2002 START - Cookbook sre.wdqs.reboot [22:29:31] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565 (10RobH) 03NEW [22:30:18] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#9963351 (10RobH) [22:30:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [22:30:51] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#9963359 (10RobH) [22:31:30] FIRING: [2x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:32:30] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frqueue2003, pay-lb2001, pay-lb2002 - https://phabricator.wikimedia.org/T369566 (10RobH) 03NEW [22:32:46] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frqueue2003, pay-lb2001, pay-lb2002 - https://phabricator.wikimedia.org/T369566#9963385 (10RobH) [22:37:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T367781)', diff saved to https://phabricator.wikimedia.org/P66006 and previous config saved to /var/cache/conftool/dbconfig/20240708-223741-arnaudb.json [22:37:43] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2180.codfw.wmnet with reason: Maintenance [22:37:45] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [22:37:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2180.codfw.wmnet with reason: Maintenance [22:37:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T367781)', diff saved to https://phabricator.wikimedia.org/P66007 and previous config saved to /var/cache/conftool/dbconfig/20240708-223752-arnaudb.json [22:38:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [22:40:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T367781)', diff saved to https://phabricator.wikimedia.org/P66008 and previous config saved to /var/cache/conftool/dbconfig/20240708-224006-arnaudb.json [22:42:53] !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3081.esams.wmnet [22:42:53] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_esams [22:43:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [22:45:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [22:46:03] (03CR) 10Herron: [C:03+1] pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [22:46:36] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [22:52:49] PROBLEM - Host cp3073 is DOWN: PING CRITICAL - Packet loss = 100% [22:55:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P66009 and previous config saved to /var/cache/conftool/dbconfig/20240708-225513-arnaudb.json [22:55:44] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [23:03:44] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [23:08:01] (03PS2) 10Dwisehaupt: prometheus: adjust fr payments-listener endpoint monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1052062 (https://phabricator.wikimedia.org/T368114) (owner: 10Filippo Giunchedi) [23:08:09] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for Sharvaniharan - https://phabricator.wikimedia.org/T368566#9963466 (10ATsay-WMF) I approve this, thanks! [23:08:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [23:09:49] (03CR) 10Dwisehaupt: "@fgiunchedi@wikimedia.org I have updated the URL to the new endpoint we can test. It should be clear to roll out when you are ready." [puppet] - 10https://gerrit.wikimedia.org/r/1052062 (https://phabricator.wikimedia.org/T368114) (owner: 10Filippo Giunchedi) [23:10:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P66010 and previous config saved to /var/cache/conftool/dbconfig/20240708-231020-arnaudb.json [23:24:18] (03CR) 10Dzahn: [C:03+2] "Merging this before the sites actually existed in DNS caused 12 monitoring alerts. 8 for search-platform and 4 for collab." [puppet] - 10https://gerrit.wikimedia.org/r/1046121 (https://phabricator.wikimedia.org/T364367) (owner: 10Stevemunene) [23:25:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T367781)', diff saved to https://phabricator.wikimedia.org/P66011 and previous config saved to /var/cache/conftool/dbconfig/20240708-232527-arnaudb.json [23:25:30] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2193.codfw.wmnet with reason: Maintenance [23:25:32] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [23:25:43] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2193.codfw.wmnet with reason: Maintenance [23:25:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T367781)', diff saved to https://phabricator.wikimedia.org/P66012 and previous config saved to /var/cache/conftool/dbconfig/20240708-232549-arnaudb.json [23:27:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T367856)', diff saved to https://phabricator.wikimedia.org/P66013 and previous config saved to /var/cache/conftool/dbconfig/20240708-232728-marostegui.json [23:27:32] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [23:28:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T367781)', diff saved to https://phabricator.wikimedia.org/P66014 and previous config saved to /var/cache/conftool/dbconfig/20240708-232803-arnaudb.json [23:29:12] (03CR) 10Dzahn: [C:03+2] "I'll revert for now. This will need DNS changes and ATS config changes first." [puppet] - 10https://gerrit.wikimedia.org/r/1046121 (https://phabricator.wikimedia.org/T364367) (owner: 10Stevemunene) [23:29:53] (03PS1) 10Dzahn: Revert "wdqs: microsites for wdqs graph split" [puppet] - 10https://gerrit.wikimedia.org/r/1052826 [23:30:48] (03PS2) 10Dzahn: Revert "wdqs: microsites for wdqs graph split" [puppet] - 10https://gerrit.wikimedia.org/r/1052826 (https://phabricator.wikimedia.org/T364367) [23:32:44] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [23:33:47] (03CR) 10Dzahn: [C:03+2] Revert "wdqs: microsites for wdqs graph split" [puppet] - 10https://gerrit.wikimedia.org/r/1052826 (https://phabricator.wikimedia.org/T364367) (owner: 10Dzahn) [23:34:26] (03CR) 10Dzahn: [V:03+1 C:03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1052826 (https://phabricator.wikimedia.org/T364367) (owner: 10Dzahn) [23:37:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:38:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1052827 [23:38:23] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1052827 (owner: 10TrainBranchBot) [23:42:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P66015 and previous config saved to /var/cache/conftool/dbconfig/20240708-234235-marostegui.json [23:42:44] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [23:43:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P66016 and previous config saved to /var/cache/conftool/dbconfig/20240708-234310-arnaudb.json [23:52:10] !log fabfur@cumin1002 END (FAIL) - Cookbook sre.cdn.roll-reboot (exit_code=1) rolling reboot on A:cp-text_esams [23:57:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P66017 and previous config saved to /var/cache/conftool/dbconfig/20240708-235742-marostegui.json [23:58:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P66018 and previous config saved to /var/cache/conftool/dbconfig/20240708-235817-arnaudb.json [23:59:12] RESOLVED: [4x] ProbeDown: Service miscweb1003:443 has failed probes (http_query_main_wikidata_org_collab_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:59:33] ^ reverted a change to resolve those