[00:02:46] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1054682 (owner: 10TrainBranchBot) [00:05:09] (03CR) 10Jdlrobson: [C:03+1] skin-themes dblist is expanded to include tier 2 wikis as well as tier 1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054685 (https://phabricator.wikimedia.org/T367150) (owner: 10Kimberly Sarabia) [00:13:23] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:14:25] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [00:26:23] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:40:25] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [00:42:46] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_eqiad [00:42:49] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_eqiad [00:54:25] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [00:58:17] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:10:25] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [01:13:39] RESOLVED: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1098-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [01:15:25] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [01:16:15] (03PS7) 10Amire80: planet: add various feeds, reorganize [puppet] - 10https://gerrit.wikimedia.org/r/988001 (owner: 10EpicPupper) [01:20:49] (03PS1) 10Amire80: Add muddyb255 to Planet [puppet] - 10https://gerrit.wikimedia.org/r/1054688 [01:33:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:40:25] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [01:50:17] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 12.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:08:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:08:25] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [02:10:25] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:10:25] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [02:17:25] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [02:28:23] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 322.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:39:19] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:25] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [02:44:25] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [02:59:19] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:25] RESOLVED: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:09:25] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [03:10:25] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:14:25] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [03:18:23] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:37:13] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:40:25] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [03:52:16] (03PS3) 10Ebrahim: Enable ICU provided alphabetical order in the Kurdish wikis categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054641 (https://phabricator.wikimedia.org/T48235) [04:05:47] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 213, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:06:07] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:14:17] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 483.03 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:17:25] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [04:29:47] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:30:15] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 125, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:40:25] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [04:42:23] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [04:46:21] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29687 bytes in 7.797 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:04:17] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:05:25] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [05:10:25] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [05:13:40] jouncebot: next [05:13:40] In 0 hour(s) and 46 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240717T0600) [05:14:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s7 T370121 [05:14:17] T370121: Switchover s7 master (db1181 -> db1236) - https://phabricator.wikimedia.org/T370121 [05:14:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1236 with weight 0 T370121', diff saved to https://phabricator.wikimedia.org/P66700 and previous config saved to /var/cache/conftool/dbconfig/20240717-051419-root.json [05:14:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T370121 [05:14:59] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1236 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1054413 (https://phabricator.wikimedia.org/T370121) (owner: 10Gerrit maintenance bot) [05:17:04] (03PS1) 10Marostegui: db1181: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1054698 [05:17:30] (03CR) 10Marostegui: [C:03+2] db1181: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1054698 (owner: 10Marostegui) [05:19:54] (03PS1) 10Abijeet Patro: TranslatablePageState: Check if banner namespaces are configured [extensions/Translate] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1054699 (https://phabricator.wikimedia.org/T370219) [05:32:17] !log Starting s7 eqiad failover from db1181 to db1236 - T370121 [05:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:21] T370121: Switchover s7 master (db1181 -> db1236) - https://phabricator.wikimedia.org/T370121 [05:32:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s7 eqiad as read-only for maintenance - T370121', diff saved to https://phabricator.wikimedia.org/P66701 and previous config saved to /var/cache/conftool/dbconfig/20240717-053230-root.json [05:33:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1236 to s7 primary and set section read-write T370121', diff saved to https://phabricator.wikimedia.org/P66702 and previous config saved to /var/cache/conftool/dbconfig/20240717-053302-root.json [05:33:22] (03CR) 10Marostegui: [C:03+2] wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1054414 (https://phabricator.wikimedia.org/T370121) (owner: 10Gerrit maintenance bot) [05:34:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1181 T370121', diff saved to https://phabricator.wikimedia.org/P66703 and previous config saved to /var/cache/conftool/dbconfig/20240717-053359-marostegui.json [05:34:25] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [05:35:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1181.eqiad.wmnet with reason: Long schema change [05:35:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1181.eqiad.wmnet with reason: Long schema change [05:36:35] !log Deploy schema change on s7 eqiad db1181 dbmaint T367856 [05:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:40] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [05:39:25] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [05:44:44] (03CR) 10CI reject: [V:04-1] TranslatablePageState: Check if banner namespaces are configured [extensions/Translate] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1054699 (https://phabricator.wikimedia.org/T370219) (owner: 10Abijeet Patro) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240717T0600) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:17:53] (03PS1) 10Marostegui: filtered_tables.txt: Remove old columns [puppet] - 10https://gerrit.wikimedia.org/r/1054791 (https://phabricator.wikimedia.org/T86338) [06:18:20] (03CR) 10Marostegui: [C:03+2] filtered_tables.txt: Remove old columns [puppet] - 10https://gerrit.wikimedia.org/r/1054791 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [06:21:28] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [06:22:01] (03PS1) 10Marostegui: filtered_tables.txt: Remove dropped column [puppet] - 10https://gerrit.wikimedia.org/r/1054795 (https://phabricator.wikimedia.org/T85757) [06:24:44] (03CR) 10Marostegui: [C:03+2] filtered_tables.txt: Remove dropped column [puppet] - 10https://gerrit.wikimedia.org/r/1054795 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [06:27:21] (03CR) 10Abijeet Patro: "recheck" [extensions/Translate] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1054699 (https://phabricator.wikimedia.org/T370219) (owner: 10Abijeet Patro) [06:30:31] (03PS1) 10Marostegui: filtered_tables.txt: Remove old table [puppet] - 10https://gerrit.wikimedia.org/r/1054796 (https://phabricator.wikimedia.org/T54930) [06:32:06] (03CR) 10Marostegui: [C:03+2] filtered_tables.txt: Remove old table [puppet] - 10https://gerrit.wikimedia.org/r/1054796 (https://phabricator.wikimedia.org/T54930) (owner: 10Marostegui) [06:36:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 17 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Translate] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1054699 (https://phabricator.wikimedia.org/T370219) (owner: 10Abijeet Patro) [06:38:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 17 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Translate] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1054699 (https://phabricator.wikimedia.org/T370219) (owner: 10Abijeet Patro) [06:40:28] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [06:41:35] 10ops-eqiad, 06DBA, 06DC-Ops, 13Patch-For-Review: db1179 stopped answering ping, depooled - https://phabricator.wikimedia.org/T369855#9989119 (10ABran-WMF) This server has been down for a few days, @wiki_willy please let me know if I can help [06:41:41] (03PS1) 10Marostegui: filtered_tables.txt: Remove non existing tables [puppet] - 10https://gerrit.wikimedia.org/r/1054797 [06:42:10] (03CR) 10Marostegui: "Amir can you please confirm these tables are no longer?" [puppet] - 10https://gerrit.wikimedia.org/r/1054797 (owner: 10Marostegui) [06:43:20] (03CR) 10Ayounsi: [C:03+1] "Awesome!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1054618 (https://phabricator.wikimedia.org/T355750) (owner: 10Elukey) [06:43:26] (03PS2) 10Slyngshede: data.yaml: Extend MOU for dalezhou [puppet] - 10https://gerrit.wikimedia.org/r/1054427 [06:48:00] (03CR) 10Slyngshede: [C:03+2] data.yaml: Extend MOU for dalezhou [puppet] - 10https://gerrit.wikimedia.org/r/1054427 (owner: 10Slyngshede) [06:48:11] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'clear' for AS: 17072 [06:48:51] (03PS2) 10Abijeet Patro: TranslatablePageState: Check if banner namespaces are configured [extensions/Translate] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1054699 (https://phabricator.wikimedia.org/T370219) [06:48:58] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'clear' for AS: 17072 [06:59:19] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:59:28] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [07:00:04] Amir1 and Urbanecm: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240717T0700). nyaa~ [07:00:04] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:49] * kart_ here. Will wait till CI is OK. [07:01:29] o/ [07:01:59] abijeet: We will wait for CI to pass and then proceed. [07:09:28] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [07:11:00] ETA: 2 min. [07:16:53] (03CR) 10Slyngshede: [V:03+1 C:03+2] C:idm configure 2FA proxy endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/1054502 (owner: 10Slyngshede) [07:17:28] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [07:19:24] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29689 bytes in 4.591 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [07:24:24] (03CR) 10JMeybohm: [C:03+2] New upstream version 3.11.3 [debs/helm3] - 10https://gerrit.wikimedia.org/r/1053934 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [07:24:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [extensions/Translate] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1054699 (https://phabricator.wikimedia.org/T370219) (owner: 10Abijeet Patro) [07:25:51] abijeet: we can finish lunch while waiting for CI :D [07:25:57] Yup :-D [07:27:28] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [07:33:45] (03CR) 10Elukey: [C:03+2] sre.network.tls: use a different client certificate to authenticate [cookbooks] - 10https://gerrit.wikimedia.org/r/1054618 (https://phabricator.wikimedia.org/T355750) (owner: 10Elukey) [07:36:21] !log elukey@cumin1002 START - Cookbook sre.network.tls for network device lsw1-d1-codfw [07:37:13] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:37:37] !log imported helm3 3.11.3 to bullseye-wikimedia and buster-wikimedia [07:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:42] !log elukey@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-d1-codfw [07:39:23] (03PS1) 10Slyngshede: C:idm Typo in settings. [puppet] - 10https://gerrit.wikimedia.org/r/1054844 [07:40:06] (03CR) 10Slyngshede: [C:03+2] C:idm Typo in settings. [puppet] - 10https://gerrit.wikimedia.org/r/1054844 (owner: 10Slyngshede) [07:40:28] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [07:44:28] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [07:46:59] (03PS1) 10Elukey: reporter.py: fix warning log [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1054845 (https://phabricator.wikimedia.org/T367427) [07:49:17] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [07:49:28] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [07:49:30] !log restart hadoop-mapreduce-historyserver.service on an-master1003 - failed for Java OOM [07:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:43] Cc: btullis, stevemunene --^ [07:50:00] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [07:50:08] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [07:50:28] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [07:50:41] (03Merged) 10jenkins-bot: TranslatablePageState: Check if banner namespaces are configured [extensions/Translate] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1054699 (https://phabricator.wikimedia.org/T370219) (owner: 10Abijeet Patro) [07:51:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054357 (https://phabricator.wikimedia.org/T363587) (owner: 10Gmodena) [07:51:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054357 (https://phabricator.wikimedia.org/T363587) (owner: 10Gmodena) [07:51:34] !log kartik@deploy1002 Started scap sync-world: Backport for [[gerrit:1054699|TranslatablePageState: Check if banner namespaces are configured (T370219)]] [07:51:38] T370219: DBQueryError marking page for translation: Table 'mediawikiwiki.translate_cache' doesn't exist - https://phabricator.wikimedia.org/T370219 [07:52:07] abijeet: I'll ping when patch is available to test on mwdebug servers. [07:54:12] !log kartik@deploy1002 abi, kartik: Backport for [[gerrit:1054699|TranslatablePageState: Check if banner namespaces are configured (T370219)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:54:37] abijeet: can you test the patch? [07:56:49] kart_, testing [07:57:20] cool [08:00:04] mvolz: OwO what's this, a deployment window?? Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240717T0800). nyaa~ [08:00:42] kart_, looks good [08:00:52] Nice. Going ahead. [08:00:57] !log kartik@deploy1002 abi, kartik: Continuing with sync [08:03:29] elukey: Many thanks. Already being tracked at T369278 [08:03:29] T369278: MapReduce history server is repeatedly crashing - https://phabricator.wikimedia.org/T369278 [08:03:36] ack thanks! [08:06:00] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1054699|TranslatablePageState: Check if banner namespaces are configured (T370219)]] (duration: 14m 26s) [08:06:04] T370219: DBQueryError marking page for translation: Table 'mediawikiwiki.translate_cache' doesn't exist - https://phabricator.wikimedia.org/T370219 [08:06:18] woof. Finally. [08:09:56] thanks kart_ [08:32:29] (03CR) 10Filippo Giunchedi: [C:03+2] data-engineering: disable promql/rate lint for MediawikiPageContentChangeEnrichAvailability [alerts] - 10https://gerrit.wikimedia.org/r/1054540 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi) [08:32:30] (03CR) 10Effie Mouzeli: [C:03+1] changeprop: Disable pregeneration for mobile-sections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054512 (https://phabricator.wikimedia.org/T328036) (owner: 10Jgiannelos) [08:33:06] (03CR) 10Filippo Giunchedi: [C:03+1] "Amazing! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1054647 (https://phabricator.wikimedia.org/T359033) (owner: 10Bking) [08:33:21] (03CR) 10Lucas Werkmeister (WMDE): "I guess I can send an email to ops-l when this merges and CC some Cloud Services people (but I don’t think it’s relevant to cloud-announce" [puppet] - 10https://gerrit.wikimedia.org/r/1054603 (https://phabricator.wikimedia.org/T370171) (owner: 10Lucas Werkmeister (WMDE)) [08:34:25] jouncebot: now [08:34:25] For the next 0 hour(s) and 25 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240717T0800) [08:35:24] (03CR) 10Lucas Werkmeister (WMDE): "I’ve added it to the Puppet request window [next Tuesday](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240723T1600) (I" [puppet] - 10https://gerrit.wikimedia.org/r/1054603 (https://phabricator.wikimedia.org/T370171) (owner: 10Lucas Werkmeister (WMDE)) [08:38:21] (03CR) 10DCausse: [C:03+1] elasticsearch: remove obsolete alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054647 (https://phabricator.wikimedia.org/T359033) (owner: 10Bking) [08:43:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1181 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P66704 and previous config saved to /var/cache/conftool/dbconfig/20240717-084351-root.json [08:44:05] (03PS1) 10Marostegui: Revert "db1181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1054847 [08:44:33] (03CR) 10Marostegui: [C:03+2] Revert "db1181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1054847 (owner: 10Marostegui) [08:47:17] !log elukey@puppetserver1001 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [08:48:05] !log elukey@cumin1002 START - Cookbook sre.hosts.reboot-single for host cp4037.ulsfo.wmnet [08:52:33] (03PS1) 10Slyngshede: MediaWiki: Exception handling of MediaWiki API requests. [software/bitu] - 10https://gerrit.wikimedia.org/r/1054848 [08:53:33] (03CR) 10Stevemunene: "Ack, many thanks Daniel" [puppet] - 10https://gerrit.wikimedia.org/r/1046121 (https://phabricator.wikimedia.org/T364367) (owner: 10Stevemunene) [08:57:02] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4037.ulsfo.wmnet [08:57:16] (03CR) 10Slyngshede: [C:03+2] MediaWiki: Exception handling of MediaWiki API requests. [software/bitu] - 10https://gerrit.wikimedia.org/r/1054848 (owner: 10Slyngshede) [08:58:27] (03PS1) 10Peter Fischer: Search update pipeline: let wikidatawiki bypass optimization (deduplication) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054850 (https://phabricator.wikimedia.org/T365831) [08:58:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1181 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P66705 and previous config saved to /var/cache/conftool/dbconfig/20240717-085857-root.json [09:02:02] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [09:05:30] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [09:08:12] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device cr1-magru [09:08:24] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29641 bytes in 3.698 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [09:08:46] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9989347 (10ayounsi) We will need to migrate the whole range to a new prefix :( Running 2 ranges is going to be a pain long term, and would n... [09:13:01] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-magru [09:14:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1181 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P66706 and previous config saved to /var/cache/conftool/dbconfig/20240717-091402-root.json [09:14:06] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device cr2-magru [09:17:10] (03PS1) 10David Caro: p:toolforge::bastion: cleanup tekton repo [puppet] - 10https://gerrit.wikimedia.org/r/1054852 [09:17:10] (03PS1) 10David Caro: p:toolforge::bastion: add helm [puppet] - 10https://gerrit.wikimedia.org/r/1054853 [09:17:39] (03CR) 10CI reject: [V:04-1] p:toolforge::bastion: add helm [puppet] - 10https://gerrit.wikimedia.org/r/1054853 (owner: 10David Caro) [09:18:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:18:39] (03PS2) 10David Caro: p:toolforge::bastion: add helm [puppet] - 10https://gerrit.wikimedia.org/r/1054853 [09:18:48] (03CR) 10David Caro: [C:03+2] p:toolforge::bastion: cleanup tekton repo [puppet] - 10https://gerrit.wikimedia.org/r/1054852 (owner: 10David Caro) [09:18:57] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-magru [09:23:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:24:47] (03PS3) 10David Caro: p:toolforge::bastion: add helm [puppet] - 10https://gerrit.wikimedia.org/r/1054853 [09:27:47] (03CR) 10Vgutierrez: Add public suffix list module (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (owner: 10BCornwall) [09:29:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1181 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P66708 and previous config saved to /var/cache/conftool/dbconfig/20240717-092907-root.json [09:29:15] 06SRE, 06Infrastructure-Foundations, 10netops: Core router error logs: "sshd: Did not receive identification string" from prometheus hosts - https://phabricator.wikimedia.org/T368513#9989379 (10ayounsi) @cmooney @fgiunchedi I'm wondering if the probe could/should be changed to a TCP handshake only or totally... [09:29:33] (03PS4) 10David Caro: p:toolforge::bastion: add helm [puppet] - 10https://gerrit.wikimedia.org/r/1054853 [09:33:57] (03CR) 10Vgutierrez: [C:04-1] "VCL still needs to be fixed." [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [09:35:19] (03CR) 10Ayounsi: [C:03+1] "lgtm, if a python expert can have a look it would be ideal, but fine to merge it as it." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [09:41:05] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: LDAP access to the analytics-privatedata-users group for Quiddity - https://phabricator.wikimedia.org/T370091#9989407 (10Clement_Goubert) a:05KStineRowe_WMF→03Clement_Goubert [09:41:20] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: LDAP access to the analytics-privatedata-users group for Quiddity - https://phabricator.wikimedia.org/T370091#9989408 (10Clement_Goubert) [09:41:43] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074#9989405 (10Clement_Goubert) @Milimetric Can you sign L3 so we can move forward with this? [09:44:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1181 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P66709 and previous config saved to /var/cache/conftool/dbconfig/20240717-094412-root.json [09:45:54] (03CR) 10Urbanecm: "will be deployed in a couple of hours" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053884 (https://phabricator.wikimedia.org/T366458) (owner: 10Urbanecm) [09:45:57] (03PS2) 10Urbanecm: CommunityConfiguration: Release to all Growth wikis, except frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053884 (https://phabricator.wikimedia.org/T366458) [09:46:15] 06SRE, 06Infrastructure-Foundations, 10netops: Core router error logs: "sshd: Did not receive identification string" from prometheus hosts - https://phabricator.wikimedia.org/T368513#9989413 (10cmooney) >>! In T368513#9938867, @fgiunchedi wrote: > Those are SSH probes from local prometheus hosts indeed, in t... [09:56:04] (03PS5) 10David Caro: p:toolforge::bastion: add helm [puppet] - 10https://gerrit.wikimedia.org/r/1054853 [09:56:04] (03PS1) 10David Caro: thirdparty/helm3: update the k8s version to latest [puppet] - 10https://gerrit.wikimedia.org/r/1054856 (https://phabricator.wikimedia.org/T370252) [09:58:07] (03PS1) 10Clément Goubert: admin: add quiddity to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1054857 (https://phabricator.wikimedia.org/T370091) [09:58:25] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:58:39] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:58:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T367781)', diff saved to https://phabricator.wikimedia.org/P66710 and previous config saved to /var/cache/conftool/dbconfig/20240717-095845-arnaudb.json [09:58:52] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240717T1000) [10:00:18] (03CR) 10Slavina Stefanova: [C:03+1] thirdparty/helm3: update the k8s version to latest [puppet] - 10https://gerrit.wikimedia.org/r/1054856 (https://phabricator.wikimedia.org/T370252) (owner: 10David Caro) [10:01:59] (03CR) 10David Caro: [C:03+2] thirdparty/helm3: update the k8s version to latest [puppet] - 10https://gerrit.wikimedia.org/r/1054856 (https://phabricator.wikimedia.org/T370252) (owner: 10David Caro) [10:04:41] (03CR) 10Clément Goubert: [C:03+2] parsoid::testing: remove unused file [puppet] - 10https://gerrit.wikimedia.org/r/1054607 (owner: 10Clément Goubert) [10:08:46] (03PS1) 10Volans: netbox: refactor tests to be more flexible [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054860 [10:11:36] (03CR) 10Volans: "This is my proposal for test refactoring that should enable the Netbox 4 migration in an easy way." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054860 (owner: 10Volans) [10:11:57] (03PS2) 10Volans: netbox: refactor tests to be more flexible [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054860 [10:21:15] (03CR) 10David Caro: "Tested in toolsbeta:" [puppet] - 10https://gerrit.wikimedia.org/r/1054853 (owner: 10David Caro) [10:21:30] (03CR) 10David Caro: [C:03+2] p:toolforge::bastion: add helm [puppet] - 10https://gerrit.wikimedia.org/r/1054853 (owner: 10David Caro) [10:29:33] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053911 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [10:29:39] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device asw1-b3-magru [10:30:12] (03CR) 10Ayounsi: [C:03+1] "thanks !" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054860 (owner: 10Volans) [10:32:11] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru [10:32:13] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device asw1-b4-magru [10:34:45] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b4-magru [10:34:47] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device ssw1-d1-codfw [10:37:08] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-d1-codfw [10:37:11] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device ssw1-d8-codfw [10:37:42] (03CR) 10Hashar: [C:03+1] delete integration.mediawiki.org [dns] - 10https://gerrit.wikimedia.org/r/1054646 (https://phabricator.wikimedia.org/T361250) (owner: 10Dzahn) [10:39:32] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-d8-codfw [10:39:34] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device lsw1-c1-codfw [10:41:54] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-c1-codfw [10:41:56] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device lsw1-c2-codfw [10:42:30] (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1054857 (https://phabricator.wikimedia.org/T370091) (owner: 10Clément Goubert) [10:42:49] (03CR) 10Effie Mouzeli: [C:03+1] admin: add quiddity to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1054857 (https://phabricator.wikimedia.org/T370091) (owner: 10Clément Goubert) [10:43:36] (03CR) 10Clément Goubert: [C:03+2] admin: add quiddity to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1054857 (https://phabricator.wikimedia.org/T370091) (owner: 10Clément Goubert) [10:44:17] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-c2-codfw [10:44:20] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device lsw1-c3-codfw [10:44:42] 06SRE, 10SRE-swift-storage: podman-auto-update failures - https://phabricator.wikimedia.org/T370255 (10MatthewVernon) 03NEW [10:46:40] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-c3-codfw [10:46:42] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device lsw1-c4-codfw [10:47:31] (03PS15) 10Arnaudb: mysqld-exporter: hotfix config for es1 to es5 [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) [10:47:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=mw-api-ext-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [10:47:52] (03CR) 10Arnaudb: mysqld-exporter: hotfix config for es1 to es5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) (owner: 10Arnaudb) [10:48:34] * volans looking [10:49:04] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-c4-codfw [10:49:06] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device lsw1-c5-codfw [10:51:27] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-c5-codfw [10:51:29] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device lsw1-c6-codfw [10:53:51] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-c6-codfw [10:53:53] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device lsw1-c7-codfw [10:54:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140 (T367856)', diff saved to https://phabricator.wikimedia.org/P66711 and previous config saved to /var/cache/conftool/dbconfig/20240717-105411-marostegui.json [10:54:16] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [10:56:14] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-c7-codfw [10:56:16] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device lsw1-d2-codfw [10:56:23] it all started around 9:35 UTC [10:56:41] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests, 13Patch-For-Review: LDAP access to the analytics-privatedata-users group for Quiddity - https://phabricator.wikimedia.org/T370091#9989556 (10Clement_Goubert) 05In progress→03Resolved I have merged the access change, puppet... [10:58:15] effie: around? [10:58:37] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-d2-codfw [10:58:39] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device lsw1-d3-codfw [10:58:42] 06SRE, 10SRE-swift-storage: podman-auto-update failures - https://phabricator.wikimedia.org/T370255#9989564 (10MatthewVernon) The problem is that there's a (short-lived) container that exists when podman-auto-update starts, and is removed whilst podman-auto-update is running. I.e. this is a race condition. P66... [10:59:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T367781)', diff saved to https://phabricator.wikimedia.org/P66712 and previous config saved to /var/cache/conftool/dbconfig/20240717-105904-arnaudb.json [10:59:09] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [10:59:19] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:00:05] urbanecm, MichaelG_WMF, and sergi0: May I have your attention please! Deploy CommunityConfiguration to all Wikipedias. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240717T1100) [11:00:07] logs of the related issue https://logstash.wikimedia.org/goto/9c7086e0ce2a253a2ad35eb088a89960 [11:00:10] o/ [11:01:01] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-d3-codfw [11:01:03] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device lsw1-d4-codfw [11:01:36] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [11:02:12] * volans monving debugging to -sre [11:02:30] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29641 bytes in 4.478 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [11:02:34] * MichaelG_WMF is here as well [11:02:40] volans: just in case, is there any problem with starting the window i have scheduled? or should we wait? [11:02:44] (03PS1) 10MVernon: cephadm::target mask the podman-auto-update service [puppet] - 10https://gerrit.wikimedia.org/r/1054864 (https://phabricator.wikimedia.org/T370255) [11:03:06] urbanecm: what is that modifying? we have some issue limited to mw-api-ext [11:03:13] thanks for asking [11:03:24] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-d4-codfw [11:03:24] volans: it enables a new MediaWiki extension at (almost) all Wikipedias [11:03:26] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device lsw1-d5-codfw [11:03:44] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1054864 (https://phabricator.wikimedia.org/T370255) (owner: 10MVernon) [11:03:52] affecting also API I guess right? [11:03:57] correct [11:04:06] (assuming the MW container needs to be there too) [11:04:12] then if possible maybe wait a bit that we try to debug [11:04:48] ack, i'll wait for a green light then [11:05:08] thanks, sorry about that [11:05:47] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-d5-codfw [11:05:49] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device lsw1-d6-codfw [11:06:52] (03CR) 10MVernon: "PCC failure is because T366387 is still unfixed." [puppet] - 10https://gerrit.wikimedia.org/r/1054864 (https://phabricator.wikimedia.org/T370255) (owner: 10MVernon) [11:08:11] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-d6-codfw [11:08:13] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device lsw1-d7-codfw [11:08:15] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9989577 (10cmooney) 05Open→03Resolved [11:08:50] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054866 [11:09:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140', diff saved to https://phabricator.wikimedia.org/P66713 and previous config saved to /var/cache/conftool/dbconfig/20240717-110918-marostegui.json [11:10:34] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-d7-codfw [11:10:37] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device lsw1-d8-codfw [11:10:51] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054867 [11:12:19] (03CR) 10NMW03: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054084 (https://phabricator.wikimedia.org/T326089) (owner: 10NMW03) [11:12:57] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-d8-codfw [11:13:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054084 (https://phabricator.wikimedia.org/T326089) (owner: 10NMW03) [11:14:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P66714 and previous config saved to /var/cache/conftool/dbconfig/20240717-111412-arnaudb.json [11:15:24] (03PS2) 10NMW03: Add Portal namespace for Ingush Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054084 (https://phabricator.wikimedia.org/T326089) [11:15:32] (03PS3) 10NMW03: Add Portal namespace for Ingush Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054084 (https://phabricator.wikimedia.org/T326089) [11:16:14] (03CR) 10CI reject: [V:04-1] Add Portal namespace for Ingush Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054084 (https://phabricator.wikimedia.org/T326089) (owner: 10NMW03) [11:17:43] (03PS4) 10NMW03: Add Portal namespace for Ingush Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054084 (https://phabricator.wikimedia.org/T326089) [11:22:20] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 623.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:22:54] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on mw2432.codfw.wmnet with reason: RAID conversion testing [11:23:08] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on mw2432.codfw.wmnet with reason: RAID conversion testing [11:24:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140', diff saved to https://phabricator.wikimedia.org/P66715 and previous config saved to /var/cache/conftool/dbconfig/20240717-112425-marostegui.json [11:24:36] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2432 [11:27:44] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.convert-disks (exit_code=99) for host mw2432 [11:29:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P66716 and previous config saved to /var/cache/conftool/dbconfig/20240717-112919-arnaudb.json [11:29:23] ACKNOWLEDGEMENT - MD RAID on mw2432 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T370258 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [11:30:14] That's me ^ [11:30:36] it should have been downtimed though [11:31:11] 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#9989694 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=db2972bf-cd24-4ee8-ba43-a5d1d6710956) set by cgoubert@cumin1002 for 7 days, 0:00:00... [11:31:20] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 28.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:32:26] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on mw2432 - https://phabricator.wikimedia.org/T370258 (10ops-monitoring-bot) 03NEW [11:37:14] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:38:58] <_joe_> !log deleted pod that was reportedly returning 5xx to the cdn for mw-api-ext [11:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140 (T367856)', diff saved to https://phabricator.wikimedia.org/P66717 and previous config saved to /var/cache/conftool/dbconfig/20240717-113932-marostegui.json [11:39:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [11:39:37] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [11:39:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [11:39:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T367856)', diff saved to https://phabricator.wikimedia.org/P66718 and previous config saved to /var/cache/conftool/dbconfig/20240717-113954-marostegui.json [11:40:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Increase db2136's weight - testing 10.11 T365805', diff saved to https://phabricator.wikimedia.org/P66719 and previous config saved to /var/cache/conftool/dbconfig/20240717-114032-marostegui.json [11:40:37] T365805: Test MariaDB 10.11 - https://phabricator.wikimedia.org/T365805 [11:42:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=mw-api-ext-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [11:43:54] <_joe_> ok, good [11:44:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T367781)', diff saved to https://phabricator.wikimedia.org/P66720 and previous config saved to /var/cache/conftool/dbconfig/20240717-114426-arnaudb.json [11:44:28] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:44:31] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [11:44:41] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:44:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1174.eqiad.wmnet with reason: Maintenance [11:45:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1174.eqiad.wmnet with reason: Maintenance [11:45:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T367781)', diff saved to https://phabricator.wikimedia.org/P66721 and previous config saved to /var/cache/conftool/dbconfig/20240717-114510-arnaudb.json [11:46:23] volans: effie: (moving back from -sre) since there are no objections, and volans was ok with the MW deployment starting, about to proceed unless someone says i shouldn't [11:46:49] ack, all recoevered, go ahead [11:47:06] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) (owner: 10Arnaudb) [11:47:53] (03CR) 10Urbanecm: [C:03+2] CommunityConfiguration: Release to all Growth wikis, except frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053884 (https://phabricator.wikimedia.org/T366458) (owner: 10Urbanecm) [11:48:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053884 (https://phabricator.wikimedia.org/T366458) (owner: 10Urbanecm) [11:48:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T367781)', diff saved to https://phabricator.wikimedia.org/P66722 and previous config saved to /var/cache/conftool/dbconfig/20240717-114820-arnaudb.json [11:48:35] (03Merged) 10jenkins-bot: CommunityConfiguration: Release to all Growth wikis, except frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053884 (https://phabricator.wikimedia.org/T366458) (owner: 10Urbanecm) [11:48:47] * MichaelG_WMF is still here and looking forward to seeing this going ahead :) [11:49:00] MichaelG_WMF: me too! the most exciting part of working on something new :) [11:49:06] !log urbanecm@deploy1002 Started scap sync-world: Backport for [[gerrit:1053884|CommunityConfiguration: Release to all Growth wikis, except frwiktionary (T366458)]] [11:49:12] T366458: CommunityConfiguration: Release extension to all Wikipedias with GrowthExperiments - https://phabricator.wikimedia.org/T366458 [11:49:24] (03PS1) 10Wangombe: Update reference to ElasticSearchTtmServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054869 (https://phabricator.wikimedia.org/T335342) [11:51:38] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:1053884|CommunityConfiguration: Release to all Growth wikis, except frwiktionary (T366458)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:51:49] okay, we're at mwdebug [11:51:51] running the script [11:52:36] !log [urbanecm@mwdebug1001 ~]$ foreachwikiindblist growthexperiments extensions/GrowthExperiments/maintenance/migrateCommunityConfig.php # T366458; output logged to migrateCommunityConfig.log in my home [11:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:33] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2432 [11:57:09] MichaelG_WMF: i'm spot checking the individual wikis, so far so good [11:57:28] urbanecm: YaY! [11:57:36] https://guc.toolforge.org/?by=date&user=Maintenance+script is probably the best aproximation for watching what the maint script did so far [11:57:39] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.convert-disks (exit_code=99) for host mw2432 [11:58:52] urbanecm: I'm keeping an eye on logstash. So far all is clear [11:59:05] thanks! [12:01:26] dewiki seems to have configurable structured add a link, which is not expected. sergi0 is looking into that [12:03:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P66723 and previous config saved to /var/cache/conftool/dbconfig/20240717-120327-arnaudb.json [12:12:00] (03CR) 10JMeybohm: [C:03+1] reporter.py: fix warning log [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1054845 (https://phabricator.wikimedia.org/T367427) (owner: 10Elukey) [12:16:14] (03PS1) 10Urbanecm: dewiki: Disable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054874 (https://phabricator.wikimedia.org/T366458) [12:16:36] MichaelG_WMF: fyi ^^ [12:17:47] as far as I can tell, add a link (structured) is disabled in the backend/serverside of dewiki. Is that not correct? [12:18:10] MichaelG_WMF: user-facing, yes. backend-facing, it is enabled. and Special:CommunityConfiguration currently exposes it as "enabled", which is not true at all. [12:18:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P66725 and previous config saved to /var/cache/conftool/dbconfig/20240717-121834-arnaudb.json [12:18:36] gotcha [12:18:51] !log migrateCommunityConfig.php finished, logs are available at https://phabricator.wikimedia.org/P66724 [12:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:08] urbanecm: seems we still don't have our feature flags straight then :/ [12:19:11] !log (relogging to attach to the task) migrateCommunityConfig.php finished, logs are available at https://phabricator.wikimedia.org/P66724 (T366458) [12:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:15] T366458: CommunityConfiguration: Release extension to all Wikipedias with GrowthExperiments - https://phabricator.wikimedia.org/T366458 [12:19:23] (03PS2) 10D3r1ck01: [SUL3] Enable SUL3 on Beta Cluster for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054863 (https://phabricator.wikimedia.org/T370254) [12:19:25] MichaelG_WMF: yep. sergi0 is filling followup(s) in Phab now [12:19:38] (we're both in the meeting room, but not mandatory to join :)) [12:19:44] !log urbanecm@deploy1002 Sync cancelled. [12:19:56] (03CR) 10Urbanecm: [C:03+2] dewiki: Disable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054874 (https://phabricator.wikimedia.org/T366458) (owner: 10Urbanecm) [12:20:33] (03Merged) 10jenkins-bot: dewiki: Disable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054874 (https://phabricator.wikimedia.org/T366458) (owner: 10Urbanecm) [12:21:10] !log urbanecm@deploy1002 Started scap sync-world: Backport for [[gerrit:1053884|CommunityConfiguration: Release to all Growth wikis, except frwiktionary (T366458)]], [[gerrit:1054874|dewiki: Disable CommunityConfiguration (T366458)]] [12:23:42] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:1053884|CommunityConfiguration: Release to all Growth wikis, except frwiktionary (T366458)]], [[gerrit:1054874|dewiki: Disable CommunityConfiguration (T366458)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:24:44] !log urbanecm@deploy1002 urbanecm: Continuing with sync [12:25:04] tested it is not deployed to de, works everywhere else, proceeding [12:27:10] (03CR) 10Urbanecm: [C:03+2] "follow-up work filled as T370261" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054874 (https://phabricator.wikimedia.org/T366458) (owner: 10Urbanecm) [12:29:41] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1053884|CommunityConfiguration: Release to all Growth wikis, except frwiktionary (T366458)]], [[gerrit:1054874|dewiki: Disable CommunityConfiguration (T366458)]] (duration: 08m 30s) [12:29:45] T366458: CommunityConfiguration: Release extension to all Wikipedias with GrowthExperiments - https://phabricator.wikimedia.org/T366458 [12:30:28] (03CR) 10Giuseppe Lavagetto: [C:03+2] varnish: add requestctl filters for cache hits [puppet] - 10https://gerrit.wikimedia.org/r/1054081 (https://phabricator.wikimedia.org/T317794) (owner: 10Giuseppe Lavagetto) [12:31:11] !log Community configuration deployment finished [12:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T367781)', diff saved to https://phabricator.wikimedia.org/P66728 and previous config saved to /var/cache/conftool/dbconfig/20240717-123341-arnaudb.json [12:33:44] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1181.eqiad.wmnet with reason: Maintenance [12:33:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1181.eqiad.wmnet with reason: Maintenance [12:33:46] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [12:33:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1181 (T367781)', diff saved to https://phabricator.wikimedia.org/P66729 and previous config saved to /var/cache/conftool/dbconfig/20240717-123352-arnaudb.json [12:35:29] (03CR) 10Elukey: [C:03+1] netbox: refactor tests to be more flexible [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054860 (owner: 10Volans) [12:35:44] (03CR) 10Elukey: [C:03+2] reporter.py: fix warning log [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1054845 (https://phabricator.wikimedia.org/T367427) (owner: 10Elukey) [12:36:31] (03Merged) 10jenkins-bot: reporter.py: fix warning log [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1054845 (https://phabricator.wikimedia.org/T367427) (owner: 10Elukey) [12:37:38] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [12:39:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29641 bytes in 3.825 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [12:41:48] (03CR) 10Volans: [C:03+2] netbox: refactor tests to be more flexible [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054860 (owner: 10Volans) [12:41:52] (03CR) 10Giuseppe Lavagetto: [C:03+2] varnish: add support for hit rules [puppet] - 10https://gerrit.wikimedia.org/r/1054082 (https://phabricator.wikimedia.org/T369480) (owner: 10Giuseppe Lavagetto) [12:43:33] (03PS15) 10Hashar: git: remove umask from git::clone [puppet] - 10https://gerrit.wikimedia.org/r/927986 (https://phabricator.wikimedia.org/T338277) [12:43:46] (03CR) 10Hashar: git: remove umask from git::clone (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927986 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [12:47:44] (03Merged) 10jenkins-bot: netbox: refactor tests to be more flexible [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054860 (owner: 10Volans) [12:49:13] (03CR) 10Giuseppe Lavagetto: [C:03+2] varnish: actually include the requestctl hit rules [puppet] - 10https://gerrit.wikimedia.org/r/1054083 (https://phabricator.wikimedia.org/T369480) (owner: 10Giuseppe Lavagetto) [12:54:25] (03CR) 10Hashar: "The Puppet catalogue compilation is at https://puppet-compiler.wmflabs.org/output/927986/1397/" [puppet] - 10https://gerrit.wikimedia.org/r/927986 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [12:58:06] PROBLEM - Confd vcl based reload on cp7008 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [12:59:19] FIRING: [3x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240717T1300). [13:00:05] gmodena and Nemoralis: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:15] o/ [13:00:15] I probably can’t deploy today, sorry [13:00:37] i can deploy today [13:00:42] hey Lucas_WMDE :) [13:00:47] gmodena: are you around? [13:01:14] hi urbanecm :) [13:01:20] urbanecm o/ [13:01:22] (03PS5) 10NMW03: Add Portal namespace for Ingush Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054084 (https://phabricator.wikimedia.org/T326089) [13:01:30] (03CR) 10Urbanecm: [C:03+2] Add Portal namespace for Ingush Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054084 (https://phabricator.wikimedia.org/T326089) (owner: 10NMW03) [13:01:32] (03PS5) 10Gmodena: eventbus: enable instrumentation on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054357 (https://phabricator.wikimedia.org/T363587) [13:01:42] (03CR) 10Urbanecm: [C:03+2] eventbus: enable instrumentation on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054357 (https://phabricator.wikimedia.org/T363587) (owner: 10Gmodena) [13:02:11] (03Merged) 10jenkins-bot: Add Portal namespace for Ingush Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054084 (https://phabricator.wikimedia.org/T326089) (owner: 10NMW03) [13:02:23] (03Merged) 10jenkins-bot: eventbus: enable instrumentation on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054357 (https://phabricator.wikimedia.org/T363587) (owner: 10Gmodena) [13:02:58] !log urbanecm@deploy1002 Started scap sync-world: Backport for [[gerrit:1054084|Add Portal namespace for Ingush Wikipedia (T326089)]], [[gerrit:1054357|eventbus: enable instrumentation on group 0 (T363587)]] [13:03:04] T326089: Creating a namespace for portals in InhWiki - https://phabricator.wikimedia.org/T326089 [13:03:04] T363587: [Event Platform] Instrument EventBus with prometheus MW Statslib - https://phabricator.wikimedia.org/T363587 [13:03:08] RECOVERY - Confd vcl based reload on cp7008 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [13:04:26] FIRING: [2x] SystemdUnitFailed: envoyproxy.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:33] (03PS1) 10Giuseppe Lavagetto: Revert "varnish: actually include the requestctl hit rules" [puppet] - 10https://gerrit.wikimedia.org/r/1054877 [13:05:05] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Revert "varnish: actually include the requestctl hit rules" [puppet] - 10https://gerrit.wikimedia.org/r/1054877 (owner: 10Giuseppe Lavagetto) [13:05:28] !log urbanecm@deploy1002 nmw03, gmodena, urbanecm: Backport for [[gerrit:1054084|Add Portal namespace for Ingush Wikipedia (T326089)]], [[gerrit:1054357|eventbus: enable instrumentation on group 0 (T363587)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:05:44] gmodena: Nemoralis: can you take a look and test via mwdebug? [13:05:47] (03PS11) 10Ayounsi: Spicerack: fix Netbox 4 breaking changes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) [13:05:47] (03PS4) 10Ayounsi: Adapt tests for Netbox 4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054562 [13:05:47] (03PS8) 10Ayounsi: Tox: add Python3.12 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 [13:05:58] urbanecm on it [13:06:01] urbanecm: sure [13:06:38] LGTM, https://inh.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces&formatversion=2 [13:06:52] (03PS1) 10Giuseppe Lavagetto: Revert^2 "varnish: actually include the requestctl hit rules" [puppet] - 10https://gerrit.wikimedia.org/r/1054878 [13:07:04] thanks [13:07:23] !log [intentional] stop nginx.service on durum1001 [13:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:57] urbanecm lgtm [13:08:02] thanks [13:08:09] !log urbanecm@deploy1002 nmw03, gmodena, urbanecm: Continuing with sync [13:09:50] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:09:54] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:09:55] ^ expected [13:10:00] (03PS12) 10Ayounsi: Spicerack: fix Netbox 4 breaking changes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) [13:10:00] (03PS5) 10Ayounsi: Adapt tests for Netbox 4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054562 [13:10:00] (03PS9) 10Ayounsi: Tox: add Python3.12 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 [13:10:09] (03PS16) 10Arnaudb: mysqld-exporter: hotfix config for es1 to es5 [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) [13:10:22] PROBLEM - Bird Internet Routing Daemon on durum1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:10:24] PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum1001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:10:29] (03CR) 10CI reject: [V:04-1] mysqld-exporter: hotfix config for es1 to es5 [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) (owner: 10Arnaudb) [13:11:22] RECOVERY - Bird Internet Routing Daemon on durum1001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:11:24] RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum1001 is OK: OK: UP (pid=1600754) and all threads (8) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:11:52] RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:11:52] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:12:54] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/revalidateLinkRecommendations.php --wiki=azwiki --all --verbose # T370262 [13:12:56] (03PS1) 10Elukey: Release version 0.5.0-1 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1054879 (https://phabricator.wikimedia.org/T368744) [13:12:57] Nemoralis: fyi ^^ [13:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:59] T370262: Run revalidateLinkRecommendations script for azwiki - https://phabricator.wikimedia.org/T370262 [13:13:04] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1054084|Add Portal namespace for Ingush Wikipedia (T326089)]], [[gerrit:1054357|eventbus: enable instrumentation on group 0 (T363587)]] (duration: 10m 06s) [13:13:12] T326089: Creating a namespace for portals in InhWiki - https://phabricator.wikimedia.org/T326089 [13:13:12] thanks! urbanecm [13:13:12] T363587: [Event Platform] Instrument EventBus with prometheus MW Statslib - https://phabricator.wikimedia.org/T363587 [13:13:20] Nemoralis: gmodena: and both deployed :) [13:13:21] anything else? [13:14:08] urbanecm awesome! Thanks a lot for the help [13:14:12] no problem [13:16:07] (03CR) 10CI reject: [V:04-1] Spicerack: fix Netbox 4 breaking changes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [13:16:26] (03CR) 10CI reject: [V:04-1] Adapt tests for Netbox 4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054562 (owner: 10Ayounsi) [13:16:28] (03CR) 10CI reject: [V:04-1] Tox: add Python3.12 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 (owner: 10Ayounsi) [13:19:01] (03CR) 10Giuseppe Lavagetto: [C:03+2] Revert^2 "varnish: actually include the requestctl hit rules" [puppet] - 10https://gerrit.wikimedia.org/r/1054878 (owner: 10Giuseppe Lavagetto) [13:19:40] !log Stop revalidateLinkRecommendation for azwiki; restart as `[urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/revalidateLinkRecommendations.php --wiki=azwiki --olderThan=20240104000000 --verbose` instead (T370262) [13:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:44] T370262: Run revalidateLinkRecommendations script for azwiki - https://phabricator.wikimedia.org/T370262 [13:23:56] * Lucas_WMDE around now if needed [13:25:45] Lucas_WMDE: window's done :) [13:26:02] (03PS21) 10Arnaudb: mysqld-exporter: hotfix config for es1 to es5 [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) [13:26:08] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbproxy2008.codfw.wmnet with OS bookworm [13:26:18] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9990010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbproxy2008.codfw.wmnet with OS bookworm [13:26:51] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2432 [13:27:06] 06SRE, 06Infrastructure-Foundations, 10netops: Core router error logs: "sshd: Did not receive identification string" from prometheus hosts - https://phabricator.wikimedia.org/T368513#9990011 (10fgiunchedi) So I looked where the probes come from, and they are part of the generic "probe mgmt network hosts for... [13:27:20] (03PS22) 10Arnaudb: mysqld-exporter: hotfix config for es1 to es5 [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) [13:28:35] jouncebot: now [13:28:35] For the next 0 hour(s) and 31 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240717T1300) [13:29:57] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.convert-disks (exit_code=99) for host mw2432 [13:30:28] urbanecm: Lucas_WMDE may I and nemo-yiannis use the rest of your window? [13:30:42] effie: sure, go for it :) [13:30:47] cheers [13:31:14] (03CR) 10Jgiannelos: [C:03+2] changeprop: Disable pregeneration for mobile-sections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054512 (https://phabricator.wikimedia.org/T328036) (owner: 10Jgiannelos) [13:31:45] sure! [13:31:51] jouncebot: next [13:31:51] In 0 hour(s) and 28 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240717T1400) [13:32:26] ok there’s a wikifunctions window afterwards [13:32:31] (03Merged) 10jenkins-bot: changeprop: Disable pregeneration for mobile-sections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054512 (https://phabricator.wikimedia.org/T328036) (owner: 10Jgiannelos) [13:32:47] I might do some experiments at https://phabricator.wikimedia.org/T368523 after that… we’ll see [13:33:24] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [13:33:26] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [13:34:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T367781)', diff saved to https://phabricator.wikimedia.org/P66730 and previous config saved to /var/cache/conftool/dbconfig/20240717-133408-arnaudb.json [13:34:13] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [13:34:38] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply [13:34:56] (03CR) 10Gergő Tisza: [SUL3] Enable SUL3 on Beta Cluster for testing (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054863 (https://phabricator.wikimedia.org/T370254) (owner: 10D3r1ck01) [13:36:01] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dbproxy2008.codfw.wmnet with OS bookworm [13:36:08] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9990040 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbproxy2008.codfw.wmnet with OS bookworm executed with errors... [13:37:18] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [13:40:20] (03PS1) 10Ssingh: team-traffic: add alerting for when anycast-healthchecker is restarted [alerts] - 10https://gerrit.wikimedia.org/r/1054881 [13:40:35] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2432 [13:43:06] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbproxy2008.codfw.wmnet with OS bookworm [13:43:17] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9990082 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbproxy2008.codfw.wmnet with OS bookworm [13:43:41] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.convert-disks (exit_code=99) for host mw2432 [13:49:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P66732 and previous config saved to /var/cache/conftool/dbconfig/20240717-134916-arnaudb.json [13:51:56] (03PS1) 10Elukey: admin_ng: allow more cpus for ml-serve's revscoring-articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054882 [13:53:00] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1054864 (https://phabricator.wikimedia.org/T370255) (owner: 10MVernon) [13:53:41] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2432 [13:53:43] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.convert-disks (exit_code=99) for host mw2432 [13:53:51] (03CR) 10Arnaudb: "🤿" [puppet] - 10https://gerrit.wikimedia.org/r/1054864 (https://phabricator.wikimedia.org/T370255) (owner: 10MVernon) [13:54:35] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2432 [13:54:37] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.convert-disks (exit_code=99) for host mw2432 [13:55:45] (03CR) 10D3r1ck01: [SUL3] Enable SUL3 on Beta Cluster for testing (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054863 (https://phabricator.wikimedia.org/T370254) (owner: 10D3r1ck01) [13:55:46] (03PS3) 10D3r1ck01: [SUL3] Enable SUL3 on Beta Cluster for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054863 (https://phabricator.wikimedia.org/T370254) [13:56:27] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2432 [13:56:28] (03CR) 10Ilias Sarantopoulos: [C:03+1] admin_ng: allow more cpus for ml-serve's revscoring-articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054882 (owner: 10Elukey) [13:57:57] (03CR) 10Filippo Giunchedi: [C:03+1] team-traffic: add alerting for when anycast-healthchecker is restarted [alerts] - 10https://gerrit.wikimedia.org/r/1054881 (owner: 10Ssingh) [13:58:36] 14SRE-Sprint-Week-Sustainability-March2023, 10conftool, 06Traffic, 13Patch-For-Review, 10Sustainability (Incident Followup): requestctl can't act on cache hits - https://phabricator.wikimedia.org/T317794#9990341 (10Joe) 05Open→03Resolved [13:58:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T367856)', diff saved to https://phabricator.wikimedia.org/P66733 and previous config saved to /var/cache/conftool/dbconfig/20240717-135854-marostegui.json [13:58:59] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [13:59:02] (03CR) 10Filippo Giunchedi: mysqld-exporter: hotfix config for es1 to es5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) (owner: 10Arnaudb) [13:59:20] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy2008.codfw.wmnet with reason: host reimage [13:59:33] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.convert-disks (exit_code=99) for host mw2432 [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240717T1400) [14:00:26] jouncebot: nowandnext [14:00:26] For the next 0 hour(s) and 59 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240717T1400) [14:00:27] In 2 hour(s) and 59 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240717T1700) [14:00:40] OK, it didn't forget about me. [14:00:52] But maybe I missed the announcement when IRC flaked. [14:01:14] (03PS1) 10Jgiannelos: Revert "changeprop: Disable pregeneration for mobile-sections" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054885 [14:02:03] (03PS1) 10Arnaudb: mariadb: observability - adds shard information on recording rule [puppet] - 10https://gerrit.wikimedia.org/r/1054884 (https://phabricator.wikimedia.org/T367283) [14:03:02] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy2008.codfw.wmnet with reason: host reimage [14:03:32] (03PS23) 10Arnaudb: mysqld-exporter: hotfix config for es1 to es5 [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) [14:03:46] (03CR) 10Arnaudb: mysqld-exporter: hotfix config for es1 to es5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) (owner: 10Arnaudb) [14:04:10] (03CR) 10Jgiannelos: [C:03+2] Revert "changeprop: Disable pregeneration for mobile-sections" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054885 (owner: 10Jgiannelos) [14:04:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P66734 and previous config saved to /var/cache/conftool/dbconfig/20240717-140423-arnaudb.json [14:05:14] (03CR) 10Klausman: [C:03+1] admin_ng: allow more cpus for ml-serve's revscoring-articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054882 (owner: 10Elukey) [14:05:25] (03Merged) 10jenkins-bot: Revert "changeprop: Disable pregeneration for mobile-sections" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054885 (owner: 10Jgiannelos) [14:06:10] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Create the python-release repository - https://phabricator.wikimedia.org/T367410#9990408 (10elukey) [14:06:40] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Create the python-release repository - https://phabricator.wikimedia.org/T367410#9990413 (10elukey) 05Open→03Resolved Spicerack 8.7.0 was released by me, we made it :) [14:06:48] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply [14:07:03] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [14:07:24] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-07-09-155027 to 2024-07-17-140123 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054886 (https://phabricator.wikimedia.org/T364413) [14:07:35] (03CR) 10Ssingh: [C:03+2] team-traffic: add alerting for when anycast-healthchecker is restarted [alerts] - 10https://gerrit.wikimedia.org/r/1054881 (owner: 10Ssingh) [14:08:25] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-07-09-155027 to 2024-07-17-140123 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054886 (https://phabricator.wikimedia.org/T364413) (owner: 10Jforrester) [14:08:44] (03CR) 10Volans: "as per IRC discussion" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [14:09:33] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-07-09-155027 to 2024-07-17-140123 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054886 (https://phabricator.wikimedia.org/T364413) (owner: 10Jforrester) [14:10:53] (03CR) 10Arnaudb: "> I looked at the expression with the current thresholds and I doubt as it stands there is enough signal (and/or the alerts to fire at all" [alerts] - 10https://gerrit.wikimedia.org/r/1049159 (https://phabricator.wikimedia.org/T367280) (owner: 10Arnaudb) [14:11:19] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:11:28] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2432 [14:11:50] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:12:22] (03Abandoned) 10Arnaudb: mariadb: monitoring memory pressure [alerts] - 10https://gerrit.wikimedia.org/r/1049159 (https://phabricator.wikimedia.org/T367280) (owner: 10Arnaudb) [14:14:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P66735 and previous config saved to /var/cache/conftool/dbconfig/20240717-141401-marostegui.json [14:14:35] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.convert-disks (exit_code=99) for host mw2432 [14:15:56] (03CR) 10Gergő Tisza: [C:03+1] [SUL3] Enable SUL3 on Beta Cluster for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054863 (https://phabricator.wikimedia.org/T370254) (owner: 10D3r1ck01) [14:16:10] (03PS4) 10Gergő Tisza: [beta] Enable SUL3 on Beta Cluster for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054863 (https://phabricator.wikimedia.org/T370254) (owner: 10D3r1ck01) [14:16:13] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:16:55] !log [durum3003] upgrade anycast-healthchecker to 0.9.8-1+wmf12u1: T370068 [14:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:59] T370068: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068 [14:17:05] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:17:10] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:17:56] (03PS1) 10Effie Mouzeli: changeprop: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054888 [14:17:59] (03PS1) 10Hashar: statistics: remove git::clone file mode [puppet] - 10https://gerrit.wikimedia.org/r/1054889 (https://phabricator.wikimedia.org/T338277) [14:18:39] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [14:19:29] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:19:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T367781)', diff saved to https://phabricator.wikimedia.org/P66736 and previous config saved to /var/cache/conftool/dbconfig/20240717-141929-arnaudb.json [14:19:31] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1191.eqiad.wmnet with reason: Maintenance [14:19:33] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1191.eqiad.wmnet with reason: Maintenance [14:19:37] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [14:19:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T367781)', diff saved to https://phabricator.wikimedia.org/P66737 and previous config saved to /var/cache/conftool/dbconfig/20240717-141939-arnaudb.json [14:19:44] (03CR) 10Jgiannelos: [C:03+1] changeprop: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054888 (owner: 10Effie Mouzeli) [14:20:08] (03CR) 10Effie Mouzeli: [C:03+2] changeprop: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054888 (owner: 10Effie Mouzeli) [14:20:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [14:20:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy2008.codfw.wmnet with OS bookworm [14:21:28] (03PS13) 10Ayounsi: Spicerack: fix Netbox 4 breaking changes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) [14:21:28] (03PS6) 10Ayounsi: Adapt tests for Netbox 4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054562 [14:21:28] (03PS10) 10Ayounsi: Tox: add Python3.12 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 [14:21:40] (03Merged) 10jenkins-bot: changeprop: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054888 (owner: 10Effie Mouzeli) [14:22:02] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on durum3003.esams.wmnet with reason: testing anycast-healthchecker 0.9.8 [14:22:04] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on durum3003.esams.wmnet with reason: testing anycast-healthchecker 0.9.8 [14:22:21] urbanecm: any update for script? [14:22:26] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9990461 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbproxy2008.codfw.wmnet with OS bookworm completed: - dbproxy... [14:22:35] jouncebot: next [14:22:35] In 2 hour(s) and 37 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240717T1700) [14:22:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T367781)', diff saved to https://phabricator.wikimedia.org/P66738 and previous config saved to /var/cache/conftool/dbconfig/20240717-142249-arnaudb.json [14:23:23] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.88 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:24:33] (03PS14) 10Ayounsi: Spicerack: fix Netbox 4 breaking changes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) [14:24:33] (03PS7) 10Ayounsi: Adapt tests for Netbox 4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054562 [14:24:33] (03PS11) 10Ayounsi: Tox: add Python3.12 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 [14:26:28] (03CR) 10BCornwall: Add public suffix list module (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (owner: 10BCornwall) [14:26:44] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for durum3003.esams.wmnet [14:26:44] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for durum3003.esams.wmnet [14:26:50] (03PS1) 10Hashar: openstack: remove OpenTofu git::clone file mode [puppet] - 10https://gerrit.wikimedia.org/r/1054890 (https://phabricator.wikimedia.org/T338277) [14:27:06] (03PS2) 10Hashar: openstack: remove OpenTofu git::clone file mode [puppet] - 10https://gerrit.wikimedia.org/r/1054890 (https://phabricator.wikimedia.org/T338277) [14:27:12] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply [14:27:26] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on mw2432 - https://phabricator.wikimedia.org/T370258#9990490 (10Jhancock.wm) submitted a service request for dell to replace the drive. will update if they do. I'm pretty certain that the drive in question is the second one (in drive bay 1) but would appreciate... [14:27:26] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [14:27:33] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [14:27:57] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [14:28:04] (03PS2) 10Hashar: statistics: remove git::clone file mode [puppet] - 10https://gerrit.wikimedia.org/r/1054889 (https://phabricator.wikimedia.org/T338277) [14:29:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P66739 and previous config saved to /var/cache/conftool/dbconfig/20240717-142908-marostegui.json [14:30:34] (03PS4) 10JMeybohm: Initial commit validating-admission-policies chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053911 (https://phabricator.wikimedia.org/T368251) [14:30:34] (03PS1) 10JMeybohm: Add policy to allow only SYS_PTRACE [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054891 (https://phabricator.wikimedia.org/T368251) [14:30:38] (03CR) 10CI reject: [V:04-1] Tox: add Python3.12 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 (owner: 10Ayounsi) [14:30:47] (03CR) 10CI reject: [V:04-1] Adapt tests for Netbox 4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054562 (owner: 10Ayounsi) [14:31:51] (03Abandoned) 10Kamila Součková: service catalog: remove mw-api-async-transition [puppet] - 10https://gerrit.wikimedia.org/r/991394 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková) [14:32:56] (03CR) 10Ayounsi: Spicerack: fix Netbox 4 breaking changes (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [14:33:51] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on mw2432 - https://phabricator.wikimedia.org/T370258#9990505 (10Clement_Goubert) Hi @Jhancock.wm very sorry for the noise, this is me trying to automate turning the RAID controller to HBA mode, there are no actual issues with the disk. I didn't know it would creat... [14:34:53] (03CR) 10Scott French: [C:03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054609 (https://phabricator.wikimedia.org/T369745) (owner: 10Mforns) [14:34:56] (03CR) 10Scott French: [C:03+2] commons-impact-analytics: bump image to v1.0.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054609 (https://phabricator.wikimedia.org/T369745) (owner: 10Mforns) [14:35:36] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [14:36:03] (03Merged) 10jenkins-bot: commons-impact-analytics: bump image to v1.0.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054609 (https://phabricator.wikimedia.org/T369745) (owner: 10Mforns) [14:36:53] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [14:37:07] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [14:37:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P66740 and previous config saved to /var/cache/conftool/dbconfig/20240717-143756-arnaudb.json [14:39:19] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:09] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/commons-impact-analytics: apply [14:40:22] (03PS1) 10Hashar: grafana: clone grafana-grizzly with default parameters [puppet] - 10https://gerrit.wikimedia.org/r/1054892 (https://phabricator.wikimedia.org/T338277) [14:40:29] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/commons-impact-analytics: apply [14:42:17] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on mw2432 - https://phabricator.wikimedia.org/T370258#9990542 (10Jhancock.wm) all good! I can cancel the ticket. [14:43:23] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:44:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T367856)', diff saved to https://phabricator.wikimedia.org/P66741 and previous config saved to /var/cache/conftool/dbconfig/20240717-144415-marostegui.json [14:44:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db2141.codfw.wmnet with reason: Maintenance [14:44:20] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [14:44:30] (03CR) 10JHathaway: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3284/console" [puppet] - 10https://gerrit.wikimedia.org/r/1054661 (https://phabricator.wikimedia.org/T367547) (owner: 10JHathaway) [14:44:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db2141.codfw.wmnet with reason: Maintenance [14:46:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbproxy2007.codfw.wmnet with OS bookworm [14:46:11] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9990554 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbproxy2007.codfw.wmnet with OS bookworm [14:46:37] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/commons-impact-analytics: apply [14:46:52] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/commons-impact-analytics: apply [14:47:31] 10ops-eqiad, 06SRE, 06DC-Ops: 10gbit nic option for centrallog1002 - https://phabricator.wikimedia.org/T369825#9990558 (10VRiley-WMF) Hey @fgiunchedi Just wanted to verify, since we would have to physically move this server into another rack (and in turn, have to change the IP) this activity is no longer nee... [14:47:52] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1179 stopped answering ping, depooled - https://phabricator.wikimedia.org/T369855#9990556 (10wiki_willy) Hi @ABran-WMF - can you work with the onsite engineers on this? cc'ing @VRiley-WMF & @Jclark-ctr >>! In T369855#9989118, @ABran-WMF wrote: > This server has bee... [14:48:16] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06serviceops-radar, 10Event-Platform: Configuration Management for Kafka settings - https://phabricator.wikimedia.org/T276088#9990559 (10bking) FYI, there is [[ https://github.com/StephenSorriaux/ansible-kafka-admin | an ansible library ]] that claims to `... [14:50:39] (03PS8) 10Ayounsi: Adapt tests for Netbox 4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054562 [14:50:39] (03PS12) 10Ayounsi: Tox: add Python3.12 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 [14:53:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P66742 and previous config saved to /var/cache/conftool/dbconfig/20240717-145303-arnaudb.json [14:53:27] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 374.88 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:54:22] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1179 stopped answering ping, depooled - https://phabricator.wikimedia.org/T369855#9990596 (10ABran-WMF) sure thing! @VRiley-WMF @Jclark-ctr the host has been depooled and is downtimed, you should be able to take it from here. Feel free to ping if needed! [14:56:17] (03PS1) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) [14:56:48] (03CR) 10CI reject: [V:04-1] admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [14:56:57] (03PS1) 10Jgiannelos: changeprop: Disable pregeneration for mobile-sections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054895 [14:57:31] (03CR) 10CI reject: [V:04-1] Adapt tests for Netbox 4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054562 (owner: 10Ayounsi) [14:57:32] (03CR) 10CI reject: [V:04-1] Tox: add Python3.12 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 (owner: 10Ayounsi) [14:59:06] (03CR) 10Filippo Giunchedi: mysqld-exporter: hotfix config for es1 to es5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) (owner: 10Arnaudb) [14:59:19] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:56] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy2007.codfw.wmnet with reason: host reimage [15:00:27] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:01:00] (03PS24) 10Arnaudb: mysqld-exporter: hotfix config for es1 to es5 [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) [15:01:16] (03CR) 10Arnaudb: mysqld-exporter: hotfix config for es1 to es5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) (owner: 10Arnaudb) [15:01:24] (03CR) 10CI reject: [V:04-1] mysqld-exporter: hotfix config for es1 to es5 [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) (owner: 10Arnaudb) [15:02:08] (03PS25) 10Arnaudb: mysqld-exporter: hotfix config for es1 to es5 [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) [15:02:15] (03CR) 10Clément Goubert: [C:03+1] changeprop: Disable pregeneration for mobile-sections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054895 (owner: 10Jgiannelos) [15:03:19] (03CR) 10MVernon: [C:03+2] hiera: mark apus service as in production [puppet] - 10https://gerrit.wikimedia.org/r/1054344 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:03:21] (03CR) 10Jgiannelos: [C:03+2] changeprop: Disable pregeneration for mobile-sections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054895 (owner: 10Jgiannelos) [15:03:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy2007.codfw.wmnet with reason: host reimage [15:04:19] (03Merged) 10jenkins-bot: changeprop: Disable pregeneration for mobile-sections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054895 (owner: 10Jgiannelos) [15:05:45] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-07-17-140123 to 2024-07-17-145014 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054896 [15:05:52] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-07-17-140123 to 2024-07-17-145014 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054896 (owner: 10Jforrester) [15:07:04] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-07-17-140123 to 2024-07-17-145014 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054896 (owner: 10Jforrester) [15:07:36] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [15:07:42] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [15:08:01] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [15:08:05] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [15:08:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T367781)', diff saved to https://phabricator.wikimedia.org/P66743 and previous config saved to /var/cache/conftool/dbconfig/20240717-150811-arnaudb.json [15:08:13] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1194.eqiad.wmnet with reason: Maintenance [15:08:15] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [15:08:26] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1194.eqiad.wmnet with reason: Maintenance [15:08:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T367781)', diff saved to https://phabricator.wikimedia.org/P66744 and previous config saved to /var/cache/conftool/dbconfig/20240717-150833-arnaudb.json [15:08:37] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [15:08:50] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [15:09:58] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [15:10:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T367781)', diff saved to https://phabricator.wikimedia.org/P66745 and previous config saved to /var/cache/conftool/dbconfig/20240717-151045-arnaudb.json [15:11:20] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [15:12:42] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply [15:12:45] (03CR) 10Ahmon Dancy: [C:03+1] git: remove umask from git::clone [puppet] - 10https://gerrit.wikimedia.org/r/927986 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [15:13:21] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [15:13:36] (03PS2) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) [15:15:01] (03PS2) 10MVernon: apus: add active/active geoip service record [dns] - 10https://gerrit.wikimedia.org/r/1054346 (https://phabricator.wikimedia.org/T279621) [15:15:01] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3286/co" [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [15:16:37] !log cumin 'A:dnsbox' 'run-puppet-agent': T279621 [15:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:42] T279621: Set up new S3-level replicated storage cluster "apus" - https://phabricator.wikimedia.org/T279621 [15:16:43] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [15:18:00] 06SRE, 06Traffic-Icebox, 06Web-Team-Backlog, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#9990617 (10Krinkle) [15:18:12] (03CR) 10Ssingh: apus: add active/active geoip service record [dns] - 10https://gerrit.wikimedia.org/r/1054346 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:18:43] (03CR) 10MVernon: [C:03+2] apus: add active/active geoip service record [dns] - 10https://gerrit.wikimedia.org/r/1054346 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:18:49] !log running authdns-update for CR 1054346 [15:18:51] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:02] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [15:20:52] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:20:59] !log sukhe@cumin1002 START - Cookbook sre.dns.wipe-cache apus.discovery.wmnet on all recursors [15:21:02] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) apus.discovery.wmnet on all recursors [15:21:20] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:21:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:21:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy2007.codfw.wmnet with OS bookworm [15:22:22] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:22:35] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Request access to servers Dcops group - https://phabricator.wikimedia.org/T360356#9990691 (10elukey) Filed a proposal in https://gerrit.wikimedia.org/r/1054894 @wiki_willy I reviewed the list of commands, most of them were already available with no pr... [15:22:36] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9990693 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbproxy2007.codfw.wmnet with OS bookworm completed: - dbproxy... [15:22:56] (03CR) 10Elukey: [V:03+1 C:04-1] "May need some rework, need to verify if ssh access would be granted." [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [15:23:11] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:23:20] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:24:12] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:24:32] (03CR) 10MVernon: [C:03+2] hiera: use discovery hostname in apus probes [puppet] - 10https://gerrit.wikimedia.org/r/1054347 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:25:21] Nemoralis: it is now finished, it says Done; replaced 4597, discarded 73 [15:25:33] (03PS1) 10Cathal Mooney: Prefer AWS routes from direct peer in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1054899 (https://phabricator.wikimedia.org/T370297) [15:25:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P66747 and previous config saved to /var/cache/conftool/dbconfig/20240717-152552-arnaudb.json [15:26:00] Nemoralis: i closed the task and published logs there [15:26:11] urbanecm: thanks! [15:26:23] (03CR) 10Cathal Mooney: [C:03+2] Adjust route generation for Anycast ranges at eqord [homer/public] - 10https://gerrit.wikimedia.org/r/1053935 (https://phabricator.wikimedia.org/T367439) (owner: 10Cathal Mooney) [15:26:53] (03Merged) 10jenkins-bot: Adjust route generation for Anycast ranges at eqord [homer/public] - 10https://gerrit.wikimedia.org/r/1053935 (https://phabricator.wikimedia.org/T367439) (owner: 10Cathal Mooney) [15:26:56] 06SRE, 06Traffic-Icebox, 06Web-Team-Backlog, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#9990688 (10Jdforrester-WMF) Note for those interested that this went live for Wikifunctions two weeks ago,... [15:28:35] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-codfw and A:lvs [15:28:44] no problem Nemoralis , thanks for the ping [15:29:00] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-codfw and A:lvs [15:29:42] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1179 stopped answering ping, depooled - https://phabricator.wikimedia.org/T369855#9990711 (10VRiley-WMF) Hey @ABran-WMF Thanks. I will be looking into this now. [15:30:39] !log sudo cumin "A:lvs" "run-puppet-agent" to pick up apus change [15:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:42] (03PS3) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) [15:32:29] !log Adjust anycast route policy at Chicago Network POP cr2-eqord to announce anycast ranges T367439 [15:32:31] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3288/co" [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [15:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:33] T367439: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439 [15:32:41] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-codfw and A:lvs [15:32:44] (03PS8) 10Arnaudb: mariadb: tweaks monitoring thresholds for replication lag [alerts] - 10https://gerrit.wikimedia.org/r/1054893 (https://phabricator.wikimedia.org/T367279) [15:33:06] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-codfw and A:lvs [15:35:02] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad and A:lvs [15:35:15] 10ops-eqiad, 06SRE, 06DC-Ops: 10gbit nic option for centrallog1002 - https://phabricator.wikimedia.org/T369825#9990764 (10fgiunchedi) >>! In T369825#9990558, @VRiley-WMF wrote: > Hey @fgiunchedi Just wanted to verify, since we would have to physically move this server into another rack (and in turn, have to... [15:35:33] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-eqiad and A:lvs [15:37:14] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:37:23] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-eqiad and A:lvs [15:38:05] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-eqiad and A:lvs [15:38:46] (03PS9) 10Ayounsi: Adapt tests for Netbox 4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054562 [15:38:46] (03PS13) 10Ayounsi: Tox: add Python3.12 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 [15:39:38] (03PS1) 10Gergő Tisza: SUL3: Fix cookie names on the SSO domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054901 (https://phabricator.wikimedia.org/T365162) [15:40:30] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3290/co" [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [15:41:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P66748 and previous config saved to /var/cache/conftool/dbconfig/20240717-154059-arnaudb.json [15:41:50] (03CR) 10Elukey: [V:03+1] "I see the new users in the change catalog, but pcc seems not showing anything ssh-related in the diff. Lemme know your thoughts!" [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [15:42:54] !log otto@deploy1002 Started deploy [analytics/refinery@0b53772] (hadoop-test): TEST [analytics/refinery@0b53772e] [15:45:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054901 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [15:46:21] !log otto@deploy1002 Finished deploy [analytics/refinery@0b53772] (hadoop-test): TEST [analytics/refinery@0b53772e] (duration: 03m 27s) [15:50:11] !log otto@deploy1002 Started deploy [analytics/refinery@8f00c85] (hadoop-test): - take 2 - TEST [analytics/refinery@8f00c859] [15:51:35] (03CR) 10D3r1ck01: [C:03+1] SUL3: Fix cookie names on the SSO domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054901 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [15:52:12] (03CR) 10Volans: Adapt tests for Netbox 4 (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054562 (owner: 10Ayounsi) [15:53:45] !log otto@deploy1002 Finished deploy [analytics/refinery@8f00c85] (hadoop-test): - take 2 - TEST [analytics/refinery@8f00c859] (duration: 03m 33s) [15:54:49] I just saw a burst of 14,363 instances of `LoadMonitor:124 Database servers in extension1 are overloaded. In order to protect application servers, the circuit breaking to database....` in logspam. [15:55:29] (03PS2) 10JMeybohm: Add policy to allow only SYS_PTRACE [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054891 (https://phabricator.wikimedia.org/T368251) [15:55:29] (03PS1) 10JMeybohm: Add policy to allow GeoIP hostPath volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054905 (https://phabricator.wikimedia.org/T368251) [15:55:29] It did stop so I guess the circuit breaking did its job. [15:56:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T367781)', diff saved to https://phabricator.wikimedia.org/P66750 and previous config saved to /var/cache/conftool/dbconfig/20240717-155606-arnaudb.json [15:56:07] just saw a user report about it as well (#wikimedia-hackathon) [15:56:08] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1202.eqiad.wmnet with reason: Maintenance [15:56:11] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [15:56:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1202.eqiad.wmnet with reason: Maintenance [15:56:22] (sorry, it was #wikimedia-cloud actually) [15:56:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T367781)', diff saved to https://phabricator.wikimedia.org/P66751 and previous config saved to /var/cache/conftool/dbconfig/20240717-155628-arnaudb.json [15:59:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T367781)', diff saved to https://phabricator.wikimedia.org/P66752 and previous config saved to /var/cache/conftool/dbconfig/20240717-155937-arnaudb.json [16:01:16] (03PS1) 10Ottomata: refinery::job::test::gobblin - use gobbin-wmf 1.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1054906 (https://phabricator.wikimedia.org/T370199) [16:01:18] (03CR) 10Gergő Tisza: [C:03+2] "beta-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054863 (https://phabricator.wikimedia.org/T370254) (owner: 10D3r1ck01) [16:02:00] (03Merged) 10jenkins-bot: [beta] Enable SUL3 on Beta Cluster for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054863 (https://phabricator.wikimedia.org/T370254) (owner: 10D3r1ck01) [16:04:14] (03PS2) 10Ottomata: refinery::job::test::gobblin - use gobbin-wmf 1.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1054906 (https://phabricator.wikimedia.org/T370199) [16:04:14] (03CR) 10Ayounsi: Adapt tests for Netbox 4 (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054562 (owner: 10Ayounsi) [16:04:39] (03PS10) 10Ayounsi: Adapt tests for Netbox 4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054562 [16:04:39] (03PS14) 10Ayounsi: Tox: add Python3.12 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 [16:05:05] (03PS1) 10Brennen Bearnes: phabricator: apache2: add UnsafeAllow3F to RewriteRules [puppet] - 10https://gerrit.wikimedia.org/r/1054907 (https://phabricator.wikimedia.org/T370110) [16:05:17] (03PS2) 10JMeybohm: Add policy to allow GeoIP hostPath volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054905 (https://phabricator.wikimedia.org/T368251) [16:05:26] (03CR) 10Ottomata: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1054906 (https://phabricator.wikimedia.org/T370199) (owner: 10Ottomata) [16:05:54] (03PS16) 10Hashar: git: remove umask from git::clone [puppet] - 10https://gerrit.wikimedia.org/r/927986 (https://phabricator.wikimedia.org/T338277) [16:06:42] (03CR) 10Ottomata: [V:03+1 C:03+2] refinery::job::test::gobblin - use gobbin-wmf 1.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1054906 (https://phabricator.wikimedia.org/T370199) (owner: 10Ottomata) [16:06:46] (03CR) 10Ottomata: [V:03+2 C:03+2] refinery::job::test::gobblin - use gobbin-wmf 1.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1054906 (https://phabricator.wikimedia.org/T370199) (owner: 10Ottomata) [16:08:03] !log bking@kafka-main1005 `kafka topics --create --topic ${TOPIC} --partitions 1 --replication-factor 3; kafka configs --entity-type topics --entity-name ${TOPIC} --alter --add-config retention.ms=2592000000` T367510 [16:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:07] T367510: Request permission to create 4 kafka topics in kafka-main (WDQS graph split) - https://phabricator.wikimedia.org/T367510 [16:08:21] (03CR) 10Ayounsi: [C:03+1] Prefer AWS routes from direct peer in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1054899 (https://phabricator.wikimedia.org/T370297) (owner: 10Cathal Mooney) [16:12:22] (03PS1) 10Ottomata: refinery::job::gobblin - use gobbin-wmf 1.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1054908 (https://phabricator.wikimedia.org/T370199) [16:12:43] (03CR) 10Hashar: [C:04-1] "After talking with Jelto about it, the patch affects different area of the infrastructure spanning multiple teams. I thus went to send som" [puppet] - 10https://gerrit.wikimedia.org/r/927986 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [16:12:44] (03CR) 10CI reject: [V:04-1] refinery::job::gobblin - use gobbin-wmf 1.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1054908 (https://phabricator.wikimedia.org/T370199) (owner: 10Ottomata) [16:13:00] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1054889 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [16:13:02] (03CR) 10Hashar: [C:04-1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927986 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [16:13:04] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1054892 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [16:13:06] !log otto@deploy1002 Started deploy [analytics/refinery@8f00c85]: [analytics/refinery@8f00c859] [16:13:23] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-07-09-154549 to 2024-07-17-145805 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054909 (https://phabricator.wikimedia.org/T364413) [16:14:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P66754 and previous config saved to /var/cache/conftool/dbconfig/20240717-161445-arnaudb.json [16:21:05] !log otto@deploy1002 Finished deploy [analytics/refinery@8f00c85]: [analytics/refinery@8f00c859] (duration: 07m 59s) [16:22:25] (03PS2) 10Ottomata: refinery::job::gobblin - use gobbin-wmf 1.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1054908 (https://phabricator.wikimedia.org/T370199) [16:23:14] (03PS3) 10Hashar: statistics: remove git::clone file mode [puppet] - 10https://gerrit.wikimedia.org/r/1054889 (https://phabricator.wikimedia.org/T338277) [16:23:20] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1054889 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [16:23:45] (03CR) 10Ottomata: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1054908 (https://phabricator.wikimedia.org/T370199) (owner: 10Ottomata) [16:24:10] (03CR) 10Ottomata: [V:03+1 C:03+2] refinery::job::gobblin - use gobbin-wmf 1.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1054908 (https://phabricator.wikimedia.org/T370199) (owner: 10Ottomata) [16:24:11] (03CR) 10Ottomata: [V:03+2 C:03+2] refinery::job::gobblin - use gobbin-wmf 1.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1054908 (https://phabricator.wikimedia.org/T370199) (owner: 10Ottomata) [16:24:56] (03PS17) 10Hashar: git: remove umask from git::clone [puppet] - 10https://gerrit.wikimedia.org/r/927986 (https://phabricator.wikimedia.org/T338277) [16:26:15] (03CR) 10Klausman: [C:03+2] admin_ng: allow more cpus for ml-serve's revscoring-articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054882 (owner: 10Elukey) [16:26:16] !log otto@deploy1002 Started deploy [analytics/refinery@8f00c85] (thin): THIN [analytics/refinery@8f00c859] [16:29:26] (03Merged) 10jenkins-bot: admin_ng: allow more cpus for ml-serve's revscoring-articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054882 (owner: 10Elukey) [16:29:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P66755 and previous config saved to /var/cache/conftool/dbconfig/20240717-162952-arnaudb.json [16:29:54] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [16:30:13] (03PS2) 10Hashar: grafana: clone grafana-grizzly with default parameters [puppet] - 10https://gerrit.wikimedia.org/r/1054892 (https://phabricator.wikimedia.org/T338277) [16:30:24] !log otto@deploy1002 Finished deploy [analytics/refinery@8f00c85] (thin): THIN [analytics/refinery@8f00c859] (duration: 04m 08s) [16:30:27] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [16:30:31] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1054892 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [16:30:49] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1054890 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [16:31:21] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [16:31:30] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [16:31:42] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [16:32:45] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [16:33:32] (03PS1) 10Gergő Tisza: SUL3: Fix URL handling for the SSO domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054911 (https://phabricator.wikimedia.org/T365162) [16:34:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054911 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [16:34:18] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [16:34:36] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [16:35:34] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1046123 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [16:36:18] (03CR) 10Ayounsi: [C:03+2] "<3" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [16:42:43] (03Merged) 10jenkins-bot: Spicerack: fix Netbox 4 breaking changes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [16:43:27] (03CR) 10RobH: [C:03+1] admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [16:45:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T367781)', diff saved to https://phabricator.wikimedia.org/P66756 and previous config saved to /var/cache/conftool/dbconfig/20240717-164459-arnaudb.json [16:45:02] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1227.eqiad.wmnet with reason: Maintenance [16:45:05] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [16:45:08] (03CR) 10D3r1ck01: [C:03+1] "WFM locally." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054911 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [16:45:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1227.eqiad.wmnet with reason: Maintenance [16:45:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T367781)', diff saved to https://phabricator.wikimedia.org/P66757 and previous config saved to /var/cache/conftool/dbconfig/20240717-164521-arnaudb.json [16:45:59] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054562 (owner: 10Ayounsi) [16:46:16] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on mw2432 - https://phabricator.wikimedia.org/T370258#9991147 (10Clement_Goubert) Please do :) I'll leave the task open so it doesn't open a new one when I inevitably break it again. For the record, any raid issue you'll get for mw2432, mw2433, mw2438, mw2439 until... [16:46:52] PROBLEM - MegaRAID on an-worker1127 is CRITICAL: CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:47:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T367781)', diff saved to https://phabricator.wikimedia.org/P66758 and previous config saved to /var/cache/conftool/dbconfig/20240717-164736-arnaudb.json [16:58:47] !log btullis@deploy1002 Started deploy [airflow-dags/analytics@ca21d05]: (no justification provided) [16:59:38] !log btullis@deploy1002 Finished deploy [airflow-dags/analytics@ca21d05]: (no justification provided) (duration: 00m 51s) [17:00:05] swfrench-wmf: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240717T1700). [17:02:39] here - holding for the moment; we're likely to defer this work to another day [17:02:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P66759 and previous config saved to /var/cache/conftool/dbconfig/20240717-170243-arnaudb.json [17:04:42] FIRING: [2x] SystemdUnitFailed: envoyproxy.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:05:59] (03CR) 10JHathaway: "I think datacenter-ops would need to be added to `profile::admin::always_groups`?" [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [17:12:13] (03PS2) 10Brennen Bearnes: phabricator: apache2: add UnsafeAllow3F to RewriteRules [puppet] - 10https://gerrit.wikimedia.org/r/1054907 (https://phabricator.wikimedia.org/T370110) [17:12:30] (03PS3) 10Brennen Bearnes: phabricator: apache2: add UnsafeAllow3F to RewriteRules [puppet] - 10https://gerrit.wikimedia.org/r/1054907 (https://phabricator.wikimedia.org/T370110) [17:13:01] (03CR) 10Ayounsi: [C:03+2] Adapt tests for Netbox 4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054562 (owner: 10Ayounsi) [17:13:32] !log bking@kafka-main2005 `kafka topics --create --topic ${TOPIC} --partitions 1 --replication-factor 3; kafka configs --entity-type topics --entity-name ${TOPIC} --alter --add-config retention.ms=2592000000 T367510` [17:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:36] T367510: Request permission to create 4 kafka topics in kafka-main (WDQS graph split) - https://phabricator.wikimedia.org/T367510 [17:13:51] (03PS1) 10Dreamrimmer: Allow Bureaucrats on Foundation Wiki to be able to remove Sysop rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054914 (https://phabricator.wikimedia.org/T370097) [17:17:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P66760 and previous config saved to /var/cache/conftool/dbconfig/20240717-171750-arnaudb.json [17:19:43] (03Merged) 10jenkins-bot: Adapt tests for Netbox 4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054562 (owner: 10Ayounsi) [17:20:28] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1046123 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [17:24:44] following up, we're going to defer this phase of the turndown to a future mediawiki infra window (TBD) [17:24:57] no changes planned to this window [17:25:46] 👍 [17:26:47] (03CR) 10Dzahn: [C:03+2] delete integration.mediawiki.org [dns] - 10https://gerrit.wikimedia.org/r/1054646 (https://phabricator.wikimedia.org/T361250) (owner: 10Dzahn) [17:27:12] !log removing integration.mediawikia.org from DNS - T361250 [17:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:17] T361250: Decommission integration.mediawiki.org - https://phabricator.wikimedia.org/T361250 [17:27:17] argg [17:27:23] !log removing integration.mediawiki.org from DNS - T361250 [17:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:35] (03PS2) 10Dzahn: delete integration.mediawiki.org [dns] - 10https://gerrit.wikimedia.org/r/1054646 (https://phabricator.wikimedia.org/T361250) [17:30:25] (03CR) 10Hashar: labs_lvm: pass shellcheck on scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935423 (owner: 10Hashar) [17:30:25] (03CR) 10Dzahn: [V:03+1] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1054646 (https://phabricator.wikimedia.org/T361250) (owner: 10Dzahn) [17:30:29] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 336.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:32:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T367781)', diff saved to https://phabricator.wikimedia.org/P66761 and previous config saved to /var/cache/conftool/dbconfig/20240717-173257-arnaudb.json [17:32:59] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [17:33:09] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [17:33:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [17:33:17] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2121.codfw.wmnet with reason: Maintenance [17:33:30] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2121.codfw.wmnet with reason: Maintenance [17:33:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2121 (T367781)', diff saved to https://phabricator.wikimedia.org/P66762 and previous config saved to /var/cache/conftool/dbconfig/20240717-173336-arnaudb.json [17:33:51] (03PS1) 10Hashar: labs_lvm: fix volume creation using relative size [puppet] - 10https://gerrit.wikimedia.org/r/1054916 (https://phabricator.wikimedia.org/T370312) [17:35:09] (03PS1) 10CDobbins: purged: revert use_pki flag [puppet] - 10https://gerrit.wikimedia.org/r/1054918 (https://phabricator.wikimedia.org/T360506) [17:35:29] (03CR) 10CI reject: [V:04-1] purged: revert use_pki flag [puppet] - 10https://gerrit.wikimedia.org/r/1054918 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [17:36:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T367781)', diff saved to https://phabricator.wikimedia.org/P66763 and previous config saved to /var/cache/conftool/dbconfig/20240717-173603-arnaudb.json [17:37:19] (03CR) 10Hashar: [C:03+1] "I have cherry picked it on the integration Puppet server `integration-puppetserver-01.integration.eqiad1.wikimedia.cloud` and that fixed t" [puppet] - 10https://gerrit.wikimedia.org/r/1054916 (https://phabricator.wikimedia.org/T370312) (owner: 10Hashar) [17:38:17] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:40:29] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 327.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:41:59] (03PS1) 10Dzahn: redirects.dat: delete integration.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/1054919 (https://phabricator.wikimedia.org/T361250) [17:46:17] (03PS1) 10Tchanders: Set Flow to read only on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054921 (https://phabricator.wikimedia.org/T370322) [17:46:47] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 13Patch-For-Review: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#9991687 (10Dzahn) [17:47:03] (03CR) 10Dreamy Jazz: [C:03+1] Set Flow to read only on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054921 (https://phabricator.wikimedia.org/T370322) (owner: 10Tchanders) [17:47:34] (03CR) 10Dreamy Jazz: [C:03+1] Set Flow to read only on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054921 (https://phabricator.wikimedia.org/T370322) (owner: 10Tchanders) [17:47:54] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 13Patch-For-Review: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#9991681 (10Dzahn) @KOfori Would you be ok with becoming the approver for the group `dns-admins`? This group is "people allowe... [17:51:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P66764 and previous config saved to /var/cache/conftool/dbconfig/20240717-175110-arnaudb.json [17:52:54] (03PS3) 10Ryan Kemper: wdqs: store metadata about graph split type [cookbooks] - 10https://gerrit.wikimedia.org/r/1053205 (https://phabricator.wikimedia.org/T364077) [17:53:40] (03CR) 10Andrew Bogott: [C:03+2] labs_lvm: fix volume creation using relative size [puppet] - 10https://gerrit.wikimedia.org/r/1054916 (https://phabricator.wikimedia.org/T370312) (owner: 10Hashar) [17:53:54] (03CR) 10Ryan Kemper: wdqs: store metadata about graph split type (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1053205 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [17:54:23] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:54:53] (03CR) 10Cathal Mooney: [C:03+2] Prefer AWS routes from direct peer in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1054899 (https://phabricator.wikimedia.org/T370297) (owner: 10Cathal Mooney) [17:54:55] 06SRE, 10Continuous-Integration-Infrastructure, 06Infrastructure-Foundations, 06Release-Engineering-Team: package_builder python-all conflicts with base::standard_packages python2.7 removal - https://phabricator.wikimedia.org/T370337 (10hashar) 03NEW [17:55:53] (03Merged) 10jenkins-bot: Prefer AWS routes from direct peer in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1054899 (https://phabricator.wikimedia.org/T370297) (owner: 10Cathal Mooney) [17:59:28] 06SRE, 10Continuous-Integration-Infrastructure, 06Infrastructure-Foundations, 06Release-Engineering-Team: package_builder python-all conflicts with base::standard_packages python2.7 removal - https://phabricator.wikimedia.org/T370337#9991821 (10hashar) That might be related to 7b7d0be4c03f12ee045e95d8826ca... [18:00:05] dancy and andre: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240717T1800). [18:00:11] o/ [18:01:23] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 5.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:01:23] !log adjust route preference for traffic to AWS on Eqiad core routers T370297 [18:01:25] Pressing the button. [18:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:47] (03PS1) 10TrainBranchBot: group1 wikis to 1.43.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054922 (https://phabricator.wikimedia.org/T366959) [18:01:49] (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.43.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054922 (https://phabricator.wikimedia.org/T366959) (owner: 10TrainBranchBot) [18:02:34] (03Merged) 10jenkins-bot: group1 wikis to 1.43.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054922 (https://phabricator.wikimedia.org/T366959) (owner: 10TrainBranchBot) [18:06:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P66765 and previous config saved to /var/cache/conftool/dbconfig/20240717-180617-arnaudb.json [18:07:30] (03PS2) 10Tchanders: Enable temporary accounts on testwiki and loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) [18:08:18] (03CR) 10CI reject: [V:04-1] Enable temporary accounts on testwiki and loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) (owner: 10Tchanders) [18:09:29] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 36.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:09:50] (03CR) 10Tchanders: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) (owner: 10Tchanders) [18:10:12] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.43.0-wmf.14 refs T366959 [18:10:17] T366959: 1.43.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T366959 [18:10:26] (03PS3) 10Tchanders: Enable temporary accounts on testwiki and loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) [18:21:23] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 336.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:21:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T367781)', diff saved to https://phabricator.wikimedia.org/P66766 and previous config saved to /var/cache/conftool/dbconfig/20240717-182125-arnaudb.json [18:21:27] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2122.codfw.wmnet with reason: Maintenance [18:21:30] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [18:21:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2122.codfw.wmnet with reason: Maintenance [18:21:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2122 (T367781)', diff saved to https://phabricator.wikimedia.org/P66767 and previous config saved to /var/cache/conftool/dbconfig/20240717-182147-arnaudb.json [18:25:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T367781)', diff saved to https://phabricator.wikimedia.org/P66768 and previous config saved to /var/cache/conftool/dbconfig/20240717-182514-arnaudb.json [18:30:16] PROBLEM - MariaDB Replica Lag: s1 #page on db1219 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 347.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:30:22] (03PS1) 10Dzahn: httpbb: remove tests for git.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1054924 (https://phabricator.wikimedia.org/T323073) [18:31:07] !incidents [18:31:07] 4859 (UNACKED) db1219 (paged)/MariaDB Replica Lag: s1 (paged) [18:31:08] 4858 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet esams) [18:31:27] !ack 4859 [18:31:28] 4859 (ACKED) db1219 (paged)/MariaDB Replica Lag: s1 (paged) [18:31:51] is ^ expected at all? [18:32:05] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1179 stopped answering ping, depooled - https://phabricator.wikimedia.org/T369855#9992160 (10VRiley-WMF) Noted that the server doesn't want to power on. Tried to power cycle it, attempted a flea power drain. Reseated the power cable from the motherboard. Removed all t... [18:33:55] (03CR) 10CI reject: [V:04-1] httpbb: remove tests for git.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1054924 (https://phabricator.wikimedia.org/T323073) (owner: 10Dzahn) [18:35:27] herron: IMO we should depool it at the very least? [18:36:01] sukhe: yes sgtm too I've got the command staged up on cumin host [18:36:13] wasn't sure i frelated to the above logs [18:36:16] anything I can help with? [18:36:35] but yeah, +1 depool [18:36:51] ack, doing [18:37:03] No [18:37:06] Don't depool [18:37:06] (03CR) 10JHathaway: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3295/co" [puppet] - 10https://gerrit.wikimedia.org/r/1054661 (https://phabricator.wikimedia.org/T367547) (owner: 10JHathaway) [18:37:10] See _security [18:37:27] wow just in the nick of time [18:37:31] aborting :) [18:37:56] (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054926 (https://phabricator.wikimedia.org/T366959) [18:37:58] (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054926 (https://phabricator.wikimedia.org/T366959) (owner: 10TrainBranchBot) [18:38:16] RECOVERY - MariaDB Replica Lag: s1 #page on db1219 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:38:38] (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054926 (https://phabricator.wikimedia.org/T366959) (owner: 10TrainBranchBot) [18:38:40] marostegui: I ran the dbctl instance depool but did not commit, FYI [18:39:05] herron: You can just do dbctl instance pool [18:39:20] And dbctl config diff should show nothing [18:39:21] marostegui: ok, done [18:39:34] yes looks good, no diff [18:39:42] excellent thank you [18:39:50] herron: fyi dancy is in the middle of running the train [18:40:04] RhinosF1: thank you [18:40:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P66769 and previous config saved to /var/cache/conftool/dbconfig/20240717-184021-arnaudb.json [18:40:23] Rolling the train back to group0 at the moment. [18:43:06] cccccctrnruvrebuieelcnjnjdfjrbunltlfbbtelcri [18:43:13] sorry :) [18:43:47] bblack: cat or yubikey? [18:43:54] yubikey [18:43:58] :) [18:46:07] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.14 refs T366959 [18:46:11] T366959: 1.43.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T366959 [18:46:39] Rollback completed. [18:55:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P66770 and previous config saved to /var/cache/conftool/dbconfig/20240717-185528-arnaudb.json [18:56:40] i am missing a couple email notifications from Phabricator. is this a known issue? [18:59:19] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:00:23] MatmaRex: Not that I'm aware of. [19:00:46] filed as https://phabricator.wikimedia.org/T370352 now [19:10:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T367781)', diff saved to https://phabricator.wikimedia.org/P66771 and previous config saved to /var/cache/conftool/dbconfig/20240717-191035-arnaudb.json [19:10:37] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2150.codfw.wmnet with reason: Maintenance [19:10:40] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [19:10:50] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2150.codfw.wmnet with reason: Maintenance [19:10:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T367781)', diff saved to https://phabricator.wikimedia.org/P66772 and previous config saved to /var/cache/conftool/dbconfig/20240717-191057-arnaudb.json [19:13:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T367781)', diff saved to https://phabricator.wikimedia.org/P66773 and previous config saved to /var/cache/conftool/dbconfig/20240717-191324-arnaudb.json [19:14:19] FIRING: [3x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:17:23] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 50.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:19:19] FIRING: [3x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:25:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9992386 (10VRiley-WMF) @Jclark-ctr and @cmooney I have plugged in a 2nd network cable. Here is that information cloudcephosd1035 - CableID 5328 : Port 42... [19:28:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P66774 and previous config saved to /var/cache/conftool/dbconfig/20240717-192830-arnaudb.json [19:29:19] FIRING: [3x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:34:36] (03CR) 10CDobbins: [V:03+1 C:03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3296/console" [puppet] - 10https://gerrit.wikimedia.org/r/1050417 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [19:37:14] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:43:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P66775 and previous config saved to /var/cache/conftool/dbconfig/20240717-194337-arnaudb.json [19:47:13] (03CR) 10JHathaway: [V:03+1 C:03+2] pcc-puppetdb: remove java pinning [puppet] - 10https://gerrit.wikimedia.org/r/1054661 (https://phabricator.wikimedia.org/T367547) (owner: 10JHathaway) [19:58:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T367781)', diff saved to https://phabricator.wikimedia.org/P66776 and previous config saved to /var/cache/conftool/dbconfig/20240717-195844-arnaudb.json [19:58:48] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2159.codfw.wmnet with reason: Maintenance [19:58:50] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [19:59:01] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2159.codfw.wmnet with reason: Maintenance [19:59:02] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [19:59:15] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [19:59:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T367781)', diff saved to https://phabricator.wikimedia.org/P66777 and previous config saved to /var/cache/conftool/dbconfig/20240717-195921-arnaudb.json [19:59:31] (03PS4) 10Ebrahim: Enable ICU provided alphabetical order in the Kurdish wikis categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054641 (https://phabricator.wikimedia.org/T48235) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240717T2000). nyaa~ [20:00:04] tgr and seddon: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:42] * Seddon appears [20:00:48] o/ [20:01:42] I can deploy [20:01:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T367781)', diff saved to https://phabricator.wikimedia.org/P66778 and previous config saved to /var/cache/conftool/dbconfig/20240717-200147-arnaudb.json [20:02:35] (03PS2) 10Gergő Tisza: SUL3: Fix cookie names on the SSO domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054901 (https://phabricator.wikimedia.org/T365162) [20:03:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054901 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [20:04:28] (03Merged) 10jenkins-bot: SUL3: Fix cookie names on the SSO domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054901 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [20:04:42] Hi sorry. I posted the deployment request in the wrong window. I just added one for the current window. [20:04:58] !log tgr@deploy1002 Started scap sync-world: Backport for [[gerrit:1054901|SUL3: Fix cookie names on the SSO domain (T365162)]] [20:05:01] T365162: Set up sso.wikimedia.beta.wmflabs.org with config-layer routing to other wikis - https://phabricator.wikimedia.org/T365162 [20:05:04] tgr|away: ^ [20:06:17] Seddon: those are not config patches [20:06:24] is that a bug with the scheduler tool? [20:06:30] Ah oops. [20:06:38] Nope I added them manually [20:06:48] By manually I mean.... copy paste [20:07:33] (03CR) 10Dreamy Jazz: Enable temporary accounts on testwiki and loginwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) (owner: 10Tchanders) [20:07:35] !log tgr@deploy1002 tgr: Backport for [[gerrit:1054901|SUL3: Fix cookie names on the SSO domain (T365162)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:08:06] generally the idea is that it gets merged to master and then a cherry-pick to the wmf deploy branch is scheduled for backport [20:08:18] are those patches urgent? [20:08:29] I'll sort it out and it can go out tomorrow [20:08:50] thx [20:09:21] !log tgr@deploy1002 tgr: Continuing with sync [20:12:04] (03PS2) 10Gergő Tisza: SUL3: Fix URL handling for the SSO domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054911 (https://phabricator.wikimedia.org/T365162) [20:12:20] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on cr2-codfw,ssw1-a[1,8]-codfw.mgmt with reason: Rebooting ssw1-d8-codfw to try and fix gnmi telemtry [20:12:35] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cr2-codfw,ssw1-a[1,8]-codfw.mgmt with reason: Rebooting ssw1-d8-codfw to try and fix gnmi telemtry [20:12:43] 10ops-eqiad, 06SRE, 06DC-Ops: 10gbit nic option for centrallog1002 - https://phabricator.wikimedia.org/T369825#9992536 (10VRiley-WMF) There is another rack in row B that we can use for this server. However, moving it to another rack will require an IP change when I last confirmed it. [20:12:56] !log rebooting unused switch ssw1-d8-codfw in an effort to troubleshoot gnmic errors [20:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:20] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:1054901|SUL3: Fix cookie names on the SSO domain (T365162)]] (duration: 09m 23s) [20:14:25] T365162: Set up sso.wikimedia.beta.wmflabs.org with config-layer routing to other wikis - https://phabricator.wikimedia.org/T365162 [20:14:46] (03PS4) 10Ahmon Dancy: MWMultiVersion.php: Allow MW_FORCE_VERSION to pin the mw version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053752 (https://phabricator.wikimedia.org/T369115) [20:15:28] (03CR) 10CI reject: [V:04-1] MWMultiVersion.php: Allow MW_FORCE_VERSION to pin the mw version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053752 (https://phabricator.wikimedia.org/T369115) (owner: 10Ahmon Dancy) [20:16:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P66779 and previous config saved to /var/cache/conftool/dbconfig/20240717-201655-arnaudb.json [20:17:12] 10ops-eqiad, 06SRE, 06DC-Ops: 10gbit nic option for centrallog1002 - https://phabricator.wikimedia.org/T369825#9992543 (10cmooney) >>! In T369825#9992536, @VRiley-WMF wrote: > There is another rack in row B that we can use for this server. However, moving it to another rack will require an IP change when I l... [20:17:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054911 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [20:17:47] (03PS5) 10Ahmon Dancy: MWMultiVersion.php: Allow MW_FORCE_VERSION to pin the mw version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053752 (https://phabricator.wikimedia.org/T369115) [20:18:01] (03Merged) 10jenkins-bot: SUL3: Fix URL handling for the SSO domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054911 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [20:18:26] (03CR) 10CI reject: [V:04-1] MWMultiVersion.php: Allow MW_FORCE_VERSION to pin the mw version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053752 (https://phabricator.wikimedia.org/T369115) (owner: 10Ahmon Dancy) [20:18:29] !log tgr@deploy1002 Started scap sync-world: Backport for [[gerrit:1054911|SUL3: Fix URL handling for the SSO domain (T365162)]] [20:19:01] (03PS6) 10Ahmon Dancy: MWMultiVersion.php: Allow MW_FORCE_VERSION to pin the mw version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053752 (https://phabricator.wikimedia.org/T369115) [20:26:31] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:27:40] (03CR) 10Ahmon Dancy: MWMultiVersion.php: Allow MW_FORCE_VERSION to pin the mw version (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053752 (https://phabricator.wikimedia.org/T369115) (owner: 10Ahmon Dancy) [20:30:31] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:32:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P66780 and previous config saved to /var/cache/conftool/dbconfig/20240717-203202-arnaudb.json [20:32:08] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9992603 (10cmooney) >>! In T370164#9989347, @ayounsi wrote: > We will need to migrate the whole range to a new prefix :( Running 2 ranges is... [20:37:34] scap is taking its sweet time [20:38:02] 20 min and it's not even on the debug hosts yet [20:40:04] tgr|away: no worries. [20:40:37] FIRING: [3x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:47:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T367781)', diff saved to https://phabricator.wikimedia.org/P66781 and previous config saved to /var/cache/conftool/dbconfig/20240717-204709-arnaudb.json [20:47:11] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2168.codfw.wmnet with reason: Maintenance [20:47:14] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [20:47:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2168.codfw.wmnet with reason: Maintenance [20:47:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T367781)', diff saved to https://phabricator.wikimedia.org/P66782 and previous config saved to /var/cache/conftool/dbconfig/20240717-204731-arnaudb.json [20:50:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T367781)', diff saved to https://phabricator.wikimedia.org/P66783 and previous config saved to /var/cache/conftool/dbconfig/20240717-205058-arnaudb.json [20:53:31] !log tgr@deploy1002 tgr: Backport for [[gerrit:1054911|SUL3: Fix URL handling for the SSO domain (T365162)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:53:35] T365162: Set up sso.wikimedia.beta.wmflabs.org with config-layer routing to other wikis - https://phabricator.wikimedia.org/T365162 [20:54:37] !log tgr@deploy1002 tgr: Continuing with sync [20:55:03] (03PS2) 10Kimberly Sarabia: skin-themes dblist is expanded to include tier 2 wikis as well as tier 1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054685 (https://phabricator.wikimedia.org/T367150) [21:00:06] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240717T2100) [21:01:03] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:1054911|SUL3: Fix URL handling for the SSO domain (T365162)]] (duration: 42m 33s) [21:01:07] T365162: Set up sso.wikimedia.beta.wmflabs.org with config-layer routing to other wikis - https://phabricator.wikimedia.org/T365162 [21:01:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054685 (https://phabricator.wikimedia.org/T367150) (owner: 10Kimberly Sarabia) [21:02:03] (03Merged) 10jenkins-bot: skin-themes dblist is expanded to include tier 2 wikis as well as tier 1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054685 (https://phabricator.wikimedia.org/T367150) (owner: 10Kimberly Sarabia) [21:02:31] !log tgr@deploy1002 Started scap sync-world: Backport for [[gerrit:1054685|skin-themes dblist is expanded to include tier 2 wikis as well as tier 1. (T367150)]] [21:02:35] T367150: Deploy dark mode to logged-out users in tier 1 and 2 wikis on the Vector2022 and Minerva skin - https://phabricator.wikimedia.org/T367150 [21:04:42] FIRING: [2x] SystemdUnitFailed: envoyproxy.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:06:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P66784 and previous config saved to /var/cache/conftool/dbconfig/20240717-210605-arnaudb.json [21:07:26] 06SRE, 06Infrastructure-Foundations, 10netops: Issue with subscribing to GNMI telemetry on certain QFX5120 devices - https://phabricator.wikimedia.org/T370366 (10cmooney) 03NEW p:05Triage→03Low [21:08:00] 06SRE, 06Infrastructure-Foundations, 10netops: Issue with subscribing to GNMI telemetry on certain QFX5120 devices - https://phabricator.wikimedia.org/T370366#9992754 (10cmooney) [21:08:03] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#9992755 (10cmooney) [21:08:39] !log tgr@deploy1002 tgr, ksarabia: Backport for [[gerrit:1054685|skin-themes dblist is expanded to include tier 2 wikis as well as tier 1. (T367150)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:08:43] T367150: Deploy dark mode to logged-out users in tier 1 and 2 wikis on the Vector2022 and Minerva skin - https://phabricator.wikimedia.org/T367150 [21:09:09] kimberly_sarabia: it's on mwdebug if you need to test it [21:09:12] LGTM [21:14:30] !log tgr@deploy1002 tgr, ksarabia: Continuing with sync [21:19:30] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:1054685|skin-themes dblist is expanded to include tier 2 wikis as well as tier 1. (T367150)]] (duration: 16m 59s) [21:19:34] T367150: Deploy dark mode to logged-out users in tier 1 and 2 wikis on the Vector2022 and Minerva skin - https://phabricator.wikimedia.org/T367150 [21:21:07] 06SRE, 06Infrastructure-Foundations, 10netops: Issue creating GNMI telemetry subscription to certain QFX5120 devices - https://phabricator.wikimedia.org/T370366#9992812 (10cmooney) [21:21:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P66785 and previous config saved to /var/cache/conftool/dbconfig/20240717-212112-arnaudb.json [21:21:25] 10ops-eqiad, 06SRE, 06DC-Ops: 10gbit nic option for centrallog1002 - https://phabricator.wikimedia.org/T369825#9992811 (10VRiley-WMF) >>! In T369825#9992543, @cmooney wrote: >>>! In T369825#9992536, @VRiley-WMF wrote: >> There is another rack in row B that we can use for this server. However, moving it to an... [21:22:32] kimberly_sarabia: it's live. Sorry, usually it doesn't take this long. [21:22:53] tgr|away: No worries! Thanks so much [21:23:13] !log UTC late deploys done [21:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T367781)', diff saved to https://phabricator.wikimedia.org/P66786 and previous config saved to /var/cache/conftool/dbconfig/20240717-213619-arnaudb.json [21:36:21] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2182.codfw.wmnet with reason: Maintenance [21:36:24] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [21:36:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2182.codfw.wmnet with reason: Maintenance [21:36:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T367781)', diff saved to https://phabricator.wikimedia.org/P66787 and previous config saved to /var/cache/conftool/dbconfig/20240717-213641-arnaudb.json [21:40:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T367781)', diff saved to https://phabricator.wikimedia.org/P66788 and previous config saved to /var/cache/conftool/dbconfig/20240717-214008-arnaudb.json [21:48:03] 10ops-eqiad, 06SRE, 06DC-Ops: 10gbit nic option for centrallog1002 - https://phabricator.wikimedia.org/T369825#9992885 (10Papaul) I asked this same question on July 15th on IRC and i didn't get any response ` 12:20 < papaul> VRiley: godog: which rack are you moving centrallog1002 to ? ` as @cmooney mentione... [21:55:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P66789 and previous config saved to /var/cache/conftool/dbconfig/20240717-215516-arnaudb.json [21:56:46] (03PS1) 10Cathal Mooney: Add identifiers for ESI-LAGs to legacy switches on codfw row D spines [homer/public] - 10https://gerrit.wikimedia.org/r/1054942 (https://phabricator.wikimedia.org/T366941) [22:05:52] FIRING: GitLabCIPipelineErrors: GitLab - High pipeline error rate - https://wikitech.wikimedia.org/wiki/GitLab/Runbook - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabCIPipelineErrors [22:07:31] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [22:10:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P66790 and previous config saved to /var/cache/conftool/dbconfig/20240717-221023-arnaudb.json [22:10:52] RESOLVED: GitLabCIPipelineErrors: GitLab - High pipeline error rate - https://wikitech.wikimedia.org/wiki/GitLab/Runbook - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabCIPipelineErrors [22:13:07] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for cloudcephmon1004-6 - jclark@cumin1002" [22:14:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for cloudcephmon1004-6 - jclark@cumin1002" [22:14:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:17:45] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephmon1004.mgmt.eqiad.wmnet with reboot policy FORCED [22:17:47] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephmon1006.mgmt.eqiad.wmnet with reboot policy FORCED [22:17:48] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephmon1005.mgmt.eqiad.wmnet with reboot policy FORCED [22:18:59] 06SRE, 06Infrastructure-Foundations, 10netops: Issue creating GNMI telemetry subscription to certain QFX5120 devices - https://phabricator.wikimedia.org/T370366#9992974 (10cmooney) [22:24:51] 06SRE, 06Infrastructure-Foundations, 10netops: Issue creating GNMI telemetry subscription to certain QFX5120 devices - https://phabricator.wikimedia.org/T370366#9992984 (10cmooney) [22:25:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T367781)', diff saved to https://phabricator.wikimedia.org/P66791 and previous config saved to /var/cache/conftool/dbconfig/20240717-222530-arnaudb.json [22:25:33] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2198.codfw.wmnet with reason: Maintenance [22:25:35] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [22:25:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2198.codfw.wmnet with reason: Maintenance [22:26:07] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2200.codfw.wmnet with reason: Maintenance [22:26:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2200.codfw.wmnet with reason: Maintenance [22:26:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2208.codfw.wmnet with reason: Maintenance [22:26:55] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2208.codfw.wmnet with reason: Maintenance [22:27:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T367781)', diff saved to https://phabricator.wikimedia.org/P66792 and previous config saved to /var/cache/conftool/dbconfig/20240717-222701-arnaudb.json [22:28:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephmon1005.mgmt.eqiad.wmnet with reboot policy FORCED [22:28:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephmon1006.mgmt.eqiad.wmnet with reboot policy FORCED [22:28:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephmon1004.mgmt.eqiad.wmnet with reboot policy FORCED [22:30:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T367781)', diff saved to https://phabricator.wikimedia.org/P66793 and previous config saved to /var/cache/conftool/dbconfig/20240717-223028-arnaudb.json [22:30:37] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [22:37:27] !log zabe@mwmaint1002:~$ mwscript createAndPromote.php aewikimedia "Reda Kerbouche" REDACTED --bureaucrat --sysop # T362529 [22:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:32] T362529: Create a Wikimedians of United Arab Emirates User Group Wiki - https://phabricator.wikimedia.org/T362529 [22:39:48] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bullseye [22:39:50] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon1005.eqiad.wmnet with OS bullseye [22:39:52] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon1006.eqiad.wmnet with OS bullseye [22:40:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#9993042 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye [22:40:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#9993043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephmon1005.eqiad.wmnet with OS bullseye [22:40:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#9993044 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephmon1006.eqiad.wmnet with OS bullseye [22:45:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P66794 and previous config saved to /var/cache/conftool/dbconfig/20240717-224536-arnaudb.json [22:45:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#9993068 (10Jclark-ctr) [23:00:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P66795 and previous config saved to /var/cache/conftool/dbconfig/20240717-230043-arnaudb.json [23:08:03] (03PS1) 10JHathaway: expose_agent_certs: use ssldir exclusively [puppet] - 10https://gerrit.wikimedia.org/r/1054951 (https://phabricator.wikimedia.org/T367547) [23:08:35] (03CR) 10CI reject: [V:04-1] expose_agent_certs: use ssldir exclusively [puppet] - 10https://gerrit.wikimedia.org/r/1054951 (https://phabricator.wikimedia.org/T367547) (owner: 10JHathaway) [23:10:37] (03PS2) 10JHathaway: expose_agent_certs: use ssldir exclusively [puppet] - 10https://gerrit.wikimedia.org/r/1054951 (https://phabricator.wikimedia.org/T367547) [23:11:45] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1054951 (https://phabricator.wikimedia.org/T367547) (owner: 10JHathaway) [23:11:58] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephmon1005 [23:12:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephmon1005 [23:13:24] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephmon1004 [23:13:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephmon1004 [23:13:43] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephmon1006 [23:13:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephmon1006 [23:14:10] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [23:15:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T367781)', diff saved to https://phabricator.wikimedia.org/P66796 and previous config saved to /var/cache/conftool/dbconfig/20240717-231550-arnaudb.json [23:15:53] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2220.codfw.wmnet with reason: Maintenance [23:15:55] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [23:16:00] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:16:06] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2220.codfw.wmnet with reason: Maintenance [23:16:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2220 (T367781)', diff saved to https://phabricator.wikimedia.org/P66797 and previous config saved to /var/cache/conftool/dbconfig/20240717-231612-arnaudb.json [23:16:29] (03PS1) 10Dwisehaupt: crm: switch civicrm to use smarty4 and don't pull extensions [puppet] - 10https://gerrit.wikimedia.org/r/1054952 (https://phabricator.wikimedia.org/T343486) [23:18:17] (03PS1) 10Dwisehaupt: crm: add gnupg to crm role [puppet] - 10https://gerrit.wikimedia.org/r/1054953 (https://phabricator.wikimedia.org/T343486) [23:19:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T367781)', diff saved to https://phabricator.wikimedia.org/P66798 and previous config saved to /var/cache/conftool/dbconfig/20240717-231939-arnaudb.json [23:34:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P66799 and previous config saved to /var/cache/conftool/dbconfig/20240717-233446-arnaudb.json [23:37:14] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:38:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1054955 [23:38:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1054955 (owner: 10TrainBranchBot) [23:41:12] (03PS1) 10Dzahn: httpbb: add a redirect test for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/1054956 [23:45:31] (03CR) 10Dzahn: [V:03+1 C:03+2] "tested with httpbb" [puppet] - 10https://gerrit.wikimedia.org/r/1054907 (https://phabricator.wikimedia.org/T370110) (owner: 10Brennen Bearnes) [23:45:52] (03PS2) 10Dzahn: httpbb: add a test for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/1054956 [23:47:23] (03PS2) 10Dzahn: httpbb: remove tests for git.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1054924 (https://phabricator.wikimedia.org/T323073) [23:48:04] (03CR) 10Dzahn: [C:03+2] httpbb: remove tests for git.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1054924 (https://phabricator.wikimedia.org/T323073) (owner: 10Dzahn) [23:49:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P66800 and previous config saved to /var/cache/conftool/dbconfig/20240717-234953-arnaudb.json [23:50:39] !log phabricator (phab1004) - deployed gerrit:1054907 ; restarted apache [23:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:26] (03PS3) 10Dzahn: httpbb: add a test for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/1054956 (https://phabricator.wikimedia.org/T370110) [23:54:19] (03CR) 10Dzahn: [C:03+2] httpbb: add a test for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/1054956 (https://phabricator.wikimedia.org/T370110) (owner: 10Dzahn) [23:54:57] (03CR) 10Dzahn: [V:03+1 C:03+2] "[deploy1002:~] $ cat test_phabricator.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1054956 (https://phabricator.wikimedia.org/T370110) (owner: 10Dzahn)