[00:01:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [00:02:15] (03CR) 10RLazarus: otelcol: Stop hardcoding k8s master IP addresses (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054394 (https://phabricator.wikimedia.org/T365855) (owner: 10CDanis) [00:02:32] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1054400 (owner: 10TrainBranchBot) [00:02:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P66563 and previous config saved to /var/cache/conftool/dbconfig/20240716-000255-arnaudb.json [00:10:11] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [00:18:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P66564 and previous config saved to /var/cache/conftool/dbconfig/20240716-001802-arnaudb.json [00:22:32] !log zabe@mwmaint1002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiktionary --logwiki=metawiki 'Dodo cham' 'Le GlitcheurHD' # T369777 [00:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:36] T369777: Unblock stuck global rename of Le GlitcheurHD - https://phabricator.wikimedia.org/T369777 [00:26:18] !log zabe@mwmaint1002:/tmp/upload$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user=Trade . # T369998 [00:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:22] T369998: Server side upload for Trade - https://phabricator.wikimedia.org/T369998 [00:33:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T367781)', diff saved to https://phabricator.wikimedia.org/P66565 and previous config saved to /var/cache/conftool/dbconfig/20240716-003310-arnaudb.json [00:33:12] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2189.codfw.wmnet with reason: Maintenance [00:33:15] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [00:33:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2189.codfw.wmnet with reason: Maintenance [00:33:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2189 (T367781)', diff saved to https://phabricator.wikimedia.org/P66566 and previous config saved to /var/cache/conftool/dbconfig/20240716-003331-arnaudb.json [00:36:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T367781)', diff saved to https://phabricator.wikimedia.org/P66567 and previous config saved to /var/cache/conftool/dbconfig/20240716-003604-arnaudb.json [00:40:13] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.92 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:51:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P66568 and previous config saved to /var/cache/conftool/dbconfig/20240716-005111-arnaudb.json [00:56:06] (03PS1) 10BCornwall: ncredir: Reformat/sort the redirects file [puppet] - 10https://gerrit.wikimedia.org/r/1054405 [00:56:20] (03CR) 10CI reject: [V:04-1] ncredir: Reformat/sort the redirects file [puppet] - 10https://gerrit.wikimedia.org/r/1054405 (owner: 10BCornwall) [00:56:38] (03Abandoned) 10BCornwall: ncredir: Reformat/sort the redirects file [puppet] - 10https://gerrit.wikimedia.org/r/1054405 (owner: 10BCornwall) [00:57:06] (03PS5) 10BCornwall: ncredir: Reformat/sort the redirects file [puppet] - 10https://gerrit.wikimedia.org/r/1025875 (https://phabricator.wikimedia.org/T355189) [00:57:19] (03CR) 10CI reject: [V:04-1] ncredir: Reformat/sort the redirects file [puppet] - 10https://gerrit.wikimedia.org/r/1025875 (https://phabricator.wikimedia.org/T355189) (owner: 10BCornwall) [00:59:59] (03PS6) 10BCornwall: ncredir: Reformat/sort the redirects file [puppet] - 10https://gerrit.wikimedia.org/r/1025875 (https://phabricator.wikimedia.org/T355189) [01:06:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P66569 and previous config saved to /var/cache/conftool/dbconfig/20240716-010618-arnaudb.json [01:08:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.14 [core] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1054406 (https://phabricator.wikimedia.org/T366959) [01:08:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.14 [core] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1054406 (https://phabricator.wikimedia.org/T366959) (owner: 10TrainBranchBot) [01:21:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T367781)', diff saved to https://phabricator.wikimedia.org/P66570 and previous config saved to /var/cache/conftool/dbconfig/20240716-012125-arnaudb.json [01:21:29] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2197.codfw.wmnet with reason: Maintenance [01:21:33] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [01:21:42] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2197.codfw.wmnet with reason: Maintenance [01:30:43] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [01:32:33] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29688 bytes in 0.662 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [01:32:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [01:33:05] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.14 [core] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1054406 (https://phabricator.wikimedia.org/T366959) (owner: 10TrainBranchBot) [01:37:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [01:45:10] (03CR) 10Krinkle: mediawiki: Refactor and improve captchaloop (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/993010 (owner: 10Reedy) [01:48:40] (03CR) 10Krinkle: mediawiki: Refactor and improve captchaloop (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/993010 (owner: 10Reedy) [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T0200) [02:07:31] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2207.codfw.wmnet with reason: Maintenance [02:07:44] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2207.codfw.wmnet with reason: Maintenance [02:07:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2207 (T367781)', diff saved to https://phabricator.wikimedia.org/P66572 and previous config saved to /var/cache/conftool/dbconfig/20240716-020751-arnaudb.json [02:07:55] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [02:10:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T367781)', diff saved to https://phabricator.wikimedia.org/P66573 and previous config saved to /var/cache/conftool/dbconfig/20240716-021023-arnaudb.json [02:25:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P66574 and previous config saved to /var/cache/conftool/dbconfig/20240716-022531-arnaudb.json [02:30:41] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:32:41] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:37:41] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:38:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [02:39:19] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P66575 and previous config saved to /var/cache/conftool/dbconfig/20240716-024038-arnaudb.json [02:43:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [02:55:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T367781)', diff saved to https://phabricator.wikimedia.org/P66576 and previous config saved to /var/cache/conftool/dbconfig/20240716-025545-arnaudb.json [02:55:50] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [02:59:19] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T0300) [03:01:51] (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054410 (https://phabricator.wikimedia.org/T366959) [03:01:52] (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054410 (https://phabricator.wikimedia.org/T366959) (owner: 10TrainBranchBot) [03:02:31] (03Merged) 10jenkins-bot: testwikis wikis to 1.43.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054410 (https://phabricator.wikimedia.org/T366959) (owner: 10TrainBranchBot) [03:03:00] !log mwpresync@deploy1002 Started scap sync-world: testwikis wikis to 1.43.0-wmf.14 refs T366959 [03:03:03] T366959: 1.43.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T366959 [03:17:41] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 45.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:37:13] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:45:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [03:53:56] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.43.0-wmf.14 refs T366959 (duration: 50m 56s) [03:54:00] T366959: 1.43.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T366959 [04:01:01] !log mwpresync@deploy1002 Pruned MediaWiki: 1.43.0-wmf.11 (duration: 00m 58s) [04:03:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T0400) [04:41:41] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 317.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:57:11] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [04:57:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s3 T370019 [04:57:42] T370019: Switchover s3 master (db1157 -> db1223) - https://phabricator.wikimedia.org/T370019 [04:57:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s3 T370019 [04:58:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1174', diff saved to https://phabricator.wikimedia.org/P66577 and previous config saved to /var/cache/conftool/dbconfig/20240716-045807-marostegui.json [04:58:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Long schema change [04:58:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Long schema change [04:58:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1223 with weight 0 T370019', diff saved to https://phabricator.wikimedia.org/P66578 and previous config saved to /var/cache/conftool/dbconfig/20240716-045839-root.json [04:59:41] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1223 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1054076 (https://phabricator.wikimedia.org/T370019) (owner: 10Gerrit maintenance bot) [05:07:15] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 40.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:14:32] (03PS1) 10Marostegui: db1174: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1054411 [05:15:00] !log Starting s3 eqiad failover from db1157 to db1223 - T370019 [05:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:03] T370019: Switchover s3 master (db1157 -> db1223) - https://phabricator.wikimedia.org/T370019 [05:15:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s3 eqiad as read-only for maintenance - T370019', diff saved to https://phabricator.wikimedia.org/P66579 and previous config saved to /var/cache/conftool/dbconfig/20240716-051516-root.json [05:15:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1223 to s3 primary and set section read-write T370019', diff saved to https://phabricator.wikimedia.org/P66580 and previous config saved to /var/cache/conftool/dbconfig/20240716-051538-root.json [05:16:00] (03PS2) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1054077 (https://phabricator.wikimedia.org/T370019) [05:16:17] (03CR) 10Marostegui: [C:03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1054077 (https://phabricator.wikimedia.org/T370019) (owner: 10Gerrit maintenance bot) [05:16:18] (03CR) 10Marostegui: [V:03+2 C:03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1054077 (https://phabricator.wikimedia.org/T370019) (owner: 10Gerrit maintenance bot) [05:16:31] (03CR) 10Marostegui: [C:03+2] db1174: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1054411 (owner: 10Marostegui) [05:17:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1157 T370019', diff saved to https://phabricator.wikimedia.org/P66581 and previous config saved to /var/cache/conftool/dbconfig/20240716-051718-root.json [05:17:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Long schema change [05:17:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Long schema change [05:17:51] 10ops-codfw, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112#9983942 (10andrea.denisse) a:03andrea.denisse [05:19:23] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 213, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:19:37] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:19:57] (03PS1) 10Marostegui: db1157: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1054412 [05:22:18] (03CR) 10Marostegui: [C:03+2] db1157: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1054412 (owner: 10Marostegui) [05:25:51] I have a patch scheduled in the next back port window but I won't have power then so I won't be around. fyi urbanecm [05:25:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Long schema change [05:25:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Long schema change [05:27:23] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:27:41] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:43:12] !log Deploy schema change on s3 eqiad db1157 dbmaint T367856 [05:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:15] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [05:43:20] !log Deploy schema change on s7 eqiad db1174 dbmaint T367856 [05:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:00] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1236 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1054413 (https://phabricator.wikimedia.org/T370121) [05:44:05] (03PS1) 10Gerrit maintenance bot: wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1054414 (https://phabricator.wikimedia.org/T370121) [05:50:26] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 213, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:50:44] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:56:37] marostegui: let me know if it is Ok to deploy cxserver. [05:56:44] kart_: go for it! [05:57:27] thanks! [05:58:28] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:58:46] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T0600) [06:00:05] marostegui, Amir1, and arnaudb: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T0600). [06:02:29] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 213, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:02:31] (03Merged) 10jenkins-bot: Update cxserver to 2024-07-15-100650-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054340 (https://phabricator.wikimedia.org/T354666) (owner: 10KartikMistry) [06:02:49] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:05:53] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [06:06:16] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:07:54] ACKNOWLEDGEMENT - MariaDB Replica Lag: s7 on dbstore1008 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 232932.07 seconds Marostegui https://phabricator.wikimedia.org/T370122 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:11:11] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:11:41] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:11:45] FIRING: [2x] Processor usage over 85%: Alert for device cr1-eqiad.wikimedia.org - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [06:12:23] (03PS1) 10Marostegui: dbstore1008: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1054421 [06:12:32] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . [06:12:52] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [06:12:59] (03CR) 10Marostegui: [C:03+2] dbstore1008: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1054421 (owner: 10Marostegui) [06:16:03] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:16:33] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:16:45] RESOLVED: [2x] Processor usage over 85%: Device cr1-eqiad.wikimedia.org recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [06:18:35] !log Updated cxserver to 2024-07-15-100650-production (T354666) [06:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:39] T354666: Enable MADLAD-400 in MinT test instance and Production for Wikipedia languages not supported by other services - https://phabricator.wikimedia.org/T354666 [06:34:02] jouncebot: nowandnext [06:34:02] For the next 0 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T0600) [06:34:02] In 0 hour(s) and 25 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T0700) [06:40:03] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 77868944 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:41:03] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:54:50] jouncebot: nowandnext [06:54:50] For the next 0 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T0600) [06:54:50] In 0 hour(s) and 5 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T0700) [06:57:55] 06SRE, 06Infrastructure-Foundations, 10netops: cr3-ulsfo flapping on July 14 - https://phabricator.wikimedia.org/T370048#9984008 (10ayounsi) 05Open→03Resolved a:03ayounsi Closing this task in favor of {T364092}. [06:58:03] (03PS1) 10Slyngshede: data.yaml: Offboarding fjoseph [puppet] - 10https://gerrit.wikimedia.org/r/1054425 [06:58:11] (03CR) 10CI reject: [V:04-1] data.yaml: Offboarding fjoseph [puppet] - 10https://gerrit.wikimedia.org/r/1054425 (owner: 10Slyngshede) [06:58:15] (03PS2) 10Slyngshede: data.yaml: Offboarding fjoseph [puppet] - 10https://gerrit.wikimedia.org/r/1054425 [06:59:00] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 52999 [06:59:19] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:59:25] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 52999 [07:00:04] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T0700). [07:00:04] Seawolf35 and Dreamy_Jazz: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:11] \o [07:01:14] (03PS2) 10Dreamy Jazz: [CheckUser] Remove wgCheckUserEventTablesMigrationStage config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053297 (https://phabricator.wikimedia.org/T366546) [07:01:18] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 22.4R3 - https://phabricator.wikimedia.org/T364092#9984011 (10ayounsi) There has been a spike of CPU usage on cr1-eqiad (with no impact), not sure if just a coincidence. [07:01:21] (03Abandoned) 10Ayounsi: python_deploy_venv.sh enable proxy support [puppet] - 10https://gerrit.wikimedia.org/r/1053000 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [07:03:39] I see that Seawolf35 isn't around for this window based on an above message [07:03:59] I will therefore deploy my patch [07:05:44] (03PS3) 10Dreamy Jazz: [CheckUser] Remove wgCheckUserEventTablesMigrationStage config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053297 (https://phabricator.wikimedia.org/T366546) [07:06:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053297 (https://phabricator.wikimedia.org/T366546) (owner: 10Dreamy Jazz) [07:06:59] (03Merged) 10jenkins-bot: [CheckUser] Remove wgCheckUserEventTablesMigrationStage config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053297 (https://phabricator.wikimedia.org/T366546) (owner: 10Dreamy Jazz) [07:07:08] !log volans@cumin1002 START - Cookbook sre.dns.netbox [07:07:43] !log dreamyjazz@deploy1002 Started scap sync-world: Backport for [[gerrit:1053297|[CheckUser] Remove wgCheckUserEventTablesMigrationStage config (T366546)]] [07:07:46] T366546: Remove wgCheckUserEventTablesMigrationStage and related migration code - https://phabricator.wikimedia.org/T366546 [07:10:51] !log volans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Merging pending changes for frack hosts as per IRC discussion - volans@cumin1002" [07:12:12] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [07:13:13] !log volans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Merging pending changes for frack hosts as per IRC discussion - volans@cumin1002" [07:13:13] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:14:48] !log dreamyjazz@deploy1002 dreamyjazz: Backport for [[gerrit:1053297|[CheckUser] Remove wgCheckUserEventTablesMigrationStage config (T366546)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:14:53] T366546: Remove wgCheckUserEventTablesMigrationStage and related migration code - https://phabricator.wikimedia.org/T366546 [07:14:56] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1054425 (owner: 10Slyngshede) [07:14:57] !log dreamyjazz@deploy1002 dreamyjazz: Continuing with sync [07:16:18] (03CR) 10Slyngshede: [C:03+2] data.yaml: Offboarding fjoseph [puppet] - 10https://gerrit.wikimedia.org/r/1054425 (owner: 10Slyngshede) [07:19:07] (03PS1) 10Slyngshede: data.yaml: Extend MOU for dalezhou [puppet] - 10https://gerrit.wikimedia.org/r/1054427 [07:19:52] !log dreamyjazz@deploy1002 Finished scap: Backport for [[gerrit:1053297|[CheckUser] Remove wgCheckUserEventTablesMigrationStage config (T366546)]] (duration: 12m 09s) [07:19:56] T366546: Remove wgCheckUserEventTablesMigrationStage and related migration code - https://phabricator.wikimedia.org/T366546 [07:22:47] (03CR) 10Slyngshede: "Extension requested by mgerlach@ who also provided the new email address for the user." [puppet] - 10https://gerrit.wikimedia.org/r/1054427 (owner: 10Slyngshede) [07:25:08] I'm not sure I will be able to deploy the other change, considering that Seawolf35 isn't here for the window. [07:25:12] (03PS1) 10Marostegui: filtered_tables: Remove columns [puppet] - 10https://gerrit.wikimedia.org/r/1054428 (https://phabricator.wikimedia.org/T367781) [07:28:46] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-serve1002.eqiad.wmnet [07:29:32] !log Restarted MediaModeration scanning scrpt [07:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:45] (03CR) 10Arnaudb: [C:03+1] filtered_tables: Remove columns [puppet] - 10https://gerrit.wikimedia.org/r/1054428 (https://phabricator.wikimedia.org/T367781) (owner: 10Marostegui) [07:30:50] (03CR) 10Marostegui: [C:03+2] filtered_tables: Remove columns [puppet] - 10https://gerrit.wikimedia.org/r/1054428 (https://phabricator.wikimedia.org/T367781) (owner: 10Marostegui) [07:37:13] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:38:10] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1002.eqiad.wmnet [07:38:37] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-serve1006.eqiad.wmnet [07:40:33] !log Morning UTC backport window done [07:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:16] PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:45:16] RECOVERY - BGP status on lsw1-e3-eqiad.mgmt is OK: BGP OK - up: 22, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:45:58] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [07:46:18] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1006.eqiad.wmnet [07:52:25] 10ops-codfw, 06SRE, 06DC-Ops: 10gbit nic option for centrallog2002 - https://phabricator.wikimedia.org/T369826#9984050 (10fgiunchedi) >>! In T369826#9982422, @Jhancock.wm wrote: > We won't need to move racks. But because of the way the switches are, we can't reuse the same port on the switch. we'll be moving... [07:52:33] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: switch gitlab from iptables to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1053879 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [08:07:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P66584 and previous config saved to /var/cache/conftool/dbconfig/20240716-080707-root.json [08:07:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P66585 and previous config saved to /var/cache/conftool/dbconfig/20240716-080720-root.json [08:07:45] (03PS1) 10Marostegui: Revert "db1157: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1054483 [08:07:47] (03PS1) 10Volans: mysql_legacy: update core sections [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054484 (https://phabricator.wikimedia.org/T367496) [08:07:49] (03PS1) 10Marostegui: Revert "db1174: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1054485 [08:08:12] (03CR) 10Marostegui: [C:03+2] Revert "db1157: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1054483 (owner: 10Marostegui) [08:08:21] (03CR) 10Marostegui: [C:03+2] Revert "db1174: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1054485 (owner: 10Marostegui) [08:08:42] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:09:34] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:09:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:10:31] (03CR) 10Marostegui: [C:03+1] mysql_legacy: update core sections [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054484 (https://phabricator.wikimedia.org/T367496) (owner: 10Volans) [08:11:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P66586 and previous config saved to /var/cache/conftool/dbconfig/20240716-081129-root.json [08:11:43] (03CR) 10Arnaudb: [C:03+1] mysql_legacy: update core sections [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054484 (https://phabricator.wikimedia.org/T367496) (owner: 10Volans) [08:13:53] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1157.eqiad.wmnet with reason: Maintenance [08:13:55] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1157.eqiad.wmnet with reason: Maintenance [08:14:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1157 (T367781)', diff saved to https://phabricator.wikimedia.org/P66587 and previous config saved to /var/cache/conftool/dbconfig/20240716-081401-arnaudb.json [08:14:06] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [08:14:25] (03CR) 10Volans: [C:03+2] mysql_legacy: update core sections [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054484 (https://phabricator.wikimedia.org/T367496) (owner: 10Volans) [08:15:02] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:15:06] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:20:54] (03Merged) 10jenkins-bot: mysql_legacy: update core sections [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054484 (https://phabricator.wikimedia.org/T367496) (owner: 10Volans) [08:22:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P66588 and previous config saved to /var/cache/conftool/dbconfig/20240716-082213-root.json [08:25:19] (03CR) 10Filippo Giunchedi: [C:03+2] o11y: disable pint promql/series for BenthosKafkaConsumerLag + webrequest [alerts] - 10https://gerrit.wikimedia.org/r/1054363 (https://phabricator.wikimedia.org/T369737) (owner: 10Filippo Giunchedi) [08:27:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1174', diff saved to https://phabricator.wikimedia.org/P66589 and previous config saved to /var/cache/conftool/dbconfig/20240716-082727-root.json [08:28:03] (03PS1) 10Marostegui: Revert^2 "db1174: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1054489 [08:28:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Long schema change [08:28:51] (03CR) 10Marostegui: [C:03+2] Revert^2 "db1174: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1054489 (owner: 10Marostegui) [08:28:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Long schema change [08:31:16] !log Clone dbstore1008:3317 from db1174 T370122 [08:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:19] T370122: dbstore1008:3317 (s7) crashed - https://phabricator.wikimedia.org/T370122 [08:32:53] !log root@kafka-logging1001:~# kafka topics --alter --topic mediawiki.httpd.accesslog --partitions 12 - T369256 [08:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:57] T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic - https://phabricator.wikimedia.org/T369256 [08:49:44] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - No response from remote host 185.15.58.129 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:49:44] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - No response from remote host 185.15.58.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:50:28] !log isaranto@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [08:51:06] (03PS2) 10Effie Mouzeli: mcrouter: test bookworm image on mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054367 (https://phabricator.wikimedia.org/T368366) [08:51:15] (03PS1) 10Marostegui: filtered_tables.txt: Remove old columns [puppet] - 10https://gerrit.wikimedia.org/r/1054495 (https://phabricator.wikimedia.org/T343718) [08:51:20] 06SRE, 06collaboration-services, 06Release-Engineering-Team, 06Traffic, 13Patch-For-Review: implement anti-abuse features for GitLAb (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#9984285 (10Jelto) [08:51:47] (03CR) 10Marostegui: [C:03+2] filtered_tables.txt: Remove old columns [puppet] - 10https://gerrit.wikimedia.org/r/1054495 (https://phabricator.wikimedia.org/T343718) (owner: 10Marostegui) [08:53:22] 06SRE, 06collaboration-services, 06Release-Engineering-Team, 06Traffic, 13Patch-For-Review: implement anti-abuse features for GitLAb (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#9984291 (10Jelto) I migrated the GitLab hosts to nftables which unblocks us using nftables built-in... [08:53:35] 06SRE, 06collaboration-services, 06Release-Engineering-Team, 06Traffic, 13Patch-For-Review: implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#9984292 (10Jelto) [08:54:21] jouncebot: now [08:54:21] No deployments scheduled for the next 1 hour(s) and 5 minute(s) [08:54:29] jouncebot: next [08:54:29] In 1 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T1000) [08:55:55] (03PS1) 10Marostegui: filtered_tables.txt: Remove unused columns [puppet] - 10https://gerrit.wikimedia.org/r/1054496 (https://phabricator.wikimedia.org/T318955) [08:55:55] 06SRE, 06collaboration-services, 06Release-Engineering-Team, 06Traffic, 13Patch-For-Review: implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#9984312 (10Jelto) [08:56:32] (03CR) 10Marostegui: [C:03+2] filtered_tables.txt: Remove unused columns [puppet] - 10https://gerrit.wikimedia.org/r/1054496 (https://phabricator.wikimedia.org/T318955) (owner: 10Marostegui) [08:58:05] (03CR) 10JMeybohm: [C:03+1] mcrouter: test bookworm image on mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054367 (https://phabricator.wikimedia.org/T368366) (owner: 10Effie Mouzeli) [09:00:58] (03CR) 10Effie Mouzeli: [C:03+2] mcrouter: test bookworm image on mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054367 (https://phabricator.wikimedia.org/T368366) (owner: 10Effie Mouzeli) [09:02:15] (03Merged) 10jenkins-bot: mcrouter: test bookworm image on mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054367 (https://phabricator.wikimedia.org/T368366) (owner: 10Effie Mouzeli) [09:02:57] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [09:03:00] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [09:03:05] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [09:03:08] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [09:03:32] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [09:04:03] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [09:04:37] (03CR) 10Elukey: mcrouter: test bookworm image on mw-debug (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054367 (https://phabricator.wikimedia.org/T368366) (owner: 10Effie Mouzeli) [09:06:21] !log btullis@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database aewikimedia (T362529) [09:06:25] T362529: Create a Wikimedians of United Arab Emirates User Group Wiki - https://phabricator.wikimedia.org/T362529 [09:06:34] (03PS1) 10Marostegui: filtered_tables.txt: Remove ununsed columns [puppet] - 10https://gerrit.wikimedia.org/r/1054498 (https://phabricator.wikimedia.org/T314041) [09:07:24] (03CR) 10Marostegui: [C:03+2] filtered_tables.txt: Remove ununsed columns [puppet] - 10https://gerrit.wikimedia.org/r/1054498 (https://phabricator.wikimedia.org/T314041) (owner: 10Marostegui) [09:08:55] (03PS1) 10Effie Mouzeli: mw-debug: fix mcrouter image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054499 [09:09:11] (03CR) 10Vgutierrez: "looking good, as mentioned on the inline comment it would be great if we don't need root privileges to fetch the suffix list file from the" [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (owner: 10BCornwall) [09:10:25] (03CR) 10Effie Mouzeli: [C:03+2] mw-debug: fix mcrouter image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054499 (owner: 10Effie Mouzeli) [09:11:08] !log update docker-report to 0.0.14-1 on bullseye-wikimedia [09:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:24] (03Merged) 10jenkins-bot: mw-debug: fix mcrouter image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054499 (owner: 10Effie Mouzeli) [09:11:59] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [09:12:01] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [09:12:11] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [09:12:45] (03PS1) 10Marostegui: filtered_tables.txt: Drop unused columns [puppet] - 10https://gerrit.wikimedia.org/r/1054501 (https://phabricator.wikimedia.org/T300774) [09:12:48] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [09:12:56] !log update docker-registry to 0.0.14-1 on build2001 [09:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:40] (03CR) 10Marostegui: [C:03+2] filtered_tables.txt: Drop unused columns [puppet] - 10https://gerrit.wikimedia.org/r/1054501 (https://phabricator.wikimedia.org/T300774) (owner: 10Marostegui) [09:14:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T367781)', diff saved to https://phabricator.wikimedia.org/P66591 and previous config saved to /var/cache/conftool/dbconfig/20240716-091418-arnaudb.json [09:14:24] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [09:16:25] (03PS3) 10Effie Mouzeli: mcrouter: test bookworm image on mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054368 (https://phabricator.wikimedia.org/T368366) [09:20:51] !log bounce benthos@mw_accesslog_sampler - T369256 [09:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:56] T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic - https://phabricator.wikimedia.org/T369256 [09:22:58] (03PS4) 10Effie Mouzeli: mcrouter: test bookworm image on mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054368 (https://phabricator.wikimedia.org/T368366) [09:23:51] !log isaranto@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [09:29:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P66592 and previous config saved to /var/cache/conftool/dbconfig/20240716-092924-arnaudb.json [09:29:30] (03PS4) 10Ayounsi: Extend STORAGE_BACKEND config to support Swift (#16319) [software/netbox] - 10https://gerrit.wikimedia.org/r/980908 (https://phabricator.wikimedia.org/T310717) [09:30:50] (03CR) 10DCausse: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1054392 (owner: 10Ryan Kemper) [09:30:59] (03PS1) 10Slyngshede: C:idm configure 2FA proxy endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/1054502 [09:31:49] (03PS5) 10Effie Mouzeli: mcrouter: test bookworm image on mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054368 (https://phabricator.wikimedia.org/T368366) [09:32:01] !log btullis@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database aewikimedia (T362529) [09:32:04] T362529: Create a Wikimedians of United Arab Emirates User Group Wiki - https://phabricator.wikimedia.org/T362529 [09:33:25] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:10] (03CR) 10Elukey: [C:03+1] mcrouter: test bookworm image on mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054368 (https://phabricator.wikimedia.org/T368366) (owner: 10Effie Mouzeli) [09:37:01] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [09:37:05] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [09:37:21] (03CR) 10Effie Mouzeli: [C:03+2] mcrouter: test bookworm image on mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054368 (https://phabricator.wikimedia.org/T368366) (owner: 10Effie Mouzeli) [09:38:12] (03Merged) 10jenkins-bot: mcrouter: test bookworm image on mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054368 (https://phabricator.wikimedia.org/T368366) (owner: 10Effie Mouzeli) [09:39:13] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [09:42:25] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [09:44:31] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [09:44:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P66593 and previous config saved to /var/cache/conftool/dbconfig/20240716-094432-arnaudb.json [09:46:49] (03CR) 10Clément Goubert: [C:03+1] wikifeeds: update references to deprecated services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053808 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [09:47:52] (03PS2) 10Ayounsi: Upgrade Netbox to 4.0.7 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1053243 (https://phabricator.wikimedia.org/T336275) [09:48:05] (03PS2) 10Gmodena: eventbus: enable instrumentation on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054357 (https://phabricator.wikimedia.org/T363587) [09:49:22] (03PS2) 10Slyngshede: C:idm configure 2FA proxy endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/1054502 [09:50:04] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3237/co" [puppet] - 10https://gerrit.wikimedia.org/r/1054502 (owner: 10Slyngshede) [09:50:12] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [09:50:16] (03CR) 10Elukey: [C:03+1] Upgrade Netbox to 4.0.7 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1053243 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:50:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [09:52:01] !log isaranto@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [09:52:31] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [09:53:52] !log isaranto@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:54:44] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:59:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T367781)', diff saved to https://phabricator.wikimedia.org/P66594 and previous config saved to /var/cache/conftool/dbconfig/20240716-095939-arnaudb.json [09:59:41] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1166.eqiad.wmnet with reason: Maintenance [09:59:44] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [09:59:55] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1166.eqiad.wmnet with reason: Maintenance [09:59:57] (03PS1) 10Effie Mouzeli: kubernetes: update mcrouter images to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1054507 (https://phabricator.wikimedia.org/T368366) [10:00:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T367781)', diff saved to https://phabricator.wikimedia.org/P66595 and previous config saved to /var/cache/conftool/dbconfig/20240716-100002-arnaudb.json [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T1000) [10:00:20] (03CR) 10CI reject: [V:04-1] kubernetes: update mcrouter images to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1054507 (https://phabricator.wikimedia.org/T368366) (owner: 10Effie Mouzeli) [10:00:22] (03PS2) 10Effie Mouzeli: kubernetes: update mcrouter images to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1054507 (https://phabricator.wikimedia.org/T368366) [10:01:31] (03PS1) 10Effie Mouzeli: mw-mcrouter: use bookworm images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054511 (https://phabricator.wikimedia.org/T368366) [10:05:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T367781)', diff saved to https://phabricator.wikimedia.org/P66597 and previous config saved to /var/cache/conftool/dbconfig/20240716-100556-arnaudb.json [10:06:00] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [10:10:20] !log T362529: creating aewikimedia CirrusSearch indices with 'mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=aewikimedia --cluster=all' [10:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:23] T362529: Create a Wikimedians of United Arab Emirates User Group Wiki - https://phabricator.wikimedia.org/T362529 [10:16:57] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: LDAP access to the analytics-privatedata-users group for Quiddity - https://phabricator.wikimedia.org/T370091#9984595 (10Clement_Goubert) a:03KStineRowe_WMF Hi, Can you please read and sign the L3 document, as well as read the Dat... [10:21:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P66598 and previous config saved to /var/cache/conftool/dbconfig/20240716-102103-arnaudb.json [10:22:30] (03PS2) 10Effie Mouzeli: mw-mcrouter: use bookworm images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054511 (https://phabricator.wikimedia.org/T368366) [10:23:16] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#9984608 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium [10:25:49] (03PS1) 10Jgiannelos: changeprop: Disable pregeneration for mobile-sections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054512 (https://phabricator.wikimedia.org/T328036) [10:29:10] (03PS2) 10Jgiannelos: changeprop: Disable pregeneration for mobile-sections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054512 (https://phabricator.wikimedia.org/T328036) [10:33:17] (03CR) 10JMeybohm: [C:03+1] mw-mcrouter: use bookworm images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054511 (https://phabricator.wikimedia.org/T368366) (owner: 10Effie Mouzeli) [10:33:29] (03CR) 10JMeybohm: [C:03+1] kubernetes: update mcrouter images to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1054507 (https://phabricator.wikimedia.org/T368366) (owner: 10Effie Mouzeli) [10:33:49] (03CR) 10Effie Mouzeli: [C:03+2] mw-mcrouter: use bookworm images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054511 (https://phabricator.wikimedia.org/T368366) (owner: 10Effie Mouzeli) [10:34:40] (03Merged) 10jenkins-bot: mw-mcrouter: use bookworm images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054511 (https://phabricator.wikimedia.org/T368366) (owner: 10Effie Mouzeli) [10:35:32] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [10:36:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P66599 and previous config saved to /var/cache/conftool/dbconfig/20240716-103610-arnaudb.json [10:41:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:46:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:47:04] (03CR) 10Vgutierrez: [C:03+1] "so, after taking a deeper look to traffic-puppetserver-bookworm, when installed by Andrew Bogott it looks like he took care of migrating t" [puppet] - 10https://gerrit.wikimedia.org/r/1053937 (https://phabricator.wikimedia.org/T355750) (owner: 10Elukey) [10:47:47] FIRING: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:47:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [10:48:47] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29685 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [10:50:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P66600 and previous config saved to /var/cache/conftool/dbconfig/20240716-105006-root.json [10:50:28] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3238/co" [puppet] - 10https://gerrit.wikimedia.org/r/1054516 (https://phabricator.wikimedia.org/T368518) (owner: 10Btullis) [10:51:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:51:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T367781)', diff saved to https://phabricator.wikimedia.org/P66601 and previous config saved to /var/cache/conftool/dbconfig/20240716-105117-arnaudb.json [10:51:19] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1175.eqiad.wmnet with reason: Maintenance [10:51:21] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [10:51:33] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1175.eqiad.wmnet with reason: Maintenance [10:51:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T367781)', diff saved to https://phabricator.wikimedia.org/P66602 and previous config saved to /var/cache/conftool/dbconfig/20240716-105139-arnaudb.json [10:53:20] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [10:54:46] (03PS1) 10Marostegui: Revert "dbstore1008: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1054517 [10:55:09] (03PS1) 10Marostegui: Revert^3 "db1174: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1054518 [10:56:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:56:18] (03CR) 10Marostegui: [C:03+2] Revert^3 "db1174: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1054518 (owner: 10Marostegui) [10:56:27] (03CR) 10Marostegui: [C:03+2] Revert "dbstore1008: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1054517 (owner: 10Marostegui) [10:57:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T367781)', diff saved to https://phabricator.wikimedia.org/P66603 and previous config saved to /var/cache/conftool/dbconfig/20240716-105732-arnaudb.json [10:57:36] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [10:57:47] RESOLVED: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:59:19] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:05:04] jouncebot: now [11:05:04] No deployments scheduled for the next 0 hour(s) and 54 minute(s) [11:05:07] jouncebot: next [11:05:07] In 0 hour(s) and 54 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T1200) [11:05:09] (03PS1) 10Slyngshede: P:idm_test add dummy secrets for mediawiki integration. [labs/private] - 10https://gerrit.wikimedia.org/r/1054519 [11:05:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P66604 and previous config saved to /var/cache/conftool/dbconfig/20240716-110512-root.json [11:07:44] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [11:08:02] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [11:11:17] (03PS1) 10Stevemunene: [WIP] wdqs: create wdqs split pybal pools [puppet] - 10https://gerrit.wikimedia.org/r/1054520 (https://phabricator.wikimedia.org/T364368) [11:12:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:12:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P66605 and previous config saved to /var/cache/conftool/dbconfig/20240716-111239-arnaudb.json [11:17:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:20:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P66606 and previous config saved to /var/cache/conftool/dbconfig/20240716-112017-root.json [11:20:39] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [11:20:41] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [11:23:37] memcached errors are due to deployment [11:27:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P66607 and previous config saved to /var/cache/conftool/dbconfig/20240716-112746-arnaudb.json [11:35:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P66608 and previous config saved to /var/cache/conftool/dbconfig/20240716-113523-root.json [11:37:13] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:42:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T367781)', diff saved to https://phabricator.wikimedia.org/P66610 and previous config saved to /var/cache/conftool/dbconfig/20240716-114254-arnaudb.json [11:42:56] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1189.eqiad.wmnet with reason: Maintenance [11:42:58] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [11:43:09] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1189.eqiad.wmnet with reason: Maintenance [11:43:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T367781)', diff saved to https://phabricator.wikimedia.org/P66611 and previous config saved to /var/cache/conftool/dbconfig/20240716-114315-arnaudb.json [11:49:18] !log drain mw1496.eqiad.wmnet [11:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P66613 and previous config saved to /var/cache/conftool/dbconfig/20240716-115028-root.json [11:59:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T367856)', diff saved to https://phabricator.wikimedia.org/P66614 and previous config saved to /var/cache/conftool/dbconfig/20240716-115920-marostegui.json [11:59:25] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [12:00:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db2116.codfw.wmnet with reason: Maintenance [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T1200) [12:00:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db2116.codfw.wmnet with reason: Maintenance [12:00:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2116 (T367856)', diff saved to https://phabricator.wikimedia.org/P66615 and previous config saved to /var/cache/conftool/dbconfig/20240716-120012-marostegui.json [12:00:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T367856)', diff saved to https://phabricator.wikimedia.org/P66616 and previous config saved to /var/cache/conftool/dbconfig/20240716-120021-marostegui.json [12:05:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:05:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P66617 and previous config saved to /var/cache/conftool/dbconfig/20240716-120534-root.json [12:06:26] ^ me for the memcached errors [12:09:11] !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.7 to netbox-next - ayounsi@cumin1002 - T336275 [12:09:15] T336275: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275 [12:10:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:10:28] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.7 to netbox-next - ayounsi@cumin1002 - T336275 [12:14:45] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 444.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:15:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P66618 and previous config saved to /var/cache/conftool/dbconfig/20240716-121528-marostegui.json [12:17:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:20:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P66619 and previous config saved to /var/cache/conftool/dbconfig/20240716-122039-root.json [12:30:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P66620 and previous config saved to /var/cache/conftool/dbconfig/20240716-123035-marostegui.json [12:34:45] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:38:21] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 396.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:39:21] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:43:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T367781)', diff saved to https://phabricator.wikimedia.org/P66621 and previous config saved to /var/cache/conftool/dbconfig/20240716-124332-arnaudb.json [12:43:37] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [12:45:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T367856)', diff saved to https://phabricator.wikimedia.org/P66622 and previous config saved to /var/cache/conftool/dbconfig/20240716-124543-marostegui.json [12:45:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db2130.codfw.wmnet with reason: Maintenance [12:45:49] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [12:45:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db2130.codfw.wmnet with reason: Maintenance [12:46:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2130 (T367856)', diff saved to https://phabricator.wikimedia.org/P66623 and previous config saved to /var/cache/conftool/dbconfig/20240716-124604-marostegui.json [12:58:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P66624 and previous config saved to /var/cache/conftool/dbconfig/20240716-125839-arnaudb.json [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T1300). [13:00:04] tchin and tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] o/ [13:00:29] hola [13:00:55] o/ [13:01:09] tchin: do you want to self-serve or should I deploy? [13:01:20] (and same question to tgr|away ^^) [13:01:42] * urbanecm waves [13:01:47] Lucas_WMDE: lemme know if you want my help [13:01:57] * Lucas_WMDE waves back [13:01:58] I can self-serve [13:02:38] I'm not at a pc with ssh right now can you deploy? [13:02:44] sure! [13:03:01] will you still be able to test the change on mwdebug? [13:03:20] yes should be fine [13:03:24] ok [13:04:51] * Lucas_WMDE wonders where wikibugs is [13:05:04] !log lucaswerkmeister-wmde@deploy1002 Started scap sync-world: Backport for [[gerrit:1052762|EventStreamConfig: Enable hive ingestion for mediawiki.page-delete (T367134)]] [13:05:08] T367134: [Refine Refactoring] Integrate Refine workflow configuration into ESC - https://phabricator.wikimedia.org/T367134 [13:09:11] !log lucaswerkmeister-wmde@deploy1002 tchin, lucaswerkmeister-wmde: Backport for [[gerrit:1052762|EventStreamConfig: Enable hive ingestion for mediawiki.page-delete (T367134)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:10:12] * tchin Looks good on mwdebug [13:10:16] !log lucaswerkmeister-wmde@deploy1002 tchin, lucaswerkmeister-wmde: Continuing with sync [13:10:20] ok, thanks for testing! [13:10:35] (I’ve asked around in #wikimedia-cloud about wikibugs btw) [13:13:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P66625 and previous config saved to /var/cache/conftool/dbconfig/20240716-131346-arnaudb.json [13:15:20] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1052762|EventStreamConfig: Enable hive ingestion for mediawiki.page-delete (T367134)]] (duration: 10m 15s) [13:15:24] T367134: [Refine Refactoring] Integrate Refine workflow configuration into ESC - https://phabricator.wikimedia.org/T367134 [13:15:33] tgr|away: all yours :) [13:16:11] thx [13:19:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [13:19:16] yay, wikibugs is back [13:19:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [13:19:45] (03Merged) 10jenkins-bot: Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [13:20:15] !log tgr@deploy1002 Started scap sync-world: Backport for [[gerrit:1036245|Handle sso.wikimedia.org domain (T365162)]] [13:20:19] T365162: Set up sso.wikimedia.beta.wmflabs.org with config-layer routing to other wikis - https://phabricator.wikimedia.org/T365162 [13:21:39] * Lucas_WMDE 👀 at the “MariaDb running with --read-only” errors in logspam-watch [13:21:58] (03CR) 10CI reject: [V:04-1] Tox: add Python3.12 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 (owner: 10Ayounsi) [13:22:03] (03CR) 10CI reject: [V:04-1] Spicerack: fix Netbox 4 breaking changes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [13:22:09] Lucas_WMDE: there was a switchover today, and those scripts apparently haven't reloaded the config [13:22:23] ah, long-running maintenance scripts [13:22:26] how we love them [13:22:31] Yeah.. [13:22:43] 10ops-codfw, 06SRE, 06DC-Ops: 10gbit nic option for centrallog2002 - https://phabricator.wikimedia.org/T369826#9985187 (10Papaul) @fgiunchedi yes the server will keep the same IP since we will just relocate it within the same rack. please see step below - power of the server - plug the 10G card - move the... [13:22:44] !log tgr@deploy1002 tgr: Backport for [[gerrit:1036245|Handle sso.wikimedia.org domain (T365162)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:22:47] I should actually kill them, it is not nice to keep trying to write to a host that is RO [13:23:28] some euwiki eval.php [13:23:31] yeah [13:23:34] it is always euwiki [13:23:50] what debug host am I supposed to use these days? just k8s-mwdebug? [13:24:04] yeah [13:24:20] Lucas_WMDE: I just killed them [13:24:41] marostegui: IIRC that eval.php by catrope caused some other errors the other day [13:24:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [13:25:03] Lucas_WMDE: You think there's a task somewhere about that? [13:25:05] hopefully he’ll know to restart it if he needs it [13:25:09] (03CR) 10Bking: [C:03+1] team-search-platform: migrate cirrus_cluster_checks [alerts] - 10https://gerrit.wikimedia.org/r/1054317 (https://phabricator.wikimedia.org/T359033) (owner: 10DCausse) [13:25:13] marostegui: https://phabricator.wikimedia.org/T369600#9965707 is what I remembered [13:25:30] Lucas_WMDE: thank you, I will check [13:25:37] if that was still the same process then it was running for over a week now ._. [13:26:06] Lucas_WMDE: the process was from 12th july [13:26:23] (03PS8) 10Ayounsi: Spicerack: fix Netbox 4 breaking changes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) [13:26:23] (03PS4) 10Ayounsi: Tox: add Python3.12 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 [13:27:42] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v8.7.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054561 [13:28:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T367781)', diff saved to https://phabricator.wikimedia.org/P66626 and previous config saved to /var/cache/conftool/dbconfig/20240716-132853-arnaudb.json [13:28:55] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1198.eqiad.wmnet with reason: Maintenance [13:28:59] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [13:28:59] (03PS5) 10Ayounsi: Tox: add Python3.12 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 [13:28:59] (03PS1) 10Ayounsi: Adapt tests for Netbox 4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054562 [13:29:08] (03PS2) 10Elukey: CHANGELOG: add changelogs for release v8.7.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054561 [13:29:09] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1198.eqiad.wmnet with reason: Maintenance [13:29:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T367781)', diff saved to https://phabricator.wikimedia.org/P66627 and previous config saved to /var/cache/conftool/dbconfig/20240716-132915-arnaudb.json [13:29:23] Normally maint scripts should call Maintenance::waitForReplication which calls $lbFactory->autoReconfigure(); which should prevent issues like this [13:29:25] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@1ee55b8]: (no justification provided) [13:29:55] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@1ee55b8]: (no justification provided) (duration: 00m 30s) [13:31:02] (03PS3) 10Elukey: CHANGELOG: add changelogs for release v8.7.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054561 [13:32:16] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054561 (owner: 10Elukey) [13:32:31] (03CR) 10CI reject: [V:04-1] Spicerack: fix Netbox 4 breaking changes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [13:32:57] (03CR) 10CI reject: [V:04-1] Tox: add Python3.12 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 (owner: 10Ayounsi) [13:33:38] zabe: not sure that’s possible in eval.php [13:33:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:33:57] but maybe the correct answer there is “please don’t run week-long maintenance scripts in eval.php”… [13:34:11] (03CR) 10FNegri: [C:03+1] Switch the rols of clouddb1021 to insetup::data_engineering [puppet] - 10https://gerrit.wikimedia.org/r/1054516 (https://phabricator.wikimedia.org/T368518) (owner: 10Btullis) [13:34:32] !log tgr@deploy1002 tgr: Continuing with sync [13:35:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T367781)', diff saved to https://phabricator.wikimedia.org/P66628 and previous config saved to /var/cache/conftool/dbconfig/20240716-133508-arnaudb.json [13:35:12] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [13:35:31] (03CR) 10Btullis: [V:03+1 C:03+2] Switch the rols of clouddb1021 to insetup::data_engineering [puppet] - 10https://gerrit.wikimedia.org/r/1054516 (https://phabricator.wikimedia.org/T368518) (owner: 10Btullis) [13:35:32] (03CR) 10CI reject: [V:04-1] Tox: add Python3.12 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 (owner: 10Ayounsi) [13:35:33] (03CR) 10CI reject: [V:04-1] Adapt tests for Netbox 4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054562 (owner: 10Ayounsi) [13:35:55] it's possible, but not straightforward [13:36:11] you'd need to create an anonymous Maintenance subclass or something [13:36:21] jouncebot: now [13:36:21] For the next 0 hour(s) and 23 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T1300) [13:36:25] or just call autoReconfigure directly [13:37:01] but yeah seems like a pretty bad idea to do anything important and long-running from a throwaway eval loop [13:37:10] I was wondering if eval.php should do this in its while loop, but presumably the script is running one long statement, not a series of statements being read from stdin [13:37:50] yeah the loop would have to be implemented in the code that gets eval'd [13:38:50] (03CR) 10JMeybohm: otelcol: Stop hardcoding k8s master IP addresses (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054394 (https://phabricator.wikimedia.org/T365855) (owner: 10CDanis) [13:39:02] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v8.7.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054561 (owner: 10Elukey) [13:39:22] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:1036245|Handle sso.wikimedia.org domain (T365162)]] (duration: 19m 07s) [13:39:26] T365162: Set up sso.wikimedia.beta.wmflabs.org with config-layer routing to other wikis - https://phabricator.wikimedia.org/T365162 [13:40:40] !log UTC afternoon deploys done [13:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:02] cc effie [13:41:21] (19 minutes left before urbanecm et al have another window booked ^^) [13:41:43] tgr|away: good luck with sso.w.o btw! [13:41:48] ^^ [13:41:57] thanks! [13:43:33] (03CR) 10Gergő Tisza: "FWIW I tested during deployment and it seems you can't fake a request to an unknown domain in production because it does not get baked int" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [13:43:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [13:44:59] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v8.7.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054561 (owner: 10Elukey) [13:45:45] 10SRE-swift-storage, 13Patch-For-Review: Set up new S3-level replicated storage cluster "apus" - https://phabricator.wikimedia.org/T279621#9985424 (10MatthewVernon) Task updated to reflect name change, updates to technology and scope, and to update to state of progress. [13:46:24] 10SRE-swift-storage, 13Patch-For-Review: Set up new S3-level replicated storage cluster "apus" - https://phabricator.wikimedia.org/T279621#9985417 (10MatthewVernon) 05Stalled→03Open [13:48:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [13:50:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P66629 and previous config saved to /var/cache/conftool/dbconfig/20240716-135015-arnaudb.json [13:52:45] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:52:54] (03PS9) 10Ayounsi: Spicerack: fix Netbox 4 breaking changes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) [13:52:54] (03PS2) 10Ayounsi: Adapt tests for Netbox 4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054562 [13:52:54] (03PS6) 10Ayounsi: Tox: add Python3.12 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 [13:53:55] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw2432.codfw.wmnet [13:54:25] 10ops-codfw, 06SRE, 06DC-Ops: 10gbit nic option for centrallog2002 - https://phabricator.wikimedia.org/T369826#9985484 (10fgiunchedi) Thank you @Papaul that is quite helpful! The steps make sense to me, I'm happy to take care of the server configuration (adjusting configuration). I'd even simplify those as... [13:57:13] (03PS1) 10Elukey: Upstream release v8.7.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1054569 [13:57:56] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1054569 (owner: 10Elukey) [13:59:43] (03CR) 10Ssingh: Release 0.9.8-1+wmf12u1 (032 comments) [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/1054370 (https://phabricator.wikimedia.org/T370068) (owner: 10Ssingh) [14:00:05] seddon, urbanecm, and dbrant: Account Vanishing deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T1400). Please do the needful. [14:00:12] o/ [14:00:17] (03CR) 10CI reject: [V:04-1] Spicerack: fix Netbox 4 breaking changes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [14:00:18] (03CR) 10CI reject: [V:04-1] Tox: add Python3.12 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 (owner: 10Ayounsi) [14:00:19] Seddon: dbrant: Hey! [14:00:20] (03CR) 10CI reject: [V:04-1] Adapt tests for Netbox 4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054562 (owner: 10Ayounsi) [14:00:22] o/ [14:00:33] Dmitry might not be in here [14:00:47] One second [14:00:51] yep [14:02:24] hello dbrant! [14:02:28] o/ [14:03:01] (03CR) 10Effie Mouzeli: [C:03+2] mw-debug and mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054556 (owner: 10Effie Mouzeli) [14:03:04] (03PS10) 10Ayounsi: Spicerack: fix Netbox 4 breaking changes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) [14:03:04] (03PS3) 10Ayounsi: Adapt tests for Netbox 4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054562 [14:03:04] (03PS7) 10Ayounsi: Tox: add Python3.12 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 [14:03:29] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mw2432.codfw.wmnet [14:03:42] effie: should i wait for your mw changes to finish before i start with my window? [14:03:58] (03Merged) 10jenkins-bot: mw-debug and mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054556 (owner: 10Effie Mouzeli) [14:04:04] (happy to, just let me know when i can start) [14:04:51] (03CR) 10Ayounsi: [C:03+1] Release 0.9.8-1+wmf12u1 (032 comments) [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/1054370 (https://phabricator.wikimedia.org/T370068) (owner: 10Ssingh) [14:05:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P66630 and previous config saved to /var/cache/conftool/dbconfig/20240716-140522-arnaudb.json [14:05:31] (03PS1) 10Urbanecm: Introduce Vanish Request Flow [extensions/CentralAuth] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054571 (https://phabricator.wikimedia.org/T367329) [14:05:43] (03PS10) 10Arnaudb: mysqld-exporter: hotfix config for es1 to es5 [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) [14:06:08] (03Abandoned) 10Urbanecm: Introduce Vanish Request Flow [extensions/CentralAuth] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054571 (https://phabricator.wikimedia.org/T367329) (owner: 10Urbanecm) [14:06:15] urbanecm: yes please if possible, I checked for the backport window only sigh [14:06:25] it will be quick I reckon [14:06:37] effie: no worries. we're prepping for the release now, i'll wait for your go ahead before touching prod :) [14:06:37] (03PS4) 10Dbrant: Enable account vanishing in CentralAuth. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053373 (https://phabricator.wikimedia.org/T369141) [14:06:41] (03CR) 10Elukey: [C:03+2] Upstream release v8.7.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1054569 (owner: 10Elukey) [14:06:45] urbanecm: cool tx [14:07:21] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:07:44] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:08:32] (03PS1) 10Urbanecm: Introduce Vanish Request Flow [extensions/CentralAuth] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054572 (https://phabricator.wikimedia.org/T367329) [14:08:54] (03PS1) 10Urbanecm: Pass wiki id to actor store for cross-db hasPublicLogs query [extensions/CentralAuth] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054573 (https://phabricator.wikimedia.org/T370059) [14:08:55] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [14:08:57] (03PS1) 10Urbanecm: Properly set automatic vanish performer on GlobalRenameUser [extensions/CentralAuth] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054574 (https://phabricator.wikimedia.org/T368177) [14:09:17] (03CR) 10CI reject: [V:04-1] Spicerack: fix Netbox 4 breaking changes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [14:10:03] (03CR) 10CI reject: [V:04-1] Tox: add Python3.12 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 (owner: 10Ayounsi) [14:10:05] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [14:10:31] (03CR) 10CI reject: [V:04-1] Adapt tests for Netbox 4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1054562 (owner: 10Ayounsi) [14:11:20] (03CR) 10Cwhite: [C:03+1] "Found the `pint file/disable promql/series` on line 9." [alerts] - 10https://gerrit.wikimedia.org/r/1054555 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi) [14:11:23] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [14:12:38] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [14:13:23] urbanecm: done, tx [14:13:27] thanks! [14:13:32] (03CR) 10Urbanecm: [C:03+2] Introduce Vanish Request Flow [extensions/CentralAuth] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054572 (https://phabricator.wikimedia.org/T367329) (owner: 10Urbanecm) [14:13:37] (03CR) 10Urbanecm: [C:03+2] Pass wiki id to actor store for cross-db hasPublicLogs query [extensions/CentralAuth] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054573 (https://phabricator.wikimedia.org/T370059) (owner: 10Urbanecm) [14:13:41] (03CR) 10Urbanecm: [C:03+2] Properly set automatic vanish performer on GlobalRenameUser [extensions/CentralAuth] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054574 (https://phabricator.wikimedia.org/T368177) (owner: 10Urbanecm) [14:14:07] (03CR) 10Urbanecm: [C:03+2] Enable account vanishing in CentralAuth. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053373 (https://phabricator.wikimedia.org/T369141) (owner: 10Dbrant) [14:14:53] (03Merged) 10jenkins-bot: Enable account vanishing in CentralAuth. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053373 (https://phabricator.wikimedia.org/T369141) (owner: 10Dbrant) [14:15:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/CentralAuth] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054572 (https://phabricator.wikimedia.org/T367329) (owner: 10Urbanecm) [14:15:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/CentralAuth] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054573 (https://phabricator.wikimedia.org/T370059) (owner: 10Urbanecm) [14:15:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/CentralAuth] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054574 (https://phabricator.wikimedia.org/T368177) (owner: 10Urbanecm) [14:18:07] (03PS2) 10Filippo Giunchedi: o11y: disable promql/series for BenthosKafkaConsumerLag [alerts] - 10https://gerrit.wikimedia.org/r/1054555 (https://phabricator.wikimedia.org/T354255) [14:18:11] (03PS1) 10Bking: relforge: remove non-functional TLS termination changes [puppet] - 10https://gerrit.wikimedia.org/r/1054578 (https://phabricator.wikimedia.org/T368950) [14:20:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T367781)', diff saved to https://phabricator.wikimedia.org/P66631 and previous config saved to /var/cache/conftool/dbconfig/20240716-142029-arnaudb.json [14:20:34] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [14:21:25] 10ops-codfw, 06SRE, 06DC-Ops: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164 (10Jhancock.wm) 03NEW [14:22:02] (03PS3) 10Effie Mouzeli: kubernetes: update mcrouter images to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1054507 (https://phabricator.wikimedia.org/T368366) [14:22:34] (03CR) 10Effie Mouzeli: [C:03+2] kubernetes: update mcrouter images to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1054507 (https://phabricator.wikimedia.org/T368366) (owner: 10Effie Mouzeli) [14:22:43] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1212.eqiad.wmnet with reason: Maintenance [14:22:44] (03CR) 10Filippo Giunchedi: [C:03+2] o11y: disable promql/series for BenthosKafkaConsumerLag [alerts] - 10https://gerrit.wikimedia.org/r/1054555 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi) [14:22:56] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1212.eqiad.wmnet with reason: Maintenance [14:22:57] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:23:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:23:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T367781)', diff saved to https://phabricator.wikimedia.org/P66632 and previous config saved to /var/cache/conftool/dbconfig/20240716-142321-arnaudb.json [14:24:09] (03CR) 10Ssingh: "Thanks for the review!" [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/1054370 (https://phabricator.wikimedia.org/T370068) (owner: 10Ssingh) [14:24:13] (03CR) 10Ssingh: [C:03+2] Release 0.9.8-1+wmf12u1 [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/1054370 (https://phabricator.wikimedia.org/T370068) (owner: 10Ssingh) [14:24:17] (03PS8) 10CDobbins: purged: set use_pki to true for all sites [puppet] - 10https://gerrit.wikimedia.org/r/1050417 (https://phabricator.wikimedia.org/T360506) [14:24:58] (03CR) 10Ottomata: eventbus: enable instrumentation on group 0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054357 (https://phabricator.wikimedia.org/T363587) (owner: 10Gmodena) [14:25:00] (03Merged) 10jenkins-bot: Introduce Vanish Request Flow [extensions/CentralAuth] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054572 (https://phabricator.wikimedia.org/T367329) (owner: 10Urbanecm) [14:25:10] (03Merged) 10jenkins-bot: Pass wiki id to actor store for cross-db hasPublicLogs query [extensions/CentralAuth] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054573 (https://phabricator.wikimedia.org/T370059) (owner: 10Urbanecm) [14:25:11] (03Merged) 10jenkins-bot: Properly set automatic vanish performer on GlobalRenameUser [extensions/CentralAuth] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054574 (https://phabricator.wikimedia.org/T368177) (owner: 10Urbanecm) [14:25:47] !log urbanecm@deploy1002 Started scap sync-world: Backport for [[gerrit:1054572|Introduce Vanish Request Flow (T367329 T367726 T367728 T367729 T367744 T368177 T368285 T368368 T368372 T368611 T369489)]], [[gerrit:1054573|Pass wiki id to actor store for cross-db hasPublicLogs query (T370059)]], [[gerrit:1054574|Properly set automatic vanish performer on GlobalRenameUser (T368177)]], [[gerrit:1053373|Enable account vanishing [14:25:47] in CentralAuth. (T369141)]] [14:26:29] (03CR) 10Ottomata: [C:03+1] eventbus: enable instrumentation on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054357 (https://phabricator.wikimedia.org/T363587) (owner: 10Gmodena) [14:26:52] T367329: Create Special:AccountVanishRequest page - https://phabricator.wikimedia.org/T367329 [14:26:52] T367726: Initiate Global Rename queue from `Special:AccountVanishRequestPage` - https://phabricator.wikimedia.org/T367726 [14:26:53] T367728: Customise "status" page for Vanishing Account - https://phabricator.wikimedia.org/T367728 [14:26:53] T367729: Customise Vanishing account Approval/Decline email - https://phabricator.wikimedia.org/T367729 [14:26:54] T367744: [EPIC] Phase 3 - Enable Global Rename Queue with Account Vanishing - https://phabricator.wikimedia.org/T367744 [14:26:54] T368177: Automatically accept vanishing requests if the user has no activity - https://phabricator.wikimedia.org/T368177 [14:26:54] T368285: Update Special:GlobalRenameQueue request view to work for vanish requests - https://phabricator.wikimedia.org/T368285 [14:26:55] T368368: Create Zendesk ticket when vanishing is declined - https://phabricator.wikimedia.org/T368368 [14:26:55] T368372: Define list for "appeal for a block" - https://phabricator.wikimedia.org/T368372 [14:26:55] T368611: Update Copy in the "alert" popup - https://phabricator.wikimedia.org/T368611 [14:26:56] T369489: Enhance the auto-vanish maintenance script - https://phabricator.wikimedia.org/T369489 [14:26:56] T370059: Auto-vanishing failing with error InvalidArgumentException: DB connection domain 'loginwiki' does not match 'metawiki' - https://phabricator.wikimedia.org/T370059 [14:26:57] T369141: Setup live configuration for account vanishing - https://phabricator.wikimedia.org/T369141 [14:27:12] (03PS1) 10Effie Mouzeli: mw-mcrouter: use puppet defined image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054580 [14:28:23] (03CR) 10Ssingh: "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1050417 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [14:29:48] (03CR) 10CDobbins: [C:03+2] purged: set use_pki to true for all sites (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1050417 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [14:29:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T367781)', diff saved to https://phabricator.wikimedia.org/P66633 and previous config saved to /var/cache/conftool/dbconfig/20240716-142953-arnaudb.json [14:29:59] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [14:31:59] (03PS3) 10Gmodena: eventbus: enable instrumentation on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054357 (https://phabricator.wikimedia.org/T363587) [14:32:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [14:33:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T365997 - depool db1194-s7,db1200-s5,db1201-s6', diff saved to https://phabricator.wikimedia.org/P66634 and previous config saved to /var/cache/conftool/dbconfig/20240716-143306-arnaudb.json [14:33:08] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db[1194,1200-1201].eqiad.wmnet,dbstore1009.eqiad.wmnet with reason: T365997 [14:33:22] T365997: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997 [14:33:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[1194,1200-1201].eqiad.wmnet,dbstore1009.eqiad.wmnet with reason: T365997 [14:34:06] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:34:07] !log Cordoning kubernetes1062.eqiad.wmnet mw1494.eqiad.wmnet mw1495.eqiad.wmnet - T365997 [14:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:54] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1054578 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [14:36:51] !log cgoubert@cumin1002 conftool action : set/pooled=inactive; selector: name=(kubernetes1062.eqiad.wmnet|mw1494.eqiad.wmnet|mw1495.eqiad.wmnet),cluster=kubernetes,service=kubesvc [14:37:23] I've silenced clouddb1019 alert dhinus marostegui ( 53432765-2729-4b06-9198-a04d03c9966c ) → this "fale positive" should be fixed when we finsih T369715 [14:37:23] T369715: Gather all mariadb host under the same prometheus label - https://phabricator.wikimedia.org/T369715 [14:37:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [14:37:58] arnaudb: what do you mean false positive? [14:38:36] you forgot the quotes!:D its a threshold that should be adjusted as its too generic for that specific host (given the criticity of the alert etc.) [14:39:19] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:27] arnaudb: But that alert comes from icinga I believe [14:39:47] You mean you'll adjust the prometheus future one? [14:40:09] yep this will be fixed during the migration indeed, that was my original meaning :) [14:40:11] (03PS2) 10Filippo Giunchedi: data-engineering: disable promql/rate lint for MediawikiPageContentChangeEnrichAvailability [alerts] - 10https://gerrit.wikimedia.org/r/1054540 (https://phabricator.wikimedia.org/T354255) [14:40:11] (03PS2) 10Filippo Giunchedi: data-platform: fix datahub availability [alerts] - 10https://gerrit.wikimedia.org/r/1054551 (https://phabricator.wikimedia.org/T354255) [14:40:55] (03CR) 10Gmodena: eventbus: enable instrumentation on group 0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054357 (https://phabricator.wikimedia.org/T363587) (owner: 10Gmodena) [14:41:01] (03PS4) 10Gmodena: eventbus: enable instrumentation on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054357 (https://phabricator.wikimedia.org/T363587) [14:41:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054025 (https://phabricator.wikimedia.org/T369979) (owner: 10Seawolf35gerrit) [14:42:08] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 24.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:43:48] (03CR) 10Filippo Giunchedi: [C:04-1] "doesn't pass PCC https://puppet-compiler.wmflabs.org/output/1053698/3246/" [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) (owner: 10Arnaudb) [14:44:55] !log reprepro -C main include bookworm-wikimedia anycast-healthchecker_0.9.8-1+wmf12u1_amd64.changes: T370068 [14:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P66635 and previous config saved to /var/cache/conftool/dbconfig/20240716-144500-arnaudb.json [14:45:02] T370068: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068 [14:46:16] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:50:00 on lsw1-f2-eqiad.mgmt with reason: prep JunOS upgrade lsw1-f2-eqiad [14:46:31] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:50:00 on lsw1-f2-eqiad.mgmt with reason: prep JunOS upgrade lsw1-f2-eqiad [14:46:44] (03PS4) 10CDanis: otelcol: Stop hardcoding k8s master IP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054394 (https://phabricator.wikimedia.org/T365855) [14:46:45] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9985956 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=36afd2cf-508d-4c02-a8cc-afb66ea29242) set by cmooney@... [14:46:49] (03CR) 10CDanis: otelcol: Stop hardcoding k8s master IP addresses (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054394 (https://phabricator.wikimedia.org/T365855) (owner: 10CDanis) [14:46:53] (03PS1) 10DCausse: cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054582 (https://phabricator.wikimedia.org/T368010) [14:47:48] (03CR) 10Bking: [C:03+2] "self-merging, as this only affects a test environment." [puppet] - 10https://gerrit.wikimedia.org/r/1054578 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [14:47:55] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio200[1-3] - https://phabricator.wikimedia.org/T367819#9985959 (10Jhancock.wm) a:05Jhancock.wm→03Papaul got the servers set up with temp idrac IPs. all yours. [14:49:36] !log [durum1001] upgrade anycast-healthchecker to 0.9.8-1+wmf12u1: T370068 [14:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:02] (03CR) 10Kamila Součková: [C:03+1] changeprop: Disable pregeneration for mobile-sections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054512 (https://phabricator.wikimedia.org/T328036) (owner: 10Jgiannelos) [14:50:30] (03PS2) 10DCausse: cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054582 (https://phabricator.wikimedia.org/T368010) [14:50:52] (03CR) 10Jgiannelos: "I double checked turnilo for traffic. Last reference from MWOffliner related traffic to mobile-sections is on 1st of July and before that " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054512 (https://phabricator.wikimedia.org/T328036) (owner: 10Jgiannelos) [14:51:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P66636 and previous config saved to /var/cache/conftool/dbconfig/20240716-145159-root.json [14:53:17] !log filippo@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on centrallog2002.codfw.wmnet with reason: network upgrade [14:53:31] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on centrallog2002.codfw.wmnet with reason: network upgrade [14:53:35] 10ops-codfw, 06SRE, 06DC-Ops: 10gbit nic option for centrallog2002 - https://phabricator.wikimedia.org/T369826#9985983 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0bfb0df8-b693-4fb7-8581-00886bab46c6) set by filippo@cumin1002 for 3:00:00 on 1 host(s) and their services with reason: ne... [14:53:37] !log urbanecm@deploy1002 dbrant, urbanecm: Backport for [[gerrit:1054572|Introduce Vanish Request Flow (T367329 T367726 T367728 T367729 T367744 T368177 T368285 T368368 T368372 T368611 T369489)]], [[gerrit:1054573|Pass wiki id to actor store for cross-db hasPublicLogs query (T370059)]], [[gerrit:1054574|Properly set automatic vanish performer on GlobalRenameUser (T368177)]], [[gerrit:1053373|Enable account vanishing in Cen [14:53:37] tralAuth. (T369141)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:53:41] !log urbanecm@deploy1002 dbrant, urbanecm: Continuing with sync [14:53:55] T367329: Create Special:AccountVanishRequest page - https://phabricator.wikimedia.org/T367329 [14:53:55] T367726: Initiate Global Rename queue from `Special:AccountVanishRequestPage` - https://phabricator.wikimedia.org/T367726 [14:53:55] T367728: Customise "status" page for Vanishing Account - https://phabricator.wikimedia.org/T367728 [14:53:56] T367729: Customise Vanishing account Approval/Decline email - https://phabricator.wikimedia.org/T367729 [14:53:56] T367744: [EPIC] Phase 3 - Enable Global Rename Queue with Account Vanishing - https://phabricator.wikimedia.org/T367744 [14:53:57] T368177: Automatically accept vanishing requests if the user has no activity - https://phabricator.wikimedia.org/T368177 [14:53:57] T368285: Update Special:GlobalRenameQueue request view to work for vanish requests - https://phabricator.wikimedia.org/T368285 [14:53:58] T368368: Create Zendesk ticket when vanishing is declined - https://phabricator.wikimedia.org/T368368 [14:53:58] T368372: Define list for "appeal for a block" - https://phabricator.wikimedia.org/T368372 [14:53:58] T368611: Update Copy in the "alert" popup - https://phabricator.wikimedia.org/T368611 [14:53:59] T369489: Enhance the auto-vanish maintenance script - https://phabricator.wikimedia.org/T369489 [14:53:59] T370059: Auto-vanishing failing with error InvalidArgumentException: DB connection domain 'loginwiki' does not match 'metawiki' - https://phabricator.wikimedia.org/T370059 [14:53:59] T369141: Setup live configuration for account vanishing - https://phabricator.wikimedia.org/T369141 [14:55:48] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:55:48] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:56:28] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:56:36] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:56:38] hmm? [14:57:07] looking [14:57:30] oh that might be me with centrallog2002 sukhe [14:57:44] in which case, expected as part of https://phabricator.wikimedia.org/T369826 [14:57:48] PROBLEM - Bird Internet Routing Daemon on durum1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:57:52] PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum1001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [14:58:01] or maybe not! [14:58:06] (03PS1) 10DCausse: rdf-streaming-updater: configure the split graph updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054584 (https://phabricator.wikimedia.org/T361935) [14:58:15] yeah, maybe not! [14:58:26] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:58:26] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:58:43] eqiad is probably expected because of durum1001 [14:58:44] codfw, no [14:58:46] so looking [14:59:19] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:05] eoghan, jelto, arnoldokoth, and mutante: #bothumor My software never has bugs. It just develops random features. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T1500). [15:00:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P66637 and previous config saved to /var/cache/conftool/dbconfig/20240716-150007-arnaudb.json [15:00:36] FIRING: [4x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:49] * urbanecm still deploying MW [15:01:40] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1054572|Introduce Vanish Request Flow (T367329 T367726 T367728 T367729 T367744 T368177 T368285 T368368 T368372 T368611 T369489)]], [[gerrit:1054573|Pass wiki id to actor store for cross-db hasPublicLogs query (T370059)]], [[gerrit:1054574|Properly set automatic vanish performer on GlobalRenameUser (T368177)]], [[gerrit:1053373|Enable account vanishing in Centra [15:01:40] lAuth. (T369141)]] (duration: 35m 52s) [15:01:46] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator/Phorge update [15:01:58] T367329: Create Special:AccountVanishRequest page - https://phabricator.wikimedia.org/T367329 [15:01:58] T367726: Initiate Global Rename queue from `Special:AccountVanishRequestPage` - https://phabricator.wikimedia.org/T367726 [15:01:59] T367728: Customise "status" page for Vanishing Account - https://phabricator.wikimedia.org/T367728 [15:01:59] T367729: Customise Vanishing account Approval/Decline email - https://phabricator.wikimedia.org/T367729 [15:01:59] T367744: [EPIC] Phase 3 - Enable Global Rename Queue with Account Vanishing - https://phabricator.wikimedia.org/T367744 [15:02:00] T368177: Automatically accept vanishing requests if the user has no activity - https://phabricator.wikimedia.org/T368177 [15:02:00] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator/Phorge update [15:02:00] T368285: Update Special:GlobalRenameQueue request view to work for vanish requests - https://phabricator.wikimedia.org/T368285 [15:02:01] T368368: Create Zendesk ticket when vanishing is declined - https://phabricator.wikimedia.org/T368368 [15:02:01] T368372: Define list for "appeal for a block" - https://phabricator.wikimedia.org/T368372 [15:02:01] T368611: Update Copy in the "alert" popup - https://phabricator.wikimedia.org/T368611 [15:02:02] T369489: Enhance the auto-vanish maintenance script - https://phabricator.wikimedia.org/T369489 [15:02:02] T370059: Auto-vanishing failing with error InvalidArgumentException: DB connection domain 'loginwiki' does not match 'metawiki' - https://phabricator.wikimedia.org/T370059 [15:02:03] T369141: Setup live configuration for account vanishing - https://phabricator.wikimedia.org/T369141 [15:02:13] poor bot :D [15:02:19] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator/Phorge update [15:02:27] (03CR) 10JMeybohm: otelcol: Stop hardcoding k8s master IP addresses (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054394 (https://phabricator.wikimedia.org/T365855) (owner: 10CDanis) [15:02:33] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator/Phorge update [15:03:35] hashar: lol [15:03:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [15:04:12] !log brennen@deploy1002 Started deploy [phabricator/deployment@7335128]: test deploy phab2002 for T370109 [15:04:15] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab.wmfusercontent.org with reason: Phabricator/Phorge update [15:04:16] T370109: Deploy Phabricator/Phorge 2024-07-16 - https://phabricator.wikimedia.org/T370109 [15:04:29] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab.wmfusercontent.org with reason: Phabricator/Phorge update [15:04:46] !log brennen@deploy1002 Finished deploy [phabricator/deployment@7335128]: test deploy phab2002 for T370109 (duration: 00m 34s) [15:05:17] !log brennen@deploy1002 Started deploy [phabricator/deployment@7335128]: deploy phab1004 for T370109 [15:05:50] !log silence OtelCollectorRefusedSpans in codfw for 7d [15:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:58] !log silence OtelCollectorRefusedSpans in codfw for 7d - T370043 [15:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:09] !log brennen@deploy1002 Finished deploy [phabricator/deployment@7335128]: deploy phab1004 for T370109 (duration: 00m 52s) [15:06:24] godog: [15:06:26] sukhe@re0.cr2-codfw> show bgp summary | match 10.192.16.35 [15:06:26] 10.192.16.35 64605 0 0 0 25 12:29 Connect [15:06:48] so eqiad was durum1001 (me) and you were right about centrallog2002, just as an FYI for awarenes [15:06:53] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on lsw1-f2-eqiad,lsw1-f2-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-f2-eqiad [15:07:03] sukhe: hah! thank you, makes sense [15:07:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1158 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P66638 and previous config saved to /var/cache/conftool/dbconfig/20240716-150704-root.json [15:07:09] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lsw1-f2-eqiad,lsw1-f2-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-f2-eqiad [15:07:16] (03CR) 10Effie Mouzeli: "diff looks ok" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054394 (https://phabricator.wikimedia.org/T365855) (owner: 10CDanis) [15:07:22] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9986058 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=81c0aaa1-44d2-4d05-942a-66bcdfb90d2d) set by cmooney@... [15:07:31] (03CR) 10Effie Mouzeli: [C:03+1] otelcol: Stop hardcoding k8s master IP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054394 (https://phabricator.wikimedia.org/T365855) (owner: 10CDanis) [15:07:51] (03PS5) 10CDanis: otelcol: Stop hardcoding k8s master IP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054394 (https://phabricator.wikimedia.org/T365855) [15:07:58] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 21 hosts with reason: JunOS upgrade lsw1-f2-eqiad [15:08:17] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 21 hosts with reason: JunOS upgrade lsw1-f2-eqiad [15:08:26] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9986071 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=58bc700a-b84d-4058-9776-9f6510239089) set by cmooney@... [15:08:32] !log Rebooting lsw1-f2-eqiad to complete JunOS upgrade T365997 [15:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:35] T365997: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997 [15:09:52] RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum1001 is OK: OK: UP (pid=208272) and all threads (8) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [15:09:58] (03CR) 10DCausse: [C:04-1] "image & kafka topics not yet ready" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054584 (https://phabricator.wikimedia.org/T361935) (owner: 10DCausse) [15:10:26] RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:10:28] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:10:48] RECOVERY - Bird Internet Routing Daemon on durum1001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:15:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T367781)', diff saved to https://phabricator.wikimedia.org/P66640 and previous config saved to /var/cache/conftool/dbconfig/20240716-151516-arnaudb.json [15:15:18] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1240.eqiad.wmnet with reason: Maintenance [15:15:20] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [15:15:32] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1240.eqiad.wmnet with reason: Maintenance [15:15:36] FIRING: [4x] ProbeDown: Service aqs1021-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:19:34] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [15:19:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [15:20:26] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1054398 (owner: 10Dzahn) [15:21:21] (03CR) 10Dzahn: [C:03+2] gerrit: switch firewall provider to nftables at role level [puppet] - 10https://gerrit.wikimedia.org/r/1054398 (owner: 10Dzahn) [15:22:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1158 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P66641 and previous config saved to /var/cache/conftool/dbconfig/20240716-152209-root.json [15:23:03] FIRING: [2x] KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:23:29] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2127.codfw.wmnet with reason: Maintenance [15:23:42] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2127.codfw.wmnet with reason: Maintenance [15:23:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2127 (T367781)', diff saved to https://phabricator.wikimedia.org/P66642 and previous config saved to /var/cache/conftool/dbconfig/20240716-152349-arnaudb.json [15:23:53] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [15:25:36] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:25:40] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:25:50] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 495, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:25:50] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 577, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:25:50] RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:25:50] RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:26:31] (03CR) 10Jelto: [V:03+1 C:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3250/console" [puppet] - 10https://gerrit.wikimedia.org/r/1054398 (owner: 10Dzahn) [15:26:32] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9986154 (10cmooney) Upgrade completed, all hosts back online and pinging ok. Thanks all for the assistance! [15:26:46] jouncebot: nowandnext [15:26:47] For the next 0 hour(s) and 33 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T1500) [15:26:47] In 0 hour(s) and 33 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T1600) [15:26:51] sukhe: centrallog2002 is back btw [15:26:55] (03CR) 10Jelto: [V:03+1 C:03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1054398 (owner: 10Dzahn) [15:26:56] hence the recovery [15:27:03] godog: ok! [15:27:04] thanks! [15:27:12] !log Uncordoning kubernetes1062.eqiad.wmnet mw1494.eqiad.wmnet mw1495.eqiad.wmnet - T365997 [15:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:15] T365997: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997 [15:27:21] !log cgoubert@cumin1002 conftool action : set/pooled=yes; selector: name=(kubernetes1062.eqiad.wmnet|mw1494.eqiad.wmnet|mw1495.eqiad.wmnet),cluster=kubernetes,service=kubesvc [15:28:03] RESOLVED: [2x] KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:28:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1194 (re)pooling @ 5%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66643 and previous config saved to /var/cache/conftool/dbconfig/20240716-152855-arnaudb.json [15:29:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 5%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66644 and previous config saved to /var/cache/conftool/dbconfig/20240716-152910-arnaudb.json [15:29:19] RESOLVED: [4x] ProbeDown: Service aqs1021-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:29:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 5%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66645 and previous config saved to /var/cache/conftool/dbconfig/20240716-152918-arnaudb.json [15:29:21] (03PS1) 10Brennen Bearnes: logspam.pl: s/interests/interest/ [puppet] - 10https://gerrit.wikimedia.org/r/1054589 [15:29:38] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054582 (https://phabricator.wikimedia.org/T368010) (owner: 10DCausse) [15:30:05] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9986188 (10ABran-WMF) dbstore1009 has replication up to date on all 3 instances all 3 other nodes are repooling ↑ [15:30:33] (03Merged) 10jenkins-bot: cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054582 (https://phabricator.wikimedia.org/T368010) (owner: 10DCausse) [15:30:53] (03Abandoned) 10Brennen Bearnes: logspam.pl: s/interests/interest/ [puppet] - 10https://gerrit.wikimedia.org/r/1054589 (owner: 10Brennen Bearnes) [15:31:47] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9986200 (10MatthewVernon) Swift looks good, thanks. [15:31:55] (03PS6) 10CDanis: otelcol: Stop hardcoding k8s master IP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054394 (https://phabricator.wikimedia.org/T365855) [15:31:55] (03PS1) 10CDanis: Fix opentelemetry-collector chart CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054594 (https://phabricator.wikimedia.org/T365855) [15:32:27] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:32:32] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:33:19] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is CRITICAL: 1.004e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [15:33:39] 10ops-codfw, 06SRE, 06DC-Ops: 10gbit nic option for centrallog2002 - https://phabricator.wikimedia.org/T369826#9986225 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done! We went with the procedure I suggested above, namely I took the host side configuration by logging back in via console... [15:34:51] (03CR) 10Ottomata: [C:03+1] eventbus: enable instrumentation on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054357 (https://phabricator.wikimedia.org/T363587) (owner: 10Gmodena) [15:35:03] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:36:20] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:37:13] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:37:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P66646 and previous config saved to /var/cache/conftool/dbconfig/20240716-153715-root.json [15:37:35] !log reboot fpc0 on fasw-c-codfw.mgmt.codfw.wmnet [15:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:39] PROBLEM - Druid coordinator on druid1011 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:39:08] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:39:22] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:39:48] (03CR) 10Clément Goubert: [C:03+2] parsoid testing: Switch api_proxy_uri [puppet] - 10https://gerrit.wikimedia.org/r/1053651 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [15:41:10] (03CR) 10JMeybohm: [C:03+1] otelcol: Stop hardcoding k8s master IP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054394 (https://phabricator.wikimedia.org/T365855) (owner: 10CDanis) [15:41:20] (03CR) 10JMeybohm: [C:03+1] Fix opentelemetry-collector chart CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054594 (https://phabricator.wikimedia.org/T365855) (owner: 10CDanis) [15:41:33] (03CR) 10CDanis: [C:03+2] Fix opentelemetry-collector chart CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054594 (https://phabricator.wikimedia.org/T365855) (owner: 10CDanis) [15:41:38] (03CR) 10CDanis: [C:03+2] otelcol: Stop hardcoding k8s master IP addresses (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054394 (https://phabricator.wikimedia.org/T365855) (owner: 10CDanis) [15:41:41] PROBLEM - Router interfaces on pfw3-codfw is CRITICAL: CRITICAL: host 208.80.153.197, interfaces up: 50, down: 1, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:41:47] PROBLEM - Juniper virtual chassis ports on fasw-c-codfw is CRITICAL: CRIT: Down: 2 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [15:43:43] RECOVERY - Router interfaces on pfw3-codfw is OK: OK: host 208.80.153.197, interfaces up: 58, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:43:47] RECOVERY - Juniper virtual chassis ports on fasw-c-codfw is OK: OK: UP: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [15:44:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1194 (re)pooling @ 10%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66647 and previous config saved to /var/cache/conftool/dbconfig/20240716-154401-arnaudb.json [15:44:09] T365997: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997 [15:44:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 10%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66648 and previous config saved to /var/cache/conftool/dbconfig/20240716-154415-arnaudb.json [15:44:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 10%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66649 and previous config saved to /var/cache/conftool/dbconfig/20240716-154424-arnaudb.json [15:44:38] (03Merged) 10jenkins-bot: Fix opentelemetry-collector chart CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054594 (https://phabricator.wikimedia.org/T365855) (owner: 10CDanis) [15:44:58] (03Merged) 10jenkins-bot: otelcol: Stop hardcoding k8s master IP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054394 (https://phabricator.wikimedia.org/T365855) (owner: 10CDanis) [15:45:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T367781)', diff saved to https://phabricator.wikimedia.org/P66650 and previous config saved to /var/cache/conftool/dbconfig/20240716-154537-arnaudb.json [15:45:42] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [15:48:46] FIRING: Emergency syslog message: Alert for device fasw-c-codfw.mgmt.codfw.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [15:49:10] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install frand200[12] - https://phabricator.wikimedia.org/T367804#9986373 (10Papaul) ` papaul@fasw-c-codfw# run show interfaces ge-[0-1]/0/17 descriptions Interface Admin Link Description ge-0/0/17 up up... [15:49:23] (03PS1) 10Lucas Werkmeister (WMDE): systemd::timer::job: Use TimeoutStartSec= [puppet] - 10https://gerrit.wikimedia.org/r/1054603 (https://phabricator.wikimedia.org/T370171) [15:51:27] (03CR) 10Lucas Werkmeister (WMDE): "CCing Bryan who added this in I7312a6130b. I opted not to rename the `max_runtime_seconds` parameter, as it already didn’t 100% match the " [puppet] - 10https://gerrit.wikimedia.org/r/1054603 (https://phabricator.wikimedia.org/T370171) (owner: 10Lucas Werkmeister (WMDE)) [15:52:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P66651 and previous config saved to /var/cache/conftool/dbconfig/20240716-155221-root.json [15:53:41] (03Abandoned) 10JMeybohm: Add kyverno_policy_parser [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052964 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [15:53:46] RESOLVED: Emergency syslog message: Device fasw-c-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [15:58:19] !log uploaded spicerack_8.7.0 to apt.wikimedia.org bullseye-wikimedia [15:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:39] RECOVERY - Druid coordinator on druid1011 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:59:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1194 (re)pooling @ 25%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66652 and previous config saved to /var/cache/conftool/dbconfig/20240716-155905-arnaudb.json [15:59:10] T365997: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997 [15:59:19] (03CR) 10Lucas Werkmeister (WMDE): "I suppose this is a somewhat risky change… several services (`git grep max_runtime_seconds`) which previously declared a max runtime but d" [puppet] - 10https://gerrit.wikimedia.org/r/1054603 (https://phabricator.wikimedia.org/T370171) (owner: 10Lucas Werkmeister (WMDE)) [15:59:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 25%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66653 and previous config saved to /var/cache/conftool/dbconfig/20240716-155920-arnaudb.json [15:59:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 25%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66654 and previous config saved to /var/cache/conftool/dbconfig/20240716-155930-arnaudb.json [16:00:05] jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P66655 and previous config saved to /var/cache/conftool/dbconfig/20240716-160044-arnaudb.json [16:02:55] (03PS1) 10Aklapper: Phabricator: Update recipients of quarterly metrics mail [puppet] - 10https://gerrit.wikimedia.org/r/1054605 (https://phabricator.wikimedia.org/T370167) [16:04:56] (03PS1) 10Clément Goubert: parsoid::testing: remove unused file [puppet] - 10https://gerrit.wikimedia.org/r/1054607 [16:05:22] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1054607 (owner: 10Clément Goubert) [16:05:35] (03PS1) 10Mforns: commons-impact-analytics: bump image to v1.0.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054609 (https://phabricator.wikimedia.org/T369745) [16:05:40] (03PS1) 10Dzahn: lists: ensure list member sync only happens on the active server [puppet] - 10https://gerrit.wikimedia.org/r/1054610 (https://phabricator.wikimedia.org/T351202) [16:09:01] (03CR) 10Hashar: [C:03+1] "That is a great idea yes!" [puppet] - 10https://gerrit.wikimedia.org/r/1006979 (https://phabricator.wikimedia.org/T323073) (owner: 10Dzahn) [16:11:11] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:11:52] (03CR) 10Scott French: [C:03+1] parsoid::testing: remove unused file [puppet] - 10https://gerrit.wikimedia.org/r/1054607 (owner: 10Clément Goubert) [16:13:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) (owner: 10Jdlrobson) [16:13:25] +franio2001 1H IN A 10.195.0.99 [16:13:28] +franio2002 1H IN A 10.195.0.100 [16:13:31] +franio2003 1H IN A 10.195.0.101 [16:13:34] pending DNS changes ^ [16:13:39] (03PS5) 10Jdlrobson: [July 16th] Enable dark mode for logged out users (tier 1 and tier 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) [16:14:10] does anyone know who is working on these? [16:14:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1194 (re)pooling @ 50%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66656 and previous config saved to /var/cache/conftool/dbconfig/20240716-161411-arnaudb.json [16:14:16] T365997: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997 [16:14:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 50%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66657 and previous config saved to /var/cache/conftool/dbconfig/20240716-161426-arnaudb.json [16:14:29] (03CR) 10CI reject: [V:04-1] [July 16th] Enable dark mode for logged out users (tier 1 and tier 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) (owner: 10Jdlrobson) [16:14:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 50%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66658 and previous config saved to /var/cache/conftool/dbconfig/20240716-161435-arnaudb.json [16:15:00] JennH: sorry, are you working on franio? [16:15:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P66659 and previous config saved to /var/cache/conftool/dbconfig/20240716-161552-arnaudb.json [16:16:32] Sukhe: not at the moment. I did set up the mgmt ips for them earlier. [16:16:49] ah OK that might be it then [16:17:14] I am going to merge the changes then if that's OK? because this will block any other DNS changes to be merged [16:18:18] (03CR) 10BryanDavis: [C:03+1] "I wonder if the "Note that this setting does not have any effect on Type=oneshot services, as they terminate immediately after activation " [puppet] - 10https://gerrit.wikimedia.org/r/1054603 (https://phabricator.wikimedia.org/T370171) (owner: 10Lucas Werkmeister (WMDE)) [16:18:32] !log sukhe@cumin1002 START - Cookbook sre.dns.netbox [16:19:45] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Create the python-release repository - https://phabricator.wikimedia.org/T367410#9986547 (10elukey) a:03elukey [16:20:46] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: merge DNS franio changes (add mgmt IPs) - sukhe@cumin1002" [16:21:42] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: merge DNS franio changes (add mgmt IPs) - sukhe@cumin1002" [16:21:42] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:21:47] JennH: merged. thanks! [16:23:06] JennH: what happened here was that we made changes in Netbox but we didn't run the cookbook (cookbook sre.dns.netbox) and hence the change were pending [16:23:34] Oops my bad ty for getting that! [16:24:15] np at all [16:26:11] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:29:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1194 (re)pooling @ 75%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66660 and previous config saved to /var/cache/conftool/dbconfig/20240716-162916-arnaudb.json [16:29:24] T365997: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997 [16:29:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 75%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66661 and previous config saved to /var/cache/conftool/dbconfig/20240716-162931-arnaudb.json [16:29:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 75%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66662 and previous config saved to /var/cache/conftool/dbconfig/20240716-162940-arnaudb.json [16:30:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T367781)', diff saved to https://phabricator.wikimedia.org/P66663 and previous config saved to /var/cache/conftool/dbconfig/20240716-163059-arnaudb.json [16:31:01] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2139.codfw.wmnet with reason: Maintenance [16:31:03] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [16:31:04] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2139.codfw.wmnet with reason: Maintenance [16:32:04] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: LDAP access to the analytics-privatedata-users group for Quiddity - https://phabricator.wikimedia.org/T370091#9986638 (10Quiddity) I've read and signed the L3, and read the Responsibilities document. Thanks. [16:39:40] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbproxy2006.codfw.wmnet with OS bookworm [16:39:49] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9986718 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbproxy2006.codfw.wmnet with OS bookworm [16:41:34] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1054617 [16:42:42] (03PS6) 10Jdlrobson: [July 16th] Enable dark mode for logged out users (tier 1 and tier 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) [16:42:49] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frand200[12] - https://phabricator.wikimedia.org/T367804#9986734 (10Papaul) [16:43:20] (03CR) 10CI reject: [V:04-1] [July 16th] Enable dark mode for logged out users (tier 1 and tier 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) (owner: 10Jdlrobson) [16:44:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1194 (re)pooling @ 100%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66664 and previous config saved to /var/cache/conftool/dbconfig/20240716-164422-arnaudb.json [16:44:26] T365997: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997 [16:44:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 100%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66665 and previous config saved to /var/cache/conftool/dbconfig/20240716-164437-arnaudb.json [16:44:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 100%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66666 and previous config saved to /var/cache/conftool/dbconfig/20240716-164446-arnaudb.json [16:46:13] (03PS7) 10Jdlrobson: [July 16th] Enable dark mode for logged out users (tier 1 and tier 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) [16:46:51] (03CR) 10CI reject: [V:04-1] [July 16th] Enable dark mode for logged out users (tier 1 and tier 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) (owner: 10Jdlrobson) [16:47:19] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 7771 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [16:47:33] (03CR) 10Clément Goubert: [C:03+2] verp_bounce_post_url: Switch to mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/1053650 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [16:48:43] (03PS8) 10Jdlrobson: [July 16th] Enable dark mode for logged out users (tier 1 and tier 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) [16:50:08] (03PS1) 10Elukey: sre.network.tls: use a different client certificate to authenticate [cookbooks] - 10https://gerrit.wikimedia.org/r/1054618 (https://phabricator.wikimedia.org/T355750) [16:50:30] (03CR) 10Dzahn: [V:03+1 C:03+2] "needs follow-up to ensure it does NOT also run on the failover host: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1054610" [puppet] - 10https://gerrit.wikimedia.org/r/1053399 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [16:51:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2149.codfw.wmnet with reason: Maintenance [16:51:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2149.codfw.wmnet with reason: Maintenance [16:51:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T367781)', diff saved to https://phabricator.wikimedia.org/P66667 and previous config saved to /var/cache/conftool/dbconfig/20240716-165135-arnaudb.json [16:51:39] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [16:51:51] (03CR) 10Elukey: "Folks I added more people as pebkac prevention scheme. This seems to work from a manual test on cumin1002, but lemme know if I got it wron" [cookbooks] - 10https://gerrit.wikimedia.org/r/1054618 (https://phabricator.wikimedia.org/T355750) (owner: 10Elukey) [16:53:32] (03CR) 10Dzahn: [C:03+2] "https://puppet-compiler.wmflabs.org/output/1054610/3251/lists2001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1054610 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [16:53:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy2006.codfw.wmnet with reason: host reimage [16:53:52] (03CR) 10Dzahn: [C:03+2] "disables timers on lists2001" [puppet] - 10https://gerrit.wikimedia.org/r/1054610 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [16:56:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy2006.codfw.wmnet with reason: host reimage [16:57:31] (03CR) 10EoghanGaffney: [C:03+1] lists: ensure list member sync only happens on the active server [puppet] - 10https://gerrit.wikimedia.org/r/1054610 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [17:00:04] swfrench-wmf: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T1700). [17:00:10] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: LDAP access to the analytics-privatedata-users group for Quiddity - https://phabricator.wikimedia.org/T370091#9986828 (10Ottomata) Approved! [17:00:35] !log lists2001 - systemctl reset-failed after gerrit:1054610 to fix T370098 [17:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:38] T370098: SystemdUnitFailed - lists2001 - sync-list-members - https://phabricator.wikimedia.org/T370098 [17:02:53] (03CR) 10Dzahn: "service is already effectively disabled now since yesterday at DNS level - i'm just going to wait a bit before merging these" [puppet] - 10https://gerrit.wikimedia.org/r/1006979 (https://phabricator.wikimedia.org/T323073) (owner: 10Dzahn) [17:03:04] here - still confirming a couple of remaining items before proceeding [17:03:24] (03CR) 10Dzahn: [C:03+2] Phabricator: Update recipients of quarterly metrics mail [puppet] - 10https://gerrit.wikimedia.org/r/1054605 (https://phabricator.wikimedia.org/T370167) (owner: 10Aklapper) [17:06:05] marostegui: Yes sorry I had a long-running script on euwiki, I'll see how far it got and decide whether to restart it. I was running these scripts on a per-wiki basis but apparently that isn't enough because some wikis are large enough that it takes multiple days [17:12:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T367781)', diff saved to https://phabricator.wikimedia.org/P66668 and previous config saved to /var/cache/conftool/dbconfig/20240716-171220-arnaudb.json [17:12:25] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [17:12:49] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:14:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:14:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy2006.codfw.wmnet with OS bookworm [17:14:49] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9986905 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbproxy2006.codfw.wmnet with OS bookworm completed: - dbproxy... [17:15:48] (03CR) 10Dzahn: [C:04-1] "not decom'ed yet" [puppet] - 10https://gerrit.wikimedia.org/r/1053791 (https://phabricator.wikimedia.org/T363402) (owner: 10Dzahn) [17:18:01] (03CR) 10Dzahn: [C:03+2] wdqs graph split: route / to miscweb microsite [puppet] - 10https://gerrit.wikimedia.org/r/1053756 (https://phabricator.wikimedia.org/T364367) (owner: 10Ryan Kemper) [17:19:19] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:27:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P66669 and previous config saved to /var/cache/conftool/dbconfig/20240716-172727-arnaudb.json [17:28:50] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frand200[12] - https://phabricator.wikimedia.org/T367804#9986950 (10Papaul) a:05Papaul→03Dwisehaupt @Dwisehaupt those are ready for OS install [17:28:56] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9986955 (10Papaul) [17:33:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:36:13] (03PS1) 10Ottomata: Update refinery_version for canary_events, test refine, and test refine_sanitize.pp [puppet] - 10https://gerrit.wikimedia.org/r/1054623 (https://phabricator.wikimedia.org/T367949) [17:37:45] update - I'm going to proceed with a subset of the planned depools while the remaining analytics workload is investigated [17:38:40] (03CR) 10Ottomata: [C:03+2] Update refinery_version for canary_events, test refine, and test refine_sanitize.pp [puppet] - 10https://gerrit.wikimedia.org/r/1054623 (https://phabricator.wikimedia.org/T367949) (owner: 10Ottomata) [17:39:40] (03PS2) 10Ottomata: Update refinery_version for canary_events, test refine and refine_sanitize [puppet] - 10https://gerrit.wikimedia.org/r/1054623 (https://phabricator.wikimedia.org/T367949) [17:39:40] !log swfrench@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=appservers-ro,name=codfw [reason: Depooling ahead of turndown - T367949] [17:39:44] T367949: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949 [17:39:47] (03CR) 10CI reject: [V:04-1] Update refinery_version for canary_events, test refine and refine_sanitize [puppet] - 10https://gerrit.wikimedia.org/r/1054623 (https://phabricator.wikimedia.org/T367949) (owner: 10Ottomata) [17:40:06] (03CR) 10Ottomata: [V:03+2 C:03+2] Update refinery_version for canary_events, test refine and refine_sanitize [puppet] - 10https://gerrit.wikimedia.org/r/1054623 (https://phabricator.wikimedia.org/T367949) (owner: 10Ottomata) [17:40:11] !log swfrench@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=api-ro,name=codfw [reason: Depooling ahead of turndown - T367949] [17:42:03] (03PS1) 10Dzahn: Revert^2 "wdqs: microsites for wdqs graph split" [puppet] - 10https://gerrit.wikimedia.org/r/1054624 [17:42:14] (03PS1) 10Tchanders: Enable temporary accounts on testwiki and loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) [17:42:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P66670 and previous config saved to /var/cache/conftool/dbconfig/20240716-174235-arnaudb.json [17:43:37] (03CR) 10Ryan Kemper: [C:03+1] Revert^2 "wdqs: microsites for wdqs graph split" [puppet] - 10https://gerrit.wikimedia.org/r/1054624 (owner: 10Dzahn) [17:43:58] !log swfrench@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=appservers-rw,name=eqiad [reason: Depooling ahead of turndown - T367949] [17:44:15] !log swfrench@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=api-rw,name=eqiad [reason: Depooling ahead of turndown - T367949] [17:44:15] !log otto@deploy1002 Started deploy [analytics/refinery@f97900c] (hadoop-test): Deploy refinery with refinery-source version 0.2.44 for mw on k8s - TEST [analytics/refinery@f97900c9] [17:45:18] RoanKattouw: no worries, thanks for letting me know :) [17:45:56] (03CR) 10Dzahn: [C:03+2] Revert^2 "wdqs: microsites for wdqs graph split" [puppet] - 10https://gerrit.wikimedia.org/r/1054624 (owner: 10Dzahn) [17:46:11] !log appservers-rw and api-rw now resolve to failoid - T367949 [17:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:15] T367949: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949 [17:47:39] !log otto@deploy1002 Finished deploy [analytics/refinery@f97900c] (hadoop-test): Deploy refinery with refinery-source version 0.2.44 for mw on k8s - TEST [analytics/refinery@f97900c9] (duration: 03m 23s) [17:47:39] !log otto@deploy1002 Started deploy [analytics/refinery@f97900c]: Deploy refinery with refinery-source version 0.2.44 for mw on k8s [analytics/refinery@f97900c9] [17:53:14] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [17:53:14] (03PS1) 10Ottomata: Update refinery_version for refine and refine_sanitize [puppet] - 10https://gerrit.wikimedia.org/r/1054629 (https://phabricator.wikimedia.org/T367949) [17:55:20] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [17:55:35] !log otto@deploy1002 Finished deploy [analytics/refinery@f97900c]: Deploy refinery with refinery-source version 0.2.44 for mw on k8s [analytics/refinery@f97900c9] (duration: 08m 33s) [17:55:43] !log otto@deploy1002 Started deploy [analytics/refinery@f97900c]: Deploy refinery with refinery-source version 0.2.44 for mw on k8s - take 2 [analytics/refinery@f97900c9] [17:57:18] (03CR) 10Dzahn: [C:03+2] "@Steve your change is now effectively deployed (reverted the revert). and both new sites show the SPARQL input form. seems to all work fin" [puppet] - 10https://gerrit.wikimedia.org/r/1046121 (https://phabricator.wikimedia.org/T364367) (owner: 10Stevemunene) [17:57:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T367781)', diff saved to https://phabricator.wikimedia.org/P66671 and previous config saved to /var/cache/conftool/dbconfig/20240716-175742-arnaudb.json [17:57:45] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2156.codfw.wmnet with reason: Maintenance [17:57:46] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [17:57:58] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2156.codfw.wmnet with reason: Maintenance [17:58:00] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:58:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:58:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T367781)', diff saved to https://phabricator.wikimedia.org/P66672 and previous config saved to /var/cache/conftool/dbconfig/20240716-175820-arnaudb.json [17:58:28] !log otto@deploy1002 Finished deploy [analytics/refinery@f97900c]: Deploy refinery with refinery-source version 0.2.44 for mw on k8s - take 2 [analytics/refinery@f97900c9] (duration: 02m 44s) [17:58:32] !log otto@deploy1002 Started deploy [analytics/refinery@f97900c]: Deploy refinery with refinery-source version 0.2.44 for mw on k8s - take 3 [analytics/refinery@f97900c9] [17:59:19] !log otto@deploy1002 Finished deploy [analytics/refinery@f97900c]: Deploy refinery with refinery-source version 0.2.44 for mw on k8s - take 3 [analytics/refinery@f97900c9] (duration: 00m 47s) [18:00:04] dancy and andre: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T1800). [18:00:12] o/ [18:00:19] fallback o/ [18:00:29] Andre! [18:00:46] Upgrading scap first... [18:00:50] no no, I'm just a bot account, I swear! :) [18:00:50] !log dancy@deploy1002 Installing scap version "4.92.0" for 232 hosts [18:01:55] (03CR) 10Kosta Harlan: Enable temporary accounts on testwiki and loginwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) (owner: 10Tchanders) [18:02:00] (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054630 (https://phabricator.wikimedia.org/T366959) [18:02:02] (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054630 (https://phabricator.wikimedia.org/T366959) (owner: 10TrainBranchBot) [18:02:45] (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054630 (https://phabricator.wikimedia.org/T366959) (owner: 10TrainBranchBot) [18:04:58] Hmm... docker_pull_k8s is hanging on 26 nodes. [18:05:47] ah, there is goes.. That was weird. [18:06:53] hmm. something's not right. [18:08:04] (03CR) 10Ottomata: [C:03+2] Update refinery_version for refine and refine_sanitize [puppet] - 10https://gerrit.wikimedia.org/r/1054629 (https://phabricator.wikimedia.org/T367949) (owner: 10Ottomata) [18:09:45] (03PS13) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [18:11:19] (03PS1) 10Ottomata: Disable produce_canary_events systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1054633 (https://phabricator.wikimedia.org/T370186) [18:12:03] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1054633 (https://phabricator.wikimedia.org/T370186) (owner: 10Ottomata) [18:12:07] (03CR) 10Tchanders: Enable temporary accounts on testwiki and loginwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) (owner: 10Tchanders) [18:14:10] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.14 refs T366959 [18:14:14] T366959: 1.43.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T366959 [18:15:45] (03CR) 10Ottomata: [C:03+2] Disable produce_canary_events systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1054633 (https://phabricator.wikimedia.org/T370186) (owner: 10Ottomata) [18:16:04] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9987115 (10Papaul) [18:16:26] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9987104 (10Papaul) [18:17:55] (03PS1) 10Pppery: Add extra date elements for arcanist [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1054634 (https://phabricator.wikimedia.org/T363188) [18:17:57] (03PS1) 10Pppery: Update source strings for 2024.19 [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1054635 (https://phabricator.wikimedia.org/T363188) [18:19:23] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:19:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T367781)', diff saved to https://phabricator.wikimedia.org/P66674 and previous config saved to /var/cache/conftool/dbconfig/20240716-181942-arnaudb.json [18:19:47] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [18:21:47] (03PS1) 10Ottomata: refinery - Remove produce_canary_events code [puppet] - 10https://gerrit.wikimedia.org/r/1054636 (https://phabricator.wikimedia.org/T370186) [18:22:18] (03PS1) 10CDanis: otelcol: use proper Calico selector syntax [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054637 (https://phabricator.wikimedia.org/T365855) [18:26:22] (03CR) 10Ahmon Dancy: git: remove umask from git::clone (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/927986 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [18:27:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbproxy2007.codfw.wmnet with OS bookworm [18:27:37] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9987166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbproxy2007.codfw.wmnet with OS bookworm [18:30:14] (03PS2) 10Pppery: Add extra date elements for arcanist [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1054634 (https://phabricator.wikimedia.org/T363188) [18:32:28] (03PS3) 10Pppery: Add extra date elements for arcanist [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1054634 (https://phabricator.wikimedia.org/T363188) [18:34:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P66675 and previous config saved to /var/cache/conftool/dbconfig/20240716-183449-arnaudb.json [18:37:29] (03PS4) 10Pppery: Add extra date elements for arcanist [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1054634 (https://phabricator.wikimedia.org/T363188) [18:38:26] 06SRE, 06collaboration-services, 06Traffic, 13Patch-For-Review, 10Release-Engineering-Team (Radar): implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#9987283 (10brennen) [18:39:57] (03PS1) 10Ebrahim: Enable ICU provided alphabetical order in the Kurdish wiki categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054641 (https://phabricator.wikimedia.org/T48235) [18:42:28] (03PS2) 10Ebrahim: Enable ICU provided alphabetical order in the Kurdish wikis categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054641 (https://phabricator.wikimedia.org/T48235) [18:43:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054558 (owner: 10Michael Große) [18:43:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054553 (https://phabricator.wikimedia.org/T368606) (owner: 10Michael Große) [18:44:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1054554 (https://phabricator.wikimedia.org/T368606) (owner: 10Michael Große) [18:45:53] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dbproxy2007.codfw.wmnet with OS bookworm [18:46:00] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9987340 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbproxy2007.codfw.wmnet with OS bookworm executed with errors... [18:46:00] (03PS9) 10Kimberly Sarabia: [July 16th] Enable dark mode for logged out users (tier 1 and tier 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) (owner: 10Jdlrobson) [18:47:16] (03CR) 10CDanis: [C:03+2] otelcol: use proper Calico selector syntax [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054637 (https://phabricator.wikimedia.org/T365855) (owner: 10CDanis) [18:49:34] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [18:49:46] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [18:49:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P66677 and previous config saved to /var/cache/conftool/dbconfig/20240716-184956-arnaudb.json [18:50:47] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [18:51:11] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [18:51:15] (03PS3) 10Bking: team-search-platform: migrate cirrus latencies & mem alert [alerts] - 10https://gerrit.wikimedia.org/r/1054374 (https://phabricator.wikimedia.org/T359033) (owner: 10DCausse) [18:52:51] (03PS4) 10Bking: team-search-platform: migrate cirrus latencies & mem alert [alerts] - 10https://gerrit.wikimedia.org/r/1054374 (https://phabricator.wikimedia.org/T359033) (owner: 10DCausse) [18:52:54] (03CR) 10Kosta Harlan: Enable temporary accounts on testwiki and loginwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) (owner: 10Tchanders) [18:53:55] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: LDAP access to the analytics-privatedata-users group for Quiddity - https://phabricator.wikimedia.org/T370091#9987384 (10KStineRowe_WMF) approved [18:54:20] (03PS10) 10Kimberly Sarabia: [July 16th] Enable dark mode for logged out users (tier 1 and tier 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) (owner: 10Jdlrobson) [18:56:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance [18:56:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance [18:56:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2140 (T367856)', diff saved to https://phabricator.wikimedia.org/P66678 and previous config saved to /var/cache/conftool/dbconfig/20240716-185657-marostegui.json [18:57:02] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [18:57:43] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9987404 (10Papaul) [18:57:44] (03PS11) 10Jdlrobson: [July 16th] Enable dark mode for logged out users (tier 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) [18:57:48] (03CR) 10Jdlrobson: [C:03+1] [July 16th] Enable dark mode for logged out users (tier 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) (owner: 10Jdlrobson) [19:00:23] (03PS1) 10Cwhite: admin: remove unused ssh key for cwhite [puppet] - 10https://gerrit.wikimedia.org/r/1054645 [19:01:12] (03PS1) 10Dzahn: delete integration.mediawiki.org [dns] - 10https://gerrit.wikimedia.org/r/1054646 (https://phabricator.wikimedia.org/T361250) [19:01:38] (03PS1) 10Bking: elasticsearch: remove obsolete alerts [puppet] - 10https://gerrit.wikimedia.org/r/1054647 (https://phabricator.wikimedia.org/T359033) [19:02:03] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1054647 (https://phabricator.wikimedia.org/T359033) (owner: 10Bking) [19:05:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T367781)', diff saved to https://phabricator.wikimedia.org/P66679 and previous config saved to /var/cache/conftool/dbconfig/20240716-190504-arnaudb.json [19:05:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2177.codfw.wmnet with reason: Maintenance [19:05:10] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [19:05:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2177.codfw.wmnet with reason: Maintenance [19:05:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T367781)', diff saved to https://phabricator.wikimedia.org/P66680 and previous config saved to /var/cache/conftool/dbconfig/20240716-190526-arnaudb.json [19:06:58] (03PS2) 10Bking: elasticsearch: remove obsolete alerts [puppet] - 10https://gerrit.wikimedia.org/r/1054647 (https://phabricator.wikimedia.org/T359033) [19:07:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbproxy2008.codfw.wmnet with OS bookworm [19:07:50] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9987455 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbproxy2008.codfw.wmnet with OS bookworm [19:08:55] (03PS1) 10Herron: wip [alerts] - 10https://gerrit.wikimedia.org/r/1054649 [19:09:13] (03PS1) 10CDanis: otelcol: use proper Calico selector syntax part2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054650 (https://phabricator.wikimedia.org/T365855) [19:09:38] (03PS2) 10CDanis: otelcol: use proper Calico selector syntax part2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054650 (https://phabricator.wikimedia.org/T365855) [19:11:32] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1054647 (https://phabricator.wikimedia.org/T359033) (owner: 10Bking) [19:12:58] (03CR) 10CDanis: [C:03+2] otelcol: use proper Calico selector syntax part2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054650 (https://phabricator.wikimedia.org/T365855) (owner: 10CDanis) [19:14:01] (03CR) 10JHathaway: [C:03+1] "I would add this info to the commit message, otherwise looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1054427 (owner: 10Slyngshede) [19:15:36] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [19:17:10] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [19:18:29] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [19:18:37] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [19:22:11] (03CR) 10AOkoth: [C:03+2] vrts: fix proxy for download [cookbooks] - 10https://gerrit.wikimedia.org/r/1053761 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [19:24:42] !log depooling appservers-ro in eqiad, which is not used by remaining analytics workloads - T367949 [19:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:48] T367949: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949 [19:25:24] !log swfrench@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=appservers-ro,name=eqiad [reason: Depooling ahead of turndown - T367949] [19:25:36] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:26:02] (03Merged) 10jenkins-bot: vrts: fix proxy for download [cookbooks] - 10https://gerrit.wikimedia.org/r/1053761 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [19:26:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T367781)', diff saved to https://phabricator.wikimedia.org/P66681 and previous config saved to /var/cache/conftool/dbconfig/20240716-192610-arnaudb.json [19:26:14] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [19:28:39] FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1098-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [19:29:23] (03CR) 10JHathaway: [C:03+1] sre.network.tls: use a different client certificate to authenticate [cookbooks] - 10https://gerrit.wikimedia.org/r/1054618 (https://phabricator.wikimedia.org/T355750) (owner: 10Elukey) [19:30:08] (03CR) 10JHathaway: [C:03+1] C:idm configure 2FA proxy endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/1054502 (owner: 10Slyngshede) [19:37:13] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:40:42] (03CR) 10CDanis: [C:03+1] "thank you, nice digging" [cookbooks] - 10https://gerrit.wikimedia.org/r/1054618 (https://phabricator.wikimedia.org/T355750) (owner: 10Elukey) [19:41:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P66682 and previous config saved to /var/cache/conftool/dbconfig/20240716-194117-arnaudb.json [19:43:49] (03CR) 10Ebrahim: [July 16th] Enable dark mode for logged out users (tier 1) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) (owner: 10Jdlrobson) [19:56:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P66683 and previous config saved to /var/cache/conftool/dbconfig/20240716-195624-arnaudb.json [19:56:32] (03CR) 10Ebrahim: [July 16th] Enable dark mode for logged out users (tier 1) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) (owner: 10Jdlrobson) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T2000). [20:00:04] Seawolf35, jdlrobson, and MichaelG_WMF: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:12] i can deploy today [20:00:20] Seawolf35: Jdlrobson: MichaelG_WMF: around? [20:00:34] Hey. I'm deploying for Jobn [20:00:36] Around :) [20:00:39] Jon* [20:00:50] kimberly_sarabia: ack, i'll ping you once i'm done with the other two patches then? [20:00:54] unless you wanna drive the window [20:00:56] Here, but I am on a phone so not able to debug or anything. [20:01:17] yep ping me whenever [20:01:20] sounds good [20:01:48] Though I don’t think my patch should break anything spectacularly [20:02:08] probably not :). i can test for you, it's a change i asked for anyway :)) [20:02:12] (thanks for the patch!) [20:02:29] (03CR) 10Urbanecm: [C:03+2] Ensure every test-config has valid defaults [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054558 (owner: 10Michael Große) [20:02:38] (03CR) 10Urbanecm: [C:03+2] Merge partial config with defaults [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054553 (https://phabricator.wikimedia.org/T368606) (owner: 10Michael Große) [20:03:05] (03CR) 10Urbanecm: [C:03+2] Merge partial config with defaults [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1054554 (https://phabricator.wikimedia.org/T368606) (owner: 10Michael Große) [20:03:12] (03PS7) 10Seawolf35gerrit: foundationwiki: Restrict `unfuzzy` right to autoconfirmed users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054025 (https://phabricator.wikimedia.org/T369979) [20:03:12] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054025 (https://phabricator.wikimedia.org/T369979) (owner: 10Seawolf35gerrit) [20:03:22] (03CR) 10Urbanecm: [C:03+2] foundationwiki: Restrict `unfuzzy` right to autoconfirmed users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054025 (https://phabricator.wikimedia.org/T369979) (owner: 10Seawolf35gerrit) [20:03:45] Not exactly new, just lost access to my last gerrit account [20:04:06] (03Merged) 10jenkins-bot: foundationwiki: Restrict `unfuzzy` right to autoconfirmed users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054025 (https://phabricator.wikimedia.org/T369979) (owner: 10Seawolf35gerrit) [20:04:21] Seawolf35: that's unfortunate :-( . email reset didn't work? [20:04:49] Uh, lost the email, that’s why I lost access after I forgot the password [20:05:11] !log urbanecm@deploy1002 Started scap sync-world: Backport for [[gerrit:1054025|foundationwiki: Restrict `unfuzzy` right to autoconfirmed users (T369979)]] [20:05:16] T369979: foundationwiki: Restrict `unfuzzy` right to autoconfirmed users - https://phabricator.wikimedia.org/T369979 [20:05:27] that wasn't all tho :) [20:05:35] welcome back Seawolf35! [20:05:57] Phone problems [20:06:02] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Degraded RAID on dumpsdata1007 - https://phabricator.wikimedia.org/T369829#9987783 (10Jclark-ctr) a:03Jclark-ctr [20:06:15] Every time I turn my phone off it disconnects me [20:06:23] Seawolf35: btw, I don't see why you can't be on the CI whitelist. I believe the standard is basically won't upload malicious stuff. [20:06:29] Seawolf35: ye irc does that [20:08:22] RhinosF1  would be nice to be on the CI whitelist, certainly more convenient than waiting for CI to decide it wants to look at my code. [20:08:39] I'm proposing a patch [20:09:08] !log urbanecm@deploy1002 seawolf35gerrit, urbanecm: Backport for [[gerrit:1054025|foundationwiki: Restrict `unfuzzy` right to autoconfirmed users (T369979)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:09:22] Seawolf35: patch's on mwdebug :) [20:09:23] looking [20:09:50] does the trick [20:09:51] !log urbanecm@deploy1002 seawolf35gerrit, urbanecm: Continuing with sync [20:10:03] RhinosF1 My gerrit account is Seawolf35gerrit, not Seawolf35 just so you know [20:10:09] Seawolf35: https://gerrit.wikimedia.org/r/c/integration/config/+/1054657 [20:10:14] (03CR) 10Dreamy Jazz: [C:04-1] "The use of `wmgEnableIPMasking` is in `CommonSettings-labs.php`, which means that this will have no effect for production wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) (owner: 10Tchanders) [20:10:17] I found you easy [20:10:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9987799 (10Jclark-ctr) @VRiley-WMF if you can update with 2nd network connection then hand over to @cmooney [20:10:48] Has.har will probably deploy it tomorrow [20:10:57] It's after 10 for him [20:10:59] Thanks! [20:11:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T367781)', diff saved to https://phabricator.wikimedia.org/P66684 and previous config saved to /var/cache/conftool/dbconfig/20240716-201131-arnaudb.json [20:11:34] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2190.codfw.wmnet with reason: Maintenance [20:11:37] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [20:11:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2190.codfw.wmnet with reason: Maintenance [20:11:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T367781)', diff saved to https://phabricator.wikimedia.org/P66685 and previous config saved to /var/cache/conftool/dbconfig/20240716-201153-arnaudb.json [20:12:10] !log swfrench@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=appservers-ro,name=eqiad [reason: Repooling to concentrate clients in eqiad - T367949] [20:12:13] T367949: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949 [20:13:42] Are the wmf- queues slower, or is CommunityConfiguration usually at 25 minutes and I only never noticed? [20:14:00] urbanecm: probably something for you after the window ^ [20:14:20] MichaelG_WMF: gate-and-submit should be more or less the same speed for everything. it runs tests for (most) extensions. [20:14:34] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#9987802 (10Scott_French) Current status: * appservers-rw and api-rw are depooled everywhere, and resolve to failoid as of 17:45 UTC * api-ro is serving only... [20:14:42] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1054025|foundationwiki: Restrict `unfuzzy` right to autoconfirmed users (T369979)]] (duration: 09m 31s) [20:14:46] T369979: foundationwiki: Restrict `unfuzzy` right to autoconfirmed users - https://phabricator.wikimedia.org/T369979 [20:15:01] MichaelG_WMF: the gate-and-submit for the same patch in master says https://integration.wikimedia.org/ci/job/wmf-quibble-vendor-mysql-php74/11754/console : SUCCESS in 24m 04s [20:15:58] anyway, waiting for rest of CI [20:15:58] (03PS1) 10BBlack: Add disc-appservers-ro to mock_etc metafo [dns] - 10https://gerrit.wikimedia.org/r/1054658 [20:15:58] (03PS1) 10BBlack: Switch appservers-ro to active/passive [dns] - 10https://gerrit.wikimedia.org/r/1054659 [20:15:59] (03PS1) 10BBlack: Remove disc-appservers-ro from mock_etc geo file [dns] - 10https://gerrit.wikimedia.org/r/1054660 [20:16:11] urbanecm: Mh. Thanks. Guess I just never noticed that and somehow associated CC with being faster due to its tests being faster than GrowthExperiments? [20:16:41] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 40 probes of 793 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:16:51] MichaelG_WMF: it is faster in the `test` run. that includes only that extension's tests, and CC's tests are faster than GE's. [20:17:29] but gate-and-submit runs more stuff (more tests, sometimes more PHP versions when we're switching, etc.) [20:21:04] Yeah, that I'm aware of. Though I think `test` also includes the extensions that are the dependencies for the tested extension, which is more for CC than GE. But I guess Gate-And-Submit might just be a strict superset of that? [20:21:28] *more for GrowthExperiments than CommunityConfiguration [20:21:41] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 28 probes of 793 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:21:59] MichaelG_WMF: yep, gate-and-submit should run https://gerrit.wikimedia.org/g/integration/config/+/327cd0d698cd8803f65b12891500cd4496dbf631/zuul/parameter_functions.py#956 [20:22:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054558 (owner: 10Michael Große) [20:22:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054553 (https://phabricator.wikimedia.org/T368606) (owner: 10Michael Große) [20:22:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1054554 (https://phabricator.wikimedia.org/T368606) (owner: 10Michael Große) [20:23:22] urbanecm: let's start with wmf.13? [20:23:39] MichaelG_WMF: i'm pulling all of them in at the same time [20:23:40] (03PS14) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [20:23:50] alright [20:23:50] just need CI to merge [20:25:23] oh right, I the additional +2 from TrainBranchBot and my mind somehow went Jenkins. [20:25:31] 2 more Minutes then :) [20:25:37] yeah. [20:25:56] (03PS6) 10BCornwall: Add public suffix list module [puppet] - 10https://gerrit.wikimedia.org/r/1054069 [20:25:56] (03PS7) 10BCornwall: ncmonitor: Set path for public suffix domain list [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114) [20:27:16] (03Merged) 10jenkins-bot: Ensure every test-config has valid defaults [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054558 (owner: 10Michael Große) [20:27:17] (03Merged) 10jenkins-bot: Merge partial config with defaults [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054553 (https://phabricator.wikimedia.org/T368606) (owner: 10Michael Große) [20:27:32] finally [20:27:39] one more patch... [20:27:56] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy2008.codfw.wmnet with OS bookworm [20:28:08] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9987852 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbproxy2008.codfw.wmnet with OS bookworm executed with errors... [20:28:53] (03Merged) 10jenkins-bot: Merge partial config with defaults [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1054554 (https://phabricator.wikimedia.org/T368606) (owner: 10Michael Große) [20:29:25] !log urbanecm@deploy1002 Started scap sync-world: Backport for [[gerrit:1054558|Ensure every test-config has valid defaults]], [[gerrit:1054553|Merge partial config with defaults (T368606)]], [[gerrit:1054554|Merge partial config with defaults (T368606)]] [20:29:30] T368606: Community configuration defaults are not merged with partially-specified objects - https://phabricator.wikimedia.org/T368606 [20:29:42] (03PS8) 10BCornwall: ncmonitor: Set path for public suffix domain list [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114) [20:30:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbproxy2008.codfw.wmnet with OS bookworm [20:30:51] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host dbproxy2008.codfw.wmnet with OS bookworm [20:31:07] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9987858 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbproxy2008.codfw.wmnet with OS bookworm [20:31:09] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9987859 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbproxy2008.codfw.wmnet with OS bookworm executed with errors... [20:32:51] (03PS1) 10JHathaway: pcc-puppetdb: remove java pinning [puppet] - 10https://gerrit.wikimedia.org/r/1054661 (https://phabricator.wikimedia.org/T367547) [20:33:12] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1054661 (https://phabricator.wikimedia.org/T367547) (owner: 10JHathaway) [20:33:13] !log urbanecm@deploy1002 urbanecm, migr: Backport for [[gerrit:1054558|Ensure every test-config has valid defaults]], [[gerrit:1054553|Merge partial config with defaults (T368606)]], [[gerrit:1054554|Merge partial config with defaults (T368606)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:33:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T367781)', diff saved to https://phabricator.wikimedia.org/P66686 and previous config saved to /var/cache/conftool/dbconfig/20240716-203331-arnaudb.json [20:33:33] MichaelG_WMF: can you take a look and test please? :) [20:33:35] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [20:34:11] urbanecm: works for both testwiki as well as eswiki \o/ [20:34:31] yay! [20:34:32] !log urbanecm@deploy1002 urbanecm, migr: Continuing with sync [20:34:33] that is, both wmf.14 as well as wmf.13 respectively [20:38:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbproxy2008.codfw.wmnet with OS bookworm [20:39:05] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9987937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbproxy2008.codfw.wmnet with OS bookworm [20:39:20] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1054558|Ensure every test-config has valid defaults]], [[gerrit:1054553|Merge partial config with defaults (T368606)]], [[gerrit:1054554|Merge partial config with defaults (T368606)]] (duration: 09m 55s) [20:39:24] T368606: Community configuration defaults are not merged with partially-specified objects - https://phabricator.wikimedia.org/T368606 [20:39:36] MichaelG_WMF: and live! :) [20:39:39] kimberly_sarabia: over to you :) [20:40:17] urbanecm: Thanks, confirmed 👍 [20:40:28] Ok I'm here [20:42:52] kimberly_sarabia: I thought you were going to deploy your patch? [20:43:02] Or do you want me to deploy for you? [20:43:22] yes can you deploy for me? sorry for the confusion. I haven't been trained yet on that [20:43:52] Oh, no problem. Sorry, I misunderstood. [20:43:55] Let's get started! [20:44:08] (03PS7) 10BCornwall: Add public suffix list module [puppet] - 10https://gerrit.wikimedia.org/r/1054069 [20:44:08] (03PS9) 10BCornwall: ncmonitor: Set path for public suffix domain list [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114) [20:44:20] (03PS12) 10Jdlrobson: [July 16th] Enable dark mode for logged out users (tier 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) [20:44:42] (03CR) 10Urbanecm: [C:03+2] [July 16th] Enable dark mode for logged out users (tier 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) (owner: 10Jdlrobson) [20:45:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) (owner: 10Jdlrobson) [20:45:23] (03Merged) 10jenkins-bot: [July 16th] Enable dark mode for logged out users (tier 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) (owner: 10Jdlrobson) [20:45:53] !log urbanecm@deploy1002 Started scap sync-world: Backport for [[gerrit:1050083|[July 16th] Enable dark mode for logged out users (tier 1) (T367150)]] [20:46:03] T367150: Deploy dark mode to logged-out users in tier 1 and 2 wikis on the Vector2022 and Minerva skin - https://phabricator.wikimedia.org/T367150 [20:48:27] !log urbanecm@deploy1002 urbanecm, jdlrobson: Backport for [[gerrit:1050083|[July 16th] Enable dark mode for logged out users (tier 1) (T367150)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:48:37] kimberly_sarabia: can you test at mwdebug, please? [20:48:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P66687 and previous config saved to /var/cache/conftool/dbconfig/20240716-204838-arnaudb.json [20:49:35] urbanecm: LGTM [20:49:43] proceeding [20:49:45] !log urbanecm@deploy1002 urbanecm, jdlrobson: Continuing with sync [20:54:36] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1050083|[July 16th] Enable dark mode for logged out users (tier 1) (T367150)]] (duration: 08m 43s) [20:54:41] T367150: Deploy dark mode to logged-out users in tier 1 and 2 wikis on the Vector2022 and Minerva skin - https://phabricator.wikimedia.org/T367150 [20:54:46] kimberly_sarabia: it's live :). [20:55:29] urbanecm: Great! Thanks [20:55:49] no problem! [21:03:14] (03PS8) 10BCornwall: Add public suffix list module [puppet] - 10https://gerrit.wikimedia.org/r/1054069 [21:03:14] (03PS10) 10BCornwall: ncmonitor: Set path for public suffix domain list [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114) [21:03:18] Sorry for the dumb question but the changes I saw in mwdebug for enwiki, zhwiki, etc. are weirdly not showing in prod except for testwiki? Did we miss something? [21:03:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P66688 and previous config saved to /var/cache/conftool/dbconfig/20240716-210345-arnaudb.json [21:04:05] oops scratch that [21:04:51] oh never mind, still not seeing changes outside of mwdebug. let me know if anyone has ideas [21:05:14] (03PS9) 10BCornwall: Add public suffix list module [puppet] - 10https://gerrit.wikimedia.org/r/1054069 [21:05:14] (03PS11) 10BCornwall: ncmonitor: Set path for public suffix domain list [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114) [21:18:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T367781)', diff saved to https://phabricator.wikimedia.org/P66689 and previous config saved to /var/cache/conftool/dbconfig/20240716-211852-arnaudb.json [21:18:55] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2194.codfw.wmnet with reason: Maintenance [21:18:57] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [21:19:08] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2194.codfw.wmnet with reason: Maintenance [21:19:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2194 (T367781)', diff saved to https://phabricator.wikimedia.org/P66690 and previous config saved to /var/cache/conftool/dbconfig/20240716-211914-arnaudb.json [21:21:00] (03PS15) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [21:33:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:35:56] (03PS1) 10Scott French: DNM: service: (appserver|api)-ro to active-passive [puppet] - 10https://gerrit.wikimedia.org/r/1054667 [21:37:12] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1054667 (owner: 10Scott French) [21:40:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T367781)', diff saved to https://phabricator.wikimedia.org/P66691 and previous config saved to /var/cache/conftool/dbconfig/20240716-214054-arnaudb.json [21:40:59] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [21:46:55] (03PS16) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [21:56:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P66692 and previous config saved to /var/cache/conftool/dbconfig/20240716-215601-arnaudb.json [21:59:13] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy2008.codfw.wmnet with OS bookworm [21:59:26] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9988300 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbproxy2008.codfw.wmnet with OS bookworm executed with errors... [22:11:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P66693 and previous config saved to /var/cache/conftool/dbconfig/20240716-221109-arnaudb.json [22:26:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T367781)', diff saved to https://phabricator.wikimedia.org/P66694 and previous config saved to /var/cache/conftool/dbconfig/20240716-222616-arnaudb.json [22:26:18] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2209.codfw.wmnet with reason: Maintenance [22:26:20] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [22:26:31] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2209.codfw.wmnet with reason: Maintenance [22:26:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2209 (T367781)', diff saved to https://phabricator.wikimedia.org/P66695 and previous config saved to /var/cache/conftool/dbconfig/20240716-222638-arnaudb.json [22:40:23] !log removing 9 files for legal compliance [22:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:36] (03PS11) 10BCornwall: Add public suffix list module [puppet] - 10https://gerrit.wikimedia.org/r/1054069 [22:48:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T367781)', diff saved to https://phabricator.wikimedia.org/P66696 and previous config saved to /var/cache/conftool/dbconfig/20240716-224815-arnaudb.json [22:48:20] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [23:03:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P66697 and previous config saved to /var/cache/conftool/dbconfig/20240716-230322-arnaudb.json [23:18:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P66698 and previous config saved to /var/cache/conftool/dbconfig/20240716-231829-arnaudb.json [23:28:54] FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1098-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [23:29:19] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:33:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T367781)', diff saved to https://phabricator.wikimedia.org/P66699 and previous config saved to /var/cache/conftool/dbconfig/20240716-233336-arnaudb.json [23:33:41] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [23:35:27] (03CR) 10Cwhite: [C:03+2] admin: remove unused ssh key for cwhite [puppet] - 10https://gerrit.wikimedia.org/r/1054645 (owner: 10Cwhite) [23:37:13] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:38:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1054682 [23:38:33] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1054682 (owner: 10TrainBranchBot) [23:44:09] (03PS4) 10Pppery: Update source strings for 2024.19 [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1054635 (https://phabricator.wikimedia.org/T363188) [23:50:10] (03PS12) 10BCornwall: Add public suffix list module [puppet] - 10https://gerrit.wikimedia.org/r/1054069 [23:57:36] (03CR) 10BCornwall: "I tested on ncmonitor1001 and verified functionality." [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (owner: 10BCornwall) [23:57:45] (03PS15) 10BCornwall: ncmonitor: Set path for public suffix domain list [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114) [23:58:53] (03PS1) 10Kimberly Sarabia: skin-themes dblist is expanded to include tier 2 wikis as well as tier 1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054685 (https://phabricator.wikimedia.org/T367150)