[00:39:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/931925 [00:39:19] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/931925 (owner: 10TrainBranchBot) [01:01:57] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/931925 (owner: 10TrainBranchBot) [01:05:47] (03CR) 10Anzx: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932284 (https://phabricator.wikimedia.org/T340276) (owner: 10Anzx) [01:12:41] (03PS3) 10Anzx: Rename namespace on extwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932272 (https://phabricator.wikimedia.org/T337696) [01:16:38] (03PS4) 10Anzx: Change dewiki import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932283 (https://phabricator.wikimedia.org/T340264) [01:43:29] PROBLEM - Host parse1012 is DOWN: PING CRITICAL - Packet loss = 100% [01:44:25] RECOVERY - Host parse1012 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [02:07:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:20] (ProbeDown) firing: (6) Service ml-cache1001:7001 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:27:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:32:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:42:17] PROBLEM - Host parse1012 is DOWN: PING CRITICAL - Packet loss = 100% [04:44:16] (03CR) 10TChin: eventstreams use kafka egress and service mesh (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/932165 (https://phabricator.wikimedia.org/T335024) (owner: 10TChin) [04:44:35] RECOVERY - Host parse1012 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [05:13:03] (03PS1) 10KartikMistry: Update cxserver to 2023-06-26-050753-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/932683 (https://phabricator.wikimedia.org/T340236) [05:20:41] PROBLEM - Check systemd state on ms-be1069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:07:46] * kart_ updating cxserver [06:08:18] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-06-26-050753-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/932683 (https://phabricator.wikimedia.org/T340236) (owner: 10KartikMistry) [06:09:17] (03Merged) 10jenkins-bot: Update cxserver to 2023-06-26-050753-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/932683 (https://phabricator.wikimedia.org/T340236) (owner: 10KartikMistry) [06:10:48] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [06:11:08] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:14:57] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:15:33] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:19:42] (03PS1) 10Marostegui: instances.yaml: Remove db1118 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/932685 (https://phabricator.wikimedia.org/T326683) [06:20:12] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1118 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/932685 (https://phabricator.wikimedia.org/T326683) (owner: 10Marostegui) [06:20:25] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:20:34] (ProbeDown) firing: (6) Service ml-cache1001:7001 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:20:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1118 from dbctl T326683', diff saved to https://phabricator.wikimedia.org/P49477 and previous config saved to /var/cache/conftool/dbconfig/20230626-062036-marostegui.json [06:20:41] T326683: Decommission db1106-db1125 - https://phabricator.wikimedia.org/T326683 [06:20:48] dbproxy alerts is to be expected [06:21:25] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:25:10] (03PS1) 10Marostegui: mariadb: Move db1118 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/932686 (https://phabricator.wikimedia.org/T335092) [06:26:19] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:26:28] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1118 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/932686 (https://phabricator.wikimedia.org/T335092) (owner: 10Marostegui) [06:26:36] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:26:43] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:26:55] (03CR) 10Stevemunene: [C: 03+2] analytics: Decommission analytics106[4-6] from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/930582 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [06:27:14] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:27:22] (03PS1) 10Muehlenhoff: Extend access for sannita [puppet] - 10https://gerrit.wikimedia.org/r/932687 [06:27:27] PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:27:31] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:27:39] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:27:41] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:28:05] !log Updated cxserver to 2023-06-26-050753-production (T340236, T339896) [06:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:10] T339896: Enable MinT for all languages supported by IndicTrans2 - https://phabricator.wikimedia.org/T339896 [06:28:11] T340236: MinT translates to English when Hindi-Santali or any other language-Santali is selected - https://phabricator.wikimedia.org/T340236 [06:30:24] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for sannita [puppet] - 10https://gerrit.wikimedia.org/r/932687 (owner: 10Muehlenhoff) [06:30:45] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:30:47] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:30:57] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:31:23] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:32:05] RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:32:09] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:32:35] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:34:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:07] PROBLEM - Hadoop NodeManager on analytics1066 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:35:40] (03CR) 10Elukey: [V: 03+1] profile::cassandra: allow Prometheus nodes to check ports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932663 (owner: 10Elukey) [06:35:56] (03Abandoned) 10Elukey: profile::cassandra: allow Prometheus nodes to check ports [puppet] - 10https://gerrit.wikimedia.org/r/932663 (owner: 10Elukey) [06:36:04] (03PS4) 10Elukey: cassandra::instance::monitoring: move alerts to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/932427 (https://phabricator.wikimedia.org/T288470) [06:53:45] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:54:17] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:56:06] (03PS1) 10Marostegui: db1118: Install 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/932690 [06:56:32] (03CR) 10Marostegui: [C: 03+2] db1118: Install 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/932690 (owner: 10Marostegui) [07:00:05] Amir1, Urbanecm, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T0700). [07:00:05] aanzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:14] o/ [07:01:19] aanzx: ping [07:01:20] 0/ [07:01:39] (03CR) 10Majavah: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932283 (https://phabricator.wikimedia.org/T340264) (owner: 10Anzx) [07:02:19] aanzx: your last patch seems empty [07:03:19] (03PS4) 10Majavah: Rename namespace on extwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932272 (https://phabricator.wikimedia.org/T337696) (owner: 10Anzx) [07:03:23] (03CR) 10Majavah: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932272 (https://phabricator.wikimedia.org/T337696) (owner: 10Anzx) [07:04:14] taavi: https://gerrit.wikimedia.org/r/c/932284 this one? [07:04:22] yes [07:05:33] aanzx: I'll deploy the first two, we can look at the last one after that [07:05:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932283 (https://phabricator.wikimedia.org/T340264) (owner: 10Anzx) [07:05:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932272 (https://phabricator.wikimedia.org/T337696) (owner: 10Anzx) [07:05:57] do you have the x-wikimedia-debug browser extension installed? [07:06:02] taavi: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/929742/2/dblists/mobile-anon-talk.dblist says file had to be auto generated [07:06:06] Yes [07:07:05] (03Merged) 10jenkins-bot: Change dewiki import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932283 (https://phabricator.wikimedia.org/T340264) (owner: 10Anzx) [07:07:09] (03Merged) 10jenkins-bot: Rename namespace on extwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932272 (https://phabricator.wikimedia.org/T337696) (owner: 10Anzx) [07:07:39] !log taavi@deploy1002 Started scap: Backport for [[gerrit:932283|Change dewiki import sources (T340264)]], [[gerrit:932272|Rename namespace on extwiki (T337696)]] [07:07:44] T337696: In ext.wiki, change namespace Güiquipeya to Güiquipedia - https://phabricator.wikimedia.org/T337696 [07:07:45] T340264: Change dewiki import sources - https://phabricator.wikimedia.org/T340264 [07:09:38] taavi: can I edit this manually https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/929742/2/dblists/mobile-anon-talk.dblist [07:10:17] no, as the comment says you should use the `composer manage-dblist` command [07:11:09] Ok , i will do it for afternoon backport [07:16:38] !log taavi@deploy1002 anzx and taavi: Backport for [[gerrit:932283|Change dewiki import sources (T340264)]], [[gerrit:932272|Rename namespace on extwiki (T337696)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [07:16:43] T337696: In ext.wiki, change namespace Güiquipeya to Güiquipedia - https://phabricator.wikimedia.org/T337696 [07:16:44] T340264: Change dewiki import sources - https://phabricator.wikimedia.org/T340264 [07:16:45] aanzx: please test both of those patches on a mwdebug server [07:16:54] Ok [07:21:28] taavi: dewiki ok, extwiki name space change still shows guiquipeya instead of pedia [07:21:54] hmm, let me see [07:23:14] (03CR) 10Elukey: "I found the real issue, finally:" [puppet] - 10https://gerrit.wikimedia.org/r/932663 (owner: 10Elukey) [07:24:37] ah. I didn't spot this earlier but you've changed the wrong setting I think - the correct setting is wgMetaNamespace but you've changed wgSitename [07:24:44] do you want to write a patch to fix it or should I? [07:26:19] I will do it now [07:29:06] (03PS2) 10Alexandros Kosiaris: Git template: Clean up git commit template message [deployment-charts] - 10https://gerrit.wikimedia.org/r/921668 [07:30:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:31:00] taavi: can you write patch i couldn't find metanamespace [07:31:07] !log taavi@deploy1002 Sync cancelled. [07:31:11] sure, give me one second [07:33:35] (03PS1) 10Majavah: extwiki: Update project namespace name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932788 (https://phabricator.wikimedia.org/T337696) [07:33:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932788 (https://phabricator.wikimedia.org/T337696) (owner: 10Majavah) [07:34:40] (03Merged) 10jenkins-bot: extwiki: Update project namespace name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932788 (https://phabricator.wikimedia.org/T337696) (owner: 10Majavah) [07:34:58] !log taavi@deploy1002 Started scap: Backport for [[gerrit:932788|extwiki: Update project namespace name (T337696)]] [07:35:02] T337696: In ext.wiki, change namespace Güiquipeya to Güiquipedia - https://phabricator.wikimedia.org/T337696 [07:36:26] !log taavi@deploy1002 taavi: Backport for [[gerrit:932788|extwiki: Update project namespace name (T337696)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [07:36:56] aanzx: can you check if it works properly now? [07:37:23] hmm, and I think we want to add an alias for the old name, otherwise links are going to break [07:37:23] Thanks taavi , working now [07:37:41] !log taavi@deploy1002 Sync cancelled. [07:38:41] (03PS1) 10Majavah: extwiki: Add an alias for old NS_PROJECT name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932789 [07:38:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932789 (owner: 10Majavah) [07:39:37] (03Merged) 10jenkins-bot: extwiki: Add an alias for old NS_PROJECT name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932789 (owner: 10Majavah) [07:39:54] !log taavi@deploy1002 Started scap: Backport for [[gerrit:932789|extwiki: Add an alias for old NS_PROJECT name]] [07:41:04] (03CR) 10Alexandros Kosiaris: [C: 03+2] Git template: Clean up git commit template message [deployment-charts] - 10https://gerrit.wikimedia.org/r/921668 (owner: 10Alexandros Kosiaris) [07:41:23] !log taavi@deploy1002 taavi: Backport for [[gerrit:932789|extwiki: Add an alias for old NS_PROJECT name]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [07:41:49] (03Merged) 10jenkins-bot: Git template: Clean up git commit template message [deployment-charts] - 10https://gerrit.wikimedia.org/r/921668 (owner: 10Alexandros Kosiaris) [07:42:00] and syncing [07:42:06] Thanks [07:48:29] (03CR) 10Jaime Nuche: [C: 03+1] releases: Fix alert for releases-jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932414 (owner: 10EoghanGaffney) [07:48:44] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:932789|extwiki: Add an alias for old NS_PROJECT name]] (duration: 08m 49s) [07:48:48] all done [08:00:59] (03PS1) 10Arturo Borrero Gonzalez: reports/network: ignore IPv6 for cloudservices boxes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/932794 (https://phabricator.wikimedia.org/T307357) [08:03:58] (03PS1) 10Elukey: cassandra::instance::monitoring: remove wrong servername [puppet] - 10https://gerrit.wikimedia.org/r/932795 (https://phabricator.wikimedia.org/T288470) [08:04:15] (03CR) 10Ayounsi: [C: 03+1] reports/network: ignore IPv6 for cloudservices boxes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/932794 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [08:04:54] (03PS1) 10Muehlenhoff: Remove access for paramd [puppet] - 10https://gerrit.wikimedia.org/r/932796 [08:05:30] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41993/console" [puppet] - 10https://gerrit.wikimedia.org/r/932795 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [08:05:53] (03CR) 10Elukey: [V: 03+1 C: 03+2] cassandra::instance::monitoring: remove wrong servername [puppet] - 10https://gerrit.wikimedia.org/r/932795 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [08:06:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] reports/network: ignore IPv6 for cloudservices boxes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/932794 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [08:06:56] !log aborrero@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [08:07:00] !log aborrero@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [08:07:20] !log aborrero@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [08:07:25] !log aborrero@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [08:12:22] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for paramd [puppet] - 10https://gerrit.wikimedia.org/r/932796 (owner: 10Muehlenhoff) [08:14:52] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Paramita Das out of all services on: 1261 hosts [08:15:29] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Paramita Das out of all services on: 1261 hosts [08:17:44] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Paramita Das out of all services on: 771 hosts [08:18:06] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Paramita Das out of all services on: 771 hosts [08:18:57] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Paramita Das out of all services on: 19 hosts [08:19:02] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Paramita Das out of all services on: 19 hosts [08:26:23] jouncebot: nowandnext [08:26:23] No deployments scheduled for the next 1 hour(s) and 33 minute(s) [08:26:23] In 1 hour(s) and 33 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T1000) [08:34:34] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: replace Apache 2.2 access control syntax [puppet] - 10https://gerrit.wikimedia.org/r/932443 (https://phabricator.wikimedia.org/T258686) (owner: 10Dzahn) [08:34:45] (03CR) 10Filippo Giunchedi: [C: 03+1] thanos: replace Apache 2.2 with modern syntax for access control [puppet] - 10https://gerrit.wikimedia.org/r/932444 (https://phabricator.wikimedia.org/T258686) (owner: 10Dzahn) [08:40:02] (03CR) 10Kosta Harlan: "What else needs to happen to make mariadb images available in GitLab CI? I still see messages saying that the image is not available https" [puppet] - 10https://gerrit.wikimedia.org/r/932328 (https://phabricator.wikimedia.org/T339352) (owner: 10Kosta Harlan) [08:41:55] (03PS1) 10Elukey: cassandra::instance::monitoring: add 'cassandra' as servername [puppet] - 10https://gerrit.wikimedia.org/r/932799 (https://phabricator.wikimedia.org/T288470) [08:43:22] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41994/console" [puppet] - 10https://gerrit.wikimedia.org/r/932799 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [08:46:16] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/932799 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [08:49:36] (03PS1) 10Arturo Borrero Gonzalez: cloudservices2005-dev: give it proper role and name. [puppet] - 10https://gerrit.wikimedia.org/r/932800 (https://phabricator.wikimedia.org/T338779) [08:51:13] (03PS2) 10Arturo Borrero Gonzalez: cloudservices2005-dev: give it proper role and name. [puppet] - 10https://gerrit.wikimedia.org/r/932800 (https://phabricator.wikimedia.org/T338779) [08:55:34] (03PS2) 10Elukey: cassandra::instance::monitoring: add 'cassandra' as servername [puppet] - 10https://gerrit.wikimedia.org/r/932799 (https://phabricator.wikimedia.org/T288470) [08:55:36] (03PS1) 10Elukey: cassandra::instance: add CN:cassandra to all PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/932801 (https://phabricator.wikimedia.org/T288470) [08:56:56] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41995/console" [puppet] - 10https://gerrit.wikimedia.org/r/932801 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [08:58:13] (03CR) 10Elukey: "@Jbond: IIRC in this way I'd still get the fqdn in the cert, but also CN:cassandra right? Basically like we do for Kafka." [puppet] - 10https://gerrit.wikimedia.org/r/932801 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [09:01:47] (03CR) 10Vgutierrez: [C: 03+1] [beta] Update wgCdnServersNoPurge for new cache server (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932380 (https://phabricator.wikimedia.org/T327742) (owner: 10Fabfur) [09:02:57] (03PS1) 10Muehlenhoff: Extend access for tandic [puppet] - 10https://gerrit.wikimedia.org/r/932802 [09:03:32] (03PS1) 10MVernon: hiera: set ms-be1068 to be an object expirer [puppet] - 10https://gerrit.wikimedia.org/r/932803 (https://phabricator.wikimedia.org/T229584) [09:03:38] /13 [09:04:26] 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, and 2 others: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10aborrero) {F37119766} [09:06:22] !log aborrero@cumin2002 START - Cookbook sre.dns.netbox [09:08:28] !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices2005-dev - aborrero@cumin2002" [09:08:32] (03CR) 10Alexandros Kosiaris: [C: 03+1] mw-on-k8s: Redirect closed wikis to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [09:09:23] !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices2005-dev - aborrero@cumin2002" [09:09:23] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:09:55] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for tandic [puppet] - 10https://gerrit.wikimedia.org/r/932802 (owner: 10Muehlenhoff) [09:10:20] (03CR) 10Alexandros Kosiaris: [C: 03+1] mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923385 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [09:10:34] !log aborrero@cumin2002 START - Cookbook sre.dns.wipe-cache cloudservices2005-dev.mgmt.codfw.wmnet on all recursors [09:10:37] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudservices2005-dev.mgmt.codfw.wmnet on all recursors [09:10:54] !log aborrero@cumin2002 START - Cookbook sre.dns.wipe-cache cloudservices2005-dev.codfw.wmnet on all recursors [09:10:57] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudservices2005-dev.codfw.wmnet on all recursors [09:11:12] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudservices2005-dev.codfw.wmnet with OS bullseye [09:11:27] 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, and 2 others: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudservices2005-dev.codfw.wmnet wi... [09:13:23] (03PS3) 10Arturo Borrero Gonzalez: cloudservices2005-dev: give it proper role and name [puppet] - 10https://gerrit.wikimedia.org/r/932800 (https://phabricator.wikimedia.org/T338779) [09:14:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices2005-dev: give it proper role and name [puppet] - 10https://gerrit.wikimedia.org/r/932800 (https://phabricator.wikimedia.org/T338779) (owner: 10Arturo Borrero Gonzalez) [09:14:51] (03PS9) 10Clément Goubert: api-gateway: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) [09:17:31] !log aborrero@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudservices2005-dev [09:17:38] (03CR) 10Muehlenhoff: [C: 03+2] Add missing types to ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/931890 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:17:41] !log aborrero@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudservices2005-dev [09:17:48] !log aborrero@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudservices2005-dev.codfw.wmnet with OS bullseye [09:18:00] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudservices2005-dev.codfw.wmnet with OS bullseye [09:18:02] 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, and 2 others: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudservices2005-dev.codfw.wmnet with O... [09:18:14] 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, and 2 others: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudservices2005-dev.codfw.wmnet wi... [09:18:48] (03PS10) 10Clément Goubert: api-gateway: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) [09:20:35] (03CR) 10Klausman: [C: 03+2] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/932804 (owner: 10Klausman) [09:22:05] (03PS2) 10Klausman: homedirs/klausman: clean up dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/932804 [09:23:50] (03PS16) 10Muehlenhoff: ferm: Allow passing sets to an srange or drange [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) [09:24:26] (03CR) 10Klausman: [C: 03+2] homedirs/klausman: clean up dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/932804 (owner: 10Klausman) [09:29:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:29:24] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: Setup in progress [09:29:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: Setup in progress [09:30:34] (03CR) 10Marostegui: [C: 03+1] hiera: set ms-be1068 to be an object expirer [puppet] - 10https://gerrit.wikimedia.org/r/932803 (https://phabricator.wikimedia.org/T229584) (owner: 10MVernon) [09:32:49] jouncebot: nowandnext [09:32:49] No deployments scheduled for the next 0 hour(s) and 27 minute(s) [09:32:49] In 0 hour(s) and 27 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T1000) [09:32:55] fabfur: ^^ [09:32:59] tnx [09:34:17] (03PS1) 10Btullis: Add a workaround for a kerberos issue affecting Presto version 0.281 [puppet] - 10https://gerrit.wikimedia.org/r/932827 (https://phabricator.wikimedia.org/T337335) [09:35:21] (03PS2) 10Btullis: Add a workaround for a kerberos issue affecting Presto version 0.281 [puppet] - 10https://gerrit.wikimedia.org/r/932827 (https://phabricator.wikimedia.org/T337335) [09:37:18] !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices2005-dev.codfw.wmnet with reason: host reimage [09:38:02] (03CR) 10Btullis: [C: 03+2] Add a workaround for a kerberos issue affecting Presto version 0.281 [puppet] - 10https://gerrit.wikimedia.org/r/932827 (https://phabricator.wikimedia.org/T337335) (owner: 10Btullis) [09:40:01] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices2005-dev.codfw.wmnet with reason: host reimage [09:41:23] jouncebot: nowandnext [09:41:23] No deployments scheduled for the next 0 hour(s) and 18 minute(s) [09:41:23] In 0 hour(s) and 18 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T1000) [09:41:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by fabfur@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932380 (https://phabricator.wikimedia.org/T327742) (owner: 10Fabfur) [09:43:47] (03Merged) 10jenkins-bot: [beta] Update wgCdnServersNoPurge for new cache server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932380 (https://phabricator.wikimedia.org/T327742) (owner: 10Fabfur) [09:46:35] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10MoritzMuehlenhoff) [09:52:17] (03Abandoned) 10Aqu: [WIP] Build spark yarn archive for Spark 3 from conda-analytics package [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578) (owner: 10Aqu) [09:53:50] 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, and 2 others: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10aborrero) [09:54:10] 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, and 2 others: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10aborrero) [09:54:53] (03CR) 10Aqu: "This repo could be deprecated now that the migration to Airflow 2.5 is done." [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/881873 (https://phabricator.wikimedia.org/T326194) (owner: 10Aqu) [09:55:19] (03CR) 10EoghanGaffney: [C: 03+1] contint: replace Apache 2.2 with 2.4 syntax for access control [puppet] - 10https://gerrit.wikimedia.org/r/932435 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [09:55:30] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 (10Clement_Goubert) [09:55:41] (03CR) 10EoghanGaffney: [C: 03+1] releases-jenkins: replace Apache 2.2 with 2.4 syntax for access control [puppet] - 10https://gerrit.wikimedia.org/r/932439 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [09:55:53] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/932400 (owner: 10Ayounsi) [09:58:19] (03PS2) 10D3r1ck01: wmf-config: Remove wgContentTranslationDefaultParsoidClient cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930798 [09:58:52] (03CR) 10Volans: "I don't know the details to vote on this, but for me it's a +1 for fixing this at the apache layer for now." [puppet] - 10https://gerrit.wikimedia.org/r/932404 (owner: 10Slyngshede) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T1000) [10:00:06] claime: A patch you scheduled for MediaWiki infrastucture (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [10:00:20] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Redirect closed wikis to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [10:01:06] !log mw-on-k8s: Redirect closed wikis to mw-on-k8s - T337490 [10:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:16] T337490: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 [10:01:26] (03PS1) 10Muehlenhoff: Point codfw URL downloader to new bullseye host [dns] - 10https://gerrit.wikimedia.org/r/932830 (https://phabricator.wikimedia.org/T329945) [10:01:55] (03PS1) 10Slyngshede: SUL Account: Allow users to dismiss account linking. [software/bitu] - 10https://gerrit.wikimedia.org/r/932831 [10:02:00] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:02:42] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:03:29] (03PS1) 10Arturo Borrero Gonzalez: cloudservices2005-dev: drop cloud-private base interface override [puppet] - 10https://gerrit.wikimedia.org/r/932832 (https://phabricator.wikimedia.org/T338779) [10:04:03] (03CR) 10Muehlenhoff: [C: 03+2] Point codfw URL downloader to new bullseye host [dns] - 10https://gerrit.wikimedia.org/r/932830 (https://phabricator.wikimedia.org/T329945) (owner: 10Muehlenhoff) [10:04:49] (03PS1) 10Arturo Borrero Gonzalez: acme_chief: allow cloudservices2005-dev to access ldap-codfw1dev cert [puppet] - 10https://gerrit.wikimedia.org/r/932833 (https://phabricator.wikimedia.org/T338779) [10:05:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices2005-dev: drop cloud-private base interface override [puppet] - 10https://gerrit.wikimedia.org/r/932832 (https://phabricator.wikimedia.org/T338779) (owner: 10Arturo Borrero Gonzalez) [10:05:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] acme_chief: allow cloudservices2005-dev to access ldap-codfw1dev cert [puppet] - 10https://gerrit.wikimedia.org/r/932833 (https://phabricator.wikimedia.org/T338779) (owner: 10Arturo Borrero Gonzalez) [10:07:07] (03PS1) 10Arturo Borrero Gonzalez: acme_chief: extend ldap-codfw1dev with cloudservices2005-dev SNI [puppet] - 10https://gerrit.wikimedia.org/r/932834 (https://phabricator.wikimedia.org/T338779) [10:08:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] acme_chief: extend ldap-codfw1dev with cloudservices2005-dev SNI [puppet] - 10https://gerrit.wikimedia.org/r/932834 (https://phabricator.wikimedia.org/T338779) (owner: 10Arturo Borrero Gonzalez) [10:08:50] (03CR) 10Volans: [C: 03+1] "Thanks for the fix! LGTM, couple of nits inline, no blockers." [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [10:16:13] (03PS8) 10Clément Goubert: mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923385 (https://phabricator.wikimedia.org/T337490) [10:16:45] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts krb2001.codfw.wmnet [10:19:17] !log mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s - T337490 [10:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:21] T337490: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 [10:19:27] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923385 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [10:19:32] (03CR) 10MVernon: [C: 03+2] hiera: set ms-be1068 to be an object expirer [puppet] - 10https://gerrit.wikimedia.org/r/932803 (https://phabricator.wikimedia.org/T229584) (owner: 10MVernon) [10:21:53] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:24:10] PROBLEM - Check systemd state on ms-be1068 is CRITICAL: CRITICAL - degraded: The following units failed: swift-container-sharder.service,swift-object-reconstructor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:24:47] !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aborrero@cumin2002" [10:25:02] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:25:19] (03PS1) 10Arturo Borrero Gonzalez: pdns_server: db_backup: fix grant statement order [puppet] - 10https://gerrit.wikimedia.org/r/932838 [10:25:20] (ProbeDown) firing: (6) Service ml-cache1001:7001 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:25:29] !log aborrero@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aborrero@cumin2002" [10:25:30] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudservices2005-dev.codfw.wmnet with OS bullseye [10:25:44] 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, and 2 others: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudservices2005-dev.codfw.wmnet with O... [10:25:47] (03PS1) 10Btullis: Revert "Enable the PRESTO_EXPAND_DATA feature flag in Superset" [puppet] - 10https://gerrit.wikimedia.org/r/932644 [10:26:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:26:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:26:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb2001.codfw.wmnet [10:26:11] (03PS2) 10Btullis: Revert "Enable the PRESTO_EXPAND_DATA feature flag in Superset" [puppet] - 10https://gerrit.wikimedia.org/r/932644 (https://phabricator.wikimedia.org/T340144) [10:28:08] (03PS1) 10Muehlenhoff: Remove krb2001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/932839 (https://phabricator.wikimedia.org/T340433) [10:29:59] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 (10Clement_Goubert) [10:30:53] (03CR) 10Muehlenhoff: [C: 03+2] Remove krb2001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/932839 (https://phabricator.wikimedia.org/T340433) (owner: 10Muehlenhoff) [10:32:35] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:32:39] !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudservices2005-dev - aborrero@cumin2002" [10:32:52] (03CR) 10Muehlenhoff: "Rebased on top of the latest type changes, ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:33:23] !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudservices2005-dev - aborrero@cumin2002" [10:37:12] (03PS1) 10Clément Goubert: mw-on-k8s: Redirect officewiki to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/932857 (https://phabricator.wikimedia.org/T337490) [10:37:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] pdns_server: db_backup: fix grant statement order [puppet] - 10https://gerrit.wikimedia.org/r/932838 (owner: 10Arturo Borrero Gonzalez) [10:38:47] (03PS3) 10Daniel Kinzler: Parsoid: Disable PC writes on dewiki and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932175 (https://phabricator.wikimedia.org/T339867) [10:41:13] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41996/console" [puppet] - 10https://gerrit.wikimedia.org/r/932857 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [10:42:16] (03PS1) 10Muehlenhoff: Extend access for wangombe [puppet] - 10https://gerrit.wikimedia.org/r/933059 [10:44:10] (03PS1) 10AikoChou: changeprop: update page_change_kind for outlink stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/933060 (https://phabricator.wikimedia.org/T328899) [10:44:55] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for wangombe [puppet] - 10https://gerrit.wikimedia.org/r/933059 (owner: 10Muehlenhoff) [10:45:15] (03CR) 10Klausman: [C: 03+1] cassandra::instance: add CN:cassandra to all PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/932801 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [10:47:08] (03CR) 10Btullis: [C: 03+2] Revert "Enable the PRESTO_EXPAND_DATA feature flag in Superset" [puppet] - 10https://gerrit.wikimedia.org/r/932644 (https://phabricator.wikimedia.org/T340144) (owner: 10Btullis) [10:47:44] 10SRE, 10AbuseFilter, 10serviceops, 10PHP 7.4 support: Regular expression "х[ÿý]и" match "х и" in Abusefilter - https://phabricator.wikimedia.org/T340068 (10Clement_Goubert) >>! In T340068#8962701, @Daimona wrote: >>>! In T340068#8962700, @Reedy wrote: >>>>! In T340068#8962699, @Daimona wrote: >>>>>! In T3... [10:49:20] (03CR) 10Elukey: [C: 03+2] cassandra::instance: add CN:cassandra to all PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/932801 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [10:50:47] (03PS3) 10Elukey: cassandra::instance::monitoring: add 'cassandra' as servername [puppet] - 10https://gerrit.wikimedia.org/r/932799 (https://phabricator.wikimedia.org/T288470) [10:51:45] PROBLEM - puppet last run on an-tool1010 is CRITICAL: CRITICAL: Puppet last ran 3 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:51:57] (03CR) 10Jbond: [C: 03+1] "lgtm thx" [puppet] - 10https://gerrit.wikimedia.org/r/932389 (owner: 10Slyngshede) [10:53:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41997/console" [puppet] - 10https://gerrit.wikimedia.org/r/932395 (owner: 10Majavah) [10:53:41] (03PS4) 10Elukey: cassandra::instance::monitoring: add 'cassandra' as servername [puppet] - 10https://gerrit.wikimedia.org/r/932799 (https://phabricator.wikimedia.org/T288470) [10:54:34] (03PS1) 10Matthias Mullie: Section-level notifications [extensions/ImageSuggestions] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/933066 (https://phabricator.wikimedia.org/T330931) [10:54:38] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [10:55:02] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41998/console" [puppet] - 10https://gerrit.wikimedia.org/r/932799 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [10:56:20] (03CR) 10Elukey: [V: 03+1 C: 03+2] cassandra::instance::monitoring: add 'cassandra' as servername [puppet] - 10https://gerrit.wikimedia.org/r/932799 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [10:56:24] (03CR) 10Jbond: [V: 03+1 C: 03+2] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/932395 (owner: 10Majavah) [10:56:52] elukey: happy for me to merge yours [10:57:13] RECOVERY - puppet last run on an-tool1010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:58:07] jbond: +1! [10:58:28] elukey: done [10:58:30] <3 [10:58:55] (03CR) 10Jbond: [C: 03+2] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/932396 (owner: 10Majavah) [11:00:02] !log installing libfastjson security updates [11:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:21] (03CR) 10Jbond: "lgtm but will wait on response to moritz q" [puppet] - 10https://gerrit.wikimedia.org/r/932397 (https://phabricator.wikimedia.org/T340180) (owner: 10Majavah) [11:03:01] (03CR) 10Cparle: [C: 03+1] Section-level notifications [extensions/ImageSuggestions] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/933066 (https://phabricator.wikimedia.org/T330931) (owner: 10Matthias Mullie) [11:03:42] (03CR) 10Jbond: jwt_authorizer: support templates for validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932398 (owner: 10Majavah) [11:05:03] (03CR) 10Majavah: P:toolforge: aptly: add a system user to own the repository (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932397 (https://phabricator.wikimedia.org/T340180) (owner: 10Majavah) [11:05:47] (03CR) 10Jbond: [C: 03+1] "lgtm (although i didn't test/render the go template)" [puppet] - 10https://gerrit.wikimedia.org/r/932399 (https://phabricator.wikimedia.org/T340180) (owner: 10Majavah) [11:09:09] PROBLEM - puppet last run on idp-test1002 is CRITICAL: CRITICAL: Puppet last ran 3 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:09:51] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [11:10:28] (03CR) 10Slyngshede: [C: 03+2] D:apereo_cas::service fix group membership validation [puppet] - 10https://gerrit.wikimedia.org/r/932389 (owner: 10Slyngshede) [11:12:56] (03PS1) 10Muehlenhoff: Add library hint for libfastjson [puppet] - 10https://gerrit.wikimedia.org/r/933070 [11:14:43] RECOVERY - puppet last run on idp-test1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:16:01] (03CR) 10Slyngshede: P:netbox Redirect to idp on OIDC auth (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932404 (owner: 10Slyngshede) [11:20:20] (ProbeDown) firing: (6) Service ml-cache1001:7001 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:24:56] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/932808 [11:27:19] PROBLEM - jenkins_service_running on releases1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [11:29:14] (03CR) 10EoghanGaffney: [C: 03+2] releases: Move the primary releases host from 1002 to 1003 [puppet] - 10https://gerrit.wikimedia.org/r/932228 (owner: 10EoghanGaffney) [11:30:17] (03PS1) 10AikoChou: ml-services: update outlink transformer docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/933080 (https://phabricator.wikimedia.org/T328899) [11:30:36] (ProbeDown) firing: (3) Service releases1002:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:34:44] (03PS2) 10EoghanGaffney: releases: Switch releases.d.w to releases1003 [dns] - 10https://gerrit.wikimedia.org/r/932230 [11:34:56] (03PS1) 10Hnowlan: api-gateway: set memory limit for ratelimit container [deployment-charts] - 10https://gerrit.wikimedia.org/r/933084 [11:36:54] (03CR) 10Jaime Nuche: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/932230 (owner: 10EoghanGaffney) [11:37:02] (03CR) 10EoghanGaffney: [C: 03+2] releases: Switch releases.d.w to releases1003 [dns] - 10https://gerrit.wikimedia.org/r/932230 (owner: 10EoghanGaffney) [11:40:05] (03CR) 10Muehlenhoff: P:toolforge: aptly: add a system user to own the repository (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/932397 (https://phabricator.wikimedia.org/T340180) (owner: 10Majavah) [11:40:07] (03PS1) 10Fabfur: hiera: Added new bullseye instance for cache-text in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/933085 (https://phabricator.wikimedia.org/T327742) [11:40:35] (03CR) 10AikoChou: "I'd like to wait until the new logging is deployed to LW and inspect that, before merging this (maybe will need to update other configs)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/933060 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [11:41:09] PROBLEM - Host parse1012 is DOWN: PING CRITICAL - Packet loss = 100% [11:41:17] (03CR) 10Jbond: [C: 04-1] "oh the irony, pcc is failing due to your patch[1] ruby 2.5 (buster) use plain keywords vs ruby2.7 that use symbols" [puppet] - 10https://gerrit.wikimedia.org/r/932459 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [11:41:25] (03CR) 10Jaime Nuche: [C: 03+1] "Haven't tested the change, but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/932439 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [11:42:03] (03PS2) 10AikoChou: ml-services: update outlink transformer docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/933080 (https://phabricator.wikimedia.org/T328899) [11:42:19] (03CR) 10Vgutierrez: [C: 03+1] hiera: Added new bullseye instance for cache-text in deployment-prep (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933085 (https://phabricator.wikimedia.org/T327742) (owner: 10Fabfur) [11:42:55] RECOVERY - Host parse1012 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [11:43:37] PROBLEM - Check systemd state on releases1003 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-patches-releases-primary.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:05] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-patches-releases1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:52] (03CR) 10Fabfur: [C: 03+2] hiera: Added new bullseye instance for cache-text in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/933085 (https://phabricator.wikimedia.org/T327742) (owner: 10Fabfur) [11:45:11] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm) [11:50:30] (03CR) 10Jbond: [C: 03+1] Add library hint for libfastjson [puppet] - 10https://gerrit.wikimedia.org/r/933070 (owner: 10Muehlenhoff) [11:50:36] (ProbeDown) firing: (3) Service releases1002:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:52:31] (03PS1) 10EoghanGaffney: releases: Revert "releases: Add motd warning about upcoming host change" [puppet] - 10https://gerrit.wikimedia.org/r/933086 [11:53:39] (03CR) 10Jaime Nuche: [C: 03+1] releases: Revert "releases: Add motd warning about upcoming host change" [puppet] - 10https://gerrit.wikimedia.org/r/933086 (owner: 10EoghanGaffney) [11:59:18] !log aborrero@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudservices2005-dev [11:59:31] !log aborrero@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudservices2005-dev [11:59:38] (03PS1) 10Btullis: Upgrade the analytics airflow instance to 2.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/933087 (https://phabricator.wikimedia.org/T336286) [11:59:40] (03PS1) 10Btullis: Upgrade the search instance of airflow to version 2.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/933088 (https://phabricator.wikimedia.org/T336286) [11:59:42] (03PS1) 10Btullis: Upgrade the research instance of airflow to version 2.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/933089 (https://phabricator.wikimedia.org/T336286) [11:59:44] (03PS1) 10Btullis: Update the platform_eng airflow instance to version 2.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/933090 (https://phabricator.wikimedia.org/T336286) [11:59:46] (03PS1) 10Btullis: Upgrade the analytics_product airflow instance to version 2.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/933091 (https://phabricator.wikimedia.org/T336286) [12:00:31] RECOVERY - Check systemd state on releases1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:59] (03CR) 10EoghanGaffney: [C: 03+2] releases: Revert "releases: Add motd warning about upcoming host change" [puppet] - 10https://gerrit.wikimedia.org/r/933086 (owner: 10EoghanGaffney) [12:00:59] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:21] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:03:07] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:04:17] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/932404 (owner: 10Slyngshede) [12:05:15] fabfur: Going to merge your puppet change, that ok? [12:06:20] (03CR) 10Slyngshede: P:netbox Redirect to idp on OIDC auth (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932404 (owner: 10Slyngshede) [12:08:15] PROBLEM - Check systemd state on releases1003 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases1003.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:09:35] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libfastjson [puppet] - 10https://gerrit.wikimedia.org/r/933070 (owner: 10Muehlenhoff) [12:10:29] (03CR) 10Hashar: [C: 03+2] "I am not sure from where the error comes. Sorry for the typo! :)" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/932640 (owner: 10Paladox) [12:10:32] (03CR) 10Slyngshede: [C: 03+2] P:netbox Redirect to idp on OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/932404 (owner: 10Slyngshede) [12:11:05] (03Merged) 10jenkins-bot: Change attribution name [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/932640 (owner: 10Paladox) [12:11:17] RECOVERY - Check systemd state on releases1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:11:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41999/console" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:15:25] (03CR) 10Muehlenhoff: ferm: Allow passing sets to an srange or drange (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:18:53] PROBLEM - Check systemd state on releases1003 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases1003.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:20:23] RECOVERY - Check systemd state on releases1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:01] PROBLEM - Check systemd state on releases1003 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases1003.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:07] RECOVERY - Check systemd state on releases1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:36:44] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I 'd split this in 2 patches, one for each wiki, to be merged at least a few hours apert. That way, if the experiment ends up having unint" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932175 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler) [12:38:51] PROBLEM - Check systemd state on releases1003 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases1003.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:40:25] RECOVERY - Check systemd state on releases1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:25] (03PS1) 10Ayounsi: [WIP] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [12:50:51] (03CR) 10CI reject: [V: 04-1] [WIP] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [12:53:09] (03CR) 10Joal: [C: 03+1] refinery::job::canary_events - use spark to launch, bump to version 0.2.17 [puppet] - 10https://gerrit.wikimedia.org/r/932456 (https://phabricator.wikimedia.org/T330236) (owner: 10Ottomata) [12:53:54] (03CR) 10CDanis: [C: 03+1] Probenet: Restore mapping for Nigeria [dns] - 10https://gerrit.wikimedia.org/r/932468 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: Dear deployers, time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T1300). [13:00:04] matthiasmullie and duesen: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:19] o/ [13:00:47] (03Abandoned) 10Anzx: Enable tabs for non logged in users on knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932284 (https://phabricator.wikimedia.org/T340276) (owner: 10Anzx) [13:01:29] o/ [13:01:39] PROBLEM - Host parse1012 is DOWN: PING CRITICAL - Packet loss = 100% [13:01:57] Unavailable today to deploy, I'm sure someone else will be along shortly [13:02:11] Erm wait a bit while I investigate what's happening to parse1012 [13:02:43] Or I can take it out of the pool so you don't get errors [13:02:45] I'll do that [13:02:49] I can also self-service as long as effi is around to help monitor the jobrunners. [13:02:58] duesen: I'm around too [13:03:06] ok cool [13:03:14] let me know when you are done [13:03:31] RECOVERY - Host parse1012 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [13:03:53] !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: name=parse1012.eqiad.wmnet [13:04:01] OBVIOUSLY [13:04:06] lol [13:04:09] matthiasmullie: your patch looks massive [13:04:18] I am around too duesen [13:04:36] go ahead [13:04:37] claime: parse1012 is flapping [13:04:41] RhinosF1: ack [13:04:42] has been for a few days [13:05:02] Thanks for the info, there's nothing in sel, I think it might be the network cable [13:05:11] I'll leave it inactive so deployment can proceed [13:05:14] duesen: yeah; most of it is just 1 patch, plus a lot of i18n that go along with it [13:05:29] !log parse1012 pooled inactive for flapping investigation [13:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:39] Y'all can go ahead [13:06:06] duesen: feel free to go first, I can wait [13:06:13] matthiasmullie: even without the i18n it's pretty big for a backport... I'm not complaining, just wondering if it might cause trouble if you have more backports and want to revedrt, etc [13:06:32] matthiasmullie: mine is a config patch, maybe merge yours while I deploy mine? [13:06:33] o/ is someone deploying already? [13:06:54] can I ask why that ImageSuggestions patch is being backported in the first place? [13:07:19] oh wait, my patch as a -1 from akosiaris [13:07:26] duesen: should be pretty safe; the only user-facing thing is Echo config; rest of the changes are a (not currently running) maint script, have another day to revert should that be needed [13:07:32] I'll do frwiki first. give me a couple of minutes to update [13:08:19] (03CR) 10Daniel Kinzler: Parsoid: Disable PC writes on dewiki and frwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932175 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler) [13:08:48] akosiaris: The reason to go for something big right away is that a small wiki will not provide any new information. It won't be visible in the sum total of things [13:08:55] I need somethign that makes the metrics move [13:09:21] taavi: there's a maint script already running weekly (on Wed) that generates notifications (image suggestions) [13:09:49] by the end of this quarter, we're supposed to have it also send notifications for sections [13:09:53] (03PS4) 10Daniel Kinzler: Parsoid: Disable PC writes on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932175 (https://phabricator.wikimedia.org/T339867) [13:10:12] which is either this Wed, or too late :p [13:10:15] akosiaris, effie, claime: ok to go? --^ [13:11:07] (03CR) 10Jbond: [V: 03+1 C: 04-1] "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:11:45] matthiasmullie: first of all, if deployed as is your patch would cause fatals since the extension.json change to add new hooks would be applied before php sees the new method [13:12:20] and it's a massive patch in general, so at least I don't feel comfortable backporting it, I'd much rather see it go out via the train [13:12:35] (03CR) 10Effie Mouzeli: [C: 03+1] Parsoid: Disable PC writes on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932175 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler) [13:13:12] effie: thanks, i'll deploy now [13:13:13] Needs a rebase afaict? [13:13:19] (03CR) 10Ottomata: [C: 03+2] eventstreams use kafka egress and service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/932165 (https://phabricator.wikimedia.org/T335024) (owner: 10TChin) [13:13:24] (03PS5) 10Daniel Kinzler: Parsoid: Disable PC writes on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932175 (https://phabricator.wikimedia.org/T339867) [13:14:04] claime: config patches seem to be always marked as merge conflicts, even if they apply cleanly. I suspect somethign is just bailing because InitializeSettings is huge [13:14:13] Ah fair [13:14:23] (03Merged) 10jenkins-bot: eventstreams use kafka egress and service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/932165 (https://phabricator.wikimedia.org/T335024) (owner: 10TChin) [13:14:29] (03CR) 10Clément Goubert: [C: 03+1] Parsoid: Disable PC writes on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932175 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler) [13:14:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932175 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler) [13:15:25] gah alright, guess we'll have to wait this one out then [13:15:33] (03CR) 10Klausman: [C: 03+1] ml-services: update outlink transformer docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/933080 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [13:15:40] (03Merged) 10jenkins-bot: Parsoid: Disable PC writes on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932175 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler) [13:15:50] effie, claime : if frwiki has no impact, can we try dewiki or enwiki in a couple of hours? [13:15:53] (03CR) 10Klausman: [C: 03+1] changeprop: update page_change_kind for outlink stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/933060 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [13:15:57] !log daniel@deploy1002 Started scap: Backport for [[gerrit:932175|Parsoid: Disable PC writes on frwiki (T339867)]] [13:16:01] T339867: RESTbase: Turn off pre-generation and caching for parsoid endpoints - https://phabricator.wikimedia.org/T339867 [13:16:08] duesen: sure [13:16:14] +1 [13:16:18] cool [13:16:20] let's ee [13:17:12] !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [13:17:23] !log daniel@deploy1002 daniel: Backport for [[gerrit:932175|Parsoid: Disable PC writes on frwiki (T339867)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:17:35] (03PS1) 10Kosta Harlan: ipoid: Use date/time image version name [deployment-charts] - 10https://gerrit.wikimedia.org/r/933096 (https://phabricator.wikimedia.org/T336163) [13:18:15] !log mfossati@deploy1002 Started deploy [airflow-dags/platform_eng@b3751e6]: (no justification provided) [13:18:24] !log mfossati@deploy1002 Finished deploy [airflow-dags/platform_eng@b3751e6]: (no justification provided) (duration: 00m 09s) [13:20:00] (03CR) 10Muehlenhoff: ferm: Allow passing sets to an srange or drange (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:22:28] !log sudo cumin 'A:dns-auth' 'disable-puppet "merging CR 932248"' [13:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:44] (03PS2) 10Ayounsi: [WIP] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [13:23:03] (03CR) 10Ssingh: [V: 03+1 C: 03+2] O:dnsbox: clean-up dnsbox role and dns::recursor [puppet] - 10https://gerrit.wikimedia.org/r/932248 (owner: 10Ssingh) [13:23:17] effie, claime: job queue wait time has been going up fro the past two hours already, any idea what's going on there? [13:23:29] I guess I should have checked that before pushing the patch... [13:23:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:24:59] duesen: anything could have done that, as it is related to traffic etc [13:25:10] effie, claime: looking at the long term trend, long wait times seem to have started end of april. [13:25:18] was fine before that [13:25:20] (ProbeDown) firing: (5) Service ml-cache1002:7001 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:25:33] !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [13:25:54] Looks like a sharp raise of insertion rates in codfw around 1130 [13:26:06] Can't see what job caused it yet [13:26:08] (fpm restart running) [13:26:18] !log daniel@deploy1002 Finished scap: Backport for [[gerrit:932175|Parsoid: Disable PC writes on frwiki (T339867)]] (duration: 10m 20s) [13:26:22] T339867: RESTbase: Turn off pre-generation and caching for parsoid endpoints - https://phabricator.wikimedia.org/T339867 [13:26:40] (03CR) 10Jbond: [V: 03+1 C: 04-1] "sorry for sending comments over two runs i forget that pcc sends any draft comments when it adds the report" [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:28:30] duesen: https://grafana.wikimedia.org/goto/zw-0PqXVz?orgId=1 [13:28:34] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:28:51] Looks like a rise in the prewarm jobs [13:29:30] !log sudo cumin 'A:dns-auth' 'enable-puppet "merging CR 932248"' [13:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:35] !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [13:30:16] Corresponding rise in job processing rate though [13:30:20] (ProbeDown) resolved: (5) Service ml-cache1002:7001 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:30:23] claime: yes, but not beyond what seems nromal looking abck a week [13:30:29] duesen: yep [13:30:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:30:46] (03PS14) 10Ssingh: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/922514 [13:31:00] claime: if this is normal, but it causes the queue to back up, that sounds like we need to add capacity... [13:31:03] (03PS3) 10Ayounsi: [WIP] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [13:31:30] ok, the patch disabling the parser cache writes from restbase updates has landed. [13:31:33] (03CR) 10Muehlenhoff: ferm: Allow passing sets to an srange or drange (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:31:41] duesen: thing is we never go below like 500 idle workers [13:31:50] We're not that saturated on the jobrunner [13:31:51] s [13:32:03] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (cloudservices2005-dev), No backups: 2 (cloudservices2005-dev, ...), Fresh: 130 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [13:32:04] something must be saturated... [13:32:16] I agree, I just can't find what ~_~ [13:32:24] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42000/console" [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [13:34:22] (03PS6) 10FNegri: cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) [13:34:45] (03CR) 10CI reject: [V: 04-1] cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [13:35:18] claime: explanation for the sharp rise is a template edit. It causes pages to be invalidated, but not rendered on eqiad. If the template is used on a lot of pages that have a decent number of viewers, each of these pages will trigger a job on codfw over the next few hours. [13:35:29] (03CR) 10FNegri: cumin: Properly set connect_timeout (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [13:35:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:36:07] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:36:27] (03PS7) 10FNegri: cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) [13:36:50] duesen: I figured it was something like that [13:36:50] (03CR) 10CI reject: [V: 04-1] cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [13:37:41] I'm looking at logstash and there's not that many errors for JobExecutor on parsoidCachePrewarm (like 8 in the past 3 hours) [13:37:44] duesen: those are normal operations though, whatever might cause a surge of jobs, while we want to have capacity for this case too [13:38:20] what claime and I are suggesting is that, lets not look at this as a problem until it becomes a problem [13:38:25] claime, effie: i see the processing rate for the prewarming jobs go up. That's unexpected, I'd expect it to go down - it's the same number of jobs, but fewer of them would now be able to exit early because the cached entry is up to date. [13:38:37] (03PS17) 10Muehlenhoff: ferm: Allow passing sets to an srange or drange [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) [13:38:41] Maybe it's just noise, that curve has a lot of jitter. [13:39:29] lets give it some time, and we can regroup and see where things are, if there are still things in the graphs you cant explain, we revert [13:39:34] (03PS8) 10FNegri: cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) [13:39:36] until we find an explanation [13:39:44] !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [13:40:04] yea, i'm not hearling any explosions ;) [13:40:21] what shall we try to add next? dewiki? enwiki? [13:40:30] !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [13:40:52] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:41:01] uh... [13:41:20] effie: I just realized that any effect will be delayed by whatever the jobqueue backlog is. [13:41:30] yeah we need to wait a bit [13:41:40] I'll check back in 20 minutes [13:41:43] Job *insertion* rate should not be delayed though [13:41:49] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [13:41:51] (03PS1) 10Btullis: Bump the version of the datahub image that is deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/933097 (https://phabricator.wikimedia.org/T329514) [13:44:13] (03PS9) 10FNegri: cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) [13:44:21] (03CR) 10Ayounsi: [C: 03+2] Ignore LAGs from test_port_block_consistency [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/932400 (owner: 10Ayounsi) [13:44:39] 10SRE, 10Wikimedia-Mailing-lists: Create wikija-g mailing list - https://phabricator.wikimedia.org/T340380 (10Sai10ukazuki) p:05Triage→03High [13:45:19] (03CR) 10Btullis: [C: 03+2] Bump the version of the datahub image that is deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/933097 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [13:45:26] (03PS1) 10TChin: eventstreams use latest mesh version [deployment-charts] - 10https://gerrit.wikimedia.org/r/933098 [13:45:31] (03Merged) 10jenkins-bot: Ignore LAGs from test_port_block_consistency [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/932400 (owner: 10Ayounsi) [13:45:51] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [13:45:55] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [13:46:04] (03PS1) 10Jbond: cassandra: add both fqdn and cassandra to sni [puppet] - 10https://gerrit.wikimedia.org/r/933099 [13:46:06] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [13:46:11] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [13:46:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:46:42] (03Merged) 10jenkins-bot: Bump the version of the datahub image that is deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/933097 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [13:47:50] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [13:48:18] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:48:31] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [13:50:36] !log tchin@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [13:52:21] (03Abandoned) 10Matthias Mullie: Section-level notifications [extensions/ImageSuggestions] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/933066 (https://phabricator.wikimedia.org/T330931) (owner: 10Matthias Mullie) [13:53:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:56:20] (03PS10) 10FNegri: cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) [13:56:37] (03CR) 10Elukey: [C: 03+2] cassandra: add both fqdn and cassandra to sni [puppet] - 10https://gerrit.wikimedia.org/r/933099 (owner: 10Jbond) [13:58:27] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [13:58:41] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [14:01:03] 10SRE, 10Wikimedia-Mailing-lists: Create wikija-g mailing list - https://phabricator.wikimedia.org/T340380 (10Aklapper) p:05High→03Triage @Sai10ukazuki: Do you [plan to work on fixing this task](https://www.mediawiki.org/wiki/Phabricator/Project_management#Setting_task_priorities), as you [increased the pr... [14:01:38] claime, effie: I'm not seeign any impact on the jobrunenr cluster. [14:02:22] ...the queue backlog is not looking good though. the enqueue rate is slowly coming down (i assume more and more of the pages that contain the template have been re-parsed now) [14:04:59] PROBLEM - mediawiki-installation DSH group on parse1012 is CRITICAL: Host parse1012 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [14:05:20] (03CR) 10Elukey: [V: 03+1 C: 03+2] Move esams varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/932218 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [14:05:22] (03PS3) 10Jameel Kaisar: Probenet: Restore mapping for Nigeria [dns] - 10https://gerrit.wikimedia.org/r/932468 (https://phabricator.wikimedia.org/T337318) [14:05:25] duesen: we are looking into state of things with claime [14:05:49] (03CR) 10FNegri: cumin: Properly set connect_timeout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [14:06:04] !log move varnishkafka instances in esams to pki [14:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:53] (03PS3) 10Jameel Kaisar: Update mappings for subregions of CA/US based on the Probenet data [dns] - 10https://gerrit.wikimedia.org/r/931992 (https://phabricator.wikimedia.org/T337318) [14:07:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:09:34] (03PS2) 10TChin: eventstreams use latest mesh version [deployment-charts] - 10https://gerrit.wikimedia.org/r/933098 [14:09:59] (03PS11) 10FNegri: cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) [14:10:23] (03CR) 10CI reject: [V: 04-1] cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [14:11:27] (03PS12) 10FNegri: cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) [14:11:51] (03CR) 10CI reject: [V: 04-1] cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [14:12:28] (03PS13) 10FNegri: cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) [14:12:58] (03CR) 10Ottomata: [C: 03+2] eventstreams use latest mesh version [deployment-charts] - 10https://gerrit.wikimedia.org/r/933098 (owner: 10TChin) [14:13:24] (03CR) 10AikoChou: [C: 03+2] ml-services: update outlink transformer docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/933080 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [14:13:49] (03Merged) 10jenkins-bot: eventstreams use latest mesh version [deployment-charts] - 10https://gerrit.wikimedia.org/r/933098 (owner: 10TChin) [14:13:58] (03PS1) 10JMeybohm: wikikube: Switch to new IPv6 service ip ranges [puppet] - 10https://gerrit.wikimedia.org/r/933100 (https://phabricator.wikimedia.org/T335285) [14:14:22] (03Merged) 10jenkins-bot: ml-services: update outlink transformer docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/933080 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [14:15:01] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [14:15:25] (03PS1) 10JMeybohm: Revert "Revert "k8s: Configure the IPv6 service ip range for apiserver"" [puppet] - 10https://gerrit.wikimedia.org/r/933101 [14:16:44] !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [14:17:02] !log sudo cumin 'P{C:bird::anycast_healthchecker}' 'disable-puppet "merging CR 922514"' [14:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:18:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [14:18:33] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:bird::anycast_healthchecker: allow binding to multiple services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [14:19:00] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [14:19:45] 10SRE, 10Wikimedia-Mailing-lists: Create wikija-g mailing list - https://phabricator.wikimedia.org/T340380 (10Sai10ukazuki) >>! In T340380#8963987, @Aklapper wrote: > @Sai10ukazuki: Do you [plan to work on fixing this task](https://www.mediawiki.org/wiki/Phabricator/Project_management#Setting_task_priorities),... [14:20:02] (03CR) 10Eevans: "I don't see how this will work. At least on the multi-instance configuration, $listen_address is different from the main host." [puppet] - 10https://gerrit.wikimedia.org/r/932795 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [14:23:07] !log restart pdns-rec.service on doh6001 to test systemd binding to anycast-hc [14:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:33] (03CR) 10Elukey: [V: 03+1 C: 03+2] cassandra::instance::monitoring: remove wrong servername (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932795 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [14:24:25] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:24:31] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:27:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:28:00] !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [14:28:01] !log tchin@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [14:29:07] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:30:03] !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [14:30:08] !log rolling out CR 922514 to A:wikidough (-s1 -b30): T336792 [14:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:13] T336792: Add systemd-level service bindings for Wikimedia DNS - https://phabricator.wikimedia.org/T336792 [14:31:56] !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [14:32:20] !log tchin@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [14:32:28] (03CR) 10Eevans: cassandra::instance::monitoring: remove wrong servername (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932795 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [14:32:44] !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [14:34:07] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:36:55] (03PS1) 10JMeybohm: envoyproxy: Add type URL to http and listener filters [puppet] - 10https://gerrit.wikimedia.org/r/933112 (https://phabricator.wikimedia.org/T337405) [14:37:22] !log rolling out CR 922514 to A:dns-auth: T336792 [14:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:25] T336792: Add systemd-level service bindings for Wikimedia DNS - https://phabricator.wikimedia.org/T336792 [14:40:20] !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [14:40:33] !log tchin@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [14:40:56] !log rolling out CR 922514 to A:durum: T336792 [14:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:27] (03PS2) 10JMeybohm: modules.mesh.configuration: Copy 1.3.0 to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/923303 [14:41:29] (03PS2) 10JMeybohm: mesh.configuration: Add type URL to http and listener filters [deployment-charts] - 10https://gerrit.wikimedia.org/r/923304 (https://phabricator.wikimedia.org/T337405) [14:42:56] (03PS1) 10TChin: eventstreams add schema listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/933114 [14:43:44] (03CR) 10CI reject: [V: 04-1] eventstreams add schema listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/933114 (owner: 10TChin) [14:44:01] RECOVERY - Hadoop NodeManager on analytics1066 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:45:14] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Traffic: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 (10elukey) All varnishkafkas on PKI! Remaining steps: * clean up the old certificate from puppet private and puppet CA. [14:46:06] (03CR) 10Hashar: "Oops. I guess the invoked methods are not the proper one or the registered component should be a bit more than just an element. I am also " [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/932641 (https://phabricator.wikimedia.org/T340372) (owner: 10Paladox) [14:46:56] !log hashar@deploy1002 Started deploy [gerrit/gerrit@7db3f9b]: Fix up attribution name in wm-app-theme.js plugin [14:47:04] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@7db3f9b]: Fix up attribution name in wm-app-theme.js plugin (duration: 00m 08s) [14:47:07] (03PS2) 10TChin: eventstreams add schema listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/933114 [14:49:09] 10SRE, 10Traffic, 10Patch-For-Review: Add systemd-level service bindings for Wikimedia DNS - https://phabricator.wikimedia.org/T336792 (10ssingh) 05Open→03Resolved a:03ssingh ` sukhe@doh1001:~$ systemctl show anycast-healthchecker.service | grep -i pdns BindsTo=dnsdist.service pdns-recursor.service Aft... [14:49:16] (03CR) 10Ottomata: [C: 03+2] eventstreams add schema listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/933114 (owner: 10TChin) [14:49:18] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations: WebAuthn FIDO2 support in CAS - https://phabricator.wikimedia.org/T277841 (10jbond) now targeted for cas 7.0 [14:49:32] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [14:51:40] !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [14:51:57] !log tchin@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [14:53:03] !log tchin@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [14:53:36] !log tchin@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [14:53:50] !log tchin@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [14:54:09] !log tchin@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [14:55:19] !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply [14:55:45] !log tchin@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [14:58:41] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10dancy) [15:00:30] !log tchin@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [15:00:54] !log tchin@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [15:01:04] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [15:01:37] !log tchin@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [15:01:45] ACKNOWLEDGEMENT - Backup freshness on backup1001 is CRITICAL: Stale: 1 (cloudservices2005-dev), No backups: 2 (cloudservices2005-dev, ...), Fresh: 130 jobs Jcrespo T339894 - The acknowledgement expires at: 2023-06-27 15:01:14. https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [15:01:57] !log tchin@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [15:05:41] (03PS1) 10Clément Goubert: changeprop-jobqueue: Bump the concurrency for prewarmparsoid to 100 [deployment-charts] - 10https://gerrit.wikimedia.org/r/933117 (https://phabricator.wikimedia.org/T339867) [15:05:58] (03CR) 10Alexandros Kosiaris: Parsoid: Disable PC writes on frwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932175 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler) [15:06:28] duesen: I 've commented on the patch, thanks for splitting it in 2. Sorry for not answering sooner. [15:06:33] How does it look for frwiki ? [15:08:06] (03CR) 10Alexandros Kosiaris: [C: 03+1] changeprop-jobqueue: Bump the concurrency for prewarmparsoid to 100 [deployment-charts] - 10https://gerrit.wikimedia.org/r/933117 (https://phabricator.wikimedia.org/T339867) (owner: 10Clément Goubert) [15:08:45] effie: You can deploy ^ [15:08:59] cool [15:10:06] akosiaris: no visible impact whatsoever. [15:10:16] cool [15:10:19] But there is an unrelated problem with parsoidCachePrewarmJob that started about 11:20 utc. [15:10:28] a template, right ? [15:10:30] The jobqueue backlog is >45min now [15:10:43] A template is my guess, yes. [15:11:05] We need to be able to cope with template edits without causing this kind of backlog in the queue... [15:11:34] Apparently it's unclear why we aren't processing enough, as the jobrunners have plenty free capacity [15:11:40] !log re-enable puppet on P{C:bird::anycast_healthchecker} and finish rolling out CR 922514 [15:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:01] somethign is throttling the job throughput, but I know too little about how changeprop-jobqueue works [15:12:21] s/too little/nothing/ [15:12:31] (03Abandoned) 10Ssingh: sre.hosts.reboot-cluster: fix-ups for Traffic/SRE usage [cookbooks] - 10https://gerrit.wikimedia.org/r/928546 (owner: 10Ssingh) [15:13:17] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:13:23] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:15:18] (03PS1) 10Ilias Sarantopoulos: ml-services: deploy amd rocm package for llm server [deployment-charts] - 10https://gerrit.wikimedia.org/r/933119 (https://phabricator.wikimedia.org/T334583) [15:18:17] (03PS2) 10Effie Mouzeli: changeprop-jobqueue: Bump the concurrency for parsoidCachePrewarm to 100 [deployment-charts] - 10https://gerrit.wikimedia.org/r/933117 (https://phabricator.wikimedia.org/T339867) (owner: 10Clément Goubert) [15:18:20] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations: WebAuthn FIDO2 support in CAS - https://phabricator.wikimedia.org/T277841 (10MoritzMuehlenhoff) [15:19:55] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10User-jbond: Validate user lockout - https://phabricator.wikimedia.org/T233946 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This has been implemented a while ago the sre.idm.logout cookbook. I runs various logout scripts (e.g. one whic... [15:19:59] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10Security-Team, 10User-jbond: Further steps for CAS/web SSO - https://phabricator.wikimedia.org/T233921 (10MoritzMuehlenhoff) [15:21:33] (03CR) 10Effie Mouzeli: [C: 03+2] changeprop-jobqueue: Bump the concurrency for parsoidCachePrewarm to 100 [deployment-charts] - 10https://gerrit.wikimedia.org/r/933117 (https://phabricator.wikimedia.org/T339867) (owner: 10Clément Goubert) [15:22:07] (03PS1) 10Arturo Borrero Gonzalez: codfw1dev: ldap: enable mirror mode [puppet] - 10https://gerrit.wikimedia.org/r/933120 [15:22:44] (03Merged) 10jenkins-bot: changeprop-jobqueue: Bump the concurrency for parsoidCachePrewarm to 100 [deployment-charts] - 10https://gerrit.wikimedia.org/r/933117 (https://phabricator.wikimedia.org/T339867) (owner: 10Clément Goubert) [15:23:52] (03PS1) 10Arturo Borrero Gonzalez: codfw1dev: ldap: drop overrided hiera key [puppet] - 10https://gerrit.wikimedia.org/r/933121 [15:24:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] codfw1dev: ldap: enable mirror mode [puppet] - 10https://gerrit.wikimedia.org/r/933120 (owner: 10Arturo Borrero Gonzalez) [15:24:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] codfw1dev: ldap: drop overrided hiera key [puppet] - 10https://gerrit.wikimedia.org/r/933121 (owner: 10Arturo Borrero Gonzalez) [15:25:49] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [15:26:26] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10User-jbond: Document IDP MFA policy and processes - https://phabricator.wikimedia.org/T284725 (10MoritzMuehlenhoff) [15:26:29] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [15:26:39] !log upgrade dns5003 to gdnsd 3.99.0~alpha2 [15:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:18] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [15:27:27] (03PS4) 10Ayounsi: [WIP] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [15:28:04] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [15:30:04] jan_drewniak: (Dis)respected human, time to deploy Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T1530). Please do the needful. [15:34:24] (03CR) 10JMeybohm: "PCC fails for deployment-ores02.deployment-prep.eqiad1.wikimedia.cloud and vrts-1002.devtools.eqiad1.wikimedia.cloud (but those fail for p" [puppet] - 10https://gerrit.wikimedia.org/r/933112 (https://phabricator.wikimedia.org/T337405) (owner: 10JMeybohm) [15:34:26] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [15:34:36] 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, and 2 others: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10aborrero) 05In progress→03Resolved [15:35:51] (03CR) 10RLazarus: [C: 03+1] envoyproxy: Add type URL to http and listener filters [puppet] - 10https://gerrit.wikimedia.org/r/933112 (https://phabricator.wikimedia.org/T337405) (owner: 10JMeybohm) [15:40:36] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: deploy amd rocm package for llm server [deployment-charts] - 10https://gerrit.wikimedia.org/r/933119 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos) [15:41:32] (03Merged) 10jenkins-bot: ml-services: deploy amd rocm package for llm server [deployment-charts] - 10https://gerrit.wikimedia.org/r/933119 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos) [15:41:52] !log installing Java 8 security updates on stat* hosts [15:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:14] (03CR) 10RLazarus: "Both envoy.filters.{http.router,listener.tls_inspector} also show up under charts/*/templates/vendor/mesh for a lot of different charts --" [deployment-charts] - 10https://gerrit.wikimedia.org/r/923304 (https://phabricator.wikimedia.org/T337405) (owner: 10JMeybohm) [15:45:23] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:47:33] (03PS1) 10KartikMistry: Enable Content and Section Translation for 4 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933125 (https://phabricator.wikimedia.org/T338123) [15:48:05] (03CR) 10Effie Mouzeli: [C: 03+1] "It looks ok, but take this with a grain of salt, or sugar" [deployment-charts] - 10https://gerrit.wikimedia.org/r/923304 (https://phabricator.wikimedia.org/T337405) (owner: 10JMeybohm) [15:50:10] (03CR) 10JMeybohm: mesh.configuration: Add type URL to http and listener filters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/923304 (https://phabricator.wikimedia.org/T337405) (owner: 10JMeybohm) [15:50:36] (ProbeDown) firing: Service releases1002:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:52:39] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:52:43] (03CR) 10RLazarus: [C: 03+1] mesh.configuration: Add type URL to http and listener filters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/923304 (https://phabricator.wikimedia.org/T337405) (owner: 10JMeybohm) [15:53:05] (03CR) 10JMeybohm: [C: 03+2] mesh.configuration: Add type URL to http and listener filters [deployment-charts] - 10https://gerrit.wikimedia.org/r/923304 (https://phabricator.wikimedia.org/T337405) (owner: 10JMeybohm) [15:53:11] (03CR) 10JMeybohm: [C: 03+2] modules.mesh.configuration: Copy 1.3.0 to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/923303 (owner: 10JMeybohm) [15:53:59] (03Merged) 10jenkins-bot: modules.mesh.configuration: Copy 1.3.0 to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/923303 (owner: 10JMeybohm) [15:54:10] (03Merged) 10jenkins-bot: mesh.configuration: Add type URL to http and listener filters [deployment-charts] - 10https://gerrit.wikimedia.org/r/923304 (https://phabricator.wikimedia.org/T337405) (owner: 10JMeybohm) [15:54:41] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:07] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [15:55:49] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/932358 (https://phabricator.wikimedia.org/T340041) (owner: 10Alexandros Kosiaris) [15:56:13] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_jenkins.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:43] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:09] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [15:58:18] 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team (Phase λ – Launch): Please add Abstract Wiki team members to `deployment` prod SRE group - https://phabricator.wikimedia.org/T339936 (10Jdforrester-WMF) [15:58:50] (03PS2) 10Alexandros Kosiaris: helmfile.d: Add wikifunctions stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/932357 (https://phabricator.wikimedia.org/T340041) [15:58:52] (03CR) 10Alexandros Kosiaris: helmfile.d: Add wikifunctions stanzas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/932357 (https://phabricator.wikimedia.org/T340041) (owner: 10Alexandros Kosiaris) [15:59:06] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks for the comments and +1s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/932357 (https://phabricator.wikimedia.org/T340041) (owner: 10Alexandros Kosiaris) [15:59:40] claime: i see that job processing rate nearly doubled half an hour ago. I am curious what made this happen. Do you know? [15:59:41] 10SRE, 10Article-Recommendation: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10MatthewVernon) [15:59:59] duesen: yep, we raised concurrency in cp-jobqueue to 100 for this job [16:00:12] 10SRE, 10SRE-swift-storage, 10Analytics-Radar, 10Data-Engineering-Icebox, 10Recommendation-API: Run swift-object-expirer as part of the swift cluster - https://phabricator.wikimedia.org/T229584 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Done (though we might want to think about refactori... [16:00:26] duesen: the concurrency graph is misleading because it's a max of averages, we were hitting the concurrency cap [16:01:49] (03Merged) 10jenkins-bot: helmfile.d: Add wikifunctions stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/932357 (https://phabricator.wikimedia.org/T340041) (owner: 10Alexandros Kosiaris) [16:03:33] RECOVERY - jenkins_service_running on releases1002 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [16:08:07] PROBLEM - jenkins_service_running on releases1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [16:11:25] (03PS1) 10EoghanGaffney: releases: Move jenkins ensure lines from old to new primary [puppet] - 10https://gerrit.wikimedia.org/r/933130 [16:12:58] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42005/console" [puppet] - 10https://gerrit.wikimedia.org/r/933130 (owner: 10EoghanGaffney) [16:14:02] (03CR) 10Andrew Bogott: [C: 03+1] cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [16:18:27] !log akosiaris@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:19:54] !log akosiaris@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:21:01] !log akosiaris@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [16:21:43] !log akosiaris@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [16:22:17] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:22:52] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:23:39] (03PS2) 10Hashar: ci/zuul: switch gearman server from contint2001 to contint2002 [puppet] - 10https://gerrit.wikimedia.org/r/867705 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [16:23:42] (03PS3) 10Hashar: ci: make contint2002 the new rsync source, remove contint2001 [puppet] - 10https://gerrit.wikimedia.org/r/867712 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [16:24:35] (03CR) 10Hashar: "Rebased to clear conflict." [puppet] - 10https://gerrit.wikimedia.org/r/867705 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [16:27:44] (03CR) 10FNegri: cumin: Properly set connect_timeout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [16:31:52] (03PS3) 10Jforrester: deployment_server: Add stanzas for wikifunctions k8s [puppet] - 10https://gerrit.wikimedia.org/r/932358 (https://phabricator.wikimedia.org/T340041) (owner: 10Alexandros Kosiaris) [16:37:14] (03Abandoned) 10Elukey: cassandra::instance::monitoring: move alerts to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/932427 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [16:37:24] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10akosiaris) [16:41:00] (03CR) 10Dzahn: [C: 03+1] releases: Move jenkins ensure lines from old to new primary [puppet] - 10https://gerrit.wikimedia.org/r/933130 (owner: 10EoghanGaffney) [16:41:09] (03PS1) 10Elukey: cassandra::instance::monitoring: move cql check to Prometheus for PKI [puppet] - 10https://gerrit.wikimedia.org/r/933134 (https://phabricator.wikimedia.org/T288470) [16:41:50] (03CR) 10Effie Mouzeli: [C: 03+1] ipoid: Use date/time image version name [deployment-charts] - 10https://gerrit.wikimedia.org/r/933096 (https://phabricator.wikimedia.org/T336163) (owner: 10Kosta Harlan) [16:42:12] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] releases: Move jenkins ensure lines from old to new primary [puppet] - 10https://gerrit.wikimedia.org/r/933130 (owner: 10EoghanGaffney) [16:42:25] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42006/console" [puppet] - 10https://gerrit.wikimedia.org/r/933134 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [16:42:57] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10akosiaris) [16:44:04] (03PS4) 10Alexandros Kosiaris: service::catalog: Deduplicate search service IPs [puppet] - 10https://gerrit.wikimedia.org/r/930175 [16:50:26] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) [16:50:34] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Add support for knams as PoP in tooling and automation - https://phabricator.wikimedia.org/T340465 (10Volans) p:05Triage→03Medium [16:50:36] (ProbeDown) resolved: Service releases1002:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:51:52] 10SRE-Sprint-Week-Sustainability-March2023, 10Beta-Cluster-Infrastructure, 10DBA, 10MediaWiki-libs-Rdbms, and 2 others: Enable MariaDB/MySQL's Strict Mode - https://phabricator.wikimedia.org/T108255 (10Reedy) [16:52:22] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks Ryan!" [puppet] - 10https://gerrit.wikimedia.org/r/930175 (owner: 10Alexandros Kosiaris) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T1700) [17:00:05] ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T1700). [17:08:44] (03PS1) 10Elukey: cassandra::instance: use the instance's fqdn as TLS cert's CN for PKI [puppet] - 10https://gerrit.wikimedia.org/r/933139 [17:10:53] (03Abandoned) 10Elukey: cassandra::instance: use the instance's fqdn as TLS cert's CN for PKI [puppet] - 10https://gerrit.wikimedia.org/r/933139 (owner: 10Elukey) [17:11:03] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:43] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:56] (03PS1) 10Gmodena: page_content_change: version bump docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/933143 (https://phabricator.wikimedia.org/T338380) [17:35:03] (03CR) 10Ebernhardson: [C: 03+1] "looks ready to go" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833861 (https://phabricator.wikimedia.org/T318270) (owner: 10Ryan Kemper) [17:36:33] (03CR) 10Gmodena: "This patch is not yet ready to be merged. It depends on https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/m" [deployment-charts] - 10https://gerrit.wikimedia.org/r/933143 (https://phabricator.wikimedia.org/T338380) (owner: 10Gmodena) [17:48:09] (03PS1) 10Ottomata: eventgate - enable use of remote schema repos for main and logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/933166 (https://phabricator.wikimedia.org/T340166) [17:50:54] (03CR) 10Ottomata: [C: 03+2] eventgate - enable use of remote schema repos for main and logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/933166 (https://phabricator.wikimedia.org/T340166) (owner: 10Ottomata) [17:51:46] (03Merged) 10jenkins-bot: eventgate - enable use of remote schema repos for main and logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/933166 (https://phabricator.wikimedia.org/T340166) (owner: 10Ottomata) [17:53:29] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [17:53:58] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [18:02:54] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [18:03:06] !log ebernhardson@deploy1002 Started deploy [airflow-dags/search@32b4b99]: update dags to use discolytics 0.15.0 [18:03:24] !log ebernhardson@deploy1002 Finished deploy [airflow-dags/search@32b4b99]: update dags to use discolytics 0.15.0 (duration: 00m 17s) [18:03:52] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [18:04:13] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [18:04:50] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [18:05:10] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [18:05:38] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [18:05:59] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [18:06:41] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [18:07:00] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [18:07:42] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [18:08:53] (03CR) 10Ottomata: [C: 03+2] refinery::job::canary_events - use spark to launch, bump to version 0.2.17 [puppet] - 10https://gerrit.wikimedia.org/r/932456 (https://phabricator.wikimedia.org/T330236) (owner: 10Ottomata) [18:16:13] (03PS1) 10Ryan Kemper: [WIP] Dashboard for query service update lag [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/933172 (https://phabricator.wikimedia.org/T324811) [18:17:51] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:17:58] 10SRE, 10Traffic: Reduce toil in provisioning and decommissioning of DNS/NTP servers by automating generation of resolv.conf and NTP peers - https://phabricator.wikimedia.org/T340479 (10ssingh) [18:18:06] 10SRE, 10Traffic: Reduce toil in provisioning and decommissioning of DNS/NTP servers by automating generation of resolv.conf and NTP peers - https://phabricator.wikimedia.org/T340479 (10ssingh) p:05Triage→03High [18:21:56] (03PS2) 10Ryan Kemper: [WIP] Dashboard for query service update lag [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/933172 (https://phabricator.wikimedia.org/T324811) [18:25:46] 10SRE, 10Traffic: Reduce toil in provisioning and decommissioning of DNS/NTP servers by automating generation of resolv.conf and NTP peers - https://phabricator.wikimedia.org/T340479 (10ssingh) [18:26:42] 10SRE, 10Traffic: Reduce toil in provisioning and decommissioning of DNS/NTP servers by automating generation of resolv.conf and NTP peers - https://phabricator.wikimedia.org/T340479 (10ssingh) [18:26:46] 10SRE, 10Traffic: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ssingh) [18:33:07] !log depooling sessionstore/codfw for bullseye upgrades — T340043 [18:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:12] T340043: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 [18:33:22] !log eevans@cumin2002 START - Cookbook sre.discovery.service-route depool sessionstore in codfw: maintenance [18:38:26] !log eevans@cumin2002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool sessionstore in codfw: maintenance [18:42:48] !log eevans@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2002.codfw.wmnet with OS bullseye [18:42:56] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin2002 for host sessionstore2002.codfw.wmnet with OS bullseye [18:52:42] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Transfer Neil Shah-Quinn's production access to new developer account - https://phabricator.wikimedia.org/T337591 (10nshahquinn-wmf) a:05MatthewVernon→03MoritzMuehlenhoff Everything is now migrated to the new account. It's safe to remove access from the o... [18:57:57] !log eevans@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2002.codfw.wmnet with reason: host reimage [19:02:11] !log eevans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2002.codfw.wmnet with reason: host reimage [19:05:24] (03PS1) 10Kosta Harlan: gitlab runner: Allow mariadb:* images for allowed_docker_services [puppet] - 10https://gerrit.wikimedia.org/r/933175 (https://phabricator.wikimedia.org/T339352) [19:06:10] (03CR) 10Ahmon Dancy: [C: 03+1] gitlab runner: Allow mariadb:* images for allowed_docker_services [puppet] - 10https://gerrit.wikimedia.org/r/933175 (https://phabricator.wikimedia.org/T339352) (owner: 10Kosta Harlan) [19:12:41] 10SRE, 10Wikimedia-Mailing-lists: Create wikija-g mailing list - https://phabricator.wikimedia.org/T340380 (10Arnoldokoth) ` aokoth@lists1001:~$ sudo mailman-wrapper create --owner kazuki-s@wikiusers.jp wikija-g@lists.wikimedia.org Created mailing list: wikija-g@lists.wikimedia.org ` [19:13:02] 10SRE, 10Wikimedia-Mailing-lists: Create wikija-g mailing list - https://phabricator.wikimedia.org/T340380 (10Arnoldokoth) 05Open→03In progress p:05Triage→03Medium [19:18:32] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] Change type of 'age-factor-decay' from non-existing float to wild [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/929744 (https://phabricator.wikimedia.org/T338970) (owner: 10Aklapper) [19:24:16] !log eevans@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore2002.codfw.wmnet with OS bullseye [19:24:22] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin2002 for host sessionstore2002.codfw.wmnet with OS bullseye completed: - sessionstore2002... [19:31:47] (03CR) 10Brennen Bearnes: [C: 03+1] gitlab runner: Allow mariadb:* images for allowed_docker_services [puppet] - 10https://gerrit.wikimedia.org/r/933175 (https://phabricator.wikimedia.org/T339352) (owner: 10Kosta Harlan) [19:38:36] (03PS1) 10Reedy: Revert "mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/933151 (https://phabricator.wikimedia.org/T340483) [19:38:45] (03PS1) 10Majavah: Revert "mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/933152 (https://phabricator.wikimedia.org/T340483) [19:38:47] (03PS2) 10Reedy: Revert "mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/933151 (https://phabricator.wikimedia.org/T340483) [19:39:17] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:39:28] (03Abandoned) 10Majavah: Revert "mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/933152 (https://phabricator.wikimedia.org/T340483) (owner: 10Majavah) [19:39:37] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:47:04] 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, and 2 others: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10Andrew) 05Resolved→03Open [19:47:07] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10Andrew) [19:47:55] (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/933151 (https://phabricator.wikimedia.org/T340483) (owner: 10Reedy) [19:48:37] !log revert "Redirect www.mediawiki.org to mw-on-k8s", debugging T340483 [19:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:41] T340483: ExtensionDistributor is broken - https://phabricator.wikimedia.org/T340483 [19:49:17] !log force puppet run on cp hosts T340483 [19:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:42] (03PS1) 10Reedy: CommonSettings.php: Set a proxy for $wgExtDistAPIConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933179 (https://phabricator.wikimedia.org/T340483) [19:52:07] 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: codfw1dev: OpenStack services can only sort of talk to memacached on cloudcontrols - https://phabricator.wikimedia.org/T340488 (10Andrew) [19:57:48] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10Eevans) [20:00:07] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T2000). [20:00:07] No Gerrit patches in the queue for this window AFAICS. [20:00:57] !log eevans@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2001.codfw.wmnet with OS bullseye [20:01:02] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin2002 for host sessionstore2001.codfw.wmnet with OS bullseye [20:02:20] (03CR) 10Reedy: "Caused T340483." [puppet] - 10https://gerrit.wikimedia.org/r/923385 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [20:03:48] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 (10Reedy) >>! In T337490#8963478, @gerritbot wrote: > Change 923385 **merged** by Clément Goubert: > %%%[operations/puppet@production] mw-on-k8s: Redirect www.mediawiki.org to... [20:07:54] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 (10Reedy) [20:10:11] jouncebot nowandnext [20:10:11] For the next 0 hour(s) and 49 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T2000) [20:10:11] In 0 hour(s) and 49 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T2100) [20:10:47] mutante, andre: i have that phab deploy prepped, now would probably be a reasonable time to push it out, i think [20:12:23] I'm in :D [20:12:36] (not that I had to do anything anyway, ahem) [20:13:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Dwisehaupt) @Jclark-ctr Just wanted to follow up and see if this has been checked yet. Thanks! [20:13:44] brennen: there's also a good bunch more Phab patches awaiting but I guess you have more important things to do :) [20:14:17] i grabbed a couple of the extremely low-stakes ones [20:14:39] others looked like i should probably do a bit more actual testing. [20:16:00] !log eevans@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2001.codfw.wmnet with reason: host reimage [20:16:37] brennen, up to your judgement :) https://phabricator.wikimedia.org/maniphest/query/MtNPMfa5ac0C/#R would be my list [20:17:39] anyway. Happy to get that non-public issue deployed <3 [20:18:43] !log eevans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2001.codfw.wmnet with reason: host reimage [20:20:04] (03PS1) 10JHathaway: admin: ensure dates are quoted [puppet] - 10https://gerrit.wikimedia.org/r/933180 (https://phabricator.wikimedia.org/T337972) [20:21:07] (03PS3) 10Ryan Kemper: [WIP] Dashboard for query service update lag [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/933172 (https://phabricator.wikimedia.org/T324811) [20:21:20] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/933180 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [20:27:33] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on phab1004.eqiad.wmnet with reason: first setup [20:27:47] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on phab1004.eqiad.wmnet with reason: first setup [20:27:55] !log deploying minor phabricator updates shortly [20:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:02] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on phab1004.eqiad.wmnet with reason: patch application [20:28:05] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on phab1004.eqiad.wmnet with reason: patch application [20:29:17] !log brennen@deploy1002 Started deploy [phabricator/deployment@a25a737]: deploy latest state to phab2002 [20:29:30] doing phab2002 then 1004 [20:29:55] andre: will round up the rest of the small stuff later this week [20:29:55] !log brennen@deploy1002 Finished deploy [phabricator/deployment@a25a737]: deploy latest state to phab2002 (duration: 00m 38s) [20:30:06] brennen, thanks [20:30:13] !log brennen@deploy1002 Started deploy [phabricator/deployment@a25a737]: deploy latest state to phab1004 [20:30:20] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on phab2002.codfw.wmnet with reason: patch application [20:30:24] downtimed phab2002 [20:30:33] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on phab2002.codfw.wmnet with reason: patch application [20:30:47] !log brennen@deploy1002 Finished deploy [phabricator/deployment@a25a737]: deploy latest state to phab1004 (duration: 00m 34s) [20:30:49] ah, thx - i don't thnk anything would normally trigger there anyway, but you never know [20:33:46] !log brennen@deploy1002 Started deploy [phabricator/deployment@0529926]: deploy latest state to phab1004 [20:33:57] grr, reverting here. [20:34:17] !log brennen@deploy1002 Finished deploy [phabricator/deployment@0529926]: deploy latest state to phab1004 (duration: 00m 31s) [20:35:32] (03PS2) 10JHathaway: stdlib: upgrade to v8.6.2 [puppet] - 10https://gerrit.wikimedia.org/r/932459 (https://phabricator.wikimedia.org/T337972) [20:39:23] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932459 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [20:39:50] (03PS4) 10Ryan Kemper: [WIP] Dashboard for query service update lag [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/933172 (https://phabricator.wikimedia.org/T324811) [20:40:05] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10Jhancock.wm) [20:40:12] 10SRE, 10ops-codfw, 10DC-Ops: sessionstore2001.codfw.wmnet unable to PXE boot - https://phabricator.wikimedia.org/T340055 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm replaced with a different brand optic (Wave2Wave 77J-S010-T) and now the scripts run without downing the port on the switch. [20:42:31] !log eevans@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore2001.codfw.wmnet with OS bullseye [20:42:37] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin2002 for host sessionstore2001.codfw.wmnet with OS bullseye completed: - sessionstore2001... [20:45:07] !log eevans@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2003.codfw.wmnet with OS bullseye [20:45:14] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin2002 for host sessionstore2003.codfw.wmnet with OS bullseye [20:47:23] (03PS1) 10Daniel Kinzler: Parsoid: Disable PC writes on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933184 (https://phabricator.wikimedia.org/T339867) [20:47:30] (03CR) 10CI reject: [V: 04-1] Parsoid: Disable PC writes on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933184 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler) [20:48:45] (03PS2) 10Daniel Kinzler: Parsoid: Disable PC writes on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933184 (https://phabricator.wikimedia.org/T339867) [20:49:00] (03PS5) 10Ryan Kemper: Dashboard for wdqs update lag [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/933172 (https://phabricator.wikimedia.org/T324811) [20:50:37] (03CR) 10Ryan Kemper: "See the following preview dashboard for what the result looks like:" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/933172 (https://phabricator.wikimedia.org/T324811) (owner: 10Ryan Kemper) [20:51:43] (03CR) 10Bking: [C: 03+1] sre.wdqs.data-transfer: fix broken logic [cookbooks] - 10https://gerrit.wikimedia.org/r/932324 (https://phabricator.wikimedia.org/T321605) (owner: 10Ryan Kemper) [20:53:59] (03PS2) 10JHathaway: site.pp: Drop wmnet domain and always use regexes [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) [20:54:21] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [20:55:25] !log eevans@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore2003.codfw.wmnet with OS bullseye [20:55:31] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin2002 for host sessionstore2003.codfw.wmnet with OS bullseye executed with errors: - sessi... [21:00:04] Reedy, sbassett, Maryum, and manfredi: That opportune time is upon us again. Time for a Weekly Security deployment window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T2100). [21:02:16] !log eevans@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2003.codfw.wmnet with OS bullseye [21:02:22] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin2002 for host sessionstore2003.codfw.wmnet with OS bullseye [21:06:45] (03CR) 10Ryan Kemper: [C: 03+2] sre.wdqs.data-transfer: fix broken logic [cookbooks] - 10https://gerrit.wikimedia.org/r/932324 (https://phabricator.wikimedia.org/T321605) (owner: 10Ryan Kemper) [21:07:22] 10SRE, 10SRE-Access-Requests: Requesting access to Kerberos for cjming - https://phabricator.wikimedia.org/T340491 (10Ottomata) Approved. [21:09:28] 10SRE, 10SRE-Access-Requests: Requesting access to Kerberos for cjming - https://phabricator.wikimedia.org/T340491 (10cjming) [21:10:09] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10Eevans) [21:13:07] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:13:43] !log ryankemper@puppetmaster1001 conftool action : set/weight=0:pooled=inactive; selector: name=wdqs2021.* [21:13:48] !log ryankemper@puppetmaster1001 conftool action : set/weight=0:pooled=inactive; selector: name=wdqs2022.* [21:15:09] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart [21:18:46] !log eevans@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2003.codfw.wmnet with reason: host reimage [21:21:40] (03PS1) 10Btullis: Specify the schema registry type for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/933187 (https://phabricator.wikimedia.org/T329514) [21:21:57] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart [21:22:13] !log eevans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2003.codfw.wmnet with reason: host reimage [21:22:28] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart [21:23:50] (03CR) 10Btullis: [C: 03+2] Specify the schema registry type for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/933187 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [21:24:56] (03Merged) 10jenkins-bot: Specify the schema registry type for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/933187 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [21:25:14] (03CR) 10BCornwall: [C: 03+2] pybal: Fix hostnames not being sent on alert [puppet] - 10https://gerrit.wikimedia.org/r/913004 (https://phabricator.wikimedia.org/T322377) (owner: 10BCornwall) [21:26:57] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [21:27:47] (03PS22) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) [21:27:52] (03CR) 10BCornwall: Create cookbook to upgrade Apache Traffic Server (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [21:36:22] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [21:36:54] (03CR) 10BCornwall: [C: 04-1] "Looks like profile::tlsproxy::envoy::cfssl_label needs to be defined. Should it be set to "discovery"?" [puppet] - 10https://gerrit.wikimedia.org/r/930187 (https://phabricator.wikimedia.org/T326657) (owner: 10Jbond) [21:39:10] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [21:39:34] (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:40:42] (SystemdUnitFailed) resolved: (4) wcqs-updater.service Failed on wcqs1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:42:00] (03PS1) 10Btullis: Enable the service mesh for the top-level datahub deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/933188 (https://phabricator.wikimedia.org/T329514) [21:43:10] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [21:43:32] (03CR) 10Btullis: [C: 03+2] Enable the service mesh for the top-level datahub deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/933188 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [21:44:34] (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:44:35] !log eevans@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore2003.codfw.wmnet with OS bullseye [21:44:36] (03Merged) 10jenkins-bot: Enable the service mesh for the top-level datahub deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/933188 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [21:44:42] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin2002 for host sessionstore2003.codfw.wmnet with OS bullseye completed: - sessionstore2003... [21:45:34] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [21:50:56] (03CR) 10Dzahn: [C: 03+2] releases-jenkins: replace Apache 2.2 with 2.4 syntax for access control [puppet] - 10https://gerrit.wikimedia.org/r/932439 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [21:53:16] !log pooling sessionstore/codfw for bullseye upgrades — T340043 [21:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:20] T340043: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 [21:53:47] !log eevans@cumin2002 START - Cookbook sre.discovery.service-route pool sessionstore in codfw: maintenance [21:54:53] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [21:55:10] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart [21:57:34] (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:57:34] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [21:58:51] !log eevans@cumin2002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool sessionstore in codfw: maintenance [21:59:22] (03PS1) 10Btullis: Permit datahub batch jobs to contact the GMS service [deployment-charts] - 10https://gerrit.wikimedia.org/r/933190 (https://phabricator.wikimedia.org/T329514) [22:00:24] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10Eevans) [22:01:30] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10Eevans) p:05Triage→03Medium [22:01:41] (03CR) 10Btullis: [C: 03+2] Permit datahub batch jobs to contact the GMS service [deployment-charts] - 10https://gerrit.wikimedia.org/r/933190 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [22:02:34] (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:02:52] (03Merged) 10jenkins-bot: Permit datahub batch jobs to contact the GMS service [deployment-charts] - 10https://gerrit.wikimedia.org/r/933190 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [22:05:00] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [22:07:26] (03CR) 10Dzahn: [C: 03+2] "So.. this does not break it.. but also I don't get blocked if I set my user agent manually to one of the blocked ones. But also.. this see" [puppet] - 10https://gerrit.wikimedia.org/r/932439 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [22:11:50] (03PS1) 10Ahmon Dancy: Add 'tag' argument to git::clone [puppet] - 10https://gerrit.wikimedia.org/r/933192 (https://phabricator.wikimedia.org/T218900) [22:12:13] (03CR) 10CI reject: [V: 04-1] Add 'tag' argument to git::clone [puppet] - 10https://gerrit.wikimedia.org/r/933192 (https://phabricator.wikimedia.org/T218900) (owner: 10Ahmon Dancy) [22:16:13] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [22:17:04] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [22:17:10] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart [22:17:14] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.wdqs.restart (exit_code=97) [22:18:52] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart [22:22:36] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:24:18] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [22:27:23] (03PS1) 10Btullis: Bump datahub top-level chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/933195 (https://phabricator.wikimedia.org/T329514) [22:29:19] (03CR) 10Btullis: [C: 03+2] Bump datahub top-level chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/933195 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [22:30:29] (03Merged) 10jenkins-bot: Bump datahub top-level chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/933195 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [22:31:13] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [22:31:36] (03PS1) 10Dzahn: switch contint.wikimedia.org from contint2001 to contint2002 [dns] - 10https://gerrit.wikimedia.org/r/933196 (https://phabricator.wikimedia.org/T324659) [22:33:46] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [22:46:15] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [22:46:33] (03CR) 10Dzahn: [C: 03+2] graphite: replace Apache 2.2 access control syntax [puppet] - 10https://gerrit.wikimedia.org/r/932445 (https://phabricator.wikimedia.org/T258686) (owner: 10Dzahn) [22:46:34] (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:48:48] (03PS1) 10Btullis: Revert changes to the GMS networkpolicy in datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/933197 (https://phabricator.wikimedia.org/T329514) [22:49:11] (03CR) 10Dzahn: [C: 03+2] "wtf, nothing changes whatsoever on the machine called "primary graphite host". makes no sense" [puppet] - 10https://gerrit.wikimedia.org/r/932445 (https://phabricator.wikimedia.org/T258686) (owner: 10Dzahn) [22:51:34] (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:51:35] (03CR) 10Btullis: [C: 03+2] Revert changes to the GMS networkpolicy in datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/933197 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [22:51:51] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [22:51:54] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2022 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:52:45] (03Merged) 10jenkins-bot: Revert changes to the GMS networkpolicy in datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/933197 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [22:53:24] RECOVERY - Blazegraph Port for wdqs-categories on wdqs2022 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:53:26] Hey all - I’d like to deploy a quick update for T336027 to PrivateSettings.php during the last few mins of the weekly security window here. Let me know if I shouldn't. [22:55:23] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Replace RAID controller battery in an-worker1092 - https://phabricator.wikimedia.org/T340204 (10Jclark-ctr) @BTullis would like to take care of tomorrow when would be a good time with you to do this? [22:55:23] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [22:58:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10Jclark-ctr) Performed stresstest on cpu for additional 24 hours with no errors restarting 3rd time [23:01:12] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [23:01:33] (03PS1) 10Dwisehaupt: Remove hosts to be decommissioned. [puppet] - 10https://gerrit.wikimedia.org/r/933198 (https://phabricator.wikimedia.org/T340155) [23:01:36] (03PS1) 10Dwisehaupt: Add frmon1002 to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/933199 (https://phabricator.wikimedia.org/T319460) [23:02:39] !log Deployed updated mitigation for T336027 [23:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:46] (03CR) 10Dwisehaupt: "for when we are ready to decom the hosts." [puppet] - 10https://gerrit.wikimedia.org/r/933198 (https://phabricator.wikimedia.org/T340155) (owner: 10Dwisehaupt) [23:07:34] (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [23:07:37] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [23:12:34] (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [23:13:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Jclark-ctr) 05Open→03Resolved updated docs [23:21:23] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-worker1092.eqiad.wmnet with reason: Replacing RAID controller battery [23:21:48] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-worker1092.eqiad.wmnet with reason: Replacing RAID controller battery [23:21:53] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Replace RAID controller battery in an-worker1092 - https://phabricator.wikimedia.org/T340204 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=36858c2c-bae0-4a63-9ac9-19916c27613e) set by btullis@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their se... [23:23:52] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Replace RAID controller battery in an-worker1092 - https://phabricator.wikimedia.org/T340204 (10BTullis) Hi @Jclark-ctr - Many thanks. I've shut down the machine ready for you, so you can replace it whenever is convenient. Feel free to boot the host again when finishe... [23:29:41] 10SRE, 10SRE-Access-Requests: Requesting access to Kerberos for cjming - https://phabricator.wikimedia.org/T340491 (10BTullis) a:03BTullis [23:30:51] 10SRE, 10SRE-Access-Requests: Requesting access to Kerberos for cjming - https://phabricator.wikimedia.org/T340491 (10BTullis) [23:35:55] 10SRE, 10SRE-Access-Requests: Requesting access to Kerberos for cjming - https://phabricator.wikimedia.org/T340491 (10BTullis) I've created the principal. @cjming - please would you check your email **spam folder** because your welcome email and initial kerberos setup instructions are almost certainly in ther... [23:39:25] (03PS1) 10Btullis: Record that fact that cjming is now kerberos enabled [puppet] - 10https://gerrit.wikimedia.org/r/933202 (https://phabricator.wikimedia.org/T340491) [23:39:48] (03PS2) 10Btullis: Record the fact that cjming is now kerberos enabled [puppet] - 10https://gerrit.wikimedia.org/r/933202 (https://phabricator.wikimedia.org/T340491)