[00:39:17] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/931925
[00:39:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/931925 (owner: 10TrainBranchBot)
[01:01:57] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/931925 (owner: 10TrainBranchBot)
[01:05:47] <wikibugs>	 (03CR) 10Anzx: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932284 (https://phabricator.wikimedia.org/T340276) (owner: 10Anzx)
[01:12:41] <wikibugs>	 (03PS3) 10Anzx: Rename namespace on extwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932272 (https://phabricator.wikimedia.org/T337696)
[01:16:38] <wikibugs>	 (03PS4) 10Anzx: Change dewiki import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932283 (https://phabricator.wikimedia.org/T340264)
[01:43:29] <icinga-wm>	 PROBLEM - Host parse1012 is DOWN: PING CRITICAL - Packet loss = 100%
[01:44:25] <icinga-wm>	 RECOVERY - Host parse1012 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[02:07:35] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:20:20] <jinxer-wm>	 (ProbeDown) firing: (6) Service ml-cache1001:7001 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:27:35] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:32:35] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:42:17] <icinga-wm>	 PROBLEM - Host parse1012 is DOWN: PING CRITICAL - Packet loss = 100%
[04:44:16] <wikibugs>	 (03CR) 10TChin: eventstreams use kafka egress and service mesh (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/932165 (https://phabricator.wikimedia.org/T335024) (owner: 10TChin)
[04:44:35] <icinga-wm>	 RECOVERY - Host parse1012 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[05:13:03] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2023-06-26-050753-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/932683 (https://phabricator.wikimedia.org/T340236)
[05:20:41] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:07:46] * kart_ updating cxserver
[06:08:18] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-06-26-050753-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/932683 (https://phabricator.wikimedia.org/T340236) (owner: 10KartikMistry)
[06:09:17] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2023-06-26-050753-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/932683 (https://phabricator.wikimedia.org/T340236) (owner: 10KartikMistry)
[06:10:48] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[06:11:08] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[06:14:57] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[06:15:33] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[06:19:42] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db1118 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/932685 (https://phabricator.wikimedia.org/T326683)
[06:20:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1118 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/932685 (https://phabricator.wikimedia.org/T326683) (owner: 10Marostegui)
[06:20:25] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[06:20:34] <jinxer-wm>	 (ProbeDown) firing: (6) Service ml-cache1001:7001 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:20:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1118 from dbctl T326683', diff saved to https://phabricator.wikimedia.org/P49477 and previous config saved to /var/cache/conftool/dbconfig/20230626-062036-marostegui.json
[06:20:41] <stashbot>	 T326683: Decommission db1106-db1125 - https://phabricator.wikimedia.org/T326683
[06:20:48] <marostegui>	 dbproxy alerts is to be expected
[06:21:25] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[06:25:10] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Move db1118 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/932686 (https://phabricator.wikimedia.org/T335092)
[06:26:19] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[06:26:28] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1118 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/932686 (https://phabricator.wikimedia.org/T335092) (owner: 10Marostegui)
[06:26:36] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[06:26:43] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[06:26:55] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] analytics: Decommission analytics106[4-6] from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/930582 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene)
[06:27:14] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[06:27:22] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend access for sannita [puppet] - 10https://gerrit.wikimedia.org/r/932687
[06:27:27] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[06:27:31] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[06:27:39] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[06:27:41] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[06:28:05] <kart_>	 !log Updated cxserver to 2023-06-26-050753-production (T340236, T339896)
[06:28:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:28:10] <stashbot>	 T339896: Enable MinT for all languages supported by IndicTrans2 - https://phabricator.wikimedia.org/T339896
[06:28:11] <stashbot>	 T340236: MinT translates to English when Hindi-Santali or any other language-Santali is selected - https://phabricator.wikimedia.org/T340236
[06:30:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Extend access for sannita [puppet] - 10https://gerrit.wikimedia.org/r/932687 (owner: 10Muehlenhoff)
[06:30:45] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[06:30:47] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[06:30:57] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[06:31:23] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[06:32:05] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[06:32:09] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[06:32:35] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:34:05] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:34:07] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1066 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[06:35:40] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] profile::cassandra: allow Prometheus nodes to check ports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932663 (owner: 10Elukey)
[06:35:56] <wikibugs>	 (03Abandoned) 10Elukey: profile::cassandra: allow Prometheus nodes to check ports [puppet] - 10https://gerrit.wikimedia.org/r/932663 (owner: 10Elukey)
[06:36:04] <wikibugs>	 (03PS4) 10Elukey: cassandra::instance::monitoring: move alerts to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/932427 (https://phabricator.wikimedia.org/T288470)
[06:53:45] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[06:54:17] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[06:56:06] <wikibugs>	 (03PS1) 10Marostegui: db1118: Install 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/932690
[06:56:32] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1118: Install 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/932690 (owner: 10Marostegui)
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T0700).
[07:00:05] <jouncebot>	 aanzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:01:14] <taavi>	 o/
[07:01:19] <taavi>	 aanzx: ping
[07:01:20] <aanzx>	 0/
[07:01:39] <wikibugs>	 (03CR) 10Majavah: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932283 (https://phabricator.wikimedia.org/T340264) (owner: 10Anzx)
[07:02:19] <taavi>	 aanzx: your last patch seems empty
[07:03:19] <wikibugs>	 (03PS4) 10Majavah: Rename namespace on extwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932272 (https://phabricator.wikimedia.org/T337696) (owner: 10Anzx)
[07:03:23] <wikibugs>	 (03CR) 10Majavah: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932272 (https://phabricator.wikimedia.org/T337696) (owner: 10Anzx)
[07:04:14] <aanzx>	 taavi: https://gerrit.wikimedia.org/r/c/932284 this one?
[07:04:22] <taavi>	 yes
[07:05:33] <taavi>	 aanzx: I'll deploy the first two, we can look at the last one after that
[07:05:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932283 (https://phabricator.wikimedia.org/T340264) (owner: 10Anzx)
[07:05:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932272 (https://phabricator.wikimedia.org/T337696) (owner: 10Anzx)
[07:05:57] <taavi>	 do you have the x-wikimedia-debug browser extension installed?
[07:06:02] <aanzx>	 taavi: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/929742/2/dblists/mobile-anon-talk.dblist says file had to be auto generated 
[07:06:06] <aanzx>	 Yes
[07:07:05] <wikibugs>	 (03Merged) 10jenkins-bot: Change dewiki import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932283 (https://phabricator.wikimedia.org/T340264) (owner: 10Anzx)
[07:07:09] <wikibugs>	 (03Merged) 10jenkins-bot: Rename namespace on extwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932272 (https://phabricator.wikimedia.org/T337696) (owner: 10Anzx)
[07:07:39] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:932283|Change dewiki import sources (T340264)]], [[gerrit:932272|Rename namespace on extwiki (T337696)]]
[07:07:44] <stashbot>	 T337696: In ext.wiki, change namespace Güiquipeya to Güiquipedia - https://phabricator.wikimedia.org/T337696
[07:07:45] <stashbot>	 T340264: Change dewiki import sources - https://phabricator.wikimedia.org/T340264
[07:09:38] <aanzx>	 taavi: can I edit this manually https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/929742/2/dblists/mobile-anon-talk.dblist
[07:10:17] <taavi>	 no, as the comment says you should use the `composer manage-dblist` command
[07:11:09] <aanzx>	 Ok , i will do it for afternoon backport
[07:16:38] <logmsgbot>	 !log taavi@deploy1002 anzx and taavi: Backport for [[gerrit:932283|Change dewiki import sources (T340264)]], [[gerrit:932272|Rename namespace on extwiki (T337696)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[07:16:43] <stashbot>	 T337696: In ext.wiki, change namespace Güiquipeya to Güiquipedia - https://phabricator.wikimedia.org/T337696
[07:16:44] <stashbot>	 T340264: Change dewiki import sources - https://phabricator.wikimedia.org/T340264
[07:16:45] <taavi>	 aanzx: please test both of those patches on a mwdebug server
[07:16:54] <aanzx>	 Ok
[07:21:28] <aanzx>	 taavi: dewiki ok, extwiki name space change still shows guiquipeya instead of pedia
[07:21:54] <taavi>	 hmm, let me see
[07:23:14] <wikibugs>	 (03CR) 10Elukey: "I found the real issue, finally:" [puppet] - 10https://gerrit.wikimedia.org/r/932663 (owner: 10Elukey)
[07:24:37] <taavi>	 ah. I didn't spot this earlier but you've changed the wrong setting I think - the correct setting is wgMetaNamespace but you've changed wgSitename
[07:24:44] <taavi>	 do you want to write a patch to fix it or should I?
[07:26:19] <aanzx>	 I will do it now
[07:29:06] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Git template: Clean up git commit template message [deployment-charts] - 10https://gerrit.wikimedia.org/r/921668
[07:30:26] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:31:00] <aanzx>	 taavi: can you write patch i couldn't find metanamespace
[07:31:07] <logmsgbot>	 !log taavi@deploy1002 Sync cancelled.
[07:31:11] <taavi>	 sure, give me one second
[07:33:35] <wikibugs>	 (03PS1) 10Majavah: extwiki: Update project namespace name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932788 (https://phabricator.wikimedia.org/T337696)
[07:33:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932788 (https://phabricator.wikimedia.org/T337696) (owner: 10Majavah)
[07:34:40] <wikibugs>	 (03Merged) 10jenkins-bot: extwiki: Update project namespace name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932788 (https://phabricator.wikimedia.org/T337696) (owner: 10Majavah)
[07:34:58] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:932788|extwiki: Update project namespace name (T337696)]]
[07:35:02] <stashbot>	 T337696: In ext.wiki, change namespace Güiquipeya to Güiquipedia - https://phabricator.wikimedia.org/T337696
[07:36:26] <logmsgbot>	 !log taavi@deploy1002 taavi: Backport for [[gerrit:932788|extwiki: Update project namespace name (T337696)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[07:36:56] <taavi>	 aanzx: can you check if it works properly now?
[07:37:23] <taavi>	 hmm, and I think we want to add an alias for the old name, otherwise links are going to break
[07:37:23] <aanzx>	 Thanks taavi , working now
[07:37:41] <logmsgbot>	 !log taavi@deploy1002 Sync cancelled.
[07:38:41] <wikibugs>	 (03PS1) 10Majavah: extwiki: Add an alias for old NS_PROJECT name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932789
[07:38:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932789 (owner: 10Majavah)
[07:39:37] <wikibugs>	 (03Merged) 10jenkins-bot: extwiki: Add an alias for old NS_PROJECT name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932789 (owner: 10Majavah)
[07:39:54] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:932789|extwiki: Add an alias for old NS_PROJECT name]]
[07:41:04] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Git template: Clean up git commit template message [deployment-charts] - 10https://gerrit.wikimedia.org/r/921668 (owner: 10Alexandros Kosiaris)
[07:41:23] <logmsgbot>	 !log taavi@deploy1002 taavi: Backport for [[gerrit:932789|extwiki: Add an alias for old NS_PROJECT name]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[07:41:49] <wikibugs>	 (03Merged) 10jenkins-bot: Git template: Clean up git commit template message [deployment-charts] - 10https://gerrit.wikimedia.org/r/921668 (owner: 10Alexandros Kosiaris)
[07:42:00] <taavi>	 and syncing
[07:42:06] <aanzx>	 Thanks
[07:48:29] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+1] releases: Fix alert for releases-jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932414 (owner: 10EoghanGaffney)
[07:48:44] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:932789|extwiki: Add an alias for old NS_PROJECT name]] (duration: 08m 49s)
[07:48:48] <taavi>	 all done
[08:00:59] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: reports/network: ignore IPv6 for cloudservices boxes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/932794 (https://phabricator.wikimedia.org/T307357)
[08:03:58] <wikibugs>	 (03PS1) 10Elukey: cassandra::instance::monitoring: remove wrong servername [puppet] - 10https://gerrit.wikimedia.org/r/932795 (https://phabricator.wikimedia.org/T288470)
[08:04:15] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] reports/network: ignore IPv6 for cloudservices boxes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/932794 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez)
[08:04:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for paramd [puppet] - 10https://gerrit.wikimedia.org/r/932796
[08:05:30] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41993/console" [puppet] - 10https://gerrit.wikimedia.org/r/932795 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey)
[08:05:53] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] cassandra::instance::monitoring: remove wrong servername [puppet] - 10https://gerrit.wikimedia.org/r/932795 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey)
[08:06:33] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] reports/network: ignore IPv6 for cloudservices boxes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/932794 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez)
[08:06:56] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[08:07:00] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[08:07:20] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[08:07:25] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[08:12:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for paramd [puppet] - 10https://gerrit.wikimedia.org/r/932796 (owner: 10Muehlenhoff)
[08:14:52] <logmsgbot>	 !log root@cumin2002 START - Cookbook sre.idm.logout Logging Paramita Das out of all services on: 1261 hosts
[08:15:29] <logmsgbot>	 !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Paramita Das out of all services on: 1261 hosts
[08:17:44] <logmsgbot>	 !log root@cumin2002 START - Cookbook sre.idm.logout Logging Paramita Das out of all services on: 771 hosts
[08:18:06] <logmsgbot>	 !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Paramita Das out of all services on: 771 hosts
[08:18:57] <logmsgbot>	 !log root@cumin2002 START - Cookbook sre.idm.logout Logging Paramita Das out of all services on: 19 hosts
[08:19:02] <logmsgbot>	 !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Paramita Das out of all services on: 19 hosts
[08:26:23] <claime>	 jouncebot: nowandnext
[08:26:23] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 33 minute(s)
[08:26:23] <jouncebot>	 In 1 hour(s) and 33 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T1000)
[08:34:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: replace Apache 2.2 access control syntax [puppet] - 10https://gerrit.wikimedia.org/r/932443 (https://phabricator.wikimedia.org/T258686) (owner: 10Dzahn)
[08:34:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] thanos: replace Apache 2.2 with modern syntax for access control [puppet] - 10https://gerrit.wikimedia.org/r/932444 (https://phabricator.wikimedia.org/T258686) (owner: 10Dzahn)
[08:40:02] <wikibugs>	 (03CR) 10Kosta Harlan: "What else needs to happen to make mariadb images available in GitLab CI? I still see messages saying that the image is not available https" [puppet] - 10https://gerrit.wikimedia.org/r/932328 (https://phabricator.wikimedia.org/T339352) (owner: 10Kosta Harlan)
[08:41:55] <wikibugs>	 (03PS1) 10Elukey: cassandra::instance::monitoring: add 'cassandra' as servername [puppet] - 10https://gerrit.wikimedia.org/r/932799 (https://phabricator.wikimedia.org/T288470)
[08:43:22] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41994/console" [puppet] - 10https://gerrit.wikimedia.org/r/932799 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey)
[08:46:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/932799 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey)
[08:49:36] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudservices2005-dev: give it proper role and name. [puppet] - 10https://gerrit.wikimedia.org/r/932800 (https://phabricator.wikimedia.org/T338779)
[08:51:13] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudservices2005-dev: give it proper role and name. [puppet] - 10https://gerrit.wikimedia.org/r/932800 (https://phabricator.wikimedia.org/T338779)
[08:55:34] <wikibugs>	 (03PS2) 10Elukey: cassandra::instance::monitoring: add 'cassandra' as servername [puppet] - 10https://gerrit.wikimedia.org/r/932799 (https://phabricator.wikimedia.org/T288470)
[08:55:36] <wikibugs>	 (03PS1) 10Elukey: cassandra::instance: add CN:cassandra to all PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/932801 (https://phabricator.wikimedia.org/T288470)
[08:56:56] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41995/console" [puppet] - 10https://gerrit.wikimedia.org/r/932801 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey)
[08:58:13] <wikibugs>	 (03CR) 10Elukey: "@Jbond: IIRC in this way I'd still get the fqdn in the cert, but also CN:cassandra right? Basically like we do for Kafka." [puppet] - 10https://gerrit.wikimedia.org/r/932801 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey)
[09:01:47] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] [beta] Update wgCdnServersNoPurge for new cache server (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932380 (https://phabricator.wikimedia.org/T327742) (owner: 10Fabfur)
[09:02:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend access for tandic [puppet] - 10https://gerrit.wikimedia.org/r/932802
[09:03:32] <wikibugs>	 (03PS1) 10MVernon: hiera: set ms-be1068 to be an object expirer [puppet] - 10https://gerrit.wikimedia.org/r/932803 (https://phabricator.wikimedia.org/T229584)
[09:03:38] <claime>	  /13
[09:04:26] <wikibugs>	 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, and 2 others: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10aborrero) {F37119766}
[09:06:22] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.dns.netbox
[09:08:28] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices2005-dev - aborrero@cumin2002"
[09:08:32] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] mw-on-k8s: Redirect closed wikis to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert)
[09:09:23] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices2005-dev - aborrero@cumin2002"
[09:09:23] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:09:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Extend access for tandic [puppet] - 10https://gerrit.wikimedia.org/r/932802 (owner: 10Muehlenhoff)
[09:10:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923385 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert)
[09:10:34] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.dns.wipe-cache cloudservices2005-dev.mgmt.codfw.wmnet on all recursors
[09:10:37] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudservices2005-dev.mgmt.codfw.wmnet on all recursors
[09:10:54] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.dns.wipe-cache cloudservices2005-dev.codfw.wmnet on all recursors
[09:10:57] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudservices2005-dev.codfw.wmnet on all recursors
[09:11:12] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudservices2005-dev.codfw.wmnet with OS bullseye
[09:11:27] <wikibugs>	 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, and 2 others: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudservices2005-dev.codfw.wmnet wi...
[09:13:23] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloudservices2005-dev: give it proper role and name [puppet] - 10https://gerrit.wikimedia.org/r/932800 (https://phabricator.wikimedia.org/T338779)
[09:14:24] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices2005-dev: give it proper role and name [puppet] - 10https://gerrit.wikimedia.org/r/932800 (https://phabricator.wikimedia.org/T338779) (owner: 10Arturo Borrero Gonzalez)
[09:14:51] <wikibugs>	 (03PS9) 10Clément Goubert: api-gateway: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065)
[09:17:31] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudservices2005-dev
[09:17:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add missing types to ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/931890 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[09:17:41] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudservices2005-dev
[09:17:48] <logmsgbot>	 !log aborrero@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudservices2005-dev.codfw.wmnet with OS bullseye
[09:18:00] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudservices2005-dev.codfw.wmnet with OS bullseye
[09:18:02] <wikibugs>	 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, and 2 others: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudservices2005-dev.codfw.wmnet with O...
[09:18:14] <wikibugs>	 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, and 2 others: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudservices2005-dev.codfw.wmnet wi...
[09:18:48] <wikibugs>	 (03PS10) 10Clément Goubert: api-gateway: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065)
[09:20:35] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/932804 (owner: 10Klausman)
[09:22:05] <wikibugs>	 (03PS2) 10Klausman: homedirs/klausman: clean up dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/932804
[09:23:50] <wikibugs>	 (03PS16) 10Muehlenhoff: ferm: Allow passing sets to an srange or drange [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497)
[09:24:26] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] homedirs/klausman: clean up dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/932804 (owner: 10Klausman)
[09:29:00] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[09:29:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: Setup in progress
[09:29:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: Setup in progress
[09:30:34] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] hiera: set ms-be1068 to be an object expirer [puppet] - 10https://gerrit.wikimedia.org/r/932803 (https://phabricator.wikimedia.org/T229584) (owner: 10MVernon)
[09:32:49] <vgutierrez>	 jouncebot: nowandnext
[09:32:49] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 27 minute(s)
[09:32:49] <jouncebot>	 In 0 hour(s) and 27 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T1000)
[09:32:55] <vgutierrez>	 fabfur: ^^
[09:32:59] <fabfur>	 tnx
[09:34:17] <wikibugs>	 (03PS1) 10Btullis: Add a workaround for a kerberos issue affecting Presto version 0.281 [puppet] - 10https://gerrit.wikimedia.org/r/932827 (https://phabricator.wikimedia.org/T337335)
[09:35:21] <wikibugs>	 (03PS2) 10Btullis: Add a workaround for a kerberos issue affecting Presto version 0.281 [puppet] - 10https://gerrit.wikimedia.org/r/932827 (https://phabricator.wikimedia.org/T337335)
[09:37:18] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices2005-dev.codfw.wmnet with reason: host reimage
[09:38:02] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add a workaround for a kerberos issue affecting Presto version 0.281 [puppet] - 10https://gerrit.wikimedia.org/r/932827 (https://phabricator.wikimedia.org/T337335) (owner: 10Btullis)
[09:40:01] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices2005-dev.codfw.wmnet with reason: host reimage
[09:41:23] <fabfur>	 jouncebot: nowandnext
[09:41:23] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 18 minute(s)
[09:41:23] <jouncebot>	 In 0 hour(s) and 18 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T1000)
[09:41:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by fabfur@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932380 (https://phabricator.wikimedia.org/T327742) (owner: 10Fabfur)
[09:43:47] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Update wgCdnServersNoPurge for new cache server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932380 (https://phabricator.wikimedia.org/T327742) (owner: 10Fabfur)
[09:46:35] <wikibugs>	 10SRE, 10Data-Engineering, 10Infrastructure-Foundations: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10MoritzMuehlenhoff)
[09:52:17] <wikibugs>	 (03Abandoned) 10Aqu: [WIP] Build spark yarn archive for Spark 3 from conda-analytics package [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578) (owner: 10Aqu)
[09:53:50] <wikibugs>	 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, and 2 others: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10aborrero)
[09:54:10] <wikibugs>	 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, and 2 others: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10aborrero)
[09:54:53] <wikibugs>	 (03CR) 10Aqu: "This repo could be deprecated now that the migration to Airflow 2.5 is done." [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/881873 (https://phabricator.wikimedia.org/T326194) (owner: 10Aqu)
[09:55:19] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] contint: replace Apache 2.2 with 2.4 syntax for access control [puppet] - 10https://gerrit.wikimedia.org/r/932435 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn)
[09:55:30] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 (10Clement_Goubert)
[09:55:41] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] releases-jenkins: replace Apache 2.2 with 2.4 syntax for access control [puppet] - 10https://gerrit.wikimedia.org/r/932439 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn)
[09:55:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/932400 (owner: 10Ayounsi)
[09:58:19] <wikibugs>	 (03PS2) 10D3r1ck01: wmf-config: Remove wgContentTranslationDefaultParsoidClient cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930798
[09:58:52] <wikibugs>	 (03CR) 10Volans: "I don't know the details to vote on this, but for me it's a +1 for fixing this at the apache layer for now." [puppet] - 10https://gerrit.wikimedia.org/r/932404 (owner: 10Slyngshede)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T1000)
[10:00:06] <jouncebot>	 claime: A patch you scheduled for MediaWiki infrastucture (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[10:00:20] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Redirect closed wikis to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert)
[10:01:06] <claime>	 !log mw-on-k8s: Redirect closed wikis to mw-on-k8s - T337490
[10:01:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:16] <stashbot>	 T337490: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490
[10:01:26] <wikibugs>	 (03PS1) 10Muehlenhoff: Point codfw URL downloader to new bullseye host [dns] - 10https://gerrit.wikimedia.org/r/932830 (https://phabricator.wikimedia.org/T329945)
[10:01:55] <wikibugs>	 (03PS1) 10Slyngshede: SUL Account: Allow users to dismiss account linking. [software/bitu] - 10https://gerrit.wikimedia.org/r/932831
[10:02:00] <icinga-wm>	 PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:02:42] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:03:29] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudservices2005-dev: drop cloud-private base interface override [puppet] - 10https://gerrit.wikimedia.org/r/932832 (https://phabricator.wikimedia.org/T338779)
[10:04:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Point codfw URL downloader to new bullseye host [dns] - 10https://gerrit.wikimedia.org/r/932830 (https://phabricator.wikimedia.org/T329945) (owner: 10Muehlenhoff)
[10:04:49] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: acme_chief: allow cloudservices2005-dev to access ldap-codfw1dev cert [puppet] - 10https://gerrit.wikimedia.org/r/932833 (https://phabricator.wikimedia.org/T338779)
[10:05:05] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices2005-dev: drop cloud-private base interface override [puppet] - 10https://gerrit.wikimedia.org/r/932832 (https://phabricator.wikimedia.org/T338779) (owner: 10Arturo Borrero Gonzalez)
[10:05:41] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] acme_chief: allow cloudservices2005-dev to access ldap-codfw1dev cert [puppet] - 10https://gerrit.wikimedia.org/r/932833 (https://phabricator.wikimedia.org/T338779) (owner: 10Arturo Borrero Gonzalez)
[10:07:07] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: acme_chief: extend ldap-codfw1dev with cloudservices2005-dev SNI [puppet] - 10https://gerrit.wikimedia.org/r/932834 (https://phabricator.wikimedia.org/T338779)
[10:08:16] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] acme_chief: extend ldap-codfw1dev with cloudservices2005-dev SNI [puppet] - 10https://gerrit.wikimedia.org/r/932834 (https://phabricator.wikimedia.org/T338779) (owner: 10Arturo Borrero Gonzalez)
[10:08:50] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Thanks for the fix! LGTM, couple of nits inline, no blockers." [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri)
[10:16:13] <wikibugs>	 (03PS8) 10Clément Goubert: mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923385 (https://phabricator.wikimedia.org/T337490)
[10:16:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts krb2001.codfw.wmnet
[10:19:17] <claime>	 !log mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s - T337490
[10:19:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:21] <stashbot>	 T337490: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490
[10:19:27] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923385 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert)
[10:19:32] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] hiera: set ms-be1068 to be an object expirer [puppet] - 10https://gerrit.wikimedia.org/r/932803 (https://phabricator.wikimedia.org/T229584) (owner: 10MVernon)
[10:21:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[10:24:10] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1068 is CRITICAL: CRITICAL - degraded: The following units failed: swift-container-sharder.service,swift-object-reconstructor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:24:47] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aborrero@cumin2002"
[10:25:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[10:25:19] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: pdns_server: db_backup: fix grant statement order [puppet] - 10https://gerrit.wikimedia.org/r/932838
[10:25:20] <jinxer-wm>	 (ProbeDown) firing: (6) Service ml-cache1001:7001 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:25:29] <logmsgbot>	 !log aborrero@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aborrero@cumin2002"
[10:25:30] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudservices2005-dev.codfw.wmnet with OS bullseye
[10:25:44] <wikibugs>	 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, and 2 others: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudservices2005-dev.codfw.wmnet with O...
[10:25:47] <wikibugs>	 (03PS1) 10Btullis: Revert "Enable the PRESTO_EXPAND_DATA feature flag in Superset" [puppet] - 10https://gerrit.wikimedia.org/r/932644
[10:26:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[10:26:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:26:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb2001.codfw.wmnet
[10:26:11] <wikibugs>	 (03PS2) 10Btullis: Revert "Enable the PRESTO_EXPAND_DATA feature flag in Superset" [puppet] - 10https://gerrit.wikimedia.org/r/932644 (https://phabricator.wikimedia.org/T340144)
[10:28:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove krb2001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/932839 (https://phabricator.wikimedia.org/T340433)
[10:29:59] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 (10Clement_Goubert)
[10:30:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove krb2001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/932839 (https://phabricator.wikimedia.org/T340433) (owner: 10Muehlenhoff)
[10:32:35] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:32:39] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudservices2005-dev - aborrero@cumin2002"
[10:32:52] <wikibugs>	 (03CR) 10Muehlenhoff: "Rebased on top of the latest type changes, ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[10:33:23] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudservices2005-dev - aborrero@cumin2002"
[10:37:12] <wikibugs>	 (03PS1) 10Clément Goubert: mw-on-k8s: Redirect officewiki to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/932857 (https://phabricator.wikimedia.org/T337490)
[10:37:14] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] pdns_server: db_backup: fix grant statement order [puppet] - 10https://gerrit.wikimedia.org/r/932838 (owner: 10Arturo Borrero Gonzalez)
[10:38:47] <wikibugs>	 (03PS3) 10Daniel Kinzler: Parsoid: Disable PC writes on dewiki and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932175 (https://phabricator.wikimedia.org/T339867)
[10:41:13] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41996/console" [puppet] - 10https://gerrit.wikimedia.org/r/932857 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert)
[10:42:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend access for wangombe [puppet] - 10https://gerrit.wikimedia.org/r/933059
[10:44:10] <wikibugs>	 (03PS1) 10AikoChou: changeprop: update page_change_kind for outlink stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/933060 (https://phabricator.wikimedia.org/T328899)
[10:44:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Extend access for wangombe [puppet] - 10https://gerrit.wikimedia.org/r/933059 (owner: 10Muehlenhoff)
[10:45:15] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] cassandra::instance: add CN:cassandra to all PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/932801 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey)
[10:47:08] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Revert "Enable the PRESTO_EXPAND_DATA feature flag in Superset" [puppet] - 10https://gerrit.wikimedia.org/r/932644 (https://phabricator.wikimedia.org/T340144) (owner: 10Btullis)
[10:47:44] <wikibugs>	 10SRE, 10AbuseFilter, 10serviceops, 10PHP 7.4 support: Regular expression "х[ÿý]и" match "х и" in Abusefilter - https://phabricator.wikimedia.org/T340068 (10Clement_Goubert) >>! In T340068#8962701, @Daimona wrote: >>>! In T340068#8962700, @Reedy wrote: >>>>! In T340068#8962699, @Daimona wrote: >>>>>! In T3...
[10:49:20] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] cassandra::instance: add CN:cassandra to all PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/932801 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey)
[10:50:47] <wikibugs>	 (03PS3) 10Elukey: cassandra::instance::monitoring: add 'cassandra' as servername [puppet] - 10https://gerrit.wikimedia.org/r/932799 (https://phabricator.wikimedia.org/T288470)
[10:51:45] <icinga-wm>	 PROBLEM - puppet last run on an-tool1010 is CRITICAL: CRITICAL: Puppet last ran 3 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[10:51:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm thx" [puppet] - 10https://gerrit.wikimedia.org/r/932389 (owner: 10Slyngshede)
[10:53:39] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41997/console" [puppet] - 10https://gerrit.wikimedia.org/r/932395 (owner: 10Majavah)
[10:53:41] <wikibugs>	 (03PS4) 10Elukey: cassandra::instance::monitoring: add 'cassandra' as servername [puppet] - 10https://gerrit.wikimedia.org/r/932799 (https://phabricator.wikimedia.org/T288470)
[10:54:34] <wikibugs>	 (03PS1) 10Matthias Mullie: Section-level notifications [extensions/ImageSuggestions] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/933066 (https://phabricator.wikimedia.org/T330931)
[10:54:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff)
[10:55:02] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41998/console" [puppet] - 10https://gerrit.wikimedia.org/r/932799 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey)
[10:56:20] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] cassandra::instance::monitoring: add 'cassandra' as servername [puppet] - 10https://gerrit.wikimedia.org/r/932799 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey)
[10:56:24] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/932395 (owner: 10Majavah)
[10:56:52] <jbond>	 elukey: happy for me to merge yours
[10:57:13] <icinga-wm>	 RECOVERY - puppet last run on an-tool1010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[10:58:07] <elukey>	 jbond: +1!
[10:58:28] <jbond>	 elukey: done
[10:58:30] <elukey>	 <3
[10:58:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/932396 (owner: 10Majavah)
[11:00:02] <moritzm>	 !log installing libfastjson security updates
[11:00:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:21] <wikibugs>	 (03CR) 10Jbond: "lgtm but will wait on response to moritz q" [puppet] - 10https://gerrit.wikimedia.org/r/932397 (https://phabricator.wikimedia.org/T340180) (owner: 10Majavah)
[11:03:01] <wikibugs>	 (03CR) 10Cparle: [C: 03+1] Section-level notifications [extensions/ImageSuggestions] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/933066 (https://phabricator.wikimedia.org/T330931) (owner: 10Matthias Mullie)
[11:03:42] <wikibugs>	 (03CR) 10Jbond: jwt_authorizer: support templates for validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932398 (owner: 10Majavah)
[11:05:03] <wikibugs>	 (03CR) 10Majavah: P:toolforge: aptly: add a system user to own the repository (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932397 (https://phabricator.wikimedia.org/T340180) (owner: 10Majavah)
[11:05:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm (although i didn't test/render the go template)" [puppet] - 10https://gerrit.wikimedia.org/r/932399 (https://phabricator.wikimedia.org/T340180) (owner: 10Majavah)
[11:09:09] <icinga-wm>	 PROBLEM - puppet last run on idp-test1002 is CRITICAL: CRITICAL: Puppet last ran 3 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:09:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[11:10:28] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] D:apereo_cas::service fix group membership validation [puppet] - 10https://gerrit.wikimedia.org/r/932389 (owner: 10Slyngshede)
[11:12:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for libfastjson [puppet] - 10https://gerrit.wikimedia.org/r/933070
[11:14:43] <icinga-wm>	 RECOVERY - puppet last run on idp-test1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:16:01] <wikibugs>	 (03CR) 10Slyngshede: P:netbox Redirect to idp on OIDC auth (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932404 (owner: 10Slyngshede)
[11:20:20] <jinxer-wm>	 (ProbeDown) firing: (6) Service ml-cache1001:7001 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:24:56] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/932808
[11:27:19] <icinga-wm>	 PROBLEM - jenkins_service_running on releases1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins
[11:29:14] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] releases: Move the primary releases host from 1002 to 1003 [puppet] - 10https://gerrit.wikimedia.org/r/932228 (owner: 10EoghanGaffney)
[11:30:17] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update outlink transformer docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/933080 (https://phabricator.wikimedia.org/T328899)
[11:30:36] <jinxer-wm>	 (ProbeDown) firing: (3) Service releases1002:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:34:44] <wikibugs>	 (03PS2) 10EoghanGaffney: releases: Switch releases.d.w to releases1003 [dns] - 10https://gerrit.wikimedia.org/r/932230
[11:34:56] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: set memory limit for ratelimit container [deployment-charts] - 10https://gerrit.wikimedia.org/r/933084
[11:36:54] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/932230 (owner: 10EoghanGaffney)
[11:37:02] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] releases: Switch releases.d.w to releases1003 [dns] - 10https://gerrit.wikimedia.org/r/932230 (owner: 10EoghanGaffney)
[11:40:05] <wikibugs>	 (03CR) 10Muehlenhoff: P:toolforge: aptly: add a system user to own the repository (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/932397 (https://phabricator.wikimedia.org/T340180) (owner: 10Majavah)
[11:40:07] <wikibugs>	 (03PS1) 10Fabfur: hiera: Added new bullseye instance for cache-text in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/933085 (https://phabricator.wikimedia.org/T327742)
[11:40:35] <wikibugs>	 (03CR) 10AikoChou: "I'd like to wait until the new logging is deployed to LW and inspect that, before merging this (maybe will need to update other configs)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/933060 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[11:41:09] <icinga-wm>	 PROBLEM - Host parse1012 is DOWN: PING CRITICAL - Packet loss = 100%
[11:41:17] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "oh the irony, pcc is failing due to your patch[1] ruby 2.5 (buster) use plain keywords vs ruby2.7 that use symbols" [puppet] - 10https://gerrit.wikimedia.org/r/932459 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[11:41:25] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+1] "Haven't tested the change, but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/932439 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn)
[11:42:03] <wikibugs>	 (03PS2) 10AikoChou: ml-services: update outlink transformer docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/933080 (https://phabricator.wikimedia.org/T328899)
[11:42:19] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] hiera: Added new bullseye instance for cache-text in deployment-prep (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933085 (https://phabricator.wikimedia.org/T327742) (owner: 10Fabfur)
[11:42:55] <icinga-wm>	 RECOVERY - Host parse1012 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[11:43:37] <icinga-wm>	 PROBLEM - Check systemd state on releases1003 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-patches-releases-primary.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:44:05] <icinga-wm>	 PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-patches-releases1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:44:52] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] hiera: Added new bullseye instance for cache-text in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/933085 (https://phabricator.wikimedia.org/T327742) (owner: 10Fabfur)
[11:45:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm)
[11:50:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Add library hint for libfastjson [puppet] - 10https://gerrit.wikimedia.org/r/933070 (owner: 10Muehlenhoff)
[11:50:36] <jinxer-wm>	 (ProbeDown) firing: (3) Service releases1002:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:52:31] <wikibugs>	 (03PS1) 10EoghanGaffney: releases: Revert "releases: Add motd warning about upcoming host change" [puppet] - 10https://gerrit.wikimedia.org/r/933086
[11:53:39] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+1] releases: Revert "releases: Add motd warning about upcoming host change" [puppet] - 10https://gerrit.wikimedia.org/r/933086 (owner: 10EoghanGaffney)
[11:59:18] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudservices2005-dev
[11:59:31] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudservices2005-dev
[11:59:38] <wikibugs>	 (03PS1) 10Btullis: Upgrade the analytics airflow instance to 2.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/933087 (https://phabricator.wikimedia.org/T336286)
[11:59:40] <wikibugs>	 (03PS1) 10Btullis: Upgrade the search instance of airflow to version 2.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/933088 (https://phabricator.wikimedia.org/T336286)
[11:59:42] <wikibugs>	 (03PS1) 10Btullis: Upgrade the research instance of airflow to version 2.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/933089 (https://phabricator.wikimedia.org/T336286)
[11:59:44] <wikibugs>	 (03PS1) 10Btullis: Update the platform_eng airflow instance to version 2.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/933090 (https://phabricator.wikimedia.org/T336286)
[11:59:46] <wikibugs>	 (03PS1) 10Btullis: Upgrade the analytics_product airflow instance to version 2.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/933091 (https://phabricator.wikimedia.org/T336286)
[12:00:31] <icinga-wm>	 RECOVERY - Check systemd state on releases1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:00:59] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] releases: Revert "releases: Add motd warning about upcoming host change" [puppet] - 10https://gerrit.wikimedia.org/r/933086 (owner: 10EoghanGaffney)
[12:00:59] <icinga-wm>	 RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:02:21] <icinga-wm>	 RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:03:07] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:04:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/932404 (owner: 10Slyngshede)
[12:05:15] <eoghan>	 fabfur: Going to merge your puppet change, that ok?
[12:06:20] <wikibugs>	 (03CR) 10Slyngshede: P:netbox Redirect to idp on OIDC auth (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932404 (owner: 10Slyngshede)
[12:08:15] <icinga-wm>	 PROBLEM - Check systemd state on releases1003 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases1003.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:09:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libfastjson [puppet] - 10https://gerrit.wikimedia.org/r/933070 (owner: 10Muehlenhoff)
[12:10:29] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] "I am not sure from where the error comes. Sorry for the typo! :)" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/932640 (owner: 10Paladox)
[12:10:32] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] P:netbox Redirect to idp on OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/932404 (owner: 10Slyngshede)
[12:11:05] <wikibugs>	 (03Merged) 10jenkins-bot: Change attribution name [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/932640 (owner: 10Paladox)
[12:11:17] <icinga-wm>	 RECOVERY - Check systemd state on releases1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:11:19] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41999/console" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:15:25] <wikibugs>	 (03CR) 10Muehlenhoff: ferm: Allow passing sets to an srange or drange (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:18:53] <icinga-wm>	 PROBLEM - Check systemd state on releases1003 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases1003.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:20:23] <icinga-wm>	 RECOVERY - Check systemd state on releases1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:28:01] <icinga-wm>	 PROBLEM - Check systemd state on releases1003 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases1003.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:31:07] <icinga-wm>	 RECOVERY - Check systemd state on releases1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:36:44] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "I 'd split this in 2 patches, one for each wiki, to be merged at least a few hours apert. That way, if the experiment ends up having unint" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932175 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler)
[12:38:51] <icinga-wm>	 PROBLEM - Check systemd state on releases1003 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases1003.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:40:25] <icinga-wm>	 RECOVERY - Check systemd state on releases1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:47:25] <wikibugs>	 (03PS1) 10Ayounsi: [WIP] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594)
[12:50:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi)
[12:53:09] <wikibugs>	 (03CR) 10Joal: [C: 03+1] refinery::job::canary_events - use spark to launch, bump to version 0.2.17 [puppet] - 10https://gerrit.wikimedia.org/r/932456 (https://phabricator.wikimedia.org/T330236) (owner: 10Ottomata)
[12:53:54] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] Probenet: Restore mapping for Nigeria [dns] - 10https://gerrit.wikimedia.org/r/932468 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar)
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: Dear deployers, time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T1300).
[13:00:04] <jouncebot>	 matthiasmullie and duesen: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:19] <matthiasmullie>	 o/
[13:00:47] <wikibugs>	 (03Abandoned) 10Anzx: Enable tabs for non logged in users on knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932284 (https://phabricator.wikimedia.org/T340276) (owner: 10Anzx)
[13:01:29] <duesen>	 o/
[13:01:39] <icinga-wm>	 PROBLEM - Host parse1012 is DOWN: PING CRITICAL - Packet loss = 100%
[13:01:57] <TheresNoTime>	 Unavailable today to deploy, I'm sure someone else will be along shortly 
[13:02:11] <claime>	 Erm wait a bit while I investigate what's happening to parse1012
[13:02:43] <claime>	 Or I can take it out of the pool so you don't get errors
[13:02:45] <claime>	 I'll do that
[13:02:49] <duesen>	 I can also self-service as long as effi is around to help monitor the jobrunners. 
[13:02:58] <claime>	 duesen: I'm around too
[13:03:06] <duesen>	 ok cool
[13:03:14] <duesen>	 let me know when you are done
[13:03:31] <icinga-wm>	 RECOVERY - Host parse1012 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms
[13:03:53] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: name=parse1012.eqiad.wmnet
[13:04:01] <claime>	 OBVIOUSLY
[13:04:06] <claime>	 lol
[13:04:09] <duesen>	 matthiasmullie: your patch looks massive
[13:04:18] <effie>	 I am around too duesen 
[13:04:36] <effie>	 go ahead
[13:04:37] <RhinosF1>	 claime: parse1012 is flapping
[13:04:41] <claime>	 RhinosF1: ack
[13:04:42] <RhinosF1>	 has been for a few days
[13:05:02] <claime>	 Thanks for the info, there's nothing in sel, I think it might be the network cable
[13:05:11] <claime>	 I'll leave it inactive so deployment can proceed
[13:05:14] <matthiasmullie>	 duesen: yeah; most of it is just 1 patch, plus a lot of i18n that go along with it
[13:05:29] <claime>	 !log parse1012 pooled inactive for flapping investigation
[13:05:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:39] <claime>	 Y'all can go ahead
[13:06:06] <matthiasmullie>	 duesen: feel free to go first, I can wait
[13:06:13] <duesen>	 matthiasmullie: even without the i18n it's pretty big for a backport... I'm not complaining, just wondering if it might cause trouble if you have more backports and want to revedrt, etc
[13:06:32] <duesen>	 matthiasmullie: mine is a config patch, maybe merge yours while I deploy mine?
[13:06:33] <taavi>	 o/ is someone deploying already?
[13:06:54] <taavi>	 can I ask why that ImageSuggestions patch is being backported in the first place?
[13:07:19] <duesen>	 oh wait, my patch as a -1 from akosiaris 
[13:07:26] <matthiasmullie>	 duesen: should be pretty safe; the only user-facing thing is Echo config; rest of the changes are a (not currently running) maint script, have another day to revert should that be needed
[13:07:32] <duesen>	 I'll do frwiki first. give me a couple of minutes to update
[13:08:19] <wikibugs>	 (03CR) 10Daniel Kinzler: Parsoid: Disable PC writes on dewiki and frwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932175 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler)
[13:08:48] <duesen>	 akosiaris: The reason to go for something big right away is that a small wiki will not provide any new information. It won't be visible in the sum total of things
[13:08:55] <duesen>	 I need somethign that makes the metrics move
[13:09:21] <matthiasmullie>	 taavi: there's a maint script already running weekly (on Wed) that generates notifications (image suggestions)
[13:09:49] <matthiasmullie>	 by the end of this quarter, we're supposed to have it also send notifications for sections
[13:09:53] <wikibugs>	 (03PS4) 10Daniel Kinzler: Parsoid: Disable PC writes on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932175 (https://phabricator.wikimedia.org/T339867)
[13:10:12] <matthiasmullie>	 which is either this Wed, or too late :p
[13:10:15] <duesen>	 akosiaris, effie, claime: ok to go? --^
[13:11:07] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 04-1] "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[13:11:45] <taavi>	 matthiasmullie: first of all, if deployed as is your patch would cause fatals since the extension.json change to add new hooks would be applied before php sees the new method
[13:12:20] <taavi>	 and it's a massive patch in general, so at least I don't feel comfortable backporting it, I'd much rather see it go out via the train
[13:12:35] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] Parsoid: Disable PC writes on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932175 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler)
[13:13:12] <duesen>	 effie: thanks, i'll deploy now
[13:13:13] <claime>	 Needs a rebase afaict?
[13:13:19] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventstreams use kafka egress and service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/932165 (https://phabricator.wikimedia.org/T335024) (owner: 10TChin)
[13:13:24] <wikibugs>	 (03PS5) 10Daniel Kinzler: Parsoid: Disable PC writes on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932175 (https://phabricator.wikimedia.org/T339867)
[13:14:04] <duesen>	 claime: config patches seem to be always marked as merge conflicts, even if they apply cleanly. I suspect somethign is just bailing because InitializeSettings is huge
[13:14:13] <claime>	 Ah fair
[13:14:23] <wikibugs>	 (03Merged) 10jenkins-bot: eventstreams use kafka egress and service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/932165 (https://phabricator.wikimedia.org/T335024) (owner: 10TChin)
[13:14:29] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Parsoid: Disable PC writes on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932175 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler)
[13:14:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932175 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler)
[13:15:25] <matthiasmullie>	 gah alright, guess we'll have to wait this one out then
[13:15:33] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: update outlink transformer docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/933080 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[13:15:40] <wikibugs>	 (03Merged) 10jenkins-bot: Parsoid: Disable PC writes on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932175 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler)
[13:15:50] <duesen>	 effie, claime : if frwiki has no impact, can we try dewiki or enwiki in a couple of hours?
[13:15:53] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] changeprop: update page_change_kind for outlink stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/933060 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[13:15:57] <logmsgbot>	 !log daniel@deploy1002 Started scap: Backport for [[gerrit:932175|Parsoid: Disable PC writes on frwiki (T339867)]]
[13:16:01] <stashbot>	 T339867: RESTbase: Turn off pre-generation and caching for parsoid endpoints - https://phabricator.wikimedia.org/T339867
[13:16:08] <claime>	 duesen: sure
[13:16:14] <effie>	 +1
[13:16:18] <duesen>	 cool
[13:16:20] <duesen>	 let's ee
[13:17:12] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
[13:17:23] <logmsgbot>	 !log daniel@deploy1002 daniel: Backport for [[gerrit:932175|Parsoid: Disable PC writes on frwiki (T339867)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[13:17:35] <wikibugs>	 (03PS1) 10Kosta Harlan: ipoid: Use date/time image version name [deployment-charts] - 10https://gerrit.wikimedia.org/r/933096 (https://phabricator.wikimedia.org/T336163)
[13:18:15] <logmsgbot>	 !log mfossati@deploy1002 Started deploy [airflow-dags/platform_eng@b3751e6]: (no justification provided)
[13:18:24] <logmsgbot>	 !log mfossati@deploy1002 Finished deploy [airflow-dags/platform_eng@b3751e6]: (no justification provided) (duration: 00m 09s)
[13:20:00] <wikibugs>	 (03CR) 10Muehlenhoff: ferm: Allow passing sets to an srange or drange (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[13:22:28] <sukhe>	 !log sudo cumin 'A:dns-auth' 'disable-puppet "merging CR 932248"'
[13:22:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:44] <wikibugs>	 (03PS2) 10Ayounsi: [WIP] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594)
[13:23:03] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] O:dnsbox: clean-up dnsbox role and dns::recursor [puppet] - 10https://gerrit.wikimedia.org/r/932248 (owner: 10Ssingh)
[13:23:17] <duesen>	 effie, claime: job queue wait time has been going up fro the past two hours already, any idea what's going on there?
[13:23:29] <duesen>	 I guess I should have checked that before pushing the patch...
[13:23:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:24:59] <effie>	 duesen: anything could have done that, as it is related to traffic etc
[13:25:10] <duesen>	 effie, claime: looking at the long term trend, long wait times seem to have started end of april.
[13:25:18] <duesen>	 was fine before that
[13:25:20] <jinxer-wm>	 (ProbeDown) firing: (5) Service ml-cache1002:7001 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:25:33] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
[13:25:54] <claime>	 Looks like  a sharp raise of insertion rates in codfw around 1130
[13:26:06] <claime>	 Can't see what job caused it yet
[13:26:08] <duesen>	 (fpm restart running)
[13:26:18] <logmsgbot>	 !log daniel@deploy1002 Finished scap: Backport for [[gerrit:932175|Parsoid: Disable PC writes on frwiki (T339867)]] (duration: 10m 20s)
[13:26:22] <stashbot>	 T339867: RESTbase: Turn off pre-generation and caching for parsoid endpoints - https://phabricator.wikimedia.org/T339867
[13:26:40] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 04-1] "sorry for sending comments over two runs i forget that pcc sends any draft comments when it adds the report" [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[13:28:30] <claime>	 duesen: https://grafana.wikimedia.org/goto/zw-0PqXVz?orgId=1
[13:28:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:28:51] <claime>	 Looks like a rise in the prewarm jobs
[13:29:30] <sukhe>	 !log sudo cumin 'A:dns-auth' 'enable-puppet "merging CR 932248"'
[13:29:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:35] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
[13:30:16] <claime>	 Corresponding rise in job processing rate though
[13:30:20] <jinxer-wm>	 (ProbeDown) resolved: (5) Service ml-cache1002:7001 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:30:23] <duesen>	 claime: yes, but not beyond what seems nromal looking abck a week
[13:30:29] <claime>	 duesen: yep
[13:30:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:30:46] <wikibugs>	 (03PS14) 10Ssingh: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/922514
[13:31:00] <duesen>	 claime: if this is normal, but it causes the queue to back up, that sounds like we need to add capacity...
[13:31:03] <wikibugs>	 (03PS3) 10Ayounsi: [WIP] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594)
[13:31:30] <duesen>	 ok, the patch disabling the parser cache writes from restbase updates has landed. 
[13:31:33] <wikibugs>	 (03CR) 10Muehlenhoff: ferm: Allow passing sets to an srange or drange (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[13:31:41] <claime>	 duesen: thing is we never go below like 500 idle workers
[13:31:50] <claime>	 We're not that saturated on the jobrunner
[13:31:51] <claime>	 s
[13:32:03] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (cloudservices2005-dev), No backups: 2 (cloudservices2005-dev, ...), Fresh: 130 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[13:32:04] <duesen>	 something must be saturated... 
[13:32:16] <claime>	 I agree, I just can't find what ~_~
[13:32:24] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42000/console" [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh)
[13:34:22] <wikibugs>	 (03PS6) 10FNegri: cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484)
[13:34:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri)
[13:35:18] <duesen>	 claime: explanation for the sharp rise is a template edit. It causes pages to be invalidated, but not rendered on eqiad. If the template is used on a lot of pages that have a decent number of viewers, each of these pages will trigger a job on codfw over the next few hours.
[13:35:29] <wikibugs>	 (03CR) 10FNegri: cumin: Properly set connect_timeout (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri)
[13:35:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:36:07] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:36:27] <wikibugs>	 (03PS7) 10FNegri: cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484)
[13:36:50] <claime>	 duesen: I figured it was something like that
[13:36:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri)
[13:37:41] <claime>	 I'm looking at logstash and there's not that many errors for JobExecutor on parsoidCachePrewarm (like 8 in the past 3 hours)
[13:37:44] <effie>	 duesen: those are normal operations though, whatever might cause a surge of jobs, while we want to have capacity for this case too 
[13:38:20] <effie>	 what claime and I are suggesting is that, lets not look at this as a problem until it becomes a problem 
[13:38:25] <duesen>	 claime, effie: i see the processing rate for the prewarming jobs go up. That's unexpected, I'd expect it to go down - it's the same number of jobs, but fewer of them would now be able to exit early because the cached entry is up to date. 
[13:38:37] <wikibugs>	 (03PS17) 10Muehlenhoff: ferm: Allow passing sets to an srange or drange [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497)
[13:38:41] <duesen>	 Maybe it's just noise, that curve has a lot of jitter.
[13:39:29] <effie>	 lets give it some time, and we can regroup and see where things are, if there are still things in the graphs you cant explain, we revert 
[13:39:34] <wikibugs>	 (03PS8) 10FNegri: cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484)
[13:39:36] <effie>	 until we find an explanation
[13:39:44] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
[13:40:04] <duesen>	 yea, i'm not hearling any explosions ;)
[13:40:21] <duesen>	 what shall we try to add next? dewiki? enwiki?
[13:40:30] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
[13:40:52] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:41:01] <duesen>	 uh...
[13:41:20] <duesen>	 effie: I just realized that any effect will be delayed by whatever the jobqueue backlog is.
[13:41:30] <claime>	 yeah we need to wait a bit
[13:41:40] <duesen>	 I'll check back in 20 minutes
[13:41:43] <claime>	 Job *insertion* rate should not be delayed though
[13:41:49] <wikibugs>	 (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri)
[13:41:51] <wikibugs>	 (03PS1) 10Btullis: Bump the version of the datahub image that is deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/933097 (https://phabricator.wikimedia.org/T329514)
[13:44:13] <wikibugs>	 (03PS9) 10FNegri: cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484)
[13:44:21] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Ignore LAGs from test_port_block_consistency [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/932400 (owner: 10Ayounsi)
[13:44:39] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Create wikija-g mailing list - https://phabricator.wikimedia.org/T340380 (10Sai10ukazuki) p:05Triage→03High
[13:45:19] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Bump the version of the datahub image that is deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/933097 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[13:45:26] <wikibugs>	 (03PS1) 10TChin: eventstreams use latest mesh version [deployment-charts] - 10https://gerrit.wikimedia.org/r/933098
[13:45:31] <wikibugs>	 (03Merged) 10jenkins-bot: Ignore LAGs from test_port_block_consistency [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/932400 (owner: 10Ayounsi)
[13:45:51] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[13:45:55] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[13:46:04] <wikibugs>	 (03PS1) 10Jbond: cassandra: add both fqdn and cassandra to sni [puppet] - 10https://gerrit.wikimedia.org/r/933099
[13:46:06] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[13:46:11] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[13:46:20] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[13:46:42] <wikibugs>	 (03Merged) 10jenkins-bot: Bump the version of the datahub image that is deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/933097 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[13:47:50] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[13:48:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:48:31] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[13:50:36] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply
[13:52:21] <wikibugs>	 (03Abandoned) 10Matthias Mullie: Section-level notifications [extensions/ImageSuggestions] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/933066 (https://phabricator.wikimedia.org/T330931) (owner: 10Matthias Mullie)
[13:53:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:56:20] <wikibugs>	 (03PS10) 10FNegri: cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484)
[13:56:37] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] cassandra: add both fqdn and cassandra to sni [puppet] - 10https://gerrit.wikimedia.org/r/933099 (owner: 10Jbond)
[13:58:27] <wikibugs>	 (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri)
[13:58:41] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[14:01:03] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Create wikija-g mailing list - https://phabricator.wikimedia.org/T340380 (10Aklapper) p:05High→03Triage @Sai10ukazuki: Do you [plan to work on fixing this task](https://www.mediawiki.org/wiki/Phabricator/Project_management#Setting_task_priorities), as you [increased the pr...
[14:01:38] <duesen>	 claime, effie: I'm not seeign any impact on the jobrunenr cluster.
[14:02:22] <duesen>	 ...the queue backlog is not looking good though. the enqueue rate is slowly coming down (i assume more and more of the pages that contain the template have been re-parsed now)
[14:04:59] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on parse1012 is CRITICAL: Host parse1012 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[14:05:20] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] Move esams varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/932218 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey)
[14:05:22] <wikibugs>	 (03PS3) 10Jameel Kaisar: Probenet: Restore mapping for Nigeria [dns] - 10https://gerrit.wikimedia.org/r/932468 (https://phabricator.wikimedia.org/T337318)
[14:05:25] <effie>	 duesen: we are looking into state of things with claime
[14:05:49] <wikibugs>	 (03CR) 10FNegri: cumin: Properly set connect_timeout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri)
[14:06:04] <elukey>	 !log move varnishkafka instances in esams to pki
[14:06:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:53] <wikibugs>	 (03PS3) 10Jameel Kaisar: Update mappings for subregions of CA/US based on the Probenet data [dns] - 10https://gerrit.wikimedia.org/r/931992 (https://phabricator.wikimedia.org/T337318)
[14:07:36] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:09:34] <wikibugs>	 (03PS2) 10TChin: eventstreams use latest mesh version [deployment-charts] - 10https://gerrit.wikimedia.org/r/933098
[14:09:59] <wikibugs>	 (03PS11) 10FNegri: cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484)
[14:10:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri)
[14:11:27] <wikibugs>	 (03PS12) 10FNegri: cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484)
[14:11:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri)
[14:12:28] <wikibugs>	 (03PS13) 10FNegri: cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484)
[14:12:58] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventstreams use latest mesh version [deployment-charts] - 10https://gerrit.wikimedia.org/r/933098 (owner: 10TChin)
[14:13:24] <wikibugs>	 (03CR) 10AikoChou: [C: 03+2] ml-services: update outlink transformer docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/933080 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[14:13:49] <wikibugs>	 (03Merged) 10jenkins-bot: eventstreams use latest mesh version [deployment-charts] - 10https://gerrit.wikimedia.org/r/933098 (owner: 10TChin)
[14:13:58] <wikibugs>	 (03PS1) 10JMeybohm: wikikube: Switch to new IPv6 service ip ranges [puppet] - 10https://gerrit.wikimedia.org/r/933100 (https://phabricator.wikimedia.org/T335285)
[14:14:22] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update outlink transformer docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/933080 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[14:15:01] <wikibugs>	 (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri)
[14:15:25] <wikibugs>	 (03PS1) 10JMeybohm: Revert "Revert "k8s: Configure the IPv6 service ip range for apiserver"" [puppet] - 10https://gerrit.wikimedia.org/r/933101
[14:16:44] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
[14:17:02] <sukhe>	 !log sudo cumin 'P{C:bird::anycast_healthchecker}' 'disable-puppet "merging CR 922514"'
[14:17:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:35] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:17:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:18:31] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh)
[14:18:33] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:bird::anycast_healthchecker: allow binding to multiple services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh)
[14:19:00] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[14:19:45] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Create wikija-g mailing list - https://phabricator.wikimedia.org/T340380 (10Sai10ukazuki) >>! In T340380#8963987, @Aklapper wrote: > @Sai10ukazuki: Do you [plan to work on fixing this task](https://www.mediawiki.org/wiki/Phabricator/Project_management#Setting_task_priorities),...
[14:20:02] <wikibugs>	 (03CR) 10Eevans: "I don't see how this will work.  At least on the multi-instance configuration, $listen_address is different from the main host." [puppet] - 10https://gerrit.wikimedia.org/r/932795 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey)
[14:23:07] <sukhe>	 !log restart pdns-rec.service on doh6001 to test systemd binding to anycast-hc
[14:23:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:33] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] cassandra::instance::monitoring: remove wrong servername (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932795 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey)
[14:24:25] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:24:31] <icinga-wm>	 PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:27:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:28:00] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
[14:28:01] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply
[14:29:07] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:30:03] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
[14:30:08] <sukhe>	 !log rolling out CR 922514 to A:wikidough (-s1 -b30): T336792
[14:30:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:13] <stashbot>	 T336792: Add systemd-level service bindings for Wikimedia DNS - https://phabricator.wikimedia.org/T336792
[14:31:56] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
[14:32:20] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply
[14:32:28] <wikibugs>	 (03CR) 10Eevans: cassandra::instance::monitoring: remove wrong servername (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932795 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey)
[14:32:44] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
[14:34:07] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:36:55] <wikibugs>	 (03PS1) 10JMeybohm: envoyproxy: Add type URL to http and listener filters [puppet] - 10https://gerrit.wikimedia.org/r/933112 (https://phabricator.wikimedia.org/T337405)
[14:37:22] <sukhe>	 !log rolling out CR 922514 to A:dns-auth: T336792
[14:37:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:25] <stashbot>	 T336792: Add systemd-level service bindings for Wikimedia DNS - https://phabricator.wikimedia.org/T336792
[14:40:20] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
[14:40:33] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply
[14:40:56] <sukhe>	 !log rolling out CR 922514 to A:durum: T336792
[14:40:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:27] <wikibugs>	 (03PS2) 10JMeybohm: modules.mesh.configuration: Copy 1.3.0 to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/923303
[14:41:29] <wikibugs>	 (03PS2) 10JMeybohm: mesh.configuration: Add type URL to http and listener filters [deployment-charts] - 10https://gerrit.wikimedia.org/r/923304 (https://phabricator.wikimedia.org/T337405)
[14:42:56] <wikibugs>	 (03PS1) 10TChin: eventstreams add schema listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/933114
[14:43:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] eventstreams add schema listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/933114 (owner: 10TChin)
[14:44:01] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1066 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:45:14] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Traffic: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 (10elukey) All varnishkafkas on PKI!  Remaining steps:  * clean up the old certificate from puppet private and puppet CA.
[14:46:06] <wikibugs>	 (03CR) 10Hashar: "Oops. I guess the invoked methods are not the proper one or the registered component should be a bit more than just an element. I am also " [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/932641 (https://phabricator.wikimedia.org/T340372) (owner: 10Paladox)
[14:46:56] <logmsgbot>	 !log hashar@deploy1002 Started deploy [gerrit/gerrit@7db3f9b]: Fix up attribution name in wm-app-theme.js plugin
[14:47:04] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [gerrit/gerrit@7db3f9b]: Fix up attribution name in wm-app-theme.js plugin (duration: 00m 08s)
[14:47:07] <wikibugs>	 (03PS2) 10TChin: eventstreams add schema listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/933114
[14:49:09] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Add systemd-level service bindings for Wikimedia DNS - https://phabricator.wikimedia.org/T336792 (10ssingh) 05Open→03Resolved a:03ssingh ` sukhe@doh1001:~$ systemctl show anycast-healthchecker.service | grep -i pdns BindsTo=dnsdist.service pdns-recursor.service Aft...
[14:49:16] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventstreams add schema listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/933114 (owner: 10TChin)
[14:49:18] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations: WebAuthn FIDO2 support in CAS - https://phabricator.wikimedia.org/T277841 (10jbond) now targeted for cas 7.0
[14:49:32] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero)
[14:51:40] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
[14:51:57] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply
[14:53:03] <logmsgbot>	 !log tchin@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply
[14:53:36] <logmsgbot>	 !log tchin@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply
[14:53:50] <logmsgbot>	 !log tchin@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply
[14:54:09] <logmsgbot>	 !log tchin@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply
[14:55:19] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply
[14:55:45] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply
[14:58:41] <wikibugs>	 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10dancy)
[15:00:30] <logmsgbot>	 !log tchin@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply
[15:00:54] <logmsgbot>	 !log tchin@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply
[15:01:04] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri)
[15:01:37] <logmsgbot>	 !log tchin@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply
[15:01:45] <icinga-wm>	 ACKNOWLEDGEMENT - Backup freshness on backup1001 is CRITICAL: Stale: 1 (cloudservices2005-dev), No backups: 2 (cloudservices2005-dev, ...), Fresh: 130 jobs Jcrespo T339894 - The acknowledgement expires at: 2023-06-27 15:01:14. https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[15:01:57] <logmsgbot>	 !log tchin@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply
[15:05:41] <wikibugs>	 (03PS1) 10Clément Goubert: changeprop-jobqueue: Bump the concurrency for prewarmparsoid to 100 [deployment-charts] - 10https://gerrit.wikimedia.org/r/933117 (https://phabricator.wikimedia.org/T339867)
[15:05:58] <wikibugs>	 (03CR) 10Alexandros Kosiaris: Parsoid: Disable PC writes on frwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932175 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler)
[15:06:28] <akosiaris>	 duesen: I 've commented on the patch, thanks for splitting it in 2. Sorry for not answering sooner. 
[15:06:33] <akosiaris>	 How does it look for frwiki ? 
[15:08:06] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] changeprop-jobqueue: Bump the concurrency for prewarmparsoid to 100 [deployment-charts] - 10https://gerrit.wikimedia.org/r/933117 (https://phabricator.wikimedia.org/T339867) (owner: 10Clément Goubert)
[15:08:45] <claime>	 effie: You can deploy ^
[15:08:59] <effie>	 cool
[15:10:06] <duesen>	 akosiaris: no visible impact whatsoever. 
[15:10:16] <akosiaris>	 cool
[15:10:19] <duesen>	 But there is an unrelated problem with parsoidCachePrewarmJob that started about 11:20 utc. 
[15:10:28] <akosiaris>	 a template, right ?
[15:10:30] <duesen>	 The jobqueue backlog is >45min now
[15:10:43] <duesen>	 A template is my guess, yes.
[15:11:05] <duesen>	 We need to be able to cope with template edits without causing this kind of backlog in the queue... 
[15:11:34] <duesen>	 Apparently it's unclear why we aren't processing enough, as the jobrunners have plenty free capacity
[15:11:40] <sukhe>	 !log re-enable puppet on P{C:bird::anycast_healthchecker} and finish rolling out CR 922514
[15:11:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:01] <duesen>	 somethign is throttling the job throughput, but I know too little about how changeprop-jobqueue works
[15:12:21] <duesen>	 s/too little/nothing/
[15:12:31] <wikibugs>	 (03Abandoned) 10Ssingh: sre.hosts.reboot-cluster: fix-ups for Traffic/SRE usage [cookbooks] - 10https://gerrit.wikimedia.org/r/928546 (owner: 10Ssingh)
[15:13:17] <icinga-wm>	 RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:13:23] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:15:18] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: deploy amd rocm package for llm server [deployment-charts] - 10https://gerrit.wikimedia.org/r/933119 (https://phabricator.wikimedia.org/T334583)
[15:18:17] <wikibugs>	 (03PS2) 10Effie Mouzeli: changeprop-jobqueue: Bump the concurrency for parsoidCachePrewarm to 100 [deployment-charts] - 10https://gerrit.wikimedia.org/r/933117 (https://phabricator.wikimedia.org/T339867) (owner: 10Clément Goubert)
[15:18:20] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations: WebAuthn FIDO2 support in CAS - https://phabricator.wikimedia.org/T277841 (10MoritzMuehlenhoff)
[15:19:55] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10User-jbond: Validate user lockout - https://phabricator.wikimedia.org/T233946 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This has been implemented a while ago the sre.idm.logout cookbook. I runs various logout scripts (e.g. one whic...
[15:19:59] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10Security-Team, 10User-jbond: Further steps for CAS/web SSO - https://phabricator.wikimedia.org/T233921 (10MoritzMuehlenhoff)
[15:21:33] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] changeprop-jobqueue: Bump the concurrency for parsoidCachePrewarm to 100 [deployment-charts] - 10https://gerrit.wikimedia.org/r/933117 (https://phabricator.wikimedia.org/T339867) (owner: 10Clément Goubert)
[15:22:07] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: codfw1dev: ldap: enable mirror mode [puppet] - 10https://gerrit.wikimedia.org/r/933120
[15:22:44] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop-jobqueue: Bump the concurrency for parsoidCachePrewarm to 100 [deployment-charts] - 10https://gerrit.wikimedia.org/r/933117 (https://phabricator.wikimedia.org/T339867) (owner: 10Clément Goubert)
[15:23:52] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: codfw1dev: ldap: drop overrided hiera key [puppet] - 10https://gerrit.wikimedia.org/r/933121
[15:24:26] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] codfw1dev: ldap: enable mirror mode [puppet] - 10https://gerrit.wikimedia.org/r/933120 (owner: 10Arturo Borrero Gonzalez)
[15:24:32] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] codfw1dev: ldap: drop overrided hiera key [puppet] - 10https://gerrit.wikimedia.org/r/933121 (owner: 10Arturo Borrero Gonzalez)
[15:25:49] <logmsgbot>	 !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[15:26:26] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10User-jbond: Document IDP MFA policy and processes - https://phabricator.wikimedia.org/T284725 (10MoritzMuehlenhoff)
[15:26:29] <logmsgbot>	 !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[15:26:39] <sukhe>	 !log upgrade dns5003 to gdnsd 3.99.0~alpha2
[15:26:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:27:18] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[15:27:27] <wikibugs>	 (03PS4) 10Ayounsi: [WIP] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594)
[15:28:04] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[15:30:04] <jouncebot>	 jan_drewniak: (Dis)respected human, time to deploy Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T1530). Please do the needful.
[15:34:24] <wikibugs>	 (03CR) 10JMeybohm: "PCC fails for deployment-ores02.deployment-prep.eqiad1.wikimedia.cloud and vrts-1002.devtools.eqiad1.wikimedia.cloud (but those fail for p" [puppet] - 10https://gerrit.wikimedia.org/r/933112 (https://phabricator.wikimedia.org/T337405) (owner: 10JMeybohm)
[15:34:26] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero)
[15:34:36] <wikibugs>	 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, and 2 others: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10aborrero) 05In progress→03Resolved
[15:35:51] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] envoyproxy: Add type URL to http and listener filters [puppet] - 10https://gerrit.wikimedia.org/r/933112 (https://phabricator.wikimedia.org/T337405) (owner: 10JMeybohm)
[15:40:36] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: deploy amd rocm package for llm server [deployment-charts] - 10https://gerrit.wikimedia.org/r/933119 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos)
[15:41:32] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: deploy amd rocm package for llm server [deployment-charts] - 10https://gerrit.wikimedia.org/r/933119 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos)
[15:41:52] <moritzm>	 !log installing Java 8 security updates on stat* hosts
[15:41:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:14] <wikibugs>	 (03CR) 10RLazarus: "Both envoy.filters.{http.router,listener.tls_inspector} also show up under charts/*/templates/vendor/mesh for a lot of different charts --" [deployment-charts] - 10https://gerrit.wikimedia.org/r/923304 (https://phabricator.wikimedia.org/T337405) (owner: 10JMeybohm)
[15:45:23] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[15:47:33] <wikibugs>	 (03PS1) 10KartikMistry: Enable Content and Section Translation for 4 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933125 (https://phabricator.wikimedia.org/T338123)
[15:48:05] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] "It looks ok, but take this with a grain of salt, or sugar" [deployment-charts] - 10https://gerrit.wikimedia.org/r/923304 (https://phabricator.wikimedia.org/T337405) (owner: 10JMeybohm)
[15:50:10] <wikibugs>	 (03CR) 10JMeybohm: mesh.configuration: Add type URL to http and listener filters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/923304 (https://phabricator.wikimedia.org/T337405) (owner: 10JMeybohm)
[15:50:36] <jinxer-wm>	 (ProbeDown) firing: Service releases1002:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:52:39] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[15:52:43] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] mesh.configuration: Add type URL to http and listener filters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/923304 (https://phabricator.wikimedia.org/T337405) (owner: 10JMeybohm)
[15:53:05] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] mesh.configuration: Add type URL to http and listener filters [deployment-charts] - 10https://gerrit.wikimedia.org/r/923304 (https://phabricator.wikimedia.org/T337405) (owner: 10JMeybohm)
[15:53:11] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] modules.mesh.configuration: Copy 1.3.0 to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/923303 (owner: 10JMeybohm)
[15:53:59] <wikibugs>	 (03Merged) 10jenkins-bot: modules.mesh.configuration: Copy 1.3.0 to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/923303 (owner: 10JMeybohm)
[15:54:10] <wikibugs>	 (03Merged) 10jenkins-bot: mesh.configuration: Add type URL to http and listener filters [deployment-charts] - 10https://gerrit.wikimedia.org/r/923304 (https://phabricator.wikimedia.org/T337405) (owner: 10JMeybohm)
[15:54:41] <icinga-wm>	 PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:55:07] <icinga-wm>	 PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV
[15:55:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/932358 (https://phabricator.wikimedia.org/T340041) (owner: 10Alexandros Kosiaris)
[15:56:13] <icinga-wm>	 PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_jenkins.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:57:43] <icinga-wm>	 RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:58:09] <icinga-wm>	 RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV
[15:58:18] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team (Phase λ – Launch): Please add Abstract Wiki team members to `deployment` prod SRE group - https://phabricator.wikimedia.org/T339936 (10Jdforrester-WMF)
[15:58:50] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: helmfile.d: Add wikifunctions stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/932357 (https://phabricator.wikimedia.org/T340041)
[15:58:52] <wikibugs>	 (03CR) 10Alexandros Kosiaris: helmfile.d: Add wikifunctions stanzas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/932357 (https://phabricator.wikimedia.org/T340041) (owner: 10Alexandros Kosiaris)
[15:59:06] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks for the comments and +1s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/932357 (https://phabricator.wikimedia.org/T340041) (owner: 10Alexandros Kosiaris)
[15:59:40] <duesen>	 claime: i see that job processing rate nearly doubled half an hour ago. I am curious what made this happen. Do you know?
[15:59:41] <wikibugs>	 10SRE, 10Article-Recommendation: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10MatthewVernon)
[15:59:59] <claime>	 duesen: yep, we raised concurrency in cp-jobqueue to 100 for this job
[16:00:12] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Analytics-Radar, 10Data-Engineering-Icebox, 10Recommendation-API: Run swift-object-expirer as part of the swift cluster - https://phabricator.wikimedia.org/T229584 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Done (though we might want to think about refactori...
[16:00:26] <claime>	 duesen: the concurrency graph is misleading because it's a max of averages, we were hitting the concurrency cap
[16:01:49] <wikibugs>	 (03Merged) 10jenkins-bot: helmfile.d: Add wikifunctions stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/932357 (https://phabricator.wikimedia.org/T340041) (owner: 10Alexandros Kosiaris)
[16:03:33] <icinga-wm>	 RECOVERY - jenkins_service_running on releases1002 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins
[16:08:07] <icinga-wm>	 PROBLEM - jenkins_service_running on releases1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins
[16:11:25] <wikibugs>	 (03PS1) 10EoghanGaffney: releases: Move jenkins ensure lines from old to new primary [puppet] - 10https://gerrit.wikimedia.org/r/933130
[16:12:58] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42005/console" [puppet] - 10https://gerrit.wikimedia.org/r/933130 (owner: 10EoghanGaffney)
[16:14:02] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] cumin: Properly set connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri)
[16:18:27] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[16:19:54] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[16:21:01] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[16:21:43] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[16:22:17] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[16:22:52] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[16:23:39] <wikibugs>	 (03PS2) 10Hashar: ci/zuul: switch gearman server from contint2001 to contint2002 [puppet] - 10https://gerrit.wikimedia.org/r/867705 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn)
[16:23:42] <wikibugs>	 (03PS3) 10Hashar: ci: make contint2002 the new rsync source, remove contint2001 [puppet] - 10https://gerrit.wikimedia.org/r/867712 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn)
[16:24:35] <wikibugs>	 (03CR) 10Hashar: "Rebased to clear conflict." [puppet] - 10https://gerrit.wikimedia.org/r/867705 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn)
[16:27:44] <wikibugs>	 (03CR) 10FNegri: cumin: Properly set connect_timeout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri)
[16:31:52] <wikibugs>	 (03PS3) 10Jforrester: deployment_server: Add stanzas for wikifunctions k8s [puppet] - 10https://gerrit.wikimedia.org/r/932358 (https://phabricator.wikimedia.org/T340041) (owner: 10Alexandros Kosiaris)
[16:37:14] <wikibugs>	 (03Abandoned) 10Elukey: cassandra::instance::monitoring: move alerts to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/932427 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey)
[16:37:24] <wikibugs>	 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10akosiaris)
[16:41:00] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] releases: Move jenkins ensure lines from old to new primary [puppet] - 10https://gerrit.wikimedia.org/r/933130 (owner: 10EoghanGaffney)
[16:41:09] <wikibugs>	 (03PS1) 10Elukey: cassandra::instance::monitoring: move cql check to Prometheus for PKI [puppet] - 10https://gerrit.wikimedia.org/r/933134 (https://phabricator.wikimedia.org/T288470)
[16:41:50] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] ipoid: Use date/time image version name [deployment-charts] - 10https://gerrit.wikimedia.org/r/933096 (https://phabricator.wikimedia.org/T336163) (owner: 10Kosta Harlan)
[16:42:12] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] releases: Move jenkins ensure lines from old to new primary [puppet] - 10https://gerrit.wikimedia.org/r/933130 (owner: 10EoghanGaffney)
[16:42:25] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42006/console" [puppet] - 10https://gerrit.wikimedia.org/r/933134 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey)
[16:42:57] <wikibugs>	 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10akosiaris)
[16:44:04] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: service::catalog: Deduplicate search service IPs [puppet] - 10https://gerrit.wikimedia.org/r/930175
[16:50:26] <wikibugs>	 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF)
[16:50:34] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Add support for knams as PoP in tooling and automation - https://phabricator.wikimedia.org/T340465 (10Volans) p:05Triage→03Medium
[16:50:36] <jinxer-wm>	 (ProbeDown) resolved: Service releases1002:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:51:52] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Beta-Cluster-Infrastructure, 10DBA, 10MediaWiki-libs-Rdbms, and 2 others: Enable MariaDB/MySQL's Strict Mode - https://phabricator.wikimedia.org/T108255 (10Reedy)
[16:52:22] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks Ryan!" [puppet] - 10https://gerrit.wikimedia.org/r/930175 (owner: 10Alexandros Kosiaris)
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T1700)
[17:00:05] <jouncebot>	 ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T1700).
[17:08:44] <wikibugs>	 (03PS1) 10Elukey: cassandra::instance: use the instance's fqdn as TLS cert's CN for PKI [puppet] - 10https://gerrit.wikimedia.org/r/933139
[17:10:53] <wikibugs>	 (03Abandoned) 10Elukey: cassandra::instance: use the instance's fqdn as TLS cert's CN for PKI [puppet] - 10https://gerrit.wikimedia.org/r/933139 (owner: 10Elukey)
[17:11:03] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:14:43] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:34:56] <wikibugs>	 (03PS1) 10Gmodena: page_content_change: version bump docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/933143 (https://phabricator.wikimedia.org/T338380)
[17:35:03] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] "looks ready to go" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833861 (https://phabricator.wikimedia.org/T318270) (owner: 10Ryan Kemper)
[17:36:33] <wikibugs>	 (03CR) 10Gmodena: "This patch is not yet ready to be merged. It depends on https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/m" [deployment-charts] - 10https://gerrit.wikimedia.org/r/933143 (https://phabricator.wikimedia.org/T338380) (owner: 10Gmodena)
[17:48:09] <wikibugs>	 (03PS1) 10Ottomata: eventgate - enable use of remote schema repos for main and logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/933166 (https://phabricator.wikimedia.org/T340166)
[17:50:54] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventgate - enable use of remote schema repos for main and logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/933166 (https://phabricator.wikimedia.org/T340166) (owner: 10Ottomata)
[17:51:46] <wikibugs>	 (03Merged) 10jenkins-bot: eventgate - enable use of remote schema repos for main and logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/933166 (https://phabricator.wikimedia.org/T340166) (owner: 10Ottomata)
[17:53:29] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply
[17:53:58] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply
[18:02:54] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply
[18:03:06] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [airflow-dags/search@32b4b99]: update dags to use discolytics 0.15.0
[18:03:24] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [airflow-dags/search@32b4b99]: update dags to use discolytics 0.15.0 (duration: 00m 17s)
[18:03:52] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply
[18:04:13] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply
[18:04:50] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply
[18:05:10] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply
[18:05:38] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply
[18:05:59] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply
[18:06:41] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply
[18:07:00] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply
[18:07:42] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply
[18:08:53] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] refinery::job::canary_events - use spark to launch, bump to version 0.2.17 [puppet] - 10https://gerrit.wikimedia.org/r/932456 (https://phabricator.wikimedia.org/T330236) (owner: 10Ottomata)
[18:16:13] <wikibugs>	 (03PS1) 10Ryan Kemper: [WIP] Dashboard for query service update lag [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/933172 (https://phabricator.wikimedia.org/T324811)
[18:17:51] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:17:58] <wikibugs>	 10SRE, 10Traffic: Reduce toil in provisioning and decommissioning of DNS/NTP servers by automating generation of resolv.conf and NTP peers - https://phabricator.wikimedia.org/T340479 (10ssingh)
[18:18:06] <wikibugs>	 10SRE, 10Traffic: Reduce toil in provisioning and decommissioning of DNS/NTP servers by automating generation of resolv.conf and NTP peers - https://phabricator.wikimedia.org/T340479 (10ssingh) p:05Triage→03High
[18:21:56] <wikibugs>	 (03PS2) 10Ryan Kemper: [WIP] Dashboard for query service update lag [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/933172 (https://phabricator.wikimedia.org/T324811)
[18:25:46] <wikibugs>	 10SRE, 10Traffic: Reduce toil in provisioning and decommissioning of DNS/NTP servers by automating generation of resolv.conf and NTP peers - https://phabricator.wikimedia.org/T340479 (10ssingh)
[18:26:42] <wikibugs>	 10SRE, 10Traffic: Reduce toil in provisioning and decommissioning of DNS/NTP servers by automating generation of resolv.conf and NTP peers - https://phabricator.wikimedia.org/T340479 (10ssingh)
[18:26:46] <wikibugs>	 10SRE, 10Traffic: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ssingh)
[18:33:07] <urandom>	 !log depooling sessionstore/codfw for bullseye upgrades — T340043
[18:33:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:12] <stashbot>	 T340043: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043
[18:33:22] <logmsgbot>	 !log eevans@cumin2002 START - Cookbook sre.discovery.service-route depool sessionstore in codfw: maintenance
[18:38:26] <logmsgbot>	 !log eevans@cumin2002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool sessionstore in codfw: maintenance
[18:42:48] <logmsgbot>	 !log eevans@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2002.codfw.wmnet with OS bullseye
[18:42:56] <wikibugs>	 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin2002 for host sessionstore2002.codfw.wmnet with OS bullseye
[18:52:42] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Transfer Neil Shah-Quinn's production access to new developer account - https://phabricator.wikimedia.org/T337591 (10nshahquinn-wmf) a:05MatthewVernon→03MoritzMuehlenhoff Everything is now migrated to the new account. It's safe to remove access from the o...
[18:57:57] <logmsgbot>	 !log eevans@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2002.codfw.wmnet with reason: host reimage
[19:02:11] <logmsgbot>	 !log eevans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2002.codfw.wmnet with reason: host reimage
[19:05:24] <wikibugs>	 (03PS1) 10Kosta Harlan: gitlab runner: Allow mariadb:* images for allowed_docker_services [puppet] - 10https://gerrit.wikimedia.org/r/933175 (https://phabricator.wikimedia.org/T339352)
[19:06:10] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] gitlab runner: Allow mariadb:* images for allowed_docker_services [puppet] - 10https://gerrit.wikimedia.org/r/933175 (https://phabricator.wikimedia.org/T339352) (owner: 10Kosta Harlan)
[19:12:41] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Create wikija-g mailing list - https://phabricator.wikimedia.org/T340380 (10Arnoldokoth) ` aokoth@lists1001:~$ sudo mailman-wrapper create --owner kazuki-s@wikiusers.jp wikija-g@lists.wikimedia.org Created mailing list: wikija-g@lists.wikimedia.org `
[19:13:02] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Create wikija-g mailing list - https://phabricator.wikimedia.org/T340380 (10Arnoldokoth) 05Open→03In progress p:05Triage→03Medium
[19:18:32] <wikibugs>	 (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] Change type of 'age-factor-decay' from non-existing float to wild [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/929744 (https://phabricator.wikimedia.org/T338970) (owner: 10Aklapper)
[19:24:16] <logmsgbot>	 !log eevans@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore2002.codfw.wmnet with OS bullseye
[19:24:22] <wikibugs>	 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin2002 for host sessionstore2002.codfw.wmnet with OS bullseye completed: - sessionstore2002...
[19:31:47] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+1] gitlab runner: Allow mariadb:* images for allowed_docker_services [puppet] - 10https://gerrit.wikimedia.org/r/933175 (https://phabricator.wikimedia.org/T339352) (owner: 10Kosta Harlan)
[19:38:36] <wikibugs>	 (03PS1) 10Reedy: Revert "mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/933151 (https://phabricator.wikimedia.org/T340483)
[19:38:45] <wikibugs>	 (03PS1) 10Majavah: Revert "mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/933152 (https://phabricator.wikimedia.org/T340483)
[19:38:47] <wikibugs>	 (03PS2) 10Reedy: Revert "mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/933151 (https://phabricator.wikimedia.org/T340483)
[19:39:17] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:39:28] <wikibugs>	 (03Abandoned) 10Majavah: Revert "mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/933152 (https://phabricator.wikimedia.org/T340483) (owner: 10Majavah)
[19:39:37] <icinga-wm>	 RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:47:04] <wikibugs>	 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, and 2 others: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10Andrew) 05Resolved→03Open
[19:47:07] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10Andrew)
[19:47:55] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/933151 (https://phabricator.wikimedia.org/T340483) (owner: 10Reedy)
[19:48:37] <akosiaris>	 !log revert "Redirect www.mediawiki.org to mw-on-k8s", debugging T340483
[19:48:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:48:41] <stashbot>	 T340483: ExtensionDistributor is broken - https://phabricator.wikimedia.org/T340483
[19:49:17] <akosiaris>	 !log force puppet run on cp hosts T340483
[19:49:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:50:42] <wikibugs>	 (03PS1) 10Reedy: CommonSettings.php: Set a proxy for $wgExtDistAPIConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933179 (https://phabricator.wikimedia.org/T340483)
[19:52:07] <wikibugs>	 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: codfw1dev: OpenStack services can only sort of talk to memacached on cloudcontrols - https://phabricator.wikimedia.org/T340488 (10Andrew)
[19:57:48] <wikibugs>	 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10Eevans)
[20:00:07] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T2000).
[20:00:07] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[20:00:57] <logmsgbot>	 !log eevans@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2001.codfw.wmnet with OS bullseye
[20:01:02] <wikibugs>	 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin2002 for host sessionstore2001.codfw.wmnet with OS bullseye
[20:02:20] <wikibugs>	 (03CR) 10Reedy: "Caused T340483." [puppet] - 10https://gerrit.wikimedia.org/r/923385 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert)
[20:03:48] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 (10Reedy) >>! In T337490#8963478, @gerritbot wrote: > Change 923385 **merged** by Clément Goubert: > %%%[operations/puppet@production] mw-on-k8s: Redirect www.mediawiki.org to...
[20:07:54] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 (10Reedy)
[20:10:11] <brennen>	 jouncebot nowandnext
[20:10:11] <jouncebot>	 For the next 0 hour(s) and 49 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T2000)
[20:10:11] <jouncebot>	 In 0 hour(s) and 49 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T2100)
[20:10:47] <brennen>	 mutante, andre: i have that phab deploy prepped, now would probably be a reasonable time to push it out, i think
[20:12:23] <andre>	 I'm in :D
[20:12:36] <andre>	 (not that I had to do anything anyway, ahem)
[20:13:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Dwisehaupt) @Jclark-ctr Just wanted to follow up and see if this has been checked yet. Thanks!
[20:13:44] <andre>	 brennen: there's also a good bunch more Phab patches awaiting but I guess you have more important things to do :)
[20:14:17] <brennen>	 i grabbed a couple of the extremely low-stakes ones
[20:14:39] <brennen>	 others looked like i should probably do a bit more actual testing.
[20:16:00] <logmsgbot>	 !log eevans@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2001.codfw.wmnet with reason: host reimage
[20:16:37] <andre>	 brennen, up to your judgement :) https://phabricator.wikimedia.org/maniphest/query/MtNPMfa5ac0C/#R would be my list
[20:17:39] <andre>	 anyway. Happy to get that non-public issue deployed <3
[20:18:43] <logmsgbot>	 !log eevans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2001.codfw.wmnet with reason: host reimage
[20:20:04] <wikibugs>	 (03PS1) 10JHathaway: admin: ensure dates are quoted [puppet] - 10https://gerrit.wikimedia.org/r/933180 (https://phabricator.wikimedia.org/T337972)
[20:21:07] <wikibugs>	 (03PS3) 10Ryan Kemper: [WIP] Dashboard for query service update lag [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/933172 (https://phabricator.wikimedia.org/T324811)
[20:21:20] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/933180 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[20:27:33] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on phab1004.eqiad.wmnet with reason: first setup
[20:27:47] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on phab1004.eqiad.wmnet with reason: first setup
[20:27:55] <brennen>	 !log deploying minor phabricator updates shortly
[20:27:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:02] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on phab1004.eqiad.wmnet with reason: patch application
[20:28:05] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on phab1004.eqiad.wmnet with reason: patch application
[20:29:17] <logmsgbot>	 !log brennen@deploy1002 Started deploy [phabricator/deployment@a25a737]: deploy latest state to phab2002
[20:29:30] <brennen>	 doing phab2002 then 1004
[20:29:55] <brennen>	 andre: will round up the rest of the small stuff later this week
[20:29:55] <logmsgbot>	 !log brennen@deploy1002 Finished deploy [phabricator/deployment@a25a737]: deploy latest state to phab2002 (duration: 00m 38s)
[20:30:06] <andre>	 brennen, thanks
[20:30:13] <logmsgbot>	 !log brennen@deploy1002 Started deploy [phabricator/deployment@a25a737]: deploy latest state to phab1004
[20:30:20] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on phab2002.codfw.wmnet with reason: patch application
[20:30:24] <mutante>	 downtimed phab2002 
[20:30:33] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on phab2002.codfw.wmnet with reason: patch application
[20:30:47] <logmsgbot>	 !log brennen@deploy1002 Finished deploy [phabricator/deployment@a25a737]: deploy latest state to phab1004 (duration: 00m 34s)
[20:30:49] <brennen>	 ah, thx - i don't thnk anything would normally trigger there anyway, but you never know
[20:33:46] <logmsgbot>	 !log brennen@deploy1002 Started deploy [phabricator/deployment@0529926]: deploy latest state to phab1004
[20:33:57] <brennen>	 grr, reverting here.
[20:34:17] <logmsgbot>	 !log brennen@deploy1002 Finished deploy [phabricator/deployment@0529926]: deploy latest state to phab1004 (duration: 00m 31s)
[20:35:32] <wikibugs>	 (03PS2) 10JHathaway: stdlib: upgrade to v8.6.2 [puppet] - 10https://gerrit.wikimedia.org/r/932459 (https://phabricator.wikimedia.org/T337972)
[20:39:23] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932459 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[20:39:50] <wikibugs>	 (03PS4) 10Ryan Kemper: [WIP] Dashboard for query service update lag [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/933172 (https://phabricator.wikimedia.org/T324811)
[20:40:05] <wikibugs>	 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10Jhancock.wm)
[20:40:12] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: sessionstore2001.codfw.wmnet unable to PXE boot - https://phabricator.wikimedia.org/T340055 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm replaced with a different brand optic (Wave2Wave 77J-S010-T) and now the scripts run without downing the port on the switch.
[20:42:31] <logmsgbot>	 !log eevans@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore2001.codfw.wmnet with OS bullseye
[20:42:37] <wikibugs>	 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin2002 for host sessionstore2001.codfw.wmnet with OS bullseye completed: - sessionstore2001...
[20:45:07] <logmsgbot>	 !log eevans@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2003.codfw.wmnet with OS bullseye
[20:45:14] <wikibugs>	 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin2002 for host sessionstore2003.codfw.wmnet with OS bullseye
[20:47:23] <wikibugs>	 (03PS1) 10Daniel Kinzler: Parsoid: Disable PC writes on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933184 (https://phabricator.wikimedia.org/T339867)
[20:47:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Parsoid: Disable PC writes on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933184 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler)
[20:48:45] <wikibugs>	 (03PS2) 10Daniel Kinzler: Parsoid: Disable PC writes on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933184 (https://phabricator.wikimedia.org/T339867)
[20:49:00] <wikibugs>	 (03PS5) 10Ryan Kemper: Dashboard for wdqs update lag [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/933172 (https://phabricator.wikimedia.org/T324811)
[20:50:37] <wikibugs>	 (03CR) 10Ryan Kemper: "See the following preview dashboard for what the result looks like:" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/933172 (https://phabricator.wikimedia.org/T324811) (owner: 10Ryan Kemper)
[20:51:43] <wikibugs>	 (03CR) 10Bking: [C: 03+1] sre.wdqs.data-transfer: fix broken logic [cookbooks] - 10https://gerrit.wikimedia.org/r/932324 (https://phabricator.wikimedia.org/T321605) (owner: 10Ryan Kemper)
[20:53:59] <wikibugs>	 (03PS2) 10JHathaway: site.pp: Drop wmnet domain and always use regexes [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972)
[20:54:21] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[20:55:25] <logmsgbot>	 !log eevans@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore2003.codfw.wmnet with OS bullseye
[20:55:31] <wikibugs>	 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin2002 for host sessionstore2003.codfw.wmnet with OS bullseye executed with errors: - sessi...
[21:00:04] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: That opportune time is upon us again. Time for a Weekly Security deployment window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230626T2100).
[21:02:16] <logmsgbot>	 !log eevans@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2003.codfw.wmnet with OS bullseye
[21:02:22] <wikibugs>	 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin2002 for host sessionstore2003.codfw.wmnet with OS bullseye
[21:06:45] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] sre.wdqs.data-transfer: fix broken logic [cookbooks] - 10https://gerrit.wikimedia.org/r/932324 (https://phabricator.wikimedia.org/T321605) (owner: 10Ryan Kemper)
[21:07:22] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Kerberos for cjming - https://phabricator.wikimedia.org/T340491 (10Ottomata) Approved.
[21:09:28] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Kerberos for cjming - https://phabricator.wikimedia.org/T340491 (10cjming)
[21:10:09] <wikibugs>	 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10Eevans)
[21:13:07] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[21:13:43] <logmsgbot>	 !log ryankemper@puppetmaster1001 conftool action : set/weight=0:pooled=inactive; selector: name=wdqs2021.*
[21:13:48] <logmsgbot>	 !log ryankemper@puppetmaster1001 conftool action : set/weight=0:pooled=inactive; selector: name=wdqs2022.*
[21:15:09] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart
[21:18:46] <logmsgbot>	 !log eevans@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2003.codfw.wmnet with reason: host reimage
[21:21:40] <wikibugs>	 (03PS1) 10Btullis: Specify the schema registry type for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/933187 (https://phabricator.wikimedia.org/T329514)
[21:21:57] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart
[21:22:13] <logmsgbot>	 !log eevans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2003.codfw.wmnet with reason: host reimage
[21:22:28] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart
[21:23:50] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Specify the schema registry type for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/933187 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[21:24:56] <wikibugs>	 (03Merged) 10jenkins-bot: Specify the schema registry type for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/933187 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[21:25:14] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] pybal: Fix hostnames not being sent on alert [puppet] - 10https://gerrit.wikimedia.org/r/913004 (https://phabricator.wikimedia.org/T322377) (owner: 10BCornwall)
[21:26:57] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[21:27:47] <wikibugs>	 (03PS22) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531)
[21:27:52] <wikibugs>	 (03CR) 10BCornwall: Create cookbook to upgrade Apache Traffic Server (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall)
[21:36:22] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0)
[21:36:54] <wikibugs>	 (03CR) 10BCornwall: [C: 04-1] "Looks like profile::tlsproxy::envoy::cfssl_label needs to be defined. Should it be set to "discovery"?" [puppet] - 10https://gerrit.wikimedia.org/r/930187 (https://phabricator.wikimedia.org/T326657) (owner: 10Jbond)
[21:39:10] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[21:39:34] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[21:40:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: (4) wcqs-updater.service Failed on wcqs1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:42:00] <wikibugs>	 (03PS1) 10Btullis: Enable the service mesh for the top-level datahub deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/933188 (https://phabricator.wikimedia.org/T329514)
[21:43:10] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99)
[21:43:32] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Enable the service mesh for the top-level datahub deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/933188 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[21:44:34] <jinxer-wm>	 (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[21:44:35] <logmsgbot>	 !log eevans@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore2003.codfw.wmnet with OS bullseye
[21:44:36] <wikibugs>	 (03Merged) 10jenkins-bot: Enable the service mesh for the top-level datahub deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/933188 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[21:44:42] <wikibugs>	 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin2002 for host sessionstore2003.codfw.wmnet with OS bullseye completed: - sessionstore2003...
[21:45:34] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[21:50:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] releases-jenkins: replace Apache 2.2 with 2.4 syntax for access control [puppet] - 10https://gerrit.wikimedia.org/r/932439 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn)
[21:53:16] <urandom>	 !log pooling sessionstore/codfw for bullseye upgrades — T340043
[21:53:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:53:20] <stashbot>	 T340043: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043
[21:53:47] <logmsgbot>	 !log eevans@cumin2002 START - Cookbook sre.discovery.service-route pool sessionstore in codfw: maintenance
[21:54:53] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0)
[21:55:10] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart
[21:57:34] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[21:57:34] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[21:58:51] <logmsgbot>	 !log eevans@cumin2002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool sessionstore in codfw: maintenance
[21:59:22] <wikibugs>	 (03PS1) 10Btullis: Permit datahub batch jobs to contact the GMS service [deployment-charts] - 10https://gerrit.wikimedia.org/r/933190 (https://phabricator.wikimedia.org/T329514)
[22:00:24] <wikibugs>	 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10Eevans)
[22:01:30] <wikibugs>	 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10Eevans) p:05Triage→03Medium
[22:01:41] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Permit datahub batch jobs to contact the GMS service [deployment-charts] - 10https://gerrit.wikimedia.org/r/933190 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[22:02:34] <jinxer-wm>	 (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[22:02:52] <wikibugs>	 (03Merged) 10jenkins-bot: Permit datahub batch jobs to contact the GMS service [deployment-charts] - 10https://gerrit.wikimedia.org/r/933190 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[22:05:00] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[22:07:26] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "So.. this does not break it.. but also I don't get blocked if I set my user agent manually to one of the blocked ones. But also.. this see" [puppet] - 10https://gerrit.wikimedia.org/r/932439 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn)
[22:11:50] <wikibugs>	 (03PS1) 10Ahmon Dancy: Add 'tag' argument to git::clone [puppet] - 10https://gerrit.wikimedia.org/r/933192 (https://phabricator.wikimedia.org/T218900)
[22:12:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add 'tag' argument to git::clone [puppet] - 10https://gerrit.wikimedia.org/r/933192 (https://phabricator.wikimedia.org/T218900) (owner: 10Ahmon Dancy)
[22:16:13] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99)
[22:17:04] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[22:17:10] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart
[22:17:14] <logmsgbot>	 !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.wdqs.restart (exit_code=97)
[22:18:52] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart
[22:22:36] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:24:18] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[22:27:23] <wikibugs>	 (03PS1) 10Btullis: Bump datahub top-level chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/933195 (https://phabricator.wikimedia.org/T329514)
[22:29:19] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Bump datahub top-level chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/933195 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[22:30:29] <wikibugs>	 (03Merged) 10jenkins-bot: Bump datahub top-level chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/933195 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[22:31:13] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[22:31:36] <wikibugs>	 (03PS1) 10Dzahn: switch contint.wikimedia.org from contint2001 to contint2002 [dns] - 10https://gerrit.wikimedia.org/r/933196 (https://phabricator.wikimedia.org/T324659)
[22:33:46] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[22:46:15] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[22:46:33] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] graphite: replace Apache 2.2 access control syntax [puppet] - 10https://gerrit.wikimedia.org/r/932445 (https://phabricator.wikimedia.org/T258686) (owner: 10Dzahn)
[22:46:34] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[22:48:48] <wikibugs>	 (03PS1) 10Btullis: Revert changes to the GMS networkpolicy in datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/933197 (https://phabricator.wikimedia.org/T329514)
[22:49:11] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "wtf, nothing changes whatsoever on the machine called "primary graphite host". makes no sense" [puppet] - 10https://gerrit.wikimedia.org/r/932445 (https://phabricator.wikimedia.org/T258686) (owner: 10Dzahn)
[22:51:34] <jinxer-wm>	 (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[22:51:35] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Revert changes to the GMS networkpolicy in datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/933197 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[22:51:51] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[22:51:54] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-categories on wdqs2022 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[22:52:45] <wikibugs>	 (03Merged) 10jenkins-bot: Revert changes to the GMS networkpolicy in datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/933197 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[22:53:24] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-categories on wdqs2022 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[22:53:26] <sbassett>	 Hey all - I’d like to deploy a quick update for T336027 to PrivateSettings.php during the last few mins of the weekly security window here.  Let me know if I shouldn't.
[22:55:23] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Replace RAID controller battery in an-worker1092 - https://phabricator.wikimedia.org/T340204 (10Jclark-ctr) @BTullis  would like to take care of tomorrow when would be a good time with you to do this?
[22:55:23] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[22:58:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10Jclark-ctr) Performed stresstest on cpu for additional 24 hours with no errors restarting 3rd time
[23:01:12] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0)
[23:01:33] <wikibugs>	 (03PS1) 10Dwisehaupt: Remove hosts to be decommissioned. [puppet] - 10https://gerrit.wikimedia.org/r/933198 (https://phabricator.wikimedia.org/T340155)
[23:01:36] <wikibugs>	 (03PS1) 10Dwisehaupt: Add frmon1002 to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/933199 (https://phabricator.wikimedia.org/T319460)
[23:02:39] <sbassett>	 !log Deployed updated mitigation for T336027
[23:02:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:06:46] <wikibugs>	 (03CR) 10Dwisehaupt: "for when we are ready to decom the hosts." [puppet] - 10https://gerrit.wikimedia.org/r/933198 (https://phabricator.wikimedia.org/T340155) (owner: 10Dwisehaupt)
[23:07:34] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[23:07:37] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[23:12:34] <jinxer-wm>	 (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[23:13:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Jclark-ctr) 05Open→03Resolved updated docs
[23:21:23] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-worker1092.eqiad.wmnet with reason: Replacing RAID controller battery
[23:21:48] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-worker1092.eqiad.wmnet with reason: Replacing RAID controller battery
[23:21:53] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Replace RAID controller battery in an-worker1092 - https://phabricator.wikimedia.org/T340204 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=36858c2c-bae0-4a63-9ac9-19916c27613e) set by btullis@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their se...
[23:23:52] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Replace RAID controller battery in an-worker1092 - https://phabricator.wikimedia.org/T340204 (10BTullis) Hi @Jclark-ctr - Many thanks. I've shut down the machine ready for you, so you can replace it whenever is convenient. Feel free to boot the host again when finishe...
[23:29:41] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Kerberos for cjming - https://phabricator.wikimedia.org/T340491 (10BTullis) a:03BTullis
[23:30:51] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Kerberos for cjming - https://phabricator.wikimedia.org/T340491 (10BTullis)
[23:35:55] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Kerberos for cjming - https://phabricator.wikimedia.org/T340491 (10BTullis) I've created the principal.  @cjming - please would you check your email **spam folder** because your welcome email and initial kerberos setup instructions are almost certainly in ther...
[23:39:25] <wikibugs>	 (03PS1) 10Btullis: Record that fact that cjming is now kerberos enabled [puppet] - 10https://gerrit.wikimedia.org/r/933202 (https://phabricator.wikimedia.org/T340491)
[23:39:48] <wikibugs>	 (03PS2) 10Btullis: Record the fact that cjming is now kerberos enabled [puppet] - 10https://gerrit.wikimedia.org/r/933202 (https://phabricator.wikimedia.org/T340491)